Data quality profiling and exploratory data analysis are crucial steps in the process of Data Science and Machine Learning development. YData-profiling is a leading tool in the data understanding step of the data science workflow as a pioneering Python package.
ydata-profiling is a leading package for data profiling, that automates and standardizes the generation of detailed reports,
complete with statistics and visualizations. The significance of the package lies in how it streamlines the process of
understanding and preparing data for analysis in a single line of code! If you're ready to get started see the quickstart!
Scalable solution to integrate with database systems?
Leverage YData Fabric Data Catalog to connect to different databases and storages (Oracle, snowflake, PostGreSQL, GCS, S3, etc.) and leverage an interactive and guided profiling experience in Fabric.
Check out the Community Version.
Why use ydata-profiling?
ydata-profiling is a valuable tool for data scientists and analysts because it streamlines EDA, provides comprehensive insights, enhances data quality,
and promotes data science best practices.
- Simple to user: It is so simple to use - a single line of code is what you need to get you started. Do you really need more to convince you? 😛
- Comprehensive insights in a report: a report including a wide range of statistics and visualizations, providing a holistic view of your data. The report is shareable as a html file or while integrate as a widget in a Jupyter Notebook.
- Data quality assessment: excel at the identification of missing data, duplicate entries and outliers. These insights are essential for data cleaning and preparation, ensuring the reliability of your analysis and leading to early problems' identification.
- Ease of integration with other flows: all metrics of the data profiling can be consumed in a standard JSON format.
- Data exploration for large datasets: even with dataset with a large number of rows,
ydata-profilingwill be able to help you as it supports both Pandas Dataframes and Spark Dataframes.
To learn more about the package check out concepts overview.
📝 Features, functionalities & integrations
YData-profiling can be used to deliver a variety of different applications. The documentation includes guides, tips and tricks for tackling them:
|Features & functionalities
|Comparing multiple version of the same dataset
|Profiling a Time-Series dataset
|Generating a report for a time-series dataset with a single line of code
|Profiling large datasets
|Tips on how to prepare data and configure
ydata-profiling for working with large datasets
|Handling sensitive data
|Generating reports which are mindful about sensitive data in the input dataset
|Dataset metadata and data dictionaries
|Complementing the report with dataset details and column-specific data dictionaries
|Customizing the report's appearance
|Changing the appearance of the report's page and of the contained visualizations
|Profiling Databases **
|For a seamless profiling experience in your organization's databases, check Fabric Data Catalog, which allows to consume data from different types of storages such as RDBMs (Azure SQL, PostGreSQL, Oracle, etc.) and object storages (Google Cloud Storage, AWS S3, Snowflake, etc.), among others.
Looking for how to use certain features or how to intgrate
ydata-profiling in your currect stack and workflows,
check our step-by-step tutorials.
- How to master exploratory data analysis with ydata-profiling? Check this step-by-step tutorial.
- Looking on how to do exploratory data analysis for Time-series 🕛? Check how to in this blogpost. To learn more about this feature check the documentation.
- How to compare 2 datasets? We got you covered with this step-by-step tutorial To learn more about this feature check the documentation.
- Want to scale for larger datasets? Check the information about release with ⭐⚡Spark support! For more information about spark integration check the documentation
Need help? Want to share a perspective? Report a bug? Ideas for collaborations? Reach out via the following channels:
- Stack Overflow: ideal for asking questions on how to use the package
- GitHub Issues: bugs, proposals for changes, feature requests
- Discord: ideal for projects discussions, ask questions, collaborations, general chat
Help us prioritizing - before reporting, double check, it is always better to upvote!
Before reporting an issue on GitHub, check out Common Issues.
If you want to validate if your request was prioritized check the project pipeline details
Learn how to get involved in the Contribution Guide.
A low-threshold place to ask questions or start contributing is the Data Centric AI Community's Discord.
A big thank you to all our amazing contributors!
⚡ We need your help - Spark!
Spark support has been released, but we are always looking for an extra pair of hands 👐. Check current work in progress!.