Welcome

Data quality profiling and exploratory data analysis are crucial steps in the process of Data Science and Machine Learning development. YData-profiling is a leading tool in the data understanding step of the data science workflow as a pioneering Python package.

ydata-profiling is a leading package for data profiling, that automates and standardizes the generation of detailed reports, complete with statistics and visualizations. The significance of the package lies in how it streamlines the process of understanding and preparing data for analysis in a single line of code! If you're ready to get started see the quickstart!

Scalable solution to integrate with database systems?

Leverage YData Fabric Data Catalog to connect to different databases and storages (Oracle, snowflake, PostGreSQL, GCS, S3, etc.) and leverage an interactive and guided profiling experience in Fabric.

Check out the Community Version.

ydata-profiling report

Why use ydata-profiling?

ydata-profiling is a valuable tool for data scientists and analysts because it streamlines EDA, provides comprehensive insights, enhances data quality, and promotes data science best practices.

Simple to user: It is so simple to use - a single line of code is what you need to get you started. Do you really need more to convince you? 😛

import pandas as pd
from ydata_profiling import ProfileReport

df = pd.read_csv('data.csv')
profile = ProfileReport(df, title="Profiling Report")

Comprehensive insights in a report: a report including a wide range of statistics and visualizations, providing a holistic view of your data. The report is shareable as a html file or while integrate as a widget in a Jupyter Notebook.
Data quality assessment: excel at the identification of missing data, duplicate entries and outliers. These insights are essential for data cleaning and preparation, ensuring the reliability of your analysis and leading to early problems' identification.
Ease of integration with other flows: all metrics of the data profiling can be consumed in a standard JSON format.
Data exploration for large datasets: even with dataset with a large number of rows, ydata-profiling will be able to help you as it supports both Pandas Dataframes and Spark Dataframes.

To learn more about the package check out concepts overview.

📝 Features, functionalities & integrations

YData-profiling can be used to deliver a variety of different applications. The documentation includes guides, tips and tricks for tackling them:

Features & functionalities	Description
Comparing datasets	Comparing multiple version of the same dataset
Profiling a Time-Series dataset	Generating a report for a time-series dataset with a single line of code
Profiling large datasets	Tips on how to prepare data and configure `ydata-profiling` for working with large datasets
Handling sensitive data	Generating reports which are mindful about sensitive data in the input dataset
Dataset metadata and data dictionaries	Complementing the report with dataset details and column-specific data dictionaries
Customizing the report's appearance	Changing the appearance of the report's page and of the contained visualizations
Profiling Databases **	For a seamless profiling experience in your organization's databases, check Fabric Data Catalog, which allows to consume data from different types of storages such as RDBMs (Azure SQL, PostGreSQL, Oracle, etc.) and object storages (Google Cloud Storage, AWS S3, Snowflake, etc.), among others.

Tutorials

Looking for how to use certain features or how to intgrate ydata-profiling in your currect stack and workflows, check our step-by-step tutorials.

How to master exploratory data analysis with ydata-profiling? Check this step-by-step tutorial.
Looking on how to do exploratory data analysis for Time-series 🕛? Check how to in this blogpost. To learn more about this feature check the documentation.
How to compare 2 datasets? We got you covered with this step-by-step tutorial To learn more about this feature check the documentation.
Want to scale for larger datasets? Check the information about release with ⭐⚡Spark support! For more information about spark integration check the documentation

🙋 Support

Need help? Want to share a perspective? Report a bug? Ideas for collaborations? Reach out via the following channels:

Stack Overflow: ideal for asking questions on how to use the package
GitHub Issues: bugs, proposals for changes, feature requests
Discord: ideal for projects discussions, ask questions, collaborations, general chat

Help us prioritizing - before reporting, double check, it is always better to upvote!

Before reporting an issue on GitHub, check out Common Issues.

If you want to validate if your request was prioritized check the project pipeline details

🤝🏽 Contributing

Learn how to get involved in the Contribution Guide.

A low-threshold place to ask questions or start contributing is the Data Centric AI Community's Discord.

A big thank you to all our amazing contributors!

⚡ We need your help - Spark!

Spark support has been released, but we are always looking for an extra pair of hands 👐. Check current work in progress!.