Skip to content

Welcome

Data quality profiling and exploratory data analysis are crucial steps in the process of Data Science and Machine Learning development. YData-profiling is a leading tool in the data understanding step of the data science workflow as a pioneering Python package.

ydata-profiling is a leading package for data profiling, that automates and standardizes the generation of detailed reports, complete with statistics and visualizations. The significance of the package lies in how it streamlines the process of understanding and preparing data for analysis in a single line of code! If you're ready to get started see the quickstart!

Advent of Code - Get featured on ydata-profiling

β€œI want to get into open source, but I don’t know how.” - Does this sound familiar to you? Have you been wanting to get more involved with open-source software, but no one’s given you an entry point?

That's why we joined The Advent of code this year. Contribute to ydata-profiling and win some 🐼🐼 swag!

How can you be part of it?

ydata-profiling report

Why use ydata-profiling?

ydata-profiling is a valuable tool for data scientists and analysts because it streamlines EDA, provides comprehensive insights, enhances data quality, and promotes data science best practices.

  • Simple to user: It is so simple to use - a single line of code is what you need to get you started. Do you really need more to convince you? πŸ˜›
    1
    2
    3
    4
    5
    import pandas as pd
    from ydata_profiling import ProfileReport
    
    df = pd.read_csv('data.csv')
    profile = ProfileReport(df, title="Profiling Report")
    
  • Comprehensive insights in a report: a report including a wide range of statistics and visualizations, providing a holistic view of your data. The report is shareable as a html file or while integrate as a widget in a Jupyter Notebook.
  • Data quality assessment: excel at the identification of missing data, duplicate entries and outliers. These insights are essential for data cleaning and preparation, ensuring the reliability of your analysis and leading to early problems' identification.
  • Ease of integration with other flows: all metrics of the data profiling can be consumed in a standard JSON format.
  • Data exploration for large datasets: even with dataset with a large number of rows, ydata-profiling will be able to help you as it supports both Pandas Dataframes and Spark Dataframes.

To learn more about the package check out concepts overview.

πŸ“ Features, functionalities & integrations

YData-profiling can be used to deliver a variety of different applications. The documentation includes guides, tips and tricks for tackling them:

Data Catalog with data profiling for databases & storages

Need to profile directly from databases and data storages (Oracle, snowflake, PostGreSQL, GCS, S3, etc.)?

Try YData Fabric Data Catalog for interactive and scalable data profiling

Check out the free Community Version.

Features & functionalities Description
Comparing datasets Comparing multiple version of the same dataset
Profiling a Time-Series dataset Generating a report for a time-series dataset with a single line of code
Profiling large datasets Tips on how to prepare data and configure ydata-profiling for working with large datasets
Handling sensitive data Generating reports which are mindful about sensitive data in the input dataset
Dataset metadata and data dictionaries Complementing the report with dataset details and column-specific data dictionaries
Customizing the report's appearance Changing the appearance of the report's page and of the contained visualizations
Profiling Relational databases ** For a seamless profiling experience in your organization's databases, check Fabric Data Catalog, which allows to consume data from different types of storages such as RDBMs (Azure SQL, PostGreSQL, Oracle, etc.) and object storages (Google Cloud Storage, AWS S3, Snowflake, etc.), among others.
PII classification & management ** Automated PII classification and management through an UI experience

Tutorials

Looking for how to use certain features or how to integrate ydata-profiling in your currect stack and workflows, check our step-by-step tutorials.

πŸ™‹ Support

Need help? Want to share a perspective? Report a bug? Ideas for collaborations? Reach out via the following channels:

  • Stack Overflow: ideal for asking questions on how to use the package
  • GitHub Issues: bugs, proposals for changes, feature requests
  • Discord: ideal for projects discussions, ask questions, collaborations, general chat

Help us prioritizing - before reporting, double check, it is always better to upvote!

Before reporting an issue on GitHub, check out Common Issues.

If you want to validate if your request was prioritized check the project pipeline details

🀝🏽 Contributing

Learn how to get involved in the Contribution Guide.

A low-threshold place to ask questions or start contributing is the Data Centric AI Community's Discord.

A big thank you to all our amazing contributors!

⚑ We need your help - Spark!

Spark support has been released, but we are always looking for an extra pair of hands πŸ‘. Check current work in progress!.