Profiling large datasets
By default, ydata-profiling
comprehensively summarizes the input
dataset in a way that gives the most insights for data analysis. For
small datasets, these computations can be performed in quasi
real-time. For larger datasets, deciding upfront which calculations to
make might be required. Whether a computation scales to a large datasets
not only depends on the exact size of the dataset, but also on its
complexity and on whether fast computations are available. If the
computation time of the profiling becomes a bottleneck,
ydata-profiling
offers several alternatives to overcome it.
Scale in a fully managed system
Looking for an fully managed system that is able to scale the profiling for large datasets? Sign up Fabric community for distributed data profiling.
Pyspark
Minimal mode
This mode was introduced in version v4.0.0
ydata-profiling
now supports Spark Dataframes profiling. You can find
an example of the integration
here.
Features supported: - Univariate variables' analysis - Head and Tail dataset sample - Correlation matrices: Pearson and Spearman
Coming soon - Missing values analysis - Interactions - Improved histogram computation
Keep an eye on the GitHub page to follow the updates on the implementation of Pyspark Dataframes support.
Minimal mode
Minimal mode
This mode was introduced in version v2.4.0
ydata-profiling
includes a minimal configuration file where the most
expensive computations are turned off by default. This is the
recommended starting point for larger datasets.
This configuration file can be found here:
config_minimal.yaml.
More details on settings and configuration are available in
../advanced_usage/available_settings
.
Sample the dataset
An alternative way to handle really large datasets is to use a portion of it to generate the profiling report. Several users report this is a good way to scale back the computation time while maintaining representativity.
pandas-profiling is a nifty tool to compute descriptive statistics on a dataframe and issue warnings on columns with many missing values, high skew, high cardinality categorical values, high correlation, etc: https://t.co/57IezPW1nI demo notebook: https://t.co/JpqTO9FK1p
— Olivier Grisel (@ogrisel) January 11, 2018
Sampling a large dataset | |
---|---|
The reader of the report might want to know that the profile is
generated using a sample from the data. This can be done by adding a
description to the report (see metadata
for details).
Sample 5% of your dataset | |
---|---|
Disable expensive computations
To decrease the computational burden in particularly large datasets but still maintain some information of interest that may stem from them, some computations can be filtered only for certain columns. Particularly, a list of targets can be provided to Interactions, so that only the interactions with these variables in specific are computed.
The setting controlling this, ìnteractions.targets
, can be changed via
multiple interfaces (configuration files or environment variables). For
details, see ../advanced_usage/changing_settings
{.interpreted-text
role="doc"}.
Concurrency
ydata-profiling
is a project under active development. One of the
highly desired features is the addition of a scalable backend such as
Modin or
Dask.
Keep an eye on the GitHub page to follow the updates on the implementation of a concurrent and highly scalable backend. Specifically, development of a Spark backend is currently underway.