Skip to content

Available Settings

A set of options is available in order to customize the behaviour of ydata-profiling and the appearance of the generated report. The depth of customization allows the creation of behaviours highly targeted at the specific dataset being analysed. The available settings are listed below. To learn how to change them, check :doc:changing_settings.

General settings

Global report settings:

Parameter Type Default Description
title string Pandas Profiling Report Title for the report, shown in the header and title bar.
pool_size integer 0 Number of workers in thread pool. When set to zero, it is set to the number of CPUs available.
progress_bar boolean True If True, ydata-profiling will display a progress bar.

Variable summary settings

Settings related with the information displayed for each variable.

Parameter Type Default Description
sort None, asc or desc nan Sort the variables asc (ending), desc (ending) or None (leaves original sorting).
variables.descriptions dict {} Ability to display a description alongside the descriptive statistics of each variable ({'var_name': 'Description'}).
vars.num.quantiles list[float] [0.05,0.25,0.5,0.75,0.95] The quantiles to calculate. Note that .25, .5 and .75 are required for the computation of other metrics (median and IQR).
vars.num.skewness_threshold integer 20 Warn if the skewness is above this threshold.
vars.num.low_categorical_threshold integer 5 If the number of distinct values is smaller than this number, then the series is considered to be categorical. Set to 0 to disable.
vars.num.chi_squared_threshold float 0.999 Set to 0 to disable chi-squared calculation.
vars.cat.length boolean True Check the string length and aggregate values (min, max, mean, media).
vars.cat.characters boolean False Check the distribution of characters and their Unicode properties. Often informative, but may be computationally expensive.
vars.cat.words boolean False Check the distribution of words. Often informative, but may be computationally expensive.
vars.cat.cardinality_threshold integer 50 Warn if the number of distinct values is above this threshold.
vars.cat.imbalance_threshold float 0.5 Warn if the imbalance score is above this threshold.
vars.cat.n_obs integer 5 Display this number of observations.
vars.cat.chi_squared_threshold float 0.999 Same as above, but for categorical variables.
vars.bool.n_obs integer 3 Same as above, but for boolean variables.
vars.bool.imbalance_threshold float 0.5 Warn if the imbalance score is above this threshold.
Configuration example
  profile = df.profile_report(
      sort="ascending",
      vars={
          "num": {"low_categorical_threshold": 0},
          "cat": {
              "length": True,
              "characters": False,
              "words": False,
              "n_obs": 5,
          },
      },
  )

  profile.config.variables.descriptions = {
      "files": "Files in the filesystem",
      "datec": "Creation date",
      "datem": "Modification date",
  }

  profile.to_file("report.html")

Setting dataset schema type

Configure the schema type for a given dataset.

Set the variable type schema to Generate the profile report
  import json
  import pandas as pd

  from ydata_profiling import ProfileReport
  from ydata_profiling.utils.cache import cache_file

  file_name = cache_file(
      "titanic.csv",
      "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv",
  )
  df = pd.read_csv(file_name)

  type_schema = {"Survived": "categorical", "Embarked": "categorical"}

  # We can set the type_schema only for the variables that we are certain of their types.
  # All the other will be automatically inferred.
  report = ProfileReport(df, title="Titanic EDA", type_schema=type_schema)

  report.to_file("report.html")

Missing data overview plots

Settings related with the missing data section and the visualizations it can include.

Parameter Type Default Description
missing_diagrams.bar boolean True Display a bar chart with counts of missing values for each column.
missing_diagrams.matrix boolean True Display a matrix of missing values. Similar to the bar chart, but might provide overview of the co-occurrence of missing values in rows.
missing_diagrams.heatmap boolean True Display a heatmap of missing values, that measures nullity correlation (i.e. how strongly the presence or absence of one variable affects the presence of another).
Configuration example: disable heatmap for large datasets
1
2
3
4
5
6
  profile = df.profile_report(
      missing_diagrams={
          "heatmap": False,
      }
  )
  profile.to_file("report.html")

Correlations

Settings regarding correlation metrics and thresholds.
The default value is auto, but the following correlation matrices are available:

Parameter Description
auto Calculates the column pairwise correlation depending on the type schema:
- numerical to numerical variable: Spearman correlation coefficient
- categorical to categorical variable: Cramer's V association coefficient
- numerical to categorical: Cramer's V association coefficient with the numerical variable discretized automatically
spearman Spearman's correlation measures the strength and direction of monotonic association between two variables. Great to evaluate the strength of the relation between categorical or ordinal variables.
pearson The Pearson correlation coefficient is the most common way of measuring a linear correlation. It is a number between –1 and 1 that measures the strength and direction of the relationship between two variables.
kendall Kendall rank correlation coefficient is a statistic used to measure the ordinal association between two measured quantities. Kendall's is often used when data doesn't meet one of the requirements of Pearson's correlation.
phi_k Phi K is especially suitable for working with mixed-type variables. Using this coefficient we can find (un)expected correlation and evaluate their statistical significance.
cramers Cramers is a correlation matrix that is commonly used to examine the association between categorical variables when there is more than 2x2 contingency.

For each correlation matrix you can use the following configurations:

Parameter Type Default Description
correlations.auto.calculate boolean True Whether to compute 'auto' correlation
correlations.auto.warn_high_correlations boolean True Show warning for correlations higher than the threshold
correlations.auto.threshold float 0.9 Warning threshold
correlations.pearson.calculate boolean False Whether to calculate Pearson correlation
correlations.pearson.warn_high_correlations boolean True Show warning for correlations higher than the threshold
correlations.pearson.threshold float 0.9 Warning threshold
correlations.spearman.calculate boolean False Whether to calculate Spearman correlation
correlations.spearman.warn_high_correlations boolean False Show warning for correlations higher than the threshold
correlations.spearman.threshold float 0.9 Warning threshold
correlations.kendall.calculate boolean False Whether to calculate Kendall rank correlation
correlations.kendall.warn_high_correlations boolean False Show warning for correlations higher than the threshold
correlations.kendall.threshold float 0.9 Warning threshold
correlations.phi_k.calculate boolean False Whether to calculate Phi K correlation
correlations.phi_k.warn_high_correlations boolean False Show warning for correlations higher than the threshold
correlations.phi_k.threshold float 0.9 Warning threshold
correlations.cramers.calculate boolean False Whether to calculate Cramer's V association coefficient
correlations.cramers.warn_high_correlations boolean True Show warning for correlations higher than the threshold
correlations.cramers.threshold float 0.9 Warning threshold

For instance, to disable all correlation computations (might be relevant for large datasets):

Disabling all correlation matrices
    profile = df.profile_report(
        title="Report without correlations",
        correlations={
            "auto": {"calculate": False},
            "pearson": {"calculate": False},
            "spearman": {"calculate": False},
            "kendall": {"calculate": False},
            "phi_k": {"calculate": False},
            "cramers": {"calculate": False},
        },
    )

    # or using a shorthand that is available for correlations
    profile = df.profile_report(
        title="Report without correlations",
        correlations=None,
    )

Interactions

Settings related with the interactions section.

Parameter Type Default Description
interactions.continuous boolean True Generate a 2D scatter plot (or hexagonal binned plot) for all continuous variable pairs.
interactions.targets list [] When a list of variable names is given, only interactions between these and all other variables are computed.

Report's appearance

Settings related with the appearance and style of the report.

Parameter Type Default Description
html.minify_html bool True If True, the output HTML is minified using the htmlmin package.
html.use_local_assets bool True If True, all assets (stylesheets, scripts, images) are stored locally. If False, a CDN is used for some stylesheets and scripts.
html.inline boolean True If True, all assets are contained in the report. If False, then a web export is created, where all assets are stored in the '[REPORT_NAME]_assets/' directory.
html.navbar_show boolean True Whether to include a navigation bar in the report
html.style.theme string None Select a bootswatch theme. Available options: flatly (dark) and united (orange)
html.style.logo string nan A base64 encoded logo, to display in the navigation bar.
html.style.primary_color string #337ab7 The primary color to use in the report.
html.style.full_width boolean False By default, the width of the report is fixed. If set to True, the full width of the screen is used.