Available Settings

A set of options is available in order to customize the behaviour of ydata-profiling and the appearance of the generated report. The depth of customization allows the creation of behaviours highly targeted at the specific dataset being analysed. The available settings are listed below. To learn how to change them, check :doc:changing_settings.

General settings

Global report settings:

Parameter	Type	Default	Description
`title`	string	Pandas Profiling Report	Title for the report, shown in the header and title bar.
`pool_size`	integer	0	Number of workers in thread pool. When set to zero, it is set to the number of CPUs available.
`progress_bar`	boolean	`True`	If `True`, `ydata-profiling` will display a progress bar.

Variable summary settings

Settings related with the information displayed for each variable.

Parameter	Type	Default	Description
`sort`	None, asc or desc	nan	Sort the variables asc (ending), desc (ending) or None (leaves original sorting).
`variables.descriptions`	dict	{}	Ability to display a description alongside the descriptive statistics of each variable ({'var_name': 'Description'}).
`vars.num.quantiles`	list[float]	[0.05,0.25,0.5,0.75,0.95]	The quantiles to calculate. Note that .25, .5 and .75 are required for the computation of other metrics (median and IQR).
`vars.num.skewness_threshold`	integer	20	Warn if the skewness is above this threshold.
`vars.num.low_categorical_threshold`	integer	5	If the number of distinct values is smaller than this number, then the series is considered to be categorical. Set to 0 to disable.
`vars.num.chi_squared_threshold`	float	0.999	Set to 0 to disable chi-squared calculation.
`vars.cat.length`	boolean	`True`	Check the string length and aggregate values (min, max, mean, media).
`vars.cat.characters`	boolean	`False`	Check the distribution of characters and their Unicode properties. Often informative, but may be computationally expensive.
`vars.cat.words`	boolean	`False`	Check the distribution of words. Often informative, but may be computationally expensive.
`vars.cat.cardinality_threshold`	integer	50	Warn if the number of distinct values is above this threshold.
`vars.cat.imbalance_threshold`	float	0.5	Warn if the imbalance score is above this threshold.
`vars.cat.n_obs`	integer	5	Display this number of observations.
`vars.cat.chi_squared_threshold`	float	0.999	Same as above, but for categorical variables.
`vars.bool.n_obs`	integer	3	Same as above, but for boolean variables.
`vars.bool.imbalance_threshold`	float	0.5	Warn if the imbalance score is above this threshold.

Configuration example
  profile = df.profile_report(
      sort="ascending",
      vars={
          "num": {"low_categorical_threshold": 0},
          "cat": {
              "length": True,
              "characters": False,
              "words": False,
              "n_obs": 5,
          },
      },
  )

  profile.config.variables.descriptions = {
      "files": "Files in the filesystem",
      "datec": "Creation date",
      "datem": "Modification date",
  }

  profile.to_file("report.html")

Setting dataset schema type

Configure the schema type for a given dataset.

Set the variable type schema to Generate the profile report
  import json
  import pandas as pd

  from ydata_profiling import ProfileReport
  from ydata_profiling.utils.cache import cache_file

  file_name = cache_file(
      "titanic.csv",
      "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv",
  )
  df = pd.read_csv(file_name)

  type_schema = {"Survived": "categorical", "Embarked": "categorical"}

  # We can set the type_schema only for the variables that we are certain of their types.
  # All the other will be automatically inferred.
  report = ProfileReport(df, title="Titanic EDA", type_schema=type_schema)

  report.to_file("report.html")

Missing data overview plots

Settings related with the missing data section and the visualizations it can include.

Parameter	Type	Default	Description
`missing_diagrams.bar`	boolean	`True`	Display a bar chart with counts of missing values for each column.
`missing_diagrams.matrix`	boolean	`True`	Display a matrix of missing values. Similar to the bar chart, but might provide overview of the co-occurrence of missing values in rows.
`missing_diagrams.heatmap`	boolean	`True`	Display a heatmap of missing values, that measures nullity correlation (i.e. how strongly the presence or absence of one variable affects the presence of another).

Configuration example: disable heatmap for large datasets
  profile = df.profile_report(
      missing_diagrams={
          "heatmap": False,
      }
  )
  profile.to_file("report.html")

Correlations

Settings regarding correlation metrics and thresholds.
The default value is auto, but the following correlation matrices are available:

Parameter	Description
`auto`	Calculates the column pairwise correlation depending on the type schema:
	- numerical to numerical variable: Spearman correlation coefficient
	- categorical to categorical variable: Cramer's V association coefficient
	- numerical to categorical: Cramer's V association coefficient with the numerical variable discretized automatically
`spearman`	Spearman's correlation measures the strength and direction of monotonic association between two variables. Great to evaluate the strength of the relation between categorical or ordinal variables.
`pearson`	The Pearson correlation coefficient is the most common way of measuring a linear correlation. It is a number between –1 and 1 that measures the strength and direction of the relationship between two variables.
`kendall`	Kendall rank correlation coefficient is a statistic used to measure the ordinal association between two measured quantities. Kendall's is often used when data doesn't meet one of the requirements of Pearson's correlation.
`phi_k`	Phi K is especially suitable for working with mixed-type variables. Using this coefficient we can find (un)expected correlation and evaluate their statistical significance.
`cramers`	Cramers is a correlation matrix that is commonly used to examine the association between categorical variables when there is more than 2x2 contingency.

For each correlation matrix you can use the following configurations:

Parameter	Type	Default	Description
`correlations.auto.calculate`	boolean	`True`	Whether to compute 'auto' correlation
`correlations.auto.warn_high_correlations`	boolean	`True`	Show warning for correlations higher than the threshold
`correlations.auto.threshold`	float	0.9	Warning threshold
`correlations.pearson.calculate`	boolean	`False`	Whether to calculate Pearson correlation
`correlations.pearson.warn_high_correlations`	boolean	`True`	Show warning for correlations higher than the threshold
`correlations.pearson.threshold`	float	0.9	Warning threshold
`correlations.spearman.calculate`	boolean	`False`	Whether to calculate Spearman correlation
`correlations.spearman.warn_high_correlations`	boolean	`False`	Show warning for correlations higher than the threshold
`correlations.spearman.threshold`	float	0.9	Warning threshold
`correlations.kendall.calculate`	boolean	`False`	Whether to calculate Kendall rank correlation
`correlations.kendall.warn_high_correlations`	boolean	`False`	Show warning for correlations higher than the threshold
`correlations.kendall.threshold`	float	0.9	Warning threshold
`correlations.phi_k.calculate`	boolean	`False`	Whether to calculate Phi K correlation
`correlations.phi_k.warn_high_correlations`	boolean	`False`	Show warning for correlations higher than the threshold
`correlations.phi_k.threshold`	float	0.9	Warning threshold
`correlations.cramers.calculate`	boolean	`False`	Whether to calculate Cramer's V association coefficient
`correlations.cramers.warn_high_correlations`	boolean	`True`	Show warning for correlations higher than the threshold
`correlations.cramers.threshold`	float	0.9	Warning threshold

For instance, to disable all correlation computations (might be relevant for large datasets):

Disabling all correlation matrices
    profile = df.profile_report(
        title="Report without correlations",
        correlations={
            "auto": {"calculate": False},
            "pearson": {"calculate": False},
            "spearman": {"calculate": False},
            "kendall": {"calculate": False},
            "phi_k": {"calculate": False},
            "cramers": {"calculate": False},
        },
    )

    # or using a shorthand that is available for correlations
    profile = df.profile_report(
        title="Report without correlations",
        correlations=None,
    )

Interactions

Settings related with the interactions section.

Parameter	Type	Default	Description
`interactions.continuous`	boolean	`True`	Generate a 2D scatter plot (or hexagonal binned plot) for all continuous variable pairs.
`interactions.targets`	list	[]	When a list of variable names is given, only interactions between these and all other variables are computed.

Report's appearance

Settings related with the appearance and style of the report.

Parameter	Type	Default	Description
`html.minify_html`	bool	`True`	If `True`, the output HTML is minified using the `htmlmin` package.
`html.use_local_assets`	bool	`True`	If `True`, all assets (stylesheets, scripts, images) are stored locally. If `False`, a CDN is used for some stylesheets and scripts.
`html.inline`	boolean	`True`	If `True`, all assets are contained in the report. If `False`, then a web export is created, where all assets are stored in the '[REPORT_NAME]_assets/' directory.
`html.navbar_show`	boolean	`True`	Whether to include a navigation bar in the report
`html.style.theme`	string	`None`	Select a bootswatch theme. Available options: flatly (dark blue) and united (orange)
`html.style.logo`	string	nan	A base64 encoded logo, to display in the navigation bar.
`html.style.primary_color`	string	#337ab7	The primary color to use in the report.
`html.style.full_width`	boolean	`False`	By default, the width of the report is fixed. If set to `True`, the full width of the screen is used.