Dataset description
When sharing reports with coworkers or publishing online, it might be
important to include metadata of the dataset, such as author, copyright
holder or descriptions. ydata-profiling
allows complementing a report
with that information. Inspired by schema.org\'s
Dataset, the currently supported properties
are description, creator, author, url, copyright_year and
copyright_holder.
The following example shows how to generate a report with a
description, copyright_holder copyright_year, creator and url.
In the generated report, these properties are found in the Overview,
under About.
Add profile report description |
---|
| report = df.profile_report(
title="Masked data",
dataset={
"description": "This profiling report was generated using a sample of 5% of the original dataset.",
"copyright_holder": "StataCorp LLC",
"copyright_year": 2020,
"url": "http://www.stata-press.com/data/r15/auto2.dta",
},
)
report.to_file(Path("stata_auto_report.html"))
|
Column descriptions
In addition to providing dataset details, often users want to include
column-specific descriptions when sharing reports with team members and
stakeholders. ydata-profiling
supports creating these descriptions, so
that the report includes a built-in data dictionary. By default, the
descriptions are presented in the Overview section of the report, next
to each variable.
Generate a report with per-variable descriptions |
---|
| profile = df.profile_report(
variables={
"descriptions": {
"files": "Files in the filesystem, # variable name: variable description",
"datec": "Creation date",
"datem": "Modification date",
}
}
)
profile.to_file(report.html)
|
Alternatively, column descriptions can be loaded from a JSON file:
dataset_column_definition.json |
---|
| {
column name 1: column 1 definition,
column name 2: column 2 definition
}
|
Generate a report with descriptions per variable from a JSON definitions file |
---|
| import json
import pandas as pd
import ydata_profiling
definition_file = dataset_column_definition.json
# Read the variable descriptions
with open(definition_file, r) as f:
definitions = json.load(f)
# By default, the descriptions are presented in the Overview section, next to each variable
report = df.profile_report(variable={"descriptions": definitions})
# We can disable showing the descriptions next to each variable
report = df.profile_report(
variable={"descriptions": definitions}, show_variable_description=False
)
report.to_file("report.html")
|
Dataset schema
In addition to providing dataset details, users often want to include
set type schemas. This is particularly important when integrating
ydata-profiling
generation with the information already in a data
catalog. When using ydata-profiling
ProfileReport, users can set the
type_schema property to control the generated profiling data types. By
default, the type_schema
is automatically inferred with visions.
Set the variable type schema to Generate the profile report |
---|
| import json
import pandas as pd
from ydata_profiling import ProfileReport
from ydata_profiling.utils.cache import cache_file
file_name = cache_file(
"titanic.csv",
"https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv",
)
df = pd.read_csv(file_name)
type_schema = {"Survived": "categorical", "Embarked": "categorical"}
# We can set the type_schema only for the variables that we are certain of their types. All the other will be automatically inferred.
report = ProfileReport(df, title="Titanic EDA", type_schema=type_schema)
report.to_file("report.html")
|