Handling sensitive data
In certain data-sensitive contexts (for instance, private health records), sharing a report that includes a sample would violate privacy constraints. The following configuration shorthand groups together various options so that only aggregate information is provided in the report and no individual records are shown:
Additionally, ydata-profiling
does not send data to external
services, making it suitable for private data.
Sample and duplicates
Explicitly showing a dataset\'as sample and duplicate rows can be disabled, to guarantee the report does not directly leak any data:
Alternatively, it is possible to still show a sample but The following
snippet demonstrates how to generate the report but using mock/synthetic
data in the dataset sample sections. Note that the name
and caption
keys are optional.
Warning
Be aware when using pandas.read_csv
with sensitive data such as phone
numbers. pandas' type guessing will by default coerce phone numbers
such as 0612345678
to numeric. This leads to information leakage
through aggregates (min, max, quantiles). To prevent this from
happening, keep the string representation.
Note that the type detection is hard. That is why visions, a type system to help developers solve these cases, was developed.
Automated PII classification & management
You can find more details about this feature here.