Overview

Brought to you by YData

Dataset statistics

Number of variables14
Number of observations32561
Missing cells4262
Missing cells (%)0.9%
Duplicate rows24
Duplicate rows (%)0.1%
Total size in memory18.1 MiB
Average record size in memory583.0 B

Variable types

Numeric6
Categorical8

Dataset

DescriptionPredict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset. Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0)). Prediction task is to determine whether a person makes over 50K a year.
CreatorBarry Becker
AuthorRonny Kohavi and Barry Becker
URLhttps://archive.ics.uci.edu/ml/datasets/adult

Variable descriptions

agedefinition 0
workclassdefinition 1
fnlwgtdefinition 2
educationdefinition 3
education-numdefinition 4
marital-statusdefinition 5
occupationdefinition 6
relationshipdefinition 7
racedefinition 8
sexdefinition 9
capital-gaindefinition 10
capital-lossdefinition 11
hours-per-weekdefinition 12
native-countrydefinition 13

Alerts

Dataset has 24 (0.1%) duplicate rowsDuplicates
education is highly overall correlated with education-numHigh correlation
education-num is highly overall correlated with educationHigh correlation
relationship is highly overall correlated with sexHigh correlation
sex is highly overall correlated with relationshipHigh correlation
workclass is highly imbalanced (52.8%) Imbalance
race is highly imbalanced (65.6%) Imbalance
native-country is highly imbalanced (84.5%) Imbalance
workclass has 1836 (5.6%) missing values Missing
occupation has 1843 (5.7%) missing values Missing
native-country has 583 (1.8%) missing values Missing
capital-gain has 29849 (91.7%) zeros Zeros
capital-loss has 31042 (95.3%) zeros Zeros

Reproduction

Analysis started2024-10-29 15:19:41.851334
Analysis finished2024-10-29 15:19:47.572841
Duration5.72 seconds
Software versionydata-profiling v0.0.dev0
Download configurationconfig.json

Variables

age
Real number (ℝ)

Distinct73
Distinct (%)0.2%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean38.581647
Minimum17
Maximum90
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size254.5 KiB
2024-10-29T15:19:47.652553image/svg+xmlMatplotlib v3.9.2, https://matplotlib.org/

Quantile statistics

Minimum17
5-th percentile19
Q128
median37
Q348
95-th percentile63
Maximum90
Range73
Interquartile range (IQR)20

Descriptive statistics

Standard deviation13.640433
Coefficient of variation (CV)0.35354718
Kurtosis-0.16612746
Mean38.581647
Median Absolute Deviation (MAD)10
Skewness0.55874337
Sum1256257
Variance186.0614
MonotonicityNot monotonic
2024-10-29T15:19:47.808029image/svg+xmlMatplotlib v3.9.2, https://matplotlib.org/
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
36 898
 
2.8%
31 888
 
2.7%
34 886
 
2.7%
23 877
 
2.7%
35 876
 
2.7%
33 875
 
2.7%
28 867
 
2.7%
30 861
 
2.6%
37 858
 
2.6%
25 841
 
2.6%
Other values (63) 23834
73.2%
ValueCountFrequency (%)
17 395
1.2%
18 550
1.7%
19 712
2.2%
20 753
2.3%
21 720
2.2%
22 765
2.3%
23 877
2.7%
24 798
2.5%
25 841
2.6%
26 785
2.4%
ValueCountFrequency (%)
90 43
0.1%
88 3
 
< 0.1%
87 1
 
< 0.1%
86 1
 
< 0.1%
85 3
 
< 0.1%
84 10
 
< 0.1%
83 6
 
< 0.1%
82 12
 
< 0.1%
81 20
0.1%
80 22
0.1%

workclass
Categorical

Imbalance  Missing 

Distinct8
Distinct (%)< 0.1%
Missing1836
Missing (%)5.6%
Memory size2.0 MiB
Private
22696 
Self-emp-not-inc
2541 
Local-gov
 
2093
State-gov
 
1298
Self-emp-inc
 
1116
Other values (3)
 
981

Length

Max length17
Median length8
Mean length9.2745972
Min length8

Characters and Unicode

Total characters284962
Distinct characters28
Distinct categories1 ?
Distinct scripts1 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st row State-gov
2nd row Self-emp-not-inc
3rd row Private
4th row Private
5th row Private

Common Values

ValueCountFrequency (%)
Private 22696
69.7%
Self-emp-not-inc 2541
 
7.8%
Local-gov 2093
 
6.4%
State-gov 1298
 
4.0%
Self-emp-inc 1116
 
3.4%
Federal-gov 960
 
2.9%
Without-pay 14
 
< 0.1%
Never-worked 7
 
< 0.1%
(Missing) 1836
 
5.6%

Length

2024-10-29T15:19:47.949479image/svg+xmlMatplotlib v3.9.2, https://matplotlib.org/
Histogram of lengths of the category

Common Values (Plot)

2024-10-29T15:19:48.068947image/svg+xmlMatplotlib v3.9.2, https://matplotlib.org/
ValueCountFrequency (%)
private 22696
73.9%
self-emp-not-inc 2541
 
8.3%
local-gov 2093
 
6.8%
state-gov 1298
 
4.2%
self-emp-inc 1116
 
3.6%
federal-gov 960
 
3.1%
without-pay 14
 
< 0.1%
never-worked 7
 
< 0.1%

Most occurring characters

ValueCountFrequency (%)
e 33249
11.7%
30725
10.8%
t 27861
9.8%
a 27061
9.5%
v 27054
9.5%
i 26367
9.3%
r 23670
8.3%
P 22696
8.0%
- 14227
 
5.0%
o 9006
 
3.2%
Other values (18) 43046
15.1%

Most occurring categories

ValueCountFrequency (%)
(unknown) 284962
100.0%

Most frequent character per category

(unknown)
ValueCountFrequency (%)
e 33249
11.7%
30725
10.8%
t 27861
9.8%
a 27061
9.5%
v 27054
9.5%
i 26367
9.3%
r 23670
8.3%
P 22696
8.0%
- 14227
 
5.0%
o 9006
 
3.2%
Other values (18) 43046
15.1%

Most occurring scripts

ValueCountFrequency (%)
(unknown) 284962
100.0%

Most frequent character per script

(unknown)
ValueCountFrequency (%)
e 33249
11.7%
30725
10.8%
t 27861
9.8%
a 27061
9.5%
v 27054
9.5%
i 26367
9.3%
r 23670
8.3%
P 22696
8.0%
- 14227
 
5.0%
o 9006
 
3.2%
Other values (18) 43046
15.1%

Most occurring blocks

ValueCountFrequency (%)
(unknown) 284962
100.0%

Most frequent character per block

(unknown)
ValueCountFrequency (%)
e 33249
11.7%
30725
10.8%
t 27861
9.8%
a 27061
9.5%
v 27054
9.5%
i 26367
9.3%
r 23670
8.3%
P 22696
8.0%
- 14227
 
5.0%
o 9006
 
3.2%
Other values (18) 43046
15.1%

fnlwgt
Real number (ℝ)

Distinct21648
Distinct (%)66.5%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean189778.37
Minimum12285
Maximum1484705
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size254.5 KiB
2024-10-29T15:19:48.216438image/svg+xmlMatplotlib v3.9.2, https://matplotlib.org/

Quantile statistics

Minimum12285
5-th percentile39460
Q1117827
median178356
Q3237051
95-th percentile379682
Maximum1484705
Range1472420
Interquartile range (IQR)119224

Descriptive statistics

Standard deviation105549.98
Coefficient of variation (CV)0.55617497
Kurtosis6.218811
Mean189778.37
Median Absolute Deviation (MAD)59894
Skewness1.4469801
Sum6.1793734 × 109
Variance1.1140798 × 1010
MonotonicityNot monotonic
2024-10-29T15:19:48.370687image/svg+xmlMatplotlib v3.9.2, https://matplotlib.org/
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
164190 13
 
< 0.1%
203488 13
 
< 0.1%
123011 13
 
< 0.1%
113364 12
 
< 0.1%
121124 12
 
< 0.1%
148995 12
 
< 0.1%
126675 12
 
< 0.1%
188246 11
 
< 0.1%
155659 11
 
< 0.1%
102308 11
 
< 0.1%
Other values (21638) 32441
99.6%
ValueCountFrequency (%)
12285 1
 
< 0.1%
13769 1
 
< 0.1%
14878 1
 
< 0.1%
18827 1
 
< 0.1%
19214 1
 
< 0.1%
19302 5
< 0.1%
19395 2
 
< 0.1%
19410 1
 
< 0.1%
19491 1
 
< 0.1%
19520 1
 
< 0.1%
ValueCountFrequency (%)
1484705 1
< 0.1%
1455435 1
< 0.1%
1366120 1
< 0.1%
1268339 1
< 0.1%
1226583 1
< 0.1%
1184622 1
< 0.1%
1161363 1
< 0.1%
1125613 1
< 0.1%
1097453 1
< 0.1%
1085515 1
< 0.1%

education
Categorical

High correlation 

Distinct16
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size2.1 MiB
HS-grad
10501 
Some-college
7291 
Bachelors
5355 
Masters
1723 
Assoc-voc
1382 
Other values (11)
6309 

Length

Max length13
Median length12
Mean length9.433709
Min length4

Characters and Unicode

Total characters307171
Distinct characters32
Distinct categories1 ?
Distinct scripts1 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st row Bachelors
2nd row Bachelors
3rd row HS-grad
4th row 11th
5th row Bachelors

Common Values

ValueCountFrequency (%)
HS-grad 10501
32.3%
Some-college 7291
22.4%
Bachelors 5355
16.4%
Masters 1723
 
5.3%
Assoc-voc 1382
 
4.2%
11th 1175
 
3.6%
Assoc-acdm 1067
 
3.3%
10th 933
 
2.9%
7th-8th 646
 
2.0%
Prof-school 576
 
1.8%
Other values (6) 1912
 
5.9%

Length

2024-10-29T15:19:48.517112image/svg+xmlMatplotlib v3.9.2, https://matplotlib.org/
Histogram of lengths of the category
ValueCountFrequency (%)
hs-grad 10501
32.3%
some-college 7291
22.4%
bachelors 5355
16.4%
masters 1723
 
5.3%
assoc-voc 1382
 
4.2%
11th 1175
 
3.6%
assoc-acdm 1067
 
3.3%
10th 933
 
2.9%
7th-8th 646
 
2.0%
prof-school 576
 
1.8%
Other values (6) 1912
 
5.9%

Most occurring characters

ValueCountFrequency (%)
32561
 
10.6%
e 29415
 
9.6%
o 26424
 
8.6%
- 21964
 
7.2%
l 20564
 
6.7%
a 19059
 
6.2%
r 18619
 
6.1%
c 18584
 
6.1%
S 17792
 
5.8%
g 17792
 
5.8%
Other values (22) 84397
27.5%

Most occurring categories

ValueCountFrequency (%)
(unknown) 307171
100.0%

Most frequent character per category

(unknown)
ValueCountFrequency (%)
32561
 
10.6%
e 29415
 
9.6%
o 26424
 
8.6%
- 21964
 
7.2%
l 20564
 
6.7%
a 19059
 
6.2%
r 18619
 
6.1%
c 18584
 
6.1%
S 17792
 
5.8%
g 17792
 
5.8%
Other values (22) 84397
27.5%

Most occurring scripts

ValueCountFrequency (%)
(unknown) 307171
100.0%

Most frequent character per script

(unknown)
ValueCountFrequency (%)
32561
 
10.6%
e 29415
 
9.6%
o 26424
 
8.6%
- 21964
 
7.2%
l 20564
 
6.7%
a 19059
 
6.2%
r 18619
 
6.1%
c 18584
 
6.1%
S 17792
 
5.8%
g 17792
 
5.8%
Other values (22) 84397
27.5%

Most occurring blocks

ValueCountFrequency (%)
(unknown) 307171
100.0%

Most frequent character per block

(unknown)
ValueCountFrequency (%)
32561
 
10.6%
e 29415
 
9.6%
o 26424
 
8.6%
- 21964
 
7.2%
l 20564
 
6.7%
a 19059
 
6.2%
r 18619
 
6.1%
c 18584
 
6.1%
S 17792
 
5.8%
g 17792
 
5.8%
Other values (22) 84397
27.5%

education-num
Real number (ℝ)

High correlation 

Distinct16
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean10.080679
Minimum1
Maximum16
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size254.5 KiB
2024-10-29T15:19:48.622435image/svg+xmlMatplotlib v3.9.2, https://matplotlib.org/

Quantile statistics

Minimum1
5-th percentile5
Q19
median10
Q312
95-th percentile14
Maximum16
Range15
Interquartile range (IQR)3

Descriptive statistics

Standard deviation2.5727203
Coefficient of variation (CV)0.25521299
Kurtosis0.62344407
Mean10.080679
Median Absolute Deviation (MAD)1
Skewness-0.31167587
Sum328237
Variance6.6188899
MonotonicityNot monotonic
2024-10-29T15:19:48.737533image/svg+xmlMatplotlib v3.9.2, https://matplotlib.org/
Histogram with fixed size bins (bins=16)
ValueCountFrequency (%)
9 10501
32.3%
10 7291
22.4%
13 5355
16.4%
14 1723
 
5.3%
11 1382
 
4.2%
7 1175
 
3.6%
12 1067
 
3.3%
6 933
 
2.9%
4 646
 
2.0%
15 576
 
1.8%
Other values (6) 1912
 
5.9%
ValueCountFrequency (%)
1 51
 
0.2%
2 168
 
0.5%
3 333
 
1.0%
4 646
 
2.0%
5 514
 
1.6%
6 933
 
2.9%
7 1175
 
3.6%
8 433
 
1.3%
9 10501
32.3%
10 7291
22.4%
ValueCountFrequency (%)
16 413
 
1.3%
15 576
 
1.8%
14 1723
 
5.3%
13 5355
16.4%
12 1067
 
3.3%
11 1382
 
4.2%
10 7291
22.4%
9 10501
32.3%
8 433
 
1.3%
7 1175
 
3.6%

marital-status
Categorical

Distinct7
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size2.2 MiB
Married-civ-spouse
14976 
Never-married
10683 
Divorced
4443 
Separated
 
1025
Widowed
 
993
Other values (2)
 
441

Length

Max length22
Median length19
Mean length15.414054
Min length8

Characters and Unicode

Total characters501897
Distinct characters25
Distinct categories1 ?
Distinct scripts1 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st row Never-married
2nd row Married-civ-spouse
3rd row Divorced
4th row Married-civ-spouse
5th row Married-civ-spouse

Common Values

ValueCountFrequency (%)
Married-civ-spouse 14976
46.0%
Never-married 10683
32.8%
Divorced 4443
 
13.6%
Separated 1025
 
3.1%
Widowed 993
 
3.0%
Married-spouse-absent 418
 
1.3%
Married-AF-spouse 23
 
0.1%

Length

2024-10-29T15:19:48.881904image/svg+xmlMatplotlib v3.9.2, https://matplotlib.org/
Histogram of lengths of the category

Common Values (Plot)

2024-10-29T15:19:49.011091image/svg+xmlMatplotlib v3.9.2, https://matplotlib.org/
ValueCountFrequency (%)
married-civ-spouse 14976
46.0%
never-married 10683
32.8%
divorced 4443
 
13.6%
separated 1025
 
3.1%
widowed 993
 
3.0%
married-spouse-absent 418
 
1.3%
married-af-spouse 23
 
0.1%

Most occurring characters

ValueCountFrequency (%)
e 70787
14.1%
r 68351
13.6%
i 46512
9.3%
- 41517
8.3%
d 33554
 
6.7%
32561
 
6.5%
s 31252
 
6.2%
v 30102
 
6.0%
a 28568
 
5.7%
o 20853
 
4.2%
Other values (15) 97840
19.5%

Most occurring categories

ValueCountFrequency (%)
(unknown) 501897
100.0%

Most frequent character per category

(unknown)
ValueCountFrequency (%)
e 70787
14.1%
r 68351
13.6%
i 46512
9.3%
- 41517
8.3%
d 33554
 
6.7%
32561
 
6.5%
s 31252
 
6.2%
v 30102
 
6.0%
a 28568
 
5.7%
o 20853
 
4.2%
Other values (15) 97840
19.5%

Most occurring scripts

ValueCountFrequency (%)
(unknown) 501897
100.0%

Most frequent character per script

(unknown)
ValueCountFrequency (%)
e 70787
14.1%
r 68351
13.6%
i 46512
9.3%
- 41517
8.3%
d 33554
 
6.7%
32561
 
6.5%
s 31252
 
6.2%
v 30102
 
6.0%
a 28568
 
5.7%
o 20853
 
4.2%
Other values (15) 97840
19.5%

Most occurring blocks

ValueCountFrequency (%)
(unknown) 501897
100.0%

Most frequent character per block

(unknown)
ValueCountFrequency (%)
e 70787
14.1%
r 68351
13.6%
i 46512
9.3%
- 41517
8.3%
d 33554
 
6.7%
32561
 
6.5%
s 31252
 
6.2%
v 30102
 
6.0%
a 28568
 
5.7%
o 20853
 
4.2%
Other values (15) 97840
19.5%

occupation
Categorical

Missing 

Distinct14
Distinct (%)< 0.1%
Missing1843
Missing (%)5.7%
Memory size2.2 MiB
Prof-specialty
4140 
Craft-repair
4099 
Exec-managerial
4066 
Adm-clerical
3770 
Sales
3650 
Other values (9)
10993 

Length

Max length18
Median length16
Mean length13.873983
Min length6

Characters and Unicode

Total characters426181
Distinct characters32
Distinct categories1 ?
Distinct scripts1 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st row Adm-clerical
2nd row Exec-managerial
3rd row Handlers-cleaners
4th row Handlers-cleaners
5th row Prof-specialty

Common Values

ValueCountFrequency (%)
Prof-specialty 4140
12.7%
Craft-repair 4099
12.6%
Exec-managerial 4066
12.5%
Adm-clerical 3770
11.6%
Sales 3650
11.2%
Other-service 3295
10.1%
Machine-op-inspct 2002
6.1%
Transport-moving 1597
 
4.9%
Handlers-cleaners 1370
 
4.2%
Farming-fishing 994
 
3.1%
Other values (4) 1735
5.3%
(Missing) 1843
5.7%

Length

2024-10-29T15:19:49.156495image/svg+xmlMatplotlib v3.9.2, https://matplotlib.org/
Histogram of lengths of the category
ValueCountFrequency (%)
prof-specialty 4140
13.5%
craft-repair 4099
13.3%
exec-managerial 4066
13.2%
adm-clerical 3770
12.3%
sales 3650
11.9%
other-service 3295
10.7%
machine-op-inspct 2002
6.5%
transport-moving 1597
 
5.2%
handlers-cleaners 1370
 
4.5%
farming-fishing 994
 
3.2%
Other values (4) 1735
5.6%

Most occurring characters

ValueCountFrequency (%)
e 42979
 
10.1%
r 40333
 
9.5%
a 39289
 
9.2%
30718
 
7.2%
- 29219
 
6.9%
i 28751
 
6.7%
c 26001
 
6.1%
l 22136
 
5.2%
s 20302
 
4.8%
t 17359
 
4.1%
Other values (22) 129094
30.3%

Most occurring categories

ValueCountFrequency (%)
(unknown) 426181
100.0%

Most frequent character per category

(unknown)
ValueCountFrequency (%)
e 42979
 
10.1%
r 40333
 
9.5%
a 39289
 
9.2%
30718
 
7.2%
- 29219
 
6.9%
i 28751
 
6.7%
c 26001
 
6.1%
l 22136
 
5.2%
s 20302
 
4.8%
t 17359
 
4.1%
Other values (22) 129094
30.3%

Most occurring scripts

ValueCountFrequency (%)
(unknown) 426181
100.0%

Most frequent character per script

(unknown)
ValueCountFrequency (%)
e 42979
 
10.1%
r 40333
 
9.5%
a 39289
 
9.2%
30718
 
7.2%
- 29219
 
6.9%
i 28751
 
6.7%
c 26001
 
6.1%
l 22136
 
5.2%
s 20302
 
4.8%
t 17359
 
4.1%
Other values (22) 129094
30.3%

Most occurring blocks

ValueCountFrequency (%)
(unknown) 426181
100.0%

Most frequent character per block

(unknown)
ValueCountFrequency (%)
e 42979
 
10.1%
r 40333
 
9.5%
a 39289
 
9.2%
30718
 
7.2%
- 29219
 
6.9%
i 28751
 
6.7%
c 26001
 
6.1%
l 22136
 
5.2%
s 20302
 
4.8%
t 17359
 
4.1%
Other values (22) 129094
30.3%

relationship
Categorical

High correlation 

Distinct6
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size2.1 MiB
Husband
13193 
Not-in-family
8305 
Own-child
5068 
Unmarried
3446 
Wife
1568 

Length

Max length15
Median length14
Mean length10.119744
Min length5

Characters and Unicode

Total characters329509
Distinct characters26
Distinct categories1 ?
Distinct scripts1 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st row Not-in-family
2nd row Husband
3rd row Not-in-family
4th row Husband
5th row Wife

Common Values

ValueCountFrequency (%)
Husband 13193
40.5%
Not-in-family 8305
25.5%
Own-child 5068
 
15.6%
Unmarried 3446
 
10.6%
Wife 1568
 
4.8%
Other-relative 981
 
3.0%

Length

2024-10-29T15:19:49.283679image/svg+xmlMatplotlib v3.9.2, https://matplotlib.org/
Histogram of lengths of the category

Common Values (Plot)

2024-10-29T15:19:49.398224image/svg+xmlMatplotlib v3.9.2, https://matplotlib.org/
ValueCountFrequency (%)
husband 13193
40.5%
not-in-family 8305
25.5%
own-child 5068
 
15.6%
unmarried 3446
 
10.6%
wife 1568
 
4.8%
other-relative 981
 
3.0%

Most occurring characters

ValueCountFrequency (%)
32561
 
9.9%
n 30012
 
9.1%
i 27673
 
8.4%
a 25925
 
7.9%
- 22659
 
6.9%
d 21707
 
6.6%
l 14354
 
4.4%
b 13193
 
4.0%
H 13193
 
4.0%
u 13193
 
4.0%
Other values (16) 115039
34.9%

Most occurring categories

ValueCountFrequency (%)
(unknown) 329509
100.0%

Most frequent character per category

(unknown)
ValueCountFrequency (%)
32561
 
9.9%
n 30012
 
9.1%
i 27673
 
8.4%
a 25925
 
7.9%
- 22659
 
6.9%
d 21707
 
6.6%
l 14354
 
4.4%
b 13193
 
4.0%
H 13193
 
4.0%
u 13193
 
4.0%
Other values (16) 115039
34.9%

Most occurring scripts

ValueCountFrequency (%)
(unknown) 329509
100.0%

Most frequent character per script

(unknown)
ValueCountFrequency (%)
32561
 
9.9%
n 30012
 
9.1%
i 27673
 
8.4%
a 25925
 
7.9%
- 22659
 
6.9%
d 21707
 
6.6%
l 14354
 
4.4%
b 13193
 
4.0%
H 13193
 
4.0%
u 13193
 
4.0%
Other values (16) 115039
34.9%

Most occurring blocks

ValueCountFrequency (%)
(unknown) 329509
100.0%

Most frequent character per block

(unknown)
ValueCountFrequency (%)
32561
 
9.9%
n 30012
 
9.1%
i 27673
 
8.4%
a 25925
 
7.9%
- 22659
 
6.9%
d 21707
 
6.6%
l 14354
 
4.4%
b 13193
 
4.0%
H 13193
 
4.0%
u 13193
 
4.0%
Other values (16) 115039
34.9%

race
Categorical

Imbalance 

Distinct5
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size2.0 MiB
White
27816 
Black
3124 
Asian-Pac-Islander
 
1039
Amer-Indian-Eskimo
 
311
Other
 
271

Length

Max length19
Median length6
Mean length6.5389884
Min length6

Characters and Unicode

Total characters212916
Distinct characters23
Distinct categories1 ?
Distinct scripts1 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st row White
2nd row White
3rd row White
4th row Black
5th row Black

Common Values

ValueCountFrequency (%)
White 27816
85.4%
Black 3124
 
9.6%
Asian-Pac-Islander 1039
 
3.2%
Amer-Indian-Eskimo 311
 
1.0%
Other 271
 
0.8%

Length

2024-10-29T15:19:49.539183image/svg+xmlMatplotlib v3.9.2, https://matplotlib.org/
Histogram of lengths of the category

Common Values (Plot)

2024-10-29T15:19:49.658921image/svg+xmlMatplotlib v3.9.2, https://matplotlib.org/
ValueCountFrequency (%)
white 27816
85.4%
black 3124
 
9.6%
asian-pac-islander 1039
 
3.2%
amer-indian-eskimo 311
 
1.0%
other 271
 
0.8%

Most occurring characters

ValueCountFrequency (%)
32561
15.3%
i 29477
13.8%
e 29437
13.8%
t 28087
13.2%
h 28087
13.2%
W 27816
13.1%
a 6552
 
3.1%
c 4163
 
2.0%
l 4163
 
2.0%
k 3435
 
1.6%
Other values (13) 19138
9.0%

Most occurring categories

ValueCountFrequency (%)
(unknown) 212916
100.0%

Most frequent character per category

(unknown)
ValueCountFrequency (%)
32561
15.3%
i 29477
13.8%
e 29437
13.8%
t 28087
13.2%
h 28087
13.2%
W 27816
13.1%
a 6552
 
3.1%
c 4163
 
2.0%
l 4163
 
2.0%
k 3435
 
1.6%
Other values (13) 19138
9.0%

Most occurring scripts

ValueCountFrequency (%)
(unknown) 212916
100.0%

Most frequent character per script

(unknown)
ValueCountFrequency (%)
32561
15.3%
i 29477
13.8%
e 29437
13.8%
t 28087
13.2%
h 28087
13.2%
W 27816
13.1%
a 6552
 
3.1%
c 4163
 
2.0%
l 4163
 
2.0%
k 3435
 
1.6%
Other values (13) 19138
9.0%

Most occurring blocks

ValueCountFrequency (%)
(unknown) 212916
100.0%

Most frequent character per block

(unknown)
ValueCountFrequency (%)
32561
15.3%
i 29477
13.8%
e 29437
13.8%
t 28087
13.2%
h 28087
13.2%
W 27816
13.1%
a 6552
 
3.1%
c 4163
 
2.0%
l 4163
 
2.0%
k 3435
 
1.6%
Other values (13) 19138
9.0%

sex
Categorical

High correlation 

Distinct2
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size1.9 MiB
Male
21790 
Female
10771 

Length

Max length7
Median length5
Mean length5.661589
Min length5

Characters and Unicode

Total characters184347
Distinct characters7
Distinct categories1 ?
Distinct scripts1 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st row Male
2nd row Male
3rd row Male
4th row Male
5th row Female

Common Values

ValueCountFrequency (%)
Male 21790
66.9%
Female 10771
33.1%

Length

2024-10-29T15:19:49.793902image/svg+xmlMatplotlib v3.9.2, https://matplotlib.org/
Histogram of lengths of the category

Common Values (Plot)

2024-10-29T15:19:49.905067image/svg+xmlMatplotlib v3.9.2, https://matplotlib.org/
ValueCountFrequency (%)
male 21790
66.9%
female 10771
33.1%

Most occurring characters

ValueCountFrequency (%)
e 43332
23.5%
a 32561
17.7%
32561
17.7%
l 32561
17.7%
M 21790
11.8%
F 10771
 
5.8%
m 10771
 
5.8%

Most occurring categories

ValueCountFrequency (%)
(unknown) 184347
100.0%

Most frequent character per category

(unknown)
ValueCountFrequency (%)
e 43332
23.5%
a 32561
17.7%
32561
17.7%
l 32561
17.7%
M 21790
11.8%
F 10771
 
5.8%
m 10771
 
5.8%

Most occurring scripts

ValueCountFrequency (%)
(unknown) 184347
100.0%

Most frequent character per script

(unknown)
ValueCountFrequency (%)
e 43332
23.5%
a 32561
17.7%
32561
17.7%
l 32561
17.7%
M 21790
11.8%
F 10771
 
5.8%
m 10771
 
5.8%

Most occurring blocks

ValueCountFrequency (%)
(unknown) 184347
100.0%

Most frequent character per block

(unknown)
ValueCountFrequency (%)
e 43332
23.5%
a 32561
17.7%
32561
17.7%
l 32561
17.7%
M 21790
11.8%
F 10771
 
5.8%
m 10771
 
5.8%

capital-gain
Real number (ℝ)

Zeros 

Distinct119
Distinct (%)0.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean1077.6488
Minimum0
Maximum99999
Zeros29849
Zeros (%)91.7%
Negative0
Negative (%)0.0%
Memory size254.5 KiB
2024-10-29T15:19:50.028815image/svg+xmlMatplotlib v3.9.2, https://matplotlib.org/

Quantile statistics

Minimum0
5-th percentile0
Q10
median0
Q30
95-th percentile5013
Maximum99999
Range99999
Interquartile range (IQR)0

Descriptive statistics

Standard deviation7385.2921
Coefficient of variation (CV)6.8531527
Kurtosis154.79944
Mean1077.6488
Median Absolute Deviation (MAD)0
Skewness11.953848
Sum35089324
Variance54542539
MonotonicityNot monotonic
2024-10-29T15:19:50.286603image/svg+xmlMatplotlib v3.9.2, https://matplotlib.org/
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
0 29849
91.7%
15024 347
 
1.1%
7688 284
 
0.9%
7298 246
 
0.8%
99999 159
 
0.5%
3103 97
 
0.3%
5178 97
 
0.3%
4386 70
 
0.2%
5013 69
 
0.2%
8614 55
 
0.2%
Other values (109) 1288
 
4.0%
ValueCountFrequency (%)
0 29849
91.7%
114 6
 
< 0.1%
401 2
 
< 0.1%
594 34
 
0.1%
914 8
 
< 0.1%
991 5
 
< 0.1%
1055 25
 
0.1%
1086 4
 
< 0.1%
1111 1
 
< 0.1%
1151 8
 
< 0.1%
ValueCountFrequency (%)
99999 159
0.5%
41310 2
 
< 0.1%
34095 5
 
< 0.1%
27828 34
 
0.1%
25236 11
 
< 0.1%
25124 4
 
< 0.1%
22040 1
 
< 0.1%
20051 37
 
0.1%
18481 2
 
< 0.1%
15831 6
 
< 0.1%

capital-loss
Real number (ℝ)

Zeros 

Distinct92
Distinct (%)0.3%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean87.30383
Minimum0
Maximum4356
Zeros31042
Zeros (%)95.3%
Negative0
Negative (%)0.0%
Memory size254.5 KiB
2024-10-29T15:19:50.435028image/svg+xmlMatplotlib v3.9.2, https://matplotlib.org/

Quantile statistics

Minimum0
5-th percentile0
Q10
median0
Q30
95-th percentile0
Maximum4356
Range4356
Interquartile range (IQR)0

Descriptive statistics

Standard deviation402.96022
Coefficient of variation (CV)4.6156076
Kurtosis20.376802
Mean87.30383
Median Absolute Deviation (MAD)0
Skewness4.5946291
Sum2842700
Variance162376.94
MonotonicityNot monotonic
2024-10-29T15:19:50.580878image/svg+xmlMatplotlib v3.9.2, https://matplotlib.org/
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
0 31042
95.3%
1902 202
 
0.6%
1977 168
 
0.5%
1887 159
 
0.5%
1485 51
 
0.2%
1848 51
 
0.2%
2415 49
 
0.2%
1602 47
 
0.1%
1740 42
 
0.1%
1590 40
 
0.1%
Other values (82) 710
 
2.2%
ValueCountFrequency (%)
0 31042
95.3%
155 1
 
< 0.1%
213 4
 
< 0.1%
323 3
 
< 0.1%
419 3
 
< 0.1%
625 12
 
< 0.1%
653 3
 
< 0.1%
810 2
 
< 0.1%
880 6
 
< 0.1%
974 2
 
< 0.1%
ValueCountFrequency (%)
4356 3
 
< 0.1%
3900 2
 
< 0.1%
3770 2
 
< 0.1%
3683 2
 
< 0.1%
3004 2
 
< 0.1%
2824 10
< 0.1%
2754 2
 
< 0.1%
2603 5
< 0.1%
2559 12
< 0.1%
2547 4
 
< 0.1%

hours-per-week
Real number (ℝ)

Distinct94
Distinct (%)0.3%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean40.437456
Minimum1
Maximum99
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size254.5 KiB
2024-10-29T15:19:50.728559image/svg+xmlMatplotlib v3.9.2, https://matplotlib.org/

Quantile statistics

Minimum1
5-th percentile18
Q140
median40
Q345
95-th percentile60
Maximum99
Range98
Interquartile range (IQR)5

Descriptive statistics

Standard deviation12.347429
Coefficient of variation (CV)0.30534633
Kurtosis2.9166868
Mean40.437456
Median Absolute Deviation (MAD)3
Skewness0.22764254
Sum1316684
Variance152.459
MonotonicityNot monotonic
2024-10-29T15:19:50.881028image/svg+xmlMatplotlib v3.9.2, https://matplotlib.org/
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
40 15217
46.7%
50 2819
 
8.7%
45 1824
 
5.6%
60 1475
 
4.5%
35 1297
 
4.0%
20 1224
 
3.8%
30 1149
 
3.5%
55 694
 
2.1%
25 674
 
2.1%
48 517
 
1.6%
Other values (84) 5671
 
17.4%
ValueCountFrequency (%)
1 20
 
0.1%
2 32
 
0.1%
3 39
 
0.1%
4 54
 
0.2%
5 60
 
0.2%
6 64
 
0.2%
7 26
 
0.1%
8 145
0.4%
9 18
 
0.1%
10 278
0.9%
ValueCountFrequency (%)
99 85
0.3%
98 11
 
< 0.1%
97 2
 
< 0.1%
96 5
 
< 0.1%
95 2
 
< 0.1%
94 1
 
< 0.1%
92 1
 
< 0.1%
91 3
 
< 0.1%
90 29
 
0.1%
89 2
 
< 0.1%

native-country
Categorical

Imbalance  Missing 

Distinct41
Distinct (%)0.1%
Missing583
Missing (%)1.8%
Memory size2.2 MiB
United-States
29170 
Mexico
 
643
Philippines
 
198
Germany
 
137
Canada
 
121
Other values (36)
 
1709

Length

Max length27
Median length14
Mean length13.49975
Min length5

Characters and Unicode

Total characters431695
Distinct characters45
Distinct categories1 ?
Distinct scripts1 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique1 ?
Unique (%)< 0.1%

Sample

1st row United-States
2nd row United-States
3rd row United-States
4th row United-States
5th row Cuba

Common Values

ValueCountFrequency (%)
United-States 29170
89.6%
Mexico 643
 
2.0%
Philippines 198
 
0.6%
Germany 137
 
0.4%
Canada 121
 
0.4%
Puerto-Rico 114
 
0.4%
El-Salvador 106
 
0.3%
India 100
 
0.3%
Cuba 95
 
0.3%
England 90
 
0.3%
Other values (31) 1204
 
3.7%
(Missing) 583
 
1.8%

Length

2024-10-29T15:19:51.025682image/svg+xmlMatplotlib v3.9.2, https://matplotlib.org/
Histogram of lengths of the category
ValueCountFrequency (%)
united-states 29170
91.2%
mexico 643
 
2.0%
philippines 198
 
0.6%
germany 137
 
0.4%
canada 121
 
0.4%
puerto-rico 114
 
0.4%
el-salvador 106
 
0.3%
india 100
 
0.3%
cuba 95
 
0.3%
england 90
 
0.3%
Other values (31) 1204
 
3.8%

Most occurring characters

ValueCountFrequency (%)
t 88030
20.4%
e 59820
13.9%
31978
 
7.4%
a 31774
 
7.4%
i 31372
 
7.3%
n 30568
 
7.1%
d 29801
 
6.9%
- 29503
 
6.8%
s 29416
 
6.8%
S 29396
 
6.8%
Other values (35) 40037
9.3%

Most occurring categories

ValueCountFrequency (%)
(unknown) 431695
100.0%

Most frequent character per category

(unknown)
ValueCountFrequency (%)
t 88030
20.4%
e 59820
13.9%
31978
 
7.4%
a 31774
 
7.4%
i 31372
 
7.3%
n 30568
 
7.1%
d 29801
 
6.9%
- 29503
 
6.8%
s 29416
 
6.8%
S 29396
 
6.8%
Other values (35) 40037
9.3%

Most occurring scripts

ValueCountFrequency (%)
(unknown) 431695
100.0%

Most frequent character per script

(unknown)
ValueCountFrequency (%)
t 88030
20.4%
e 59820
13.9%
31978
 
7.4%
a 31774
 
7.4%
i 31372
 
7.3%
n 30568
 
7.1%
d 29801
 
6.9%
- 29503
 
6.8%
s 29416
 
6.8%
S 29396
 
6.8%
Other values (35) 40037
9.3%

Most occurring blocks

ValueCountFrequency (%)
(unknown) 431695
100.0%

Most frequent character per block

(unknown)
ValueCountFrequency (%)
t 88030
20.4%
e 59820
13.9%
31978
 
7.4%
a 31774
 
7.4%
i 31372
 
7.3%
n 30568
 
7.1%
d 29801
 
6.9%
- 29503
 
6.8%
s 29416
 
6.8%
S 29396
 
6.8%
Other values (35) 40037
9.3%

Interactions

2024-10-29T15:19:46.314435image/svg+xmlMatplotlib v3.9.2, https://matplotlib.org/
2024-10-29T15:19:43.129828image/svg+xmlMatplotlib v3.9.2, https://matplotlib.org/