Classification of income using US census data (1994)

sam33frodon
Dec 27, 2020
25 min read

Updated: Jan 25, 2021

Goal of project: Classification the income per year using the 1994 US Census Data

Classification techniques: decision tree, random forest, KNN, and Naïve Bayes.

Data Set Information

Data extraction was done by Barry Becker from the 1994 Census database. The dataset can be downloaded from: “https://www.kaggle.com/uciml/incomecensus-census-income”

Response variable

A. income: Categorical variable that contains yearly income of the respondent (“<=$50K” or “>50K”).

Independent variables

B. age: Numerical variable that contains the age of the respondent.

C. workclass : Categorical variable that contains the type of employer of the respondent

D. fnlwgt: Numerical variable that contains the number of respondents that each row of the data set represents.

E. education: Categorical variable that represents the level of education of the respondent (Doctorate, Prof-school, Masters, Bachelors, Assoc-acdm, Assoc-voc, Some-college, HS-grad, 12th, 11th, 10th, 9th, 7th-8th, 5th-6th, 1st-4th, Preschool)

F. education.num: Numerical variable that represents the education variable.

G. marital.status: The marital status of the respondent

H. occupation: Categorical variable that represents the type of employment of the respondent (?, Adm-clerical, Armed-Forces, Craft-repair, Exec-managerial, Farming-fishing, Handlers-cleaners, Machine-op-inspct, Other-service, Priv-house-serv, Prof-specialty, Protective-serv, Sales, Tech-support, Transport-moving).

I. relationship: Categorical variable that represents the position in the family of the respondent (Husband, Not-in-family, Other-relative, Own-child, Unmarried, Wife).

J. race: Categorical variable that represents the race of the respondent (Amer-Indian-Eskimo, Asian-Pac-Islander, Black,Other, White).

K. sex: Categorical variable that represent the sex of the respondent (Female, Male).

L. capital.gain: Numerical variable that represents the income gained by the respondent from sources other than salary/wages.

M. capital.loss: Numerical variable that represents the income lost by the respondent from sources other than salary/wages.

N. hours.per.week: Numerical variable that represents the hours worked per week by the respondent.

O. native.country: Categorical variable that represents the native country of the respondent.

1. Importing data

We will read in our data using the read_csv() function, from the tidyverse package readr, instead of read.csv().

library(readr)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(DataExplorer)

incomecensus <- read_csv("adultincome.csv")

## Parsed with column specification:
## cols(
##   age = col_double(),
##   workclass = col_character(),
##   fnlwgt = col_double(),
##   education = col_character(),
##   education.num = col_double(),
##   marital.status = col_character(),
##   occupation = col_character(),
##   relationship = col_character(),
##   race = col_character(),
##   sex = col_character(),
##   capital.gain = col_double(),
##   capital.loss = col_double(),
##   hours.per.week = col_double(),
##   native.country = col_character(),
##   income = col_character()
## )

2. Inspecting the data

First look at the data

str(incomecensus)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 32561 obs. of  15 variables:
##  $ age           : num  90 82 66 54 41 34 38 74 68 41 ...
##  $ workclass     : chr  "?" "Private" "?" "Private" ...
##  $ fnlwgt        : num  77053 132870 186061 140359 264663 ...
##  $ education     : chr  "HS-grad" "HS-grad" "Some-college" "7th-8th" ...
##  $ education.num : num  9 9 10 4 10 9 6 16 9 10 ...
##  $ marital.status: chr  "Widowed" "Widowed" "Widowed" "Divorced" ...
##  $ occupation    : chr  "?" "Exec-managerial" "?" "Machine-op-inspct" ...
##  $ relationship  : chr  "Not-in-family" "Not-in-family" "Unmarried" "Unmarried" ...
##  $ race          : chr  "White" "White" "Black" "White" ...
##  $ sex           : chr  "Female" "Female" "Female" "Female" ...
##  $ capital.gain  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ capital.loss  : num  4356 4356 4356 3900 3900 ...
##  $ hours.per.week: num  40 18 40 40 40 45 40 20 40 60 ...
##  $ native.country: chr  "United-States" "United-States" "United-States" "United-States" ...
##  $ income        : chr  "<=50K" "<=50K" "<=50K" "<=50K" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   age = col_double(),
##   ..   workclass = col_character(),
##   ..   fnlwgt = col_double(),
##   ..   education = col_character(),
##   ..   education.num = col_double(),
##   ..   marital.status = col_character(),
##   ..   occupation = col_character(),
##   ..   relationship = col_character(),
##   ..   race = col_character(),
##   ..   sex = col_character(),
##   ..   capital.gain = col_double(),
##   ..   capital.loss = col_double(),
##   ..   hours.per.week = col_double(),
##   ..   native.country = col_character(),
##   ..   income = col_character()
##   .. )

We will not need the column fnlwgt

incomecensus <- incomecensus %>% 
  select(-c(fnlwgt))

head(incomecensus)

## # A tibble: 6 x 15
##     age workclass fnlwgt education education.num marital.status occupation
##   <dbl> <chr>      <dbl> <chr>             <dbl> <chr>          <chr>     
## 1    90 ?          77053 HS-grad               9 Widowed        ?         
## 2    82 Private   132870 HS-grad               9 Widowed        Exec-mana~
## 3    66 ?         186061 Some-col~            10 Widowed        ?         
## 4    54 Private   140359 7th-8th               4 Divorced       Machine-o~
## 5    41 Private   264663 Some-col~            10 Separated      Prof-spec~
## 6    34 Private   216864 HS-grad               9 Divorced       Other-ser~
## # ... with 8 more variables: relationship <chr>, race <chr>, sex <chr>,
## #   capital.gain <dbl>, capital.loss <dbl>, hours.per.week <dbl>,
## #   native.country <chr>, income <chr>

We can see that there are some rows containing question marks (“?”).

The next step is to ensure the consistency of data before building any classification technique.

Assigning correct R data types to each column

To convert character to factor

incomecensus <- incomecensus %>%
  mutate_if(is.character,as.factor)

str(incomecensus)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 32561 obs. of  15 variables:
##  $ age           : num  90 82 66 54 41 34 38 74 68 41 ...
##  $ workclass     : Factor w/ 9 levels "?","Federal-gov",..: 1 5 1 5 5 5 5 8 2 5 ...
##  $ fnlwgt        : num  77053 132870 186061 140359 264663 ...
##  $ education     : Factor w/ 16 levels "10th","11th",..: 12 12 16 6 16 12 1 11 12 16 ...
##  $ education.num : num  9 9 10 4 10 9 6 16 9 10 ...
##  $ marital.status: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 7 7 7 1 6 1 6 5 1 5 ...
##  $ occupation    : Factor w/ 15 levels "?","Adm-clerical",..: 1 5 1 8 11 9 2 11 11 4 ...
##  $ relationship  : Factor w/ 6 levels "Husband","Not-in-family",..: 2 2 5 5 4 5 5 3 2 5 ...
##  $ race          : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 3 5 5 5 5 5 5 5 ...
##  $ sex           : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 2 1 1 2 ...
##  $ capital.gain  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ capital.loss  : num  4356 4356 4356 3900 3900 ...
##  $ hours.per.week: num  40 18 40 40 40 45 40 20 40 60 ...
##  $ native.country: Factor w/ 42 levels "?","Cambodia",..: 40 40 40 40 40 40 40 40 40 1 ...
##  $ income        : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 1 2 ...

3. Preparing the consistent the data

3.1. Removing unnecessary column

incomecensus <- incomecensus %>% 
  select(-c(fnlwgt))

3.2. Detecting missing values

There are many ways to detect missing values. In this project, we will use the package DataExplorer to quickly see if there are any missing values in the data.

library(DataExplorer)

plot_missing(incomecensus)

There is no missing value.

3.3. Checking duplicate records

 # to check for duplicate records
incomecensus %>%
  summarize(record_count = n(),
            distinct_records = n_distinct(.))

## # A tibble: 1 x 2
##   record_count distinct_records
##          <int>            <int>
## 1        32561            29096

The function distinct() [dplyr package] can be used to keep only unique/distinct rows from a data frame.

If there are duplicate rows, only the first row is preserved. It’s an efficient version of the R base function unique().

incomecensus <- incomecensus %>% 
  distinct()

The resulting data frame contain 26904 records.

3.4. Dealing with unusual data points

3.4.1. First we will find unusual data points that appear in many columns. Dealing with rows containing “?” (Note that these rows are not empty) Example: How many row with “?” in the column work class

incomecensus %>% 
  group_by(workclass) %>% 
  summarise(count = n())

## # A tibble: 9 x 2
##   workclass        count
##   <fct>            <int>
## 1 ?                 1632
## 2 Federal-gov        946
## 3 Local-gov         2040
## 4 Never-worked         7
## 5 Private          19621
## 6 Self-emp-inc      1091
## 7 Self-emp-not-inc  2473
## 8 State-gov         1272
## 9 Without-pay         14

We will check if there are any other columns containing question marks.

First, we will check for the records with any “?”.

question_marks_count <- purrr::map_df(incomecensus, 
                                      ~ stringr::str_detect(., pattern = "\\?")) %>%
  rowSums() %>%
  tbl_df() %>%
  filter(value > 0) %>%
  summarize(question_marks_count= n()) 

question_marks_count

## # A tibble: 1 x 1
##   question_marks_count
##                  <int>
## 1                 2192

In total, there are 2192 rows containing question marks (“?”).

Now we will count the number of question marks for each column.

count.NA.per.col <- plyr::ldply(incomecensus, function(c) sum(c == "?"))
count.NA.per.col

##               .id   V1
## 1             age    0
## 2       workclass 1632
## 3       education    0
## 4   education.num    0
## 5  marital.status    0
## 6      occupation 1639
## 7    relationship    0
## 8            race    0
## 9             sex    0
## 10   capital.gain    0
## 11   capital.loss    0
## 12 hours.per.week    0
## 13 native.country  580
## 14         income    0

There are three columns that contain “?”, including workclass, occupation, and native.country.

We will remove all of these records.

3.4.2 Checking the consistency of each column

We will use package ggplot2 to detect any unusual points.

3.4.2A. age and education

incomecensus %>%
  ggplot(aes(age, education)) + 
  geom_point(alpha = 0.3, col = "#00AFBB") + 
  geom_point(aes(col = (age<= 20 & education == 'Masters')), 
             alpha = 1, 
             size = 3) + 
  scale_colour_manual(values = setNames(c('blue','grey'),
                                        c(T, F))) +
  theme(legend.position="bottom") + 
  scale_x_continuous(breaks = seq(0,100,10))+ 
  xlab("Age") + 
  ylab("Education Level") +
  ggtitle("Age vs Education Level")

There are a few data points (in blue) that are outlier datapoints. It is doubtful that people with ages of 20 years and less can complete a Masters degree. In reality, it might be possible. These data points can be removed.

incomecensus <- incomecensus %>%
  filter(!(age <= 20 & education == 'Masters'))

3.4.2B. sex and relationship status

table(incomecensus$sex, incomecensus$relationship)

##         
##          Husband Not-in-family Other-relative Own-child Unmarried  Wife
##   Female       1          3262            384      1576      2354  1365
##   Male     10808          3853            489      2077       732     1

There are two records where the husbands are female. There is also one record where the wife is female. In 1994, this situation was not possible in the US. Where are these records?

incomecensus %>%
  select(sex, relationship, marital.status) %>% 
  filter((sex == 'Male' & relationship == 'Wife')  | (sex == 'Female' & relationship == 'Husband'))

## # A tibble: 2 x 3
##   sex    relationship marital.status    
##   <fct>  <fct>        <fct>             
## 1 Male   Wife         Married-civ-spouse
## 2 Female Husband      Married-civ-spouse

These points can be removed using the code below.

incomecensus <- incomecensus %>%
  filter(!((sex == 'Male' & relationship == 'Wife') | (sex == 'Female' & relationship == 'Husband')))

3.4.2C. Relationship and marital status

The marital status classification identifies five major categories: never married, married, widowed, and divorced.

table(incomecensus$relationship, incomecensus$marital.status)

##                 
##                  Divorced Married-AF-spouse Married-civ-spouse
##   Husband               0                 9              10799
##   Not-in-family      2153                 0                 14
##   Other-relative      102                 1                118
##   Own-child           305                 1                 83
##   Unmarried          1449                 0                  0
##   Wife                  0                10               1355
##                 
##               Married-spouse-absent Never-married Separated Widowed
##   Husband           0             0         0       0
##   Not-in-family     181          3959       382     426
##   Other-relative     26           533        53      40
##   Own-child          43          3120        89      12
##   Unmarried         120           774       404     339
##   Wife              0             0         0       0

The group “other married, spouse absent” includes married people living apart because either the husband or wife was employed and living at a considerable distance from home, was serving away from home in the Armed Forces, had moved to another area, or had a different place of residence for any other reason except separation as defined above. (https://www.unmarried.org/government-terminology/)

incomecensus %>%
  filter(marital.status == 'Married-spouse-absent' & relationship == 'Unmarried') %>% 
  summarise(count = n())

## # A tibble: 1 x 1
##   count
##   <int>
## 1   120

There are 130 rows where people are ‘Married-spouse-absent’ and ‘Unmarried’ at the same time. These records will be removed.

incomecensus <- incomecensus %>%
  filter(!(marital.status == 'Married-spouse-absent' & relationship == 'Unmarried'))

3.4.2D. age and weekly working hour

incomecensus %>%
  ggplot(aes(age, hours.per.week)) + 
  geom_point(size = 2, shape = 23, color = "blue") + 
  theme(legend.position="bottom") + 
  scale_x_continuous(breaks = seq(0,100,10))+ 
  scale_y_continuous(breaks = seq(0,100,10))+ 
  xlab("Age") + 
  ylab("Weekly working hours") +
  ggtitle("Age vs Weekly working hours ")

Some people aged older than 70 and work more than 80 hours a week. Even for people with 90 years of age, there are records of 100 hours per week.

To highlight these data point

incomecensus %>%
  ggplot(aes(age, hours.per.week)) + 
  geom_point(alpha = 0.3, col = "#00AFBB") +
  geom_point(aes(col = (age >= 70 & hours.per.week >= 40)), 
             alpha = 1, 
             size = 3) + 
  scale_colour_manual(values = setNames(c('red','grey'),
                                        c(T, F))) +
  theme(legend.position="bottom") + 
  scale_x_continuous(breaks = seq(0,100,10))+ 
  scale_y_continuous(breaks = seq(0,100,10))+ 
  xlab("Age") + 
  ylab("Weekly working hours") +
  ggtitle("Age vs Weekly working hours ")

incomecensus %>%
  filter(age >= 70 & hours.per.week > 50) %>% 
  summarise(counts =n())

## # A tibble: 1 x 1
##   counts
##    <int>
## 1     25

Data points with age more than and equal to 70 years and working hours greater than 50 hours can be removed.

incomecensus <- incomecensus %>%
   filter(!(age >= 70 & hours.per.week > 50))

3.4.2E. capital loss and capital gain

Taking a look at the capital gain.

incomecensus %>% ggplot(aes(capital.gain)) + 
  geom_histogram(fill= "lightblue",
                 color = 'blue',
                 binwidth = 5000) +   
  labs(title= "Capital gain Distribution") +
  theme(plot.title = element_text(hjust = 0.5))

There are many points with 99999 as Capital gain, which seems suspicious and should be eliminated from the analysis. We can remove all the records where the capital gain is equal to 99999.

incomecensus %>%
  group_by(capital.gain) %>% 
  summarise(counts =n()) %>% 
  arrange(desc(capital.gain)) %>% 
  head(10)

## # A tibble: 10 x 2
##    capital.gain counts
##           <dbl>  <int>
##  1        99999    147
##  2        41310      2
##  3        34095      3
##  4        27828     32
##  5        25236     10
##  6        25124      1
##  7        22040      1
##  8        20051     31
##  9        18481      2
## 10        15831      6

incomecensus <- incomecensus %>%
   filter(!(capital.gain == 99999))

incomecensus %>% ggplot(aes(capital.loss)) + 
  geom_histogram(fill= "lightblue",
                 color = 'blue',
                 binwidth = 100) +   
  labs(title= "Capital loss Distribution") +
  theme(plot.title = element_text(hjust = 0.5))

incomecensus %>%
  group_by(capital.loss) %>% 
  summarise(counts =n()) %>% 
  arrange(desc(capital.loss)) %>% 
  head(10)

## # A tibble: 10 x 2
##    capital.loss counts
##           <dbl>  <int>
##  1         4356      1
##  2         3900      2
##  3         3770      2
##  4         3683      2
##  5         3004      1
##  6         2824      8
##  7         2754      2
##  8         2603      4
##  9         2559     12
## 10         2547      4

It seems that there is no unusual data point.

3.4.2F Work class and occupation

table(incomecensus$occupation, incomecensus$workclass)

##                    
##                        ? Federal-gov Local-gov Never-worked Private
##   ?                    0           0         0            0       0
##   Adm-clerical         0         308       269            0    2360
##   Armed-Forces         0           9         0            0       0
##   Craft-repair         0          62       138            0    2377
##   Exec-managerial      0         175       209            0    2296
##   Farming-fishing      0           8        27            0     427
##   Handlers-cleaners    0          21        44            0    1056
##   Machine-op-inspct    0          14        11            0    1571
##   Other-service        0          33       187            0    2349
##   Priv-house-serv      0           0         0            0     137
##   Prof-specialty       0         164       664            0    2001
##   Protective-serv      0          27       291            0     183
##   Sales                0          14         7            0    2508
##   Tech-support         0          66        38            0     667
##   Transport-moving     0          24       112            0    1093
##                    
##                  Self-emp-inc Self-emp-not-inc State-gov Without-pay
##   Adm-clerical            28               46       245           3
##   Armed-Forces             0                0         0           0
##   Craft-repair            96              485        52           1
##   Exec-managerial        360              369       181           0
##   Farming-fishing         51              415        15           5
##   Handlers-cleaners        2               15         9           1
##   Machine-op-inspct       10               35        13           1
##   Other-service           27              171       119           0
##   Priv-house-serv          0                0         0           0
##   Prof-specialty         138              346       392           0
##   Protective-serv          5                6       112           0
##   Sales                  266              362        10           0
##   Tech-support             3               26        55           0
##   Transport-moving        26              118        40           1

4. Exploratory data analysis (EDA)

4A. income

Categorical variable that contains yearly income of the respondent (“<=$50K” or “>50K”).

incomecensus %>% 
  group_by(income) %>% 
  summarise(count = n())

## # A tibble: 2 x 2
##   income count
##   <fct>  <int>
## 1 <=50K  19888
## 2 >50K    6720

income_prop <- incomecensus %>% 
  group_by(income) %>% 
  summarise(count = n()) %>% 
  ungroup()%>% 
  arrange(desc(income)) %>%
  mutate(percentage = round(count/sum(count),4)*100,
         lab.pos = cumsum(percentage)-0.5*percentage)

ggplot(data = income_prop, 
       aes(x = "", 
           y = percentage, 
           fill = income))+
  geom_bar(stat = "identity")+
  coord_polar("y") +
  geom_text(aes(y = lab.pos, label = paste(percentage,"%", sep = "")), 
            col = "blue", 
            size = 5) +
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
  theme_void() +
  theme(legend.title = element_text(color = "black", size = 14),
        legend.text = element_text(color = "black", size = 14))

The summary of the data shows that 75 % of the observations have an income less than or equal to 50k dollars. Specifically, 22654 persons have an income <=50k dollars, while 7508 people earn more than 50k.

4B. Age

incomecensus %>% ggplot(aes(age)) + 
  geom_histogram(fill= "lightblue",
                 color = 'blue',
                 binwidth = 5) +   
  labs(title= "Age Distribution") +
  theme(plot.title = element_text(hjust = 0.5))

summary(incomecensus$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.00   28.00   38.00   38.97   48.00   90.00

incomecensus %>% ggplot(aes(age)) + 
  geom_histogram(aes(fill=income),
                 color = 'grey',
                 binwidth = 1) +   
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
  labs(title= "Age Distribution for Income")+
  theme(plot.title = element_text(hjust = 0.5))

incomecensus %>% 
  ggplot(aes(age, 
             fill= income)) +
  geom_density(alpha= 0.7, color = 'blue') +
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
  labs(x = "Age", 
       y = "Density", 
       title = "Density graph of age distribution")

Older people tends to earn more. On the other hand, majority of the people with an age around 25 years earns less than 50k per annum.

4C. workclass

library(ggthemes)
library(hrbrthemes)

incomecensus %>%
  group_by(workclass) %>%
  summarise(counts = n()) %>% 
  mutate(Percentage = round(counts*100/sum(counts),2)) %>% 
  arrange(desc(counts)) %>% 
  ggplot(aes(x= reorder(workclass, -counts),
             y  = counts,
             fill = Percentage)) +
  geom_bar(stat = "identity",
           width = 0.6) +
  scale_fill_gradient(low="skyblue1", high="royalblue4")+
  geom_text(aes(label = paste0(round(counts,1), "\n",Percentage,"%")), 
            vjust = -0.1, 
            color = "darkblue", 
            size = 4) +
  scale_y_continuous(limits = c(0,25000)) +
  theme_minimal() +
  labs(x = "Work class",
       y = "Frequency",
       caption = "Income census US 1994")

Almost three-quarters of sample work in the private sector.

In this data, there are dictinction between Local-gov, State-gov, and Federal-gov.

We might group these three category if there is no difference in income level between them.

library(ggpubr)
library(scales)

p1 <- incomecensus %>%
  group_by(workclass) %>%
  summarise(counts = n()) %>% 
  mutate(Percentage = round(counts*100/sum(counts),2)) %>% 
  arrange(desc(counts)) %>% 
  ggplot(aes(x= reorder(workclass, counts),
             y  = counts)) +
  geom_bar(stat = "identity",
           width = 0.6,
           fill = "steelblue") +
  geom_text(aes(label = paste0(round(counts,1),"\n",Percentage,"%")), 
            vjust = 0.5, 
            hjust = -0.5,
            color = "darkblue", 
            size = 4) +
  scale_y_continuous(limits = c(0,20000)) +
  theme_minimal() +
  labs(x = "Work class",y = "Frequency") + 
  coord_flip()

p2 <- incomecensus %>% 
  group_by(workclass, income) %>% 
  summarize(n = n()) %>% 
  mutate(pct = n*100/sum(n)) %>% 
  ggplot(aes(x = reorder(workclass, n), 
             y = pct/100, 
             fill = income)) +
  geom_bar(stat = "identity", width = 0.6) + 
  scale_x_discrete(name = "") +
  scale_y_continuous(name= "Percentage", 
                     labels = percent) +
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
  geom_text(aes(label = paste0(round(pct,0),"%")), 
            position = position_stack(vjust = 0.5), 
            size = 4, 
            color = "black") +
  theme(plot.title = element_text(hjust = 0.5), 
        axis.text.y=element_blank()) + 
  coord_flip()

ggarrange(p1, p2, nrow = 1)

income_gov <- incomecensus %>% 
  filter(workclass %in% c("Local-gov", "State-gov", "Federal-gov")) %>% 
  group_by(workclass, income) %>%
  summarise(count = n()) %>% 
  mutate(pct = count/sum(count)) %>%
  arrange(desc(income), pct)

income_gov <-  income_gov %>% transmute(income, percent = count*100/sum(count))
income_gov

## # A tibble: 6 x 3
## # Groups:   workclass [3]
##   workclass   income percent
##   <fct>       <fct>    <dbl>
## 1 State-gov   >50K      27.0
## 2 Local-gov   >50K      29.4
## 3 Federal-gov >50K      38.4
## 4 Federal-gov <=50K     61.6
## 5 Local-gov   <=50K     70.6
## 6 State-gov   <=50K     73.0

incomecensus %>% 
  filter(workclass %in% c("Local-gov", "State-gov", "Federal-gov")) %>% 
  group_by(workclass, income) %>% 
  summarize(n = n()) %>% 
  mutate(pct = n*100/sum(n)) %>% 
  ggplot(aes(x = reorder(workclass, n), 
             y = pct, 
             fill = income)) +
  geom_bar(stat = "identity", width = 0.6) + 
  scale_x_discrete(name = "") +
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
  geom_text(aes(label = paste0(round(pct,0),"%")), 
            position = position_stack(vjust = 0.5), 
            size = 4, 
            color = "black") +
  theme(axis.text.y = element_blank(),
        axis.title.y = element_blank(),
        axis.ticks.y = element_blank())

There is a slight difference between Local-gov and Stave-gov.

Working for the federal government can have a greater chance of getting an income higher than 50K.

incomecensus %>% 
  filter(workclass %in% c("Self-emp-not-inc", "Self-emp-inc", "Without-pay")) %>% 
  group_by(workclass, income) %>% 
  summarize(n = n()) %>% 
  mutate(pct = n*100/sum(n)) %>% 
  ggplot(aes(x = reorder(workclass, n), 
             y = pct, 
             fill = income)) +
  geom_bar(stat = "identity", width = 0.6) + 
  scale_x_discrete(name = "") +
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
  geom_text(aes(label = paste0(round(pct,0),"%")), 
            position = position_stack(vjust = 0.5), 
            size = 4, 
            color = "black") +
  theme(axis.text.y = element_blank(),
        axis.title.y = element_blank(),
        axis.ticks.y = element_blank()

4D. education

incomecensus %>% 
  group_by(education) %>% 
  summarise(counts = n()) %>% 
  arrange(desc(counts))

## # A tibble: 16 x 2
##    education    counts
##    <fct>         <int>
##  1 HS-grad        8201
##  2 Some-college   5853
##  3 Bachelors      4447
##  4 Masters        1534
##  5 Assoc-voc      1248
##  6 Assoc-acdm      990
##  7 11th            937
##  8 10th            763
##  9 7th-8th         534
## 10 Prof-school     486
## 11 9th             445
## 12 12th            356
## 13 Doctorate       349
## 14 5th-6th         275
## 15 1st-4th         146
## 16 Preschool        44

incomecensus %>% 
  group_by(education.num) %>% 
  summarise(counts = n()) %>% 
  arrange(desc(counts))

## # A tibble: 16 x 2
##    education.num counts
##            <dbl>  <int>
##  1             9   8201
##  2            10   5853
##  3            13   4447
##  4            14   1534
##  5            11   1248
##  6            12    990
##  7             7    937
##  8             6    763
##  9             4    534
## 10            15    486
## 11             5    445
## 12             8    356
## 13            16    349
## 14             3    275
## 15             2    146
## 16             1     44

We can see that these two columns are identical. The column education.num is simply converted from the column education.

If we keep the column education which is factor type, we might regroup some categories in this column. If we keep the column education.num which is numerical type,, we can keep the column as it is (numeric).

eudcation_df <- incomecensus %>%
  group_by(education.num) %>%
  summarise(counts = n()) %>% 
  mutate(Percentage = round(counts*100/sum(counts),2)) %>% 
  arrange(education.num)

eudcation_df

## # A tibble: 16 x 3
##    education.num counts Percentage
##            <dbl>  <int>      <dbl>
##  1             1     44       0.17
##  2             2    146       0.55
##  3             3    275       1.03
##  4             4    534       2.01
##  5             5    445       1.67
##  6             6    763       2.87
##  7             7    937       3.52
##  8             8    356       1.34
##  9             9   8201      30.8 
## 10            10   5853      22   
## 11            11   1248       4.69
## 12            12    990       3.72
## 13            13   4447      16.7 
## 14            14   1534       5.77
## 15            15    486       1.83
## 16            16    349       1.31

ggplot(data = eudcation_df ,
           aes(x= education.num,
               y  = counts,
               fill = Percentage)) +
  geom_bar(stat = "identity",
           width = 0.6) +
  scale_x_continuous(breaks = c(0:16)) +
  scale_fill_gradient(low="skyblue1", high="royalblue4")+
  geom_text(aes(label = paste0(round(counts,1), "\n","(",Percentage,"%)")), 
            vjust = -0.1, 
            color = "darkblue", 
            size = 4) +
  scale_y_continuous(limits = c(0,12000)) +
  theme_minimal() +
  labs(x = "Years of education",
       y = "Frequency",
       caption = "Income census US 1994") +
  ggtitle("Distribution of years of education") +
    theme(plot.title = element_text(hjust = 0.5), 
        axis.title = element_text(size = 12), 
        axis.text.x = element_text(size = 12), 
        axis.text.y = element_text(size = 12),
        legend.position = "right", 
        legend.title = element_text(size = 12, face = "bold"),
        legend.text = element_text(colour="black", size = 12))

About 50% of respondents have 9-10 years of education.

About 17% of people in the sample hold bachelor degree (13 years of education).

education.pct <- incomecensus %>%
  group_by(education.num, income) %>%
  summarize(count = n()) %>%
  mutate(pct = count/sum(count)) %>%
  arrange(desc(income), pct)

ggplot(education.pct, 
       aes(education.num, pct, fill = income)) + 
  geom_bar(stat="identity", position = "fill") + 
  geom_hline(yintercept = 0.2489, col = "blue") +
  scale_x_continuous(breaks = c(0:16)) +
  scale_y_continuous(labels=scales::percent) + 
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) + 
  ggtitle("Income by years of education") + 
  xlab("Education (years)") + 
  ylab("Percentage") +
  theme(plot.title = element_text(hjust = 0.5), 
        axis.title = element_text(size = 12), 
        axis.text.x = element_text(size = 12), 
        axis.text.y = element_text(size = 12),
        legend.position = "right", 
        legend.title = element_text(size = 12, face = "bold"),
        legend.text = element_text(colour="black", size = 12))

Higher years of education led to a higher chance of having >50K.

4D. marital status

incomecensus %>% 
  group_by(marital.status) %>% 
  summarise(counts = n()) %>% 
  arrange(desc(counts))

## # A tibble: 7 x 2
##   marital.status        counts
##   <fct>                  <int>
## 1 Married-civ-spouse     12236
## 2 Never-married           8369
## 3 Divorced                3996
## 4 Separated                926
## 5 Widowed                  811
## 6 Married-spouse-absent    249
## 7 Married-AF-spouse         21

We can put all the married people in the same group

This column have 7 labels, three of them are:

Married-AF-spouse: Married armed forces spouse
Married-civ-spouse: Married civilian spouse
Married-spouse-absent

These levels can be grouped into the group “married”.

Replace “Married-AF-spouse”, “Married-civ-spouse”, and “Married-spouse-absent” by “Married”

pat = c("Married-AF-spouse|Married-civ-spouse|Married-spouse-absent")
incomecensus <- incomecensus %>% 
  mutate(marital_status = stringr::str_replace_all(marital.status, pat, "Married"))

incomecensus <- incomecensus %>% 
  select(-marital.status)

marital_df <- incomecensus %>%
  group_by(marital_status) %>%
  summarise(counts = n()) %>% 
  mutate(Percentage = round(counts*100/sum(counts),1)) %>% 
  arrange(desc(counts))

marital_df

## # A tibble: 5 x 3
##   marital_status counts Percentage
##   <chr>           <int>      <dbl>
## 1 Married         12506       47  
## 2 Never-married    8369       31.5
## 3 Divorced         3996       15  
## 4 Separated         926        3.5
## 5 Widowed           811        3

The proportion of married people that have an income higher than 50K is highest.

p3 <- incomecensus %>%
  group_by(marital_status) %>%
  summarise(counts = n()) %>% 
  mutate(Percentage = round(counts*100/sum(counts),2)) %>% 
  arrange(desc(counts)) %>% 
  ggplot(aes(x= reorder(marital_status, counts),
             y  = counts)) +
  geom_bar(stat = "identity",
           width = 0.6,
           fill = "steelblue") +
  geom_text(aes(label = paste0(round(counts,1),"\n","(",Percentage,"%)")), 
            vjust = 0.4, 
            hjust = -0.5,
            color = "darkblue", 
            size = 4) +
  scale_y_continuous(limits = c(0,20000)) +
  theme_minimal() +
  labs(x = "Marital status",y = "Frequency") + 
  coord_flip()

p4 <- incomecensus %>% 
  group_by(marital_status, income) %>% 
  summarize(n = n()) %>% 
  mutate(pct = n*100/sum(n)) %>% 
  ggplot(aes(x = reorder(marital_status, n), 
             y = pct/100, 
             fill = income)) +
  geom_bar(stat = "identity", width = 0.6) + 
  scale_x_discrete(name = "") +
  scale_y_continuous(name = "Percentage", 
                     labels = scales::percent) +
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
  geom_text(aes(label = paste0(round(pct,0),"%")), 
            position = position_stack(vjust = 0.5), 
            size = 4, 
            color = "black") +
  theme(plot.title = element_text(hjust = 0.5), 
        axis.text.y=element_blank()) + 
  coord_flip()

ggarrange(p3, p4, nrow = 1)

4F. occupation

Category_df <- incomecensus %>%
  group_by(occupation) %>% 
  summarise(counts = n()) %>% 
  mutate(Percentage = round(counts*100/sum(counts),2)) %>% 
  arrange(counts)

ggplot(data = Category_df,
           aes(x= reorder(occupation, -counts),
               y  = counts,
               fill = Percentage)) +
  geom_bar(stat = "identity",
           width = 0.7) +
  geom_text(aes(label = paste0(counts,", ", Percentage,"%")), 
            vjust = 0.2, 
            hjust = -0.1,
            color = "darkblue", 
            size = 4) +
  scale_y_continuous(limits = c(0,5000)) +
  scale_fill_distiller(palette = "Spectral") +
  theme_minimal() +
  labs(x = "Category",
       y = "Frequency",
       caption = "Income census US 1994") +
  coord_flip()

occupation.pct <- incomecensus %>%
  group_by(occupation, income) %>%
  summarize(count = n()) %>%
  mutate(pct = count/sum(count)) %>%
  arrange(desc(income), pct)

occupation.pct$occupation <- factor(occupation.pct$occupation,
                                    levels = occupation.pct$occupation[1:(nrow(occupation.pct)/2)])

ggplot(occupation.pct, aes(reorder(occupation,-pct), pct, fill = income)) + 
  geom_bar(stat="identity", position = "fill") + 
  geom_hline(yintercept = 0.2489, col = "blue") +
  ggtitle("Income by occupation") + 
  xlab("Occupation") + 
  ylab("Percentage") + 
  scale_y_continuous(labels=scales::percent) + 
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) + 
  theme(plot.title = element_text(hjust = 0.5), 
        axis.title = element_text(size = 12), 
        axis.text.x = element_text(size = 12), 
        axis.text.y = element_text(size = 12),
        legend.position = "bottom", 
        legend.title = element_text(size = 12, face = "bold"),
        legend.text = element_text(colour="black", size = 12)) + 
  coord_flip()

About 25% of the sample has an income of more than 50K. There are five categories of jobs that have a higher percentage: Protective service, tech-support, Sales, Exec-managerial, and Prof-speciality.

Question: Are these people work in the private sector?

incomecensus %>% 
  filter(occupation %in% c("Protective-serv", "Tech-support", "Sales", "Exec-managerial", "Prof-specialty")) %>% 
  group_by(workclass) %>% 
  summarise(freq =  n()) %>% 
  mutate(percentage = round(freq*100/sum(freq),1)) %>%
  arrange(desc(percentage))

## # A tibble: 6 x 3
##   workclass         freq percentage
##   <fct>            <int>      <dbl>
## 1 Private           7655       64.1
## 2 Local-gov         1209       10.1
## 3 Self-emp-not-inc  1109        9.3
## 4 Self-emp-inc       772        6.5
## 5 State-gov          750        6.3
## 6 Federal-gov        446        3.7

About two-third of respondents work in the private sector.

Governmental job accounts for 18.81 %.

occupation_distr <- incomecensus %>%
  group_by(occupation) %>%
  summarise(counts = n()) %>% 
  mutate(Percentage = round(counts*100/sum(counts),2)) %>% 
  arrange(desc(counts)) %>% 
  ggplot(aes(x= reorder(occupation, counts),
             y  = counts)) +
  geom_bar(stat = "identity",
           width = 0.6,
           fill = "steelblue") +
  geom_text(aes(label = paste0(round(counts,1),"\n","(",Percentage,"%)")), 
            vjust = 0.4, hjust = -0.5,color = "darkblue", size = 4) +
  scale_y_continuous(limits = c(0,20000)) +
  theme_minimal() +
  labs(x = "Grouped Occupation", y = "Frequency") + 
  coord_flip()

occupation_pct <- incomecensus %>% 
  group_by(occupation, income) %>% 
  summarize(n = n()) %>% 
  mutate(pct = n*100/sum(n)) %>% 
  ggplot(aes(x = reorder(occupation, n), 
             y = pct/100, 
             fill = income)) +
  geom_bar(stat = "identity", width = 0.6) + 
  scale_x_discrete(name = "") +
  scale_y_continuous(name = "Percentage", 
                     labels = scales::percent) +
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
  geom_text(aes(label = paste0(round(pct,0),"%")), 
            position = position_stack(vjust = 0.5), 
            size = 4, 
            color = "black") +
  theme(plot.title = element_text(hjust = 0.5), 
        axis.text.y=element_blank()) + 
  coord_flip()

ggarrange(occupation_distr, occupation_pct, nrow = 1)

4G. relationship

incomecensus %>% 
  ggplot(aes(income, color= income, fill= income)) +
  geom_bar( alpha = 0.8, width = 0.8) +
  facet_grid(~ relationship) + 
  labs(x = "Incomes", y = "Count", title = "Incomes by relationship")

This column contains categories that might be overlap with other feature.

For example, unmarried people in the relationship attribute is the same as the unmarried level in the marital column.

relationship_distr <- incomecensus %>%
  group_by(relationship) %>%
  summarise(counts = n()) %>% 
  mutate(Percentage = round(counts*100/sum(counts),2)) %>% 
  arrange(desc(counts)) %>% 
  ggplot(aes(x= reorder(relationship, counts),
             y  = counts)) +
  geom_bar(stat = "identity",
           width = 0.6,
           fill = "steelblue") +
  geom_text(aes(label = paste0(round(counts,1),"\n","(",Percentage,"%)")), 
            vjust = 0.4, hjust = -0.5,color = "darkblue", size = 4) +
  scale_y_continuous(limits = c(0,20000)) +
  theme_minimal() +
  labs(x = "Relationship", y = "Frequency") + 
  coord_flip()

relationship_pct <- incomecensus %>% 
  group_by(relationship, income) %>% 
  summarize(n = n()) %>% 
  mutate(pct = n*100/sum(n)) %>% 
  ggplot(aes(x = reorder(relationship, n), 
             y = pct/100, 
             fill = income)) +
  geom_bar(stat = "identity", width = 0.6) + 
  scale_x_discrete(name = "") +
  scale_y_continuous(name = "Percentage", 
                     labels = scales::percent) +
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
  geom_text(aes(label = paste0(round(pct,0),"%")), 
            position = position_stack(vjust = 0.5), 
            size = 4, 
            color = "black") +
  theme(plot.title = element_text(hjust = 0.5), 
        axis.text.y=element_blank()) + 
  coord_flip()

ggarrange(relationship_distr, relationship_pct, nrow = 1)

This variable may be redundant as it collapse somehow with the feature marital_status.

4H. race

incomecensus %>% 
  ggplot(aes(race, fill= income)) +
  geom_bar(position = "fill") +
  labs(x = "Race", 
       y = "Proportion", 
       title = "Incomes by race")+
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  geom_hline(yintercept = 0.2489, col="blue") +
  coord_flip()

However, the percent of whites and “Asian-Pac-Islander” that earn more than 50k is over the general population average.

4H. sex

Here we can see that the vast majority of people having an income greater than 50000 dollars are males.

gender_prop <- incomecensus %>% 
  group_by(sex) %>% 
  summarise(count = n()) %>% 
  ungroup()%>% 
  arrange(desc(sex)) %>%
  mutate(percentage = round(count/sum(count),4)*100,
         lab.pos = cumsum(percentage)-0.5*percentage)

gender_distr <- ggplot(data = gender_prop, 
       aes(x = "", 
           y = percentage, 
           fill = sex))+
  geom_bar(stat = "identity")+
  coord_polar("y") +
  geom_text(aes(y = lab.pos, 
                label = paste(percentage,"%", sep = "")), col = "blue", size = 4) +
  scale_fill_manual(values=c("orange", "lightblue"),
                    name = "Gender") +
  theme_void() +
  theme(legend.title = element_text(color = "black", size = 12),
        legend.text = element_text(color = "black", size = 12))


gender_prop <- incomecensus %>% 
  group_by(sex, income) %>% 
  summarize(n = n()) %>% 
  mutate(pct = n*100/sum(n)) %>% 
  ggplot(aes(x = reorder(sex, n), 
             y = pct, 
             fill = income)) +
  geom_bar(stat = "identity", width = 0.6) + 
  scale_x_discrete(name = "") +
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
  geom_text(aes(label = paste0(round(pct,0),"%")), 
            position = position_stack(vjust = 0.5), 
            size = 4, 
            color = "black") +
  theme(axis.text.y = element_blank(),
        axis.text.x = element_text(color = "black", size = 12),
        axis.title.y = element_blank(),
        axis.ticks.y = element_blank(),
        legend.title = element_text(color = "black", size = 12),
        legend.text = element_text(color = "black", size = 12))

ggarrange(gender_distr, gender_prop, nrow = 1)

Two-third of respondents are male.

The proportion of women that earn more than 50k is much lower than that of their male counterparts.

4I. house per week

Distribution of hours per week

incomecensus %>% ggplot(aes(hours.per.week)) + 
  geom_histogram(fill= "orange",
                 color = 'blue',
                 binwidth = 5) +   
  theme(plot.title = element_text(hjust = 0.5))

incomecensus %>% 
  group_by(income) %>%
  summarise("Mean hours per week" = mean(hours.per.week),
            "Standard deviatrion" = sd(hours.per.week))

## # A tibble: 2 x 3
##   income `Mean hours per week` `Standard deviatrion`
##   <fct>                  <dbl>                 <dbl>
## 1 <=50K                   39.5                  12.3
## 2 >50K                    45.8                  11.0

ggplot(data = incomecensus, 
       aes(income, 
           hours.per.week, 
           fill = income))+
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) + 
  geom_boxplot()+
  labs(x = "Incomes", 
       y = "Worked hours per week", 
       title = "Incomes by working hours")

There are many outliers. Some people work for 100 hours a week (which is possible?).

4J. native.country

There are 40 different native countries and some of them have just a few cases, for instance, Honduras have just 10 cases. Therefore, the first attempt is to group the countries by continent.

incomecensus <- incomecensus %>% 
  mutate(native_continent = case_when(
    native.country %in% c("France", "Greece", "Hungary", "Italy", "Portugal",
                          "Scotland", "England", "Germany", "Holand-Netherlands",
                          "Ireland", "Poland", "Yugoslavia") ~ "Europe",
    native.country %in% c("Columbia", "Dominican-Republic", "El-Salvador", "Haiti",
                          "Honduras", "Mexico", "Outlying-US(Guam-USVI-etc)",
                          "Cuba", "Ecuador", "Guatemala", "Jamaica", "Nicaragua",
                          "Peru", "Puerto-Rico", "Trinadad&Tobago") ~ "Latin America", # "Trinadad&Tobago" and "Outlying-US(Guam-USVI-etc" 
    native.country %in% c("Iran", "Japan", "Philippines", "Taiwan", "Vietnam", "Cambodia",
                          "China", "Hong", "India", "Laos", "South", "Thailand") ~ "Asia",
    native.country %in% c("United-States","Canada") ~ "USA/Canada"))

incomecensus %>% 
  group_by(native_continent) %>% 
  count()

## # A tibble: 4 x 2
## # Groups:   native_continent [4]
##   native_continent     n
##   <chr>            <int>
## 1 Asia               693
## 2 Europe             490
## 3 Latin America     1314
## 4 USA/Canada       24111

incomecensus %>% 
  group_by(native_continent, income) %>% 
  summarize(n = n()) %>% 
  mutate(pct = n*100/sum(n)) %>% 
  ggplot(aes(x = reorder(native_continent, n), 
             y = pct, 
             fill = income)) +
  geom_bar(stat = "identity", width = 0.6) + 
  scale_x_discrete(name = "") +
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
  geom_text(aes(label = paste0(round(pct,0),"%")), 
            position = position_stack(vjust = 0.5), 
            size = 4, 
            color = "blue") +
  theme(axis.text.y = element_blank(),
        axis.text.x = element_text(color = "black", size = 12),
        axis.title.y = element_blank(),
        axis.ticks.y = element_blank(),
        legend.title = element_text(color = "black", size = 12),
        legend.text = element_text(color = "black", size = 12))

5. Preparing the final data for machine learning

We will remove some columns

incomecensus <- incomecensus %>% 
  select(-c("education","capital.gain","capital.loss", "relationship", "native.country"))

The final data

str(incomecensus)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 26608 obs. of  10 variables:
##  $ age             : num  82 54 41 34 38 74 68 45 38 52 ...
##  $ workclass       : Factor w/ 9 levels "?","Federal-gov",..: 5 5 5 5 5 8 2 5 7 5 ...
##  $ education.num   : num  9 4 10 9 6 16 9 16 15 13 ...
##  $ occupation      : Factor w/ 5 levels "?","White_collar",..: 5 4 5 3 2 5 5 5 5 3 ...
##  $ race            : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 5 5 5 5 3 5 5 ...
##  $ sex             : Factor w/ 2 levels "Female","Male": 1 1 1 1 2 1 1 1 2 1 ...
##  $ hours.per.week  : num  18 40 40 45 40 20 40 35 45 20 ...
##  $ income          : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 2 1 2 2 2 ...
##  $ marital_status  : chr  "Widowed" "Divorced" "Separated" "Divorced" ...
##  $ native_continent: chr  "USA/Canada" "USA/Canada" "USA/Canada" "USA/Canada" ...

We need to convert two columns marital_status and native_continent in factor

incomecensus$marital_status <- as.factor(incomecensus$marital_status)
incomecensus$native_continent <- as.factor(incomecensus$native_continent)

We will save this data as income.csv

write.csv(incomecensus,file = "income.csv", row.names= FALSE)

6. Building models

data <- read_csv("income.csv")

## Parsed with column specification:
## cols(
##   age = col_double(),
##   workclass = col_character(),
##   education.num = col_double(),
##   occupation = col_character(),
##   race = col_character(),
##   sex = col_character(),
##   hours.per.week = col_double(),
##   income = col_character(),
##   marital_status = col_character(),
##   native_continent = col_character()
## )

data <- data %>%
  mutate_if(is.character,as.factor)

str(data)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 26608 obs. of  10 variables:
##  $ age             : num  82 54 41 34 38 74 68 45 38 52 ...
##  $ workclass       : Factor w/ 7 levels "Federal-gov",..: 3 3 3 3 3 6 1 3 5 3 ...
##  $ education.num   : num  9 4 10 9 6 16 9 16 15 13 ...
##  $ occupation      : Factor w/ 4 levels "Blue_collar",..: 2 1 2 3 4 2 2 2 2 3 ...
##  $ race            : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 5 5 5 5 3 5 5 ...
##  $ sex             : Factor w/ 2 levels "Female","Male": 1 1 1 1 2 1 1 1 2 1 ...
##  $ hours.per.week  : num  18 40 40 45 40 20 40 35 45 20 ...
##  $ income          : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 2 1 2 2 2 ...
##  $ marital_status  : Factor w/ 5 levels "Divorced","Married",..: 5 1 4 1 4 3 1 1 3 5 ...
##  $ native_continent: Factor w/ 4 levels "Asia","Europe",..: 4 4 4 4 4 4 4 4 4 4 ...

library(caret)

Trainindex <- createDataPartition(y = data$income , p = .70, list = FALSE)

training <- data[Trainindex ,]
validation <- data[-Trainindex,]

training_new <- training[-8]
validation_new <- validation[-8]

income_training_label <- training$income
income_validation_label <- validation$income

6.1. Logistic regression

set.seed(123)
default_glm_mod <- train(form = income ~ .,
                         data = training,
                         method = "glm",
                         family = "binomial",
                         tuneLength = 5)

glm_pred <- predict(default_glm_mod, newdata = validation)

confusionMatrix(glm_pred, validation$income)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction <=50K >50K
##      <=50K  5495  972
##      >50K    471 1044
##                                           
##                Accuracy : 0.8192          
##                  95% CI : (0.8106, 0.8276)
##     No Information Rate : 0.7474          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4783          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9211          
##             Specificity : 0.5179          
##          Pos Pred Value : 0.8497          
##          Neg Pred Value : 0.6891          
##              Prevalence : 0.7474          
##          Detection Rate : 0.6884          
##    Detection Prevalence : 0.8102          
##       Balanced Accuracy : 0.7195          
##                                           
##        'Positive' Class : <=50K           
##

6.2. Tree based methods

6.2.1. rpart

library(rpart)
library(rpart.plot)

rparttree <- rpart(income ~ ., 
                   data = training,
                   method = "class")

rparttree

## n= 18626 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 18626 4704 <=50K (0.74744980 0.25255020)  
##   2) marital_status=Divorced,Never-married,Separated,Widowed 9806  725 <=50K (0.92606567 0.07393433) *
##   3) marital_status=Married 8820 3979 <=50K (0.54886621 0.45113379)  
##     6) education.num< 12.5 6116 2072 <=50K (0.66121648 0.33878352) *
##     7) education.num>=12.5 2704  797 >50K (0.29474852 0.70525148) *

# Plot the trees
rpart.plot(rparttree)

Marital status is the most important feature for this data set.

6.2.2. C5.0 with boosting

library(C50)

set.seed(123)

treeC5 <- C5.0(income ~ .,
                data = training,
                trials = 100)

treeC5_Pred <- predict(treeC5, validation)

confusionMatrix(treeC5_Pred, validation$income)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction <=50K >50K
##      <=50K  5439  842
##      >50K    527 1174
##                                         
##                Accuracy : 0.8285        
##                  95% CI : (0.82, 0.8367)
##     No Information Rate : 0.7474        
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.521         
##                                         
##  Mcnemar's Test P-Value : < 2.2e-16     
##                                         
##             Sensitivity : 0.9117        
##             Specificity : 0.5823        
##          Pos Pred Value : 0.8659        
##          Neg Pred Value : 0.6902        
##              Prevalence : 0.7474        
##          Detection Rate : 0.6814        
##    Detection Prevalence : 0.7869        
##       Balanced Accuracy : 0.7470        
##                                         
##        'Positive' Class : <=50K         
##

6.2.3. Random forest

library(randomForest)

set.seed(123)
treeRf <- randomForest(income ~ .,
 data = training,
 ntree = 500,
 mtry = 3,
 importance = TRUE)

treeRf_Pred <- predict(treeRf, validation)

confusionMatrix(treeRf_Pred, validation$income)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction <=50K >50K
##      <=50K  5377  865
##      >50K    589 1151
##                                           
##                Accuracy : 0.8178          
##                  95% CI : (0.8092, 0.8263)
##     No Information Rate : 0.7474          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4946          
##                                           
##  Mcnemar's Test P-Value : 5.517e-13       
##                                           
##             Sensitivity : 0.9013          
##             Specificity : 0.5709          
##          Pos Pred Value : 0.8614          
##          Neg Pred Value : 0.6615          
##              Prevalence : 0.7474          
##          Detection Rate : 0.6736          
##    Detection Prevalence : 0.7820          
##       Balanced Accuracy : 0.7361          
##                                           
##        'Positive' Class : <=50K           
##

6.3. NAIVE BAYES

library(e1071)

set.seed(123)
NB <- naiveBayes(income ~., data = training)
# that one is faster than the package caret

NB_pred <- predict(NB, validation, type="class")
confusionMatrix(NB_pred,validation$income)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction <=50K >50K
##      <=50K  5224  760
##      >50K    742 1256
##                                           
##                Accuracy : 0.8118          
##                  95% CI : (0.8031, 0.8203)
##     No Information Rate : 0.7474          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.5001          
##                                           
##  Mcnemar's Test P-Value : 0.6609          
##                                           
##             Sensitivity : 0.8756          
##             Specificity : 0.6230          
##          Pos Pred Value : 0.8730          
##          Neg Pred Value : 0.6286          
##              Prevalence : 0.7474          
##          Detection Rate : 0.6545          
##    Detection Prevalence : 0.7497          
##       Balanced Accuracy : 0.7493          
##                                           
##        'Positive' Class : <=50K           
##

6.4. KNN

set.seed(3333)
knn_fit <- train(income ~., 
                 data = training, 
                 method = "knn",
                 preProcess = c("center", "scale"),tuneLength = 5)

knn_fit

## k-Nearest Neighbors 
## 
## 18626 samples
##     9 predictor
##     2 classes: '<=50K', '>50K' 
## 
## Pre-processing: centered (24), scaled (24) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 18626, 18626, 18626, 18626, 18626, 18626, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.7844624  0.4197798
##    7  0.7927082  0.4377792
##    9  0.7980115  0.4499211
##   11  0.8006633  0.4561392
##   13  0.8028696  0.4613813
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 13.

KNN_pred <- predict(knn_fit, validation, type="raw")
confusionMatrix(KNN_pred,validation$income)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction <=50K >50K
##      <=50K  5319  874
##      >50K    647 1142
##                                          
##                Accuracy : 0.8094         
##                  95% CI : (0.8007, 0.818)
##     No Information Rate : 0.7474         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.4758         
##                                          
##  Mcnemar's Test P-Value : 6.837e-09      
##                                          
##             Sensitivity : 0.8916         
##             Specificity : 0.5665         
##          Pos Pred Value : 0.8589         
##          Neg Pred Value : 0.6383         
##              Prevalence : 0.7474         
##          Detection Rate : 0.6664         
##    Detection Prevalence : 0.7759         
##       Balanced Accuracy : 0.7290         
##                                          
##        'Positive' Class : <=50K          
##

Research analyst
Data analyst
Data storyteller

Classification of income using US census data (1994)

1. Importing data

2. Inspecting the data

3. Preparing the consistent the data

3.1. Removing unnecessary column

3.2. Detecting missing values

3.3. Checking duplicate records

3.4. Dealing with unusual data points

3.4.2 Checking the consistency of each column

3.4.2A. age and education

3.4.2B. sex and relationship status

3.4.2C. Relationship and marital status

3.4.2D. age and weekly working hour

3.4.2E. capital loss and capital gain

3.4.2F Work class and occupation

4. Exploratory data analysis (EDA)

4A. income

4B. Age

4C. workclass

4D. education

4D. marital status

4F. occupation

4G. relationship

4H. race

4H. sex

4I. house per week

4J. native.country

5. Preparing the final data for machine learning

6. Building models

6.1. Logistic regression

6.2. Tree based methods

6.2.1. rpart

6.2.2. C5.0 with boosting

6.2.3. Random forest

6.3. NAIVE BAYES

6.4. KNN

Recent Posts

Comments

Research analyst Data analyst Data storyteller

1. Importing data

2. Inspecting the data

3. Preparing the consistent the data

3.1. Removing unnecessary column

3.2. Detecting missing values

3.3. Checking duplicate records

3.4. Dealing with unusual data points

3.4.2 Checking the consistency of each column

3.4.2A. age and education

3.4.2B. sex and relationship status

3.4.2C. Relationship and marital status

3.4.2D. age and weekly working hour

3.4.2E. capital loss and capital gain

3.4.2F Work class and occupation

4. Exploratory data analysis (EDA)

4A. income

4B. Age

4C. workclass

4D. education

4D. marital status

4F. occupation

4G. relationship

4H. race

4H. sex

4I. house per week

4J. native.country

5. Preparing the final data for machine learning

6. Building models

6.1. Logistic regression

6.2. Tree based methods

6.2.1. rpart

6.2.2. C5.0 with boosting

6.2.3. Random forest

6.3. NAIVE BAYES

6.4. KNN

Comments

Research analyst
Data analyst
Data storyteller