Feature Engineering 2 - Missing values - Categorial

4 minute read

Missing Values - Feature Engineering - Categorial Variable

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

How to handle categorial missing values

df=pd.read_csv('Datasets/House_prices/train.csv')
df.head()

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolQC	Fence	MiscFeature	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60	RL	65.0	8450	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	2	2008	WD	Normal	208500
1	2	20	RL	80.0	9600	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	5	2007	WD	Normal	181500
2	3	60	RL	68.0	11250	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	9	2008	WD	Normal	223500
3	4	70	RL	60.0	9550	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	2	2006	WD	Abnorml	140000
4	5	60	RL	84.0	14260	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	12	2008	WD	Normal	250000

5 rows × 81 columns

df.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
       'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
       'SaleCondition', 'SalePrice'],
      dtype='object')

df[df.columns[df.isna().sum()>0]].isna().mean().sort_values()

Electrical      0.000685
MasVnrType      0.005479
MasVnrArea      0.005479
BsmtQual        0.025342
BsmtCond        0.025342
BsmtFinType1    0.025342
BsmtExposure    0.026027
BsmtFinType2    0.026027
GarageCond      0.055479
GarageQual      0.055479
GarageFinish    0.055479
GarageType      0.055479
GarageYrBlt     0.055479
LotFrontage     0.177397
FireplaceQu     0.472603
Fence           0.807534
Alley           0.937671
MiscFeature     0.963014
PoolQC          0.995205
dtype: float64

1. Frequent categories imputation

df1=pd.read_csv('Datasets/House_prices/train.csv',usecols=['BsmtQual','FireplaceQu','GarageType','SalePrice'])

df1.head()

	BsmtQual	FireplaceQu	GarageType	SalePrice
0	Gd	NaN	Attchd	208500
1	Gd	TA	Attchd	181500
2	Gd	TA	Attchd	223500
3	TA	Gd	Detchd	140000
4	Gd	TA	Attchd	250000

df1.isna().sum()

BsmtQual        37
FireplaceQu    690
GarageType      81
SalePrice        0
dtype: int64

df1.isna().mean().sort_values(ascending=True)

SalePrice      0.000000
BsmtQual       0.025342
GarageType     0.055479
FireplaceQu    0.472603
dtype: float64

df1.shape

(1460, 4)

since there are less number of missing values in BsmtQual and GarageType we can relace them with most frequent category as it will not distort its relationship with another variable or the distribution

# Computing the frequency with every feature

df1.BsmtQual.value_counts()
#df.groupby(['BsmtQual'])['BsmtQual'].count()

TA    649
Gd    618
Ex    121
Fa     35
Name: BsmtQual, dtype: int64

df1.BsmtQual.value_counts().plot.bar()

<matplotlib.axes._subplots.AxesSubplot at 0x7fe8df97a750>

linearly separable data

df1.GarageType.value_counts().plot.bar()

<matplotlib.axes._subplots.AxesSubplot at 0x7fe8dfaa4310>

linearly separable data

df1.FireplaceQu.value_counts().plot.bar()

<matplotlib.axes._subplots.AxesSubplot at 0x7fe8de336810>

linearly separable data

df1.GarageType.mode()[0]

'Attchd'

df1.GarageType.value_counts().index[0]

'Attchd'

def impute_nan(df,variable):
    most_freq_cat=df[variable].value_counts().index[0] #most_freq_cat=df[variable].mode()[0]
    df[variable].fillna(most_freq_cat,inplace=True)

for features in ['BsmtQual','GarageType','FireplaceQu']:
    impute_nan(df1,features)

df1

	BsmtQual	FireplaceQu	GarageType	SalePrice
0	Gd	Gd	Attchd	208500
1	Gd	TA	Attchd	181500
2	Gd	TA	Attchd	223500
3	TA	Gd	Detchd	140000
4	Gd	TA	Attchd	250000
...	...	...	...	...
1455	Gd	TA	Attchd	175000
1456	Gd	TA	Attchd	210000
1457	TA	Gd	Attchd	266500
1458	TA	Gd	Attchd	142125
1459	TA	Gd	Attchd	147500

1460 rows × 4 columns

df1.isna().sum()

BsmtQual       0
FireplaceQu    0
GarageType     0
SalePrice      0
dtype: int64

df1.FireplaceQu.value_counts().plot.bar()

<matplotlib.axes._subplots.AxesSubplot at 0x7fe8de7459d0>

linearly separable data

As FireplaceQu has large missing values (~47%), imputing the same with mode might distort the relationship of FireplaceQu with other variable. So it is not a good practice to impute with mode in case where there is higher percentage of missing values

Advantages

Easy and faster way to implement

Disadvantages

Since we are using more frequent label, it may use them in an over-represented way if there are many NANs.
It distorts the relation of the most frequent label.

2. Adding a variable to capture NAN

df2=pd.read_csv('Datasets/House_prices/train.csv',usecols=['BsmtQual','FireplaceQu','GarageType','SalePrice'])

df2['BsmtQual_NAN']=np.where(df2.BsmtQual.isna(),1,0)

impute_nan(df2,'BsmtQual')

df2.head()

	BsmtQual	FireplaceQu	GarageType	SalePrice
0	Gd	NaN	Attchd	208500
1	Gd	TA	Attchd	181500
2	Gd	TA	Attchd	223500
3	TA	Gd	Detchd	140000
4	Gd	TA	Attchd	250000

we are capturing the importance of missing value (here by using a variable+_NAN column) and also imputing the column with mode (most frequnent category)
Since we are capturing the importance of missing values, we can also apply this to the column having higher percentage of missing value (i.e. FireplaceQu)

df2['FireplaceQu_NAN']=np.where(df2.FireplaceQu.isna(),1,0)

impute_nan(df2,'FireplaceQu')

df2

	BsmtQual	FireplaceQu	GarageType	SalePrice	BsmtQual_NAN	FireplaceQu_NAN
0	Gd	Gd	Attchd	208500	0	1
1	Gd	TA	Attchd	181500	0	0
2	Gd	TA	Attchd	223500	0	0
3	TA	Gd	Detchd	140000	0	0
4	Gd	TA	Attchd	250000	0	0
...	...	...	...	...	...	...
1455	Gd	TA	Attchd	175000	0	0
1456	Gd	TA	Attchd	210000	0	0
1457	TA	Gd	Attchd	266500	0	0
1458	TA	Gd	Attchd	142125	0	1
1459	TA	Gd	Attchd	147500	0	1

1460 rows × 6 columns

3. Replace NAN with a new category

we use it if we have more frequent categories

df3=pd.read_csv('Datasets/House_prices/train.csv',usecols=['BsmtQual','FireplaceQu','GarageType','SalePrice'])

def impute_nan_new_cat(df,variable):
    df[variable+"_newvar"]=df[variable].fillna('Missing_'+variable)

for features in ['BsmtQual','GarageType','FireplaceQu']:
    impute_nan_new_cat(df3,features)

df3

	BsmtQual	FireplaceQu	GarageType	SalePrice	BsmtQual_newvar	GarageType_newvar	FireplaceQu_newvar
0	Gd	NaN	Attchd	208500	Gd	Attchd	Missing_FireplaceQu
1	Gd	TA	Attchd	181500	Gd	Attchd	TA
2	Gd	TA	Attchd	223500	Gd	Attchd	TA
3	TA	Gd	Detchd	140000	TA	Detchd	Gd
4	Gd	TA	Attchd	250000	Gd	Attchd	TA
...	...	...	...	...	...	...	...
1455	Gd	TA	Attchd	175000	Gd	Attchd	TA
1456	Gd	TA	Attchd	210000	Gd	Attchd	TA
1457	TA	Gd	Attchd	266500	TA	Attchd	Gd
1458	TA	NaN	Attchd	142125	TA	Attchd	Missing_FireplaceQu
1459	TA	NaN	Attchd	147500	TA	Attchd	Missing_FireplaceQu

1460 rows × 7 columns

Share on

Twitter Facebook Google+ LinkedIn

Rohit Kumar

Feature Engineering 2 - Missing values - Categorial

Missing Values - Feature Engineering - Categorial Variable

How to handle categorial missing values

1. Frequent categories imputation

Advantages

Disadvantages

2. Adding a variable to capture NAN

3. Replace NAN with a new category

Share on

You May Also Enjoy

Python Pandas - String and Regular Expression

Python Pandas - Joining and merging DataFrame

Python Pandas - DataFrame

Feature Engineering 7 - Outliers and Its impact on ML usecases