chi2 is not appropriate

chi2 is mainly used for binary indicator or term counts, so one needs to be really careful when doing Pearson chi-square test. I will provide several functions to compare these differences, the dataset which I use is the pima dataset

A peek of the dataset

1
2
3
4
5
6
7
8
from pandas import read_csv

# load data
filename = 'dataset/diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename)
dataframe.columns = names
dataframe.head(2)
preg plas pres skin test mass pedi age class
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0

Use function chi2

  • Convert data frame to the array
1
2
array = dataframe.values
array
1
2
3
4
5
6
7
array([[  6.   , 148.   ,  72.   , ...,   0.627,  50.   ,   1.   ],
[ 1. , 85. , 66. , ..., 0.351, 31. , 0. ],
[ 8. , 183. , 64. , ..., 0.672, 32. , 1. ],
...,
[ 5. , 121. , 72. , ..., 0.245, 30. , 0. ],
[ 1. , 126. , 60. , ..., 0.349, 47. , 1. ],
[ 1. , 93. , 70. , ..., 0.315, 23. , 0. ]])
  • Use chi2 to calculate chi-square values
1
2
3
4
5
6
7
import pandas as pd
from sklearn.feature_selection import chi2

# chi square values
scores,pvalues = chi2(X,Y)
scores
pd.Series(scores,index = names[0:8])
1
2
3
4
5
6
7
8
9
preg     111.519691
plas 1411.887041
pres 17.605373
skin 53.108040
test 2175.565273
mass 127.669343
pedi 5.392682
age 181.303689
dtype: float64

Accumulate values under the same class

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# %%
import pandas as pd
import numpy as np
from scipy.sparse import issparse
from sklearn.utils import check_array
from sklearn.preprocessing import LabelBinarizer
from sklearn.utils.extmath import safe_sparse_dot

label = 'class'
X,y = dataframe.drop(columns = label), dataframe[label]

def preprocess_X_y(X, y):
X = check_array(X, accept_sparse="csr")
if np.any((X.data if issparse(X) else X) < 0):
raise ValueError("Input X must be non-negative.")

Y = LabelBinarizer().fit_transform(y)
if Y.shape[1] == 1:
Y = np.append(1 - Y, Y, axis=1)

observed = safe_sparse_dot(Y.T, X) # n_classes * n_features

feature_count = X.sum(axis=0).reshape(1, -1)
class_prob = Y.mean(axis=0).reshape(1, -1)
expected = np.dot(class_prob.T, feature_count)
return observed, expected

f_obs, f_exp = preprocess_X_y(X, y)

# %%
from scipy.stats import chisquare

pd.Series(chisquare(f_obs, f_exp).statistic, index=X.columns)
1
2
3
4
5
6
7
8
9
preg     111.519691
plas 1411.887041
pres 17.605373
skin 53.108040
test 2175.565273
mass 127.669343
pedi 5.392682
age 181.303689
dtype: float64

Use function chi2_contigency

1
2
3
4
5
6
7
8
9
10
11
from scipy.stats import chi2_contingency
chi_square = pd.Series(float(0),index = names[0:8])
columns = dataframe['class'].unique()

for n in names[0:8]:
# Create contingency table
data_crosstab = pd.crosstab(dataframe[n], dataframe['class'],
margins = True, margins_name = "Total")
score, pvalue, dof,expected = chi2_contingency(data_crosstab)
chi_square[n] = score
chi_square
1
2
3
4
5
6
7
8
9
preg     64.594809
plas 269.733242
pres 54.933964
skin 73.562894
test 227.769830
mass 286.470253
pedi 533.024096
age 140.937520
dtype: float64

Use regular formula

In this section I generate contingency table to calculate the chi-square scores

  • Contingency table between pregancy and class
1
2
3
data_crosstab = pd.crosstab(dataframe['preg'], dataframe['class'],
margins = True, margins_name = "Total")
data_crosstab
class (preg) 0 1 Total
0 73 38 111
1 106 29 135
\(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\)
Total 500 268 768
  • Use chi-square formula to calculate chi-square scores for each variable
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import pandas as pd

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
# set up float(0) so that the calculation is more accurate
chi_square = pd.Series(float(0),index = names[0:8])
columns = dataframe['class'].unique()

for n in names[0:8]:
# Create contingency table
data_crosstab = pd.crosstab(dataframe[n], dataframe['class'],
margins = True, margins_name = "Total")
rows = dataframe[n].unique()
for i in columns:
for j in rows:
O = data_crosstab[i][j]
E = data_crosstab[i]['Total'] * data_crosstab['Total'][j]/data_crosstab['Total']['Total']
chi_square[n] += (O-E) **2/E
chi_square
1
2
3
4
5
6
7
8
9
preg     64.594809
plas 269.733242
pres 54.933964
skin 73.562894
test 227.769830
mass 286.470253
pedi 533.024096
age 140.937520
dtype: float64

Conclusion

For continuous variable, better not use chi2 as the assumed distribution for the statistic may not be chi-square distribution.

One could also refer to the help document of chi2, which indicates that this function is usually applied to binary indicators or term counts.

Reference