0%

chi2 is not appropriate

Posted on 2022-12-07 Edited on 2022-12-09 In python

chi2 is mainly used for binary indicator or term counts, so one needs to be really careful when doing Pearson chi-square test. I will provide several functions to compare these differences, the dataset which I use is the pima dataset

A peek of the dataset

from pandas import read_csv

# load data
filename = 'dataset/diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename)
dataframe.columns = names
dataframe.head(2)

	preg	plas	pres	skin	test	mass	pedi	age	class
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0

Use function `chi2`

Convert data frame to the array

1 2	array = dataframe.values array

array([[  6.   , 148.   ,  72.   , ...,   0.627,  50.   ,   1.   ],
       [  1.   ,  85.   ,  66.   , ...,   0.351,  31.   ,   0.   ],
       [  8.   , 183.   ,  64.   , ...,   0.672,  32.   ,   1.   ],
       ...,
       [  5.   , 121.   ,  72.   , ...,   0.245,  30.   ,   0.   ],
       [  1.   , 126.   ,  60.   , ...,   0.349,  47.   ,   1.   ],
       [  1.   ,  93.   ,  70.   , ...,   0.315,  23.   ,   0.   ]])

Use chi2 to calculate chi-square values

import pandas as pd
from sklearn.feature_selection import chi2

# chi square values
scores,pvalues = chi2(X,Y)
scores
pd.Series(scores,index = names[0:8])

preg     111.519691
plas    1411.887041
pres      17.605373
skin      53.108040
test    2175.565273
mass     127.669343
pedi       5.392682
age      181.303689
dtype: float64

Accumulate values under the same class

# %%
import pandas as pd
import numpy as np
from scipy.sparse import issparse
from sklearn.utils import check_array
from sklearn.preprocessing import LabelBinarizer
from sklearn.utils.extmath import safe_sparse_dot

label = 'class'
X,y = dataframe.drop(columns = label), dataframe[label]

def preprocess_X_y(X, y):
    X = check_array(X, accept_sparse="csr")
    if np.any((X.data if issparse(X) else X) < 0):
        raise ValueError("Input X must be non-negative.")

    Y = LabelBinarizer().fit_transform(y)
    if Y.shape[1] == 1:
        Y = np.append(1 - Y, Y, axis=1)

    observed = safe_sparse_dot(Y.T, X)  # n_classes * n_features

    feature_count = X.sum(axis=0).reshape(1, -1)
    class_prob = Y.mean(axis=0).reshape(1, -1)
    expected = np.dot(class_prob.T, feature_count)
    return observed, expected

f_obs, f_exp = preprocess_X_y(X, y)

# %%
from scipy.stats import chisquare

pd.Series(chisquare(f_obs, f_exp).statistic, index=X.columns)

preg     111.519691
plas    1411.887041
pres      17.605373
skin      53.108040
test    2175.565273
mass     127.669343
pedi       5.392682
age      181.303689
dtype: float64

Use function `chi2_contigency`

from scipy.stats import chi2_contingency
chi_square = pd.Series(float(0),index = names[0:8])
columns = dataframe['class'].unique()

for n in names[0:8]:
    # Create contingency table
    data_crosstab = pd.crosstab(dataframe[n], dataframe['class'],
                            margins = True, margins_name = "Total")
    score, pvalue, dof,expected = chi2_contingency(data_crosstab)
    chi_square[n] = score
chi_square

preg     64.594809
plas    269.733242
pres     54.933964
skin     73.562894
test    227.769830
mass    286.470253
pedi    533.024096
age     140.937520
dtype: float64

Use regular formula

In this section I generate contingency table to calculate the chi-square scores

Contingency table between pregancy and class

1
2
3

data_crosstab = pd.crosstab(dataframe['preg'], dataframe['class'],
                            margins = True, margins_name = "Total")
data_crosstab

class (preg)	0	1	Total
0	73	38	111
1	106	29	135
\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)
Total	500	268	768

Use chi-square formula to calculate chi-square scores for each variable

import pandas as pd

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
# set up float(0) so that the calculation is more accurate
chi_square = pd.Series(float(0),index = names[0:8])
columns = dataframe['class'].unique()

for n in names[0:8]:
    # Create contingency table
    data_crosstab = pd.crosstab(dataframe[n], dataframe['class'],
                            margins = True, margins_name = "Total")
    rows = dataframe[n].unique()
    for i in columns:
        for j in rows:
            O = data_crosstab[i][j]
            E = data_crosstab[i]['Total'] * data_crosstab['Total'][j]/data_crosstab['Total']['Total']
            chi_square[n] += (O-E) **2/E
chi_square

preg     64.594809
plas    269.733242
pres     54.933964
skin     73.562894
test    227.769830
mass    286.470253
pedi    533.024096
age     140.937520
dtype: float64

Conclusion

For continuous variable, better not use chi2 as the assumed distribution for the statistic may not be chi-square distribution.

One could also refer to the help document of chi2, which indicates that this function is usually applied to binary indicators or term counts.

Reference