chi2 is not appropriate
chi2
is mainly used for binary indicator or term counts, so one needs to be really careful when doing Pearson chi-square test. I will provide several functions to compare these differences, the dataset which I use is the pima dataset
A peek of the dataset
1 | from pandas import read_csv |
preg | plas | pres | skin | test | mass | pedi | age | class | |
---|---|---|---|---|---|---|---|---|---|
0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
Use function chi2
- Convert data frame to the array
1 | array = dataframe.values |
1 | array([[ 6. , 148. , 72. , ..., 0.627, 50. , 1. ], |
- Use chi2 to calculate chi-square values
1 | import pandas as pd |
1 | preg 111.519691 |
Accumulate values under the same class
1 | # %% |
1 | preg 111.519691 |
Use function chi2_contigency
1 | from scipy.stats import chi2_contingency |
1 | preg 64.594809 |
Use regular formula
In this section I generate contingency table to calculate the chi-square scores
- Contingency table between
pregancy
andclass
1 | data_crosstab = pd.crosstab(dataframe['preg'], dataframe['class'], |
class (preg) | 0 | 1 | Total |
---|---|---|---|
0 | 73 | 38 | 111 |
1 | 106 | 29 | 135 |
\(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) |
Total | 500 | 268 | 768 |
- Use chi-square formula to calculate chi-square scores for each variable
1 | import pandas as pd |
1 | preg 64.594809 |
Conclusion
For continuous variable, better not use chi2
as the assumed distribution for the statistic may not be chi-square distribution.
One could also refer to the help document of chi2
, which indicates that this function is usually applied to binary indicators or term counts.
Reference
- chi2 should support categorical data other than binary or document count
- DOC only use chi2 on binary and counts features
- Scipy and Sklearn chi2 implementations give different results
- Does Chi-square test for independence (sklearn.feature_selection.SelectKBest) produce incorect results?
- Erroneous use of chi2 in the documentation
- derivation for chi-square test