0%

Multicollinearity arises in the polynomial regression. As a special case of the linear model, two problems occurs, the first is the multicollinearity, the second is that if power \(k\) is large, say more than \(3\) or \(4\), the magnitudes of these powers tend to vary over a rather wide range.

To deal with these problems, \(2\) strategies are needed. One is to limit to a cubic (\(k=3\)) if at all possible, and generally no more than a quintic (\(k=5\)). The other is to center the \(x\)-variable. In other words, fit the model \[ y=\beta_0^*+\beta_1^*(x-\bar{x})+\cdots+\beta_k^*(x-\bar{x})^k \] In addition to centering, one could also scale the \(x\) variable by dividing by its standard deviation \(s_x\), thus standardizing it to have a zero mean and a unit standard deviation.

Reference

  • Statistics and data analysis : from elementary to intermediate (page 418)

When approximating the posterior distribution, one tends to ignore the normalizing constant in the equation as \[ \pi(\theta|x)=\frac{\pi(x|\theta)\pi(\theta)}{\int\pi(x|\theta)\pi(\theta)d\theta}\\ =\frac{\pi(x|\theta)\pi(\theta)}{\pi(x)}\\ \propto\pi(x|\theta)\pi(\theta) \] The fact is, not all the MCMC methods avoid the need for the normalizing constant. However, many of them do (such as the Metropolis-Hastings algorithm), since the iteration process is based on the ratio \[ \begin{gather*} R(\theta_1,\theta_2)=\frac{\pi(\theta_1|x)}{\pi(\theta_2|x)}\\ =\frac{\pi(x|\theta_1)\pi(\theta_1)}{\pi(x)}/\frac{\pi(x|\theta_2)\pi(\theta_2)}{\pi(x)}\\ =\frac{\pi(x|\theta_1)\pi(\theta_1)}{\pi(x|\theta_2)\pi(\theta_2)} \end{gather*} \] You can easily observe that \(R(\theta_1,\theta_2)\) does not involve the normalizing constant, so actually you can cancels out the normalizing constant \(\pi(x)\) when approximating the posterior distribution \(\pi(\theta|x)\). Therefore, you can easily state that \(\pi(\theta|x)\) is proportional to the \(\pi(x|\theta)\pi(\theta)\) and utilize this information to approximate the posterior distribution.

Reference

Normalizing constant irrelevant in Bayes Theorem?

Rationale behind ignoring the “denominator” in Bayes Rule

Note: This section contains notes on the derivations made while authoring my book. The derivation details originate from public Q&A platforms and will naturally differ from those in the book chapters.

Column 1 Column 2 Total
Row 1 \(a\) \(b\) \(n_{1\cdot}=a+b\)
Row 2 \(c\) \(d\) \(n_{2\cdot}=c+d\)
Total \(n_{\cdot 1}\) \(n_{\cdot 2}\) \(N\)

The goal is to approximate the variance of log odds-ratio \(VAR[ln(OR)]\) when \(OR\) is expressed as \[ OR=\frac{p_1/(1-p_1)}{p_2/(1-p_2)} \] where \(p_1\) and \(p_2\) are two probabilities for group 1 and group 2, respectively.

Suppose \(Y=f(X)\), then we can use delta method to approximate the \(f(X)\) if \(E(X)=\mu\), as \[ f(X)\approx f(\mu)+f'(\mu)(X-\mu)+\frac{1}{2}f''(\mu)(X-\mu)^2 \] Assume \(VAR(X)=\sigma^2\), we have \[ \begin{gather*} E(Y)\approx E(f(X))=f(\mu)+\frac{1}{2}f''(\mu)\sigma^2\\ VAR(Y)=VAR[f(x)]\approx [f'(\mu)]^2VAR(X) \end{gather*} \]

Note, if \(E[Y]=E[f(X)]\) can be obtained using transformed density function, then delta method is not needed in approximating the \(Y=f(X)\)

Let \(Y=ln(\hat{OR})\); since group 1 and group 2 are independent, we have \[ \begin{gather*} VAR(ln(\hat{OR}))=VAR[ln(\frac{\hat{p}_1/(1-\hat{p}_1)}{\hat{p}_2/(1-\hat{p}_2)})]\\ =VAR[ln(\frac{\hat{p}_1}{1-\hat{p}_1})]+VAR[ln(\frac{\hat{p}_2}{1-\hat{p}_2})]\\ =(\frac{1}{\hat{p}_1(1-\hat{p}_1)})^2\frac{\hat{p}_1(1-\hat{p}_1)}{n_{1\cdot}}+(\frac{1}{\hat{p}_2(1-\hat{p}_2)})^2\frac{\hat{p}_2(1-\hat{p}_2)}{n_{2\cdot}}\\ =\frac{1}{n_{1\cdot}\hat{p}_1(1-\hat{p}_1)}+\frac{1}{n_{2\cdot}\hat{p}_2(1-\hat{p}_2)}\\ =\frac{1}{n_{1\cdot}\hat{p}_1}+\frac{1}{n_{1\cdot}(1-\hat{p}_1)}+\frac{1}{n_{2\cdot}\hat{p}_2}+\frac{1}{n_{2\cdot}(1-\hat{p}_2)}\\ =\frac{1}{a}+\frac{1}{b}+\frac{1}{c}+\frac{1}{d} \end{gather*} \] Where \(n_{1\cdot},n_{2\cdot}\) are counts for group 1 and group 2, respectively; \(a,b,c,d\) are 4 counts for the \(2\times 2\) contingency table.

Reference

How to calculate the standard deviation of the log-odds?

Note: This section contains notes on the derivations made while authoring my book. The derivation details originate from public Q&A platforms and will naturally differ from those in the book chapters.

The goal is to prove the marginal distribution of data \(n\) is a negative-binomial distribution when the likelihood is Poisson distribution.

Suppose count \(n\) is drawn from a Poisson distribution \[ n\sim Poisson(\mu) \] where \(\lambda=\mu/E\sim \Gamma(\alpha,\beta)\),\(\beta\) is the rate.

Re-express the distribution of \(n\), as \[ \begin{gather*} n|\lambda\sim Poisson(\lambda E)\\ f(n|\lambda,E)=\frac{(\lambda E)^n}{n!}e^{-\lambda E} \end{gather*} \] Suppose \(\alpha,\beta\) are known. Taking all possible values of \(\lambda\), we could obtain the marginal distribution of \(n\) as \[ \begin{gather*} f(n|\alpha,\beta)=\int_0^{\infty}f(n|\lambda)\pi(\lambda)d\lambda\\ =\int_0^{\infty}\frac{(\lambda E)^n}{n!}e^{-\lambda E}\frac{(\beta \lambda)^{\alpha -1}e^{-\lambda \beta}}{\Gamma(\alpha)}\beta d\lambda\\ =\frac{E^n\beta^{\alpha}}{n!\Gamma(\alpha)}\int_0^{\infty}\lambda^ne^{-\lambda E}\lambda^{\alpha-1}e^{-\lambda \beta}d\lambda\\ =\frac{E^n\beta^{\alpha}}{n!\Gamma(\alpha)}\int_0^{\infty}e^{-\lambda(E+\beta)}\lambda^{n+\alpha-1}d\lambda\\ =\frac{E^n\beta^{\alpha}\Gamma(\alpha+n)}{n!\Gamma(\alpha)(\beta+E)^{n+\alpha}}\int_0^{\infty}\frac{(\beta+E)^{n+\alpha}}{\Gamma(\alpha+n)}e^{-\lambda(E+\beta)}\lambda^{n+\alpha-1}d\lambda\\ =\frac{1}{n!\Gamma(\alpha)}(\frac{E}{\beta+E})^n(\frac{\beta}{\beta+E})^{\alpha}\Gamma(\alpha+n)*1\\ =\frac{\Gamma(\alpha+n)}{n!\Gamma(\alpha)}(\frac{\beta}{E}+1)^{-n}(\frac{E}{\beta}+1)^{-\alpha} \end{gather*} \] \(f(n|\alpha,\beta)\) is exactly the density function of the negative Binomial distribution, and is also the likelihood function taking \(\alpha,\beta\) as parameters.

Based on the details provided in the derivation, it can also be stated that the negative binomial distribution is essentially a Gamma-Poisson mixture distribution, as it integrates aspects of both the Gamma and Poisson distributions.

It’s also readily to prove that the posterior predictive distribution of data \(n\) as \(f(\hat{n}|n)\) is a negative binomial distribution. Since the prior of \(\lambda\) is a Gamma distribution, the posterior of \(\lambda\) given \(n\) is also a Gamma distribution. The likelihood of \(\hat{n}\) given unknown but fixed \(\lambda\) is a Poisson distribution. Then we could obtain the posterior predictive distribution of \(\hat{n}\) given \(n\) as \[ \begin{gather*} p(\hat{n}|n)=\int_0^{\infty}p(\hat{n}|\lambda)p(\lambda|n)d\lambda\\ =\int_0^{\infty}(Poisson\times Gamma) d\lambda \end{gather*} \] Thus we show that \(\hat{n}|n\) is a Gamma-Poisson mixture distribution.

Note: This section contains notes on the derivations made while authoring my book. The derivation details originate from public Q&A platforms and will naturally differ from those in the book chapters.

The goal is to calculate the expected value of the logarithm of gamma distribution, which is \[ E[lnX] \] Where \(X\sim Gamma(\alpha,\beta)\)

Part I

To deal with the problem, we need to first figure out how to calculate the density of the transformed variable, which is the density \(f_Y(y)\) when \(Y=g(X)\).

Suppose \(Y=g(X)\), \(X\) is a random variable on probability space \((\Omega,\mathcal{F},\mathcal{P})\), and has continuous density \(f\).

\(\Rightarrow Y=g(X)\) has the distribution function \(F(g^{-1}(y))\).

Taking derivatives via the chain rule, we have \[ \begin{gather*} f_Y(y)=\frac{dF(g^{-1}(y))}{dy}\\=\frac{dF(g^{-1}(y))}{dx}\frac{dx}{dy}\\ =\frac{dF(g^{-1}(y))}{dx}\frac{d[g^{-1}(y)]}{dy}\\ =f(g^{-1}(y))\frac{d[g^{-1}(y)]}{dy} \end{gather*} \] Recall that the derivative of an inverse function is \[ \begin{gather*} \frac{d}{dy}[g^{-1}(y)]\\ =\frac{dx}{d[g(g^{-1}(y))]}\\ =\frac{1}{g'(g^{-1}(y))} \end{gather*} \] The density function of \(g(X)\) is then expressed as \[ \begin{gather*} f_Y(y)=f(g^{-1}(y))\frac{1}{g'(g^{-1}(y))}\\ =\frac{f(g^{-1}(y))}{g'(g^{-1}(y))} \end{gather*} \]

Part II

Now that we have figured out how to calculate the density of the transformed variable, we can proceed to obtain the density of \(Y=lnX\).

Since \(X\sim \Gamma(\alpha,\beta)\), \(Y=g(X)=ln(X)\), we have \(X=e^y\). Because \(\beta\) is a scale parameter, its effect will be to shift the logarithm by \(log\beta\). Specifically, it will shift the logarithm by \(-log\beta\). So we can readily start with the case when \(\beta=1\).

That gives us the distribution of \(X\) as \[ f_X(x)=\frac{1}{\Gamma(\alpha)}x^{\alpha}e^{-x}\frac{1}{x} \] Where \(\Gamma(\alpha)=\int_0^{\infty}x^{\alpha}e^{-x}\frac{dx}{x}\) is the normalizing constant.

Substituting \(x=e^y\) entails \(dy=\frac{dx}{x}\) as \(\frac{de^y}{dy}=e^y=x\), thus we have \[ \begin{gather*} f_Y(y)=\frac{f(g^{-1}(y))}{g'(g^{-1}(y))}\\ =\frac{1}{\Gamma(\alpha)}e^{\alpha y}e^{-e^y}\frac{1}{e^y}/(\frac{1}{e^y})\\ =\frac{1}{\Gamma(\alpha)}e^{\alpha y-e^y} \end{gather*} \] The possible values of \(Y\) now range over all the real numbers \(\mathbb{R}\).

Part III

As \(f_Y\) must integrate to unity, as \[ \begin{gather*} \int_{\mathbb{R}}f_Y(y)dy=\int_{\mathbb{R}}\frac{1}{\Gamma(\alpha)}e^{\alpha y-e^y}dy\\ =\frac{1}{\Gamma(\alpha)}\int_{\mathbb{R}}e^{\alpha y-e^y}dy\\ =1 \end{gather*} \] we can obtain \[ \Gamma(\alpha)=\int_{\mathbb{R}}e^{\alpha y-e^y}dy \] To calculate \(E(Y)=\int_{\mathbb{R}}yf_Y(y)dy\), we need to replace \(y\) with \(\alpha\) in order to eliminate the uncertainty.

To do so, we notice that \(f_Y(y)\) is a differentiable function of \(\alpha\), as \[ \frac{d}{d\alpha}e^{\alpha y-e^y}=ye^{\alpha y-e^y}\\ =y\Gamma(\alpha)f_Y(y)\\ \] Since we can obtain \(e^{\alpha y-e^y}=\Gamma(\alpha)f_Y(y)\) from \[ f_Y(y)=\frac{1}{\Gamma(\alpha)}e^{\alpha y-e^y} \] Now that we can calculate the \(E(Y)\), as \[ \begin{gather*} E(Y)=\int_{\mathbb{R}}yf_Y(y)dy\\ =\frac{1}{\Gamma(\alpha)}\int_{\mathbb{R}}\Gamma(\alpha)yf_Y(y)dy\\ =\frac{1}{\Gamma(\alpha)}\int_{\mathbb{R}}\frac{d}{d\alpha}e^{\alpha y-e^y}dy\\ =\frac{1}{\Gamma(\alpha)}\frac{d}{d\alpha}\int_{\mathbb{R}}e^{\alpha y-e^y}dy\\ =\frac{1}{\Gamma(\alpha)}\frac{d}{d\alpha}\Gamma(\alpha)\\ =\frac{d}{d\alpha}ln\Gamma(\alpha)\\ =\Psi(\alpha) \end{gather*} \] \(\Psi(\cdot)\) is the digamma function, which is the logarithmic derivative of the gamma function.

By restoring the scale parameter \(\beta\) value, we can obtain the general result of \(E[lnX]\) as \[ E[lnX]=\Psi(\alpha)-ln\beta \]

Reference

How to calculate the density function of \(g(X)\)?

What is the expected value of the logarithm of Gamma distribution?

The field of statistics essentially comprises two main branches: frequentist and Bayesian. The primary distinction between these two approaches, as articulated by Michael I. Jordan, lies in their respective viewpoints:

Frequentists typically proceed by moving forward from the state of nature to the data, while Bayesians operate in the opposite direction, moving from the data back to the state of nature. This fundamental contrast is exemplified in concepts such as the false discovery rate, which considers the probability of a hypothesis given that a discovery has been made, conditioning it on the data.

Contemplating these divergent perspectives gives rise to intriguing questions. Below are some inquiries I have formulated and explored:

  1. What are the distinctions between confidence interval and credible interval? What are the distinctions between coverage probability and credible probability?

    [George Casella Statistical Inference p435] When describing the interactions between the confidence interval and the parameter, we should carefully say that the interval covers the parameter, not that the parameter is inside the interval. This was done on purpose. It’s imperative to stress that the random quantity in classical frequentist partisan is the interval, not the parameter. The parameter is fixed. When we say a 90% confidence interval for a parameter, say \(lambda\) is \(a\leq \lambda \leq b\), we are tempting to say (and many experimenters do) that “The probability is 90% that \(\lambda\) is in the interval \([a,b]\).” However, within classical statistics, such a statement is invalid since the parameter is assumed fixed. Formally, the interval \([a,b]\) is one of the possible realized values of the random interval \([f(\alpha/2),g(1-\alpha/2)]\). And since the parameter \(\lambda\) does not move, \(\lambda\) is in the realized interval \([a,b]\) with probability either \(0\) or \(1\). When we say that the realized interval \([a,b]\) has a 90% chance of coverage, we only mean that we know that 90% of the sample points of the random interval cover the true parameter.

    In contrast, the Bayesian setup allows us to say \(\lambda\) is inside \([a,b]\) with some probability, not \(0\) or \(1\). This is because, under the Bayesian model, \(\lambda\) is random variable with a probability distribution. All Bayesian claims of coverage are made with respect to the posterior distribution of the parameter.

    It’s important not to confuse incredible probability (The Bayes posterior probability) with coverage probability (The classical probability). The probabilities are very different entities, with different meanings and interpretations. Credible probability comes from the posterior distribution, which in turn gets its probability from the prior distribution. Thus, credible probabilities reflects the experimenter’s subjective beliefs, as expressed in the prior distribution and updated with the data to the posterior distribution. A Bayesian assertion of 90% coverage means that the experimenter, upon combining prior knowledge with data, is 90% sure of coverage

  2. How to calculate the credible interval?

    In Bayesian statistics, the posterior distribution of a parameter can be derived and used for inference about that parameter. Since the parameter is considered a random variable in this framework, this distribution can be divided into numerous intervals, each covering 95% of the distribution, thus having a credible probability of 95%. Importantly, while there are multiple ways to construct these 95% credible intervals, they are not considered random in Bayesian analysis. Typically, we would choose the intervals that are symmetrically constructed around the distribution's central point. Once an interval is established, we interpret it as having a 95% probability of containing the true parameter value.

  3. In a frequentist hypothesis testing framework, is it possible to compare the probabilities of the null hypothesis (H0) and the alternative hypothesis (H1) to make a decision about accepting or rejecting the hypothesis?

    [George Casella Statistics inference P379] The probabilities \(P(H_0 )\text{ is true}\) and \(P(H_1 )\text{ is true}\) are not meaningful to the classical statistician. The classical statistician considers \(\theta\) to be a fixed number. Consequently, a hypothesis is either true or false. If \(\theta\in \theta_0\), \(P(H_0 )\text{ is true}=1\) and \(P(H_1 )\text{ is true}=0\) for all values of \(\mathbf{x}\). If \(\theta\in\theta_0^c\), these values are reversed. Since these probabilities are unknown (since \(\theta\) is unknown) and do not depend on the sample \(\mathbf{x}\), methods of calculating the probabilities \(P(H_0 )\text{ is true}\) and \(P(H_1 )\text{ is true}\) are not used by the classical statistician. In a Bayesian formulation of a hypothesis testing problem, these probabilities depend on the sample \(\mathbf{x}\) and can give useful information about the veracity of \(H_0\) and \(H_1\)

  4. Is the Likelihood Ratio Test (LRT) specific to the Bayesian or frequentist statistical approach, or is it applicable to both methodologies?

    The likelihood ratio test is a method used within the frequentist statistical framework. It is a hypothesis test that compares two statistical models based on the ratio of their likelihoods. The test evaluates how well two models, typically one being a special case or a subset of the other (referred to as the null and alternative models), explain a set of observed data.

    The likelihood ratio test is not inherently Bayesian; it is primarily a frequentist approach. However, the concept of comparing likelihoods of different models or hypotheses is also fundamental in Bayesian statistics, except it's applied in a different context and with a different framework.

Unsolved questions

  1. What is the evalution method for bayesian test?

  2. What is the bayesian P value?

    [What are Bayesian p-values?](https://stats.stackexchange.com/questions/171386/what-are-bayesian-p-values)

  3. Is power analysis necessary for the Bayesian test? If so, how to calculate the power of the Bayesian tests?

    [Is power analysis necessary in Bayesian Statistics?](https://stats.stackexchange.com/questions/65754/is-power-analysis-necessary-in-bayesian-statistics)

    [Power analysis from Bayesian point of view [duplicate]](https://stats.stackexchange.com/questions/110346/power-analysis-from-bayesian-point-of-view)

  4. Does bayesian test has type I error when the parameter is assumed to come from a probability distribution?

    Analysis of type I and II error rates of Bayesian and frequentist parametric and nonparametric two-sample hypothesis tests under preliminary assessment of normality

  5. Why does the false discovery rate kind of like the bayesian thinking, going back from data to state of nature; Why FDR is unlike other metrics, such as Recall, Precision, Accuracy, or others?

    False discovery rate

    Original paper: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing

    Bradley Efron empirical Bayes book

IOGSMT

Issue Objectives Goals Strategies Measurement Tactics
Website lack users Increase number of users Increase users by 10% in one month Website SEO Increase users by 5% Screening out available SEO corporations
Decrease subscription fee Increase users by 5% Calculate a subscription fee based on the cost
Problem Goal Detailed Goal How to Measurement Detailed How to

Remark: IOGSMT is a blueprint to achieve our aspirations. While it focuses on the What and How, we can also incorporate the Why into it.

Framework of learning everything in a short period time

Build structures

Mark timelines

Always ask what, how and why

Example

The framework of learning a new area:

  • Build structure
  • Mark timeline
  • Find results-oriented data

Several useful databases:

How to train a final machine learning model

This section relates mostly to the following several questions:

  • what is the purpose of k-fold cross validation?
  • How to finalize a model?

What is the purpose of k-fold cross validation?

Both train-test splits and k-fold cross validation are examples of resampling methods. The purpose of these methods are to compute a performance measure for the model on test dataset. They are best viewed as methods to estimate the performance of a statistical procedure, rather than a statistical model.

How to finalize a model?

Once we have specified the estimated skills, we are finished with the resampling method

  • If we are using a train-test split, that means we can discard the split datasets and the trained model.
  • If we are using k-fold cross validation, that means we can throw away all of the trained models

In other words, once we select the best tuned model by using resampling method, we can finalize the model by applying the chosen machine learning procedure on the whole dataset

This leads to another question:

  • Why not withhold the model trained on the training dataset or the best model from the cross validation?

In fact, we can apply sub-models as the final model, but it may take an indefinite amount of time to train before we obtain them; while cross validation and other resampling approaches can compare various procedures, they also prevent us from selecting an overfitting model.. Furthermore, overfitting has to do with model complexity, and it has nothing to do with the amount of data. As such, the model will likely perform better when trained on all of the available data than just the subset used to estimate the performance of the model.

If one really wants to employ models obtained from cross validation as the final models, there are still other ways to do so, which is to create multiple final models and take the mean from an ensemble of predictions in order to reduce the variance;

It is not suggested to use k models from cross validation as the voting classifier if the true objective is to develop a voting classifier as the final model. In cross validation, each fold could only serve as the test dataset once; hence, k models derived by cross validation are incapable of aggregating results, providing the same purpose as a voting classifier.

An alternate method for employing ensemble models would be to use an ensemble-type estimation of performance or generalization error, such as out-of-bag or generalization error. Using an un-aggregated cross validation estimate for an ensemble model can result in a pessimistic bias that can range from negligible to substantial, depending on the stability of the CV surrogate models and the number of aggregated surrogate models. Another option may be to pick several models that appear to generalize well across various test sets. Each model would then be trained on the whole training data set with its best-tuned hyperparameters chosen to produce a final model. Finally, these final models can be combined to form the ensemble model.

Finalize the model in Caret

In Caret, the train function will carry out cross validation experiment (if we choose to) to determine the best tuned hyperparameters, and use that to train on the entire dataset in order to get the final model.

In this section, we will specify different parameters to see how cross validation works in caret functions

Use iris data to apply 10-folds cross validation for knn algorithm

1
2
3
4
5
6
7
8
9
10
11
> library(caret)
> inTraining <- createDataPartition(iris$Petal.Width, p = .75, list = FALSE)
> training <- iris[ inTraining,]
> testing <- iris[-inTraining,]
> dim(training)
[1] 114 5
> fitControl <- trainControl(## 10-fold CV
+ method = "cv",
+ number = 10,
+ classProbs = TRUE,
+ savePredictions = TRUE)

Tuned hyperparameter length is 10, which means there are \(10*114\) prediction dots

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
> knnFit1 <- train(Species ~ ., data = training, 
+ method = "knn",
+ trControl = fitControl,
+ tuneLength = 10,
+ preProcess = c('center','scale'))

# 10 groups of performance WRT 10 groups of tuned hyperparameters K
> knnFit1$results
k Accuracy Kappa AccuracySD KappaSD
1 5 0.9500000 0.9250000 0.08958064 0.13437096
2 7 0.9583333 0.9375000 0.07081972 0.10622957
3 9 0.9575758 0.9362500 0.07135347 0.10711656
4 11 0.9575758 0.9362500 0.07135347 0.10711656
5 13 0.9659091 0.9487500 0.05903369 0.08867270
6 15 0.9659091 0.9487500 0.05903369 0.08867270
7 17 0.9575758 0.9362500 0.05956599 0.08945242
8 19 0.9325758 0.8987500 0.08626242 0.12942850
9 21 0.9225758 0.8838246 0.08332400 0.12498478
10 23 0.9125758 0.8688993 0.06840424 0.10258360

# A peek of cross validation prediction
> head(knnFit1$pred)
pred obs setosa versicolor virginica rowIndex k Resample
1 setosa setosa 1 0 0 1 5 Fold01
2 setosa setosa 1 0 0 9 5 Fold01
3 setosa setosa 1 0 0 12 5 Fold01
4 setosa setosa 1 0 0 19 5 Fold01
5 versicolor versicolor 0 1 0 40 5 Fold01
6 versicolor versicolor 0 1 0 67 5 Fold01

# Each observation is predicted 10 times
> dim(knnFit1$pred)
[1] 1140 8

From 10-folds cross validation, the best tuned hyperparameter k is 15

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
> knnFit1
k-Nearest Neighbors

114 samples
4 predictor
3 classes: 'setosa', 'versicolor', 'virginica'

Pre-processing: centered (4), scaled (4)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 103, 102, 102, 102, 102, 104, ...
Resampling results across tuning parameters:

k Accuracy Kappa
5 0.9500000 0.9250000
7 0.9583333 0.9375000
9 0.9575758 0.9362500
11 0.9575758 0.9362500
13 0.9659091 0.9487500
15 0.9659091 0.9487500
17 0.9575758 0.9362500
19 0.9325758 0.8987500
21 0.9225758 0.8838246
23 0.9125758 0.8688993

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 15.

The final model will use the best tuned hyperparameter(s) and train on the entire dataset; thus there will be 114 fitted values.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
> knnFit1$finalModel
15-nearest neighbor model
Training set outcome distribution:

setosa versicolor virginica
37 39 38
> head(knnFit1$finalModel$learn$X)
Sepal.Length Sepal.Width Petal.Length Petal.Width
X1 -0.9054682 0.95513347 -1.337733 -1.313374
X2 -1.1395348 -0.14029123 -1.337733 -1.313374
X4 -1.4906348 0.07879371 -1.281974 -1.313374
X5 -1.0225015 1.17421841 -1.337733 -1.313374
X6 -0.5543683 1.83147324 -1.170455 -1.053210
X7 -1.4906348 0.73604853 -1.337733 -1.183292

> dim(knnFit1$finalModel$learn$X)
[1] 114 4

Set the parameter savePredictions ="best" will enable model display the cross validation results from the sub-model trained by the best tuned hyperparameter(s)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
> fitControl <- trainControl(## 10-fold CV
+ method = "cv",
+ number = 10,
+ classProbs = TRUE,
+ savePredictions = 'final')

> knnFit1 <- train(Species ~ ., data = training,
+ method = "knn",
+ trControl = fitControl,
+ tuneLength = 10,
+ preProcess = c('center','scale'))

> head(knnFit1$pred)
k pred obs setosa versicolor virginica rowIndex Resample
1 15 setosa setosa 1.00000000 0.00000000 0.00000000 9 Fold01
2 15 versicolor versicolor 0.06666667 0.86666667 0.06666667 45 Fold02
3 15 versicolor versicolor 0.00000000 0.93333333 0.06666667 46 Fold02
4 15 setosa setosa 1.00000000 0.00000000 0.00000000 1 Fold01
5 15 virginica virginica 0.00000000 0.06666667 0.93333333 93 Fold04
6 15 virginica virginica 0.00000000 0.33333333 0.66666667 96 Fold04

> dim(knnFit1$pred)
[1] 114 8

How to use preprocess with cross validation?

Data leakage

The data leakage problem is a significant issue associated with cross validation. It occurs when training dataset characteristics seep into the test dataset. For instance, processing your data using normalization or standardization on the complete training dataset prior to learning would not be a proper test because the scale of the data in the test set would have affected the training dataset. One might utilize the Pipeline function in Python to solve this issue. The caret package in R is also capable of handling it, where the author states the caret will automatically preprocess the data only to the resampled version of data to avoid data leakage.

As elements of statistical learning chapter 7 Model Assessment and Selection states:

In cross validation, samples must be “left out” before any selection or filtering steps are applied. But there is one qualification: initial unsupervised screening steps can be done before samples are left out. for example, we could select the 1000 predictors with highest variance across all 50 samples, before starting cross validation. Since this filtering does not involve the class labels, it does not give the predictors an unfair advantages.

Subsampling

Usually, an optimistic estimates of performance are more likely to exist if subsampling happens before cross validation. We could see some discussions in the following urls:

Complication

One complication is still related to preprocessing. Should the subsampling occur before or after the preprocessing? For example, if you down-sample the data and use PCA for signal extraction, should the loadings be estimated from the entire training set? The estimate is potentially better since the entire training set is being used but the subsample may happen to capture a small potion of the PCA space. There isn’t any obvious answer

Subsampling during resampling


Reference

  • A good brain diet

    • avocado
    • blueberries
    • broccoli
    • eggs
    • coconut/olive oil
    • green leafy dark vegetables
    • wild salmon
    • termeric
  • Killing ANTs (Automatic Negative thoughts)

  • Exercise

  • Brain nutrients/Brain Vitamins

  • Positive peer group

  • Clean environment

  • Sleep

  • Brain protection (such as electromagnetic fields)

  • New learnings

  • Stress management

chi2 is mainly used for binary indicator or term counts, so one needs to be really careful when doing Pearson chi-square test. I will provide several functions to compare these differences, the dataset which I use is the pima dataset

A peek of the dataset

1
2
3
4
5
6
7
8
from pandas import read_csv

# load data
filename = 'dataset/diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename)
dataframe.columns = names
dataframe.head(2)
preg plas pres skin test mass pedi age class
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0

Use function chi2

  • Convert data frame to the array
1
2
array = dataframe.values
array
1
2
3
4
5
6
7
array([[  6.   , 148.   ,  72.   , ...,   0.627,  50.   ,   1.   ],
[ 1. , 85. , 66. , ..., 0.351, 31. , 0. ],
[ 8. , 183. , 64. , ..., 0.672, 32. , 1. ],
...,
[ 5. , 121. , 72. , ..., 0.245, 30. , 0. ],
[ 1. , 126. , 60. , ..., 0.349, 47. , 1. ],
[ 1. , 93. , 70. , ..., 0.315, 23. , 0. ]])
  • Use chi2 to calculate chi-square values
1
2
3
4
5
6
7
import pandas as pd
from sklearn.feature_selection import chi2

# chi square values
scores,pvalues = chi2(X,Y)
scores
pd.Series(scores,index = names[0:8])
1
2
3
4
5
6
7
8
9
preg     111.519691
plas 1411.887041
pres 17.605373
skin 53.108040
test 2175.565273
mass 127.669343
pedi 5.392682
age 181.303689
dtype: float64

Accumulate values under the same class

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# %%
import pandas as pd
import numpy as np
from scipy.sparse import issparse
from sklearn.utils import check_array
from sklearn.preprocessing import LabelBinarizer
from sklearn.utils.extmath import safe_sparse_dot

label = 'class'
X,y = dataframe.drop(columns = label), dataframe[label]

def preprocess_X_y(X, y):
X = check_array(X, accept_sparse="csr")
if np.any((X.data if issparse(X) else X) < 0):
raise ValueError("Input X must be non-negative.")

Y = LabelBinarizer().fit_transform(y)
if Y.shape[1] == 1:
Y = np.append(1 - Y, Y, axis=1)

observed = safe_sparse_dot(Y.T, X) # n_classes * n_features

feature_count = X.sum(axis=0).reshape(1, -1)
class_prob = Y.mean(axis=0).reshape(1, -1)
expected = np.dot(class_prob.T, feature_count)
return observed, expected

f_obs, f_exp = preprocess_X_y(X, y)

# %%
from scipy.stats import chisquare

pd.Series(chisquare(f_obs, f_exp).statistic, index=X.columns)
1
2
3
4
5
6
7
8
9
preg     111.519691
plas 1411.887041
pres 17.605373
skin 53.108040
test 2175.565273
mass 127.669343
pedi 5.392682
age 181.303689
dtype: float64

Use function chi2_contigency

1
2
3
4
5
6
7
8
9
10
11
from scipy.stats import chi2_contingency
chi_square = pd.Series(float(0),index = names[0:8])
columns = dataframe['class'].unique()

for n in names[0:8]:
# Create contingency table
data_crosstab = pd.crosstab(dataframe[n], dataframe['class'],
margins = True, margins_name = "Total")
score, pvalue, dof,expected = chi2_contingency(data_crosstab)
chi_square[n] = score
chi_square
1
2
3
4
5
6
7
8
9
preg     64.594809
plas 269.733242
pres 54.933964
skin 73.562894
test 227.769830
mass 286.470253
pedi 533.024096
age 140.937520
dtype: float64

Use regular formula

In this section I generate contingency table to calculate the chi-square scores

  • Contingency table between pregancy and class
1
2
3
data_crosstab = pd.crosstab(dataframe['preg'], dataframe['class'],
margins = True, margins_name = "Total")
data_crosstab
class (preg) 0 1 Total
0 73 38 111
1 106 29 135
\(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\)
Total 500 268 768
  • Use chi-square formula to calculate chi-square scores for each variable
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import pandas as pd

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
# set up float(0) so that the calculation is more accurate
chi_square = pd.Series(float(0),index = names[0:8])
columns = dataframe['class'].unique()

for n in names[0:8]:
# Create contingency table
data_crosstab = pd.crosstab(dataframe[n], dataframe['class'],
margins = True, margins_name = "Total")
rows = dataframe[n].unique()
for i in columns:
for j in rows:
O = data_crosstab[i][j]
E = data_crosstab[i]['Total'] * data_crosstab['Total'][j]/data_crosstab['Total']['Total']
chi_square[n] += (O-E) **2/E
chi_square
1
2
3
4
5
6
7
8
9
preg     64.594809
plas 269.733242
pres 54.933964
skin 73.562894
test 227.769830
mass 286.470253
pedi 533.024096
age 140.937520
dtype: float64

Conclusion

For continuous variable, better not use chi2 as the assumed distribution for the statistic may not be chi-square distribution.

One could also refer to the help document of chi2, which indicates that this function is usually applied to binary indicators or term counts.

Reference