Cross validation related

Posted on 2022-12-15 Edited on 2022-12-17 In Data science

How to train a final machine learning model

This section relates mostly to the following several questions:

what is the purpose of k-fold cross validation?
How to finalize a model?

What is the purpose of k-fold cross validation?

Both train-test splits and k-fold cross validation are examples of resampling methods. The purpose of these methods are to compute a performance measure for the model on test dataset. They are best viewed as methods to estimate the performance of a statistical procedure, rather than a statistical model.

How to finalize a model?

Once we have specified the estimated skills, we are finished with the resampling method

If we are using a train-test split, that means we can discard the split datasets and the trained model.
If we are using k-fold cross validation, that means we can throw away all of the trained models

In other words, once we select the best tuned model by using resampling method, we can finalize the model by applying the chosen machine learning procedure on the whole dataset

This leads to another question:

Why not withhold the model trained on the training dataset or the best model from the cross validation?

In fact, we can apply sub-models as the final model, but it may take an indefinite amount of time to train before we obtain them; while cross validation and other resampling approaches can compare various procedures, they also prevent us from selecting an overfitting model.. Furthermore, overfitting has to do with model complexity, and it has nothing to do with the amount of data. As such, the model will likely perform better when trained on all of the available data than just the subset used to estimate the performance of the model.

If one really wants to employ models obtained from cross validation as the final models, there are still other ways to do so, which is to create multiple final models and take the mean from an ensemble of predictions in order to reduce the variance;

It is not suggested to use k models from cross validation as the voting classifier if the true objective is to develop a voting classifier as the final model. In cross validation, each fold could only serve as the test dataset once; hence, k models derived by cross validation are incapable of aggregating results, providing the same purpose as a voting classifier.

An alternate method for employing ensemble models would be to use an ensemble-type estimation of performance or generalization error, such as out-of-bag or generalization error. Using an un-aggregated cross validation estimate for an ensemble model can result in a pessimistic bias that can range from negligible to substantial, depending on the stability of the CV surrogate models and the number of aggregated surrogate models. Another option may be to pick several models that appear to generalize well across various test sets. Each model would then be trained on the whole training data set with its best-tuned hyperparameters chosen to produce a final model. Finally, these final models can be combined to form the ensemble model.

Finalize the model in Caret

In Caret, the train function will carry out cross validation experiment (if we choose to) to determine the best tuned hyperparameters, and use that to train on the entire dataset in order to get the final model.

In this section, we will specify different parameters to see how cross validation works in caret functions

Use iris data to apply 10-folds cross validation for knn algorithm

> library(caret)
> inTraining <- createDataPartition(iris$Petal.Width, p = .75, list = FALSE)
> training <- iris[ inTraining,]
> testing  <- iris[-inTraining,]
> dim(training)
[1] 114   5
> fitControl <- trainControl(## 10-fold CV
+   method = "cv",
+   number = 10,
+   classProbs = TRUE,
+   savePredictions = TRUE)

Tuned hyperparameter length is 10, which means there are \(10*114\) prediction dots

> knnFit1 <- train(Species ~ ., data = training, 
+                 method = "knn", 
+                 trControl = fitControl,
+                 tuneLength = 10,
+                 preProcess = c('center','scale'))

# 10 groups of performance WRT 10 groups of tuned hyperparameters K
> knnFit1$results
    k  Accuracy     Kappa AccuracySD    KappaSD
1   5 0.9500000 0.9250000 0.08958064 0.13437096
2   7 0.9583333 0.9375000 0.07081972 0.10622957
3   9 0.9575758 0.9362500 0.07135347 0.10711656
4  11 0.9575758 0.9362500 0.07135347 0.10711656
5  13 0.9659091 0.9487500 0.05903369 0.08867270
6  15 0.9659091 0.9487500 0.05903369 0.08867270
7  17 0.9575758 0.9362500 0.05956599 0.08945242
8  19 0.9325758 0.8987500 0.08626242 0.12942850
9  21 0.9225758 0.8838246 0.08332400 0.12498478
10 23 0.9125758 0.8688993 0.06840424 0.10258360

# A peek of cross validation prediction
> head(knnFit1$pred)
        pred        obs setosa versicolor virginica rowIndex k Resample
1     setosa     setosa      1          0         0        1 5   Fold01
2     setosa     setosa      1          0         0        9 5   Fold01
3     setosa     setosa      1          0         0       12 5   Fold01
4     setosa     setosa      1          0         0       19 5   Fold01
5 versicolor versicolor      0          1         0       40 5   Fold01
6 versicolor versicolor      0          1         0       67 5   Fold01

# Each observation is predicted 10 times
> dim(knnFit1$pred)
[1] 1140    8

From 10-folds cross validation, the best tuned hyperparameter k is 15

> knnFit1
k-Nearest Neighbors 

114 samples
  4 predictor
  3 classes: 'setosa', 'versicolor', 'virginica' 

Pre-processing: centered (4), scaled (4) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 103, 102, 102, 102, 102, 104, ... 
Resampling results across tuning parameters:

  k   Accuracy   Kappa    
   5  0.9500000  0.9250000
   7  0.9583333  0.9375000
   9  0.9575758  0.9362500
  11  0.9575758  0.9362500
  13  0.9659091  0.9487500
  15  0.9659091  0.9487500
  17  0.9575758  0.9362500
  19  0.9325758  0.8987500
  21  0.9225758  0.8838246
  23  0.9125758  0.8688993

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 15.

The final model will use the best tuned hyperparameter(s) and train on the entire dataset; thus there will be 114 fitted values.

> knnFit1$finalModel
15-nearest neighbor model
Training set outcome distribution:

    setosa versicolor  virginica 
        37         39         38 
> head(knnFit1$finalModel$learn$X)
   Sepal.Length Sepal.Width Petal.Length Petal.Width
X1   -0.9054682  0.95513347    -1.337733   -1.313374
X2   -1.1395348 -0.14029123    -1.337733   -1.313374
X4   -1.4906348  0.07879371    -1.281974   -1.313374
X5   -1.0225015  1.17421841    -1.337733   -1.313374
X6   -0.5543683  1.83147324    -1.170455   -1.053210
X7   -1.4906348  0.73604853    -1.337733   -1.183292

> dim(knnFit1$finalModel$learn$X)
[1] 114   4

Set the parameter savePredictions ="best" will enable model display the cross validation results from the sub-model trained by the best tuned hyperparameter(s)

> fitControl <- trainControl(## 10-fold CV
+   method = "cv",
+   number = 10,
+   classProbs = TRUE,
+   savePredictions = 'final')

> knnFit1 <- train(Species ~ ., data = training, 
+                 method = "knn", 
+                 trControl = fitControl,
+                 tuneLength = 10,
+                 preProcess = c('center','scale'))

> head(knnFit1$pred)
   k       pred        obs     setosa versicolor  virginica rowIndex Resample
1 15     setosa     setosa 1.00000000 0.00000000 0.00000000        9   Fold01
2 15 versicolor versicolor 0.06666667 0.86666667 0.06666667       45   Fold02
3 15 versicolor versicolor 0.00000000 0.93333333 0.06666667       46   Fold02
4 15     setosa     setosa 1.00000000 0.00000000 0.00000000        1   Fold01
5 15  virginica  virginica 0.00000000 0.06666667 0.93333333       93   Fold04
6 15  virginica  virginica 0.00000000 0.33333333 0.66666667       96   Fold04

> dim(knnFit1$pred)
[1] 114   8

How to use preprocess with cross validation?

Data leakage

The data leakage problem is a significant issue associated with cross validation. It occurs when training dataset characteristics seep into the test dataset. For instance, processing your data using normalization or standardization on the complete training dataset prior to learning would not be a proper test because the scale of the data in the test set would have affected the training dataset. One might utilize the Pipeline function in Python to solve this issue. The caret package in R is also capable of handling it, where the author states the caret will automatically preprocess the data only to the resampled version of data to avoid data leakage.

As elements of statistical learning chapter 7 Model Assessment and Selection states:

In cross validation, samples must be “left out” before any selection or filtering steps are applied. But there is one qualification: initial unsupervised screening steps can be done before samples are left out. for example, we could select the 1000 predictors with highest variance across all 50 samples, before starting cross validation. Since this filtering does not involve the class labels, it does not give the predictors an unfair advantages.

Subsampling

Usually, an optimistic estimates of performance are more likely to exist if subsampling happens before cross validation. We could see some discussions in the following urls:

Complication

One complication is still related to preprocessing. Should the subsampling occur before or after the preprocessing? For example, if you down-sample the data and use PCA for signal extraction, should the loadings be estimated from the entire training set? The estimate is potentially better since the entire training set is being used but the subsample may happen to capture a small potion of the PCA space. There isn’t any obvious answer

Subsampling during resampling