Cross validation related
How to train a final machine learning model
This section relates mostly to the following several questions:
- what is the purpose of k-fold cross validation?
- How to finalize a model?
What is the purpose of k-fold cross validation?
Both train-test splits and k-fold cross validation are examples of resampling methods. The purpose of these methods are to compute a performance measure for the model on test dataset. They are best viewed as methods to estimate the performance of a statistical procedure, rather than a statistical model.
How to finalize a model?
Once we have specified the estimated skills, we are finished with the resampling method
- If we are using a train-test split, that means we can discard the split datasets and the trained model.
- If we are using k-fold cross validation, that means we can throw away all of the trained models
In other words, once we select the best tuned model by using resampling method, we can finalize the model by applying the chosen machine learning procedure on the whole dataset
This leads to another question:
- Why not withhold the model trained on the training dataset or the best model from the cross validation?
In fact, we can apply sub-models as the final model, but it may take an indefinite amount of time to train before we obtain them; while cross validation and other resampling approaches can compare various procedures, they also prevent us from selecting an overfitting model.. Furthermore, overfitting has to do with model complexity, and it has nothing to do with the amount of data. As such, the model will likely perform better when trained on all of the available data than just the subset used to estimate the performance of the model.
If one really wants to employ models obtained from cross validation as the final models, there are still other ways to do so, which is to create multiple final models and take the mean from an ensemble of predictions in order to reduce the variance;
It is not suggested to use k models from cross validation as the voting classifier if the true objective is to develop a voting classifier as the final model. In cross validation, each fold could only serve as the test dataset once; hence, k models derived by cross validation are incapable of aggregating results, providing the same purpose as a voting classifier.
An alternate method for employing ensemble models would be to use an ensemble-type estimation of performance or generalization error, such as out-of-bag or generalization error. Using an un-aggregated cross validation estimate for an ensemble model can result in a pessimistic bias that can range from negligible to substantial, depending on the stability of the CV surrogate models and the number of aggregated surrogate models. Another option may be to pick several models that appear to generalize well across various test sets. Each model would then be trained on the whole training data set with its best-tuned hyperparameters chosen to produce a final model. Finally, these final models can be combined to form the ensemble model.
Finalize the model in Caret
In Caret
, the train
function will carry out cross validation experiment (if we choose to) to determine the best tuned hyperparameters, and use that to train on the entire dataset in order to get the final model.
In this section, we will specify different parameters to see how cross validation works in caret
functions
Use iris data to apply 10-folds cross validation for
knn
algorithm
1 | > library(caret) |
Tuned hyperparameter length is 10, which means there are \(10*114\) prediction dots
1 | > knnFit1 <- train(Species ~ ., data = training, |
From 10-folds cross validation, the best tuned hyperparameter k is 15
1 | > knnFit1 |
The final model will use the best tuned hyperparameter(s) and train on the entire dataset; thus there will be 114 fitted values.
1 | > knnFit1$finalModel |
1 | > fitControl <- trainControl(## 10-fold CV |
How to use preprocess with cross validation?
Data leakage
The data leakage
problem is a significant issue associated with cross validation. It occurs when training dataset characteristics seep into the test dataset. For instance, processing your data using normalization
or standardization
on the complete training dataset prior to learning would not be a proper test because the scale of the data in the test set would have affected the training dataset. One might utilize the Pipeline
function in Python to solve this issue. The caret
package in R is also capable of handling it, where the author states the caret will automatically preprocess the data only to the resampled version of data to avoid data leakage.
As elements of statistical learning chapter 7 Model Assessment and Selection states:
In cross validation, samples must be “left out” before any selection or filtering steps are applied. But there is one qualification: initial unsupervised screening steps can be done before samples are left out. for example, we could select the 1000 predictors with highest variance across all 50 samples, before starting cross validation. Since this filtering does not involve the class labels, it does not give the predictors an unfair advantages.
Subsampling
Usually, an optimistic estimates of performance are more likely to exist if subsampling happens before cross validation. We could see some discussions in the following urls:
- subsampling during resampling
- Avoiding leakage in cross-validation when using SMOTE
- Why is the true (test) error rate of any classifier 50%?
Complication
One complication is still related to preprocessing. Should the subsampling occur before or after the preprocessing? For example, if you down-sample the data and use PCA for signal extraction, should the loadings be estimated from the entire training set? The estimate is potentially better since the entire training set is being used but the subsample may happen to capture a small potion of the PCA space. There isn’t any obvious answer
Reference
How to choose a predictive model after k-fold cross-validation?
Assessing and improving the stability of chemometric models in small sample size situations
How to obtain optimal hyperparameters after nested cross validation?
Training, saving and distributing the model – what about data the transformations?