Cross Validation (CV)

type

Post

Created date

Jun 16, 2022 01:21 PM

Problem :

One can build a model on the training dataset with 100 % of accuracy, but it may fail to generalise the unseen data. So, it is not a good model, given that it is overfitting the training set.

Solution :

Cross-validation is one of its method to protect against overfitting in a predictive model, particularly in a case where the amount of data may be limited.

In cross-validation, you make a fixed number of folds (or partitions) of the data, run the analysis on each fold, and then average the overall error estimate.

In typical cross-validation, the training and validation sets must cross-over in successive rounds such that each data point has a chance of being validated against.

English explanation

K-fold Cross validation

What

k-fold cross validation splits the dataset in k different (disjoint) subsets of approximately the same dimension, and use in turn one of the subsets for estimating the generalization error, and the others for trainng the system (in your case a K-NN classifier).

If you are performing 2-fold cross validation, you are dividing your dataset into two halves, and first, you use the first one for training and the second for testing, and then you use the second for training and the first for testing.

Why

Less computationally expensive than LOOCV.

Used to protect against overfitting

helps lower variance in small data sets

The test error is then estimated by averaging the resulting MSE estimates.

Visualisation is seen as below :

How

Let's say K = 5

Train on k-1 partitions means that 4 pieces(80% of the data) are jumped into testing set; 1 of which (20% of the data)is made as the test set.

Straticifation ensures the classes are represented across partition

Some remark :

The splitting process is done without replacement. So, each observation will only appear once for the testing set; then that observation will always be on training set.

For example, let's say observation A (15th Row) is sampled. This row can only be used to Predict() once, and it will be one of the observations in training set at the other times

usually K = 5 or 10, for small datasets, we can use higher values for k. so that there is a good compromise between the bias-variance tradeoff

When k=5, 20% of the test set is held back each time. When k= 10, 10% of the test set is held back each time and so on.

LOOCV means k = # of observation in the dataset

https://towardsdatascience.com/k-fold-cross-validation-explained-in-plain-english-659e33c0bc0

Chinese explaination of Cross validation

另外一种折中的办法叫做K折交叉验证，和LOOCV的不同在于，我们每次的测试集将不再只包含一个数据，而是多个，具体数目将根据K的选取决定。比如，如果K=5，那么我们利用五折交叉验证的步骤就是：

1.将所有数据集分成5份

2.不重复地每次取其中一份做测试集，用其他四份做训练集训练模型，之后计算该模型在测试集上的[公式]

3.将5次的[公式]取平均得到最后的MSE

Comparison with validation set approach (Here)

The good things about the validation set approach is conceptually easier to grasp and easily implemented as you are simply partitioning the existing training data into two sets. This can be useful in industry when explaining to stakeholders how models were tested. And the validation set approach has a computational advantage.

However, there are 2 drawbacks:

Comparison with LOOCV approach

Good things about K-fold:

1) k-fold CV is the same as LOOCV when k = n. However, for example, a situation where k = 10 and n = 10,000, k-fold CV will fit 10 models, whereas LOOCV will fit 10,000. That is, LOOCV is the most computationally intense method since the model must be fit n times.

2) Bias-variance tradeoff: k-fold CV can give a more accurate estimate of the test error rate; LOOCV has a higher variance, but lower bias.

More detail

The LOOCV cross-validation approach is a special case of k-fold cross-validation in which k=nk=n. This approach has two drawbacks compared to k-fold cross-validation.

First, it requires fitting the potentially computationally expensive model nn times compared to k-fold cross-validation which requires the model to be fitted only kk times.

Second, the LOOCV cross-validation approach may give approximately unbiased estimates of the test error, since each training set contains n−1n−1 observations; however, this approach has higher variance than k-fold cross-validation (since we are averaging the outputs of nn fitted models trained on an almost identical set of observations, these outputs are highly correlated, and the mean of highly correlated quantities has higher variance than less correlated ones).

So, there is a bias-variance trade-off associated with the choice of kk in k-fold cross-validation; typically using k=5k=5 or k=10k=10 yield test error rate estimates that suffer neither from excessively high bias nor from very high variance.

Further discussion is here.

Step to conduct the 10-fold cross-validation on the model fitting and estimate the test classification error

Code


set.seed(2021)
spam_folds <- vfold_cv(data = spam, strata = spam) #<<
compute_fold_error <- function(split) {
  train <- analysis(split)
  test <- assessment(split)
  
  fit <-
    logistic_mod %>% 
    fit(spam ~ `day of week` + `time of day` + domain, 
      data = train)
  test_pred <- augment(fit, test)
  error <- tibble(error = 1 - metrics(test_pred, truth = spam, 
        estimate = .pred_class)$.estimate[1]) %>%
  rsample::add_resample_id(split = split)

  return(error)
}

kfold_results <- 
  map_df(
    spam_folds$splits, 
    ~compute_fold_error(.x))
kfold_results %>% summarise(mean(error))
#   mean(error)
# 1   0.1471407

tute 4 ETC3250

Steps

Make a function to contain such steps in sequence:

Using the fit () to model whatever ideal models

Using Augment () to predict the model using the test-set dataset

Calculate the error by using 1 to subtract the correction rate.

Using a for loop / map_df to repeat that defined function for whatever times that is defined.

Q : What is the issues of CV

Since the size of the training dataset is less than the original one, the estimate of prediction error will be bias more

Although using LOOCV can minimise the bias, when K = n , the estimate produce high variance.

So k = 5 or 10 is a good compromise between bias-variance tradeoff

For Bootstrap, it is a ﬂexible and powerful statistical tool to quantify the uncertainty associated with a given estimator or statistical learning method.

SAMPLING WITH REPLACEMENT

Predictive Analytics (anastasiospanagiotelis.netlify.app)