type
Post
Created date
Jun 16, 2022 01:21 PM
category
Data Science
tags
Machine Learning
Machine Learning
status
Published
Language
From
summary
slug
password
Author
Priority
Featured
Featured
Cover
Origin
Type
URL
Youtube
Youtube
icon
Resampling methods involve repeatedly drawing samples from a training dataset and refitting a statistical model on each of the samples in order to obtain additional information about the fitted model.
For example, to estimate the variability of a linear regression model, we can repeatedly draw different samples from the training data, fit a linear regression model to each new sample, and then examine the extent to which the fits differ.
There are two common resampling methods :
  1. Cross Validation
  1. Bootstrap
Cross validation can be used to estimate the test error of a specific statistical learning method, or to select the appropriate level of flexibility of a method.
Bootstrap can be used to help provide the accuracy of parameter estimates, or of a statistical learning method.

There are 2 kinds of CV : K-fold and Leave-one-out cross-validation (LOOCV)
As mentioned in FIT3152, here is the reasons we use CV.

Problem :

One can build a model on the training dataset with 100 % of accuracy, but it may fail to generalise the unseen data. So, it is not a good model, given that it is overfitting the training set.

Solution :

  • Cross-validation is one of its method to protect against overfitting in a predictive model, particularly in a case where the amount of data may be limited.
  • In cross-validation, you make a fixed number of folds (or partitions) of the data, run the analysis on each fold, and then average the overall error estimate.
  • In typical cross-validation, the training and validation sets must cross-over in successive rounds such that each data point has a chance of being validated against.
English explanation
K-fold Cross validation

What

  • k-fold cross validation splits the dataset in k different (disjoint) subsets of approximately the same dimension, and use in turn one of the subsets for estimating the generalization error, and the others for trainng the system (in your case a K-NN classifier).
  • If you are performing 2-fold cross validation, you are dividing your dataset into two halves, and first, you use the first one for training and the second for testing, and then you use the second for training and the first for testing.

Why

  • Less computationally expensive than LOOCV.
  • Used to protect against overfitting
  • helps lower variance in small data sets
  • The test error is then estimated by averaging the  resulting MSE estimates.
Visualisation is seen as below :
notion image
 

How

Let's say K = 5
Train on k-1 partitions means that 4 pieces(80% of the data) are jumped into testing set; 1 of which (20% of the data)is made as the test set.
Straticifation ensures the classes are represented across partition
notion image

Some remark :

  1. The splitting process is done without replacement. So, each observation will only appear once for the testing set; then that observation will always be on training set.
    1. For example, let's say observation A (15th Row) is sampled. This row can only be used to Predict() once, and it will be one of the observations in training set at the other times
  1. usually K = 5 or 10, for small datasets, we can use higher values for k. so that there is a good compromise between the bias-variance tradeoff
  1. When k=5, 20% of the test set is held back each time. When k= 10, 10% of the test set is held back each time and so on.
LOOCV means k = # of observation in the dataset
https://towardsdatascience.com/k-fold-cross-validation-explained-in-plain-english-659e33c0bc0
 
Chinese explaination of Cross validation
https://zhuanlan.zhihu.com/p/67986077
另外一种折中的办法叫做K折交叉验证,和LOOCV的不同在于,我们每次的测试集将不再只包含一个数据,而是多个,具体数目将根据K的选取决定。比如,如果K=5,那么我们利用五折交叉验证的步骤就是:
1.将所有数据集分成5份
2.不重复地每次取其中一份做测试集,用其他四份做训练集训练模型,之后计算该模型在测试集上的[公式]
3.将5次的[公式]取平均得到最后的MSE
Comparison with validation set approach (Here)
The good things about the validation set approach is conceptually easier to grasp and easily implemented as you are simply partitioning the existing training data into two sets. This can be useful in industry when explaining to stakeholders how models were tested. And the validation set approach has a computational advantage.
However, there are 2 drawbacks:
Comparison with LOOCV approach
Good things about K-fold:
1) k-fold CV is the same as LOOCV when k = n. However, for example, a situation where k = 10 and n = 10,000, k-fold CV will fit 10 models, whereas LOOCV will fit 10,000. That is, LOOCV is the most computationally intense method since the model must be fit n times.
2) Bias-variance tradeoff: k-fold CV can give a more accurate estimate of the test error rate; LOOCV has a higher variance, but lower bias.
More detail
The LOOCV cross-validation approach is a special case of k-fold cross-validation in which k=nk=n. This approach has two drawbacks compared to k-fold cross-validation.
First, it requires fitting the potentially computationally expensive model nn times compared to k-fold cross-validation which requires the model to be fitted only kk times.
Second, the LOOCV cross-validation approach may give approximately unbiased estimates of the test error, since each training set contains n−1n−1 observations; however, this approach has higher variance than k-fold cross-validation (since we are averaging the outputs of nn fitted models trained on an almost identical set of observations, these outputs are highly correlated, and the mean of highly correlated quantities has higher variance than less correlated ones).
So, there is a bias-variance trade-off associated with the choice of kk in k-fold cross-validation; typically using k=5k=5 or k=10k=10 yield test error rate estimates that suffer neither from excessively high bias nor from very high variance.
Further discussion is here.
 
Step to conduct the 10-fold cross-validation on the model fitting and estimate the test classification error
Code
set.seed(2021) spam_folds <- vfold_cv(data = spam, strata = spam) #<< compute_fold_error <- function(split) { train <- analysis(split) test <- assessment(split) fit <- logistic_mod %>% fit(spam ~ `day of week` + `time of day` + domain, data = train) test_pred <- augment(fit, test) error <- tibble(error = 1 - metrics(test_pred, truth = spam, estimate = .pred_class)$.estimate[1]) %>% rsample::add_resample_id(split = split) return(error) } kfold_results <- map_df( spam_folds$splits, ~compute_fold_error(.x)) kfold_results %>% summarise(mean(error)) # mean(error) # 1 0.1471407
tute 4 ETC3250
Steps
Make a function to contain such steps in sequence:
  1. Using the fit () to model whatever ideal models
  1. Using Augment () to predict the model using the test-set dataset
  1. Calculate the error by using 1 to subtract the correction rate.
Using a for loop / map_df to repeat that defined function for whatever times that is defined.
Q : What is the issues of CV
  • Since the size of the training dataset is less than the original one, the estimate of prediction error will be bias more
  • Although using LOOCV can minimise the bias, when K = n , the estimate produce high variance.
  • So k = 5 or 10 is a good compromise between bias-variance tradeoff

For Bootstrap, it is a flexible and powerful statistical tool to quantify the uncertainty associated with a given estimator or statistical learning method.
 
 
 
 
 
 
 
 
 
 
 
 
 
ClusteringModel assessment (Classification)