type
Post
Created date
Jun 16, 2022 01:43 PM
category
Data Science
tags
Machine Learning
Machine Learning
status
Published
Language
From
summary
slug
password
Author
Priority
Featured
Featured
Cover
Origin
Type
URL
Youtube
Youtube
icon
Before knowing what Bootstrap is, we need to know what RESAMP LING means.
RESAMPLING (800)

What

  • procedures that describe how to economically use available data to estimate a population parameter.

Why

Why using this ?
  • To resample form the original data — either directly or via a fitted model — to create replicate datasets
    • The result can be both a more accurate estimate of
      • the parameter (such as taking the mean of the estimates) and
      • a quantification of the uncertainty of the estimate (e.g adding a confidence interval).
  • very easy to use, requiring little mathematical knowledge
Why might not using this ?
  • computationally very expensive, requiring tens, hundreds, or even thousands of resamples in order to develop a robust estimate of the population parameter.

How


What

  • A kind of RESAMPLING to approximate the sampling distribution (i.e. a technique to estimate the confidence interval. (ITP p560)
  • Does not improve your point estimates. (ITP p559)
  • Allows us to estimate the variance without re-collecting more data (ITP p559)

Why

When do we use Bootstrap?
  • Better than using CI to estimate the point estimate. It can be easy to find the Mean (point estimate) using CI, but difficult to find other true parameter i.e. θˆ like you have an estimator to estimate MEDIAN or others.
    • For simple problems such as the sample b average, analyzing Var(θˆ ) is not difficult. However, if θˆ is a more complicated statistic, e.g., the median, analyzing Var(θˆ ) may not be as straightforward. So, if wanting to estimate the confidence interval, you need to use Bootstrap. (ITP p559).
  • In a nutshell, we use bootstrapping when the estimator does not have a simple expression for the variance

Why is it difficult to provide a confidence interval for estimators such as the median?
notion image

How

There are 2 examples, which applies to the same logic. First of all, we will introduce the brute force approach which does not work in real-world. Then, that is the reason we use Bootstrap.

Example 1 (ITP p559)

1.1. Brute force approach

Imagine we have the population distribution (Which does not happen in real-world often, but let's suppose).
There are some notation :
By having the population distribution, we can generate as many samples 's as we want.

Step 1. We draw  replicate datasets   , . . . ,  from
Step 1. We draw replicate datasets , . . . , from
 
notion image

Since we are interested in constructing the confidence interval for , we need to analyze b the mean and variance of . The true mean and the estimated mean of are
Step 2 Compute M() and b V() based on the samples.
Step 2 Compute M() and b V() based on the samples.

Problem :
we only have one dataset We do not have access to , . . . , , and we do not have access to . Therefore, we are not able to approximate the variance using the above brute force simulation.

1.2 Solution : Bootstrap

Assume: We do NOT have access to , but we have one dataset
Step 1: Generate datasets  , . . . ,  from  , by sampling with replacement from  .
Step 1: Generate datasets , . . . , from , by sampling with replacement from .
Step 2 : Compute  and  based on the samples
Step 2 : Compute and based on the samples
So, at the end, we have the output — for the final answer.
notion image

 

Example X and Y return

Can be found p.218 s
 

How to bootstrap a
Confidence Intervals
Confidence Intervals
?

From the lecture:
 
When does non-parametric Bootstrap fail?
  • using max

Reference

(800)
(ITS) James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning: with Applications in R (Springer Texts in Statistics) (2nd ed. 2021 ed.). Springer. (Here)
( ITP ) - The best source to get the gist. Chan, S. (2021). Introduction to Probability for Data Science. Michigan Publishing Services. (Here)
Extra resource from MIT class 24 (Here)

By bootstrapping (resample from the sample), we can pretend :
  • sample mean = pop. mean
  • sample variance = pop. variance
The similar idea can be found here
Eli5 example
Example 1 : proportion of selecting 2 Financial assets
https://zhuanlan.zhihu.com/p/24851814
Example 2 : Pop of male and female
假设你要统计你们小区里男女比例,可是你全部知道整个小区的人分别是男还是女很麻烦对吧。于是你搬了个板凳坐在小区门口,花了十五分钟去数,准备了200张小纸条,有一个男的走过去,你就拿出一个小纸条写上“M”,有一个女的过去你就写一个“S”。最后你回家以后把200张纸条放在茶几上,随机拿出其中的100张,看看几个M,几个S,你一定觉得这并不能代表整个小区对不对。然后你把这些放回到200张纸条里,再随即抽100张,再做一次统计。…………如此反复10次或者更多次,大约就能代表你们整个小区的男女比例了。
你还是觉得不准?没办法,就是因为不能知道准确的样本,所以拿Bootstrap来做模拟而已。

作者:EdisonChen
链接:https://www.zhihu.com/question/22929263/answer/23098749
Confidence IntervalsMaximum Likelihood Estimate (MLE)