type

Post

Created date

Jun 16, 2022 01:43 PM

category

Data Science

tags

Machine Learning

Machine Learning

status

Published

Language

From

summary

slug

password

Author

Priority

Featured

Featured

Cover

Origin

Type

URL

Youtube

Youtube

icon

Before knowing what Bootstrap is, we need to know what RESAMP LING means.

## RESAMPLING (800)

### What

- procedures that describe how to economically use available data to estimate a population parameter.

### Why

**Why using this ?**

- To resample form the original data — either directly or via a fitted model — to create replicate datasets
- The result can be both a more accurate estimate of
- the parameter (such as taking the mean of the estimates) and
- a quantification of the uncertainty of the estimate (e.g adding a confidence interval).

- very easy to use, requiring little mathematical knowledge

**Why might not using this ?**

- computationally very expensive, requiring tens, hundreds, or even thousands of resamples in order to develop a robust estimate of the population parameter.

### How

## What

- A kind of RESAMPLING to approximate the sampling distribution (i.e. a technique to estimate the confidence interval. (ITP p560)

- Does not improve your point estimates. (ITP p559)

- Allows us to estimate the variance without re-collecting more data (ITP p559)

## Why

**When do we use Bootstrap?**- Better than using CI to estimate the point estimate. It can be easy to find the Mean (point estimate) using CI, but difficult to find other true parameter i.e. θˆ like you have an estimator to estimate MEDIAN or others.
- For simple problems such as the sample b average, analyzing Var(θˆ ) is not difficult. However, if θˆ is a more complicated statistic, e.g., the median, analyzing Var(θˆ ) may not be as straightforward.
*So, if wanting to estimate the confidence interval, you need to use*(ITP p559).**Bootstrap.**

- In a nutshell, we use bootstrapping when the estimator does not have a simple expression for the variance

*Why is it difficult to provide a confidence interval for estimators such as the median?*## How

There are 2 examples, which applies to the same logic. First of all, we will introduce the brute force approach which does not work in real-world. Then, that is the reason we use Bootstrap.

### Example 1 (ITP p559)

#### 1.1. Brute force approach

Imagine we have the population distribution (Which does not happen in real-world often, but let's suppose).

## There are some notation :

By having the population distribution, we can generate as many samples 's as we want.

Since we are interested in constructing the confidence interval for , we need to analyze b the mean and variance of . The true mean and the estimated mean of are

Problem :

we only have one dataset We do not have access to , . . . , , and we do not have access to . Therefore, we are not able to approximate the variance using the above brute force simulation.

#### 1.2 Solution : **Bootstrap**

Assume: We do NOT have access to , but we have one dataset

So, at the end, we have the output — for the final answer.

### Example X and Y return

Can be found p.218 s

### How to bootstrap a Confidence Intervals?

From the lecture:

When does non-parametric Bootstrap fail?

- using max

## Reference

(800)

(ITS) James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning: with Applications in R (Springer Texts in Statistics) (2nd ed. 2021 ed.). Springer. (Here)

( ITP ) -

*The best source to get the gist*. Chan, S. (2021). Introduction to Probability for Data Science. Michigan Publishing Services. (Here)Extra resource from MIT class 24 (Here)

By bootstrapping (resample from the sample), we can pretend :

`sample mean = pop. mean`

`sample variance = pop. variance`

The similar idea can be found here

## Eli5 example

## Example 2 : Pop of male and female

假设你要统计你们小区里男女比例，可是你全部知道整个小区的人分别是男还是女很麻烦对吧。于是你搬了个板凳坐在小区门口，花了十五分钟去数，准备了200张小纸条，有一个男的走过去，你就拿出一个小纸条写上“M”，有一个女的过去你就写一个“S”。最后你回家以后把200张纸条放在茶几上，随机拿出其中的100张，看看几个M，几个S，你一定觉得这并不能代表整个小区对不对。然后你把这些放回到200张纸条里，再随即抽100张，再做一次统计。…………如此反复10次或者更多次，大约就能代表你们整个小区的男女比例了。

你还是觉得不准？没办法，就是因为不能知道准确的样本，所以拿Bootstrap来做模拟而已。

作者：EdisonChen

链接：https://www.zhihu.com/question/22929263/answer/23098749

**Author:**Jason Siu**URL:**https://jason-siu.com/article%2Fb0ab4cee-61dc-4384-a234-f6fcbce89940**Copyright:**All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!

Relate Posts