type

Post

Created date

Nov 15, 2021 02:40 AM

category

Data Science

tags

Machine Learning

Machine Learning

status

Published

Language

From

summary

slug

password

Author

Priority

Featured

Featured

Cover

Origin

Type

URL

Youtube

Youtube

icon

## Table of Content

Week 1 : SamplingWhat is CDF?What is probability density function (PDF)? What is the difference between PDF and CDF?Week 2 : Hypothesis testingWhat is Hypotheses ?How to conduct Hypo Testing in RWe also learnt about Wilcoxon test, it is non-parametric. Shapiro testWeek 3 : Permutation and resamplingConfidence interval :Estimator p8trimmed mean and mean p9Confidence interval p11bias p14The bootstrap principle The parametric bootstrapThe non-parametric bootstrapWeek 4 : Likelihoods and parameter estimationWeek 5 : Bootstraps, uncertainty, and parameter estimatesWeek 6 : Regression for explaining dataWeek 7 :Regression for causal estimation - The Directed Acyclic Graph (DAG)What & Why DAGHow DAGToP 1 : FORK (common cause)ToP 2 : PIPE (Chain of logic)ToP 3 : Collider (Very COMMON)Regression and DAGAdd or Not Add??Week 8 : Regression for predictionInterpretation of MSE (Here)Week 9 : ClassificationConfusion MatrixInterpretation of indicator: (Here)Precision (吾係變係)Recall (係變吾係) aka SensitivityLogistic regressionWhen do we use Logistic regression?Math of Logistic regression: Log-odd What if there are lot of variables?: penaltyWeek 10 : Using Bayes theorem to infer parametersWeek 11 : Bayesian regression

## TOC by topic

**How randomness affects estimates**

**Regression**

**Bayesian reasoning**

## There are three parts of this unit :

**First,**- We looked at how randomness affects estimates drawn from data

- In particular, we examined sampling distributions using direct simulations and bootstrap

- We looked at how we can make conﬁdence statements about things estimated from the data

*Second part of the course we looked at,*- Fitting a straight line through data

- Assessing assumptions of linear regression

- When you can (and can’t) make causal statements

- Regression for prediction

- Classification (including logistic regression and k-nearest neigbhours)

*The third part is Bayesian reasoning --*- A formal system for updating information in the light of new data

- A powerful technique for both modelling data and understanding data

## Week 1 : Sampling

In Week 1, we learnt about how the sample size of and experiment can eﬀect the variability of the object like CDF

#### What is Random Variable ?

- You record / quantify the result of a random process

- values of your event. Like 1 - 6 when rolling a dice. (1006 lecture)

Computer cannot produce a truly random sequence of numbers, so instead, it produces

*pseudo-random number sequences**.*Which behave like random numbers .- It is is not a bad thing because the numbers aren’t truly random, we can make our code reproduce the same results at a later time

- That means, we can use
`set.seed ()`

#### What is CDF?

- AKA "probability distribution function"

- used where the possibilities are
**discrete**

- A way to check the type of distribution

For example, we have a dataset called Sample:

`tibble(sample = data) %>% `

```
ggplot(aes(x = sample)) +
stat_ecdf() +
stat_function(fun = pnorm, colour = "blue",
linetype = "dashed")
```

We are checking if the dataset looks like a ND.

CDF and ECDF (empircal cumulative density function)

The difference between them is that :

- CDF is the theoretical construct WHEREAS eCDF is built from an actual data set.
- So, that means, the Sample size of CDF is infinite and eCDF is limited.
- CDF is like more Continuous WHEREAS eCDF is discrete.

*So, the higher the sample sizes, the more likely it is prone to the CDF (Smooth line)*#### What is probability density function (PDF)?

- used where the possibilities are
**continuous.**

- Notation :
- PDF denotes
- CDF denotes

#### What is the difference between PDF and CDF?

*(Outputs are different):*- So a CDF is a function whose output is a probability. The PDF is a function whose output is a non-negative number. The PDF itself is not a probability (unlike the CDF), but it can be used to calculate probabilities.

*(The types of possibility are different) :*- A probability distribution function (CDF) is used where the possibilities are discrete. Coin flips, dice rolls, favorite colors, first names, etc. The sum of the probabilities is one.
- A probability density function is used where the possibilities are continuous, not discrete. Here it is the integral that is one, not the sum. This means the the magnitude of the instantaneous probability density depends on the units used to express the measure--because the integration includes "dx" in it and measuring x in different units means the "d" is different.

## Functions learnt

## rep()

## CDF

## ECDF = stat_ecdf() +

## Week 2 : Hypothesis testing

### What is Hypotheses ?

- is an idea or theory that is not proven but that leads to further study or discussion.

- is used to assess whether or not a data set is meaningfully diﬀerent from some baseline distribution.

When we do hypo testing, there are numbers of components :

### How to conduct Hypo Testing in R

`Rules of thumb`

— Known : (Null distribution, Test-statistic) → Unknown (P-values) → Decide if reject the null hypo.## Experiments vs observtional study

Observational - you just look at something happening and try to study it.

Experimental - you actually change the thing you're studying and see the effects.

Observational:
you examine trends of lung cancer vs subscriber of pewdiepie. You see
that lung cancer goes down as pewdiepie subscriptions go up. You
conclude that there is a correlation between pewdiepie subscribers and
lung cancer.

Experimental: you
kidnap 1000 people and force them to watch pewdiepie and see if their
rate of lung cancer decreases. It doesn't. You conclude that there is no
correlation and likely no casuation between the two.

Experimental
studies usually get better results as you can eliminate surrounding
variables that may affect the results. But, particularly in social
sciences like psychology, it's kinda hard sometimes without being highly
unethical.

Hypothesis testing process

## Test statistics

"I am a tool to decide whether or not to reject the null hypothesis. "

- It is obtained by taking the observed value (the sample statistic) and converting it into a standard score under the assumption that the null hypothesis is true.

#### What

- A standard score calculated from your data, under the assumption that the null hypothesis is true. (Lecture)

- There are few kinds of Test statistics, depending on the kinds of you do. Say you are doing T-test, then T value is your Test statistics. (Scribbr)
- 由樣本所算出來的一個值，用來決定是否接受或拒絕
*H*0。常用的檢定統計量有：Z, t, F 與 χ2。(Opengate)

- A summary of a data set that reduces the data to one value that can be used to perform the hypothesis test. (Wiki)

#### Why

- Calculate the
`p-value`

of your results

#### How

- Look at the formular from (Wiki)

(Good link)

## Hypothesis testing concepts

## Null hypothesis

`The mean of a random variable = 0`

"I have a sad life because I either get rejected or am fail to be rejected; never get accepted or proved"

- So, as analyst, you can only reject or fail to reject the Null hypothesis.

## Alternative hypothesis

"I am rebellious that whatever Null hypothesis, I am ALWAYS his opposition."

## Assumptions

Consider :

- the level of measurement of the variable

- the method of sampling

- the shape of the population distribution

- the sample size

## Significance level (α)

- A probability threshold below which the null hypothesis will be rejected.

- Common values are 5% and 1%.

## Sampling distribution under the null hypothesis ： Test statistic

An example could be Permutation test; which is explained below :

## Permutation test

## What

- a kind of test statistics that is an awesome nonparametric test with light assumptions

## Why

## P-value - Reject the hypo. if p value is small.

- The probability of seeing the observed test statistic under the null hypothesis

- Conditional probability given that the null hypothesis is TRUE. (Here)

## A values that lies beyond the critical values

## Decision - critical region

- Decide to either reject the null hypothesis in favor of the alternative or not reject it.

- The decision rule is to reject the null hypothesis H0 if the p-value ≤ .05, meaning that is in the critical region,
- or otherwise, to accept or "fail to reject" the hypothesis when p> 0.05.

## Example

## Hypo Testing steps

1 Decide on a null hypothesis H0.

2 Decide on an alternative hypothesis H1.

3 Decide on a significance level.

4 Calculate the appropriate test statistic.

5 Find from tables the corresponding tabulated test statistic.

6 Compare calculated and tabulated test statistics and decide whether to accept or reject the null hypothesis.

7 State the conclusion and assumptions of the test.

Source Rees, D.G. Essential Statistics, Chapman and Hall 1995. (1006 lecture)

## Power (Good link & here)

"I am the judge or enemy, you say, of the Null hypothesis"

- The probability of correctly rejecting the null hypothesis; or you may say :
*the probability of not making a Type II error* - 正确拒绝H0的概率 ; or you may say :
*（不犯二类错误的概率）*

- Known as — (Generally, should be
`0.2`

)

**Optimistic analysis **- testing how big our sample will need to be to achieve a certain power

**Optimistic analysis**

- very useful when you are planning experiments

- the target power is 80%; that's why Generally, should be
`0.2`

## Factors affecting Power (Here)

- Size of the effect: Larger effects are more easily detected.

- Measurement error: Systematic and random errors in recorded data reduce power.

- Sample size: Larger samples reduce sampling error and increase power.

- Significance level: Increasing the significance level increases power

## Significance level and decision errors

## Graph of decision errors

There are 2 types of Errors : Type 1 Error and Type 2 Error

Type 1

`false +ve`

: Wrong rejection of - More precisely,
*incorrectly rejecting the null hypothesis*in return for a false alternative hypothesis

*Known as —*- The significance level is usually set at 0.05 or 5%,meaning that your results only have a 5% chance of occurring, or less, if the null hypothesis is actually true.

Type 2

`false -ve`

: Wrong rejection of = Wrong acceptance of - More precisely,
*reject a false null hypothesis*in return for a true alternative hypothesis

*Known as —*

#### Example, The avg price of cars is 3200. A test is conducted to see if that is true. State the types of Errors :

Step 1 : Find the H0 and HA

H0 : M = 3200

HA : M ≠ 3200

Step 2 : Find the Errors

Type 1: Wrong rejection of (Test shown that M ≠ 3200; but in fact M = 3200)

Type 2: Wrong rejection of (Test shown that M = 3200; but in fact M ≠ 3200)

## Good source about Hypo testing

#### We also learnt about Wilcoxon test, it is non-parametric.

Its null hypo is that :

1. the diff of marks is a ND;

2. quiz1 and quiz2 shares the same dist ;

3. diff = 0

#### Shapiro test

The null-hypothesis of this test is that the population is normally distributed.

Thus, if the p value is less than the chosen alpha level, then the null hypothesis is rejected and there is evidence that the data tested are not normally distributed.

On the other hand, if the p value is greater than the chosen alpha level, then the null hypothesis (that the data came from a normally distributed population) can not be rejected (e.g., for an alpha level of .05, a data set with a p value of less than .05 rejects the null hypothesis that the data are from a normally distributed population).

## Week 3 : Permutation and resampling

Last week we use the example of "Will the web re-design increase profits?". In case it really would, your boss will further ask you "by how much?". So, by understanding the concepts (which is called Estimated Effect) today, you will be able to answer it.

### Confidence interval :

a range of plausible values where we may find the true population value.

#### Estimator p8

## estimand, estimator and estimate

(Estimand -> Estimator -> Estimate)

Example 1:

We rely on the observations in our sample and use a linear (regression) function (the estimator) to estimate the causal effect of education on income in our sample which is our estimate for the population-level causal effect (the estimand).

#### trimmed mean and mean p9

## Trimmed mean (20%) of this data set

711, 710, 783, 814, 797 793, 711, 678, 773, 615 701, 841, 764, 330, 653, 558, 738, 795, 702, 752

## Steps

- Order the numbers from small to large.

- Remove 20% of the points on both ends. Hint: What's 20% of 20 values?

- Compute the mean with the remaining values.

- ans is 711.0

## difference between Mean and trimmed mean

- Using a trimmed mean helps eliminate the influence of outliers or data points on the tails that may unfairly affect the traditional mean. (here)

#### Confidence interval p11

Confidence Intervals#### bias p14

### The bootstrap principle

Bootstrap#### The parametric bootstrap

- Used when you know the family of probability dist

#### The non-parametric bootstrap

- Used when you DON'T know the family of probability dist

When doesn't it work?

When does it work?

## Week 4** : Likelihoods and parameter estimation**

The entire process of MLE revolves around determining the parameters which maximizes the probability of the data

Maximum Likelihood Estimate (MLE)Interpretation :

The bigger the likelihood, the more likely it is that the data comes from that model, the more appropriate to use that model

## Week 5 : Bootstraps, uncertainty, and parameter estimates

## Censor data

## Week 6 : Regression for explaining data

## Slide 1

Multilevel Models

- Fixed effects vs random effects

- Mixed effects models

- Diagnostics

## w7 Linear Model

## Residual

## What is R^2

- is a common measurement of strength of linear model fit.

- tells us % variability / variance in response explained by model.

## What is the difference between augment () and predict()

Both are used in testing the model, but the way it is used is different :

- augment () : It is used when
**training dataset**is used to test a model

- predict() : It is used when
**testing dataset**is used to test a model

## The residuals and fitted values can be extracted using `augment() `

function from broom:

fit = lm ( you know it )

mod_heights_diagnostics <- augment( fit )

**Interpretation of a good model**

## Indicator to judge a good model or not

- R Square / Adjusted R Square

## AIC

## BIC

## Residual plot

`resid_panel(fit, plots = "all")`

residual : the difference between the predicted (fitted ) and actual (observed ) values

3 condition to evaluation a good model based on residual

- Independent

- Normally distributed

## residual plot: a good linear model's residual should be independent or not have relationship.

## How to view if there is relationship or not?

if there is no pattern shown in residual plot, it means it does not have relationship, hence a good model

## QQplot : a good linear model's residual should be normally distributed

## How to view if it is normally distributed or not?

If all points lie on the blue line, then it is normally distributed

## What about in this case?

- Extreme points deviate from the line, which can be highly inﬂuential.

- That is to say that some data points can change the regression coefficient to the extent that you would see something different if you left it out.

## Cook's distance and Residual leveraged plot : a good linear model should avoid as much outliers as possible

## Cook's distance (you know how to view if there is outliers, just remove them based upon)

- > 4/n is a good rule of thumb for a big value.

## Residual leveraged plot

if points lie below the red line, then there are outliers needed to be removed

R.leveage plot :

## This data is not normally distributed

## Model is good when residuals are small (ETC2420 Week 6)

Residuals are small, that means :

- The smaller the residual sum of squares, the better your model fits your data

## What if there is no Covariate/ Predictor variables / features in the regression? (ETC2420 Week 6)

## hypothesis test in `lm()`

(ETC2420 Week 6)

## What are the output from `lm()`

of test statistic and a hypothesis test ?

covariate means Predictor variables / features.

- It tests the null hypothesis that the covariate = 0
- If there is only 1 covariate in your model,
- Rejecting the null hypothesis indicates that there is a linear relationship between the response and the covariate.
- If there are more than one covariate in your model,
- it indicates that there is a linear relationship between the covariate and the response, even after we have taken into account the other covariates.

- This doesn’t indicate the strength of the

## Assumptions when calculating the P-value and standard errors

## Week 7 :Regression for causal estimation - The Directed Acyclic Graph (DAG)

### What & Why DAG

A graphical technique

- To visualise assumptions about the relationship between variables (AKA nodes in the context of graphs).

- To see variables they interact with each other that changes the regression

- To ensure that the spurious correlation between X and Y isn’t introduced.

#### How DAG

There are 4 kinds of Elemental confounds (or I called Types of path

**ToP**)#### ToP 1 : FORK (common cause)

Z is the common cause of X & Y

**Example :**

e.g. 1 :

`weight ← `

`(unhealthy lifestyle)`

` → smoking`

e.g. 2 :

`bad impact on heath ← `

`(Staying up late)`

` → work efficience`

e.g. 3 :

`Marriage Rate ← (Age at Marriage) → Divorce Rate`

**X and Y are correlated except for when we control for Z.**

e.g.

`weigh`

and `smoking`

are correlated except for when we control for the common causes `unhealthy lifestyle`

.*Y are independent, conditional on Z.*

*If we don’t condition on the common cause, we will see spurious correlation.*

#### ToP 2 : PIPE (Chain of logic)

Z mediates the effect of X on Y. (i.e X causes Z causes Y)

**Example :**

e.g 1 :

`smoking → cholesterol to rise → increases risk for cardiac arrest. `

e.g 2:

`unhealthy life style → weight rises → Cholesterol`

**Z is an intermediate variable between X & Y.**

**If we were to intervene or control for Z, then the relationship between X and Y would be broken and there should be no correlation.**- In a statistical framework, we would not want to control for Z

**If we condition on Z now, we also block the path from X to Y. So in both a fork and a pipe, conditioning of the middle variable blocks the path.**

#### ToP 3 : Collider (Very COMMON)

There is no association between X and Y unless you condition on Z. (i.e X and Y jointly causes Z )

**Example :**- e.g 1 :
`Flu → Fever ← Chicken pox (varicella-zoster)`

## e.g 2 :` Switch → Light ← Electricity`

## e.g 3 : `Newsworthy → Publish ← Trustworthy`

**An inverted fork is not an open path; it is blocked at the collider Z.**

**X & Y are independent if Z is not conditioned on**

Conversely, conditioning on Z — the collider variable, opens the path. Once the path is open, information flows between X and Y.

Example :

`Influenza and chicken pox are independent`

if fever is not conditioned on**Learning X and Z reveals Y.**

### Regression and DAG

Say we want to identify a causal effect between X and Y :

#### Add or Not Add??

**When there is a Collider:**-We DO NOT include Collider variable in the regression.

**When there is a FORK:**-We DO include the variable in the regression

**When there is a Pipe:**-We DO NOT include the variable in the regression

If adding pipe variable — W, we will see the bias in our causal estimate.

You will see the direct effect of X and Y.

You will not see the indirect effect captured by W.

## Week 8 : Regression for prediction

## Slide 1 : Bayesian inference

Likelihood

Normalising constant

Estimation

Incorporating a prior

Posterior

Credible intervals

Continuous distributions

Comparison with MLE

Conjugate priors

Bayesian regression

## Slide 2: Conditional probability and Bayesian methods

- Events – Union, Intersection & Disjoint events

- Independent, Dependent and Exclusive events (with implementation in R)

- Conditional Probability (with implementation in R)

- Bayes Theorem (with implementation in R)

- Probability trees

- Frequentist vs Bayesian definitions of probability

- Open Challenges

## Slides 3: Bayesian Statistics explained to Beginners in Simple English

- The drawbacks of frequentist statistics lead to the need for Bayesian Statistics

- Discover Bayesian Statistics and Bayesian Inference

- There are various methods to test the significance of the model like p-value, confidence interval, etc

## Slides 5 : The Odds, Continually Updated

**Mean Square Error**

**What is MSE?**

- Avg(the square of the difference between actual and estimated values) (Here)

**Description of the graph **(Here)

In the diagram, predicted values are points on the line and actual values are shown by small circles.

- Error in prediction is shown as the distance between the data point and fitted line.

- MSE for the line is calculated as the avg( sum of squares for all data points ).

- For all such lines possible for a given dataset, the line that gives minimal or least MSE is considered as the best fit.

### How is MSE calculated? (Here)

**Interpretation of MSE (****Here****)**

- To check how close our prediction is to actual values.

- Lower the MSE, the closer is forecast to actual, the better fit is. So we want to
.*minimise MSE*

- Used to compare different regression models.

## Week 9 : Classification

### Confusion Matrix

### Interpretation of indicator: (Here)

#### Precision (吾係變係)

- To determine the best model when the costs of False Positive is high.
- For instance, email spam detection.
- In email spam detection, a false positive means that an email that is non-spam (actual negative) has been identified as spam (predicted spam).
- The email user might
*lose important emails*if the precision is*not high*for the spam detection model.

#### Recall (係變吾係) aka Sensitivity

- To determine the best model when the cost associated with False Negative is high.

- Calculates how many of the Actual Positives our model capture through labeling it as Positive (True Positive).
- For instance of fraud detection,
- If a
*fraudulent*transaction (Actual Positive) is predicted as*non-fraudulent*(Predicted Negative), the consequence can be very bad for the bank. - For instance of sick patient detection,
- Similarly, in sick patient detection, if a
*sick patient*(Actual Positive) goes through the test and predicted as*not sick*(Predicted Negative). - The cost associated with False Negative will be extremely high if the sickness is contagious.

### Logistic regression

#### When do we use Logistic regression?

- When the response variable are binary, which takes one of two values, 0 or 1

#### Math of Logistic regression: Log-odd

- Sum(Square Residuals) + ()

(More details can be read from the week 9 lecture slides.)

#### What if there are lot of variables?: penalty

**Problem :**

When there are lot of variables,

*Multicollinearity*is introduced. That means independent variables / Predictor variables in a regression model are correlated. Which is not under the assumption of regression.

*How do you know if your data is of Multicollinearity?*- i.e. coefficient. is very big.

- Least-squares are unbiased, and variances are large, this results in predicted values to be far away from the actual values.

- Above of which leads to overfitting and instability.

**Solution:**

We adding a penalty term that discourages large values of i.e. coefficient.

**add a penalty**

*How do you know if*

*?*- Essentially, we tweak around the value of λ to bring trades off between bias and variance.
- When the coefficients are (asymptotically unbiased)
- Increasing decreases the s. When the coeffcients are all zero (no variance) [ 係反比關係]

## Week 10 : Using Bayes theorem to infer parameters

- Working with text

- Web scraping

- Grammar of graphics

- Randomisation to assess structure perceived in plots

Bayes theorem / Bayes' ruleThe Beta-Binomial Conjugate Prior

## Week 11 : Bayesian regression

MCMCCredible interval

## FAQ

## In week 2 lab, we learnt how the sample size of and experiment can eﬀect the variability of CDF. So in T-dist, how does changing the sample size or DF affect the CDF?

## Week 3

## What is the difference between bootstrapping and ecdf

## How is Modeling and ML different? Why do we need to know the know the underlying data distribution?

The purpose of both could be different.

ML : used for prediction tasks; the models are quite often black boxes and they literally do not care what the underlying data generation process / distribution(s) are.

Modeling : used for inference; It is very common to form a hypothesis about what the underlying data generation process is, then to collect some data and test the hypothesis that the model

## Difference between **Frequentist vs Bayesian**

**Author:**Jason Siu**URL:**https://jason-siu.com/article%2Fc11f51a4-c98b-4f44-bb58-d61c49af5453**Copyright:**All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!

Relate Posts