ETC2420 - Statistical thinking

type

Post

Created date

Nov 15, 2021 02:40 AM

Week 1 : Sampling

In Week 1, we learnt about how the sample size of and experiment can eﬀect the variability of the object like CDF

What is Random Variable ?

You record / quantify the result of a random process

values of your event. Like 1 - 6 when rolling a dice. (1006 lecture)

Computer cannot produce a truly random sequence of numbers, so instead, it produces pseudo-random number sequences. Which behave like random numbers .

It is is not a bad thing because the numbers aren’t truly random, we can make our code reproduce the same results at a later time

That means, we can use set.seed ()

What is CDF?

AKA "probability distribution function"

used where the possibilities are discrete

A way to check the type of distribution

For example, we have a dataset called Sample:

tibble(sample = data) %>%

ggplot(aes(x = sample)) +
stat_ecdf() +
stat_function(fun = pnorm, colour = "blue",
linetype = "dashed")

We are checking if the dataset looks like a ND.

The deviation from the theoretical CDF is purely due to the small sample size. If the empirical CDF of the sample goes outside this window, we are seeing something qualitatively different from our 100 draws.

CDF and ECDF (empircal cumulative density function)

The difference between them is that :

CDF is the theoretical construct WHEREAS eCDF is built from an actual data set.

So, that means, the Sample size of CDF is infinite and eCDF is limited.
CDF is like more Continuous WHEREAS eCDF is discrete.

So, the higher the sample sizes, the more likely it is prone to the CDF (Smooth line)

What is probability density function (PDF)?

used where the possibilities are continuous.

Notation :

PDF denotes
CDF denotes

What is the difference between PDF and CDF?

(Outputs are different):

So a CDF is a function whose output is a probability. The PDF is a function whose output is a non-negative number. The PDF itself is not a probability (unlike the CDF), but it can be used to calculate probabilities.

(The types of possibility are different) :

A probability distribution function (CDF) is used where the possibilities are discrete. Coin flips, dice rolls, favorite colors, first names, etc. The sum of the probabilities is one.
A probability density function is used where the possibilities are continuous, not discrete. Here it is the integral that is one, not the sum. This means the the magnitude of the instantaneous probability density depends on the units used to express the measure--because the integration includes "dx" in it and measuring x in different units means the "d" is different.

Functions learnt

rep()

CDF

ECDF = stat_ecdf() +

Week 2 : Hypothesis testing

What is Hypotheses ?

is an idea or theory that is not proven but that leads to further study or discussion.

is used to assess whether or not a data set is meaningfully diﬀerent from some baseline distribution.

When we do hypo testing, there are numbers of components :

Null hypothesis

Alternative hypothesis

Test statistics

Power

How to conduct Hypo Testing in R

Rules of thumb — Known : (Null distribution, Test-statistic) → Unknown (P-values) → Decide if reject the null hypo.

Experiments vs observtional study

Observational - you just look at something happening and try to study it.

Experimental - you actually change the thing you're studying and see the effects.

Observational: you examine trends of lung cancer vs subscriber of pewdiepie. You see that lung cancer goes down as pewdiepie subscriptions go up. You conclude that there is a correlation between pewdiepie subscribers and lung cancer.

Experimental: you kidnap 1000 people and force them to watch pewdiepie and see if their rate of lung cancer decreases. It doesn't. You conclude that there is no correlation and likely no casuation between the two.

Experimental studies usually get better results as you can eliminate surrounding variables that may affect the results. But, particularly in social sciences like psychology, it's kinda hard sometimes without being highly unethical.

Hypothesis testing process

Test statistics

"I am a tool to decide whether or not to reject the null hypothesis. "

It is obtained by taking the observed value (the sample statistic) and converting it into a standard score under the assumption that the null hypothesis is true.

What

A standard score calculated from your data, under the assumption that the null hypothesis is true. (Lecture)

There are few kinds of Test statistics, depending on the kinds of you do. Say you are doing T-test, then T value is your Test statistics. (Scribbr)

由樣本所算出來的一個值，用來決定是否接受或拒絕 H0。常用的檢定統計量有：Z, t, F 與 χ2。(Opengate)

A summary of a data set that reduces the data to one value that can be used to perform the hypothesis test. (Wiki)

Why

Calculate the p-value of your results

How

Look at the formular from (Wiki)

(Good link)

Hypothesis testing concepts

Null hypothesis

The mean of a random variable = 0

"I have a sad life because I either get rejected or am fail to be rejected; never get accepted or proved"

So, as analyst, you can only reject or fail to reject the Null hypothesis.

Alternative hypothesis

"I am rebellious that whatever Null hypothesis, I am ALWAYS his opposition."

Assumptions

Consider :

the level of measurement of the variable

the method of sampling

the shape of the population distribution

the sample size

Significance level (α)

A probability threshold below which the null hypothesis will be rejected.

Common values are 5% and 1%.

Sampling distribution under the null hypothesis ： Test statistic

An example could be Permutation test; which is explained below :

Permutation test

This link visually explains how it works

YouTube tutorial of permutation test

What

a kind of test statistics that is an awesome nonparametric test with light assumptions

Why

P-value - Reject the hypo. if p value is small.

This link explains how P-value works

The probability of seeing the observed test statistic under the null hypothesis

Conditional probability given that the null hypothesis is TRUE. (Here)

A values that lies beyond the critical values

Decision - critical region

Decide to either reject the null hypothesis in favor of the alternative or not reject it.

The decision rule is to reject the null hypothesis H0 if the p-value ≤ .05, meaning that is in the critical region,

or otherwise, to accept or "fail to reject" the hypothesis when p> 0.05.

Example

Hypo Testing steps

1 Decide on a null hypothesis H0.

2 Decide on an alternative hypothesis H1.

3 Decide on a significance level.

4 Calculate the appropriate test statistic.

5 Find from tables the corresponding tabulated test statistic.

6 Compare calculated and tabulated test statistics and decide whether to accept or reject the null hypothesis.

7 State the conclusion and assumptions of the test.

Source Rees, D.G. Essential Statistics, Chapman and Hall 1995. (1006 lecture)

Power (Good link & here)

"I am the judge or enemy, you say, of the Null hypothesis"

The probability of correctly rejecting the null hypothesis; or you may say : the probability of not making a Type II error

正确拒绝H0的概率 ; or you may say : （不犯二类错误的概率）

Known as — (Generally, should be 0.2 )

Optimistic analysis - testing how big our sample will need to be to achieve a certain power

very useful when you are planning experiments

the target power is 80%; that's why Generally, should be 0.2

Around when sample size = 40, the 1- should be 0.8 — Around when sample size = 40, the 1- should be `0.8`

Factors affecting Power (Here)

Size of the effect: Larger effects are more easily detected.

Measurement error: Systematic and random errors in recorded data reduce power.

Sample size: Larger samples reduce sampling error and increase power.

Significance level: Increasing the significance level increases power

Significance level and decision errors

Graph of decision errors

There are 2 types of Errors : Type 1 Error and Type 2 Error

Type 1 false +ve : Wrong rejection of

More precisely, incorrectly rejecting the null hypothesis in return for a false alternative hypothesis

Known as —

The significance level is usually set at 0.05 or 5%,meaning that your results only have a 5% chance of occurring, or less, if the null hypothesis is actually true.

Type 2 false -ve : Wrong rejection of = Wrong acceptance of

More precisely, reject a false null hypothesis in return for a true alternative hypothesis

Known as —

Example, The avg price of cars is 3200. A test is conducted to see if that is true. State the types of Errors :

Step 1 : Find the H0 and HA

H0 : M = 3200

HA : M ≠ 3200

Step 2 : Find the Errors

Type 1: Wrong rejection of (Test shown that M ≠ 3200; but in fact M = 3200)

Type 2: Wrong rejection of (Test shown that M = 3200; but in fact M ≠ 3200)

Good source about Hypo testing

Probability and Statistics - 假設檢定：基本流程總整理 Process of Hypothesis Testing Statistics

ntpu Slide

淺談p值(p-value)

We also learnt about Wilcoxon test, it is non-parametric.

Its null hypo is that :

1. the diff of marks is a ND;

2. quiz1 and quiz2 shares the same dist ;

3. diff = 0

Shapiro test

The null-hypothesis of this test is that the population is normally distributed.

Thus, if the p value is less than the chosen alpha level, then the null hypothesis is rejected and there is evidence that the data tested are not normally distributed.

On the other hand, if the p value is greater than the chosen alpha level, then the null hypothesis (that the data came from a normally distributed population) can not be rejected (e.g., for an alpha level of .05, a data set with a p value of less than .05 rejects the null hypothesis that the data are from a normally distributed population).

Week 3 : Permutation and resampling

Last week we use the example of "Will the web re-design increase profits?". In case it really would, your boss will further ask you "by how much?". So, by understanding the concepts (which is called Estimated Effect) today, you will be able to answer it.

Confidence interval :

a range of plausible values where we may find the true population value.

Estimator p8

estimand, estimator and estimate

(Estimand -> Estimator -> Estimate)

Example 1:

We rely on the observations in our sample and use a linear (regression) function (the estimator) to estimate the causal effect of education on income in our sample which is our estimate for the population-level causal effect (the estimand).

trimmed mean and mean p9

Trimmed mean (20%) of this data set

711, 710, 783, 814, 797 793, 711, 678, 773, 615 701, 841, 764, 330, 653, 558, 738, 795, 702, 752

Steps

Order the numbers from small to large.

Remove 20% of the points on both ends. Hint: What's 20% of 20 values?

Compute the mean with the remaining values.

ans is 711.0

difference between Mean and trimmed mean

Using a trimmed mean helps eliminate the influence of outliers or data points on the tails that may unfairly affect the traditional mean. (here)

Confidence interval p11

Confidence Intervals

bias p14

The bootstrap principle

👢

Bootstrap

The parametric bootstrap

Used when you know the family of probability dist

The non-parametric bootstrap

Used when you DON'T know the family of probability dist

When doesn't it work?

When does it work?

Week 4 : Likelihoods and parameter estimation

The entire process of MLE revolves around determining the parameters which maximizes the probability of the data

Maximum Likelihood Estimate (MLE)

Interpretation :

The bigger the likelihood, the more likely it is that the data comes from that model, the more appropriate to use that model

Week 5 : Bootstraps, uncertainty, and parameter estimates

Censor data

Week 6 : Regression for explaining data

Slide 1

Multilevel Models

Fixed effects vs random effects

Mixed effects models

Diagnostics

w7 Linear Model

Residual

Definition

What is R^2

is a common measurement of strength of linear model fit.

tells us % variability / variance in response explained by model.

What is the difference between augment () and predict()

Both are used in testing the model, but the way it is used is different :

augment () : It is used when training dataset is used to test a model

predict() : It is used when testing dataset is used to test a model

The residuals and fitted values can be extracted using augment() function from broom:

fit = lm ( you know it )

mod_heights_diagnostics <- augment( fit )

Interpretation of a good model

Indicator to judge a good model or not

R Square / Adjusted R Square

AIC

BIC

Residual plot

resid_panel(fit, plots = "all")

residual : the difference between the predicted (fitted ) and actual (observed ) values

3 condition to evaluation a good model based on residual

Independent

Normally distributed

residual plot: a good linear model's residual should be independent or not have relationship.

How to view if there is relationship or not?

if there is no pattern shown in residual plot, it means it does not have relationship, hence a good model

QQplot : a good linear model's residual should be normally distributed

How to view if it is normally distributed or not?

If all points lie on the blue line, then it is normally distributed

What about in this case?

Extreme points deviate from the line, which can be highly inﬂuential.

That is to say that some data points can change the regression coefficient to the extent that you would see something different if you left it out.

Cook's distance and Residual leveraged plot : a good linear model should avoid as much outliers as possible

Cook's distance (you know how to view if there is outliers, just remove them based upon)

> 4/n is a good rule of thumb for a big value.

Residual leveraged plot

if points lie below the red line, then there are outliers needed to be removed

R.leveage plot :

This data is not normally distributed

Model is good when residuals are small (ETC2420 Week 6)

Residuals are small, that means :

The smaller the residual sum of squares, the better your model fits your data

What if there is no Covariate/ Predictor variables / features in the regression? (ETC2420 Week 6)

hypothesis test in lm() (ETC2420 Week 6)

What are the output from lm() of test statistic and a hypothesis test ?

covariate means Predictor variables / features.

It tests the null hypothesis that the covariate = 0

If there is only 1 covariate in your model,

Rejecting the null hypothesis indicates that there is a linear relationship between the response and the covariate.

If there are more than one covariate in your model,

it indicates that there is a linear relationship between the covariate and the response, even after we have taken into account the other covariates.

This doesn’t indicate the strength of the

Assumptions when calculating the P-value and standard errors

Week 7 :Regression for causal estimation - The Directed Acyclic Graph (DAG)

What & Why DAG

A graphical technique

To visualise assumptions about the relationship between variables (AKA nodes in the context of graphs).

To see variables they interact with each other that changes the regression

To ensure that the spurious correlation between X and Y isn’t introduced.

How DAG

There are 4 kinds of Elemental confounds (or I called Types of path ToP)

ToP 1 : FORK (common cause)

Z is the common cause of X & Y

Example :

e.g. 1 : weight ← (unhealthy lifestyle) → smoking

e.g. 2 : bad impact on heath ← (Staying up late) → work efficience

e.g. 3 : Marriage Rate ← (Age at Marriage) → Divorce Rate

X and Y are correlated except for when we control for Z.

e.g. weigh and smoking are correlated except for when we control for the common causes unhealthy lifestyle.

Y are independent, conditional on Z.

If we don’t condition on the common cause, we will see spurious correlation.

ToP 2 : PIPE (Chain of logic)

Z mediates the effect of X on Y. (i.e X causes Z causes Y)

Example :

e.g 1 : smoking → cholesterol to rise → increases risk for cardiac arrest.

e.g 2: unhealthy life style → weight rises → Cholesterol

Z is an intermediate variable between X & Y.

If we were to intervene or control for Z, then the relationship between X and Y would be broken and there should be no correlation.

In a statistical framework, we would not want to control for Z

If we condition on Z now, we also block the path from X to Y. So in both a fork and a pipe, conditioning of the middle variable blocks the path.

ToP 3 : Collider (Very COMMON)

There is no association between X and Y unless you condition on Z. (i.e X and Y jointly causes Z )

Example :

e.g 1 : Flu → Fever ← Chicken pox (varicella-zoster)

e.g 2 : Switch → Light ← Electricity

e.g 3 : Newsworthy → Publish ← Trustworthy

An inverted fork is not an open path; it is blocked at the collider Z.

X & Y are independent if Z is not conditioned on

Conversely, conditioning on Z — the collider variable, opens the path. Once the path is open, information flows between X and Y.

Example : Influenza and chicken pox are independent if fever is not conditioned on

Learning X and Z reveals Y.

Regression and DAG

Say we want to identify a causal effect between X and Y :

Add or Not Add??

When there is a Collider:

-We DO NOT include Collider variable in the regression.

When there is a FORK:

-We DO include the variable in the regression

When there is a Pipe:

-We DO NOT include the variable in the regression

If adding pipe variable — W, we will see the bias in our causal estimate.

You will see the direct effect of X and Y.

You will not see the indirect effect captured by W.

DAGitty - drawing and analyzing causal diagrams (DAGs)

Causal Diagrams Cheat Sheet (dagitty.net)

PMAP 8521 • Example: DAGs with Dagitty

Week 8 : Regression for prediction

Slide 1 : Bayesian inference

Likelihood

Normalising constant

Estimation

Incorporating a prior

Posterior

Credible intervals

Continuous distributions

Comparison with MLE

Conjugate priors

Bayesian regression

Slide 2: Conditional probability and Bayesian methods

Events – Union, Intersection & Disjoint events

Independent, Dependent and Exclusive events (with implementation in R)

Conditional Probability (with implementation in R)

Bayes Theorem (with implementation in R)

Probability trees

Frequentist vs Bayesian definitions of probability

Open Challenges

Slides 3: Bayesian Statistics explained to Beginners in Simple English

The drawbacks of frequentist statistics lead to the need for Bayesian Statistics

Discover Bayesian Statistics and Bayesian Inference

There are various methods to test the significance of the model like p-value, confidence interval, etc

Slides 4: A visual guide to Bayesian thinking

Slides 5 : The Odds, Continually Updated

Mean Square Error

What is MSE?

Avg(the square of the difference between actual and estimated values) (Here)

Description of the graph (Here)

In the diagram, predicted values are points on the line and actual values are shown by small circles.

Error in prediction is shown as the distance between the data point and fitted line.

MSE for the line is calculated as the avg( sum of squares for all data points ).

For all such lines possible for a given dataset, the line that gives minimal or least MSE is considered as the best fit.

How is MSE calculated? (Here)

Here RMSE is the root of MSE. You can think of :

What is Mean Absolute Error (MAE)?

The sum of the absolute difference between actual and predicted values.

What is ?

A coefficient of determination

The total variance explained by model/total variance.

Comparison between RSME, MAE, and — Comparison between **RSME, MAE, and**

Interpretation of MSE (Here)

To check how close our prediction is to actual values.

Lower the MSE, the closer is forecast to actual, the better fit is. So we want to minimise MSE.

Used to compare different regression models.

Week 9 : Classification

Confusion Matrix

Interpretation of indicator: (Here)

Accuracy

Proportion of times the classiﬁer is correct

This is useful, code found in ETC2420 lecture slide week 9.

Precision (吾係變係)

To determine the best model when the costs of False Positive is high.

For instance, email spam detection.

In email spam detection, a false positive means that an email that is non-spam (actual negative) has been identified as spam (predicted spam).
The email user might lose important emails if the precision is not high for the spam detection model.

Recall (係變吾係) aka Sensitivity

To determine the best model when the cost associated with False Negative is high.

Calculates how many of the Actual Positives our model capture through labeling it as Positive (True Positive).

For instance of fraud detection,

If a fraudulent transaction (Actual Positive) is predicted as non-fraudulent (Predicted Negative), the consequence can be very bad for the bank.

For instance of sick patient detection,

Similarly, in sick patient detection, if a sick patient (Actual Positive) goes through the test and predicted as not sick (Predicted Negative).
The cost associated with False Negative will be extremely high if the sickness is contagious.

Logistic regression

When do we use Logistic regression?

When the response variable are binary, which takes one of two values, 0 or 1

Math of Logistic regression: Log-odd

Sum(Square Residuals) + ()

(More details can be read from the week 9 lecture slides.)

What if there are lot of variables?: penalty

Problem :

When there are lot of variables, Multicollinearity is introduced. That means independent variables / Predictor variables in a regression model are correlated. Which is not under the assumption of regression.

How do you know if your data is of Multicollinearity?

i.e. coefficient. is very big.

Least-squares are unbiased, and variances are large, this results in predicted values to be far away from the actual values.

Above of which leads to overfitting and instability.

Solution:

We adding a penalty term that discourages large values of i.e. coefficient.

How do you know if add a penalty ?

Essentially, we tweak around the value of λ to bring trades off between bias and variance.

When the coefficients are (asymptotically unbiased)
Increasing decreases the s. When the coeffcients are all zero (no variance) [ 係反比關係]

In the coding, we change param of `mixture`.

When mixture = 0, this is called Ridge-regression.

The s are shrunk.

When mixture = 1, this is called Logistic lasso.

The s are shrunk by making the least important ones exactly zero.

Resource about Logistic regression

Tutorial 27- Ridge and Lasso Regression Indepth Intuition- Data Science - YouTube

Regularization Part 1: Ridge (L2) Regression - YouTube

Week 10 : Using Bayes theorem to infer parameters

Working with text

Web scraping

Grammar of graphics

Randomisation to assess structure perceived in plots

Bayes theorem / Bayes' rule

The Beta-Binomial Conjugate Prior

Week 11 : Bayesian regression

MCMC

Credible interval

FAQ

In week 2 lab, we learnt how the sample size of and experiment can eﬀect the variability of CDF. So in T-dist, how does changing the sample size or DF affect the CDF?

Week 3

What is the difference between bootstrapping and ecdf

https://stats.stackexchange.com/questions/466809/bootstrapping-and-ecdf

How is Modeling and ML different? Why do we need to know the know the underlying data distribution?

The purpose of both could be different.

ML : used for prediction tasks; the models are quite often black boxes and they literally do not care what the underlying data generation process / distribution(s) are.

Modeling : used for inference; It is very common to form a hypothesis about what the underlying data generation process is, then to collect some data and test the hypothesis that the model

Confidence Intervals vs Prediction Intervals vs Tolerance Intervals

Difference between Frequentist vs Bayesian

Week 1 : Sampling

What is Random Variable ?

What is CDF?

What is probability density function (PDF)?

What is the difference between PDF and CDF?

Week 2 : Hypothesis testing

What is Hypotheses ?

How to conduct Hypo Testing in R

What

Why

How

Example, The avg price of cars is 3200. A test is conducted to see if that is true. State the types of Errors :

We also learnt about Wilcoxon test, it is non-parametric.

Shapiro test

Week 3 : Permutation and resampling

Confidence interval :

Estimator p8

trimmed mean and mean p9

Confidence interval p11

bias p14

The bootstrap principle

The parametric bootstrap

The non-parametric bootstrap

Week 4 : Likelihoods and parameter estimation

Week 5 : Bootstraps, uncertainty, and parameter estimates

Week 6 : Regression for explaining data

Week 7 :Regression for causal estimation - The Directed Acyclic Graph (DAG)

What & Why DAG

How DAG

ToP 1 : FORK (common cause)

ToP 2 : PIPE (Chain of logic)

ToP 3 : Collider (Very COMMON)

Regression and DAG

Add or Not Add??

Week 8 : Regression for prediction

What is MSE?

How is MSE calculated? (Here)

Here RMSE is the root of MSE. You can think of :

What is Mean Absolute Error (MAE)?

What is ?

Interpretation of MSE (Here)

Week 9 : Classification

Confusion Matrix

Interpretation of indicator: (Here)

Accuracy

Precision (吾係變係)

Recall (係變吾係) aka Sensitivity

Logistic regression

When do we use Logistic regression?

Math of Logistic regression: Log-odd

What if there are lot of variables?: penalty

In the coding, we change param of mixture.

Week 10 : Using Bayes theorem to infer parameters

Week 11 : Bayesian regression

In the coding, we change param of `mixture`.