Created date
Nov 15, 2021 02:40 AM
Data Science
Machine Learning
Machine Learning
Table of Content
TOC by topic
There are three parts of this unit :
  • We looked at how randomness affects estimates drawn from data
  • In particular, we examined sampling distributions using direct simulations and bootstrap
  • We looked at how we can make confidence statements about things estimated from the data
Second part of the course we looked at,
  • Fitting a straight line through data
  • Assessing assumptions of linear regression
  • When you can (and can’t) make causal statements
  • Regression for prediction
  • Classification (including logistic regression and k-nearest neigbhours)
The third part is Bayesian reasoning --
  • A formal system for updating information in the light of new data
  • A powerful technique for both modelling data and understanding data


Week 1 : Sampling

In Week 1, we learnt about how the sample size of and experiment can effect the variability of the object like CDF

What is Random Variable ?

  • You record / quantify the result of a random process
  • values of your event. Like 1 - 6 when rolling a dice. (1006 lecture)
notion image

Computer cannot produce a truly random sequence of numbers, so instead, it produces pseudo-random number sequences. Which behave like random numbers .
  • It is is not a bad thing because the numbers aren’t truly random, we can make our code reproduce the same results at a later time
  • That means, we can use set.seed ()

What is CDF?

  • AKA "probability distribution function"
  • used where the possibilities are discrete
  • A way to check the type of distribution

For example, we have a dataset called Sample:
tibble(sample = data) %>%
ggplot(aes(x = sample)) + stat_ecdf() + stat_function(fun = pnorm, colour = "blue", linetype = "dashed")
We are checking if the dataset looks like a ND.
The deviation from the theoretical CDF is purely due to the small sample size. If the empirical CDF of the sample goes outside this window, we are seeing something qualitatively different from our 100 draws.
The deviation from the theoretical CDF is purely due to the small sample size. If the empirical CDF of the sample goes outside this window, we are seeing something qualitatively different from our 100 draws.
CDF and ECDF (empircal cumulative density function)

The difference between them is that :
  • CDF is the theoretical construct WHEREAS eCDF is built from an actual data set.
    • So, that means, the Sample size of CDF is infinite and eCDF is limited.
    • CDF is like more Continuous WHEREAS eCDF is discrete.
    • So, the higher the sample sizes, the more likely it is prone to the CDF (Smooth line)
notion image

What is probability density function (PDF)?

  • used where the possibilities are continuous.
  • Notation :
    • PDF denotes
    • CDF denotes
notion image
linking PDF with CDF
linking PDF with CDF

What is the difference between PDF and CDF?

  • (Outputs are different):
    • So a CDF is a function whose output is a probability. The PDF is a function whose output is a non-negative number. The PDF itself is not a probability (unlike the CDF), but it can be used to calculate probabilities.
  • (The types of possibility are different) :
    • A probability distribution function (CDF) is used where the possibilities are discrete. Coin flips, dice rolls, favorite colors, first names, etc. The sum of the probabilities is one.
    • A probability density function is used where the possibilities are continuous, not discrete. Here it is the integral that is one, not the sum. This means the the magnitude of the instantaneous probability density depends on the units used to express the measure--because the integration includes "dx" in it and measuring x in different units means the "d" is different.


Functions learnt
notion image
ECDF = stat_ecdf() +

Week 2 : Hypothesis testing

What is Hypotheses ?

  • is an idea or theory that is not proven but that leads to further study or discussion.
  • is used to assess whether or not a data set is meaningfully different from some baseline distribution.

When we do hypo testing, there are numbers of components :

How to conduct Hypo Testing in R

Rules of thumb — Known : (Null distribution, Test-statistic) → Unknown (P-values) → Decide if reject the null hypo.
Experiments vs observtional study
Observational - you just look at something happening and try to study it.
Experimental - you actually change the thing you're studying and see the effects.
Observational: you examine trends of lung cancer vs subscriber of pewdiepie. You see that lung cancer goes down as pewdiepie subscriptions go up. You conclude that there is a correlation between pewdiepie subscribers and lung cancer.
Experimental: you kidnap 1000 people and force them to watch pewdiepie and see if their rate of lung cancer decreases. It doesn't. You conclude that there is no correlation and likely no casuation between the two.

Experimental studies usually get better results as you can eliminate surrounding variables that may affect the results. But, particularly in social sciences like psychology, it's kinda hard sometimes without being highly unethical.
Hypothesis testing process
Test statistics
"I am a tool to decide whether or not to reject the null hypothesis. "
  • It is obtained by taking the observed value (the sample statistic) and converting it into a standard score under the assumption that the null hypothesis is true.


  • A standard score calculated from your data, under the assumption that the null hypothesis is true. (Lecture)
  • There are few kinds of Test statistics, depending on the kinds of you do. Say you are doing T-test, then T value is your Test statistics. (Scribbr)
    • 由樣本所算出來的一個值,用來決定是否接受或拒絕 H0。常用的檢定統計量有:Z, t, F 與 χ2。(Opengate)
  • A summary of a data set that reduces the data to one value that can be used to perform the hypothesis test. (Wiki)


  • Calculate the p-value of your results


  • Look at the formular from (Wiki)
(Good link)
Hypothesis testing concepts
Null hypothesis
The mean of a random variable = 0
"I have a sad life because I either get rejected or am fail to be rejected; never get accepted or proved"
  • So, as analyst, you can only reject or fail to reject the Null hypothesis.
Alternative hypothesis
"I am rebellious that whatever Null hypothesis, I am ALWAYS his opposition."
Consider :
  • the level of measurement of the variable
  • the method of sampling
  • the shape of the population distribution
  • the sample size
Significance level (α)
  • A probability threshold below which the null hypothesis will be rejected.
  • Common values are 5% and 1%.
Sampling distribution under the null hypothesis : Test statistic
An example could be Permutation test; which is explained below :
Permutation test
  • a kind of test statistics that is an awesome nonparametric test with light assumptions
notion image
P-value - Reject the hypo. if p value is small.
  • The probability of seeing the observed test statistic under the null hypothesis
  • Conditional probability given that the null hypothesis is TRUE. (Here)
A values that lies beyond the critical values
notion image
Decision - critical region
  • Decide to either reject the null hypothesis in favor of the alternative or not reject it.
  • The decision rule is to reject the null hypothesis H0 if the p-value ≤ .05, meaning that is in the critical region,
    • or otherwise, to accept or "fail to reject" the hypothesis when p> 0.05.
Hypo Testing steps
1 Decide on a null hypothesis H0.
2 Decide on an alternative hypothesis H1.
3 Decide on a significance level.
4 Calculate the appropriate test statistic.
5 Find from tables the corresponding tabulated test statistic.
6 Compare calculated and tabulated test statistics and decide whether to accept or reject the null hypothesis.
7 State the conclusion and assumptions of the test.
Source Rees, D.G. Essential Statistics, Chapman and Hall 1995. (1006 lecture)
Power (Good link & here)
"I am the judge or enemy, you say, of the Null hypothesis"
  • The probability of correctly rejecting the null hypothesis; or you may say : the probability of not making a Type II error
    • 正确拒绝H0的概率 ; or you may say : (不犯二类错误的概率)
  • Known as — (Generally, should be 0.2 )
Optimistic analysis - testing how big our sample will need to be to achieve a certain power
  • very useful when you are planning experiments
  • the target power is 80%; that's why Generally, should be 0.2
Around when sample size = 40, the 1-  should be 0.8
Around when sample size = 40, the 1- should be 0.8
Factors affecting Power (Here)
  • Significance level: Increasing the significance level increases power
Significance level and decision errors
Graph of decision errors
notion image
notion image
There are 2 types of Errors : Type 1 Error and Type 2 Error
Type 1 false +ve : Wrong rejection of
  • More precisely, incorrectly rejecting the null hypothesis in return for a false alternative hypothesis
  • Known as —
    • The significance level is usually set at 0.05 or 5%,meaning that your results only have a 5% chance of occurring, or less, if the null hypothesis is actually true.
Type 2 false -ve : Wrong rejection of = Wrong acceptance of
  • More precisely, reject a false null hypothesis in return for a true alternative hypothesis
  • Known as —

Example, The avg price of cars is 3200. A test is conducted to see if that is true. State the types of Errors :

Step 1 : Find the H0 and HA
H0 : M = 3200
HA : M ≠ 3200
Step 2 : Find the Errors
Type 1: Wrong rejection of (Test shown that M ≠ 3200; but in fact M = 3200)
Type 2: Wrong rejection of (Test shown that M = 3200; but in fact M ≠ 3200)
Good source about Hypo testing

We also learnt about Wilcoxon test, it is non-parametric.

Its null hypo is that :
1. the diff of marks is a ND;
2. quiz1 and quiz2 shares the same dist ;
3. diff = 0

Shapiro test

The null-hypothesis of this test is that the population is normally distributed.
Thus, if the p value is less than the chosen alpha level, then the null hypothesis is rejected and there is evidence that the data tested are not normally distributed.
On the other hand, if the p value is greater than the chosen alpha level, then the null hypothesis (that the data came from a normally distributed population) can not be rejected (e.g., for an alpha level of .05, a data set with a p value of less than .05 rejects the null hypothesis that the data are from a normally distributed population).

Week 3 : Permutation and resampling

Last week we use the example of "Will the web re-design increase profits?". In case it really would, your boss will further ask you "by how much?". So, by understanding the concepts (which is called Estimated Effect) today, you will be able to answer it.

Confidence interval :

a range of plausible values where we may find the true population value.

Estimator p8

estimand, estimator and estimate
(Estimand -> Estimator -> Estimate)
notion image

Example 1:
We rely on the observations in our sample and use a linear (regression) function (the estimator) to estimate the causal effect of education on income in our sample which is our estimate for the population-level causal effect (the estimand).

trimmed mean and mean p9

Trimmed mean (20%) of this data set
711, 710, 783, 814, 797 793, 711, 678, 773, 615 701, 841, 764, 330, 653, 558, 738, 795, 702, 752
  1. Order the numbers from small to large.
  1. Remove 20% of the points on both ends. Hint: What's 20% of 20 values?
  1. Compute the mean with the remaining values.
  1. ans is 711.0
difference between Mean and trimmed mean
  • Using a trimmed mean helps eliminate the influence of outliers or data points on the tails that may unfairly affect the traditional mean. (here)

Confidence interval p11

Confidence Intervals
Confidence Intervals

bias p14

The bootstrap principle


The parametric bootstrap

  • Used when you know the family of probability dist

The non-parametric bootstrap

  • Used when you DON'T know the family of probability dist
When doesn't it work?
When does it work?

Week 4 : Likelihoods and parameter estimation

The entire process of MLE revolves around determining the parameters which maximizes the probability of the data
Maximum Likelihood Estimate (MLE)
Maximum Likelihood Estimate (MLE)
Interpretation :
The bigger the likelihood, the more likely it is that the data comes from that model, the more appropriate to use that model


Week 5 : Bootstraps, uncertainty, and parameter estimates

Censor data
notion image


Week 6 : Regression for explaining data

Slide 1
Multilevel Models
  • Fixed effects vs random effects
  • Mixed effects models
  • Diagnostics
w7 Linear Model
What is R^2
  • is a common measurement of strength of linear model fit.
  • tells us % variability / variance in response explained by model.
What is the difference between augment () and predict()
Both are used in testing the model, but the way it is used is different :
  • augment () : It is used when training dataset is used to test a model
  • predict() : It is used when testing dataset is used to test a model
The residuals and fitted values can be extracted using augment() function from broom:
fit = lm ( you know it )
mod_heights_diagnostics <- augment( fit )
Interpretation of a good model
Indicator to judge a good model or not
  • R Square / Adjusted R Square
Residual plot
resid_panel(fit, plots = "all")
data is normally distributed
data is normally distributed
notion image
residual : the difference between the predicted (fitted ) and actual (observed ) values
3 condition to evaluation a good model based on residual
  • Independent
  • Normally distributed
residual plot: a good linear model's residual should be independent or not have relationship.
How to view if there is relationship or not?
if there is no pattern shown in residual plot, it means it does not have relationship, hence a good model
QQplot : a good linear model's residual should be normally distributed
How to view if it is normally distributed or not?
If all points lie on the blue line, then it is normally distributed
What about in this case?
  • Extreme points deviate from the line, which can be highly influential.
  • That is to say that some data points can change the regression coefficient to the extent that you would see something different if you left it out.
Cook's distance and Residual leveraged plot : a good linear model should avoid as much outliers as possible
Cook's distance (you know how to view if there is outliers, just remove them based upon)
  • > 4/n is a good rule of thumb for a big value.
Residual leveraged plot
if points lie below the red line, then there are outliers needed to be removed
R.leveage plot :
This data is not normally distributed
notion image
Model is good when residuals are small (ETC2420 Week 6)
Residuals are small, that means :
  • The smaller the residual sum of squares, the better your model fits your data
What if there is no Covariate/ Predictor variables / features in the regression? (ETC2420 Week 6)
notion image
hypothesis test in lm() (ETC2420 Week 6)
What are the output from lm() of test statistic and a hypothesis test ?
covariate means Predictor variables / features.
  • It tests the null hypothesis that the covariate = 0
    • If there is only 1 covariate in your model,
      • Rejecting the null hypothesis indicates that there is a linear relationship between the response and the covariate.
    • If there are more than one covariate in your model,
      • it indicates that there is a linear relationship between the covariate and the response, even after we have taken into account the other covariates.
  • This doesn’t indicate the strength of the
Assumptions when calculating the P-value and standard errors
notion image


Week 7 :Regression for causal estimation - The Directed Acyclic Graph (DAG)

What & Why DAG

A graphical technique
  • To visualise assumptions about the relationship between variables (AKA nodes in the context of graphs).
  • To see variables they interact with each other that changes the regression
  • To ensure that the spurious correlation between X and Y isn’t introduced.


There are 4 kinds of Elemental confounds (or I called Types of path ToP)

ToP 1 : FORK (common cause)

Z is the common cause of X & Y
  • Example :
    • e.g. 1 : weight ← (unhealthy lifestyle) → smoking
      e.g. 2 : bad impact on heath ← (Staying up late) → work efficience
      e.g. 3 : Marriage Rate ← (Age at Marriage) → Divorce Rate
  • X and Y are correlated except for when we control for Z.
    • e.g. weigh and smoking are correlated except for when we control for the common causes unhealthy lifestyle.
  • Y are independent, conditional on Z.
  • If we don’t condition on the common cause, we will see spurious correlation.
notion image

ToP 2 : PIPE (Chain of logic)

Z mediates the effect of X on Y. (i.e X causes Z causes Y)
  • Example :
e.g 1 : smoking → cholesterol to rise → increases risk for cardiac arrest.
e.g 2: unhealthy life style → weight rises → Cholesterol
  • Z is an intermediate variable between X & Y.
  • If we were to intervene or control for Z, then the relationship between X and Y would be broken and there should be no correlation.
    • In a statistical framework, we would not want to control for Z
  • If we condition on Z now, we also block the path from X to Y. So in both a fork and a pipe, conditioning of the middle variable blocks the path.
notion image


ToP 3 : Collider (Very COMMON)

There is no association between X and Y unless you condition on Z. (i.e X and Y jointly causes Z )
  • Example :
    • e.g 1 : Flu → Fever ← Chicken pox (varicella-zoster)
    • e.g 2 : Switch → Light ← Electricity
      notion image
      notion image
      e.g 3 : Newsworthy → Publish ← Trustworthy
      notion image
  • An inverted fork is not an open path; it is blocked at the collider Z.
  • X & Y are independent if Z is not conditioned on
    • Conversely, conditioning on Z — the collider variable, opens the path. Once the path is open, information flows between X and Y.
      Example : Influenza and chicken pox are independent if fever is not conditioned on
  • Learning X and Z reveals Y.
notion image

Regression and DAG

Say we want to identify a causal effect between X and Y :
notion image

Add or Not Add??

When there is a Collider:
-We DO NOT include Collider variable in the regression.

When there is a FORK:
-We DO include the variable in the regression

When there is a Pipe:
-We DO NOT include the variable in the regression
If adding pipe variable — W, we will see the bias in our causal estimate.
You will see the direct effect of X and Y.
You will not see the indirect effect captured by W.
notion image
z is the variable
z is the variable
notion image
DAGitty - drawing and analyzing causal diagrams (DAGs)

Week 8 : Regression for prediction

Slide 1 : Bayesian inference
Normalising constant
Incorporating a prior
Credible intervals
Continuous distributions
Comparison with MLE
Conjugate priors
Bayesian regression
Slide 2: Conditional probability and Bayesian methods
  1. Events – Union, Intersection & Disjoint events
  1. Independent, Dependent and Exclusive events (with implementation in R)
  1. Conditional Probability (with implementation in R)
  1. Bayes Theorem (with implementation in R)
  1. Probability trees
  1. Frequentist vs Bayesian definitions of probability
  1. Open Challenges
Slides 3: Bayesian Statistics explained to Beginners in Simple English
  • The drawbacks of frequentist statistics lead to the need for Bayesian Statistics
  • Discover Bayesian Statistics and Bayesian Inference
  • There are various methods to test the significance of the model like p-value, confidence interval, etc
Slides 4: A visual guide to Bayesian thinking
Slides 5 : The Odds, Continually Updated
Mean Square Error

What is MSE?

  • Avg(the square of the difference between actual and estimated values) (Here)
Description of the graph (Here)
In the diagram, predicted values are points on the line and actual values are shown by small circles.
  • Error in prediction is shown as the distance between the data point and fitted line.
  • MSE for the line is calculated as the avg( sum of squares for all data points ).
  • For all such lines possible for a given dataset, the line that gives minimal or least MSE is considered as the best fit.
notion image

How is MSE calculated? (Here)

notion image

Here RMSE is the root of MSE. You can think of :


What is Mean Absolute Error (MAE)?

  • The sum of the absolute difference between actual and predicted values.

What is ?

  • A coefficient of determination
  • The total variance explained by model/total variance.
notion image
Comparison between RSME, MAE, and
Comparison between RSME, MAE, and

Interpretation of MSE (Here)

  • To check how close our prediction is to actual values.
  • Lower the MSE, the closer is forecast to actual, the better fit is. So we want to minimise MSE.
  • Used to compare different regression models.

Week 9 : Classification

Confusion Matrix

notion image
notion image

Interpretation of indicator: (Here)


  • Proportion of times the classifier is correct
This is useful, code found in ETC2420 lecture slide week 9.
This is useful, code found in ETC2420 lecture slide week 9.

Precision (吾係變係)

  • To determine the best model when the costs of False Positive is high.
    • For instance, email spam detection.
      • In email spam detection, a false positive means that an email that is non-spam (actual negative) has been identified as spam (predicted spam).
      • The email user might lose important emails if the precision is not high for the spam detection model.

Recall (係變吾係) aka Sensitivity

  • To determine the best model when the cost associated with False Negative is high.
  • Calculates how many of the Actual Positives our model capture through labeling it as Positive (True Positive).
    • For instance of fraud detection,
      • If a fraudulent transaction (Actual Positive) is predicted as non-fraudulent (Predicted Negative), the consequence can be very bad for the bank.
    • For instance of sick patient detection,
      • Similarly, in sick patient detection, if a sick patient (Actual Positive) goes through the test and predicted as not sick (Predicted Negative).
      • The cost associated with False Negative will be extremely high if the sickness is contagious.

Logistic regression

When do we use Logistic regression?

  • When the response variable are binary, which takes one of two values, 0 or 1

Math of Logistic regression: Log-odd

  • Sum(Square Residuals) + ()
(More details can be read from the week 9 lecture slides.)

What if there are lot of variables?: penalty

Problem :
When there are lot of variables, Multicollinearity is introduced. That means independent variables / Predictor variables in a regression model are correlated. Which is not under the assumption of regression.
How do you know if your data is of Multicollinearity?
  • i.e. coefficient. is very big.
  • Least-squares are unbiased, and variances are large, this results in predicted values to be far away from the actual values.
  • Above of which leads to overfitting and instability.
We adding a penalty term that discourages large values of i.e. coefficient.
How do you know if add a penalty ?
  • Essentially, we tweak around the value of λ to bring trades off between bias and variance.
    • When the coefficients are (asymptotically unbiased)
    • Increasing decreases the s. When the coeffcients are all zero (no variance) [ 係反比關係]

In the coding, we change param of mixture.

  • When mixture = 0, this is called Ridge-regression.
    • The s are shrunk.
  • When mixture = 1, this is called Logistic lasso.
    • The s are shrunk by making the least important ones exactly zero.
Resource about Logistic regression
notion image

Week 10 : Using Bayes theorem to infer parameters

  • Working with text
  • Web scraping
  • Grammar of graphics
  • Randomisation to assess structure perceived in plots

Bayes theorem / Bayes' rule
Bayes theorem / Bayes' rule
The Beta-Binomial Conjugate Prior
The Beta-Binomial Conjugate Prior

Week 11 : Bayesian regression


Credible interval
Credible interval
In week 2 lab, we learnt how the sample size of and experiment can effect the variability of CDF. So in T-dist, how does changing the sample size or DF affect the CDF?
Week 3
What is the difference between bootstrapping and ecdf
How is Modeling and ML different? Why do we need to know the know the underlying data distribution?
The purpose of both could be different.
ML : used for prediction tasks; the models are quite often black boxes and they literally do not care what the underlying data generation process / distribution(s) are.
Modeling : used for inference; It is very common to form a hypothesis about what the underlying data generation process is, then to collect some data and test the hypothesis that the model
Confidence Intervals vs Prediction Intervals vs Tolerance Intervals
Difference between Frequentist vs Bayesian
notion image
Regression model (Time Series) (RTS)The Beta-Binomial Conjugate Prior

Table of contents
Jason Siu
A warm welcome! I am a tech enthusiast who loves sharing my passion for learning and self-discovery through my website.
Number of posts:
Table of contents