type
Post
Created date
Jun 16, 2022 01:21 PM
category
Data Science
tags
Machine Learning
Machine Learning
status
Published
Language
From
summary
slug
password
Author
Priority
Featured
Featured
Cover
Origin
Type
URL
Youtube
Youtube
icon
 

Definition


  • As the name suggests, LDA is a linear model for classification and dimensionality reduction. It is used to solve the dimension reduction method for categorical data. [vidhya]
  • Start finding directions that maximise the separation between classes, then use these directions to predict the class of individuals. These directions, called linear discriminants, are linear combinations of predictor variables. [STHDA]
  • LDA MAXIMISES the between-class variance (‘separability’) and MINIMISES the within-class variance. [Here]
  • Finds a linear combination of predictors that maximizes the separation between groups. [Brendi]

Theory


Problem:
  1. Logistic Regression is a linear classification model that performs well for binary classification but falls short in the case of multiple classification problems with well-separated classes. [vidhya]
  1. High dimension [vidhya]
Solution:
  1. LDA handles these quite efficiently!! [vidhya]
  1. Used to reduce the number of features just as PCA which reduces the computing cost significantly. [vidhya]

Assumption: Only works when

The predictors are normally distributed. [vidhya] (i.e., All samples come from normal populations [Slides] )
notion image
  • Each of the classes has identical variance-covariance matrices. [vidhya] RMB the shape of the data is determined by the variance-covariance matrices.
    • The variances and covariances are the same for the y=1y=1 group and y=0y=0 group. (Here)

Advantage

  1. LDA (and QDA) is that they are more stable and will not vary too much when different training samples are used.
  1. LDA is interpretable.

Disadvantage

LDA and QDA are too simple for complicated decision boundaries.

Example

Brendi
notion image
notion image
notion image
Bredwin
notion image

R code

olive <- read_csv("http://ggobi.org/book/data/olive.csv") %>% dplyr::select(-`...1`, -area) %>% mutate(region = factor(region)) # Standardise variables olive_std <- olive %>% mutate(across(where(is.numeric), ~ (.x - mean(.x)) / sd(.x)))
data
library(discrim) library(MASS) set.seed(775) olive_split <- initial_split(olive_std, 2/3, strata = region) olive_train <- training(olive_split) olive_test <- testing(olive_split) lda_mod <- discrim_linear() %>% set_engine("MASS", prior = c(1/3, 1/3, 1/3)) %>% translate() olive_lda_fit <- lda_mod %>% fit(region ~ ., data = olive_train) olive_lda_fit

R interpretation

LDA determines group means and computes, for each individual, the probability of belonging to the different groups. The individual is then affected to the group with the highest probability score.
The lda() outputs contain the following elements:
  • Prior probabilities of groups: the proportion of training observations in each group. For example, there are 31% of the training observations in the setosa group
  • Group means: group center of gravity. Shows the mean of each variable in each group.
  • Coefficients of linear discriminants: Shows the linear combination of predictor variables that are used to form the LDA decision rule. for example, LD1 = 0.91*Sepal.Length + 0.64*Sepal.Width - 4.08*Petal.Length - 2.3*Petal.Width. Similarly, LD2 = 0.03*Sepal.Length + 0.89*Sepal.Width - 2.2*Petal.Length - 2.6*Petal.Width.
  • The proportion of trace: the percentage separation achieved by each discriminant function.
    • proportion of between-class variance
      • For example, LDA 1 has 99.05% to separability.
notion image
Using the function plot() produces plots of the linear discriminants, obtained by computing LD1 and LD2 for each of the training observations.
Further interpretation and prediction can be read via [STHDA].

Math


Bayes theorem is about probabilities so if we want to decide whether this observation belongs to group 1 or group 2 or group 3, we look at the value on its density function in whichever one has the highest value on the density function, that's the most likely class that it belongs to that's what Bayes theorem corresponds to. [Lecture]
notion image

LDA finds the mean vectors of each class, then finds projection direction (rotation) that maximizes separation of means:
LDA finds the mean vectors of each class, then finds projection direction (rotation) that maximizes separation of means:
It also takes into account within-class variance to find a projection which minimizes overlap of distributions (covariance) while maximizing the separation of means:
It also takes into account within-class variance to find a projection which minimizes overlap of distributions (covariance) while maximizing the separation of means:
Another source to look into the math : [vidhya]
 

FAQ


PCA vs LDA
Component axes here mean direction.
In practice, often a PCA is done followed by an LDA for dimensionality reduction.
  • Very similar; only differ that LDA does not have class-specific covariance matrices, but one shared covariance matrix among the classes. [TDS]
What is the difference between LDA and logistic regression
If the classes are well-separated estimates from logistic regression tend to be unstable.
If there are a small number of observations, estimates from logistic regression tend to be unstable.
notion image
 

Reference


Brendi notes [Brendi]
Slides [Slides]

Extra resources

[STHDA] explains how to write LDA in python
Standford lecture
Calculation illustrations
Lab
ETC3250 Lab
3550 exam explanation
 
Bias-Variance tradeoff (BV)Distance measure