type
Post
Created date
Jun 16, 2022 01:21 PM
category
Data Science
tags
Machine Learning
Machine Learning
status
Published
Language
From
summary
slug
password
Author
Priority
Featured
Featured
Cover
Origin
Type
URL
Youtube
Youtube
icon

Principal Components Analysis

Problem : (From here)
Imagine the dataset's dimension is 100 rows * 10 columns which you to exploratory data analysis. You instantly think of making 2d scatterplots, each of which contains the n observations’ measurements on two of the features.
However, there are such scatterplots. For example, with p = 10 there are 45 plots!
So, here comes to the play of PCA!
Definition :
Principal Components do nothing more than finding uncorrelated linear combinations of the features that explain variance. The components are just eigenvectors.

Principal components analysis is useful for
  • Creating a single index
Combine the correlated variables to be one variable
Since population and advantage spending are in correlation, we can create a linear combination to capture two values into one variable to explain
notion image
  • Understanding relationships between variables
  • Seeing how variables are associated with observations on a single biplot.
  • Visualising high-dimensional time series.

Assumption


  • Features must be in normal distribution.
  • Linearity: The data set to be linear combinations of the variables.
  • PC2 has to be uncorrelated to PC1.

Some terms


Component loading
  • In multivariate (multiple variable) space, the correlation between the component and the original variables is called the component loadings.
  • Think of it as correlation coefficients, squaring them give the amount of explained variation.
    • Therefore the component loadings tell us how much of the variation in a variable is explained by the component

Weighted linear combination


Oftentimes, we want to make a single index to ** EXPLAIN / SUMMARISE variation ** of the data from multiple variables.
The use cases are listed below:

Marketing
Surveys we may ask a large number of questions about customer experience. And create a single overall measure of customer experience.
Finance
There may be several ways to assess the credit worthiness of firms. A credit score summarises all the information about the likelihood of bankruptcy for a company.
Economics
The development of a country or state can be measured in the Human Development Index, taking Income, Illiteracy, Life Expectancy, Murder Rate, High School Graduation Rate into account.

So, a convenient way to combine variables is through a linear combination (LC)

Maximise variance


To make a better measure, we want it to be weighted. It is like our final WAM is based on 50% of assignment and 50% exam.
The index (mark in the class) should record a LARGE VARIANCE, in order to differentiate the best preforming students from the weakest performing students.
The PC with the highest variance is the first Principal Component of the data; it is a new variable that explains as much variance as possible in the original variables.

Second Principal Component

Problem : Sometimes a single index still oversimplifies the data. (The agility to explanate the data is low.)
Solution : The second principal component is an Linear Combination that :
  • Is uncorrelated with the first PC.
  • Has the highest variance out of all LCs that satisfy condition 1.
Why would you prefer the 1st PC uncorrelated to the 2nd PC?
Since there is no need for PC2 to explain any variance already explained by PC1, PC2 and PC1 are uncorrelated.

Example : Correlation between a variable and the corresponding PC.
A high (low) weight indicates a strong positive (negative) association between a variable and the corresponding PC.
 
notion image

 

Biplot


  • A plot to indicate the weight vectors on the scatterplot
    • Why?
      We want Biplot to :
      • See how the observation relate to one another
      • See how the variables relate to one another
      • See how the observations relate to the variables
      How ?

      There are two ways to draw Biplots:
      Distance Biplot
      • See how the observations relate to one another
        • (By viewing the distances of the observation)
          notion image
      • See how the variables relate to one another
        • (By viewing the angle of the variables )
      • See how the observations relate to the variables
        • (Draw an extended line for the variable and view how closed the observation is to the variable)
       
      • Correlation Biplot
      Distance Biplot 這個主要看 observation

      • The distance between observations implies similarity between observations
      Example : US
      notion image
      Louisiana (LA) and South Carolina (SC) are close therefore are similar.
      Arkansas (AR) and California (CA) are far apart and therefore different.
      • If the variables are ignored this is identical to a scatter plot of principal components.
       
      Correlation Biplot 這個主要看 variable 標

      • The angles between variables tell us something about correlation (approximately).
      Example : US
      notion image
      Income and HSGrad are highly positively correlated.
      • The angle between them is close to zero 銳角.
      LifeExp and Income are close to uncorrelated.
      • The angle between them is close 90 degrees 直角.
      Murder and LifeExp are highly negatively correlated.
      • The angle between them is close 180 degrees 鈍角.
       
       

Third PC


  • uncorrelated with PC1 and PC2.
  • has the highest variance
  • cannot visualise this with a biplot
Example : There are 109 macroeconomic variables,
In which you cannot look at 109 time series plots to visualise general macroeconomic conditions. However, one can look at time series plots of the PC of these variables.
notion image

So, There are as many principal components as there are variables. In the lecture, it explains how they are proportionally weighted.
Usually, a small number of principal components can often explain a large proportion of the variance.
Example, 3 PCs explain 35% of the total variation of 109 variables.

Implementation of PCA

Standardisation


Before conducting PCA, it is important to scale the data if it is of different scale. Otherwise the weights can be influenced by the units of measurement. (Sensitivity towards the data)
See that the difference can generate as high as 99.9% in the LifeExp, if you do not scale.
notion image

Do PCA in R


Use prcomp ()
  • The output of the prcomp function is a prcomp object.
  • It is a list that contains a lot of information. Of most interest are ($them) :
    • The principal components : x
    • The weights : rotation
  • What is the Cumulative Proportion?
    • It means the Proportion of the ability to interpret the data. (Higher Better)
      • So, with PC1, PC2, and PC3 together, they can explain 90.7% of the data.
notion image

Biplot
Use biplot(pca), where two PCs must be selected.
By default biplot produces the distance biplot. If you wanna do the correlation biplot, try :
biplot(pca,scale = 0)
  • ( scale = 0 ) = Correlation Biplot
  • ( scale = 1 ) = Distance Biplot
How to interpret :
The higher the proportion of variance the more accurate the biplot

Screeplot : It is like an elbow chart
  • Along the horizontal axis is the Principal Component.
  • The Scree plot indicates how much each PC explains the total variance of the data.
  • Look for a part where the plot flattens out also called the elbow of the Scree Plot.
  • Along the vertical axis is the variance corresponding to each Principal Component.
  • screeplot(pca,type="lines")
notion image
Another measure to select PCs, is Kaiser’s Rule. The rule is to select all PCs with a variance greater than 1.

Significant of loadings (ETC3250)

To see if the variable significant enough, we can make a confidence interval.
Bootstrap is the way to construct the interval. It is used to assess whether the coefficients of a PC are significantly different from 0.
Steps of Bootstrap confidence intervals
  1. Generating B bootstrap samples of the data
  1. Compute PCA, record the loadings
  1. Re-orient the loadings, by choosing one variable with large coefficient to be the direction base
  1. If B=1000, 25th and 975th sorted values yields the lower and upper bounds for confidence interval for each PC.
Codval
Data prep
```{r eval=FALSE} # devtools::install_github("jimmyday12/fitzRoy") library(fitzRoy) aflw <- fetch_player_stats(2020, comp = "AFLW") save(aflw, file="aflw.rda") ```
Code for bootstrap
```{r} library(boot) # The first variable, goals, can be used as the # indicator of sign because it has a large coefficient compute_PC2 <- function(data, index) { pc2 <- prcomp(data[index,], center=TRUE, scale=TRUE)$rotation[,2] # Coordinate signs if (sign(pc2[1]) < 0) pc2 <- -pc2 return(pc2) } # Make sure sign of first PC element is positive PC2_boot <- boot(data=aflw_av[,4:36], compute_PC2, R=1000) colnames(PC2_boot$t) <- colnames(aflw_av[,4:36]) PC2_boot_ci <- as_tibble(PC2_boot$t) %>% gather(var, coef) %>% mutate(var = factor(var, levels=colnames(aflw_av[,4:36]))) %>% group_by(var) %>% summarise(q2.5 = quantile(coef, 0.025), q50 = median(coef), q97.5 = quantile(coef, 0.975)) %>% mutate(t0 = PC2_boot$t0) ```
Plot
```{r fig.height=4} PC2_boot_ci %>% ggplot() + geom_point(aes(x=var, y=t0)) + geom_errorbar(aes(x=var, ymin=q2.5, ymax=q97.5), width=0.2) + geom_hline(yintercept=c(-1/sqrt(nrow(PC2_boot_ci)), 1/sqrt(nrow(PC2_boot_ci))), colour="red") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + xlab("Predictors") + ylab("Coefficient") + geom_hline(yintercept = 0, color = "gray") ```
Interpretation
notion image
To judge whether the variable is significant, it depends on 2 things, any violation means the variable is insignificant :
1) If the interval or point does not touch the 0 line, the variable is significant.
2) If the interval or point goes beyond the red line, the variable is significant
So, in this case, On PC2 m100 and m200 contrast m1500 and m3000 (and possibly marathon). These are significantly different from 0.
 
 

Math


  • Sum of loading = 1 if standardized variables.
To understand how to compute the proportion of variance explained (PVE), we need to understand the term - Total variance:
If standardizing the variables, the sum of that variance = 1. So, TV = # of the variables
If standardizing the variables, the sum of that variance = 1. So, TV = # of the variables

Drawing Scree plot

notion image
notion image

FAQ


PCA vs LDA
Component axes here mean direction.
In practice, often a PCA is done followed by an LDA for dimensionality reduction.
  • Very similar; only differ that LDA does not have class-specific covariance matrices, but one shared covariance matrix among the classes. [TDS]
What is a variance-covariance matrix?
A convenient expression of statistics in describing patterns of variability and covariation of the data
The aim is to understand how the variables of the input data set are varying from the mean with respect to each other. In another word, it examines if there is any relationship between them.
The diagonal elements of the covariance matrix contain the variances of each variable.
  • The variance measures how much the data are scattered about the mean. 
  • The variance is equal to the square of the standard deviation.
Interpretation of the values:
if positive then : the two variables increase or decrease together (correlated)
if negative then : One increases when the other decreases (Inversely correlated)
What is eigenvalue?
the total amount of variance that can be explained by a given principal component.
What is the use of eigenvector?
The directions of the spread of our data
What is PC?
Principal components are new variables that are constructed as linear combinations or mixtures of the initial variables.
These combinations are done in a way that the new variables (i.e., principal components) are uncorrelated and most of the information within the initial variables is squeezed or compressed into the first components.

Reference


 
Lecture
 
Lab
 
 
Video
 

Extra resource


Numerical example
 
Linear RegressionRegularization - Ridge & Lasso