Principal Component Analysis (PCA)

type

Post

Created date

Jun 16, 2022 01:21 PM

Principal Components Analysis

Problem : (From here)

Imagine the dataset's dimension is 100 rows * 10 columns which you to exploratory data analysis. You instantly think of making 2d scatterplots, each of which contains the n observations’ measurements on two of the features.

However, there are such scatterplots. For example, with p = 10 there are 45 plots!

So, here comes to the play of PCA!

Definition :

Principal Components do nothing more than finding uncorrelated linear combinations of the features that explain variance. The components are just eigenvectors.

Principal components analysis is useful for

Creating a single index

Combine the correlated variables to be one variable

Since population and advantage spending are in correlation, we can create a linear combination to capture two values into one variable to explain

Understanding relationships between variables

Seeing how variables are associated with observations on a single biplot.

Visualising high-dimensional time series.

Assumption

Features must be in normal distribution.

Linearity: The data set to be linear combinations of the variables.

PC2 has to be uncorrelated to PC1.

Some terms

Component loading

In multivariate (multiple variable) space, the correlation between the component and the original variables is called the component loadings.

Think of it as correlation coefficients, squaring them give the amount of explained variation.

Therefore the component loadings tell us how much of the variation in a variable is explained by the component

Weighted linear combination

Oftentimes, we want to make a single index to ** EXPLAIN / SUMMARISE variation ** of the data from multiple variables.

The use cases are listed below:

Marketing

Surveys we may ask a large number of questions about customer experience. And create a single overall measure of customer experience.

Finance

There may be several ways to assess the credit worthiness of firms. A credit score summarises all the information about the likelihood of bankruptcy for a company.

Economics

The development of a country or state can be measured in the Human Development Index, taking Income, Illiteracy, Life Expectancy, Murder Rate, High School Graduation Rate into account.

So, a convenient way to combine variables is through a linear combination (LC)

Maximise variance

To make a better measure, we want it to be weighted. It is like our final WAM is based on 50% of assignment and 50% exam.

The index (mark in the class) should record a LARGE VARIANCE, in order to differentiate the best preforming students from the weakest performing students.

The PC with the highest variance is the first Principal Component of the data; it is a new variable that explains as much variance as possible in the original variables.

Second Principal Component

Problem : Sometimes a single index still oversimplifies the data. (The agility to explanate the data is low.)

Solution : The second principal component is an Linear Combination that :

Is uncorrelated with the first PC.

Has the highest variance out of all LCs that satisfy condition 1.

Why would you prefer the 1st PC uncorrelated to the 2nd PC?

Since there is no need for PC2 to explain any variance already explained by PC1, PC2 and PC1 are uncorrelated.

Example : Correlation between a variable and the corresponding PC.

A high (low) weight indicates a strong positive (negative) association between a variable and the corresponding PC.

Biplot

A plot to indicate the weight vectors on the scatterplot

Why?

We want Biplot to :

See how the observation relate to one another

See how the variables relate to one another

See how the observations relate to the variables

How ?

There are two ways to draw Biplots:

Distance Biplot

See how the observations relate to one another

(By viewing the distances of the observation)

See how the variables relate to one another

(By viewing the angle of the variables )

See how the observations relate to the variables

(Draw an extended line for the variable and view how closed the observation is to the variable)

Correlation Biplot

Distance Biplot 這個主要看 observation

The distance between observations implies similarity between observations

Example : US

Louisiana (LA) and South Carolina (SC) are close therefore are similar.

Arkansas (AR) and California (CA) are far apart and therefore different.

If the variables are ignored this is identical to a scatter plot of principal components.

Correlation Biplot 這個主要看 variable 標

The angles between variables tell us something about correlation (approximately).

Example : US

Income and HSGrad are highly positively correlated.

The angle between them is close to zero 銳角.

LifeExp and Income are close to uncorrelated.

The angle between them is close 90 degrees 直角.

Murder and LifeExp are highly negatively correlated.

The angle between them is close 180 degrees 鈍角.

Third PC

uncorrelated with PC1 and PC2.

has the highest variance

cannot visualise this with a biplot

Example : There are 109 macroeconomic variables,

In which you cannot look at 109 time series plots to visualise general macroeconomic conditions. However, one can look at time series plots of the PC of these variables.

So, There are as many principal components as there are variables. In the lecture, it explains how they are proportionally weighted.

Usually, a small number of principal components can often explain a large proportion of the variance.

Example, 3 PCs explain 35% of the total variation of 109 variables.

Implementation of PCA

Standardisation

Before conducting PCA, it is important to scale the data if it is of different scale. Otherwise the weights can be influenced by the units of measurement. (Sensitivity towards the data)

See that the difference can generate as high as 99.9% in the LifeExp, if you do not scale.

Do PCA in R

Use prcomp ()

The output of the prcomp function is a prcomp object.

It is a list that contains a lot of information. Of most interest are ($them) :

The principal components : x
The weights : rotation

What is the Cumulative Proportion?

It means the Proportion of the ability to interpret the data. (Higher Better)

So, with PC1, PC2, and PC3 together, they can explain 90.7% of the data.

Biplot

Use biplot(pca), where two PCs must be selected.

By default biplot produces the distance biplot. If you wanna do the correlation biplot, try :

biplot(pca,scale = 0)

( scale = 0 ) = Correlation Biplot

( scale = 1 ) = Distance Biplot

How to interpret :

The higher the proportion of variance the more accurate the biplot

Screeplot : It is like an elbow chart

Along the horizontal axis is the Principal Component.

The Scree plot indicates how much each PC explains the total variance of the data.

Look for a part where the plot flattens out also called the elbow of the Scree Plot.

Along the vertical axis is the variance corresponding to each Principal Component.

screeplot(pca,type="lines")

Another measure to select PCs, is Kaiser’s Rule. The rule is to select all PCs with a variance greater than 1.

Significant of loadings (ETC3250)

To see if the variable significant enough, we can make a confidence interval.

Bootstrap is the way to construct the interval. It is used to assess whether the coefficients of a PC are significantly different from 0.

Steps of Bootstrap confidence intervals

Generating B bootstrap samples of the data

Compute PCA, record the loadings

Re-orient the loadings, by choosing one variable with large coefficient to be the direction base

If B=1000, 25th and 975th sorted values yields the lower and upper bounds for confidence interval for each PC.

Codval

Data prep


```{r eval=FALSE}
# devtools::install_github("jimmyday12/fitzRoy")
library(fitzRoy)
aflw <- fetch_player_stats(2020, comp = "AFLW")
save(aflw, file="aflw.rda")
```

Code for bootstrap


```{r}
library(boot) 
# The first variable, goals, can be used as the 
# indicator of sign because it has a large coefficient
compute_PC2 <- function(data, index) {
  pc2 <- prcomp(data[index,], 
                center=TRUE, 
                scale=TRUE)$rotation[,2]
  # Coordinate signs
  if (sign(pc2[1]) < 0) 
    pc2 <- -pc2 
  return(pc2)
}

# Make sure sign of first PC element is positive
PC2_boot <- boot(data=aflw_av[,4:36], compute_PC2, R=1000)
colnames(PC2_boot$t) <- colnames(aflw_av[,4:36])
PC2_boot_ci <- as_tibble(PC2_boot$t) %>%
  gather(var, coef) %>% 
  mutate(var = factor(var, 
                      levels=colnames(aflw_av[,4:36]))) %>%
  group_by(var) %>%
  summarise(q2.5 = quantile(coef, 0.025), 
            q50 = median(coef),
            q97.5 = quantile(coef, 0.975)) %>%
  mutate(t0 = PC2_boot$t0) 
```

Plot


```{r fig.height=4}
PC2_boot_ci %>% ggplot() +
  geom_point(aes(x=var, y=t0)) +
  geom_errorbar(aes(x=var, ymin=q2.5, ymax=q97.5),
                width=0.2) + 
  geom_hline(yintercept=c(-1/sqrt(nrow(PC2_boot_ci)),
                          1/sqrt(nrow(PC2_boot_ci))),
             colour="red") +
  theme(axis.text.x = element_text(angle = 90, 
                                   vjust = 0.5, 
                                   hjust=1)) +
  xlab("Predictors") + ylab("Coefficient") +
  geom_hline(yintercept = 0, color = "gray") 
```

Interpretation

To judge whether the variable is significant, it depends on 2 things, any violation means the variable is insignificant :

1) If the interval or point does not touch the 0 line, the variable is significant.

2) If the interval or point goes beyond the red line, the variable is significant

So, in this case, On PC2 m100 and m200 contrast m1500 and m3000 (and possibly marathon). These are significantly different from 0.

Math

Sum of loading = 1 if standardized variables.

To understand how to compute the proportion of variance explained (PVE), we need to understand the term - Total variance:

If standardizing the variables, the sum of that variance = 1. So, TV = # of the variables

Drawing Scree plot

FAQ

PCA vs LDA

Linear Discriminant Analysis (sebastianraschka.com)

Component axes here mean direction.

In practice, often a PCA is done followed by an LDA for dimensionality reduction.

Very similar; only differ that LDA does not have class-specific covariance matrices, but one shared covariance matrix among the classes. [TDS]

What is a variance-covariance matrix?

A convenient expression of statistics in describing patterns of variability and covariation of the data

The aim is to understand how the variables of the input data set are varying from the mean with respect to each other. In another word, it examines if there is any relationship between them.

The diagonal elements of the covariance matrix contain the variances of each variable.

The variance measures how much the data are scattered about the mean.

The variance is equal to the square of the standard deviation.

Interpretation of the values:

if positive then : the two variables increase or decrease together (correlated)

if negative then : One increases when the other decreases (Inversely correlated)

What is eigenvalue?

the total amount of variance that can be explained by a given principal component.

What is the use of eigenvector?

The directions of the spread of our data

What is PC?

Principal components are new variables that are constructed as linear combinations or mixtures of the initial variables.

These combinations are done in a way that the new variables (i.e., principal components) are uncorrelated and most of the information within the initial variables is squeezed or compressed into the first components.