Imagine the dataset's dimension is 100 rows * 10 columns which you to exploratory data analysis. You instantly think of making 2d scatterplots, each of which contains the n observations’ measurements on two of the features.
However, there are such scatterplots. For example, with p = 10 there are 45 plots!
So, here comes to the play of PCA!
Definition :
Principal Components do nothing more than finding uncorrelated linear combinations of the features that explain variance. The components are just eigenvectors.
Principal components analysis is useful for
Creating a single index
Combine the correlated variables to be one variable
Since population and advantage spending are in correlation, we can create a linear combination to capture two values into one variable to explain
Understanding relationships between variables
Seeing how variables are associated with observations on a single biplot.
Visualising high-dimensional time series.
Assumption
Features must be in normal distribution.
Linearity: The data set to be linear combinations of the variables.
PC2 has to be uncorrelated to PC1.
Some terms
Component loading
In multivariate (multiple variable) space, the correlation between the component and the original variables is called the component loadings.
Think of it as correlation coefficients, squaring them give the amount of explained variation.
Therefore the component loadings tell us how much of the variation in a variable is explained by the component
Weighted linear combination
Oftentimes, we want to make a single index to ** EXPLAIN / SUMMARISE variation ** of the data from multiple variables.
The use cases are listed below:
Marketing
Surveys we may ask a large number of questions about customer experience. And create a single overall measure of customer experience.
Finance
There may be several ways to assess the credit worthiness of firms. A credit score summarises all the information about the likelihood of bankruptcy for a company.
Economics
The development of a country or state can be measured in the Human Development Index, taking Income, Illiteracy, Life Expectancy, Murder Rate, High School Graduation Rate into account.
So, a convenient way to combine variables is through a linear combination (LC)
Maximise variance
To make a better measure, we want it to be weighted. It is like our final WAM is based on 50% of assignment and 50% exam.
The index (mark in the class) should record a LARGE VARIANCE, in order to differentiate the best preforming students from the weakest performing students.
The PC with the highest variance is the first Principal Component of the data; it is a new variable that explains as much variance as possible in the original variables.
Second Principal Component
Problem : Sometimes a single index still oversimplifies the data. (The agility to explanate the data is low.)
Solution : The second principal component is an Linear Combination that :
Is uncorrelated with the first PC.
Has the highest variance out of all LCs that satisfy condition 1.
Why would you prefer the 1st PC uncorrelated to the 2nd PC?
Since there is no need for PC2 to explain any variance already explained by PC1, PC2 and
PC1 are uncorrelated.
Example : Correlation between a variable and the corresponding PC.
A high (low) weight indicates a strong positive (negative) association between a variable and the corresponding PC.
Biplot
A plot to indicate the weight vectors on the scatterplot
Why?
We want Biplot to :
See how the observation relate to one another
See how the variables relate to one another
See how the observations relate to the variables
How ?
There are two ways to draw Biplots:
Distance Biplot
See how the observations relate to one another
(By viewing the distances of the observation)
See how the variables relate to one another
(By viewing the angle of the variables )
See how the observations relate to the variables
(Draw an extended line for the variable and view how closed the observation is to the variable)
Correlation Biplot
Distance Biplot 這個主要看 observation
The distance between observations implies similarity between observations
Example : US
Louisiana (LA) and South Carolina (SC) are close therefore are similar.
Arkansas (AR) and California (CA) are far apart and therefore different.
If the variables are ignored this is identical to a scatter plot of principal components.
Correlation Biplot 這個主要看 variable 標
The angles between variables tell us something about correlation (approximately).
Example : US
Income and HSGrad are highly positively correlated.
The angle between them is close to zero 銳角.
LifeExp and Income are close to uncorrelated.
The angle between them is close 90 degrees 直角.
Murder and LifeExp are highly negatively correlated.
The angle between them is close 180 degrees 鈍角.
Third PC
uncorrelated with PC1 and PC2.
has the highest variance
cannot visualise this with a biplot
Example : There are 109 macroeconomic variables,
In which you cannot look at 109 time series plots to visualise general macroeconomic conditions. However, one can look at time series plots of the PC of these variables.
So, There are as many principal components as there are variables. In the lecture, it explains how they are proportionally weighted.
Usually, a small number of principal components can often explain a large proportion of the variance.
Example, 3 PCs explain 35% of the total variation of 109 variables.
Implementation of PCA
Standardisation
Before conducting PCA, it is important to scale the data if it is of different scale. Otherwise the weights can be influenced by the units of measurement. (Sensitivity towards the data)
See that the difference can generate as high as 99.9% in the LifeExp, if you do not scale.
Do PCA in R
Use prcomp ()
The output of the prcomp function is a prcomp object.
It is a list that contains a lot of information. Of most interest are ($them) :
The principal components : x
The weights : rotation
What is the Cumulative Proportion?
It means the Proportion of the ability to interpret the data. (Higher Better)
So, with PC1, PC2, and PC3 together, they can explain 90.7% of the data.
Biplot
Use biplot(pca), where two PCs must be selected.
By default biplot produces the distance biplot. If you wanna do the correlation biplot, try :
biplot(pca,scale = 0)
( scale = 0 ) = Correlation Biplot
( scale = 1 ) = Distance Biplot
How to interpret :
The higher the proportion of variance the more accurate the biplot
Screeplot : It is like an elbow chart
Along the horizontal axis is the Principal Component.
The Scree plot indicates how much each PC explains the total variance of the data.
Look for a part where the plot flattens out also called the elbow of the Scree Plot.
Along the vertical axis is the variance corresponding to each Principal Component.
screeplot(pca,type="lines")
Another measure to select PCs, is Kaiser’s Rule. The rule is to select all PCs with a variance greater than 1.
Significant of loadings (ETC3250)
To see if the variable significant enough, we can make a confidence interval.
Bootstrap is the way to construct the interval. It is used to assess whether the coefficients of a PC are significantly different from 0.
Steps of Bootstrap confidence intervals
Generating B bootstrap samples of the data
Compute PCA, record the loadings
Re-orient the loadings, by choosing one variable with large coefficient to be the direction base
If B=1000, 25th and 975th sorted values yields the lower and upper bounds for confidence interval for
each PC.
```{r}
library(boot)
# The first variable, goals, can be used as the
# indicator of sign because it has a large coefficient
compute_PC2 <- function(data, index) {
pc2 <- prcomp(data[index,],
center=TRUE,
scale=TRUE)$rotation[,2]
# Coordinate signs
if (sign(pc2[1]) < 0)
pc2 <- -pc2
return(pc2)
}
# Make sure sign of first PC element is positive
PC2_boot <- boot(data=aflw_av[,4:36], compute_PC2, R=1000)
colnames(PC2_boot$t) <- colnames(aflw_av[,4:36])
PC2_boot_ci <- as_tibble(PC2_boot$t) %>%
gather(var, coef) %>%
mutate(var = factor(var,
levels=colnames(aflw_av[,4:36]))) %>%
group_by(var) %>%
summarise(q2.5 = quantile(coef, 0.025),
q50 = median(coef),
q97.5 = quantile(coef, 0.975)) %>%
mutate(t0 = PC2_boot$t0)
```
In practice, often a PCA is done followed by an LDA for dimensionality reduction.
Very similar; only differ that LDA does not have class-specific covariance matrices, but one shared covariance matrix among the classes. [TDS]
What is a variance-covariance matrix?
A convenient expression of statistics in describing patterns of variability and covariation of the data
The aim is to understand how the variables of the input data set are varying from the mean with respect to each other. In another word, it examines if there is any relationship between them.
The diagonal elements of the covariance matrix contain the variances of each variable.
The variance measures how much the data are scattered about the mean.
The variance is equal to the square of the standard deviation.
Interpretation of the values:
if positive then : the two variables increase or decrease together (correlated)
if negative then : One increases when the other decreases (Inversely correlated)
What is eigenvalue?
the total amount of variance that can be explained by a given principal component.
What is the use of eigenvector?
The directions of the spread of our data
What is PC?
Principal components are new variables that are constructed as linear combinations or mixtures of the initial variables.
These combinations are done in a way that the new variables (i.e., principal components) are uncorrelated and most of the information within the initial variables is squeezed or compressed into the first components.