type
Post
Created date
Jun 16, 2022 01:21 PM
category
Data Science
tags
Machine Learning
Machine Learning
status
Published
Language
From
summary
slug
password
Author
Priority
Featured
Featured
Cover
Origin
Type
URL
Youtube
Youtube
icon

Definition


  • Divide the data objects into a set of NESTED clusters organised as a hierarchical tree
  • The number of clusters is chosen a priori.
  • Visualised as a dendrogram
notion image

Theory


There are 2 types of Hierarchical Clustering (i.e., Agglomerative & Divisive)

  • Agglomerative:
    • At start, every observation is a cluster. Merge the most similar clusters step by step until all observations are in one cluster.
    • In Agglomerative clustering, there are ways of defining the distance between clusters A and B :
  • Divisive:
    • Works opposite as Agglomerative.
    • At start, all observations are in one cluster. Split step by step until each observation is in its own cluster.
    • (it is very slow compared to Agglomerative)

Types of Hierarchical Clustering


There are two kinds: Agglomerative 合 and Divisive 分
Divisive:
  • Start with one, all-inclusive cluster
  • At each step, split a cluster until each cluster contains a point (or there are k clusters)
Agglomerative:
  • Works as opposite, Start with the points as individual clusters
  • At each step, merge the closest pair of clusters until only one cluster (or k clusters) left

Since Agglomerative is the most popular one, here explains how it works:

  1. Compute distance matrix; let each data point be a cluster
  1. Repeat
    1. 2.1. Merge the two closest clusters
      2.2. Update the distance matrix
  1. Until only a single cluster remains

There are 4 ways to Define Inter-Cluster Similarity :


  • MIN
  • MAX
  • Group Average
  • Distance Between Centroids
Of course, each of which has its own dis and advantage :
notion image
notion image
notion image

Assumption


 

Why and Why not HC?


Advantage
  1. It kinda compensates for what KNN cant do :
      • Don’t need to assume any particular number of clusters
        • Any desired number of clusters can be obtained by ‘cutting’ the dendrogram at the proper level
Disadvantage
Only works with a small dataset

Example

Brendi
notion image
notion image
notion image
notion image
Bredwin
notion image
notion image
notion image
notion image
notion image
notion image
notion image

R code

```{r} # Read-in data data(flea, package = "tourr") # Standardise variables flea_std <- flea %>% mutate(across(where(is.numeric), ~ (.x - mean(.x, na.rm = TRUE)) / sd(.x, na.rm = TRUE))) ```
Data
```{r} library(patchwork) # Compute distance matrix on standardised variables flea_dist <- flea_std %>% select(where(is.numeric)) %>% dist() # Hierarchical clustering for each method flea_hc_ward <- hclust(flea_dist, method = "ward.D2") flea_hc_single <- hclust(flea_dist, method = "single") # View dendrogram for each method p1 <- ggdendrogram(flea_hc_ward, rotate = TRUE, size = 4) + labs(title = "Ward.D2") p2 <- ggdendrogram(flea_hc_single, rotate = TRUE, size = 4) + labs(title = "Single") p1 / p2 ```
Plot dendrogram
interactive
identify(flea_hc_ward )
plot(flea_hc_ward )
ggdendrogram(flea_hc_ward , rotate = TRUE, size = 4) + labs(title = "ward.D2")

R interpretation

GGAnimate
Choose two cut the dendrograms at the 3 cluster solutions for each, creating new columns corresponding to the cluster label. Using a grand tour, examine the cluster solutions. Comment on which one best captures the cluster structure.
# Create cluster membership with 3-cluster solution flea_std <- flea_std %>% mutate(cl_ward = cutree(flea_hc_ward, k = 3), cl_single = cutree(flea_hc_single, k = 3))
# --- Grand tour animate_xy(flea_std[,1:6], col=flea_std$cl_ward) animate_xy(flea_std[,1:6], col=flea_std$cl_single) # Chaining
Ward Linkage
  • The results from Wards linkage captures the three clusters that we see when we look at this 6D data in a tour.
  • Its quite satisfying to see that it has discovered these clusters, based on interpoint distances.
Single Linkage
  • The three cluster solution from single linkage has left all observations except two in one cluster.
  • The two single points put in separate clusters could be considered to be outlying, in some directions in the 6D space.
  • From the tour, you can see several more observations that might be considered to be outliers. If the single linkage dendrogram is cut further down, at 5 or 6 cluster solutions, these observations may also be placed in separate clusters, thus identifying them as outliers also.
For 2 through 10 clusters compute the cluster statistics, “within.cluster.ss”,“wb.ratio”, “ch”, “pearsongamma”, “dunn”, “dunn2”. What would be the conclusion on the number of clusters based on these metrics?

Brendi usually uses WB ratio and gamma

flea_vars <- flea_dist <- flea_std %>% select(where(is.numeric)) flea_hc_cl <- NULL; flea_hc_cl_stats <- NULL for (i in 2:10) { cl <- cutree(flea_hc_ward, i) x <- cluster.stats(dist(flea_vars[,-1]), cl) flea_hc_cl <- cbind(flea_hc_cl, cl) flea_hc_cl_stats <- rbind(flea_hc_cl_stats, c(i, x$within.cluster.ss, x$wb.ratio, x$ch, x$pearsongamma, x$dunn, x$dunn2)) } colnames(flea_hc_cl_stats) <- c("cl", "within.cluster.ss", "wb.ratio", "ch", "pearsongamma", "dunn", "dunn2") flea_hc_cl_stats <- as_tibble(flea_hc_cl_stats) flea_hc_cl_stats_long <- flea_hc_cl_stats %>% pivot_longer(-cl, names_to ="stat", values_to = "value") ggplot(data=flea_hc_cl_stats_long) + geom_line(aes(x=cl, y=value)) + xlab("# clusters") + ylab("") + facet_wrap(~stat, ncol=3, scales = "free_y") + theme_bw()
notion image
notion image
  • wb.ratio : Average within cluster / Average between cluster
    • preferably low,
    • always drops for each additional cluster so look for large drops
  • pearson gamma : Correlation between distances, a 0 and 1 vector where 0 means same cluster, 1 means different clusters
    • If distance correlation is 0, X and Y are independent.
    • preferably high
Further formula and comments from Di Cook
dunn minimum separation / maximum diameter. Dunn index, see Halkidi et al. (2002). dunn2 minimum average dissimilarity between two cluster / maximum average within cluster dissimilarity, another version of the family of Dunn indexes. would indicate that dunn2 is read the same way as dunn, higher the better. This particular example looks like it should be read in the opposite way. I would follow the function help, and consider it to be read the same - and ignore dunn2 anyway(!). (Otherwise I would need to go back to the original paper.)

Dendrogram

What is a dendrogram used in it?

  • A ‘dendrogram’ shows how the clusters are merged hierarchically (looks like a tree diagram)
    • It decomposes data objects into several levels of nested partitions (tree of clusters)
    • The height represents the distance between clusters
    • A clustering of the data objects is obtained by cutting the dendrogram at the desired level, and then each connected component forms a cluster
  • a visualisation of the Hierarchical Clustering

How to interpret :

  • Think of the axis with distance (y-axis) as the measuring a 'tolerance level'
    • If the distance between two clusters is within the tolerance they are merged into one cluster.
    • As tolerance increases more and more clusters are merged leading to less clusters overall.

Viewing Stability in Dendrogram

What is Stability?
Stability is the height between between cluster 1 and cluster (n-1), in the visualisation of dendrogram
  • Aka Tolerance level
Changing tolerance (threshold) affect the Stability .
  • The way to determine Stability is to look at the range of tolerance level.
    • I will give an illustration below :
      the range of tolerance level for cluster 1 >  the range of tolerance level for cluster 2. Therefore, cluster 1 is more STABLE than cluster 2.
      the range of tolerance level for cluster 1 > the range of tolerance level for cluster 2. Therefore, cluster 1 is more STABLE than cluster 2.
 

FAQ

How to measure a distance between clusters in Hierarchical Clustering
Illustration
For agglomerative clustering
Single Linkage (aka. nearest neighbour) (MIN)
The minimal distance between two points.
First, calculate all points between clusters,
Then find the minimum distance between two points, (which then becomes Single Linkage)
This is to find the CLOSEST neighbor, which is shown below :
This is to find the CLOSEST neighbor, which is shown below :
Complete Linkage (aka furthest neighbour)(MAX)
The maximum distance between two points.
First, calculate all points between clusters,
Then, find the maximum distance between two points, (which then becomes a Single Linkage)
Pros and Cons of Single Linkage
This is to find the FURTHEREST neighbor, which is shown below :
notion image
Average Linkage (AVG)
Find the distance between two clusters is defined as the average distance between each point in one cluster to every point in the other cluster.(From here)
Avg of all pairwise distances :
notion image
Centroid Method
distance between centroids (mean) of two clusters
notion image
Ward's algorithm

Why

The most suitable method for quantitative variables.
Problem : Variance between distance
  • Normally the distance between clusters is SAMLL where the variance is tiny. But When the distance is large, the variance is large as well.
Solution :
  • Ward’s method merges two clusters to minimise within cluster variance.

How

  • Instead of measuring the distance directly, it analyzes the variance of clusters.
  • Ward’s method says that the distance between two clusters, A and B, is how much the sum of squares will increase when we merge them.
notion image
The difference amongst MIN, MAX and AVG
Are there a correct number of clusters? If not, how to determine it
NO! But generally, DO NOT :
  • choose too many clusters: (homogeneity within clusters)
    • A firm developing a different marketing strategy for each market segment may not have the resources to develop a large number of unique strategies.
  • choose too few clusters: (parsimony within clusters)
    • If you choose the 1-cluster solution there is no point in doing clustering at all.
 
How can we know when to use Single Linkage or Complete Linkage? (Pros and Cons)
For Single linkage
For Single linkage
disadvantage of CL
disadvantage of CL
What is Chaining? (found on Cons of Single Linkage in ETF3500 p.56)
Chaining is a problem that occurs when clusters are not compact and some observations in the same cluster are far away from one another.
notion image
notion image
What is in-lier?
Points that are between major clusters of data. Affect some linkage methods, eg single, which will tend to "chain" thru the data grping everything tgt.
What is the difference between the centroid and average linkage method?
  • In average linkage
      1. Compute the distances between pairs of observations
      1. Average these distances
Illustration
This method involves looking at the distances between all pairs and averages all of these distances
This method involves looking at the distances between all pairs and averages all of these distances
  • In the centroid method
      1. Average the observations to obtain the centroid of each cluster.
      1. Find the distance between centroids
Illustration
This involves finding the mean vector location for each of the clusters and taking the distance between the two centroids.
This involves finding the mean vector location for each of the clusters and taking the distance between the two centroids.
What is Stability?
Stability is the height between between cluster 1 and cluster (n-1), in the visualisation of dendrogram
  • Aka Tolerance level
Changing tolerance (threshold) affect the Stability .
  • The way to determine Stability is to look at the range of tolerance level.
    • I will give an illustration below :
      the range of tolerance level for cluster 1 >  the range of tolerance level for cluster 2. Therefore, cluster 1 is more STABLE than cluster 2.
      the range of tolerance level for cluster 1 > the range of tolerance level for cluster 2. Therefore, cluster 1 is more STABLE than cluster 2.
 
What is Robust? Is Single Linkage (MIN) Robust? (only in Hierarchical)
  • Robust means the analysis does not dramatically change, even adding a single observation.
  • To Single Linkage (MIN) Robust :
    • In this instance, the new observation was not even an outlier but called inlier.
    • Methods that are not affected by single observations are often called robust.
What are the linkages besides Single and Complete? Why do we prefer using them to SL or CL?
Average Linkage
Centroid method
Ward’s Method
They should be used because they are robust (i.e. Not sensitive to outliers)
Complete linkage avoids chaining, but suffers from crowding
What is centroid?
The center of a cluster
When do we use hierarchical and non-hierarchical cluster techniques
Although both algorithms can be used in the analysis, hierarchical are suited to small data sets (since the dendrogram can be more easily interpreted) while non-hierarchical methods are well suited to large data sets.
What are the advantages and disadvantages of hierarchical & non-hierarchical clustering
Hierarchical: are sequential either building up from separate clusters or breaking down from the one cluster.
Non-hierarchical: The number of clusters is chosen a priori.

Advantages of Hierarchical clustering

  • All possible solutions (with respect to the number of clusters) is provided in a single analysis
  • It is structured and can be explored using the dendrogram.
  • Don’t need to assume any particular number of clusters. Any desired number of clusters can be obtained by ‘cutting’ the dendrogram at the proper level

Advantages of K-Means Clustering

  • Can explore cluster allocations that will never be visited by the path of hierarchical solutions.
  • With a large number of variables, if K is small, K-Means may be computationally faster than hierarchical clustering.
  • K-Means might yield tighter clusters than hierarchical clustering
  • An instance can change cluster (move to another cluster) when the centroids are recomputed.

Disadvantages of Hierarchical clustering

  • It is not possible to undo the previous step: once the instances have been assigned to a cluster, they can no longer be moved around.
  • Time complexity: not suitable for large datasets
  • Very sensitive to outliers
 

Disadvantages of Hierarchical clustering

  • Need to assume any particular number of clusters (K-Value), which is hard to predict.
  • Initial seeds (centers of each cluster) have a strong influence on the final results
  • Sensitive to scale.

Rand Index

What is Rand Index

  • The probability of picking two observations at random that are in agreement.
  • Lies between 0 and 1 and higher numbers indicate agreement.
  • Express what proportion of the cluster assignments are ‘correct’.
Problems : Even if observations are clustered at random, there will still be some agreement due to chance.

Solution — Adjusted Rand Index

  • Can use adjustedRandIndex() in the package of mclust
Interpretation
  • 0 = if the level of agreement equals the case where clustering is done at random.
  • 1 = if the two clustering solutions are in perfect agreement. (Good)
Can Rand Index be the same, if solution A is a three-cluster solution and solution B is a two-cluster solution. Given that they are done from the same dataset
No.
 

Math


Reference


Extra Resource


Practice
KNNSubset selection