type
Post
Created date
Jun 16, 2022 01:21 PM
category
Data Science
tags
Machine Learning
Machine Learning
status
Published
Language
From
summary
slug
password
Author
Priority
Featured
Featured
Cover
Origin
Type
URL
Youtube
Youtube
icon

Definition


  1. Partition Clustering (Non-Hierarchical)
      • Divide the data objects into unique subsets, of which a number of sets (i.e., ) are pre-defined.
      • are sequential either building up from separate clusters or breaking down from the one cluster.
      • In some cases, the exact number of clusters may be known (like marketing customers group), then we can use Partition Clustering.
      • A good clustering is one for which the within-cluster variation is as small as possible.
Knowing what Partition Clustering is, let's explore one of its kind — KNN.
  • Each cluster is associated with a centroid
  • Each point is assigned to the cluster with the CLOSEST centroid.

Theory


Assumption


  1. The data is in feature space, which means data in feature space can be measured by distance metrics such as Manhattan, Euclidean etc.
  1. Each data point can only belong to one cluster.
  1. Each training data point consists of a set of vectors and class labels associated with each vector.
  1. Wishes to have ‘K’ as an odd number in case of 2 class classification.

Mechanism - Here are the steps of KNN :


  1. Select k points (at random) as the initial centroids
  1. Repeat :
    1. 2.1. Form k clusters by assigning all points to the CLOSEST centroid
      2.2. Re-compute the centroid of each cluster
  1. Until the centroids don’t change.

Why and Why not KNN :


Advantage :
  • Simple to implement
  • Since the KNN algorithm requires no training before making predictions, new data can be added seamlessly which will not impact the algorithm's accuracy.
Disadvantage :
It depends on initial values. The problem of which is that it has different :
  • Sizes
  • Density
  • Non-globular shapes
  • Contain outliers

Example

Bredwin
notion image
Brendi
notion image
notion image

R code

set.seed(2021) # set seed for reproducibility (kmeans have initialise at different random starts) flea_km <- stats::kmeans(flea_std[,2:7], centers = 3) ## flea_std is a numeric df.
Elbow plot
# factoextra::fviz_nbclust helper function; used; visualise & determine the optimal k factoextra::fviz_nbclust(x = tourr::flea %>% select(-species), FUNcluster = kmeans, method = "wss", # within cluster ss k.max = 10) # max no. of clusters to consider
Measure

R interpretation

Comparing HCluster
assumed you have run this
# --- (II) (III) compute distance matrix + pass as input into `stats::hclust` flea_hc_w <- stats::dist(flea_std[,2:7]) %>% stats::hclust(method = "ward.D2")
map cluster labels to obs. from both methods ```{r} flea_std <- flea_std %>% mutate(cl_w = stats::cutree(flea_hc_w, k = 3), # extract cluster solution from hclust; k = 3 cl_km = flea_km$cluster) # extract cluster solutions from k-means; k = 3 ``` compute confusion table; compare both results •note: labels doesn't matter ```{r} flea_std %>% count(cl_w, cl_km) %>% # compute agreemeent # pivot into wide form; create confusion table pivot_wider(names_from = cl_w, values_from = n, values_fill = 0) %>% rename("cl_km/cl_wards" = cl_km) ```

b. Map the cluster labels from the two results, and calculate the agreement.

notion image
## matrix table(actual = flea_std$species, fitted = flea_km$cluster)
notion image
 

KNN and SSE

  • Split ‘loose’ clusters, i.e., clusters with relatively HIGH SSE.
    • If you have 100 data points, and you have 1 cluster, then your SSE is highest.
  • Merge clusters that are ‘close’ and that have relatively LOW SSE.
    • If you have 100 data points, and you have 100 clusters, then your SSE is 0.

FAQ

What are the advantages and disadvantages of hierarchical & non-hierarchical clustering
Hierarchical: are sequential either building up from separate clusters or breaking down from the one cluster.
Non-hierarchical: The number of clusters is chosen a priori.

Advantages of Hierarchical clustering

  • All possible solutions (with respect to the number of clusters) is provided in a single analysis
  • It is structured and can be explored using the dendrogram.
  • Don’t need to assume any particular number of clusters. Any desired number of clusters can be obtained by ‘cutting’ the dendrogram at the proper level

Advantages of K-Means Clustering

  • Can explore cluster allocations that will never be visited by the path of hierarchical solutions.
  • With a large number of variables, if K is small, K-Means may be computationally faster than hierarchical clustering.
  • K-Means might yield tighter clusters than hierarchical clustering
  • An instance can change cluster (move to another cluster) when the centroids are recomputed.

Disadvantages of Hierarchical clustering

  • It is not possible to undo the previous step: once the instances have been assigned to a cluster, they can no longer be moved around.
  • Time complexity: not suitable for large datasets
  • Very sensitive to outliers
 

Disadvantages of Hierarchical clustering

  • Need to assume any particular number of clusters (K-Value), which is hard to predict.
  • Initial seeds (centers of each cluster) have a strong influence on the final results
  • Sensitive to scale.
How does each point find its closest centroid?
  • To find its centroid, you need to calculate the "distance" between each point and all the centroids,
    • then you will know which is the closest one.
Is there any Pre-processing for KNN?
Yes, you need to :
• Normalise the data • Eliminate outliers
Is there any Post-processing for KNN?
• Eliminate small clusters that may represent outliers 埋吾到堆ge 走開
notion image
SSE係埋堆的measure, 如果太高就分開佢地
Split ‘loose’ clusters, i.e., clusters with relatively high SSE.
Merge clusters that are ‘close’ and that have relatively low SSE
Will the initial centroids influence final clusters? if so, how do you cope with that?
Why is KNN tending to produce a good clustering?
Because the within-cluster variation tends to be small.

Math


By partitioning K clusters, we want to minimise the total within-cluster variation as SMALL as possible.
By partitioning K clusters, we want to minimise the total within-cluster variation as SMALL as possible.

Reference


Extra Resource


Practice
Regularization - Ridge & LassoHierarchical Clustering