KNN | Jason Siu

type

Post

Created date

Jun 16, 2022 01:21 PM

Definition

Partition Clustering (Non-Hierarchical)

Divide the data objects into unique subsets, of which a number of sets (i.e., ) are pre-defined.

are sequential either building up from separate clusters or breaking down from the one cluster.

In some cases, the exact number of clusters may be known (like marketing customers group), then we can use Partition Clustering.

A good clustering is one for which the within-cluster variation is as small as possible.

Knowing what Partition Clustering is, let's explore one of its kind — KNN.

Each cluster is associated with a centroid

Each point is assigned to the cluster with the CLOSEST centroid.

Theory

Assumption

The data is in feature space, which means data in feature space can be measured by distance metrics such as Manhattan, Euclidean etc.

Each data point can only belong to one cluster.

Each training data point consists of a set of vectors and class labels associated with each vector.

Wishes to have ‘K’ as an odd number in case of 2 class classification.

Mechanism - Here are the steps of KNN :

Select k points (at random) as the initial centroids

Repeat :

2.1. Form k clusters by assigning all points to the CLOSEST centroid

2.2. Re-compute the centroid of each cluster

Until the centroids don’t change.

Why and Why not KNN :

Advantage :

Simple to implement

Since the KNN algorithm requires no training before making predictions, new data can be added seamlessly which will not impact the algorithm's accuracy.

Disadvantage :

It depends on initial values. The problem of which is that it has different :

Sizes

Density

Non-globular shapes

Contain outliers

Example

Bredwin

Brendi

R code


set.seed(2021) # set seed for reproducibility (kmeans have initialise at different random starts)
flea_km <- stats::kmeans(flea_std[,2:7],
                         centers = 3) ## flea_std is a numeric df.

Elbow plot


# factoextra::fviz_nbclust helper function; used; visualise & determine the optimal k
factoextra::fviz_nbclust(x = tourr::flea %>% select(-species),
                         FUNcluster = kmeans,
                         method = "wss", # within cluster ss
                         k.max = 10) # max no. of clusters to consider

Measure

R interpretation

Comparing HCluster

assumed you have run this


# --- (II) (III) compute distance matrix + pass as input into `stats::hclust`
flea_hc_w <- stats::dist(flea_std[,2:7]) %>% 
  stats::hclust(method = "ward.D2")


map cluster labels to obs. from both methods 
```{r}
flea_std <- flea_std %>%
  mutate(cl_w = stats::cutree(flea_hc_w, k = 3), # extract cluster solution from hclust; k = 3
         cl_km = flea_km$cluster) # extract cluster solutions from k-means; k = 3
```

      compute confusion table; compare both results
•note: labels doesn't matter
```{r}
flea_std %>% 
  count(cl_w, cl_km) %>% # compute agreemeent
  # pivot into wide form; create confusion table
  pivot_wider(names_from = cl_w, 
              values_from = n, 
              values_fill = 0) %>% 
  rename("cl_km/cl_wards" = cl_km)
```

b. Map the cluster labels from the two results, and calculate the agreement.


## matrix
table(actual = flea_std$species, fitted = flea_km$cluster)

KNN and SSE

Split ‘loose’ clusters, i.e., clusters with relatively HIGH SSE.

If you have 100 data points, and you have 1 cluster, then your SSE is highest.

Merge clusters that are ‘close’ and that have relatively LOW SSE.

If you have 100 data points, and you have 100 clusters, then your SSE is 0.

FAQ

What are the advantages and disadvantages of hierarchical & non-hierarchical clustering

Hierarchical: are sequential either building up from separate clusters or breaking down from the one cluster.

Non-hierarchical: The number of clusters is chosen a priori.

Advantages of Hierarchical clustering

All possible solutions (with respect to the number of clusters) is provided in a single analysis

It is structured and can be explored using the dendrogram.

Don’t need to assume any particular number of clusters. Any desired number of clusters can be obtained by ‘cutting’ the dendrogram at the proper level