FIT3152 Notes | Jason Siu

Table of content

Basic functions Data Visualisation - How to plot a graph using ggplot - Elements of a graph Dirty data What is dirty data ?What is tidy data ? Regression You can refer to the notes I have written before Cross validation Ensemble method Bagging (Bootstrap Aggregation)The way it works : Boosting How to perform Boosting Here tells the difference between those two : This link is very useful 1. How we sample for # of replicate 2. How we model 3. How we produce the result Random forest KNN Here is the steps of KNN :Why and Why not KNN :Hierarchical Clustering Why and Why not HC?What is a dendrogram used in it?Types of Hierarchical Clustering Since Agglomerative is the most popular one, here explains how it works :There are 4 ways to Define Inter-Cluster Similarity :Text Analysis Vector space model (VSM)The Process of extracting structure from text Analysing text and documents based on Term importance and Document similarity Term importance:Document similarity : Cosine Distance TF-IDF

Basic functions

In week 3, we learnt some basic functions :

Apply a function to a data frame split by factors

do.call

works like a loop til its end of the function

These are the functions you mostly understood

as.table

converts the output format to a table

as.data.frame

Coerce the previous output into a data frame

colnames

assigns new column names to a dataframe.

Merging (could user Cbind () if the two dataframes are in alignment)


merge(Sepal.cor,	Petal.cor,	by	=	"Species")
Using a common column – “Species” to Merging data frames

Making new column

Deleting columns

which.max


which.max(iris[,3])
To find the row containing the longest petal

rbind

Data Visualisation

In week 4, we learnt about

- How to plot a graph using ggplot

The difference between qplot () and ggplot () is

The syntax of qplot () is similar to plot, making it easier to transit to ggplot().

The following code can explain how similar among qplot () , ggplot () and plot ()


#make a plot using base
plot(x=diamonds$carat, y=diamonds$price, type="p")
title(main="I'm a base plot")
#make a "quick" plot using ggplot2
qplot(x=carat, y=price, data=diamonds, geom="point") +
  ggtitle("I'm a qplot")

=======================================================================
qplot(x=carat, y=price, data=diamonds, geom="point") +
  ggtitle("I'm a qplot")
ggplot(data=diamonds, aes(x=carat, y=price)) + geom_point() +
  ggtitle("I'm a ggplot")

- Elements of a graph

If you are interested in finding the types of graphs, you can search "Visualisation zoo" to have an exploration

Dirty data

What is dirty data ?

Incorrect data:

Values are not adhere to its domain (Valid values)

e.g Month cannot be 13 or above, as its domain is from 1 to 12

Inaccurate data

A data value can be correct without being accurate.

e.g, the state code "VIC" and the city name "Sydney" are both correct, but when used together (such as Sydney, VIC), the state code is wrong because Sydney is in NSW.

Business rule violations:

Another type of inaccurate data value is one that violates business rules.

e.g a start date should always precede a finish date.

Inconsistent data:

Uncontrolled data redundancy results in inconsistencies.

e.g JASON CHING YUEN, Siu Ching Yuen, Jason Siu

All of which mean the same person

Incomplete data:

More often, we do not gather the data requirement from down-stream information consumers (e.g. marketing department).

Hence, we need to find the most important data elements :

e.g 1 marketing department :

Gender, Customer code or Postcode that might not be captured at all or only haphazardly.

e.g 2 build a system for the lending department of a bank :

Initial Loan Amount, Monthly Payment Amount and Interest Rate

Non-integrated data:

Most organisations store data redundantly and inconsistently across many systems, which were never designed with integration or analytics in mind.

e.g. customer data may exist on two or more outsourced systems under different customer numbers with different spellings of the customer name and even different phone numbers or addresses.

What is tidy data ?

Tidy data refers to data that has variables organized in columns and observations in rows.

When you clean the data by doing data wrangling, you use the principles of tidy data. All the commands that we have learned during the semester filter, mutate, summary etc ..aim at getting your data in a tidy format.

Regression

You can refer to the notes I have written before

For regression, especially multiple one, you need to understand how to :

know how to determine the the best attributes

interpret if the model is good or not based on the diagnostic

predict or forecast the results

Things you always forget :

For surety, use the significance of the regression equation is smaller than 0.0001, which means triple stars (***)

the smaller the p-value, the more significant the estimate is.

Price prediction

Network

Machine Learning

Decision tree

Component includes :

Leaf
Branches

The following methods are to enhance the performance of models

Cross validation

There are many Resampling approaches, like Bootstraping, and CV.

Problem :

One can build a model on the training dataset with 100 % of accuracy, but it may fail to generalise the unseen data. So, it is not a good model, given that it is overfitting the training set.

Solution :

Cross-validation is one of its method to protect against overfitting in a predictive model, particularly in a case where the amount of data may be limited.

In cross-validation, you make a fixed number of folds (or partitions) of the data, run the analysis on each fold, and then average the overall error estimate.

In typical cross-validation, the training and validation sets must cross-over in successive rounds such that each data point has a chance of being validated against.

English explanation

K-fold Cross validation

What

k-fold cross validation splits the dataset in k different (disjoint) subsets of approximately the same dimension, and use in turn one of the subsets for estimating the generalization error, and the others for trainng the system (in your case a K-NN classifier).

If you are performing 2-fold cross validation, you are dividing your dataset into two halves, and first, you use the first one for training and the second for testing, and then you use the second for training and the first for testing.

Why

Less computationally expensive than LOOCV.

Used to protect against overfitting

helps lower variance in small data sets

The test error is then estimated by averaging the resulting MSE estimates.

Visualisation is seen as below :

How

Let's say K = 5

Train on k-1 partitions means that 4 pieces(80% of the data) are jumped into testing set; 1 of which (20% of the data)is made as the test set.

Straticifation ensures the classes are represented across partition

Some remark :

The splitting process is done without replacement. So, each observation will only appear once for the testing set; then that observation will always be on training set.

For example, let's say observation A (15th Row) is sampled. This row can only be used to Predict() once, and it will be one of the observations in training set at the other times

usually K = 5 or 10, for small datasets, we can use higher values for k. so that there is a good compromise between the bias-variance tradeoff

When k=5, 20% of the test set is held back each time. When k= 10, 10% of the test set is held back each time and so on.

LOOCV means k = # of observation in the dataset

https://towardsdatascience.com/k-fold-cross-validation-explained-in-plain-english-659e33c0bc0

Chinese explaination of Cross validation

另外一种折中的办法叫做K折交叉验证，和LOOCV的不同在于，我们每次的测试集将不再只包含一个数据，而是多个，具体数目将根据K的选取决定。比如，如果K=5，那么我们利用五折交叉验证的步骤就是：

1.将所有数据集分成5份

2.不重复地每次取其中一份做测试集，用其他四份做训练集训练模型，之后计算该模型在测试集上的[公式]

3.将5次的[公式]取平均得到最后的MSE

Comparison with validation set approach (Here)

The good things about the validation set approach is conceptually easier to grasp and easily implemented as you are simply partitioning the existing training data into two sets. This can be useful in industry when explaining to stakeholders how models were tested. And the validation set approach has a computational advantage.

However, there are 2 drawbacks:

Comparison with LOOCV approach

Good things about K-fold:

1) k-fold CV is the same as LOOCV when k = n. However, for example, a situation where k = 10 and n = 10,000, k-fold CV will fit 10 models, whereas LOOCV will fit 10,000. That is, LOOCV is the most computationally intense method since the model must be fit n times.

2) Bias-variance tradeoff: k-fold CV can give a more accurate estimate of the test error rate; LOOCV has a higher variance, but lower bias.

More detail

The LOOCV cross-validation approach is a special case of k-fold cross-validation in which k=nk=n. This approach has two drawbacks compared to k-fold cross-validation.

First, it requires fitting the potentially computationally expensive model nn times compared to k-fold cross-validation which requires the model to be fitted only kk times.

Second, the LOOCV cross-validation approach may give approximately unbiased estimates of the test error, since each training set contains n−1n−1 observations; however, this approach has higher variance than k-fold cross-validation (since we are averaging the outputs of nn fitted models trained on an almost identical set of observations, these outputs are highly correlated, and the mean of highly correlated quantities has higher variance than less correlated ones).

So, there is a bias-variance trade-off associated with the choice of kk in k-fold cross-validation; typically using k=5k=5 or k=10k=10 yield test error rate estimates that suffer neither from excessively high bias nor from very high variance.

Further discussion is here.

Step to conduct the 10-fold cross-validation on the model fitting and estimate the test classification error

Code


set.seed(2021)
spam_folds <- vfold_cv(data = spam, strata = spam) #<<
compute_fold_error <- function(split) {
  train <- analysis(split)
  test <- assessment(split)
  
  fit <-
    logistic_mod %>% 
    fit(spam ~ `day of week` + `time of day` + domain, 
      data = train)
  test_pred <- augment(fit, test)
  error <- tibble(error = 1 - metrics(test_pred, truth = spam, 
        estimate = .pred_class)$.estimate[1]) %>%
  rsample::add_resample_id(split = split)

  return(error)
}

kfold_results <- 
  map_df(
    spam_folds$splits, 
    ~compute_fold_error(.x))
kfold_results %>% summarise(mean(error))
#   mean(error)
# 1   0.1471407

tute 4 ETC3250

Steps

Make a function to contain such steps in sequence:

Using the fit () to model whatever ideal models

Using Augment () to predict the model using the test-set dataset

Calculate the error by using 1 to subtract the correction rate.

Using a for loop / map_df to repeat that defined function for whatever times that is defined.

Ensemble method

It is the creation of a better classifier from a collection of weaker classifiers. (團結就是力量)

There are a few of ensemble methods learnt, namely Bagging, Boosting

Assumption :

The individual classifiers are moderately (> 50%) accurate.

Individual classifiers are created independently.

Pooling the results of the each classifier reduces the variance of the overall classification.

Decision trees work well as the individual classifiers

Disadvantage: model produced is not as easy to interpret as a single tree.

Bagging (Bootstrap Aggregation)

Bagging is a method to decrease variance and avoid over-fitting. It works well for high-variance machine learning algorithm, typically decision trees.

Bagging is nothing but SAMPLING WITH REPLACEMENT, using the same sample size of your distribution.

What is SAMPLING WITH REPLACEMENT ?

(That means same observation can occur more than once in the bootstrap data set. )

Each bootstrap replicate may have multiple instances of the original data points, and contains approximately 63% of the original data set.

The way it works :

For example, in the samples drawn from set 0-L (we called Replicate here), some of the observation can be happened more than once. (Sample with Replacement)

Step 2: Construct a single classifier for each replicate Step 3: Combine the classifiers by taking a majority vote to produce the final decision.

FAQ for Bagging :

When to use bagging

Useful when there is noise in the data.

Useful for unstable classifiers – that is, small changes in the training data cause large changes in the classifier; Unstable classifiers include decision trees, neural networks, linear regression.

Not recommended for stable classifiers such as K Nearest Neighbours, Naïve Bayes.

Boosting

How to perform Boosting

Assign equal weights to each point in training set; fit basic tree. • Repeat n iterations: Update weights of misclassified items and normalise; update tree, building on current tree. • Output the final classifier as weighted sum of votes from each tree.

Bagging and Boosting are similar in that they are both ensemble techniques, where a set of weak learners are combined to create a strong learner that obtains better performance than a single one. But you can think of Boosting is an advanced version of Bagging.

Here tells the difference between those two : This link is very useful

How we sample for # of replicate

How we model

How we produce the result

1. How we sample for # of replicate

In the case of Bagging, any element has the same probability (1 / # of dataset ) to appear in a new data set. However, for Boosting the observations are weighted and therefore some of them will take part in the new sets more often.

2. How we model

The training stage of the two models is different; parallel for Bagging (i.e., each model is built independently) and sequential for Boosting.

It is sequential in a sense that each classifier is trained on data, taking into account the previous classifiers’ success.

After each training step., the weights are redistributed. Misclassified data (the results of FP and TN) increases its weights to emphasise the most difficult cases. In this way, subsequent learners will prioritise on them during their training.

3. How we produce the result

The way they produce the result is different: Average for Bagging and weighted average for Bagging

In Bagging the result is obtained by averaging the responses of the N learners (or majority vote).

However, Boosting assigns a second set of weights, this time for the N classifiers, in order to take a weighted average of their estimates.

The way Boosting training learns is via evaluation. A learner with good a classification result on the training data will be assigned a higher weight than a poor one.

That's why the Boosting algorithm allocates weights to each resulting model.

So, in short, Boosting tries to achieve better accuracy and Bagging tries to reduce variance and avoid over-fitting

Random forest

Clustering algorithm

Given a set of data points with a set of attributes, the algorithm will find groups of objects that are

similar to data points to become a Cluster A
or other data points which are different to Cluster A become other clusters

is an unsupervised algorithm, i.e. its data is unlabelled; learning algorithms try to find patterns within the data or to cluster the data into groups or sets.

That similarity is measured by :

Euclidean Distance (think Pythagoras’ theorem).

Other distance-based measures (for example, Manhattan).

Other measures if the attribute values are not continuous

Why Clustering?

The set of data to be analysed can be reduced.

There are two types of clustering: Diff between two is Partition → unnested; Hierarchical → nested

Partition Clustering (Non-Hierarchical)

Divide the data objects into unique subsets, of which a number of sets are pre-defined

are sequential either building up from separate clusters or breaking down from the one cluster.

In some cases, the exact number of clusters may be known (like marketing customers group), then we can use Partition Clustering.

2. Hierarchical Clustering

Divide the data objects into a set of NESTED clusters

The number of clusters is chosen a priori.

There are 2 types of Hierarchical Clustering

Agglomerative:

At start, every observation is a cluster. Merge the most similar clusters step by step until all observations are in one cluster.

In Agglomerative clustering, there are ways of defining the distance between clusters A and B :

Single Linkage (where finding the minimal clusters' distance)

Complete Linkage (where finding the maximal clusters' distance)

Average Linkage (where finding the Average clusters' distance)

Centroid method

Ward’s Method

Divisive:

At start, all observations are in one cluster. Split step by step until each observation is in its own cluster.
(it is very slow compared to Agglomerative)

K-mean clustering - Lloyd's algorithm

K-Means Clustering Explanation and Visualization - YouTube

Visualizing K-Means Clustering (naftaliharris.com)

FAQ about clustering

How to measure a distance between clusters in Hierarchical Clustering

Illustration

GitHub - kavana-r/RLadies_Hierarchical_Clustering

For agglomerative clustering

Agglomerative Clustering: how it works - YouTube

Single Linkage (aka. nearest neighbour) (MIN)

The minimal distance between two points.

First, calculate all points between clusters,

Then find the minimum distance between two points, (which then becomes Single Linkage)

This is to find the CLOSEST neighbor, which is shown below : — This is to find the ***CLOSEST*** neighbor, which is shown below :

Complete Linkage (aka furthest neighbour)(MAX)

The maximum distance between two points.

First, calculate all points between clusters,

Then, find the maximum distance between two points, (which then becomes a Single Linkage)

Pros and Cons of Single Linkage

This is to find the FURTHEREST neighbor, which is shown below :

Average Linkage (AVG)

Find the distance between two clusters is defined as the average distance between each point in one cluster to every point in the other cluster.(From here)

Hierarchical Agglomerative Clustering [HAC - Average Link] - YouTube

Avg of all pairwise distances :

Centroid Method

distance between centroids (mean) of two clusters

Ward's algorithm

Why

The most suitable method for quantitative variables.

Problem : Variance between distance

Normally the distance between clusters is SAMLL where the variance is tiny. But When the distance is large, the variance is large as well.

Solution :

Ward’s method merges two clusters to minimise within cluster variance.

How

Instead of measuring the distance directly, it analyzes the variance of clusters.

Ward’s method says that the distance between two clusters, A and B, is how much the sum of squares will increase when we merge them.

The difference amongst MIN, MAX and AVG

Youtube : Hierarchical Clustering 3: single-link vs. complete-link

Are there a correct number of clusters? If not, how to determine it

NO! But generally, DO NOT :

choose too many clusters: (homogeneity within clusters)

A firm developing a different marketing strategy for each market segment may not have the resources to develop a large number of unique strategies.

choose too few clusters: (parsimony within clusters)

If you choose the 1-cluster solution there is no point in doing clustering at all.

How can we know when to use Single Linkage or Complete Linkage? (Pros and Cons)

GitHub - kavana-r/RLadies_Hierarchical_Clustering

What is Chaining? (found on Cons of Single Linkage in ETF3500 p.56)

Chaining is a problem that occurs when clusters are not compact and some observations in the same cluster are far away from one another.

What is the difference between the centroid and average linkage method?

In average linkage

Compute the distances between pairs of observations

Average these distances

Illustration

This method involves looking at the distances between all pairs and averages all of these distances

In the centroid method

Average the observations to obtain the centroid of each cluster.

Find the distance between centroids

Illustration

This involves finding the mean vector location for each of the clusters and taking the distance between the two centroids.

What is Stability?

Stability is the height between between cluster 1 and cluster (n-1), in the visualisation of dendrogram

Aka Tolerance level

Changing tolerance (threshold) affect the Stability .

The way to determine Stability is to look at the range of tolerance level.

I will give an illustration below :

the range of tolerance level for cluster 1 > the range of tolerance level for cluster 2. Therefore, cluster 1 is more STABLE than cluster 2.

What is Robust? Is Single Linkage (MIN) Robust? (only in Hierarchical)

Robust means the analysis does not dramatically change, even adding a single observation.

To Single Linkage (MIN) Robust :

In this instance, the new observation was not even an outlier but called inlier.
Methods that are not affected by single observations are often called robust.

What are the linkages besides Single and Complete? Why do we prefer using them to SL or CL?

Average Linkage

Centroid method

Ward’s Method

They should be used because they are robust (i.e. Not sensitive to outliers)

What is centroid?

The center of a cluster

When do we use hierarchical and non-hierarchical cluster techniques

Although both algorithms can be used in analysis, hierarchical are suited to small data sets (since the dendrogram can be more easily interpreted) while non-hierarchical methods are well suited to large data sets.

What are the advantages and disadvantages of hierarchical & non-hierarchical clustering

Hierarchical: are sequential either building up from separate clusters or breaking down from the one cluster.

Non-hierarchical: The number of clusters is chosen a priori.

Advantages of Hierarchical clustering

All possible solutions (with respect to the number of clusters) is provided in a single analysis

It is structured and can be explored using the dendrogram.

Don’t need to assume any particular number of clusters. Any desired number of clusters can be obtained by ‘cutting’ the dendrogram at the proper level

Advantages of K-Means Clustering

Can explore cluster allocations that will never be visited by the path of hierarchical solutions.

With a large number of variables, if K is small, K-Means may be computationally faster than hierarchical clustering.

K-Means might yield tighter clusters than hierarchical clustering

An instance can change cluster (move to another cluster) when the centroids are recomputed.

Disadvantages of Hierarchical clustering

It is not possible to undo the previous step: once the instances have been assigned to a cluster, they can no longer be moved around.

Time complexity: not suitable for large datasets

Very sensitive to outliers

Disadvantages of Hierarchical clustering

Need to assume any particular number of clusters (K-Value), which is hard to predict.

Initial seeds (centers of each cluster) have a strong influence on the final results

Sensitive to scale.

Dendrogram

a visualisation of the Hierarchical Clustering

How to interpret :

Think of the axis with distance (y-axis) as the measuring a 'tolerance level'

If the distance between two clusters is within the tolerance they are merged into one cluster.
As tolerance increases more and more clusters are merged leading to less clusters overall.

Viewing Stability in Dendrogram

What is Stability?

Stability is the height between between cluster 1 and cluster (n-1), in the visualisation of dendrogram

Aka Tolerance level

Changing tolerance (threshold) affect the Stability .

The way to determine Stability is to look at the range of tolerance level.

I will give an illustration below :

Rand Index

What is Rand Index

The probability of picking two observations at random that are in agreement.

Lies between 0 and 1 and higher numbers indicate agreement.

Express what proportion of the cluster assignments are ‘correct’.

Problems : Even if observations are clustered at random, there will still be some agreement due to chance.

Solution — Adjusted Rand Index

Can use adjustedRandIndex() in the package of mclust

Interpretation

0 = if the level of agreement equals the case where clustering is done at random.

1 = if the two clustering solutions are in perfect agreement. (Good)

Can Rand Index be the same, if solution A is a three-cluster solution and solution B is a two-cluster solution. Given that they are done from the same dataset

No.

KNN

Knowing what Partition Clustering, let's explore one of its kind — KNN.

Each cluster is associated with a centroid

Each point is assigned to the cluster with the CLOSTEST centroid.

Here is the steps of KNN :

Select k points (at random) as the initial centroids

Repeat :

2.1. Form k clusters by assigning all points to the CLOSTEST centroid

2.2. Re-compute the centroid of each cluster

Until the centroids don’t change

Why and Why not KNN :

Advantage :

Simple to implement

Since the KNN algorithm requires no training before making predictions, new data can be added seamlessly which will not impact the accuracy of the algorithm.

Disadvantage :

It depends on initial values. The problem of which is that it has different :

Sizes

Density

Non-globular shapes

Contain outliers

FAQ of KNN

How do each point find its closest centroid?

To find its centroid, you need to calculate the "distance" between each point and all the centroids,

then you will know which is the closest one.

Are there any Pre-processing for KNN?

Yes, you need to :

• Normalise the data • Eliminate outliers

Are there any Post-processing for KNN ?

• Eliminate small clusters that may represent outliers 埋吾到堆ge 走開

SSE係埋堆的measure，如果太高就分開佢地

－Split ‘loose’ clusters, i.e., clusters with relatively high SSE.

－Merge clusters that are ‘close’ and that have relatively low SSE

Will the initial centroids influence final clusters? if so, how do you cope with that?

Yes, so there are few ways to do :

Multiple runs

Select more than k initial centroids and then select among these initial centroids

Hierarchical Clustering

Produces a set of nested clusters organised as a hierarchical tree

Visualised as a dendrogram

Why and Why not HC?

Advantage

It kinda compensates what KNN cant do :

Don’t need to assume any particular number of clusters

Any desired number of clusters can be obtained by ‘cutting’ the dendrogram at the proper level

Disadvantage

What is a dendrogram used in it?

A ‘dendrogram’ shows how the clusters are merged hierarchically (looks like tree diagram)

It decomposes data objects into several levels of nested partitions (tree of clusters)
The height represents the distance between clusters
A clustering of the data objects is obtained by cutting the dendrogram at the desired level, and then each connected component forms a cluster

Types of Hierarchical Clustering

There are two kinds : Agglomerative 合 and Divisive 分

Divisive:

Start with one, all-inclusive cluster

At each step, split a cluster until each cluster contains a point (or there are k clusters)

Agglomerative:

Works as opposite, Start with the points as individual clusters

At each step, merge the closest pair of clusters until only one cluster (or k clusters) left

Since Agglomerative is the most popular one, here explains how it works :

Compute distance matrix; let each data point be a cluster

Repeat

2.1. Merge the two closest clusters

2.2. Update the distance matrix

Until only a single cluster remains

There are 4 ways to Define Inter-Cluster Similarity :

Group Average

Distance Between Centroids

Of course, each of which has its own dis and advantage :

Text Analysis

Some parts of introduction is skipped and i assumed you understand what they are

Vector space model (VSM)

A commonly used approach to text analysis is the VSM,

Here is how :

Document text is converted to a bag of words (or tokens).

Counts of words are then treated as orthogonal vectors in n-dimensional space.

The angle between documents indicates their degree of similarity.

2. VSM uses ‘bag of words’ approach, where assumption:

Each document assumed to be just a collection of words, 如圖所示, 忽略語法

The order of the words in a document does not matter.

Syntactically similar documents are semantically similar – which is often the case
However, it does not always work : "John ate peach" ≠ "Peach ate John"

One document represents One vector. Proximity of documents is based on similarity measure defined over the vector space

The Process of extracting structure from text

Tokenise:

document is split into a stream of words;
all punctuation marks are removed

Convert case

Remove stop words

Stem & Lemmatize

Here explains the difference in between

Create n-grams

After the above process finished, we create a Term-Document (Frequency) Matrix, measured by the frequency of a word for a specific document.

There are two things to notice:

Terms should not be too common (not helpful for clustering).

Terms should not be too infrequent, those occurring very rarely often removed (not helpful for clustering)

Analysing text and documents based on Term importance and Document similarity

Term importance:

Term Document Matrices

As explained above

Inverse Document Frequency

Document similarity : Cosine Distance

FAQ of Document similarity

Why calculate document similarity?

In the dialogue and question-and-answer system, many questions and dialogues we ask actually have duplicate answers. If you judge that two more different descriptions are the same question, then you can push the answers directly to avoid repeated questions and answers , And reduce the time to find a solution.

When converting a document into a space vector, we can use the distance between the vectors and the cosine value to calculate the similarity of the text.

Cosine similarity between documents:

Draft: do not circulate

Closeness evaluated in n-dimensional space, where n = number of tokens in DTM.

Cosine similarity closer to 1 means documents are more similar than smaller values.

https://www.programmersought.com/article/38654975219/