Table of content

Basic functions


In week 3, we learnt some basic functions :
by
Apply a function to a data frame split by factors
notion image
do.call
works like a loop til its end of the function
These are the functions you mostly understood
as.table
converts the output format to a table
as.data.frame
Coerce the previous output into a data frame
colnames
assigns new column names to a dataframe.
Merging (could user Cbind () if the two dataframes are in alignment)
merge(Sepal.cor, Petal.cor, by = "Species") Using a common column – “Species” to Merging data frames
Making new column
Deleting columns
which.max
which.max(iris[,3]) To find the row containing the longest petal
rbind
 

 
 
 
 

Data Visualisation


In week 4, we learnt about

- How to plot a graph using ggplot

The difference between qplot () and ggplot () is
  • The syntax of qplot () is similar to plot, making it easier to transit to ggplot().
The following code can explain how similar among qplot () , ggplot () and plot ()
#make a plot using base plot(x=diamonds$carat, y=diamonds$price, type="p") title(main="I'm a base plot") #make a "quick" plot using ggplot2 qplot(x=carat, y=price, data=diamonds, geom="point") + ggtitle("I'm a qplot") ======================================================================= qplot(x=carat, y=price, data=diamonds, geom="point") + ggtitle("I'm a qplot") ggplot(data=diamonds, aes(x=carat, y=price)) + geom_point() + ggtitle("I'm a ggplot")

- Elements of a graph

notion image
If you are interested in finding the types of graphs, you can search "Visualisation zoo" to have an exploration
 
 

Dirty data


What is dirty data ?

Incorrect data:
Values are not adhere to its domain (Valid values)
e.g Month cannot be 13 or above, as its domain is from 1 to 12
 
Inaccurate data
A data value can be correct without being accurate.
e.g, the state code "VIC" and the city name "Sydney" are both correct, but when used together (such as Sydney, VIC), the state code is wrong because Sydney is in NSW.
Business rule violations:
Another type of inaccurate data value is one that violates business rules.
e.g a start date should always precede a finish date.
Inconsistent data:
Uncontrolled data redundancy results in inconsistencies.
e.g JASON CHING YUEN, Siu Ching Yuen, Jason Siu
All of which mean the same person
Incomplete data:
More often, we do not gather the data requirement from down-stream information consumers (e.g. marketing department).
Hence, we need to find the most important data elements :
e.g 1 marketing department :
Gender, Customer code or Postcode that might not be captured at all or only haphazardly.
e.g 2 build a system for the lending department of a bank :
Initial Loan Amount, Monthly Payment Amount and Interest Rate
Non-integrated data:
Most organisations store data redundantly and inconsistently across many systems, which were never designed with integration or analytics in mind.
e.g. customer data may exist on two or more outsourced systems under different customer numbers with different spellings of the customer name and even different phone numbers or addresses.

What is tidy data ?

Tidy data refers to data that has variables organized in columns and observations in rows.
When you clean the data by doing data wrangling, you use the principles of tidy data. All the commands that we have learned during the semester filter, mutate, summary etc ..aim at getting your data in a tidy format.
 
 

Regression


You can refer to the notes I have written before

For regression, especially multiple one, you need to understand how to :
  • know how to determine the the best attributes
  • interpret if the model is good or not based on the diagnostic
  • predict or forecast the results
Things you always forget :
  • Price prediction
Network
Machine Learning
Decision tree
  • Component includes :
    • Leaf
    • Branches
    •  
The following methods are to enhance the performance of models

Cross validation


There are many Resampling approaches, like Bootstraping, and CV.

Problem :

One can build a model on the training dataset with 100 % of accuracy, but it may fail to generalise the unseen data. So, it is not a good model, given that it is overfitting the training set.

Solution :

  • Cross-validation is one of its method to protect against overfitting in a predictive model, particularly in a case where the amount of data may be limited.
  • In cross-validation, you make a fixed number of folds (or partitions) of the data, run the analysis on each fold, and then average the overall error estimate.
  • In typical cross-validation, the training and validation sets must cross-over in successive rounds such that each data point has a chance of being validated against.
English explanation
K-fold Cross validation

What

  • k-fold cross validation splits the dataset in k different (disjoint) subsets of approximately the same dimension, and use in turn one of the subsets for estimating the generalization error, and the others for trainng the system (in your case a K-NN classifier).
  • If you are performing 2-fold cross validation, you are dividing your dataset into two halves, and first, you use the first one for training and the second for testing, and then you use the second for training and the first for testing.

Why

  • Less computationally expensive than LOOCV.
  • Used to protect against overfitting
  • helps lower variance in small data sets
  • The test error is then estimated by averaging the  resulting MSE estimates.
Visualisation is seen as below :
notion image
 

How

Let's say K = 5
Train on k-1 partitions means that 4 pieces(80% of the data) are jumped into testing set; 1 of which (20% of the data)is made as the test set.
Straticifation ensures the classes are represented across partition
notion image

Some remark :

  1. The splitting process is done without replacement. So, each observation will only appear once for the testing set; then that observation will always be on training set.
    1. For example, let's say observation A (15th Row) is sampled. This row can only be used to Predict() once, and it will be one of the observations in training set at the other times
  1. usually K = 5 or 10, for small datasets, we can use higher values for k. so that there is a good compromise between the bias-variance tradeoff
  1. When k=5, 20% of the test set is held back each time. When k= 10, 10% of the test set is held back each time and so on.
LOOCV means k = # of observation in the dataset
https://towardsdatascience.com/k-fold-cross-validation-explained-in-plain-english-659e33c0bc0
 
Chinese explaination of Cross validation
https://zhuanlan.zhihu.com/p/67986077
另外一种折中的办法叫做K折交叉验证,和LOOCV的不同在于,我们每次的测试集将不再只包含一个数据,而是多个,具体数目将根据K的选取决定。比如,如果K=5,那么我们利用五折交叉验证的步骤就是:
1.将所有数据集分成5份
2.不重复地每次取其中一份做测试集,用其他四份做训练集训练模型,之后计算该模型在测试集上的[公式]
3.将5次的[公式]取平均得到最后的MSE
Comparison with validation set approach (Here)
The good things about the validation set approach is conceptually easier to grasp and easily implemented as you are simply partitioning the existing training data into two sets. This can be useful in industry when explaining to stakeholders how models were tested. And the validation set approach has a computational advantage.
However, there are 2 drawbacks:
Comparison with LOOCV approach
Good things about K-fold:
1) k-fold CV is the same as LOOCV when k = n. However, for example, a situation where k = 10 and n = 10,000, k-fold CV will fit 10 models, whereas LOOCV will fit 10,000. That is, LOOCV is the most computationally intense method since the model must be fit n times.
2) Bias-variance tradeoff: k-fold CV can give a more accurate estimate of the test error rate; LOOCV has a higher variance, but lower bias.
More detail
The LOOCV cross-validation approach is a special case of k-fold cross-validation in which k=nk=n. This approach has two drawbacks compared to k-fold cross-validation.
First, it requires fitting the potentially computationally expensive model nn times compared to k-fold cross-validation which requires the model to be fitted only kk times.
Second, the LOOCV cross-validation approach may give approximately unbiased estimates of the test error, since each training set contains n−1n−1 observations; however, this approach has higher variance than k-fold cross-validation (since we are averaging the outputs of nn fitted models trained on an almost identical set of observations, these outputs are highly correlated, and the mean of highly correlated quantities has higher variance than less correlated ones).
So, there is a bias-variance trade-off associated with the choice of kk in k-fold cross-validation; typically using k=5k=5 or k=10k=10 yield test error rate estimates that suffer neither from excessively high bias nor from very high variance.
Further discussion is here.
 
Step to conduct the 10-fold cross-validation on the model fitting and estimate the test classification error
Code
set.seed(2021) spam_folds <- vfold_cv(data = spam, strata = spam) #<< compute_fold_error <- function(split) { train <- analysis(split) test <- assessment(split) fit <- logistic_mod %>% fit(spam ~ `day of week` + `time of day` + domain, data = train) test_pred <- augment(fit, test) error <- tibble(error = 1 - metrics(test_pred, truth = spam, estimate = .pred_class)$.estimate[1]) %>% rsample::add_resample_id(split = split) return(error) } kfold_results <- map_df( spam_folds$splits, ~compute_fold_error(.x)) kfold_results %>% summarise(mean(error)) # mean(error) # 1 0.1471407
tute 4 ETC3250
Steps
Make a function to contain such steps in sequence:
  1. Using the fit () to model whatever ideal models
  1. Using Augment () to predict the model using the test-set dataset
  1. Calculate the error by using 1 to subtract the correction rate.
Using a for loop / map_df to repeat that defined function for whatever times that is defined.

 
 

Ensemble method


It is the creation of a better classifier from a collection of weaker classifiers. (團結就是力量)
There are a few of ensemble methods learnt, namely Bagging, Boosting
Assumption :
  • The individual classifiers are moderately (> 50%) accurate.
  • Individual classifiers are created independently.
  • Pooling the results of the each classifier reduces the variance of the overall classification.
  • Decision trees work well as the individual classifiers
  • Disadvantage: model produced is not as easy to interpret as a single tree.
 

Bagging (Bootstrap Aggregation)


Bagging is a method to decrease variance and avoid over-fitting. It works well for high-variance machine learning algorithm, typically decision trees.
Bagging is nothing but SAMPLING WITH REPLACEMENT, using the same sample size of your distribution.
  • What is SAMPLING WITH REPLACEMENT ?
    • (That means same observation can occur more than once in the bootstrap data set. )
Each bootstrap replicate may have multiple instances of the original data points, and contains approximately 63% of the original data set.

The way it works :


notion image
For example, in the samples drawn from set 0-L (we called Replicate here), some of the observation can be happened more than once. (Sample with Replacement)
Step 2: Construct a single classifier for each replicate Step 3: Combine the classifiers by taking a majority vote to produce the final decision.

FAQ for Bagging :
When to use bagging
  • Useful when there is noise in the data.
  • Useful for unstable classifiers – that is, small changes in the training data cause large changes in the classifier; Unstable classifiers include decision trees, neural networks, linear regression.
  • Not recommended for stable classifiers such as K Nearest Neighbours, Naïve Bayes.
 

Boosting


How to perform Boosting

  • Assign equal weights to each point in training set; fit basic tree. • Repeat n iterations: Update weights of misclassified items and normalise; update tree, building on current tree. • Output the final classifier as weighted sum of votes from each tree.
 
Bagging and Boosting are similar in that they are both ensemble techniques, where a set of weak learners are combined to create a strong learner that obtains better performance than a single one. But you can think of Boosting is an advanced version of Bagging.

Here tells the difference between those two : This link is very useful

  1. How we sample for # of replicate
  1. How we model
  1. How we produce the result

1. How we sample for # of replicate


In the case of Bagging, any element has the same probability (1 / # of dataset ) to appear in a new data set. However, for Boosting the observations are weighted and therefore some of them will take part in the new sets more often.
 
 
notion image

 

2. How we model


The training stage of the two models is different; parallel for Bagging (i.e., each model is built independently) and sequential for Boosting.
 
It is sequential in a sense that each classifier is trained on data, taking into account the previous classifiers’ success.
After each training step., the weights are redistributed. Misclassified data (the results of FP and TN) increases its weights to emphasise the most difficult cases. In this way, subsequent learners will prioritise on them during their training.
notion image

3. How we produce the result


The way they produce the result is different: Average for Bagging and weighted average for Bagging
In Bagging the result is obtained by averaging the responses of the N learners (or majority vote).
However, Boosting assigns a second set of weights, this time for the N classifiers, in order to take a weighted average of their estimates.
notion image
 

 
The way Boosting training learns is via evaluation. A learner with good a classification result on the training data will be assigned a higher weight than a poor one.
 
That's why the Boosting algorithm allocates weights to each resulting model.
notion image
 

So, in short, Boosting tries to achieve better accuracy and Bagging tries to reduce variance and avoid over-fitting

Random forest


 

Clustering algorithm


  • Given a set of data points with a set of attributes, the algorithm will find groups of objects that are
    • similar to data points to become a Cluster A
    • or other data points which are different to Cluster A become other clusters
  • is an unsupervised algorithm, i.e. its data is unlabelled; learning algorithms try to find patterns within the data or to cluster the data into groups or sets.
That similarity is measured by :
  • Euclidean Distance (think Pythagoras’ theorem).
  • Other distance-based measures (for example, Manhattan).
  • Other measures if the attribute values are not continuous
Why Clustering?
  • The set of data to be analysed can be reduced.
  • There are two types of clustering: Diff between two is Partition → unnested; Hierarchical → nested

    • notion image
      1. Partition Clustering (Non-Hierarchical)
          • Divide the data objects into unique subsets, of which a number of sets are pre-defined
          • are sequential either building up from separate clusters or breaking down from the one cluster.
          • In some cases, the exact number of clusters may be known (like marketing customers group), then we can use Partition Clustering.

      2. Hierarchical Clustering
      • Divide the data objects into a set of NESTED clusters
      • The number of clusters is chosen a priori.
      There are 2 types of Hierarchical Clustering
      • Agglomerative:
        • At start, every observation is a cluster. Merge the most similar clusters step by step until all observations are in one cluster.
        • In Agglomerative clustering, there are ways of defining the distance between clusters A and B :
      • Divisive:
        • At start, all observations are in one cluster. Split step by step until each observation is in its own cluster.
        • (it is very slow compared to Agglomerative)

K-mean clustering - Lloyd's algorithm
FAQ about clustering
How to measure a distance between clusters in Hierarchical Clustering
Illustration
For agglomerative clustering
Single Linkage (aka. nearest neighbour) (MIN)
The minimal distance between two points.
First, calculate all points between clusters,
Then find the minimum distance between two points, (which then becomes Single Linkage)
This is to find the CLOSEST neighbor, which is shown below :
This is to find the CLOSEST neighbor, which is shown below :
Complete Linkage (aka furthest neighbour)(MAX)
The maximum distance between two points.
First, calculate all points between clusters,
Then, find the maximum distance between two points, (which then becomes a Single Linkage)
Pros and Cons of Single Linkage
This is to find the FURTHEREST neighbor, which is shown below :
notion image
Average Linkage (AVG)
Find the distance between two clusters is defined as the average distance between each point in one cluster to every point in the other cluster.(From here)
Avg of all pairwise distances :
notion image
Centroid Method
distance between centroids (mean) of two clusters
notion image
Ward's algorithm

Why

The most suitable method for quantitative variables.
Problem : Variance between distance
  • Normally the distance between clusters is SAMLL where the variance is tiny. But When the distance is large, the variance is large as well.
Solution :
  • Ward’s method merges two clusters to minimise within cluster variance.

How

  • Instead of measuring the distance directly, it analyzes the variance of clusters.
  • Ward’s method says that the distance between two clusters, A and B, is how much the sum of squares will increase when we merge them.
notion image
The difference amongst MIN, MAX and AVG
Are there a correct number of clusters? If not, how to determine it
NO! But generally, DO NOT :
  • choose too many clusters: (homogeneity within clusters)
    • A firm developing a different marketing strategy for each market segment may not have the resources to develop a large number of unique strategies.
  • choose too few clusters: (parsimony within clusters)
    • If you choose the 1-cluster solution there is no point in doing clustering at all.
 
How can we know when to use Single Linkage or Complete Linkage? (Pros and Cons)
For Single linkage
For Single linkage
disadvantage of CL
disadvantage of CL
What is Chaining? (found on Cons of Single Linkage in ETF3500 p.56)
Chaining is a problem that occurs when clusters are not compact and some observations in the same cluster are far away from one another.
What is the difference between the centroid and average linkage method?
  • In average linkage
      1. Compute the distances between pairs of observations
      1. Average these distances
Illustration
This method involves looking at the distances between all pairs and averages all of these distances
This method involves looking at the distances between all pairs and averages all of these distances
  • In the centroid method
      1. Average the observations to obtain the centroid of each cluster.
      1. Find the distance between centroids
Illustration
This involves finding the mean vector location for each of the clusters and taking the distance between the two centroids.
This involves finding the mean vector location for each of the clusters and taking the distance between the two centroids.
What is Stability?
Stability is the height between between cluster 1 and cluster (n-1), in the visualisation of dendrogram
  • Aka Tolerance level
Changing tolerance (threshold) affect the Stability .
  • The way to determine Stability is to look at the range of tolerance level.
    • I will give an illustration below :
      the range of tolerance level for cluster 1 >  the range of tolerance level for cluster 2. Therefore, cluster 1 is more STABLE than cluster 2.
      the range of tolerance level for cluster 1 > the range of tolerance level for cluster 2. Therefore, cluster 1 is more STABLE than cluster 2.
 
What is Robust? Is Single Linkage (MIN) Robust? (only in Hierarchical)
  • Robust means the analysis does not dramatically change, even adding a single observation.
  • To Single Linkage (MIN) Robust :
    • In this instance, the new observation was not even an outlier but called inlier.
    • Methods that are not affected by single observations are often called robust.
What are the linkages besides Single and Complete? Why do we prefer using them to SL or CL?
Average Linkage
Centroid method
Ward’s Method
They should be used because they are robust (i.e. Not sensitive to outliers)
What is centroid?
The center of a cluster
When do we use hierarchical and non-hierarchical cluster techniques
Although both algorithms can be used in analysis, hierarchical are suited to small data sets (since the dendrogram can be more easily interpreted) while non-hierarchical methods are well suited to large data sets.
What are the advantages and disadvantages of hierarchical & non-hierarchical clustering
Hierarchical: are sequential either building up from separate clusters or breaking down from the one cluster.
Non-hierarchical: The number of clusters is chosen a priori.

Advantages of Hierarchical clustering

  • All possible solutions (with respect to the number of clusters) is provided in a single analysis
  • It is structured and can be explored using the dendrogram.
  • Don’t need to assume any particular number of clusters. Any desired number of clusters can be obtained by ‘cutting’ the dendrogram at the proper level

Advantages of K-Means Clustering

  • Can explore cluster allocations that will never be visited by the path of hierarchical solutions.
  • With a large number of variables, if K is small, K-Means may be computationally faster than hierarchical clustering.
  • K-Means might yield tighter clusters than hierarchical clustering
  • An instance can change cluster (move to another cluster) when the centroids are recomputed.

Disadvantages of Hierarchical clustering

  • It is not possible to undo the previous step: once the instances have been assigned to a cluster, they can no longer be moved around.
  • Time complexity: not suitable for large datasets
  • Very sensitive to outliers
 

Disadvantages of Hierarchical clustering

  • Need to assume any particular number of clusters (K-Value), which is hard to predict.
  • Initial seeds (centers of each cluster) have a strong influence on the final results
  • Sensitive to scale.
Dendrogram
  • a visualisation of the Hierarchical Clustering

How to interpret :

  • Think of the axis with distance (y-axis) as the measuring a 'tolerance level'
    • If the distance between two clusters is within the tolerance they are merged into one cluster.
    • As tolerance increases more and more clusters are merged leading to less clusters overall.

Viewing Stability in Dendrogram

What is Stability?
Stability is the height between between cluster 1 and cluster (n-1), in the visualisation of dendrogram
  • Aka Tolerance level
Changing tolerance (threshold) affect the Stability .
  • The way to determine Stability is to look at the range of tolerance level.
    • I will give an illustration below :
      the range of tolerance level for cluster 1 >  the range of tolerance level for cluster 2. Therefore, cluster 1 is more STABLE than cluster 2.
      the range of tolerance level for cluster 1 > the range of tolerance level for cluster 2. Therefore, cluster 1 is more STABLE than cluster 2.
 
Rand Index

What is Rand Index

  • The probability of picking two observations at random that are in agreement.
  • Lies between 0 and 1 and higher numbers indicate agreement.
  • Express what proportion of the cluster assignments are ‘correct’.
Problems : Even if observations are clustered at random, there will still be some agreement due to chance.

Solution — Adjusted Rand Index

  • Can use adjustedRandIndex() in the package of mclust
Interpretation
  • 0 = if the level of agreement equals the case where clustering is done at random.
  • 1 = if the two clustering solutions are in perfect agreement. (Good)
Can Rand Index be the same, if solution A is a three-cluster solution and solution B is a two-cluster solution. Given that they are done from the same dataset
No.
 

KNN


Knowing what Partition Clustering, let's explore one of its kind — KNN.
  • Each cluster is associated with a centroid
  • Each point is assigned to the cluster with the CLOSTEST centroid.

Here is the steps of KNN :


  1. Select k points (at random) as the initial centroids
  1. Repeat :
    1. 2.1. Form k clusters by assigning all points to the CLOSTEST centroid
      2.2. Re-compute the centroid of each cluster
  1. Until the centroids don’t change

Why and Why not KNN :


Advantage :
  • Simple to implement
  • Since the KNN algorithm requires no training before making predictions, new data can be added seamlessly which will not impact the accuracy of the algorithm.
Disadvantage :
It depends on initial values. The problem of which is that it has different :
  • Sizes
  • Density
  • Non-globular shapes
  • Contain outliers
FAQ of KNN
How do each point find its closest centroid?
  • To find its centroid, you need to calculate the "distance" between each point and all the centroids,
    • then you will know which is the closest one.
Are there any Pre-processing for KNN?
Yes, you need to :
• Normalise the data • Eliminate outliers
Are there any Post-processing for KNN ?
• Eliminate small clusters that may represent outliers 埋吾到堆ge 走開
notion image
SSE係埋堆的measure, 如果太高就分開佢地
Split ‘loose’ clusters, i.e., clusters with relatively high SSE.
Merge clusters that are ‘close’ and that have relatively low SSE
Will the initial centroids influence final clusters? if so, how do you cope with that?
Yes, so there are few ways to do :
  1. Multiple runs
  1. Select more than k initial centroids and then select among these initial centroids

Hierarchical Clustering


  • Produces a set of nested clusters organised as a hierarchical tree
  • Visualised as a dendrogram

Why and Why not HC?


Advantage
  1. It kinda compensates what KNN cant do :
      • Don’t need to assume any particular number of clusters
        • Any desired number of clusters can be obtained by ‘cutting’ the dendrogram at the proper level
Disadvantage

What is a dendrogram used in it?

  • A ‘dendrogram’ shows how the clusters are merged hierarchically (looks like tree diagram)
    • It decomposes data objects into several levels of nested partitions (tree of clusters)
    • The height represents the distance between clusters
    • A clustering of the data objects is obtained by cutting the dendrogram at the desired level, and then each connected component forms a cluster

Types of Hierarchical Clustering


There are two kinds : Agglomerative 合 and Divisive 分
Divisive:
  • Start with one, all-inclusive cluster
  • At each step, split a cluster until each cluster contains a point (or there are k clusters)
Agglomerative:
  • Works as opposite, Start with the points as individual clusters
  • At each step, merge the closest pair of clusters until only one cluster (or k clusters) left

Since Agglomerative is the most popular one, here explains how it works :

  1. Compute distance matrix; let each data point be a cluster
  1. Repeat
    1. 2.1. Merge the two closest clusters
      2.2. Update the distance matrix
  1. Until only a single cluster remains
 

There are 4 ways to Define Inter-Cluster Similarity :


  • MIN
  • MAX
  • Group Average
  • Distance Between Centroids
Of course, each of which has its own dis and advantage :
notion image
notion image

Text Analysis


Some parts of introduction is skipped and i assumed you understand what they are

Vector space model (VSM)

  1. A commonly used approach to text analysis is the VSM,
    1. Here is how :
      • Document text is converted to a bag of words (or tokens).
      • Counts of words are then treated as orthogonal vectors in n-dimensional space.
      • The angle between documents indicates their degree of similarity.
2. VSM uses ‘bag of words’ approach, where assumption:
  • Each document assumed to be just a collection of words, 如圖所示, 忽略語法
  • The order of the words in a document does not matter.
    • Syntactically similar documents are semantically similar – which is often the case
    • However, it does not always work : "John ate peach" ≠ "Peach ate John"
notion image

One document represents One vector. Proximity of documents is based on similarity measure defined over the vector space
notion image

The Process of extracting structure from text


  • Tokenise:
    • document is split into a stream of words;
    • all punctuation marks are removed
  • Convert case
  • Remove stop words
Stem & Lemmatize
  • Create n-grams
After the above process finished, we create a Term-Document (Frequency) Matrix, measured by the frequency of a word for a specific document.
There are two things to notice:
  • Terms should not be too common (not helpful for clustering).
  • Terms should not be too infrequent, those occurring very rarely often removed (not helpful for clustering)
 
notion image

Analysing text and documents based on Term importance and Document similarity


Term importance:


  • Term Document Matrices
    • As explained above
  • Inverse Document Frequency

    Document similarity : Cosine Distance


    FAQ of Document similarity
    Why calculate document similarity?
    In the dialogue and question-and-answer system, many questions and dialogues we ask actually have duplicate answers. If you judge that two more different descriptions are the same question, then you can push the answers directly to avoid repeated questions and answers , And reduce the time to find a solution.
    When converting a document into a space vector, we can use the distance between the vectors and the cosine value to calculate the similarity of the text.
     
    Cosine similarity between documents:
    Draft: do not circulate
    • Closeness evaluated in n-dimensional space, where n = number of tokens in DTM.
    • Cosine similarity closer to 1 means documents are more similar than smaller values.
    https://www.programmersought.com/article/38654975219/
    notion image
     

    TF-IDF

     

    Jason Siu
    A warm welcome! I am a tech enthusiast, passionate about learning and self-discovery.
    Statistics
    Number of posts:
    233