type
Post
Created date
Jun 16, 2022 01:21 PM
category
Data Science
tags
Machine Learning
Machine Learning
status
Published
Language
From
summary
slug
password
Author
Priority
Featured
Featured
Cover
Origin
Type
URL
Youtube
Youtube
icon
 

Definition


  • SVM finds the optimal decision boundary that separates data points from different groups (or classes), and then predicts the class of new observations based on this separation boundary. [STHDA] In other words, it finds the maximum margin separating hyperplane [Cornell].
  • In simple word, SVM tries to find the hyperplane which separates the data points as widely as possible, since this margin maximization improves the model’s accuracy on the test or the unseen data.

Theory


Some guys need to be introduced:
Hyperplane
Is called decision boundary a straight line but if we have more dimensions, we call this decision boundary a “hyperplane” here
Support Vector
  • Points that are closest to the hyperplane. A separating line will be defined with the help of these data points. [vidhya]
  • Points pass to margins.
  • They are the data points most difficult to classify
  • They have direct bearing on the optimum location of the decision surface
Margin
  • The distance between the hyperplane and the observations closest to the hyperplane (support vectors).
  • Large margin is considered a good margin.
  • There are two types of margins hard margin and soft margin. I will talk more about these two in the later section. [vidhya]
notion image

With knowing what these are, the next question is which hyperplane does it select? There can be an infinite number of hyperplanes passing through a point and classifying the two classes perfectly. ) [vidhya]
SVM does this by finding the maximum margin between the hyperplanes meaning that it finds maximum distances between the two classes.

Assumption


  • Can handle both linear and non-linear class boundaries. [STHDA]
  • If non-linear, use kernel trick.
  • The margin should be as large as possible.
  • The SVs are the most useful data points because they are the ones most likely to be incorrectly classified.
  • Data is independent and identically distributed.

Advantages of SVM


Disadvantages of SVM


1. SVM works better when the data is Linear
2. It is more effective in high dimensions
3. With the help of the kernel trick, we can solve any complex problem
4. SVM is not sensitive to outliers
5. Can help us with Image classification
1. Choosing a good kernel is not easy
2. It doesn’t show good results on a big dataset
3. The SVM hyperparameters are Cost -C and gamma. It is not that easy to fine-tune these hyper-parameters. It is hard to visualize their impact

Example

[vidhya]
notion image
To classify green and blue points, we can have many decision boundaries, but the question is which is the best and how do we find it?
NOTE: Since we are plotting the data points in a 2-dimensional graph we call this decision boundary a straight line but if we have more dimensions, we call this decision boundary a “hyperplane”
notion image
The best hyperplane is that plane that has the maximum distance from both the classes, and this is the main aim of SVM.
This is done by finding different hyperplanes which classify the labels in the best way then it will choose the one which is farthest from the data points or the one which has a maximum margin.
 
 
 
[Brendi]
notion image
notion image
notion image
 
 
 
 
 
 

R code

lab
Fit the linear SVM to olive oils, using a training split of 2/3, using only regions 2, 3, and the predictors linoleic and arachidic. Report the training and test error, list the support vectors, the coefficients for the support vectors and the equation for the separating hyperplane, and $$???\times\text{linoleic}+???\times\text{arachidic}+??? > 0$$ and make a plot of the boundary. ```{r} notsouth <- olive %>% filter(region != 1) %>% select(region, linoleic, arachidic) %>% mutate(region = factor(region)) %>% mutate(across(where(is.numeric), ~ (.x - mean(.x)) / sd(.x))) ``` ```{r} set.seed(2021) notsouth_split <- initial_split(notsouth, prop = 2/3, strata = region) notsouth_tr <- training(notsouth_split) notsouth_ts <- testing(notsouth_split) library(kernlab) svm_mod <- svm_rbf(cost = 10) %>% set_mode("classification") %>% set_engine("kernlab", kernel = "vanilladot", # linear kernel, see ?kernlab::ksvm() scaled = FALSE) notsouth_svm <- svm_mod %>% fit(region ~ ., data = notsouth_tr) ``` ```{r} notsouth_p <- as_tibble(expand_grid(linoleic = seq(-2.2, 2.2, 0.1), arachidic = seq(-2, 2, 0.1))) notsouth_p <- notsouth_p %>% mutate(region_svm = predict(notsouth_svm, notsouth_p )$.pred_class) ggplot() + # Predicted values geom_point(data = notsouth_p, aes(x = linoleic, y = arachidic, color = region_svm), alpha = 0.1) + # Overlay with actual data geom_point(data = notsouth, aes(x = linoleic, y = arachidic, color = region, shape = region)) + # Circle the support vectors geom_point(data = notsouth_tr %>% slice(notsouth_svm$fit@SVindex), # Extract support vectors aes(x = linoleic, y = arachidic), shape = 1, size = 3, colour = "black") + scale_color_brewer("", palette="Dark2") + theme_bw() + theme(aspect.ratio = 1, legend.position = "none") + ggtitle("SVM") + geom_abline(intercept = 1.45396, slope = -3.478113) ``` The $\alpha$'s, indexes of support vectors and $\beta_0$, and the observations that are the support vectors are: ```{r} notsouth_svm$fit@coef # \alpha * y_i notsouth_svm$fit@SVindex # Indexes (row numbers) of support vectors notsouth_svm$fit@b # Negative intercept term -b_0 # Extract support vectors using indexes notsouth_tr[notsouth_svm$fit@SVindex, ] # Support vectors; notsouth_tr %>% slice(6,102,132) notsouth_tr$region[notsouth_svm$fit@SVindex] # Response variable of support vectors ```

R interpretation

Forming equation from R output
notion image
  1. make eqaution
notion image
We use confusion matrix and plot to interpret.
notion image

Math


Can be found from [vidhya].
notion image

FAQ
What is the difference between LDA and SVM?
ND. assumption
LDA assumes that the data points have the same covariance and the probability density is assumed to be normally distributed. SVM has no such assumption.
Dataset
SVM focuses only on the points that are difficult to classify (To find the SV), LDA focuses on all data points.
Empirical
SVM doesn't really discriminate well between more than two classes.
An outlier robust alternative is to use logistic classification. LDA handles several classes well, as long as the assumptions are met.
What is the difference between logistic regression and SVM?
SVM is defined such that it is defined in terms of the support vectors only, we don’t have to worry about other observations since the margin is made using the points which are closest to the hyperplane (support vectors), whereas in logistic regression the classifier is defined over all the points. Hence SVM enjoys some natural speed-ups. [vidhya] - Not sorted
Does adding more data points affect the SVM?
It depends.
If points are off the margin of the hyperplane, then it does not affect much the decision boundary; else it would.
notion image
What is the difference between soft margin and hard margin ?
soft margin allows misclassification, whereas hard margin does not allow so.

Kernel trick


For a dataset that is not linearly separable (i.e., the decision boundary is an ellipse), we extend its formulation by transforming the original data to map into the new space.
  • A function to compute the dot product of points that are mapped in higher dimension space, without actually transforming ALL the points into the higher feature space and calculating the dot product.
ETC3250’s Tute 6 breaks down the formula
notion image
notion image
notion image

Reference


Brendi notes [Brendi]

Extra Resource


Stanford lecture
Lab
 
 
Random ForestBase Rate Fallacy 基本比率