Created date
Jun 16, 2022 01:21 PM
Data Science
Machine Learning
Machine Learning


  • SVM finds the optimal decision boundary that separates data points from different groups (or classes), and then predicts the class of new observations based on this separation boundary. [STHDA] In other words, it finds the maximum margin separating hyperplane [Cornell].
  • In simple word, SVM tries to find the hyperplane which separates the data points as widely as possible, since this margin maximization improves the model’s accuracy on the test or the unseen data.


Some guys need to be introduced:
Is called decision boundary a straight line but if we have more dimensions, we call this decision boundary a “hyperplane” here
Support Vector
  • Points that are closest to the hyperplane. A separating line will be defined with the help of these data points. [vidhya]
  • Points pass to margins.
  • They are the data points most difficult to classify
  • They have direct bearing on the optimum location of the decision surface
  • The distance between the hyperplane and the observations closest to the hyperplane (support vectors).
  • Large margin is considered a good margin.
  • There are two types of margins hard margin and soft margin. I will talk more about these two in the later section. [vidhya]
notion image

With knowing what these are, the next question is which hyperplane does it select? There can be an infinite number of hyperplanes passing through a point and classifying the two classes perfectly. ) [vidhya]
SVM does this by finding the maximum margin between the hyperplanes meaning that it finds maximum distances between the two classes.


  • Can handle both linear and non-linear class boundaries. [STHDA]
  • If non-linear, use kernel trick.
  • The margin should be as large as possible.
  • The SVs are the most useful data points because they are the ones most likely to be incorrectly classified.
  • Data is independent and identically distributed.

Advantages of SVM

Disadvantages of SVM

1. SVM works better when the data is Linear
2. It is more effective in high dimensions
3. With the help of the kernel trick, we can solve any complex problem
4. SVM is not sensitive to outliers
5. Can help us with Image classification
1. Choosing a good kernel is not easy
2. It doesn’t show good results on a big dataset
3. The SVM hyperparameters are Cost -C and gamma. It is not that easy to fine-tune these hyper-parameters. It is hard to visualize their impact


notion image
To classify green and blue points, we can have many decision boundaries, but the question is which is the best and how do we find it?
NOTE: Since we are plotting the data points in a 2-dimensional graph we call this decision boundary a straight line but if we have more dimensions, we call this decision boundary a “hyperplane”
notion image
The best hyperplane is that plane that has the maximum distance from both the classes, and this is the main aim of SVM.
This is done by finding different hyperplanes which classify the labels in the best way then it will choose the one which is farthest from the data points or the one which has a maximum margin.
notion image
notion image
notion image

R code

Fit the linear SVM to olive oils, using a training split of 2/3, using only regions 2, 3, and the predictors linoleic and arachidic. Report the training and test error, list the support vectors, the coefficients for the support vectors and the equation for the separating hyperplane, and $$???\times\text{linoleic}+???\times\text{arachidic}+??? > 0$$ and make a plot of the boundary. ```{r} notsouth <- olive %>% filter(region != 1) %>% select(region, linoleic, arachidic) %>% mutate(region = factor(region)) %>% mutate(across(where(is.numeric), ~ (.x - mean(.x)) / sd(.x))) ``` ```{r} set.seed(2021) notsouth_split <- initial_split(notsouth, prop = 2/3, strata = region) notsouth_tr <- training(notsouth_split) notsouth_ts <- testing(notsouth_split) library(kernlab) svm_mod <- svm_rbf(cost = 10) %>% set_mode("classification") %>% set_engine("kernlab", kernel = "vanilladot", # linear kernel, see ?kernlab::ksvm() scaled = FALSE) notsouth_svm <- svm_mod %>% fit(region ~ ., data = notsouth_tr) ``` ```{r} notsouth_p <- as_tibble(expand_grid(linoleic = seq(-2.2, 2.2, 0.1), arachidic = seq(-2, 2, 0.1))) notsouth_p <- notsouth_p %>% mutate(region_svm = predict(notsouth_svm, notsouth_p )$.pred_class) ggplot() + # Predicted values geom_point(data = notsouth_p, aes(x = linoleic, y = arachidic, color = region_svm), alpha = 0.1) + # Overlay with actual data geom_point(data = notsouth, aes(x = linoleic, y = arachidic, color = region, shape = region)) + # Circle the support vectors geom_point(data = notsouth_tr %>% slice(notsouth_svm$fit@SVindex), # Extract support vectors aes(x = linoleic, y = arachidic), shape = 1, size = 3, colour = "black") + scale_color_brewer("", palette="Dark2") + theme_bw() + theme(aspect.ratio = 1, legend.position = "none") + ggtitle("SVM") + geom_abline(intercept = 1.45396, slope = -3.478113) ``` The $\alpha$'s, indexes of support vectors and $\beta_0$, and the observations that are the support vectors are: ```{r} notsouth_svm$fit@coef # \alpha * y_i notsouth_svm$fit@SVindex # Indexes (row numbers) of support vectors notsouth_svm$fit@b # Negative intercept term -b_0 # Extract support vectors using indexes notsouth_tr[notsouth_svm$fit@SVindex, ] # Support vectors; notsouth_tr %>% slice(6,102,132) notsouth_tr$region[notsouth_svm$fit@SVindex] # Response variable of support vectors ```

R interpretation

Forming equation from R output
notion image
  1. make eqaution
notion image
We use confusion matrix and plot to interpret.
notion image


Can be found from [vidhya].
notion image

What is the difference between LDA and SVM?
ND. assumption
LDA assumes that the data points have the same covariance and the probability density is assumed to be normally distributed. SVM has no such assumption.
SVM focuses only on the points that are difficult to classify (To find the SV), LDA focuses on all data points.
SVM doesn't really discriminate well between more than two classes.
An outlier robust alternative is to use logistic classification. LDA handles several classes well, as long as the assumptions are met.
What is the difference between logistic regression and SVM?
SVM is defined such that it is defined in terms of the support vectors only, we don’t have to worry about other observations since the margin is made using the points which are closest to the hyperplane (support vectors), whereas in logistic regression the classifier is defined over all the points. Hence SVM enjoys some natural speed-ups. [vidhya] - Not sorted
Does adding more data points affect the SVM?
It depends.
If points are off the margin of the hyperplane, then it does not affect much the decision boundary; else it would.
notion image
What is the difference between soft margin and hard margin ?
soft margin allows misclassification, whereas hard margin does not allow so.

Kernel trick

For a dataset that is not linearly separable (i.e., the decision boundary is an ellipse), we extend its formulation by transforming the original data to map into the new space.
  • A function to compute the dot product of points that are mapped in higher dimension space, without actually transforming ALL the points into the higher feature space and calculating the dot product.
ETC3250’s Tute 6 breaks down the formula
notion image
notion image
notion image


Brendi notes [Brendi]

Extra Resource

Stanford lecture
Random ForestClassification tree