Support Vector Machine (SVM)

type

Post

Created date

Jun 16, 2022 01:21 PM

Definition

SVM finds the optimal decision boundary that separates data points from different groups (or classes), and then predicts the class of new observations based on this separation boundary. [STHDA] In other words, it finds the maximum margin separating hyperplane [Cornell].

In simple word, SVM tries to find the hyperplane which separates the data points as widely as possible, since this margin maximization improves the model’s accuracy on the test or the unseen data.

Theory

Some guys need to be introduced:

Hyperplane

Is called decision boundary a straight line but if we have more dimensions, we call this decision boundary a “hyperplane” here

Support Vector

Points that are closest to the hyperplane. A separating line will be defined with the help of these data points. [vidhya]

Points pass to margins.

They are the data points most difficult to classify

They have direct bearing on the optimum location of the decision surface

Margin

The distance between the hyperplane and the observations closest to the hyperplane (support vectors).

Large margin is considered a good margin.

There are two types of margins hard margin and soft margin. I will talk more about these two in the later section. [vidhya]

With knowing what these are, the next question is which hyperplane does it select? There can be an infinite number of hyperplanes passing through a point and classifying the two classes perfectly. ) [vidhya]

SVM does this by finding the maximum margin between the hyperplanes meaning that it finds maximum distances between the two classes.

Assumption

Can handle both linear and non-linear class boundaries. [STHDA]

If non-linear, use kernel trick.

The margin should be as large as possible.

The SVs are the most useful data points because they are the ones most likely to be incorrectly classified.

Data is independent and identically distributed.

Microsoft PowerPoint - lect2.ppt [Compatibility Mode] (ox.ac.uk)

Advantages of SVM

Disadvantages of SVM

1. SVM works better when the data is Linear

2. It is more effective in high dimensions

3. With the help of the kernel trick, we can solve any complex problem

4. SVM is not sensitive to outliers

5. Can help us with Image classification

1. Choosing a good kernel is not easy

2. It doesn’t show good results on a big dataset

3. The SVM hyperparameters are Cost -C and gamma. It is not that easy to fine-tune these hyper-parameters. It is hard to visualize their impact

Example

[vidhya]

To classify green and blue points, we can have many decision boundaries, but the question is which is the best and how do we find it?

NOTE: Since we are plotting the data points in a 2-dimensional graph we call this decision boundary a straight line but if we have more dimensions, we call this decision boundary a “hyperplane”

The best hyperplane is that plane that has the maximum distance from both the classes, and this is the main aim of SVM.

This is done by finding different hyperplanes which classify the labels in the best way then it will choose the one which is farthest from the data points or the one which has a maximum margin.

[Brendi]

R code

lab


Fit the linear SVM to olive oils, using a training split of 2/3, using only regions 2, 3, and the predictors linoleic and arachidic. Report the training and test error, list the support vectors, the coefficients for the support vectors and the equation for the separating hyperplane, and
$$???\times\text{linoleic}+???\times\text{arachidic}+??? > 0$$ and make a plot of the boundary.

```{r}
notsouth <- olive %>%
  filter(region != 1) %>%
  select(region, linoleic, arachidic) %>%
  mutate(region = factor(region)) %>%
  mutate(across(where(is.numeric), ~ (.x - mean(.x)) / sd(.x)))
```

```{r}
set.seed(2021)
notsouth_split <- initial_split(notsouth, prop = 2/3, strata = region)
notsouth_tr <- training(notsouth_split)
notsouth_ts <- testing(notsouth_split)

library(kernlab)

svm_mod <-
  svm_rbf(cost = 10) %>%
  set_mode("classification") %>%
  set_engine("kernlab", 
             kernel = "vanilladot", # linear kernel, see ?kernlab::ksvm()
             scaled = FALSE)

notsouth_svm <- svm_mod %>%
  fit(region ~ ., data = notsouth_tr)
```

```{r}
notsouth_p <- as_tibble(expand_grid(linoleic = seq(-2.2, 2.2, 0.1), 
                                    arachidic = seq(-2, 2, 0.1))) 

notsouth_p  <- notsouth_p  %>%
  mutate(region_svm = predict(notsouth_svm, notsouth_p )$.pred_class)

ggplot() +
  # Predicted values
  geom_point(data = notsouth_p, 
             aes(x = linoleic, 
                 y = arachidic,
                 color = region_svm), 
             alpha = 0.1) +
  # Overlay with actual data
  geom_point(data = notsouth, 
             aes(x = linoleic, 
                 y = arachidic, 
                 color = region, 
                 shape = region)) +
  # Circle the support vectors
  geom_point(data = notsouth_tr %>% slice(notsouth_svm$fit@SVindex), # Extract support vectors
             aes(x = linoleic, y = arachidic), 
             shape = 1, 
             size = 3,
             colour = "black") +
  scale_color_brewer("", palette="Dark2") +
  theme_bw() +
  theme(aspect.ratio = 1,
        legend.position = "none") +
  ggtitle("SVM") +
  geom_abline(intercept = 1.45396, slope = -3.478113)
```

The $\alpha$'s, indexes of support vectors and $\beta_0$, and the observations that are the support vectors are:

```{r}
notsouth_svm$fit@coef # \alpha * y_i
notsouth_svm$fit@SVindex # Indexes (row numbers) of support vectors
notsouth_svm$fit@b # Negative intercept term -b_0

# Extract support vectors using indexes
notsouth_tr[notsouth_svm$fit@SVindex, ] # Support vectors; notsouth_tr %>% slice(6,102,132)
notsouth_tr$region[notsouth_svm$fit@SVindex] # Response variable of support vectors
```

Other sources

Classifying data using Support Vector Machines(SVMs) in R - GeeksforGeeks

Support Vector Machine In R | Using SVM To Predict Heart Diseases | Edureka

SVM Model: Support Vector Machine Essentials - Articles - STHDA

R interpretation

Forming equation from R output

make eqaution

We use confusion matrix and plot to interpret.

Math

Can be found from [vidhya].

Using a Hard Margin vs. Soft Margin in SVM | Baeldung on Computer Science

FAQ

What is the difference between LDA and SVM?

ND. assumption

LDA assumes that the data points have the same covariance and the probability density is assumed to be normally distributed. SVM has no such assumption.

Dataset

SVM focuses only on the points that are difficult to classify (To find the SV), LDA focuses on all data points.

Empirical

SVM doesn't really discriminate well between more than two classes.

An outlier robust alternative is to use logistic classification. LDA handles several classes well, as long as the assumptions are met.

What is the difference between logistic regression and SVM?

SVM is defined such that it is defined in terms of the support vectors only, we don’t have to worry about other observations since the margin is made using the points which are closest to the hyperplane (support vectors), whereas in logistic regression the classifier is defined over all the points. Hence SVM enjoys some natural speed-ups. [vidhya] - Not sorted

Does adding more data points affect the SVM?

It depends.

If points are off the margin of the hyperplane, then it does not affect much the decision boundary; else it would.

What is the difference between soft margin and hard margin ?

soft margin allows misclassification, whereas hard margin does not allow so.

Kernel trick

For a dataset that is not linearly separable (i.e., the decision boundary is an ellipse), we extend its formulation by transforming the original data to map into the new space.

A function to compute the dot product of points that are mapped in higher dimension space, without actually transforming ALL the points into the higher feature space and calculating the dot product.

ETC3250’s Tute 6 breaks down the formula