Can you explain the difference between SVC and LinearSVC in scikit-learn?

Question

I've recently started learning to work with sklearn and have just come across this peculiar result.

I used the digits dataset available in sklearn to try different models and estimation methods.

When I tested a Support Vector Machine model on the data, I found out there are two different classes in sklearn for SVM classification: SVC and LinearSVC, where the former uses one-against-one approach and the other uses one-against-rest approach.

I didn't know what effect that could have on the results, so I tried both. I did a Monte Carlo-style estimation where I ran both models 500 times, each time splitting the sample randomly into 60% training and 40% test and calculating the error of the prediction on the test set.

The regular SVC estimator produced the following histogram of errors: While the linear SVC estimator produced the following histogram:

What could account for such a stark difference? Why does the linear model have such higher accuracy most of the time?

And, relatedly, what could be causing the stark polarization in the results? Either an accuracy close to 1 or an accuracy close to 0, nothing in between.

For comparison, a decision tree classification produced a much more normally distributed error rate with an accuracy of around .85.

I assume the scikit-learn documentation does not highlight the difference? Did you check? — Rohit, Sep 02 '15 at 15:06
the documentation is kinda sparse/vague on the topic. It mentions the difference between one-against-one and one-against-rest, and that the linear SVS is Similar to SVC with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better (to large numbers of samples). — metjush, Sep 02 '15 at 15:11
for regular SVC, I used the default kernel.
I know 1v1 and 1vR are different approaches, but I guess that's what I want to know - why do they produce such different results? Is it the kernel choice or the different approach to multiple category classification? — metjush, Sep 02 '15 at 15:12
check https://scikit-learn.org/stable/modules/svm.html#svm-mathematical-formulation — Ferroao, May 09 '21 at 20:16

score 28 · Answer 1 · answered Dec 18 '15 at 07:54

A regular SVM with default values uses a radial basis function as the SVM kernel. This is basically a Gaussian kernel aka bell-curve. Meaning that the no man's land between different classes is created with a Gaussian function. The linear-SVM uses a linear kernel for the basis function, so you can think of this as a ^ shaped function. It is much less tunable and is basically just a linear interpolation.

People are kind of hammering this question because you didn't provide a lot of information, but looking deeply into what you posted... this hits on some fundamental aspects of really understanding the details of bias and variance and the difference between linear and nonlinear basis functions in SVM.

Check out this image describing the four regions of high and low bias and high and low variance. Obviously the best place to be is low variance and low bias.

First lets assess variance -

Now take a look at your plots:

The nonlinear basis function has higher variance. See how it is noisier than the linear kernel! The linear kernel has lower variance. See how it is less noisy!

Now lets assess bias -

Which kernel is more accurate? We can add the errors that you provided. The nonlinear kernel has a total error of ~550+325=~875. The linear kernel has an error of ~690+~50=~740. So the linear kernel seems to do better overall, but they are pretty close overall. This is were things get tricky!

Putting it all together

See how the linear kernel did a poor job on 1's and a really great job on 0's. This is pretty unbalanced. Where as the nonlinear kernel is more balanced. It kind of seems like the sweet spot might be to create a balanced model that doesn't have such high variance. How do we control for high variance? Bingo - regularization. We can add regularization to the nonlinear model and we will probably see much better results. This is the C parameter in scikit learn SVMs, which you will want to increase from the default. We could also play with the gamma parameter. Gamma controlls the width of the Gaussian. Maybe try increasing that one slightly to get less noisy results e.g. a larger no-man's land between classes.

Hope this helps!

C parameter needs to be decreased from the default not increased. — Hamdi, Mar 02 '18 at 17:02

score 5 · Answer 2 · answered Sep 03 '15 at 13:39

If you used the default kernel in SVC(), the Radial Basis Function (rbf) kernel, then you probably learned a more nonlinear decision boundary. In the case of the digits dataset, this will vastly outperform a linear decision boundary on this task (see 3.1 'Baseline Linear Classifier')

Can you explain the difference between SVC and LinearSVC in scikit-learn?

2 Answers2

Linked