SVM optimization objective: why are we maximizing $\frac{1}{\|w\|}$?

Question

I'm learning about Support Vector Machines(SVM). I understood that their objective is to maximize the margin between the decision line and the closest point to it. For simplicity, let's assume we have a linear SVM and linearly separable data X. We would like to find the maximum distance D and vector W, such that

$\dfrac{W^T \cdot X_i+W_0}{\|W\|}\geq D$ for all $X_i$ in the first class

$\dfrac{W^T \cdot X_i+W_0}{\|W\|}\leq-D$ for all $X_i$ in the second class.

Why is the problem equivalent to maximizing $\frac{1}{\|W\|}$, given the following constraints are satisfied:

$W^T\cdot X_i+W_0\geq1$ for all $X_i$ in the first class

$W^T\cdot X_i+W_0\leq-1$ for all $X_i$ in the second class.

I'm more interested in a formal proof than in the intuition.

score 7 · Accepted Answer · answered Oct 11 '13 at 23:45

Let's slightly rewrite the condition you give: the quantity $$ \frac{W^T \cdot X_i + W_0}{\|W\|} \mp D $$ is non-negative for $X_i$ in the first class (with $-D$), non-positive for $X_i$ in the second class (with $+D$). Now suppose $\lambda > 0$. Then the quantity $$ \frac{(\lambda W)^T \cdot X_i + (\lambda W_0)}{\|\lambda W\|} \mp \lambda D $$ satisfies the same properties as before; in fact the two sets of inequalities are equivalent. Therefore we can assume without loss of generality that $\|W\| = 1$ (choose $\lambda = 1/\|W\|$), and so the quantity in question becomes $$ W^T \cdot X_i + W_0 \mp D, $$ where we are promised that $\|W\| = 1$. Now divide everything by $D$ to get another quantity satisfying the same properties: $$ (D^{-1}W)^T \cdot X_i + (D^{-1}W_0) \mp 1. $$ Maximizing $D$ is the same as minimizing $D^{-1} = \| D^{-1} W \|$, which is the same as maximizing $1/\|D^{-1} W \|$. Now imagine that instead of $W,W_0$, you were optimizing over $W'=D^{-1}W,W'_0=D^{-1}W_0$; for each choice of $W',W'_0$ you can recover $D = 1/\|W'\|$ and so $W,W_0$, which shows that both problems are equivalent. In these terms, you want the following quantity to satisfy the required properties: $$ W'^T \cdot X_i + W'_0 \mp 1. $$ Under these conditions, you want to maximize $1/\|W'\|$.

Thank you, very nice proof missing from most machine learning textbooks. — iensen, Oct 12 '13 at 00:28

SVM optimization objective: why are we maximizing $\frac{1}{\|w\|}$?

1 Answers1

Linked