Determining partial derivative of the local cost function in Generalized Learning Vector Quantization

Question

I am writing a bachelor's thesis on a machine learning topic involving Generalized Learning Vector Quantization (GLVQ). Most paper I read have a very brief explanation on the mathematics concerning this topic. I want to deeply understand and also show the mathematics behind it. I am almost through the process, but one step is unclear to me.

Let $\textbf{x}_i$ be an inputvector, $\textbf{w}^\pm$ a prototype related to the distance $d^\pm$. Let $$\mu(\textbf{x}) = \frac{d^+-\;d^-}{d^++\;d^-}$$ be a function determining the proximity of an inputvector to the prototypes, where $d^+$ gives back the distance to the nearest prototype $\textbf{w}^+$of the same class and $d^-$ to the nearest prototype $\textbf{w}^-$ of a different class.

Let the Costfunction be defined as $$S = \sum^{N}_{i=1}f(\mu(\textbf{x}_i))$$

We now want to do the partial derivative on $m^\pm$ over S, thus $$\frac{\partial S}{\partial \textbf{m}^{\pm}} = \frac{\partial S}{\partial f} \frac{\partial f}{\partial \mu} \frac{\partial \mu}{\partial d^{\pm}} \frac{\partial d^{\pm}}{\partial \textbf{m}^{\pm}}$$

Let $f$ be any monotone growing function (usually sigmoid, but it will remain undeclared here). Let $\textbf{x}_m$ be the Vector which gets $\textbf{m}^+$ as closest prototype from the same class.

Now we get to the derivation.

$$S = f(\mu(\textbf{x}_1)) + \cdots + f(\mu(\textbf{x}_m)) + \cdots + f(\mu(\textbf{x}_N))$$ Since we are only interested in the local stochastical gradient descent, we continue working with $f(\mu(\textbf{x}_m))$

$$ \begin{align} \frac{\partial S}{\partial \textbf{m}^+} &= \frac{\partial f(\mu(\textbf{x}_m))}{\partial\textbf{m}^+}\\ &= \frac{\partial f}{\partial \mu} \frac{\partial \mu}{\partial \textbf{m}^+} \end{align}$$

$$\require{cancel}\begin{align} \frac{\partial \mu}{\partial \textbf{m}^+} &= \cfrac{ \partial\cfrac{d^+(\textbf{x}_m, \textbf{m}^+) - d^-(\textbf{x}_m, \textbf{m}^-)} {d^+(\textbf{x}_m, \textbf{m}^+) + d^-(\textbf{x}_m, \textbf{m}^-)}} {\partial \textbf{m}^+} \\ &= \frac{(d^+ - d^-)' \cdot (d^+ + d^-) - (d^+ - d^-) \cdot (d^+ + d^-)'}{(d^+ + d^-)^2} \nonumber \\ \nonumber \\ &= \frac{(d^{\prime +} + \cancel{-d^-}) \cdot (d^+ + d^-) - (d^+ - d^-) \cdot (d^{\prime +} + \cancel{d^-})}{(d^+ + d^-)^2} \nonumber \\ \nonumber \\ &= \frac{d^{\prime +} \cdot \Big((\cancel{d^+} + d^-) - (\cancel{d^+} - d^-)\Big)}{(d^+ + d^-)^2} \nonumber \\ \nonumber \\ &= \frac{d^{\prime +} \cdot 2d^-}{(d^+ + d^-)^2} \nonumber \\ \nonumber \\ &= \cfrac{\cfrac{\partial d^+}{\partial \textbf{m}^+} \cdot 2d^-}{(d^+ + d^-)^2} \ \end{align}$$

Now as a last step I need to do the partial derivative on the distance function. Papers give this euclidean distance as a distance function.

$$d^\pm = |\textbf{x} - \textbf{m}^\pm|^2$$

So I am not sure how to proceed going on from here

$$ \cfrac{\partial d^+}{\partial \textbf{m}^+} = \frac{\partial |\textbf{x}_w - \textbf{m}^{+}|^2}{\partial \textbf{m}^+} $$

I know the solution has to be

$$\frac{\partial S}{\partial \textbf{m}^+} = \frac{\partial S}{\partial f} \frac{\partial f}{\partial \mu} \frac{\partial \mu}{\partial d^{\pm}} \frac{\partial d^{\pm}}{\partial \textbf{m}^{\pm}} = -\frac{\partial f}{\partial \mu} \frac{4d^-}{(d^+ + d^-)^2}(\textbf{x}_w - \textbf{m}^+)$$

Badran · Accepted Answer · 2021-08-13T20:52:12.993

The gradient of the squared norm $f(x) = |x|^2$ is

$$ \cfrac{\partial f(x)}{\partial x} = \frac{\partial |x|^2}{\partial x} = 2x$$

You can pick your favorite proof from this link. I quoted this proof from the same link.

$$f(x)=|x|^2= \left(\left(\sum_{k=1}^n x_k^2 \right)^{1/2}\right)^{2}=\sum_{k=1}^n x_k^2 ,$$ then $$\frac{\partial}{\partial x_j}f(x) =\frac{\partial}{\partial x_j}\sum_{k=1}^n x_k^2=\sum_{k=1}^n \underbrace{\frac{\partial}{\partial x_j}x_k^2}_{\substack{=0, \ \text{ if } j \neq k,\\=2x_j, \ \text{ else }}}= 2x_j.$$ It follows that $$\nabla f(x) = 2x.$$

A simpler way can be done using vector calculus.

Your function can be written as $$ |\textbf{x}_m - \textbf{w}^{+}|^2 = f(\textbf{x}_m - \textbf{w}^{+}) $$

Using the chain rule, the gradient with respect to $\textbf{w}^{+}$ of $|\textbf{x}_m - \textbf{w}^{+}|^2$ is $-2(\textbf{x}_m - \textbf{w}^{+})$. Substituting this in your equation will give you the right answer.

Determining partial derivative of the local cost function in Generalized Learning Vector Quantization

1 Answers1