Understanding step in derivation of softmax function

Question

I'm reading Eli Bendersky's blog post that derives the softmax function and its associated loss function and am stuck on one of the first steps of the softmax function derivative [link].

His notation defines the softmax as follows:

$$S_j = \frac{e^{a_i}}{ \sum_{k=1}^{N} e^{a_k} } $$

He then goes on to start the derivative:

$$ \frac{\partial S_i}{\partial a_j} = \frac{ \partial \frac{e^{a_i} }{ \sum_{k=1}^N e^{a_k}} } {\partial a_j} $$

Here we are computing the derivative with respect to the $i$th output and the $j$th input. Because the numerator involves a quotient, he says one must apply the quotient rule from calculus:

$$ f(x) = \frac{g(x)}{h(x)} $$ $$ f'(x) = \frac{ g'(x)h(x) - h'(x)g(x) } { (h(x))^2 } $$

In the case of the $S_j$ equations above:

$$ g_i = e^{a_i} $$ $$ h_i = \sum_{k=1}^N e^{a_k} $$

So far so good. Here's where I get confused. He then says: "Note that no matter which $a_j$ we compute the derivative of $h_i$ for, the answer will always be $e^{a_j}$".

If anyone could help me see why this is the case, I'd be very grateful.

score 3 · Accepted Answer · answered Dec 31 '17 at 20:34

3

$$\frac{\partial}{\partial a_j}h_i = \frac{\partial}{\partial a_j}\sum_{k=1}^N e^{a_k}=\sum_{k=1}^N \frac{\partial}{\partial a_j}e^{a_k}=e^{a_j}$$ because $\frac{\partial}{\partial a_j}e^{a_k}=0$ for $k\neq j$.

answered Dec 31 '17 at 20:34

The Phenotype

5,199

Thanks @ThePhenotype this is helpful. Can I please ask you to elaborate on why de^{a_k} / da_j = 0 for k != j ? I think that's what I'm not seeing... – duhaime Dec 31 '17 at 20:40
@duhaime Every $a_i$ is a variable. If we call $a_j$ $x$ and the other variables $y$, $z$,$\ldots$ then you can see that for example $\frac{\partial}{\partial y}e^{x}=0$, also when deriving to $z$, etc. In other words, the function $e^{a_k}$ is constant when we move over $a_j$ for $j\neq k$, so the partial derivative to $a_j$ is $0$. – The Phenotype Dec 31 '17 at 20:47
Ah, I think that's exactly it! I was taking the derivative of $e^x$ for all values of $x$, which gives $N$ non-zero values, but I should have been taking the derivative wrt $a_j$, in which case we only care about the case where the input index == the output index; is that right? – duhaime Dec 31 '17 at 21:00
1

If you mean with the input index == the output index that the indices of the variable and partial derivative agree, then you're right yes. – The Phenotype Dec 31 '17 at 21:07
Awesome, thanks again! – duhaime Dec 31 '17 at 21:12

Understanding step in derivation of softmax function

1 Answers1

Linked