3

I'm reading Eli Bendersky's blog post that derives the softmax function and its associated loss function and am stuck on one of the first steps of the softmax function derivative [link].

His notation defines the softmax as follows:

$$S_j = \frac{e^{a_i}}{ \sum_{k=1}^{N} e^{a_k} } $$

He then goes on to start the derivative:

$$ \frac{\partial S_i}{\partial a_j} = \frac{ \partial \frac{e^{a_i} }{ \sum_{k=1}^N e^{a_k}} } {\partial a_j} $$

Here we are computing the derivative with respect to the $i$th output and the $j$th input. Because the numerator involves a quotient, he says one must apply the quotient rule from calculus:

$$ f(x) = \frac{g(x)}{h(x)} $$ $$ f'(x) = \frac{ g'(x)h(x) - h'(x)g(x) } { (h(x))^2 } $$

In the case of the $S_j$ equations above:

$$ g_i = e^{a_i} $$ $$ h_i = \sum_{k=1}^N e^{a_k} $$

So far so good. Here's where I get confused. He then says: "Note that no matter which $a_j$ we compute the derivative of $h_i$ for, the answer will always be $e^{a_j}$".

If anyone could help me see why this is the case, I'd be very grateful.

duhaime
  • 207
  • 2
  • 11

1 Answers1

3

$$\frac{\partial}{\partial a_j}h_i = \frac{\partial}{\partial a_j}\sum_{k=1}^N e^{a_k}=\sum_{k=1}^N \frac{\partial}{\partial a_j}e^{a_k}=e^{a_j}$$ because $\frac{\partial}{\partial a_j}e^{a_k}=0$ for $k\neq j$.

  • Thanks @ThePhenotype this is helpful. Can I please ask you to elaborate on why de^{a_k} / da_j = 0 for k != j ? I think that's what I'm not seeing... – duhaime Dec 31 '17 at 20:40
  • @duhaime Every $a_i$ is a variable. If we call $a_j$ $x$ and the other variables $y$, $z$,$\ldots$ then you can see that for example $\frac{\partial}{\partial y}e^{x}=0$, also when deriving to $z$, etc. In other words, the function $e^{a_k}$ is constant when we move over $a_j$ for $j\neq k$, so the partial derivative to $a_j$ is $0$. – The Phenotype Dec 31 '17 at 20:47
  • Ah, I think that's exactly it! I was taking the derivative of $e^x$ for all values of $x$, which gives $N$ non-zero values, but I should have been taking the derivative wrt $a_j$, in which case we only care about the case where the input index == the output index; is that right? – duhaime Dec 31 '17 at 21:00
  • 1
    If you mean with the input index == the output index that the indices of the variable and partial derivative agree, then you're right yes. – The Phenotype Dec 31 '17 at 21:07
  • Awesome, thanks again! – duhaime Dec 31 '17 at 21:12