How to derive gradients for softmax function

Question

We have the following feedforward equations:

$z_1 = W_1x + b_1$

$a_1 = f(z_1)$

$z_2 = W_2a_1 + b_2$

$a_2 = y^* = softmax(z_2)$

$L(y, y^*) = -\frac{1}{N}\sum_{n \in N} \sum_{i \in C} y_{n,i} \log{y^*_{n,i}}$

Now, I'm trying to compute the following partial derivates $\frac{dL}{dW_2}$, $\frac{dL}{db_2}$, $\frac{dL}{dW_1}$, $\frac{dL}{db_1}$. I'm familiar with how to compute these gradients in a simple network, but am running into difficulties with the softmax function here. This is what I have so far:

$\frac{dL}{dW_2} = -\frac{1}{N}\sum_{n \in N}\sum_{i \in C}(\frac{d}{dW_2} y_{n,i}\log{y^*_{n,i}}) = -\frac{1}{N}\sum_{n \in N}\sum_{i \in C}( \frac{y_{n,i}}{y^*_{n,i}}\frac{dy^*_{n,i}}{dW_2})$. Then, $\frac{dy^*_{n,i}}{dW_2} = \frac{dy^*_{n,i}}{dz_2}\frac{dz_2}{dW_2}$ and $\frac{dz_2}{dW_2} = a_1$. I'm just not quite sure how to compute $\frac{dy^*_{n,i}}{dz_2}$. Would anyone be able to help with this or suggest a different way of approaching the computation for $\frac{dL}{dW_2}$?

The general solution to softmax is complicated. You may find it easier if you assume with $y_{n,i}$ only one $i$ of each $n$ has value $1$ whilst the rest are $0$. This is a common use case for softmax, and is simpler to work with becasue you only care about the value of the matching $y^*_{n,i}$. Can any answer also make that assumption? — Neil Slater, Oct 16 '19 at 08:04
Yes, I suppose so. I need these calculations since I'm trying to manually perform backprop to train a NN — py1123, Oct 16 '19 at 18:01
Oh, also, softmax usually goes with multiclass log loss. When you combine those two, you get a really simple formula for $\nabla_{z_2} L$ i.e. $\nabla_{z_2} L = \frac{1}{N}\sum_{n \in N} y^*_n - y_n$ . . . that's because many terms in the two derivatives cancel each other out. An answer could demonstrate that, but isn't a general solution for all possible uses of softmax. — Neil Slater, Oct 16 '19 at 19:39
It still isn't entirely clear to me why there isn't a solution to this specific problem. I'm attempting to build my own NN for 2-class classification (here $C = 2$), and need to compute these derivatives in order to determine how to modify each parameter to decrease the loss function. — py1123, Oct 16 '19 at 21:06
There is definitely a solution - many different ones are out there. I've done it myslf here: https://github.com/neilslater/ru_ne_ne/blob/master/ext/ru_ne_ne/core_objective_functions.c#L258-L308 . . . (note this is abandonware, I never published the library, however I do have test routines for that gradient demonstrating it is correct). — Neil Slater, Oct 17 '19 at 08:16

How to derive gradients for softmax function

0 Answers0