We have the following feedforward equations:
$z_1 = W_1x + b_1$
$a_1 = f(z_1)$
$z_2 = W_2a_1 + b_2$
$a_2 = y^* = softmax(z_2)$
$L(y, y^*) = -\frac{1}{N}\sum_{n \in N} \sum_{i \in C} y_{n,i} \log{y^*_{n,i}}$
Now, I'm trying to compute the following partial derivates $\frac{dL}{dW_2}$, $\frac{dL}{db_2}$, $\frac{dL}{dW_1}$, $\frac{dL}{db_1}$. I'm familiar with how to compute these gradients in a simple network, but am running into difficulties with the softmax function here. This is what I have so far:
$\frac{dL}{dW_2} = -\frac{1}{N}\sum_{n \in N}\sum_{i \in C}(\frac{d}{dW_2} y_{n,i}\log{y^*_{n,i}}) = -\frac{1}{N}\sum_{n \in N}\sum_{i \in C}( \frac{y_{n,i}}{y^*_{n,i}}\frac{dy^*_{n,i}}{dW_2})$. Then, $\frac{dy^*_{n,i}}{dW_2} = \frac{dy^*_{n,i}}{dz_2}\frac{dz_2}{dW_2}$ and $\frac{dz_2}{dW_2} = a_1$. I'm just not quite sure how to compute $\frac{dy^*_{n,i}}{dz_2}$. Would anyone be able to help with this or suggest a different way of approaching the computation for $\frac{dL}{dW_2}$?