Generalized softmax derivative for implementation with any loss function

Question

I am currently taking some deep learning and neural network (NN) courses, and in addition to performing the course work, am implementing my own "toolkit" of NN techniques to better my understanding of the methods and mathematics behind it, instead of just working with the framework we are provided with.

My next goal is to implement a generalized version of softmax, so it can be combined with any desired loss function, but I am having some trouble understanding how to use the jacobian matrix that is the derivative of softmax in the backpropagation step.

My current implementation calculates the derivative of the loss function with respect to $z^{l} = w^{l}*a^{l-1}+b^{l}$ as follows:

$$ \frac{\partial L}{\partial z^{l}} = L'(a^l, y) \circ \sigma'(z^l), $$

with:

$l$ denoting the output layer
$w^l$ and $b^l$ denote the weight matrix and bias vector for layer $l$
$a^{l-1}$ is the activation of layer $l-1$
$L'$ is the derivative of the loss function
$\sigma '$ is the derivative of the activation function for the output layer $l$
$\circ$ is the element-wise multiplication of two vectors or matrices

When for example using the MSE loss and a linear or sigmoid activation for the last layer, the size of the matrices in the element-wise multiplication make sense, both are of size $n_{out}\times 1$, where $n_{out}$ is the number of output nodes. But when using the softmax function as the activation function, the sizes are now incompatible, $L' \in \mathbb{R}^{n_{out} \times 1}$, and $\sigma ' \in \mathbb{R}^{n_{out} \times n_{out}}$, which does not permit element-wise multiplication.

I know that the Softmax function is almost always combined with the cross entropy loss function to make calculations simpler, but is this a necessity for the maths to work?

I have searched through a bunch of different forums and threads, but all the threads I can find deal with the derivation of the softmax function, and stop there, never delving into how it interacts with backpropagation. Am I missing something really simple in order to use softmax with other loss functions, or does it only work with cross entropy?

Edit: After some more research, I found the following thread: Applying the gradient of softmax in backprop. However there seems to be some differing opinions in the comments, and the derivative is no longer in the expected jacobian form, so I do not feel that this sufficiently answers my question. ANy help would be much appreciated!

Generalized softmax derivative for implementation with any loss function

0 Answers0