4

I am try to calculate the derivative of cross-entropy, when the softmax layer has the temperature T. That is: \begin{equation} p_j = \frac{e^{o_j/T}}{\sum_k e^{o_k/T}} \end{equation}

This question here was answered at T=1: Derivative of Softmax loss function

Now what would be the final derivative in terms of $p_i$, $q_i$, and T? Please see the linked question for the notations.

Edit: Thanks to Alex for pointing out a typo

sim_inf
  • 43

3 Answers3

2

The cross-entropy loss for softmax outputs assumes that the set of target values are one-hot encoded rather than a fully defined probability distribution at $T=1$, which is why the usual derivation does not include the second $1/T$ term.

The following is from this elegantly written article:

\begin{split} \frac{\partial \xi}{\partial z_i} & = - \sum_{j=1}^C \frac{\partial t_j \log(y_j)}{\partial z_i}{} = - \sum_{j=1}^C t_j \frac{\partial \log(y_j)}{\partial z_i} = - \sum_{j=1}^C t_j \frac{1}{y_j} \frac{\partial y_j}{\partial z_i} \\ & = - \frac{t_i}{y_i} \frac{\partial y_i}{\partial z_i} - \sum_{j \neq i}^C \frac{t_j}{y_j} \frac{\partial y_j}{\partial z_i} = - \frac{t_i}{y_i} y_i (1-y_i) - \sum_{j \neq i}^C \frac{t_j}{y_j} (-y_j y_i) \\ & = - t_i + t_i y_i + \sum_{j \neq i}^C t_j y_i = - t_i + \sum_{j = 1}^C t_j y_i = -t_i + y_i \sum_{j = 1}^C t_j \\ & = y_i - t_i \end{split}

where $C$ is the number of output classes. The above derivation neither assumes the $T \ne 1$ condition nor that the target distribution is also a softmax output. So in order to find out what the gradient looks like when we plug in these two missing assumptions into the derivation, let's first see what we get when we plug in the $T \ne 1$ assumption:

\begin{split} \frac{\partial \xi}{\partial z_i} & = - \sum_{j=1}^C \frac{\partial t_j \log(y_j)}{\partial z_i}{} = - \sum_{j=1}^C t_j \frac{\partial \log(y_j)}{\partial z_i} = - \sum_{j=1}^C t_j \frac{1}{y_j} \frac{\partial y_j}{\partial z_i} \\ & = - \frac{t_i}{y_i} \frac{\partial y_i}{\partial z_i} - \sum_{j \neq i}^C \frac{t_j}{y_j} \frac{\partial y_j}{\partial z_i} = - \frac{t_i}{y_i} \frac{1}{T} y_i (1-y_i) - \sum_{j \neq i}^C \frac{t_j}{y_j} \frac{1}{T} (-y_j y_i) \\ & = -\frac{1}{T} t_i + \frac{1}{T} t_i y_i + \frac{1}{T} \sum_{j \neq i}^C t_j y_i = - \frac{1}{T} t_i + \frac{1}{T} \sum_{j = 1}^C t_j y_i = -\frac{1}{T} t_i + \frac{1}{T} y_i \sum_{j = 1}^C t_j \\ & = \frac{1}{T} (y_i - t_i) \end{split}

The last part, where the assumption that the targets are soft as well is also injected into the derivation, is beautifully summarized in section 2.1 of Prof. Hinton's 2015 paper titled 'Distilling the Knowledge in a Neural Network'. Rewriting the same in the context of the derivation given above, we get:

\begin{split} \frac{\partial \xi}{\partial z_i} & = \frac{1}{T} (y_i - t_i) = \frac{1}{T} (\frac{e^{z_i/T}}{\sum_{d=1}^C e^{z_d/T}} - \frac{e^{v_i/T}}{\sum_{d=1}^C e^{v_d/T}}) \end{split}

If the temperature is high compared with the magnitude of the logits, we can approximate: \begin{split} \frac{\partial \xi}{\partial z_i} & \approx \frac{1}{T} (\frac{1 + z_i/T}{C + \sum_{d=1}^C z_d/T} - \frac{1 + v_i/T}{C + \sum_{d=1}^C v_d/T}) \end{split}

since, we can indeed approximate $e^{very small value}$ with $1 + {very small value}$ (The denominator terms are nothing but a straightforward generalization of these values when summed up). If we now assume that the logits have been zero-meaned separately for each transfer case so that $\sum_{d} z_d = \sum_{d} v_d = 0$, then the above equation simplifies to: \begin{split} \frac{\partial \xi}{\partial z_i} & \approx \frac{1}{CT^2} (z_i - v_i) \end{split}

This is when we arrive at the $1 / T^2$ term. Here 'transfer set' refers to the dataset that is used to train the to-be-distilled student model, labelled using soft targets produced via the softmax outputs of the cumbersome teacher model(s).

samirzach
  • 136
1

$ \def\o{{\tt1}}\def\p{\partial} \def\F{{\cal L}} \def\L{\left}\def\R{\right} \def\LR#1{\L(#1\R)} \def\fracLR#1#2{\L(\frac{#1}{#2}\R)} \def\Diag#1{\operatorname{Diag}\LR{#1}} \def\trace#1{\operatorname{Tr}\LR{#1}} \def\qiq{\quad\implies\quad} \def\grad#1#2{\frac{\p #1}{\p #2}} $Before taking derivatives, define the all-ones vector $(\o)$ plus a few more vectors $$\eqalign{ x &= \fracLR{o}{T} &&\qiq dx &= \fracLR{do}{T} \\ e &= \exp(x),\;\;&E=\Diag e &\qiq de &= E\;dx \\ p &= \frac{e}{\o:e},\;&P= \Diag p &\qiq dp &= \LR{P-pp^T}dx \\ }$$ and also introduce the Frobenius product $(:)$, which is a concise notation for the trace $$\eqalign{ A:B &= \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \;=\; \trace{A^TB} \\ A:A &= \big\|A\big\|^2_F \\ }$$ Write the objective function using the above notation. $$\eqalign{ \F &= -y:\log(p) \qquad\qquad \\ }$$ Then calculate its differential and gradient. $$\eqalign{ d\F &= -y:d\log(p) \\ &= -y:P^{-1}\,dp \\ &= -y:P^{-1}\LR{P-pp^T}dx \\ &= -y:\LR{I-\o p^T}dx \\ &= \LR{p\o^T-I}y:dx \\ &= \LR{p-y}:\fracLR{do}{T} \\ &= \fracLR{p-y}{T}:do \\ \grad{\F}{o} &= \fracLR{p-y}{T} \\ }$$ This result is about as nice as one could hope.

Setting $\;T=1\,$ recovers the answer in the linked post.

greg
  • 35,825
0

It's called chain rule:$\frac{\partial L}{\partial s} = \frac{\partial L}{\partial y} \times\frac{\partial y}{\partial s}$. For the first term, in case of Euclidean loss, it is $(y-L)$. For the second, it is $\sigma(s)(1-\sigma(s)) = y(1-y)$

Alex
  • 19,262
  • Thanks, but I am familiar with the chain rule. My question was if we set T to 1, we will get $\frac{\partial L}{\partial o_i}=p_i-y_i$, now what if we don't set T to 1? Edit: again, please see the linked question for the notations – sim_inf Aug 13 '20 at 14:56
  • In your question, is it $e^{\frac{0}{T}}$ or $\frac{e^0}{T}$? – Alex Aug 13 '20 at 14:58
  • That is not zero, it is O like, orange: \begin{equation} p_j = \frac{e^{o_j/T}}{\sum_k e^{o_k/T}} \end{equation} – sim_inf Aug 13 '20 at 14:59
  • It's the same just scaled by $\frac{1}{T}: \frac{\partial y}{\partial o_j} = \frac{1}{T}p_j(1-p_j)$ – Alex Aug 13 '20 at 15:05
  • What if $i \neq j$ in $\frac{\partial p_j}{\partial o_i}$ – sim_inf Aug 13 '20 at 15:19
  • $p_j \neq f(o_i)$ – Alex Aug 13 '20 at 15:20
  • I am referring to this question: https://math.stackexchange.com/questions/945871/derivative-of-softmax-loss-function

    The final line, how would the final line change if we set T in the softmax function?

    $\frac{\partial L}{\partial o_i}=-\sum_ky_k\frac{\partial \log p_k}{\partial o_i}=-\sum_ky_k\frac{1}{p_k}\frac{\partial p_k}{\partial o_i}\=-y_i(1-p_i)-\sum_{k\neq i}y_k\frac{1}{p_k}({\color{red}{-p_kp_i}})\=-y_i(1-p_i)+\sum_{k\neq i}y_k({\color{red}{p_i}})\=-y_i+\color{blue}{y_ip_i+\sum_{k\neq i}y_k({p_i})}\=\color{blue}{p_i\left(\sum_ky_k\right)}-y_i=p_i-y_i$

    – sim_inf Aug 13 '20 at 15:25
  • because $o_i$ is an argument in the $i^{th}$ output neuron, and obviously $p_j$ is $j^{th}$ output – Alex Aug 13 '20 at 15:25
  • How would the gradient scale if we set T. I know that it would scale by $1/t^2$. But I need to know the exact equation. Is it precisely $1/T^2(p_i-y_i)$ or something else – sim_inf Aug 13 '20 at 15:30