0

Chapter 10 of the Deep Learning book has

$$ \begin{align} a^{(t)} &= b + Wh^{(t-1)} + Ux^{(t)} \\ h^{(t)} &= \tanh(a^{(t)})\\ o^{(t)} &= c + Vh^{(t)}\\ \hat{y}^{(t)} &= \text{softmax}(o^{(t)})\\ \\ L &= \sum_t L^{(t)}\\ &= -\sum_t \log{p_{\text{model}}(y^{(t)}\ |\ x^{(1)},\dots,x^{(t)})} \end{align} $$ where $p_{\text{model}}(y^{(t)}\ |\ x^{(1)},\dots,x^{(t)})$ is given by reading the entry for $y^{(t)}$ from the model's output vector $\hat{y}^{(t)}$.
...
$$ \frac{\partial L}{\partial L^{(t)}}=1\\ (\nabla_{\pmb{o}^{(t)}}L)_i = \frac{\partial L}{\partial L^{(t)}}\frac{\partial L^{(t)}}{\partial o_i^{(t)}} = \hat{y}_i^{(t)} - \pmb{1}_{i=y^{(t)}} $$

I got $\frac{\partial L^{(t)}}{\partial o_i^{(t)}} = \hat{y}_i^{(t)} - y_i^{(t)}$ as shown here. But how do we get the result in the book?

muser
  • 368

1 Answers1

1

I think both writing are identical.

In vector form, you can write $$ \frac{\partial L}{\partial \mathbf{o}} = \mathrm{softmax}(\mathbf{o})-\mathbf{y} $$ where vector $\mathbf{y}$ is null everywhere except at the position that indicates the class. For instance $y(2)=1$ if we are dealing with an example from the second class...

Steph
  • 3,665
  • But the book uses $\pmb{1}_{i=y^{(t)}}$ (appears to be a vector) even for the $i^{th}$ component of gradient, as cited in the question. Is that incorrect? – muser Dec 04 '22 at 03:44
  • no it is not a vector. In the Goodfellow's book, this term is 1 if y=i (if the class of the example is i) or 0 otherwise. This is one-hot encoding and it is how I defined the vector $\mathbf{y}$. – Steph Dec 04 '22 at 19:32