How do you take the gradient vector of the cross entropy cost function?

Question

Let $\displaystyle J(\Theta) = -\frac{1}{m} \sum _{i=1}^m \sum_{k=1}^Ky_k^{(i)} \log(p)$ be the cross entropy cost function where $$p = \hat p_k^{(i)} = \frac{\large e^{(\theta^k)^T x^{(i)}}}{\large\sum_{j=1}^K e^{(\theta^j)^T x^{(i)}}}$$ The matrix $\Theta = [\theta^{(1)} \space \theta^{(2)} \dots \space \theta^{(k)}]$, $x^{(i)}$ is a vector, and $y_k^{(i)}$ is equal to $1$ if the target class for the $i$th instance is $k$ and is otherwise $0$.

How do you take the gradient vector of $J(\Theta)$ with regards to $\theta^{(k)}$?

I understand that the gradient would go to $\nabla_{\theta^{(k)}} \log(p)$ which would be of the form $\frac{\text{bottom}}{\text{top}} \cdot \nabla_{\theta^{(k)}}(p)$.

But how do you compute $\nabla_{\theta^{(k)}} \frac{\large e^{{\theta^{(k)}}^T x^{(i)}}}{\large\sum_{j=1}^K e^{{\theta^{(j)}}^T x^{(i)}}}$?

Any $\Large\frac{\partial}{\partial_{\theta^{(k)}_a}}$ I take equals $\frac{1}{m}\sum_{i=1}^m \sum_{k=1}^K y^{(i)}_k (x^{(i)}_a)[\hat p_k^{(i)}-1]$, but I don't see how the full gradient of these equals $\frac{1}{m}\sum_{i=1}^m(\hat p_{k}^{(i)} -y_k^{(i)})x^{(i)}$.

user3658307 · Accepted Answer · 2018-07-15T19:56:19.807

I will slightly alter your notation. Let $$ J(\Theta)=\frac{-1}{m}\sum_{i=1}^{m}\sum_k y_k^{(i)}\log(\hat{p}_k^{(i)}),\;\;\; \hat{p}_k^{(i)}=\frac{\exp(\theta_k^Tx^{(i)})}{\sum_\alpha \exp(\theta_\alpha^Tx^{(i)})}$$ where $\theta_\xi\in\mathbb{R}^n$, $\Theta\in\mathbb{R}^{K \times n}$. Since $J$ is a scalar function, the derivative $\nabla_\Theta J(\Theta)$ must be a matrix. We will compute component $j,\ell$ of that matrix. $$ \log(\hat{p}_k^{(i)}) = \theta_k^Tx^{(i)} - \log\sum_\alpha\exp(\theta_\alpha^Tx^{(i)}) $$ $$ \nabla_{\theta_{j\ell}} J(\Theta)= \frac{-1}{m}\sum_{i=1}^{m}\sum_k y_k^{(i)}\partial_{j\ell}\log(\hat{p}_k^{(i)}) $$ where $\nabla_{\theta_{j\ell}}=\partial/\partial \theta_{j\ell}=\partial_{j\ell}$. The first term: $$ \partial_{j\ell}\, \theta_k^Tx^{(i)} = \sum_\beta \partial_{j\ell} \theta_{k\beta}x^{(i)}_\beta = \delta_{kj} x_\ell^{(i)} $$ where $\delta_{\xi\eta}$ is the Kronecker Delta. Define: $$ S_i(\Theta)=\sum_\alpha \exp(\theta_\alpha^Tx^{(i)}) $$ Now the second term: \begin{align} \partial_{j\ell} \log\sum_\alpha\exp(\theta_\alpha^Tx^{(i)}) &= S_i(\Theta)^{-1}\partial_{j\ell} \sum_\alpha \exp(\theta_\alpha^Tx^{(i)})\\ &= S_i(\Theta)^{-1} \sum_\alpha \exp(\theta_\alpha^Tx^{(i)}) \partial_{j\ell}[\theta_\alpha^Tx^{(i)}]\\ &= S_i(\Theta)^{-1} \sum_\alpha \exp(\theta_\alpha^Tx^{(i)}) \delta_{\alpha j} x^{(i)}_\ell \\ &= S_i(\Theta)^{-1} \exp(\theta_j^Tx^{(i)}) x^{(i)}_\ell \\ &= x^{(i)}_\ell \hat{p}_j^{(i)} \end{align} Putting it all together gives me: $$ \nabla_{\theta_{j\ell}} J(\Theta) = \frac{-1}{m}\sum_{i=1}^m\sum_k y_k^{(i)} \left[ \delta_{kj} x_\ell^{(i)} - x^{(i)}_\ell \hat{p}_j^{(i)} \right]=\frac{1}{m}\sum_{i}x^{(i)}_\ell\sum_k y_k^{(i)} \left[ \hat{p}_j^{(i)} - \delta_{kj} \right] $$ We can simplify the inner sum by continuing this index masochism. We will show that: $$ \sum_k y_k^{(i)} \left[ \hat{p}_j^{(i)} - \delta_{kj} \right] = y^{(i)}_j\left[ \hat{p}_j^{(i)} - 1 \right] + \sum_{k\ne j} \hat{p}_j^{(i)}y^{(i)}_k = \hat{p}_j^{(i)} - y^{(i)}_j $$ There are two cases for $\partial_{j\ell} J(\Theta)$, depending on the value of $y^{(i)}_j$. \begin{align} &\text{Case 1: } y_j^{(i)}=1 \;\;\;\implies\;\;\; y^{(i)}_j\left[ \hat{p}_j^{(i)} - 1 \right] + \sum_{k\ne j} \hat{p}_j^{(i)}\underbrace{y^{(i)}_k}_0 = \underbrace{y^{(i)}_j}_1 \hat{p}_j^{(i)} - y^{(i)}_j = \hat{p}_j^{(i)} - y^{(i)}_j \\ &\text{Case 2: } y_p^{(i)}=1,\,p\ne j \;\;\;\implies\;\;\; \underbrace{y^{(i)}_j}_0 \left[ \hat{p}_j^{(i)} - 1 \right] + \sum_{k\ne j} \hat{p}_j^{(i)}\underbrace{y^{(i)}_k}_{\delta_{pk}} = \hat{p}_j^{(i)} - \underbrace{y^{(i)}_j}_0 \\ \therefore \;\; & \;\; \sum_k y_k^{(i)} \left[ \hat{p}_j^{(i)} - \delta_{kj} \right] = \hat{p}_j^{(i)} - {y^{(i)}_j} \end{align} Substituting this into our expression gives: $$ \nabla_{\theta_{j\ell}} J(\Theta) = \frac{-1}{m}\sum_{i=1}^m\sum_k y_k^{(i)} \left[ \delta_{kj} x_\ell^{(i)} - x^{(i)}_\ell \hat{p}_j^{(i)} \right]=\frac{1}{m}\sum_{i}x^{(i)}_\ell \left[ \hat{p}_j^{(i)} - {y^{(i)}_j} \right] $$ Finally, we can gather together the parts of the Jacobian corresponding to the $j$th row, i.e. $\theta_j$, which is a vector $\nabla_{\theta_j} J(\Theta)\in\mathbb{R}^{n}$. This is equivalent to simply grouping together the $\ell$ indices into the vector: $$ \nabla_{\theta_{j}} J(\Theta) =\frac{1}{m}\sum_{i}x^{(i)} \left[ \hat{p}_j^{(i)} - {y^{(i)}_j} \right] $$

Relevant links: [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11]

The book I'm reading says the gradient is: $\Large \nabla_{\theta^{(k)}} = \frac{1}{m} \sum _{i=1}^m (\hat p_k^{(i)} - y_k^{(i)})x^{(i)}$. Is there a way to convert your last equation into that? — Oliver G, Jul 15 '18 at 19:00

How do you take the gradient vector of the cross entropy cost function?

1 Answers1

Linked