0

We want to solve the classification task, i.e., learn the parameters $\theta = (\mathbf{W}, \mathbf{b}) \in \mathbb{R}^{P\times K}\times \mathbb{R}^{K}$ of the function $f_\theta: \mathbb{R}^P \to [0, 1]^K$ which corresponds for each coordinate to the probability of being from one class.

The model is defined as $$ [f_\theta(x)]_k = \mathbb{P}[Y=k | x] = \frac{1}{Z} \exp(w_k^\top x + b_k) \enspace , $$ where $w_k$ corresponds to the kth column of $W$, and $Z$ is a normalizing constant.

As these probabilities must sum to one, we get $$ Z = \sum_{k=1}^K \exp(w_k^\top x + b_k). $$ We can recognize the so-called soft-max function: $[\sigma(z)]_i = \frac{e^{z_i}}{\sum_{k=1}^K e^{z_k}}$.

After one-hot encoding of the target variable (OneHotEncoder), denoting $\{ y_{ik} \}_{k=1}^{K}$ the indicator sequence for the class of the $i^{\text{th}}$ observation $x_i$ (i.e., if the sample $i$ belongs to class $k$, then $y_{ik} = 1$ and $y_{ik'}=0$ for $k'\neq k$) the negative log likelihood (nll, a.k.a cross-enropy loss) becomes: $$ L(W, b) = - \frac1N \sum_{i=1}^N \log(\mathbb{P}[Y=y_i | x_i]) = -\frac1N \sum_{i=1}^N \log\Bigg(\frac{\exp(w_{y_i}^\top x_i + b_{y_i})}{\sum_{k=1}^K \exp(w_k^\top x_i+ b_k)}\Bigg) \enspace . $$

Note: The notation $w_{y_i}$ means the column of $W$ whose index corresponds to the class value (e.g. 1, 2, ..., K) for the sample $x_i$.

Using the softmax function, we can also rewrite this as $$ L(W, b) = -\frac1N \sum_{i=1}^N \log([\sigma(W^{T} x_i + b)]_{y_i}) $$


Now I would like to propose an expression of the gradients of the loss $L(W, b)$ with respect to its two parameters: $\nabla_W L(W, b) \in \mathbb{R}^{P \times K}$ and $\nabla_b L(W, b)$


Let $s \in \{1,...,P\}$, $t,k \in \{1,...,K\}$. Let's differentiate the function : $$ \begin{align*} L(W, b) &= -\frac1N \left( \sum_{i=1}^N w_{y_i}.x_{i} + b_{y_i} \right) + \frac1N \sum_{i=1}^N \log \left( \sum_{k=1}^K \exp(w_k^\top x_i+ b_k) \right) \\ &= -\frac1N \left( \sum_{i=1}^N b_{y_i} +\sum_{i=1}^N \sum_{j=1}^{P} w_{j,y_i}x_{i,j} \right) + \frac1N \sum_{i=1}^N \log \left( \sum_{k=1}^K \exp(w_k^\top x_i+ b_k) \right) \end{align*} $$ Let's say $y_{i} \rightarrow t$ means that $y_{i}$ one hot encode the class $t$, then we have : $$ \partial_{W_{s,t}} L(W,b) = -{1 \over N} \sum_{i ; y_{i} \rightarrow t} x_{i,s} + {1 \over N} \sum_{i=1}^{N} { x_{i,s} \exp\left( W_{t}.x_{i} + b_{t} \right) \over \sum_{k=1}^{K} \exp\left( W_{k}.x_{i} + b_{k} \right)} = -{1 \over N} \sum_{i ; y_{i} \rightarrow t} X[i,s] + {1 \over N} \sum_{i=1}^{N} X[i,s] \sigma\left( W^{T}X[i]+b \right)_{t} $$ and $$ \partial_{b_{t}} L(W,b) = -{ \text{card}\{k ; y_{k} \rightarrow t \} \over N} + {1 \over N} \sum_{i=1}^{N} { \exp\left( W_{t}.x_{i} + b_{t} \right) \over \sum_{k=1}^{K} \exp\left( W_{k}.x_{i} + b_{k} \right)} = -{ \text{card}\{k ; y_{k} \rightarrow t \} \over N} + {1 \over N} \sum_{i=1}^{N} \sigma\left( W^{T}X[i]+b \right)_{t} $$


I've been told my derivatives are false, but I don't spot any mistake. So I have some option maybe I should compute it differently. For example with vector derivate, using $$ L(W, b) = -\frac1N \sum_{i=1}^N \log([\sigma(W^{T} x_i + b)]_{y_i}) $$ Instead of using coordinate wise derivatives but I don't really now the rule of this calculus... So I decide to propose you this problem which is to find the derivative if you want to give it a go.

Thank you for your help.


RobPratt
  • 45,619
CechMS
  • 221

1 Answers1

1

The negative loglik is also called cross-entropy. I think one of your difficulty comes from your notations Write the cost function (for one example) as $$ \phi = -\mathbf{y} : \log (\hat{\mathbf{y}})$$ where the log is applied elementwise and $\hat{\mathbf{y}}=\mathrm{softmax}(\mathbf{z})$, $\mathbf{z}=\mathbf{Wx+b}$

The key is to show that $$ \frac{\partial \phi}{\partial \mathbf{z}}=\hat{\mathbf{y}}-\mathbf{y} $$ See this for a demonstration

UPDATE The rest will be easily obtained by the chain rule. $$ \frac{\partial \phi}{\partial \mathbf{W}}= (\hat{\mathbf{y}}-\mathbf{y})\mathbf{x}^T $$ and $$ \frac{\partial \phi}{\partial \mathbf{b}}=\hat{\mathbf{y}}-\mathbf{y} $$

Steph
  • 3,665
  • Okay I can compute it that way, I'm going to use the differential matrix instead of partial derivative which is the same. So let $f(W,b)= W^{T}x_{i}+b$ then $g(W,b) = \left[log\left(\sigma(f(W,b))\right) \right]{y{i}}$ which is the function to differentiate. – CechMS Jan 28 '22 at 16:40
  • $$ \partial_{b_{s}}g(W,b) = D_{(W,b)}g(0,b_{s}) ={ D_{W^{T}x_{i}+b} \sigma (e_{s}){y{i}} \over \sigma\left( W^{T}x_{i}+b \right){y{i}} } $$ with $e_{s}$ the vector of the canonical basis, indeed $\partial_{b_{s}} W^{T}x_{i}+b = e_{s}$ and the term $D_{W^{T}x_{i}+b} \sigma (e_{s})$ is pretty easy to compute as we know the differential of $\sigma$ it's $$ D_{z} \sigma = \left( \sigma(z){i}(1-\sigma(z){j}) \right)_{i,j} $$ However this thing seems false as my gradcheck on python told me. Maybe it's the code maybe it's the maths. Do you find the same result as me ? thank you – CechMS Jan 28 '22 at 16:41
  • Here you are introducing the sigmoid as $\sigma$ but it was softmax just before! I think you should better describe your problem to help you. – Steph Jan 28 '22 at 17:23
  • I didn't $\sigma= $ softmax here, as introduced in my first message – CechMS Jan 28 '22 at 18:01
  • ok ? The line $D_{z} \sigma = \left( \sigma(z){i}(1-\sigma(z){j}) \right)_{i,j}$ was confusing me in this case. I have updated my answer to give you the derivatives wrt parameters. – Steph Jan 28 '22 at 18:10
  • There is a mistake indeed, it's $1_{i = j}$ not $1$ in the differential (see for example https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/). So do you agree with my partial derivative ? – CechMS Jan 28 '22 at 18:15