Gradients of functions involving matrices and vectors, e.g., $\nabla_{w} w^{t}X^{t}y$ and $\nabla_{w} w^t X^tXw$

Question

I have encountered these two gradients $\triangledown_{w} w^{t}X^{t}y$ and $\triangledown_{w} w^t X^tXw$, where $w$ is a $n\times 1 $ vector, $X$ is a $m\times n$ matrix and $y$ is $m\times 1$ vector.

My approach for $\triangledown_{w} w^{t}X^{t}y$ was this:

$w^{t}X^{t}y$ =

$\begin{bmatrix} w_1 & w_2 & ... & w_n \end{bmatrix} \begin{bmatrix} x_{11} & x_{21} & ... & x_{m1}\\ x_{12} & x_{22} & ... & x_{m2} \\ \vdots & \vdots & \ddots & \vdots\\ x_{1n} & x_{2n} & ... & x_{mn} \end{bmatrix} \begin{bmatrix} y_1\\ y_2\\ \vdots\\ y_m \end{bmatrix}$

$= y_1(\sum_{i=1}^{n}w_ix_{1i}) + y_2(\sum_{i=1}^{n}w_ix_{2i}) + ... + y_m(\sum_{i=1}^{n}w_ix_{mi})$ $= \sum_{j=1}^{m}\sum_{i=1}^{n} y_jw_ix_{ji}$

$\frac{\partial }{\partial w_a}\left [\sum_{j=1}^{m}\sum_{i=1}^{n} y_jw_ix_{ji}\right ] = \sum_{j=1}^{m}\sum_{i=1}^{n} y_jx_{ji}\delta_{ia} = \sum_{j=1}^{m}y_jx_{ja}$

And I'm stuck there, not knowing how to convert it to matrix notation. I'm not even sure if it is correct.

How can I get the actual gradient $\triangledown_{w} w^{t}X^{t}y$ out of that partial derivative? Is there an easier way to get the gradient (maybe using some rules, like in ordinary calculus), because this way using summation seems tedious, especially when you have to calculate $\triangledown_{w} w^t X^tXw$?

How do I then work out $\triangledown_{w} w^t X^tXw$ ?

Your derivation is correct. It's simple to convert this result back into matrix notation $$\sum_j X_{ja}y_j=X^Ty$$ it just takes a little bit of practice. Now try the second one, just remember to handle the 2 $w$'s seperately. — greg, Jan 06 '18 at 15:55

score 2 · Accepted Answer · answered Jan 06 '18 at 13:13

2

Let

$$f (\mathrm x) := \rm x^\top A \, x$$

Hence,

$$f (\mathrm x + h \mathrm v) = (\mathrm x + h \mathrm v)^\top \mathrm A \, (\mathrm x + h \mathrm v) = f (\mathrm x) + h \, \mathrm v^\top \mathrm A \,\mathrm x + h \, \mathrm x^\top \mathrm A \,\mathrm v + h^2 \, \mathrm v^\top \mathrm A \,\mathrm v$$

Thus, the directional derivative of $f$ in the direction of $\rm v$ at $\rm x$ is

$$\lim_{h \to 0} \frac{f (\mathrm x + h \mathrm v) - f (\mathrm x)}{h} = \mathrm v^\top \mathrm A \,\mathrm x + \mathrm x^\top \mathrm A \,\mathrm v = \langle \mathrm v , \mathrm A \,\mathrm x \rangle + \langle \mathrm A^\top \mathrm x , \mathrm v \rangle = \langle \mathrm v , \color{blue}{\left(\mathrm A + \mathrm A^\top\right) \,\mathrm x} \rangle$$

Lastly, the gradient of $f$ with respect to $\rm x$ is

$$\nabla_{\mathrm x} \, f (\mathrm x) = \color{blue}{\left(\mathrm A + \mathrm A^\top\right) \,\mathrm x}$$

answered Jan 06 '18 at 13:13

Rodrigo de Azevedo

1

Thank you Rodrigo! Even though you didn't answer my question directly, you provided me with a method to evaluate gradients that I was able to use and compute them correctly.
Is this method going to work generally on any kind of matrix-vector expression? Furthermore, why is it that you have to express it in a form of inner product like ⟨v, some expression⟩ to get the actual gradient?
– Stefan Dimeski Jan 06 '18 at 16:07
This works for functions that take matrices or vectors and produce scalars. Using the Taylor expansion, $$\lim_{h \to 0} \frac{f (\mathrm x + h \mathrm v) - f (\mathrm x)}{h} = \langle \mathrm v , \nabla_{\mathrm x} , f (\mathrm x) \rangle$$ – Rodrigo de Azevedo Jan 06 '18 at 16:12
Oh, I see, thank you! Is there a similar way to get the derivative of a vector-valued or a matrix-valued function? – Stefan Dimeski Jan 06 '18 at 16:35
Take a look at these. – Rodrigo de Azevedo Jan 06 '18 at 16:47
I will, thank you so much man! – Stefan Dimeski Jan 06 '18 at 17:02

score 1 · Answer 2 · answered Jan 06 '18 at 12:30

By the definition of what is to be the gradient vector of the application $$ \mathbb{R}^{n\times 1}\ni w \mapsto w^tX^ty= \sum_{i=1}^n\sum_{j=1}^m w_{i1}\cdot X_{ji}\cdot y_{1j}\in\mathbb{R} $$ we have $$ \nabla_w \big( w^tX^ty \big) = \left( \frac{\partial}{\partial w_{11}} ( w^tX^ty ), \frac{\partial}{\partial w_{21}} ( w^tX^ty ), \ldots, \frac{\partial}{\partial w_{i1}} ( w^tX^ty ), \ldots, \frac{\partial}{\partial w_{21}}( w^tX^ty ), \right) $$ For $i_0=1,2,\ldots,n$; \begin{align} \frac{\partial}{\partial w_{i_0}} ( w^tX^ty ) =& \frac{\partial}{\partial w_{i_01}} \left( \sum_{i=1}^n\sum_{j=1}^m w_{i1}\cdot X_{ji}\cdot y_{1j} \right) \\ =& \sum_{i=1}^n\sum_{j=1}^m \frac{\partial}{\partial w_{i_01}} (w_{i1}\cdot X_{ji}\cdot y_{1j}) \\ =& \sum_{j=1}^m \frac{\partial}{\partial w_{i_01}} (w_{i_01}\cdot X_{ji_0}\cdot y_{1j}) \\ =& \sum_{j=1}^m X_{ji_0}\cdot y_{1j} \\ \end{align} Then $$ \nabla_w \big( w^tX^ty \big) = \left( \sum_{j=1}^m X_{j1}\cdot y_{1j}, \sum_{j=1}^m X_{j2}\cdot y_{1j}, \ldots, \sum_{j=1}^m X_{ji_0}\cdot y_{1j}, \ldots, \sum_{j=1}^m X_{jn}\cdot y_{1j}, \right) $$ With similar calculations, we get the gradient vector of the application $$ \mathbb{R}^{n\times 1}\ni w \mapsto w^tX^tXw= \sum_{1\leq k\leq m} w_{1k}^2\cdot X_{k k}^2 + 2\sum_{1\leq k<\ell \leq m} w_{1k}\cdot X_{\ell k}\cdot X_{k\ell}\cdot w_{1\ell} \in\mathbb{R}. $$

score 0 · Answer 3 · answered Jan 06 '18 at 13:03

0

Better use $w^tX^ty=(w^tX^ty)^t=y^tXw$

answered Jan 06 '18 at 13:03

random

836

Gradients of functions involving matrices and vectors, e.g., $\nabla_{w} w^{t}X^{t}y$ and $\nabla_{w} w^t X^tXw$

3 Answers3

Linked