Differentiating a matrix with respect to a vector

Question

In multivariate linear model, I have come across the following matrix-valued function of $\beta \in \Bbb R^p$.

$$\beta \mapsto(y-X\beta)(y-X\beta)^{T}$$

where matrix $X \in \Bbb R^{n \times p}$ and vector $y \in \Bbb R^n$ are given. I have to differentiate it with respect to $\beta \in \Bbb R^p$. Can anyone please help me with how to differentiate this?

Some other examples that I have seen on this site are differentiation of $(y-X\beta)^{T}(y-X\beta)$ (which is a scalar), but here the expression is an $n×n$ matrix and I am not sure how to handle this. Also, I would appreciate some reference or reading materials on this kind of matrix-vector differentiation for beginners.

You need a 3-dimensional matrix. What you should do is differentiate each entry of the output matrix with respect to the vector input, thereby obtaining a gradient. Then collect all $n p$ gradients in some list. — Rodrigo de Azevedo, Apr 24 '22 at 16:24
@RodrigodeAzevedo Sorry, but what is gradient and what does it mean to collect gradients in a list? May I request you to elaborate with some example? — user587389, Apr 24 '22 at 16:33
Take a look at this and this and especially this. By "collecting gradients" I mean something like an $n \times p$ matrix where each entry is a gradient rather than a scalar. It would be a "cubical" matrix. — Rodrigo de Azevedo, Apr 24 '22 at 17:24

greg · Answer 1 · 2022-04-27T14:18:41.473

$ \def\e{\varepsilon}\def\p{\partial} \def\LR#1{\left(#1\right)} \def\BR#1{\bigl(#1\bigr)} \def\vecc#1{\operatorname{vec}\LR{#1}} \def\trace#1{\operatorname{Tr}\LR{#1}} \def\qiq{\quad\implies\quad} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\c#1{\color{red}{#1}} \def\CLR#1{\c{\LR{#1}}} \def\gradLR#1#2{\LR{\grad{#1}{#2}}} $Here are two approaches to avoid the 3D-matrix (aka tensor) issue mentioned by Rodrigo.

First, note that the gradient of a vector $(b)$ with respect to one of its components $(b_k)$ is the corresponding cartesian basis vector $(\e_k)$ $$\eqalign{ \grad{b}{b_k} &= \e_k \\ }$$ For ease of typing, I'll use $b$ instead of $\beta,\,$ and also define the vector $$\eqalign{ z &= (Xb-y) \qiq \grad z{b_k} &= X\e_k = x_k \\ }$$ where $x_k$ is the $k^{th}$ column of $X$.

Using this, the component-wise gradient is easy to calculate $$\eqalign{ Q &= zz^T \\ \grad {Q}{b_k} &= zx_k^T + x_kz^T \\ }$$ Another approach is to vectorize the matrix differential of the function $$\eqalign{ Q &= zz^T \\ dQ &= X\,db\,z^T + z\,db^TX^T \\ \vecc{dQ} &= \BR{z\otimes X + X\otimes z}\,db \\ \grad{\vecc{Q}}b &= {z\otimes X + X\otimes z} \\ }$$ where $(\otimes)$ denotes the Kronecker product.

Or you could extend the component-wise result to a full tensor result $$\eqalign{ \grad {Q_{ij}}{b_k} &= z_iX_{jk} + X_{ik}z_j \\ }$$

But what I want is the differentiation with respect to $b$, not $b_k$. — user587389, Apr 24 '22 at 20:06
@user587389 I have sent you links. Maybe you should read them. — Rodrigo de Azevedo, Apr 24 '22 at 20:19
@RodrigodeAzevedo Yes, I'll read them. Thank you for the links. — user587389, Apr 24 '22 at 20:29

Rondoudou · Answer 2 · 2022-04-24T17:02:15.667

The notion of differentiation in this context can come back to the meaning of the derivative as some approximated behaviour near a given point : that is, if you denote by $f$ your function,

$$ f(\beta + \varepsilon \alpha) = f(\beta) + \varepsilon f'(\beta) \cdot \alpha + o(\varepsilon) $$

so $f'(\beta)$ will be a linear operator from vectors to $n \times n$ matrices. In this case, you can do the asymptotic by yourself :

$$ f(\beta + \varepsilon \alpha) = (y - X \beta - \varepsilon X \alpha) (y - X \beta - \varepsilon X \alpha)^T \\ = (y - X \beta)(y - X \beta)^T - \varepsilon ( X \alpha (y - X \beta)^T + (y - X \beta) (X \alpha)^T ) + O(\varepsilon^2) $$

and you find :

$$ f'(\beta) : \alpha \mapsto X\alpha (y - X \beta)^T + (y - X \beta) \alpha^T X^T $$

Differentiating a matrix with respect to a vector

2 Answers2