0

I am trying to differentiate

$$ \mathbf{b} \mapsto \mathbf{(a+b)}^\intercal \mathbf{A(a+b)} $$

where $\mathbf{a}$ and $\mathbf{b}$ are $n \times 1$ vectors and $\mathbf{A}$ is an $n \times n$ symmetric positive semidefinite matrix.

I expand it out to

$$\mathbf{a}^\intercal \mathbf{Aa} + \mathbf{a}^\intercal \mathbf{Ab} + \mathbf{b}^\intercal \mathbf{Aa} + \mathbf{b}^\intercal \mathbf{Ab}$$

and then differentiated with respect to $\mathbf{b}$, which I calculated as $\mathbf{a}^\intercal \mathbf{A}$ + $\mathbf{Aa}$ + $\mathbf{2Ab}$, but this does not make sense to me because the first term $\mathbf{a}^\intercal \mathbf{A}$ is now $1 \times n$ but the second and third terms $\mathbf{Aa}$ and $\mathbf{2Ab}$ are $n \times 1$, so the dimensions do not match. What am I doing wrong?

  • 2
    Note that $a^\top A b = b^\top A a$ so their gradient with respect to $b$ should be identical. If you are using the convention that gradients are column vectors, then the gradient for both terms is $Aa$. It may help to revisit the definition of gradient from scratch, to help guide blind application of univariate differentiation rules to multivariate differentiation. – angryavian Jul 21 '23 at 16:14

3 Answers3

1

You idea almost works, but when you differentiate you need to take the transposes into account. One way to do it is using differential notation. For $$\begin{align} f &=(a+b)^TA(a+b) \\ &=a^TAa+b^TAa+a^TAb+b^TAb \end{align}$$ the differential is $$\begin{align} df &=db^TAa+a^TAdb+db^TAb+b^TAdb \end{align}$$ where the last two terms come from the chain rule. Using $x^TAy=y^TA^Tx$ and the symmetry of $A$, we can rearrange $df$ as something times $db$ as follows: $$\begin{align} df &=a^TAdb+a^TAdb+b^TAdb+b^TAdb \\ &=2(a^TA+b^TA)db \\ &=2(a+b)^TAdb. \end{align}$$ Now, $f$ is a scalar. Do you expect the derivative of $f$ to be a row or a column? If you want the derivative to be a row, then the answer is the factor of $db$ above, i.e. $$\frac{df}{db}=2(a+b)^TA.$$ On the other hand, if you want the answer to be a column, then you could write the above as a dot product of something with $db$ as follows: $$df=2A(a+b)\cdot db$$ and then the derivative is the first factor: $$\frac{df}{db}=2A(a+b).$$

Jackozee Hakkiuz
  • 5,583
  • 1
  • 14
  • 35
0

In vector/matrix calculus, we often break things up using the chain rule, rather than by expanding out terms.

Think of it like this: we have $a, b \in \mathbb{R}^n, A \in \mathbb{R}^{n \times n}$.

Now consider the following functions:

$$f: \mathbb{R}^n \rightarrow \mathbb{R}^n. f(b) = a+b$$

$$g: \mathbb{R^n} \rightarrow \mathbb{R}. g(x) = x^TAx$$ where $A$ is a symmetric matrix.

By the chain rule: $\frac{dg(f(b))}{d b} = \frac{dg}{df} \frac{df}{db}$ where $\frac{dg}{df} \in \mathbb{R}^{1 \times n}, \frac{df}{db} \in \mathbb{R}^{n \times n}$ and so $\frac{dg(f(b))}{db} \in \mathbb{R}^{1 \times n}$.

Let's calculate those separately:

$$\frac{df}{ db} = \begin{bmatrix} \frac{\partial f_1}{\partial b_1} & \dots & \frac{\partial f_1}{\partial b_n}\\ \vdots & \ddots & \vdots\\ \frac{\partial f_n}{\partial b_1} & \dots & \frac{\partial f_n}{\partial b_n} \end{bmatrix} \in \mathbb{R}^{n \times n} $$

Now, for each partial derivative in the Jacobian above, we have $\frac{\partial f_i}{\partial b_j} = \frac{\partial }{\partial b_j}a_i + \frac{\partial}{\partial b_j}b_i = \delta_{ij}$. Therefore $\frac{df}{db} = I_n \in \mathbb{R}^{n \times n}$. Now, $$\frac{dg}{df} = \begin{bmatrix} \frac{\partial g}{\partial f_1} & \frac{\partial g}{\partial f_2} & \cdots & \frac{\partial g}{\partial f_n} \end{bmatrix} \in \mathbb{R}^{1 \times n}$$

Note that $g(f) = f^TAf = \sum_{i=1}^{N}{f_i [Af]_i} = \sum_{i=1}^{N}{\sum_{j=1}^{N}{f_iA_{ij}f_j}}$. Therefore:

$$ \frac{\partial g}{ \partial f_k} = \sum_{i, j}{A_{ij} \frac{\partial}{\partial f_k}(f_if_j)} = \sum_{i, j}{A_{ij}(f_i\delta_{jk} + f_j\delta_{ik})} = $$

$$\sum_{i, j}{A_{ij}f_i\delta_{jk}} + \sum_{i, j}{A_{ij}f_j\delta_{ik}} = \sum_{i}{A_{ik}f_i} + \sum_{j}{A_{kj}f_j} $$

Since $A$ is symmetric, $A_{ik} = A_{ki}$ so we have:

$$\frac{\partial g}{\partial f_k} = \sum_{i}{A_{ki}f_i} + \sum_{j}{A_{kj}f_j} $$

Reindexing we are left with:

$$\frac{\partial g}{\partial f_k} = 2 \sum_{i}{A_{ki}f_i} = 2[Af]_k$$

Collecting all the partial derivatives in the Jacobian vector:

$$ \begin{bmatrix} \frac{\partial g}{\partial f_1} & \cdots & \frac{\partial g}{\partial f_n} \end{bmatrix} = 2 \begin{bmatrix} [Af]_1 & \cdots & [Af]_n \end{bmatrix} = 2(Af)^T = 2f^TA^T = 2f^TA \in \mathbb{R}^{1 \times n}$$

Finally, substituting the Jacobians in the chain rule:

$$ \frac{dg(f(x))}{dx} = \frac{dg}{df} \cdot \frac{df}{dx} =2f^TA \cdot I_n = 2f^TA = 2(a+b)^TA \in \mathbb{R}^{1 \times n}$$

as required.

Dmarks
  • 381
0

$ \def\BR#1{\Big[#1\Big]} \def\LR#1{\left(#1\right)} \def\op#1{\operatorname{#1}} \def\trace#1{\op{Tr}\LR{#1}} \def\frob#1{\left\| #1 \right\|_F} \def\qiq{\quad\implies\quad} \def\p{\partial} \def\grad#1#2{\frac{\p #1}{\p #2}} $Define the vector $$\eqalign{w = a+b \qiq dw=db}$$ and use it to rewrite the objective function before differentiating $$\eqalign{ \phi &= w^TAw \\ d\phi &= dw^TAw \;+\; w^TA\:dw \\ &= \LR{2Aw}^Tdw \\ }$$ Now change $w\to b$ $$\eqalign{ d\phi &= \BR{2A\LR{a+b}}^Tdb \qquad \\ \grad{\phi}{b} &= 2A\LR{a+b} \\ }$$

greg
  • 35,825