Hessian matrix of a quadratic form

Question

Prove that the Hessian matrix of a quadratic form $f(x)=x^TAx$ is $f^{\prime\prime}(x) = A + A^T$.

I am not even sure what the Jacobian looks like (I never did one for $x \in \Bbb R^n$). Please help.

https://math.stackexchange.com/q/189434/321264, https://math.stackexchange.com/q/312077/321264 — StubbornAtom, Jun 30 '23 at 07:14

score 8 · Accepted Answer · answered Nov 17 '12 at 13:33

8

So let's compute the first derivative, by definition we need to find $f'(x)\colon\mathbb R^n \to \mathbb R^n$ such that $$ f(x+h) = f(x) + f'(x)h + o(h), \qquad h \to 0 $$ We have \begin{align*} f(x+h) &= (x+h)^tA(x+h)\\ &= x^tAx + h^tAx + x^tAh + h^tAh\\ &= f(x) + x^t(A + A^t)h + h^tAh \end{align*} As $|h^tAh|\le \|A\||h|^2 = o(h)$, we have $f'(x) = x^t(A + A^t)$ for each $x \in \mathbb R^n$. Now compute $f''$, we have \begin{align*} f'(x+h) &= x^t(A + A^t) + h^t(A + A^t)\\ &= f(x) + h^t(A + A^t) \end{align*} So $f''(x) = A + A^t$.

answered Nov 17 '12 at 13:33

martini

84,101

I don't understand how in the last step we can make $x^t(A+A^t)=f(x)$? – user 6663629 Oct 05 '19 at 15:04
@user6663629 maybe it's supposed to be $f^{\prime}$ instead of $f$ – user36028 Nov 06 '20 at 07:55

score 8 · Answer 2 · answered Nov 17 '12 at 13:40

Intuitively, the gradient and Hessian of $f$ satisfy \begin{equation} f(x + \Delta x) \approx f(x) + \nabla f(x)^T \Delta x + \frac12 \Delta x^T Hf(x) \Delta x \end{equation} and the Hessian is symmetric.

In this problem, \begin{align*} f(x + \Delta x) &= (x + \Delta x)^T A (x + \Delta x) \\ &= x^T A x + \Delta x^T A x + x^T A \Delta x + \Delta x^T A \Delta x \\ &= x^T A x + \Delta x^T(A + A^T)x + \frac12 \Delta x(A + A^T) \Delta x. \end{align*}

Comparing this with the approximate equality above, we see that $\nabla f(x) = (A + A^T) x$ and $Hf(x) = A + A^T$.

score 6 · Answer 3 · answered Nov 04 '14 at 07:44

6

Write explicitly $$f(x)=\sum_{i,j}(\text{2nd degree monomials})$$ The hessian is the matrix $$H=(\partial_i\partial_jf(x)).$$

answered Nov 04 '14 at 07:44

Martín-Blas Pérez Pinilla

41,546
4
46
89

score 3 · Answer 4 · answered Aug 04 '18 at 01:02

For $f(x)=x^{\top}Ax$ where $f(x)\colon\mathbb R^n \to \mathbb R^1$, the Jacobian $f'(x)\colon\mathbb R^n \to \mathbb R^n$ can be solved as

$f'(x)=\lim_{h\to0}\frac{f(x+h)-f(x)}{h}$

$f(x+h)=(x+h)^{\top}A(x+h)=(x^{\top}A+h^{\top}A)(x+h)=x^{\top}Ax+x^{\top}Ah+h^{\top}Ax+h^{\top}Ah$

$f(x+h)=f(x)+x^{\top}Ah+x^{\top}A^{\top}h+h^{\top}Ah=f(x)+x^{\top}(A+A^{\top})h+h^{\top}Ah$

$f'(x)=\lim_{h\to0}\frac{f(x)+x^{\top}(A+A^{\top})h+h^{\top}Ah-f(x)}{h}=\lim_{h\to0}\frac{(x^{\top}(A+A^{\top})+h^{\top}A)h}{h}$

$f'(x)=\lim_{h\to0}x^{\top}(A+A^{\top})+h^{\top}A=x^{\top}(A+A^{\top})$

Thus, the Hessian $f''(x)\colon\mathbb R^n \to \mathbb R^{n\times n}$ can be found as

$f''(x)=\lim_{h\to0}\frac{f'(x+h)-f'(x)}{h}$

$f'(x+h)=(x+h)^{\top}(A+A^{\top})=x^{\top}(A+A^{\top})+h^{\top}(A+A^{\top})$

$f''(x)=\lim_{h\to0}\frac{x^{\top}(A+A^{\top})+h^{\top}(A+A^{\top})-x^{\top}(A+A^{\top})}{h}=\lim_{h\to0}A+A^{\top}$

Finally $f''(x)=A+A^{\top}$

Sönke Schmachtel · Answer 5 · 2019-12-29T13:27:42.267

For all that wonder about the step to change an expression with $h^{\top}$ into one with h:

$x^{\top}Ah+h^{\top}Ax = x^{\top}Ah+x^{\top}A^{\top}h = x^{\top}(A+A^{\top})h$

you can take a matrix and show by calculation that $h^{\top}Ax$ is the same as $x^{\top}A^{\top}h$:

$A=\left(\begin{matrix}A_{1,1}&A_{1,2}\\A_{2,1}&A_{2,2}\end{matrix}\right)$

$A^{\top}=\left(\begin{matrix}A_{1,1}&A_{2,1}\\A_{1,2}&A_{2,2}\end{matrix}\right)$

$h^{\top}Ax=\left(\begin{matrix}h&h\end{matrix}\right)\left(\begin{matrix}A_{1,1}&A_{1,2}\\A_{2,1}&A_{2,2}\end{matrix}\right)\left(\begin{matrix}x_1\\x_2\end{matrix}\right)=\left(\begin{matrix}(A_{1,1}+A_{2,1})h&(A_{1,2}+A_{2,2})h\end{matrix}\right)\left(\begin{matrix}x_1\\x_2\end{matrix}\right)$

$=h(A_{1,1}+A_{2,1})x_1+h(A_{1,2}+A_{2,2})x_2$

$x^{\top}A^{\top}h=\left(\begin{matrix}x_1&x_2\end{matrix}\right)\left(\begin{matrix}A_{1,1}&A_{2,1}\\A_{1,2}&A_{2,2}\end{matrix}\right)\left(\begin{matrix}h\\h\end{matrix}\right)=\left(\begin{matrix}x_1&x_2\end{matrix}\right)\left(\begin{matrix}(A_{1,1}+A_{2,1})h\\(A_{1,2}+A_{2,2})h\end{matrix}\right)$

$=h(A_{1,1}+A_{2,1})x_1+h(A_{1,2}+A_{2,2})x_2$

BUT there is another unrelated problem with the formula for the gradient. A gradient is a column vector and $x^{\top}(A+A^{\top})$ produces a row vector. AND how do you devide by a vector $h=\left(\begin{matrix}h\\..\\h\end{matrix}\right))$? It doesn't work and I think this is the reason why we get a rowvector instead of a columnvector. One probably needs to use a directional derivative to be proper. But what is the gradient written as a directional derivative?

This stuff here has some practical application to construct gradients and Hessians for square forms of Laplacians. And well if using Newton optimization you cannot plug a row vector.

(For comparison as it is suggested above https://en.wikipedia.org/wiki/Taylor_series section Taylor series in several variables)

$T(\mathbf{x}) = f(\mathbf{a}) + (\mathbf{x} - \mathbf{a})^\mathsf{T} D f(\mathbf{a}) + \frac{1}{2!} (\mathbf{x} - \mathbf{a})^\mathsf{T} \left \{D^2 f(\mathbf{a}) \right \} (\mathbf{x} - \mathbf{a}) + \cdots, $

It is not really good to compare the results here with the taylor formula from wikipedia because there youd multiply the gradient and the Hessian with $x$ and $x^{\top}$, where here we are interested in the gradient and the Hessian only. Though you can see that the gradient needs to be a column vector

I think that the gradient of $f(x)=x^{\top}Ax$ will be $\nabla f(x)=(A+A^{\top})x$ but missing a proof.

Edit: $Df(x)$ is apparently the transpose, thus it should be $Df(x)=x^{\top}(A+A^{\top})$ (see comments below)

There is a subtle technical difference between the gradient $\nabla f$ and the first derivative (or differential) $D f$ of a function $f: \mathbb{R}^n \to \mathbb{R}$, namely that they are transposes of each other (see wiki: gradient). — MSDG, Dec 29 '19 at 12:38
Aah. Thx! Didnt know that! Yes that is really important to know! Then the taylor series formula above is thus technically speaking problematic? Confusing! I mean x must be a column vector, right? (Otherwise you could not multiply $x^{\top}$) with the Hessian from the left) A_nd if you try to muliply $x^{\top}=(x_1 x_2 ... x_n)$ with Df if it is a row vector $(Df=(f_{x1} f_{x2} ... f_{xn}))$ that doesnt work out....
:-oo — Sönke Schmachtel, Dec 29 '19 at 13:06
It is not problematic, one just needs to be aware what the notation represents. In the Wikipedia article for Taylor series expansions it is clearly stated below the formula that $Df$ denotes the gradient, not the differential (so it is a column vector, and the multiplication that you find problematic is well-defined). There are several conventions for denoting these things. Personally I like to denote the differential by $\mathrm df$, and the gradient by $\nabla f$ (or $\text{grad } f$). — MSDG, Dec 29 '19 at 13:31
Yes :-) Thumbs up! "where D f (a) is the gradient of f evaluated at x = a " should have read it better — Sönke Schmachtel, Dec 29 '19 at 13:37

Sönke Schmachtel · Answer 6 · 2019-12-29T16:02:36.977

Digging further into quadratic forms I came accross the fact that a quadratic form of a nonsymmetric matrix A can be always rewritten as a symmetrix quadratic form via $x^{\top}Ax=x^{\top}\frac{A+A^{\top}}{2}x=x^{\top}A_{sym}x$

https://math.stackexchange.com/a/3203658/738033

and especially wonderfull is then that $A_{sym}=Q^{\top}\Lambda Q$ where Q is an orthogonal matrix, also if A is positive definite (if it is a Laplacian in particular) you could use Cholesky decomposition $A_{sym}=LL^{T}$ or the related LDL factorization for linearly constrained problems (positive semidefinite) :-)

score 0 · Answer 7 · answered Feb 01 '23 at 09:24

If you happen to work in machine learning and read Christopher M. Bishop's classic text "Pattern Recognition and Machine Learning", it is like doing univariate calculus:

First order derivative (gradient): $$\nabla f({\bf x})=\frac{\partial{\bf x^T}{\bf A}{\bf x}}{\partial{\bf x}}=\frac{\partial\rm{Tr}({\bf x^T}{\bf A}{\bf x})}{\partial{\bf x}}=\bigl({\bf x}^T({\bf A}+{\bf A}^T)\bigr)^T=2{\bf A}{\bf x}$$ by equation (C.27). Here we have used symmetry of $\bf A$. Jacobian is the row form of the gradient, i.e., the transpose of the gradient.

Further, second order derivative (Hessian): $${\bf H}=\frac{\partial\nabla f({\bf x})}{\partial{\bf x}}=2\frac{\partial}{\partial{\bf x}}({\bf A}{\bf x})=2\left(\frac{\partial{\bf A}}{\partial{\bf x}}{\bf x}+{\bf A}\frac{\partial{\bf x}}{\partial{\bf x}}\right)=2({\bf0}+{\bf A}{\bf I})=2{\bf A}={\bf A}+{\bf A}^T$$ by equation (C.20). So, if you are familiar with rules and notations of matrix derivative in Bishop's text, it's very easy and intuitive.

Hessian matrix of a quadratic form

7 Answers7

Linked