Get wrong answer on $\frac{\partial \mathbf{x}^{\top} \mathbf{A} \mathbf{x}}{\partial \mathbf{x}}$ when using graph

Question

I can use the product rule to obtain $\frac{\partial \mathbf{x}^{\top} \mathbf{A} \mathbf{x}}{\partial \mathbf{x}} = \mathbf{x}^{\top} \frac{\partial \mathbf{A} \mathbf{x}}{\partial \mathbf{x}}+(\mathbf{A} \mathbf{x})^{\top} \frac{\partial \mathbf{x}}{\partial \mathbf{x}}$, and get the right result $\mathbf{x}^{\top}\left(\mathbf{A}+\mathbf{A}^{\top}\right)$. Here $x$ is a vector, $A$ is a matrix that is not a function of $x$.

However, when I try to calculate by using a graph, the result is different. I am not sure at which step I made the mistake. Please point out where I made it wrong. I used the graph attached (red numbers are the derivatives) and the equations as follows.

Let $ q_{2}=q_{1}x$, $q_{1}=(A^{T} x)^{T}$, and $k=A^Tx$. I want to find $\frac{d q_{2}}{d x}$

$\frac{d q_{2}}{d x}=\frac{\partial q_{2}}{\partial x}+\frac{\partial q_{2}}{\partial q_{1}} \frac{\partial q_{1}}{\partial x}= q_{1}+x \frac{\partial q_{1}}{\partial x}= q_{1}+x \frac{\partial q_{1}}{\partial k} \frac{\partial k}{\partial x} =x^TA+xIA^T = x^TA+xA^T \ne x^{\top}\left({A}+{A}^{\top}\right)$, $I$ is the NxN identity matrix.

The result I got here is different. I know it is not correct, but I couldn't figure out where I lost it.

Thank you!

@RodrigodeAzevedo: Thank you, I've just checked the link out. It does provide some useful info, but I still don't know where I got it wrong in my proof! — Catbuilts, Feb 10 '21 at 07:22
Your notation is confusing. You use the exact same notation for the gradient and for the Jacobian. — Rodrigo de Azevedo, Feb 10 '21 at 07:26
Can you clarify it? It would be a great help for me because I'm new to this. Thank you! — Catbuilts, Feb 10 '21 at 07:30
$\frac{d q_{2}}{d x}=\frac{\partial q_{2}}{\partial x}+\frac{\partial q_{2}}{\partial q_{1}} \frac{\partial q_{1}}{\partial x}$ is not correct. Where does this comes from? — mathcounterexamples.net, Feb 10 '21 at 07:43
Note that $\mathbf{x} \mapsto \mathbf{x}^{\top} \mathbf{A} ,\mathbf{x}$ is a scalar field, whereas $\mathbf{x} \mapsto \mathbf{A} \mathbf{x}$ is a vector field. — Rodrigo de Azevedo, Feb 10 '21 at 08:02
@mathcounterexamples.net: I use the chain rule for total derivatives (https://en.wikipedia.org/wiki/Total_derivative) for d(f (x, q1)/dx. What should be the correct answer? — Catbuilts, Feb 10 '21 at 19:21
@RodrigodeAzevedo: Thank you for the clarification. Do you mean I got it wrong because I mixed scalar field and vector field together? Sorry, I still don't get it. — Catbuilts, Feb 10 '21 at 19:23
@Catbuilts Can you explain how you are using the "graph" to calculate the derivative? I don't get the boxes. If you are using a PDF/Textbook/Video for this purpose, I'd like to see it so I can explain it with the benefit of knowing that source, which will be more helpful for you. — Sarvesh Ravichandran Iyer, Feb 15 '21 at 17:38
@TeresaLisbon: Thank you for your help. I follow the vector calculus from this book: https://mml-book.github.io/book/mml-book.pdf , and the graph on page 162. I don't know if it makes sense to use the graph with vector or not. Please show me where I got it wrong. Thank you! — Catbuilts, Feb 15 '21 at 22:43
@Catbuilts Will surely have a look, unless the answer below is satisfactory. — Sarvesh Ravichandran Iyer, Feb 16 '21 at 03:20
@teresa: thank you, Teresa. It is more clear to me now after reading the answer below. Again, thanks for your time and consideration! — Catbuilts, Feb 16 '21 at 15:14
@Catbuilts You are welcome! I spent some time looking at the question, then decided the answer below was well written enough. +1 to your question, and look forward to interacting with you on the site. Always remember to utilize the chain rule carefully in the multivariable situation, is the takeaway from this problem. — Sarvesh Ravichandran Iyer, Feb 16 '21 at 15:15

Calvin Khor · Accepted Answer · 2021-02-17T03:30:33.827

In the line $$\frac{d q_{2}}{d x}=\frac{\partial q_{2}}{\partial x}+\frac{\partial q_{2}}{\partial q_{1}} \frac{\partial q_{1}}{\partial x}= q_{1}+x \frac{\partial q_{1}}{\partial x}=\dots$$ In terms of dimensions of matrices, it looks to me like you have written $$ [1\times n] = [1\times n] + [1\times n][n\times n] = [1\times n] + [n\times 1][n\times n] = \dots $$ because $x$ is a column vector. The product of matrices $[n\times 1][n\times n]$ is not defined, so there is an error here.

Let me try to rewrite in more explicit notation (I identify $\mathbb R^n$ with column vectors $\mathbb R^{n\times 1}$)

\begin{align} q&:\mathbb R^n \to \mathbb R,&q(x) &= x^TAx,\\ q_1&: \mathbb R^n \to \mathbb R^{1\times n}, & q_1(x) &= (A^Tx)^T = x^TA,\\ q_2&:\mathbb R^{1\times n} \times \mathbb R^n\to\mathbb R, & q_2(a,b) &= ab \end{align} Thus $$q(x) = q_2(q_1(x),x)$$ Since there are subtly different spaces involved here, I'll elect to use a notation that works for arbitrary spaces, namely the Frechet derivative $d$ which always satisfies the chain rule $$ d(f\circ g)(x)[h] = df(g(x))[dg(x)[h]],$$ and then the only game is to figure out what spaces each differential belongs to.

Letting $\partial$ denote a partial Frechet derivative, we have $$ dq(x)[h]= \partial_aq_2(q_1(x),x)[dq_1(x)[h]] + \partial_b q_2(q_1(x),x)[h].$$

Now, $q_1$ is linear, and $q_2$ is separately linear in each variable. For any linear map $L$, the Frechet derivative $dL(x)[h]=L(h)$. (The nice thing is that this covers multiplication of matrices on the left and on the right.) Thus for $h\in\mathbb R^n$ and $v\in\mathbb R^{1\times n}$,

\begin{align} dq_1(x)[h]&=q_1(h)=h^TA,\\ \partial_a q_2(a,b)[v]&= vb,\\ \partial_b q_2(a,b)[h]&= ah. \end{align} therefore \begin{align} dq(x)h&=dq_1(x)[h]x + q_1(x)h\\ &=h^T Ax + x^TA h \\ &= x^TA^Th +x^TAh \\ &= x^T(A^T+A)h = ((A^T+A)x)\cdot h \end{align} and finally we can read the Jacobian from the final line: $$ \frac {dq}{dx}(x) = x^T(A^T+A).$$ (And, the gradient vector is its transpose $\nabla q(x) = (A^T+A)x$.)

Equivalently, one has to understand that the Jacobian of a map $\mathbb R^n \to \mathbb R^{1\times n}$ acts on the right: if $g$ is $\mathbb R^{1\times n}$-valued, then $ f(x+h) \approx f(x) + \frac{df}{dx}(x)h $ cannot make sense by dimension counting (remember $h$ is a column vector). So we are forced to accept that the correct approximation is $f(x+h)\approx f(x) + h^T \frac{df}{dx}(x)$. What this means is that the order of terms in chain rule is reversed! Indeed, the standard chain rule formula for the composition of a $\mathbb R^{n\times 1}$-vector valued function $f=f(y)$ and $g=g(x)$ is $$ \frac{d(f\circ g)}{dx}(x) = \frac{df}{dy}(g(x))\frac{dg}{dx}(x)$$ If we take the matrix transpose of this equation, we obtain the chain rule for row-vector-valued functions $h=f^T$: $$\frac{d(h\circ g)}{dx}(x) = \left(\frac{dg}{dx}(x)\right)^T\frac{dh}{dy}(g(x)).$$

A similar issue occurs in the above where the input is the row vector, and the approximation for $q_2$ reads $ q_2(a+v,b)\approx q_2(a,b) + v \frac{\partial q_2}{\partial a}(a,b)$. Hence (and I think deriving from first principles is easiest) \begin{align} q_2(q_1(x+h),b)&\approx q_2(q_1(x),b) + (q_1(x+h)-q_1(x)) \frac{\partial q_2}{\partial a}(q_1(x),b) \\&\approx q_2(q_1(x),b) + h^T\frac{dq_1}{dx}(x) \frac{\partial q_2}{\partial a}(q_1(x),b) \\& = q_2(a,b) + \frac{\partial q_2^T}{\partial a}(q_1(x),b)\frac{dq_1^T}{dx}(x)h \end{align} The correct chain rule is therefore, as painful as it seems, $$\frac{dq}{dx}(x)= \frac{\partial q_2^T}{\partial a}(q_1(x),x)\frac{dq_1^T}{dx}(x) + \frac{\partial q_2}{\partial b}(q_1(x),x)$$

It would be easier IMO to avoid this mess by not using a $\mathbb R^{1\times n}$-valued function.

very useful explanation, especially the last part! I have a follow-up: are gradient and jacobian always transpose when they are vectors (The part you say "and the gradient vector is its transpose...")? — Catbuilts, Feb 16 '21 at 15:23
glad to hear you found it useful! It is usually a transpose to me, but it is one of those things that not everyone agrees on. — Calvin Khor, Feb 16 '21 at 15:53
I am reading on Numerator Layout and Denominator Layout. Is that related? As far as I understand, if we use Numerator Layout for f: N->1, jacobian is a row vector (1xN), and the gradient is its transpose (Nx1), but if we use Denominator Layout, they should be the same. Please correct me if I'm wrong. Thanks! — Catbuilts, Feb 16 '21 at 15:57
I think that’s basically the issue. If I’m not wrong the book calls Jacobian and gradient the same thing. Basically if a calculation you see somewhere looks wrong, this is one of the things you need to check — Calvin Khor, Feb 16 '21 at 16:01
Thank you. In the 2 formulas at the end, I think it should be df/dg and dh/dg. — Catbuilts, Feb 17 '21 at 01:17
@Catbuilts Yes, usually, when discussing chain rule, you should have two different variable names e.g. $f=f(y)$ and $g=g(x)$ and then $d(f\circ g)/dx(x) = df/dy (g(x)) dg/dx(x)$. I'll figure out how to reword it. Thank you for the careful reading! — Calvin Khor, Feb 17 '21 at 01:30

Get wrong answer on $\frac{\partial \mathbf{x}^{\top} \mathbf{A} \mathbf{x}}{\partial \mathbf{x}}$ when using graph

1 Answers1