Derivation of Linear Regression using Normal Equations

Question

I was going through Andrew Ng's course on ML and had a doubt regarding one of the steps while deriving the solution for linear regression using normal equations.

Normal equation: $\theta=(X^TX)^{-1}X^TY$

While deriving, there's this step:

$\frac{\delta}{\delta\theta}\theta^TX^TX\theta = X^TX\frac{\delta}{\delta\theta}\theta^T\theta$

But isn't matrix multiplication commutative, for us to take out $X^TX$ from inside the derivative?

Thanks

score 2 · Answer 1 · answered Jan 13 '19 at 19:29

Although that equality is true, it does not give insight into why it is true.

There are many ways to compute that gradient, but here is a direct approach that simply computes all the partial derivatives individually.

Let $A$ be a symmetric matrix. (In your context, $A= X^\top X$.) The partial derivative of $\theta^\top A \theta = \sum_i \sum_j A_{ij} \theta_i \theta_j$ with respect to $\theta_k$ is $$\frac{\partial}{\partial \theta_k} \theta^\top A \theta = \sum_i \sum_j A_{ij} \frac{\partial}{\partial \theta_k}(\theta_i \theta_j) = A_{kk} \cdot 2 \theta_k + \sum_{i \ne k} A_{ik} \cdot \theta_ i + \sum_{j \ne k} A_{kj} \theta_j = 2\sum_i A_{ki} \theta_i = 2 (A \theta)_k$$ Stacking the partial derivatives into a vector gives you the gradient, so $$\nabla_\theta \theta^\top A \theta = 2 A \theta.$$

score 2 · Accepted Answer · answered Jan 13 '19 at 21:30

Given two symmetric $(A, B)$ consider these following the scalar functions and their gradients $$\eqalign{ \alpha &= \theta^TA\theta &\implies \frac{\partial\alpha}{\partial\theta}=2A\theta \cr \beta &= \theta^TB\theta &\implies \frac{\partial\beta}{\partial\theta}=2B\theta \cr }$$ It's not terribly illuminating, but you can write the second gradient in terms of the first, i.e. $$\frac{\partial\beta}{\partial\theta} = BA^{-1}\frac{\partial\alpha}{\partial\theta}$$ For the purposes of your question, $A=I$ and $B=X^TX$.

score 0 · Answer 3 · answered Oct 19 '20 at 17:46

A slight generalization of the result in this answer is a matrix identity that I believe Andrew Ng may have pointed out at some stage. Although it is not needed for this problem, it's useful to know in general. For any matrices $Z, A$ with $A$ symmetric we have:

$$\nabla_Z tr(Z^TAZ) = 2AZ$$

I believe Andrew Ng introduces something like this as the matrix analogy for the case of $\frac{\delta}{\delta x}x^2 = 2x$, where $x$ is just a scalar. In this case, $\nabla_Zf(Z)$ is just the matrix of partial derivatives of the function $f$ with respect to each entry in some general $m \times n$ matrix $Z$ and $tr(A)$ is the sum of the diagonals of a square matrix $A$. The only difference here to the linked answer is that the linked answer shows the identity's truth for the case where $Z = \theta$ is a vector (rather than a general matrix). But we can generalize further by induction on the number of columns $n$ in $Z$. Clearly, the linked answer gives us the truth for $n = 1$, where the trace of the scalar result is just the result itself. So we're only left with the inductive case.

In the inductive case, we will make use of block notation for matrices which is very handy for induction on matrices. So we assume the identity holds for all $m \times n'$ matrices with $n' \leq n$ and show that it holds for any $m \times (n + 1)$ matrix. Well, using block notation, such a matrix can always be written as an $m \times n$ matrix concatenated with a vector. So we have:

$$Z = \begin{bmatrix} Z' & \theta \end{bmatrix}$$

Where $\theta$ is some $m \times 1$ matrix i.e. a vector. Now, we can start using algebra to transform the above expression:

$$Z^TAZ = \begin{bmatrix} Z' & \theta \end{bmatrix}^TA\begin{bmatrix} Z' & \theta \end{bmatrix} = \begin{bmatrix} Z'^T \\ \theta^T \end{bmatrix}A\begin{bmatrix} Z' & \theta \end{bmatrix} = \begin{bmatrix} Z'^TA \\ \theta^TA \end{bmatrix}\begin{bmatrix} Z' & \theta \end{bmatrix} = \begin{bmatrix} Z'^TAZ' & Z'^TA \theta\\ \theta^TAZ' & \theta^TA \theta \end{bmatrix}$$

What we've just done is expanded the matrix out into its four quadrants. Now recall that we're interested in the trace of the matrix. A nice property of traces of matrices written in four quadrants like the above is that the trace completely discards the upper right and lower left quadrants. So we have:

$$ tr \begin{bmatrix} Z'^TAZ' & Z'^TA \theta\\ \theta^TAZ' & \theta^TA \theta \end{bmatrix} = tr(Z'^TAZ') + tr(\theta^TA\theta) $$

Now, since $Z = \begin{bmatrix} Z' & \theta \end{bmatrix}$, we can split the matrix derivative of $Z$ for any function $f$ using block notation as follows:

$$\nabla_Zf(Z) = \begin{bmatrix} \nabla_{Z'}f(Z) & \nabla_\theta f(Z) \end{bmatrix}$$

Now, putting these last facts together, we have:

$$ \nabla_Z tr \begin{bmatrix} Z'^TAZ' & Z'^TA \theta\\ \theta^TAZ' & \theta^TA \theta \end{bmatrix} = \nabla_Z (tr(Z'^TAZ') + tr(\theta^TA\theta)) = \\ \begin{bmatrix} \nabla_{Z'}(tr(Z'^TAZ') + tr(\theta^TA\theta)) & \nabla_\theta (tr(Z'^TAZ') + tr(\theta^TA\theta)) \end{bmatrix} $$

An important fact assumed but not yet made explicit that we'll now use is that all the entries in our matrix $Z$ are independent of each other. In particular, this implies that $Z'$ and $\theta$ are independent of each other. And so the derivatives w.r.t $Z'$ and $\theta$ can simplify as follows:

$$ \begin{bmatrix} \nabla_{Z'}(tr(Z'^TAZ') + tr(\theta^TA\theta)) & \nabla_\theta (tr(Z'^TAZ') + tr(\theta^TA\theta)) \end{bmatrix} = \begin{bmatrix} \nabla_{Z'} tr(Z'^TAZ') & \nabla_\theta tr(\theta^TA\theta) \end{bmatrix} $$

Now, finally, we can apply our inductive hypothesis to both block components of our matrix, since both $\theta$ and $Z'$ are smaller than $Z$ (and $\theta$ was our base case anyway), to get:

$$ \begin{bmatrix} \nabla_{Z'} tr(Z'^TAZ') & \nabla_\theta tr(\theta^TA\theta) \end{bmatrix} = \begin{bmatrix} 2AZ' & 2A\theta \end{bmatrix} = 2A\begin{bmatrix} Z' & \theta \end{bmatrix} = 2AZ $$

And that completes the proof. Again, it's overkill for what you need for this exact case, but it's a more general identity that you may want to apply in other scenarios where $Z$ may not be a vector.

Derivation of Linear Regression using Normal Equations

3 Answers3

Linked