Understanding principal component analysis

Question

Let $X$ be $m\times n$ sample matrix where each row is a sample point. We want to find matrix $P$ of dimension $n \times r$ such that $XP$ is the dimension reduced matrix of samples after applying the principal component technique.

We find $P$ by maximizing the trace of the covariance matrix $C_Y^{'}=\frac{1}{m}(XP)^T(XP)=P^T(\frac{1}{m}X^TX)P$. Because we want the variance of each variable to be maximized.

We let $C=\frac{1}{m}X^TX$ and we want to maximize $tr(P^TCP)$ subject to $P^TP=I$.

They said that we can use lagrange method to find partial of $f(P)=tr(P^TCP)+\lambda(P^TP-I)$. I don't understand this, please explain.

Also, they used $\frac{\partial tr(AB)}{\partial A}=B^T$, and $\frac{\partial X^TX}{\partial X}=X$. I need help understanding that as well.

They did $\frac{\partial f}{\partial p}= \frac{\partial tr(P^TCP)}{\partial P}+\lambda \frac{\partial (P^TP)}{\partial P} =\frac{\partial tr(PP^TC)}{\partial P}+\lambda P=(P^TC)^T+\lambda P=C^TP+\lambda P=CP+\lambda P$, and when set to $0$, we get $CP=(-\lambda)P$.

And that shows why we need to calculate eigenvalues. I need clarification on that as well, for example, how to choose size of $P$?

Here, $\lambda$ is the vector of Lagrange multipliers for the constraint $P^TP=I$ (i.e. that $P$ is orthonormal). After taking derivatives and setting equal to 0, we see that they are actually the eigenvalues too. Take $r=1$ so that $\lambda$ is a scalar and it might become a little more clear. — mostsquares, May 08 '19 at 20:18
In your Lagrangian $f(P)$, it should be $P^TP-I$, not $P^TP=I$, by the way. Happy to expand this into an answer if it's still confusing — mostsquares, May 08 '19 at 20:20
I edited it. I don't understand how you can take partial with respect to $P$ as $P$ is a matrix. Please help me with an answer! — user 6663629, May 08 '19 at 20:25

mostsquares · Accepted Answer · 2020-05-12T21:55:35.560

OK, to continue from discussion in the comments. I think that the confusion could be that they are using the language of matrix calculus, which is just a compressed notation for taking derivatives with respect to elements of matrices, combined with Lagrange multipliers, to derive PCA from what some people would call an "intuitive cost function." However, I think that the authors of what you are reading have been pretty hand-wavey and in fact what they wrote actually does not make much sense. Anyway...

So, there are kind of a couple of different questions that could be separated out here. A few of them will be handled better on their own, so I'll link to other SO answers in those cases.

Optimization problem

It seems like this part is pretty clear to you. We've set up an optimization problem: find $P$ to maximize the trace of $C_Y$ \[f(P) = \operatorname{tr}(P^T C P)\] subject to the constraint that the columns of $P$ be orthonormal vectors, or in other words subject to\[P^TP=I.\] Here $C=\frac{1}{m}X^TX$ is the empirical covariance of $X$ (usually after centering!).

Lagrangian

As written, the Lagrangian $f(P)$ can't be right -- you can see this by noticing that $P^TP-I$ is a matrix, so what is the value of the RHS supposed to be, also a matrix? We can try to fix it, but I want to argue that this is actually hard to do -- if you look at this answer:

Prove that the trace of the matrix product $U'AU$ is maximized by setting $U$'s columns to $A$'s eigenvectors

you'll see that it's not so simple to solve the problem, at least for $r>1$. I think that whoever wrote what you are working with was going for more of a qualitative understanding, and they seem to have ignored some of the complications for the sake of intuition, but this might be what was making things confusing.

In the $r=1$ case, it's not too hard. Our constraint just becomes $P^TP=1$, i.e. $P$ is really just a unit column vector. Then we get the Lagrangian

\[L(P,\lambda) = \operatorname{tr}(P^T C P) - \lambda (P^TP - 1).\]

This is not so hard to solve, and it gives the first principal component -- I'll show that in a second, but first I just want to note that extending this to more components is hard. The complications of doing that are addressed in the question I linked to above, but to get a feel, think about it: what are our constraints? We need all of the unit length constraints $P_i^T P_i=1$ for $i=1,\dots,r$ and all of the orthogonality constraints $P_i^TP_j=0$ for all $i,j$. But now we have more dual variables than were present in what you were given.

Anyway, back to $r=1$. To solve for $P$, take the derivative with respect to the vector $P$ and set equal to 0 using the vector analogies of the matrix calculus identities that you were given: \[\frac{\partial L}{\partial P} = \frac{\partial \operatorname{tr}(P^T C P)}{\partial P} - \lambda \frac{\partial P^TP}{\partial P}.\] Note that this is basically what you had above but with a sign change, since the Lagrangian should really be written how I have it here with the $-$ sign in front of $\lambda$. The vector partial derivatives here are just a different notation for gradients, so think of them that way if they are confusing. But the identities you wrote down hold and can help us solve this:

The gradient $\frac{\partial\operatorname{tr}(P^TCP)}{\partial P}$ for $P$ and $B$ both column vectors is the row vector $2CP$.
Similarly the derivative of a the dot product $P^T P$ with respect to $P_i$ is just $P_i$, so we can write the gradient with respect to the whole vector as $2P$ (factor of 2 because $P_i$ shows up on the right also).

Plugging in, we get \[\frac{\partial L}{\partial P}= 2CP - 2\lambda P\]

Setting this equal to 0 to find the critical point, we get that $CP=\lambda P$, or in other words $P$ is an eigenvector of $C$ with eigenvalue $\lambda$.

Now we have to optimize over $\lambda$, since it's still a free variable -- but to maximize $L(P,\lambda)$, we see that we just take the largest possible $\lambda$, but since we have learned that $\lambda$ must be an eigenvalue, that means taking the largest eigenvalue.

I hope this helped with some intuition, but understanding the full case with $r>1$ as I said earlier takes more work, I guess.

Good answer but please note you have some mistakes there in the algebra with the transposes and in the derivatives as well. When you derivate $\frac{\partial\mathrm{tr}(P^TB)}{\partial P}$ (I swapped the transposes to fit it then into $P^TCP$) you need to take into account that your $B$ is a function of $P$ as well. (Check this.) You get a factor $2$ that eventually cancels with another one you missed in derivating $\frac{\partial P^TP}{\partial P}$ and the final result holds. — abcd, Apr 21 '20 at 00:37

Understanding principal component analysis

1 Answers1

Optimization problem

Lagrangian

Linked