2

I'm trying to get an intuitive grip on SVDs to explore LSI as part of NLP. I understand that it involves the usage of SVDs. As I understand in SVD

$A = SΣU^T$

I understand that $S$ is a matrix of eigenvectors of $A^TA$ and $\Sigma$ is the square root of its eigenvalues as a diagonal matrix.

  1. I don't understand how $U$ is the matrix of eigenvectors of $AA^T$

In LSI, we decompose this SVD matrix to get a shorter vector for each term and document to find out the relative distance between those vectors and the centroid

  1. I don't understand how doing an SVD converts words to their own vectors, by which we can estimate the distance between documents

I'm very new to NLP, I'm finding it hard to get an intuitive sense of the math, I'd sincerely appreciate any help. Is there any books that I can refer to, thanks!

sprajagopal
  • 622
  • 1
  • 5
  • 20

2 Answers2

2

I don't understand how U is the matrix of eigenvectors of A* A(trans)

Correction: $U$ is the matrix of eigenvectors of $AA^T$. If you want to understand this in general read on. Or if your interest in incidental of LSA, then skip to next section.

In SVD, we decompose a matrix $X$ three matrice each with useful properties.

$X = USV^T$. Here $U,V$ are unitary matrices: $U^T = U^{-1}$ and same for $V$. $S$ is a diagonal matrix.

The only property we need is that $U,V$ are unitary matrices.

I'm assuming real matrices every where.

Let's start with $XX^T$:

(please note the dimensions at each step as indicated)

$XX^T = USV^T (USV^T)^T$

$XX^T = USV^T_{n \times n} V_{n \times n}S^TU^T$

Here we use $V^TV = I$:

$XX^T = USI_{n \times n} S^T_{n \times m}U^T$

$I_{n \times n} S^T_{n \times m} = S^T$. Notice that $S$ is a diagonal matrix, so $SS^T$ is also diagonal:

$XX^T = U SS^T U^T$

$XX^T = U SS^T U^{-1}$

This is the same expression as the eigen-decomposition.

You can try a similar chain of proof for $X^TX$ to see that $V$ is a set of eigenvectors of $X^TX$.

Distance between documents

In LSI, we decompose this SVD matrix to get a shorter vector for each term and document to find out the relative distance between those vectors and the centroid

I think you mean "approximate" instead of "decompose."

I don't understand how doing an SVD converts words to their own vectors, by which we can estimate the distance between documents

SVD does not convert the words into vectors. The second part of your sentence is correct though: we can estimate distance between documents by using SVD and approximating it.

This is my understanding: Say you have a set of $n$ documents. To represent them in a mathematically, you strip out the common words such as the, a, here, there. In other words, these are words that you believe can appear in all these documents and have no significance.

After this, you count the remaining words in the document and create a column vector, where each word has a score (the number of times it occurs in this document). So far you have represented $1$ document as a column vector.

Now you do this for all the $n$ documents. Now stack them side by side and you have a matrix, while making sure that each word gets a row. Call this matrix $C$.

At this point what you have is a massive matrix $C$ which is humanly (and computationally) difficult to handle. This is where SVD is useful. It can decompose this matrix into three matrices each of which has an useful property.

$C = USV^T$

Here $S$ is a diagonal matrix. The diagonal elements will have different values, but all positive in this case.

Now you can think of approximating. Say you notice that most diagonal elements of $S$ close to zero. You set them to zero. Now with this thresholding done, you can reconstruct a matrix $C'$ by multiplying the three matrices again.

Mathematically, this $C'$ is an approximation of $C$.

More importantly, after we set threshold the diagonal elements of $S$, we'll have several columns of $S$ that are entirely zero. This means the corresponding rows in $V^T$ are actually useless because they do not contribute to the approximated $C'$.

Let's drop those irrelevant rows from $V^T$ and call this new matrix $V_{truncated}^T$.

$V_{truncated}^T$ contains the same number of columns as $C$ but very few rows. This means we started with $n$ documents and maybe a million rows but now we have $n$ documents and only a handful of rows in $V^T_{truncated}$.

In fact we could threshold the diagonal elements harshly leading to only $2$ rows in our $V^T_{truncated}$.

The product $SV^T_{truncated}$ is a low dimensional representation of our original documents. At this point we could easily draw the low-dimensional documents as 2D points on a piece of paper. Close together points represent similar documents or as you put it "distance between documents."

Some more material

Some intuitive answers here. After that, this and this explain how approximating the SVD achieves in this application i.e. LSI.

sprajagopal
  • 622
  • 1
  • 5
  • 20
0

The item-document matrix $A$ is a $m \times n$ matrix where there are $m$ items and $n$ documents. Each entry shows the counts for the word in that document. The SVD is used to decompose the matrix in the product of several matrices. In the process of finding the SVD you're using Gram-Schmidt to form the left and right singular vectors which are the eigenvectors of $AA^T$ and $A^TA$. You can check this as

$$ AA^T = (U \Sigma U^T) (U \Sigma U^T)^T $$ $$ = U \Sigma U^T U^T \Sigma^T U^T $$ then since $U^TU = UU^T =I$ we get $$ = U \Sigma\Sigma^T U^T $$ and $\Sigma^T = \Sigma \implies \Sigma \Sigma^{T} = \Sigma^{2} =\Lambda$

$$ = U \Lambda U^T $$

You can do the same thing for $A^TA$ to see it equals $V \Lambda V^T$. The purpose is the Gram-Schmidt takes the columns (which are incredibly sparse here) and forms orthogonal vectors where the entries are describing the weights for the terms in this new factor. By doing this you can truncate it. This is called the truncated SVD. If $A$ is $ m \times n$ then we have

$$ A_{m \times n} = U_{m \times m} \Sigma_{m \times n} V_{n \times n}^{T}$$

as the full SVD. Now if we take a threshold value $\epsilon$ such that we make all the singular values $\sigma < \epsilon$ equal to zero we get a new matrix

$$ A_{m \times n} = U_{m \times k} \Sigma_{k \times k} V_{k \times n}^{T}$$

where $k$ is the number singular values greater than $\epsilon$. By doing this you can then use those vectors to compute the similarity between documents.


Another way to view what the SVD is doing is that we can rewrite it

$$ A = \sum_{i=1}^{k} \sigma_{i}u_{i}v_{i}^{T}$$

This says the SVD decomposes a matrix (here the term document matrix) into the sum of $k$ rank-one matrices $u_{i}v_{i}^T$ which are weighted or scaled by the non-negative constants $\sigma_{1} \geq \sigma_{2} \geq \cdots \sigma_{k} \geq 0$ called the singular values. These are the square roots of the eigenvalues of the covariance matrices $AA^T$ and $A^TA$.


The actual distance between the original matrix and low-rank approximation is given by a theorem called the Eckart-Young-Mirsky theorem which says using the euclidean norm we have

$$ \|A - A_{k} \|_{2} = \bigg\| \sum_{i=k+1}^{n} \sigma_{i}u_{i}v_{i}^{T} \bigg\| = \sigma_{k+1}$$

which means for a rank $k$ approximation using the $2$ norm the actual distance between them is given by the $\sigma_{k+1}$ which is the $k+1$th singular value.

Ryan Howe
  • 402