7

For a matrix $A$ the Singular Values Decomposition allows getting the closest low-rank approximation $$A_K=\sum_i^K\sigma_i \vec{v}_i \vec{u}_i^T$$ so that $\|A-A_k\|_F$ is minimal.

I'd like to do the same but allow for an additional diagonal term; that is, for a given square matrix $A$ and an positive integer $K$ find a diagonal matrix $D$ and low-rank approximation $$A_K=D+\sum_i^K\sigma_i \vec{v}_i \vec{u}_i^T$$ so that like above $\|A-A_k\|_F$ is minimal.

The problem originated in the context of correlations matrices. Thus answers which further assume $A$ is symmetric, positive semi-definite are also welcome.

Uri Cohen
  • 403
  • 1
    Any particular motivation for this problem? – Ben Grossmann Jul 18 '16 at 18:44
  • 2
    I study a problem where I can make strong predictions on the properties of some dataset if its correlation matrix have this structure. With correlation matrices there is easy-to-interpret meaning for the diagonal as variance terms. – Uri Cohen Jul 18 '16 at 19:01

4 Answers4

2

Because $D$ can be chosen after the low-rank approximation is known, we need to minimize the low-rank decomposition of only off-diagonal terms: $$[U, \sigma] = \arg\min \sum_{i\ne j}(A_{ij}-\sum_k^K \sigma_k U^k U^{kT})^2$$

This problem may be solvable efficiently by applying methods under the title of "Weighted Low-Rank Approximations". Those generalize SVD (and other problems) by weighting the items of the matrix to be reconstructed using weights $W$.

The approach in the link (and referece below) performs an iterative update(possibly reaching a local minima rather than the global one). The resulting update is very simple, just iterative application of a single update rule (eq. 2 in the paper below): $$X_{t+1}=\mathrm{SVD}_k(W\circ A+(1-W)\circ X_t)$$

with $X_{t}$ the k-ranked approximation at iteration $t$ and $X_0$ initialized to either zero matrix of $A$.

This for the case in question we may proceed by taking $W=1-I$, causing the diagonal to be ignored, thus simplifying the above equation even further: $$X_{t+1}=\mathrm{SVD}_k(A-I\circ A+I\circ X_t)$$

Srebro, N., & Jaakkola, T. (2003, August). Weighted low-rank approximations. In ICML (Vol. 3, pp. 720-727).

Thomas Ahle
  • 4,612
Uri Cohen
  • 403
  • Minimizing low-rank off-diagonal (call this $R_{off}$), then $D$, gives a local minimum, but why global ? Little experiments with rank $1, 2n$ parameters, show that $R_{off}$ is very spiky. But I'm no expert -- see questions/tagged/non-convex-optimization – denis Oct 02 '18 at 15:25
  • Yes, off course, this provides in general a local minima. But I found it to be quite stable numerically with my real world matrices, in the sense that the result is similar under different inital conditions. – Uri Cohen Oct 02 '18 at 16:37
  • I wonder if we could even include the "diagonal replacement" that this does, inside the SVD. if we assume SVD is computed using subspace iteration. – Thomas Ahle Apr 08 '22 at 06:21
1

Things are going to be tricky for this one. Rank and the Frobenius norm are unitarily invariant, but the property of being "diagonal" is not.

The best approach I can think of, off the top of my head, is as follows: we can define a matrix norm by $$ \|M\|_{F_K}^2 = \sum_{i=1}^K [\sigma_i(M)]^2 $$ Your question can then be re-framed as follows: for a matrix $A$, find the diagonal matrix $D$ such that $$ f(D) = \sum_{i=K+1}^n \sigma_i(A - D) = \|A - D\|_F^2 - \|A - D\|_{F_K}^2 $$ is minimized. If there's any hope of getting a nice formula for this, it will be from applying some kind of calculus/Largrange multiplier argument to this. Alternatively, this could presumably be made into some kind of quadratic or semidefinite program.

Ben Grossmann
  • 225,327
  • is it easier to minimize (iteratively) $|A- \lambda I_c|_{F_K}$ where $I_c$ is a non full-rank identity matrix ? – reuns Jul 18 '16 at 22:15
  • I would doubt it; I don't see any advantage to "non-full rank identity matrices". – Ben Grossmann Jul 18 '16 at 22:17
  • it means minimizing only on one parameter $\lambda$ – reuns Jul 18 '16 at 22:33
  • I mean, sure. But that's not enough for me to believe that doing your problem $n$ times has any advantage over doing my problem (which has $n$ parameters) once. – Ben Grossmann Jul 18 '16 at 22:36
  • I didn't mean to propose anything easier, just asking if it can be minimized iteratively – reuns Jul 18 '16 at 22:38
  • Oh. Well, you asked "is it easier". Yes, you could minimize this iteratively by restricting $D$ appropriately in $n$ different stages. – Ben Grossmann Jul 18 '16 at 22:41
1

The solution I ended up implementing is publicly available: https://github.com/sompolinsky-lab/dnn-object-manifolds/blob/master/library/optimal_low_rank_structure2.m

The idea behind it is to denote $A\in\mathbb{R}^{N\times P}$ the data, such that $C=AA^T$, then find an orthonormal set $V\in\mathbb{R}^{N\times K}$ such that the residual $R=A-V(V^TA)$ is approximately diagonal; this is done by solving the following optimization problem:

$$V^*=\arg\min_{V\ :\ V^TV=I_K} \sum_{i,j} (RR^T)_{ij}/\sqrt{(RR^T)_{ii}(RR^T)_{jj}}$$

This can be done using existing tools for optimization under orthonormality constraints. It has no free parameters you need to fit or optimize over and for my problems it gives very solid results, i.e. finds very consistent solutions despite the fact that it converges is to a local minima, with no assurance if it's a good one relative to the global minima. To achieve this for every $k=1..MAX\_K$ it repeats for $N\_REPEATS$ time the search for a minima assuming the rank is $k$, using the solution for $k-1$ as initial conditions. Then the $k$ with the minima of optimization target is used.

Please refer to the paper if you use this method: https://www.nature.com/articles/s41467-020-14578-5

Uri Cohen
  • 403
  • How do you solve the $V^*$ optimization problem with standard tools? – Thomas Ahle Apr 14 '22 at 23:55
  • The reference is [Zaiwen Wen and Wotao Yin. A Feasible method for Optimization with Orthogonality Constraints, Optimization Online, 11/2010], see the GitHub link for both MATLAB implementation and a link to Python implementation. – Uri Cohen Apr 23 '22 at 20:46
  • is this the python code you refer to? https://github.com/schung039/neural_manifolds_replicaMFT/blob/master/mftma/manifold_analysis_correlation.py#L281 How does one get the diagonal + low-rank decomposition from the outputs? – Thomas Ahle Apr 25 '22 at 05:15
  • Yes, see the "Compute the optimal V" in line 354 using a "Stiefel" optimization. – Uri Cohen Apr 26 '22 at 09:03
1

I implemented a version of the Subspace Iteration algorithm for SVD based on Uri Cohen's answer.

def ssi(M, k):
    """ Subspace iteration ignoring the diagonal """
    n, _ = M.shape
    U = np.random.randn(n, k)
    last_diag = np.diag(M)
    for _ in range(100):
        M1 = M - np.diag(np.diag(M) - last_diag)
        U, R = np.linalg.qr(M1 @ U)
        last_diag = np.einsum('c,rc,rc->r', np.diag(np.abs(R)), U, U)
    return np.sqrt(np.diag(R)) * U

This returns a lowrank matrix $L$, such that $\|M-LL^T\|_F$ is small outside of the diagonal.

For comparison I implemented Uri Cohen's answer:

def iterated-svd(M, k):
    last_diag = np.diag(M)
    for _ in range(100):
        M1 = M - np.diag(np.diag(M) - last_diag)
        U, D, V = np.linalg.svd(M1)
        ud = np.einsum('c,rc,rc->r', D[:k], U[:,:k], U[:,:k])
    return np.sqrt(D)[None,:k] * U[:,:k]

As well as an SGD version using torch. I tested all of them on 1000 random symmetric matrices with dimension $5$. Trying to get a diagonal + rank-2 approximation.

The mean squared error obtained from each algorithm were:

  • ssi 100its: 2.05
  • svd 1it: 7.07
  • svd 100its: 1.74
  • torch 100its: 4.18

In other words, all algorithms more or less agreed.

To me this suggests that "subspace iteration with replaced diagonal" works pretty well. And since it's much faster the iterated SVD and torch, I'll be using that personally.

However, there is no proof for this, and I'm sure Ben Grossmann is right that it can fail for some matrices. So be warned!

I also should point out the strange need for using abs(R) in the subspace iteration instead of just R. This doesn't seem to be needed when computing the normal SVD, but experimentally it is crucial.

Another interesting thing to note: Even if your original matrix is symmetric, the decompositions $M \approx D + LL^T$ are not necessarily PSD! I counted the number of times this happened in my 1000 experiments and got the following number of non PSD matrices:

  • ssi 100it: 109
  • svd 1it: 0
  • svd 100it: 92
  • torch 100it: 331

While I don't even know a proof that SVD with 1 iteration can't have this problem, it suggests that the improved approximations may not always be worth the trouble if you want to later, say, interpret your approximation as a covariance matrix.

I tried a 5th version that runs ssi, but forces the diagonal matrix to be non-negative. This trivially enforces PSDness of the approximation. Concretely I changed the "diagonal replacement line" to

M1 = M - np.maximum(np.diag(d - ud), 0)

The mean squared error was 2.33, so it is worse than ssi (2.05) and iterated svd (1.74), but still a lot better than plain svd (7.06) and we get the extra guarantee about positiveness of the diagonal.

Thomas Ahle
  • 4,612