1

Can someone fully step through the derivation for ALS with missing entries ? https://stanford.edu/~rezab/classes/cme323/S15/notes/lec14.pdf Why equation (1) becomes (2) ? Please help I am stupid

http://danielnee.com/2016/09/collaborative-filtering-using-alternating-least-squares/

Why do I sum all $y_i$ when diferrentating wrt $x_u$ ? I will give 50 points

$ \frac{d}{dx_u} = 2 \sum (r_{ui} - x_u^Ty_i)x_u $

what should I put at the sum ? If i derivate wrt to a row of x do all the columns of y get affected ?

Kong
  • 884
  • tl;dr of my answer below: you don't sum all $i$, you just sum the ones that appear in some $r_{u,i}$ (see the subscript notation in the formula (2)). – AnonymousCoward Jan 01 '18 at 12:07

1 Answers1

2

The answer is described in the paragraph between equations (1) and (2). In the first for-loop of Algorithm 1, we are considering $Y$ as fixed and solving the problem of finding $X$ that minimizes (1).

When $Y$ is fixed, the cost function (1) is a convex, quadratic function of $X$, so there is a formula for the $X$ that minimizes (1) that you can find by differentiating and solving =0. (2) is this formula. Here is the derivation:

First, rewrite the loss function as

$$L = \sum_{u,i}x_u^T(\lambda I + y_iy_i^T)x_u -(2r_{u,i})x_u^Ty_i + (\lambda ||y_i||^2 + r_{u,i}^2)$$ To make it more clear for what follows: $$L = \sum_{u} \sum_{r_{u,i}\in r_{u,*}}x_u^T(\lambda I + y_iy_i^T)x_u -(2r_{u,i})x_u^Ty_i + (\lambda ||y_i||^2 + r_{u,i}^2)$$

Note that the leading order term can be rewritten $x_u^T(\lambda I + \sum_{i} y_iy_i^T)x_u$ where the sum is taken over $i$'s paired with $u$ in the $r_{i,u}$'s (this is the meaning of the notation $r_{u,i} \in r_{u,*}$ in the subscript).

Differentiate (for details, cf. computations here https://math.stackexchange.com/a/659982/565): $$\frac{dL}{dx_u} = -\sum_{r_{u,i} \in r_{u,*}} (2r_{u,i}y_i) + 2(\lambda + \sum_{r_{u,i} \in r_{u,*}} y_iy_i^T)x_u$$ Solve for zero: $$0 = -\sum_{r_{u,i} \in r_{u,*}} (2r_{u,i}y_i) + 2(\lambda + \sum_i y_iy^T)x_u$$ $$x_u = (\lambda + \sum_{r_{u,i} \in r_{u,*}} y_iy_i^T)^{-1}(\sum_{r_{u,i} \in r_{u,*}} r_{u,i}y_i)$$

In the second for-loop of Algorithm, now that $X$ has been optimized for fixed $Y$, we reverse the situation and optimize $Y$ for this new fixed $X$. Again, the cost function is convex quadratic so there is a formula for $Y$. This is repeated, optimizing $X$ and $Y$ back and forth, until a suitable convergence condition on $X$ and $Y$ is reached.

For reference, the computation I think you are confused by might be this one:

Lemma: for $x\in R^n$ and $A \in M_{n\times n}(R)$ symmetric (e.g. $\lambda I + yy^T$), $$\frac{d}{dx} x^TAx = 2Ax.$$

Proof: https://math.stackexchange.com/a/312271/565

  • Sorry I am confused what the star means. $$x_u = (\lambda + \sum_{r_{u,i} \in r_{u,}} y_iy_i^T)^{-1}(\sum_{r_{u,i} \in r_{u,}} r_{u,i}y_i)$$ Does it mean that for the current row u, I skip all column entries i that are missing ? So I compute $\sum_{r_{u,i} \in r_{u,*}} y_iy_i^T$ only for non missing entries ? – Kong Jan 01 '18 at 13:11
  • 1
    Yes. The sigma notation means: for fixed $u$ sum over all $i$ such that there is a $r_{u,i}$. – AnonymousCoward Jan 01 '18 at 16:15