Question about expectation of v_t and the true second moment g_t^2 in the Adam algorithm

Question

The paper is ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION

When we try to prove an equation, how can we say (2) jumps to (3). I think they have a large gap since we should consider $g_{t-1}$, $g_{t-2}$ etc, and put $E{[g_{t}^2]}$ in (4) is very different from the result of (2).

Please try to formulate the question in a better way than just referring to the paper. Doing so would lead you to introduce $\zeta$ and helping you answering the question. — Kas, Dec 05 '22 at 09:07

score 1 · Answer 1 · answered Dec 05 '22 at 09:17

The problem comes from the definition of $\zeta$. The authors define $\zeta$ to be exactly the "missing term" so that (2) equals (3), that is, $$ \zeta = \mathbb{E}\left[(1-\beta_2)\sum_{i=1}^t \beta_2^{t-i}g_i^2\right] - \mathbb E\left[g_t^2\right](1-\beta_2)\sum_{i=1}^t \beta_2^{t-i}. $$ Once that is said, the equality is straightforward to obtain. The authors then explain that one may expect $\zeta$ to be small in general. In particular, if for all $i\in\{1,\ldots,t\}$, $\mathbb E[g_i^2] = \mathbb E[g_t^2]$ (i.e., the second moment does not depend on $i$), then one can see from the definition of $\zeta$ that $\zeta=0$.

Question about expectation of v_t and the true second moment g_t^2 in the Adam algorithm

1 Answers1