Projected gradient descent, why project at every iteration?

Question

The way I have been presented gradient descent at least from Levitin and Polyak is that you do the gradient step: $\theta_{t+1} = \theta_t - \eta_t \nabla_t(\theta_t)$ and then afterwards you project to your convex set $C$ after your gradient step: $\theta_{t+1} = P_C(\theta_{t+1})$. Intuitively, I am wondering why is projection necessary after every step, shouldn't you be able to just carry out gradient descent, and after some long enough $t$ isn't projection at the end close enough optimal $\theta^*$? Or is there a counterexample of a convex set that does not allow this?

DominikS · Accepted Answer · 2023-12-14T08:00:39.540

The projected gradient descent attempts to approximate the steepest descent inside $C$ in order to find the minimum there. The gradient descent without projection will diverge in general from the descent inside $C$. One of the following two scenarios are likely:

The descent without projection converges to a minimum outside of $C$. Projecting that onto $C$ will usually not give you the minimum inside $C$. The behaviour of $f$ in $C$ might be very different from the global behaviour.
More often, there might not even be a global minimum the descent converges to, so (a) how do you know when to stop, and (b) when you stop, why would that bear any connection to the minimum inside $C$?

Example. Consider the function $f(x,y) = 0.001 x^2 + y^2$, and $C=B_1(10, 2)$, which is the closed unit ball centred at $(10, 2)$.

Performing gradient descent in $\mathbb R^2$ will lead you to the global minimum at $(0,0)$. Projecting that onto $C$ will give you a point near $(9,2)$.
The minimum inside $C$ (which should be found by the projected gradient descent) would rather be near $(10, 1)$.

Projected gradient descent, why project at every iteration?

1 Answers1