Making sense of the big world of gradient methods

Question

There are many extensions of gradient descent: stochastic-, Nesterov accelerated-, proximal-, conjugate-, dual-, mirrored-, splitted-, coordinate- gradient descend and more. It also appears that many of these can be combined (there is a paper titled Accelerated Proximal Stochastic Dual Coordinate Ascent for Regularized Loss Minimization). Then you have BFGS & various methods that work on other approximations to the Hessian.

Is there a sort of unified theory of gradient descent?

This has more to do with numerical analysis and numerical optimization even with numerical linear algebra. I don't think this is the site you are looking for. I would move this to MathOverflow or mathematics. I recommend the book numerical optimization by Jorge Nocedal if you want to read about this from the beginning. — JEquihua, Apr 10 '14 at 17:39
You're mixing and matching all sorts of methods. Gradient descent is gradient descent. Sure, there are variants of it, but BFGS and coordinate descents aren't in the same family. Nocedal and Wright's book is a good reference to learn about descent methods in general, and gradient descent is one of them. But it doesn't talk about stochastic gradient, proximal methods, etc. — Dominique, Apr 24 '17 at 18:41

RobertHannah89 · Answer 1 · 2017-06-18T02:42:49.113

Well... sort of. That is a pretty complicated question. I'm not an expert on most of this, but here is my understanding. So let's begin. I assume gradient descent is straightforward.

Nesterov acceleration: Any of these kinds of methods can be Nesterov accelerated. You can think of this as an "add-on" to these algorithms. As the condition number $\kappa$ of a convex function increases, these algorithms will generally converge much slower. Acceleration will negate the effect of having a large condition number to some extent.

Stochastic gradient descent: This is somewhat different from all the other algorithms you have listed. You apply this algorithm when you have a finite-sum structure:

$$f(x)=\frac{1}{n}\sum_{i=1}^nf_i(x)$$

where each $f_i$ is a convex function. Now you can still apply all the other algorithms you listed to this objective. However it turns out that when your convex objective can be written like this, that certain algorithms can much faster than algorithms that ignore the finite-sum structure. E.g. SAGA, SVRG, SAG, SDCA. These are known as variance-reduction algorithms. Stochastic gradient descent is widely used, but rather slow.

Proximal gradient: You would use this if your convex function was not differentiable, or if it were but the gradient isn't Lipschitz or something.

Coordinate methods: Here instead of taking the full gradient (e.g. $x^{k+1}=x^k-\gamma \nabla f(x)$), you take a component of the gradient (e.g. $x^{k+1}=x^k-\gamma \nabla_i f(x)$). Usually the index $i$ is randomly selected, but can be cyclic. The main reason to do this is parallelization. Also though, if the component-wise Lipschitz constants of the gradient are very different, you can converge much faster than standard gradient descent.

Dual methods: Sometimes it is easier to solve what is known as the dual problem. This is an equivalent convex optimization problem that is equivalent to the original. All the algorithms (except for finite sum algorithms like SGD) can be equally applied to the dual problem.

Splitting schemes: Sometimes when you have an objective with more than $1$ terms: $f(x)+g(x)$, it is easier to minimize these functions individually than together. So there are algorithms that "split" up the work essentially. So for proximal gradient: We first do a gradient descent step on $f$, followed by a proximal step on $g$. You would do this, for instance, if $f$ were differentiable but $g$ were merely subdifferentiable.

BFGS, Newton, etc.: These are second-order methods. I know less about these. The are less popular because their iteration cost is extremely high in comparison to first-order methods. The larger the dimension of the problem (and large problems are far more common these days) the more costly these algorithms are. However I hear they are useful in some contexts.

Accelerated Proximal Stochastic Dual Coordinate Ascent for Regularized Loss Minimization: So let's apply what we now know ( I haven't read this paper, but I'm reasonably sure what this is). So this is a dual method for solve empirical risk minimization problems (loss minimization). The dual is often used here because it decouples the component functions in the finite sum (hence dual). Technically the dual problem is a concave minimization problem (and hence ascent, not descent). It is stochastic because it is probably a coordinate method that computes a coordinate-wise proximal at each step. The coordinate it choses will be random. Acceleration has beeen applied to the standard algorithms and hence accelerated.

You might be thinking: "Is this really necessary? All these edjectives?" The short answer is "yes". Every single adjective is there for a good reason. This is a state-of the art algorithm that will converge extremely fast (most likely it converges at the fastest possible rate under some oracle hypothesis).

Making sense of the big world of gradient methods

1 Answers1