2

I want to explain a disadvantage of Gradient Descent where the gradient itself doesn't give information about how far we are away from the local/global minimum.

Say we have two functions with an intersecting point that has the same gradient for both functions. But assume that the point is nearer to an optimum of one function than to the other. After a gradient descent step, we would see that one function would approach the minimum better than the other. An example would be some functions like here.

For that, I am looking for two rather simple 1) 3D functions $f: \mathbb{R}² \rightarrow \mathbb{R}$ and $g: \mathbb{R}² \rightarrow \mathbb{R}$ that have an 2) intersecting point $f(a,b) = g(a,b)$, that also has the 3) same gradient $\triangledown f(a,b) = \triangledown g(a,b) $, for that given point $(a,b)$. But these points should have 4) different distances to the optimum.

I couldn't think of a way to create such functions. Can somebody give me a hint, on how to approach this problem?

oezguensi
  • 143

1 Answers1

0

Let $z=(x,y)^T$, and consider the two quadratic functions defined as follow:

$$ f(z) = a + b^Tz + \frac{1}{2}(z-z_0)^TC_f(z-z_0) $$ $$ g(z) = a + b^Tz + \frac{1}{2}(z-z_0)^TC_g(z-z_0), $$

where $a, b, C_f$ and $C_g$ are respectively of dimension $1 \times 1$, $2\times 1$ and $2 \times 2$ and comprise parameters. The corresponding gradients are respectively

$$ \frac{\partial f}{\partial z}(z) = b + C_f(z-z_0) $$ $$ \frac{\partial g}{\partial z}(z) = b + C_g(z-z_0) $$

So we have as required, at $z=z_0$, the equalities: $$ f(z_0)=g(z_0)= a+b^Tz_0 $$ $$ \frac{\partial f}{\partial z}(z_0) = \frac{\partial g}{\partial z}(z_0) = b $$

Then, by suitable choice of $C_f$ and $C_g$ we can reach a minimum for $f$ which is closer to the starting point $z_0$ than the minimum of $g$. The minimum of $f$ and $g$ are respectively, assuming that $C_f$ and $C_g$ are full rank matrices: $$z^*_f = z_0 - C_f^{-1}b, $$ $$z^*_g = z_0 - C_g^{-1}b. $$ Take for instance $C_f^{-1}$ to be diagonal with $1/2$ on the main diagonal and $C_g^{-1} = 100I_2$.

Bertrand
  • 656
  • Well, note that as the function is quadratic, the method of steepest descent will in both cases find the minimum in one iteration. This is not what you expected when you wrote "one function would approach the minimum better than the other". – Bertrand Dec 25 '18 at 22:11
  • Assume we use the formula $z := z - \alpha * \triangledown f$, why would it approach the minimum in one iteration? I mean maybe for the right alpha, but as it is a hyperparameter it is not always optimal. Am I missing something? – oezguensi Dec 26 '18 at 00:09
  • Both your question and your comment are relevant and I changed my former comment in order to try to answer your point. For quadratic functions in $z$, the method of steepest descent will find the minimum after several iterations (for suitable starting value). In this case, yes, "one function would approach the minimum better than the other". With a quadratic optimization method, however, the minimization algorithm finds the minimum in one iteration for both $f$ and $g$ and any quadratic functions in $z$. – Bertrand Dec 29 '18 at 17:59
  • Thank you for the clarification. – oezguensi Dec 29 '18 at 21:06