2

Consider the everywhere twice differentiable function $f:\mathbb R^n\to \mathbb R$, the closed and convex set $\mathcal S$, and the convex optimization problem

$$ \min_{x\in \mathcal S} \; f(x). $$

Is there an easy / intuitive way of proving both statements?

  1. $x = x^*$ is a local minimizer if $\nabla f(x^*) = 0$ and $\nabla^2 f(x^*) \succ 0$. Specifically, the condition $\nabla^2 f(x^*) \succeq 0$ is not sufficient, since as a counterexample we can consider $f(x) = x^3$, where at $x = 0$, $\nabla^2 f(0) = 0$ and $\nabla f(0) = 0$, but $0$ is not a minimum.

  2. $x = x^*$ is a global minimizer if $\nabla f(x^*) = 0$ and $\nabla^2 f(x) \succeq 0$ for all $x\in \mathcal S$.

The second statement in particular is quite well-known in convex optimization literature. However, I wonder if there is a nice proof, to reassure ourselves that there are no corner cases (like the one found in case 1).

Y. S.
  • 1,816
  • 1
    I'm a bit confused. Is $f$ convex? – BigbearZzz Dec 23 '18 at 03:14
  • In the second case, yes (but not strictly convex). in the first, no. But the only assumptions I'm stating is that $f$ is twice differentiable over $x\in \mathcal S$; the rest should be from the assumptions. – Y. S. Dec 23 '18 at 03:17
  • What is $S$? Is $S$ convex, closed, bounded, etc,etc. – copper.hat Dec 23 '18 at 03:54
  • Let's assume $S$ is closed and convex, not necessarily bounded. (question amended) – Y. S. Dec 23 '18 at 03:55
  • So I think the answer might be as simple as: all stationary points are either local min, local max, or saddle points. If the function has a PSD hessian everywhere, then in the interior of S the stationary points must be local mins. Clearly at the boundary, saddles can occur, but the "descending" part must happen outside of S. Anyway, the question is probably unnecessarily pedantic; I mostly just wanted to clarify my understanding here to make sure there weren't any strange corner cases in case 2. – Y. S. Dec 23 '18 at 04:01
  • @Y.S. I'm not really sure what kind of intuition you're looking for but here's my take: $\nabla^2 f(x^)\ge 0$ doesn't tell us much, it's the fact that we assume $\nabla^2 f\ge 0$ throughout the domain in 2. that allows us to deduce anything at all. That's why merely assuming that $\nabla^2 f(x^)\ge 0$ in 1. doesn't tell you anything. It's the stronger assumption $\nabla^2 f(x^)> 0$ (and continuity) that happens to imply the weak form of "$\nabla^2 f\ge 0$ throughout the domain". Hence it's not like "$\nabla^2 f(x^)= 0$" is a corner case but more of $\nabla^2 f(x^*)> 0$ is nice. – BigbearZzz Dec 23 '18 at 04:06
  • Ok I got it. Basically we just need to prove that if there exists $x^$ and $\hat x$ in the interior of $\mathcal S$ where $f(\hat x) < f(x^)$ and also $\nabla f(x^) = 0$, then by running mean value theorem twice, this suggests there is a negative second directional derivitive for some point between $x^$ and $\hat x$. – Y. S. Dec 23 '18 at 04:06
  • Specifically, there exist some point between $\hat x$ and $x^$ (let's call it $y$) where the directional derivative is negative. But all directional derivatives at $x^$ are 0. So there furthermore exist a point between $x^*$ and $y$ where the SECOND directional derivative is ALSO negative. Therefore the Hessian at that point must be NOT PSD. – Y. S. Dec 23 '18 at 04:08
  • @Y.S. By assuming continuity of $\nabla^2 f$ (which is a reasonable assumption), you can even get an open neighborhood consisting of negative definite hessian. I hope my answers so far help you somehow :) – BigbearZzz Dec 23 '18 at 04:14
  • @BigbearZzz Thanks for your ongoing discussion. I just want to clarify though, the issue I am trying to resolve here is not when the Hessian is PD. I agree with you, that if you have continuity and all that jazz, then if you carefully construct the argument enough, you can show local optimality. However, we know from the $f(x) = x^3$ example that it is possible to have an SPD (not PD) Hessian and not be locally optimal, not in any neighborhood no matter how small. – Y. S. Dec 23 '18 at 04:19
  • The question I was asking is, when you force global convexity (and thus have no saddles), can you provably guarantee that all gradient-0, Hessian-SPD points MUST be local minima? (ignoring globality for the moment.) I still maintain that without using that mean value theorem argument, the statement is simply "memorized folklore" and not mathematically proven. However, thanks for your ongoing discussion; it has helped me clarify what exactly it was I was asking. – Y. S. Dec 23 '18 at 04:19
  • I'm not sure I follow you thought. If we force global convexity then Hessian doesn't really matter anymore. For a globally convex function, local minimum and global minimum coincide hence all we need is $\nabla f(x^*)=0$. – BigbearZzz Dec 23 '18 at 04:28

1 Answers1

1

Yes to both of the questions (assuming that $\nabla^2 f$ is continuous for the first question).

The result follows from the multivariables Taylor theorem: $$ f(x+v) = f(x) + \nabla f(x)\cdot v + (\nabla^2f(x+\theta v): v\otimes v ) $$ for some $\theta\in(0,1)$. By letting $x=x^*$ this reduces to $$ f(x^*+v) - f(x^*) = (\nabla^2f(x^*+\theta v): v\otimes v ), $$ which obviously implies (2.).

For (1.), the assumption that $\nabla^2 f(x^*) \succ 0$ and continuity of $\nabla^2 f(x^*)$ means that $\nabla^2 f(x^*+\theta v) \succ 0$ for sufficiently small $v$ (see this question), thus the above formula shows that $f(x^*+v) - f(x^*) >0$ in a small neighborhood.

BigbearZzz
  • 15,084
  • Taylor's theorem only gives inequality for some $\theta\in (0,1)$, not for all $\theta\in (0,1)$, which is not sufficient to say that $f(x)$ is a minimizer (local or global). To be more specific, the possible corner case is $\nabla^2 f(x^) \succeq 0$ but not $\nabla^2 f(x^) \succ 0$, which isn't really resolved by this (since $f(x+v) \leq f(x)$ without really requiring global convexity). – Y. S. Dec 23 '18 at 03:36
  • What do you mean? For each $y\in\Bbb R^n$ we can let $v=y-x^*$ and for each such $v$ there's a corresponding $\theta = \theta(v)$. – BigbearZzz Dec 23 '18 at 03:39
  • So ok, restricting to case 1, for local optimality we need to say that there exists some $\epsilon$ where for ALL $y$ where $|y-x|\leq \epsilon$, $f(y) \leq f(x)$. This only says that there exists SOME $y$ in the $\epsilon$ ball around $x$. – Y. S. Dec 23 '18 at 03:41
  • 1
    That's why I said we require continuity of $\nabla^2 f$ at $x^$. The link I gave is about the openess of the set of positive definite matrices. Together they imply that $\nabla^2 f(y) > 0$ in a small neighborhood of $x^$. – BigbearZzz Dec 23 '18 at 03:43
  • Sure, ok. I am probably being unnecessarily pedantic about case 1; with a proof for open set PD-ness, I agree it works. But what about case 2? I think you can't use the same argument here. – Y. S. Dec 23 '18 at 03:45
  • For case 2. it's even simpler since you assumed $\nabla^2 f \ge 0$ on the entire $\Bbb R^n$. I don't see what's the problem you're talking about. – BigbearZzz Dec 23 '18 at 03:46
  • I mean, I know it is obviously true. I guess I am just hoping for some math intuition, since previously I thought $\nabla f(x^*) \succeq 0$ was sufficient for local optimality, then found a counterexample. It's more of like, what is a nice way of proving this, so we don't have to be paranoid about possible counterexamples? – Y. S. Dec 23 '18 at 03:47
  • I edited the question to clarify that point. – Y. S. Dec 23 '18 at 03:50
  • In the first case, $\nabla^2 f(x^) > 0$ implies that $\nabla^2 f(y) > 0$ in a small neighborhood by continuity. However, $\nabla^2 f(x^) \ge 0$ doesn't imply anything and we don't get $\nabla^2 f(y) \ge 0$ in any neighborhood for free. That's why we need to impose the stronger assumption that $\nabla^2 f(y) \ge 0$ ourself to ensure the result. – BigbearZzz Dec 23 '18 at 03:51