I'm studying proximal operators. But there is something that disturbs me quite a lot and can't find an answer to it.
If ℎ is a closed convex function, then the proximal operator of ℎ (with parameter >0) is defined by
$$\text{prox}_{th}(\hat x) = \arg \min_x \, h(x) + \frac{1}{2t} \|x - \hat x \|_2^2$$
This topic explains the following: What Is the Motivation of Proximal Mapping / Proximal Operator?
"A natural strategy is to first reduce the value of by taking a step in the negative gradient direction, then reduce the value of ℎ by applying the prox-operator of ℎ, and repeat". This strategy yields the following iteration:
$$x^{k+1} = \text{prox}_{th}(x^k - t \nabla g(x^k))$$
I don't understand how you can minimize $h$. For me, you are still trying to optimize a non-differentiable function.
My only partial answer/intuition is that one we have the proximal operator form, we can actually use the subgradient optimality condition on $h$ to derive a closed-form solution of the problem. But then arises a second question: why can't we use subgradient optimality condition right away ? (much like in subgradient methods).