Interpretation of "Noise" in Function Optimization

Question

I am trying to better understand the meaning of "noise" with regards to function optimization - specifically, why "Noisy" functions are more difficult to optimize compared to "Non-Noisy" functions.

Up until now, I always thought of "noise" from a signal processing standpoint: for example - how to remove and filter out the noise component from some signal:

I also generally think of this in the context of Time Series Analysis, where a time series is separated into non-random components (e.g. seasonal) and random components (e.g. noise):

In both of these above cases, "Noise" is viewed as something with inherent "negative connotations", as something undesirable which is either hindering or further complicating the end goal of (usually) a forecasting or engineering project.

However, I am interested in "noise" from more of a Machine Learning and Optimization perspective. For instance, (I am not sure if this is correct) I have heard that since the "loss functions" of Machine Learning algorithms are always modelling a random variable - thus, any "loss function" of a Machine Learning algorithm is always considered to be a "noisy function":

My Question:

Why are "Noisy" Functions difficult to optimize compared to "Non-Noisy" Functions?
I can understand that "Noisy" Functions contain "random noise" (as the name implies) which alters their "fidelity" with regards to the concept they are attempting to represent (i.e. an additional source of "difficulty" when attempting to use them for some applied purpose) - but are "Noisy" Functions inherent more "computationally expensive" to evaluate (e.g. their derivatives) compared to "Non-Noisy" and "Lesser-Noisy" Functions of similar complexity? How exactly does the "Noisiness" of a function contribute to its computational complexity (to the extent that gradient-free methods are often used on "Noisy" Functions in order to reduce their "computational costs")?
I have heard the following argument being made on an informal level : Given that "Noisy" Functions are often more "computationally expensive" to optimize, and that no major theoretical results have been established on the convergence properties of gradient-based optimization algorithms on "Noisy" Functions - using gradient-free optimization algorithms (e.g. evolutionary algorithms, genetic algorithm, metaheuristics) might have certain advantages in optimizing such "Noisy" Functions. Have any significant theoretical results been established regarding the convergence properties of common optimization algorithms (e.g. gradient descent, stochastic gradient descent) on "Noisy" Functions?

Thanks!

References:

For the second question, it might be helpful to specify what you mean mathematically by "noisy function". If the noisy function here mean not smooth (hessian unbounded) and not convex, then the claim about gradient methods being slow is roughly true. In the convex optimization literature, the claim can be formalized and lower bound for the convergence of gradient methods can be proved. You may take a look at https://arxiv.org/pdf/1405.4980.pdf — Sisi, Jan 21 '22 at 19:00
I you haven't done so, read up on "risk minimization" and "empirical risk minimization". The latter is the prime example for the thing you are describing, and the former helps understanding the noise. — Dirk, Jan 22 '22 at 07:35
@ Dirk: I have briefly read about risk minimization" and "empirical risk minimization" (ERM). I am still a bit confused as to why this is relevant for optimizing noisy functions. For instance, it seems that empirical risk minimization is about obtaining theoretical lower error bounds for a machine learning model, when we only have "realizations" from the "joint probability distribution function" that the machine learning model is trying to model. On another note, I have heard that ERM serves more of general theoretical framework for this topic. — stats_noob, Jan 22 '22 at 07:40
I am looking to understand more of the reasons behind "why exactly are noisy functions more difficult to optimize compared to less noisy/non noisy functions"? Intuitively, we can think of some informal reasons as to why this might be the case. However, I would like to see some more "mathematical justification" behind this : mathematically speaking, why are noisy functions more difficult to optimize compared to less noisy and non-noisy (i.e. deterministic) functions? Thank you so much! — stats_noob, Jan 22 '22 at 07:42
When doing optimization without noise, you can work with the objective function, but under noise, you don't even know if there are a objective function beforehand, so you have to use a way to predict it through some estimator function, as example, using Lagrange multipliers is a much more simple algorithn than the Kalmann Filter.... but anyway, noise is not always an issue, and it could be even a desire property, as in image dither — Joako, Jan 24 '22 at 01:03

Suzane · Answer 1 · 2022-01-21T19:44:58.527

My contribution, as someone who works with non-differentiable functions:

First, let's assume that there exists a function $f$ behind the phenomenon being modeled that is well behaved and smooth, but for some reason you can only compute values of the function $\hat{f}$, which corresponds to $f$ with noise included. As you mentioned, this makes the values of $\hat{f}$ an imprecise approximation of those of $f$. This will impact gradient-free methods because they use the function values to compute their steps; you're trying to minimize $f$ but supply values of $\hat{f}$. However, the much bigger issue of noisy functions shows up in gradient-based methods.

Traditionally we'd rather minimize a smooth function because its local behavior can be reasonably well predicted from its gradient, or, even better, from its higher order derivatives. Around a point, a nonlinear function, as complicated as it may be, resembles the linear behavior of the tangent plane to the point, which is the 1st order Taylor polynomial, and it resembles even more the second-order Taylor polynomial, which could look like a bowl. Gradient-based methods (gradient descent, Newton, etc) use derivative values to compute the next step by assuming the nonlinear function behaves like its linear/quadratic approximations, which are very easy to minimize. However, the noisy function $\hat{f}$ is definitely NOT smooth (more or less badly behaved depending on your sampling intervals). It does not technically have a gradient or Hessian, but one can go ahead and try to use the (imprecise) values of $\hat{f}$ to estimate the derivatives of $f$.

The next point is, how to we compute values of the derivatives of a smooth function $f$? Traditionally most people turn to finite difference methods: from a current point, you take one or more small steps around the point and evaluate the function also at these points. Then, once again, you assume linear or quadratic local behavior of $f$ to estimate the derivatives at the original, current point. This method is always imprecise even with a well-behaved function; with a "noisy" function, it can become so imprecise that it is useless. The abrupt changes in the value of $\hat{f}$ can give vertical or near-vertical derivative estimates, depending on the step size. Then, a less known method of evaluating derivatives is that of automatic differentiation, which is actually far superior; it's a shame most don't talk about it. It provides exact values for the derivatives based on the chain rule, you just need to include a few more operations together with the evaluation of $f$. Still, this method relies on values of $f$ being accurate; if you're using values of $\hat{f}$, once again you will have an imprecise derivative value. However, I can expect automatic differentiation to be far better than finite differences also in this case.

So, one way gradient-based methods for noisy functions can be "costly" is that they just don't work well and you can't solve the problem, or the algorithms might have built-in safeguards to repeat step computations when they seem too far off. Another way is that the "noise" function you include in your function evaluation might be costly itself to evaluate, so you'd want to evaluate $\hat{f}$ as few times as possible. Gradient-based methods using finite differences might require a lot more function evaluations than a gradient-free method.

Finally, if you are looking for convergence theory for noisy functions, you would have to assume the function is non-smooth. Nonsmooth optimization does have some convergence results, but they are not very useful in practice; and still, you assume some well enough non-smooth behavior, like local Lipschitz continuity, which might not be satisfied for a noisy function.

Hi Suzane! Thank you so much for your answer! I am going to read it a few times over and let you know if I have any questions! — stats_noob, Jan 22 '22 at 01:08
Here are some related questions I asked in the past: is the lipschitz condition obeyed in real life? https://math.stackexchange.com/questions/4357673/do-we-have-any-way-of-knowing-if-natural-phenomena-in-the-real-world-follow-the — stats_noob, Jan 22 '22 at 01:09
Are real life functions generally non convex? https://math.stackexchange.com/questions/4357945/simply-put-are-most-functions-in-the-real-world-non-convex — stats_noob, Jan 22 '22 at 01:10
https://math.stackexchange.com/questions/4358737/can-we-tell-if-any-function-is-convex-or-non-convex — stats_noob, Jan 22 '22 at 01:10

Interpretation of "Noise" in Function Optimization

1 Answers1