I am trying to better understand the meaning of "noise" with regards to function optimization - specifically, why "Noisy" functions are more difficult to optimize compared to "Non-Noisy" functions.
Up until now, I always thought of "noise" from a signal processing standpoint: for example - how to remove and filter out the noise component from some signal:
I also generally think of this in the context of Time Series Analysis, where a time series is separated into non-random components (e.g. seasonal) and random components (e.g. noise):
In both of these above cases, "Noise" is viewed as something with inherent "negative connotations", as something undesirable which is either hindering or further complicating the end goal of (usually) a forecasting or engineering project.
However, I am interested in "noise" from more of a Machine Learning and Optimization perspective. For instance, (I am not sure if this is correct) I have heard that since the "loss functions" of Machine Learning algorithms are always modelling a random variable - thus, any "loss function" of a Machine Learning algorithm is always considered to be a "noisy function":
My Question:
Why are "Noisy" Functions difficult to optimize compared to "Non-Noisy" Functions?
I can understand that "Noisy" Functions contain "random noise" (as the name implies) which alters their "fidelity" with regards to the concept they are attempting to represent (i.e. an additional source of "difficulty" when attempting to use them for some applied purpose) - but are "Noisy" Functions inherent more "computationally expensive" to evaluate (e.g. their derivatives) compared to "Non-Noisy" and "Lesser-Noisy" Functions of similar complexity? How exactly does the "Noisiness" of a function contribute to its computational complexity (to the extent that gradient-free methods are often used on "Noisy" Functions in order to reduce their "computational costs")?
I have heard the following argument being made on an informal level : Given that "Noisy" Functions are often more "computationally expensive" to optimize, and that no major theoretical results have been established on the convergence properties of gradient-based optimization algorithms on "Noisy" Functions - using gradient-free optimization algorithms (e.g. evolutionary algorithms, genetic algorithm, metaheuristics) might have certain advantages in optimizing such "Noisy" Functions. Have any significant theoretical results been established regarding the convergence properties of common optimization algorithms (e.g. gradient descent, stochastic gradient descent) on "Noisy" Functions?
Thanks!
References: