2

I want to determine how the inclusion of new data effects hyperparameters of the Gaussian Process kernel. For reference assuming square exponential kernels as provided here: $$K(x,x') = \sigma^2\exp\left(\frac{-(x-x')^T(x-x')}{2l^2}\right)$$ So the derivative with respect to length scale determines what the effect to the kernel when the lengthscale changes as follows: $$\frac{\partial K}{\partial l} = \sigma^2\exp\big(\frac{-(x-x')^T(x-x')}{2l^2}\big) \frac{(x-x')^T(x-x')}{l^3}$$

I however would like to determine what is the change or effect of a single new data point to the lengthscale. What should be the symbolic expression I need to evaluate the derivative of?

Is it $$\frac{\partial l}{\partial \mu}$$ of the GP? where $\mu$ is the predictive mean of the GP as follows:

$$\mu(x^*)=K(x^*,X)^\top[K(X,X)+\sigma_n^2\mathbf{I}]^{-1} \mathbf{y_n}$$ If so how can the derivative expression be formulated. (Initial expression atleast, I should be able to workout derivitave from there itself)

1 Answers1

1

Interesting question. Firstly, the lengthscale does not change with new data. Rather, it only changes when you re-optimize the hyperparameters. So I assume you care about how the optima of the NLML space parameterized by hyperparameters and data changes w.r.t. a new observation. That is to say: I see a new point and re-optimize the kernel function. The lengthscale changes, can we quantify this?

Unfortunately, a complete general answer to this is no (as far as I am aware) as the hyperparameter optimization space is non-analytic (unless you want to go about sampling the entire space and interperlating to fill in the gaps).

But hope is not lost entirely. What I suspect is that you care about the gradient of the hyperparameter space at the old optima when the new point is observed, or more completely the change about the region of the optima as the new point is observed. The change on the NLML hyperparamer space is just the difference in NLML$(x)$ and NLML$(x, \bar{x})$ and the same holds for the derivatives.

Each new point is a discrete event so you have to look at differences not analytic gradients.

Finally, if you care about the change of NLML$(x, \bar{x})$ with respect to the position of $\bar{x}$ we could analytically compute that derivative fairly easily (but I'll wait for feedback from you before I right it all out).

j__
  • 1,659
  • Actually I am interested in determining how acquisition function (Bayesian Optimization) changes or evolves when the new data point is ingested by the GP. One way to determine that is by looking at what changes in GP when new point is observed. The thing that changes in GP is the "local mean" around the new data point and when the kernel hyperparameters are re-optimized by MLE or some other optimizer, the hyperparameter such as lengthscale changes. I want to quantify these two quantities in terms of derivatives, So basically $d\mu / dx^$ and $dl / dx^$ – GENIVI-LEARNER Jan 29 '20 at 13:42
  • maybe what i am looking for is not actually the derivatives thats why I wanted to also provide the intent of what i want to do. So instead of real new data point, I want to know what do we expect if instead we put expected mean of the GP at the data point. In otherwords "cheat" by adding expected value of the GP at some location "$x^*$" as "observed data" and then I want to investigate with real data to see how far the GP with false data (expected mean instead of actual y) vs GP with real data (actual y) – GENIVI-LEARNER Jan 29 '20 at 13:49
  • Maybe if you could point me to the right direction by letting me know what should be the metric I should be looking for? if it is derivative then what with respect to what, if something else then what exactly. – GENIVI-LEARNER Jan 29 '20 at 13:50
  • for submodular functions it is easy but for non-submodular functions such as GP its difficult but anything the is remotely acceptable shall be fine – GENIVI-LEARNER Jan 29 '20 at 13:51
  • Also what is NLML? Non linear machine learning? – GENIVI-LEARNER Jan 29 '20 at 13:52
  • 1
    NLML is the negative log marginal likelihood - the thing you optimise to work out good hyperparameters. I assume this is for a research paper or equivalent? A direction you could go is to say most BO approaches look only at a single realisation of the kernel function or marginalise over hyperparameters which is like averaging the out based on a prior. However, you could take into account uncertainty in both hyperparameter space and function space - having non-linear relations to one another. This might form the basis of a new acquisition function.... it would be intentionally uninformative – j__ Jan 29 '20 at 23:46
  • +1 for the insight. This is for the thesis :) well as far as I know BO approaches look at the optimum of all the realization of posterior predictive distribution of the GP with some metric called acquisition function. (marginalization out hyperparameters mean averaging out length scale and sigma right?) Now i do know uncertainty in function space which is the predictive variances of the GP. So using that uncertainty I can calculate uncertainty in hyper parameters space especially "lengthscale" as per your suggestion. So [continued] – GENIVI-LEARNER Jan 30 '20 at 13:02
  • can I use calculated uncertainty in lengthscale and say that the lengthscale of (GP with predicted mean serving as new data instead of real data serving as new data) falls within the range of normal GP , hence the approximation is justified? I just want to claim that over certain horizon, say n-points, I can ingest predicted mean as new data and use acquisition function to determine where to go next instead of recalculating GP by running real experiment and using the result with acquisition function to determine where to go next. – GENIVI-LEARNER Jan 30 '20 at 13:09
  • Think of it the other way around: the hyperparameters "cause" the function space uncertainty (predictive variance). Each hyperparameter setting gives a different predictive variance. We could take a weighted combination of these (monte carlo style or integrate if analytically tractable). This would be the weighted sum of Gaussians so you could propagate moments maybe. I think in the limit it should be a Gaussian for hyperparamters which are linear, but for non-smooth kernels etc maybe you could work out the expected distribution? – j__ Jan 30 '20 at 16:07
  • If you are thinking more generally about BO there has been some interesting work on spaces with eg symmetries or monotonic spaces. I think more could be explored in those directions too. – j__ Jan 30 '20 at 16:09
  • Ok, so i am been digesting the content you put in the answer. You mentioned "Each new point is a discrete event so you have to look at differences not analytic gradients" This makes sense. However, will the change to the hyperparameters more of an "incremental" nature, meaning, before observing a new point and re-optimizing the hyperparameters, we have already seen lot of prior data, so the change brought by the hyperparameters will not be discrete and will be incremental/decremental in nature, right? – GENIVI-LEARNER Feb 10 '20 at 09:15
  • also you mentioned that "Finally, if you care about the change of NLML(x,x¯) with respect to the position of x¯ we could analytically compute that derivative fairly easily (but I'll wait for feedback from you before I right it all out)." Can I request see that analytical derivative? This is a derivative with respect to some new arbitrary data point right? – GENIVI-LEARNER Feb 10 '20 at 09:27