Gradient of gaussian process marginal likelihood with automatic relevance detection

Question

This is a follow up question to: Gradients of marginal likelihood of Gaussian Process with squared exponential covariance, for learning hyper-parameters.

Given a covariance function:

$K(x,x') = \sigma^2\exp\big(\frac{-(x-x')^T(x-x')}{2l^2}\big)$

The gradient with respect to $l$ is:

$\frac{\partial K}{\partial l} = \sigma^2\exp\big(\frac{-(x-x')^T(x-x')}{2l^2}\big) \frac{(x-x')^T(x-x')}{l^3}$

Assuming $x$ and $x'$ are vectors of length $m$, $l$ can be made into a vector of length $m$ and the relevance of each element in $x$ can be learned (automatic relevance determination). My question is if $l$ is a vector how is the gradient for each $l_{i}$ calculated? It may be obvious but I am getting confused by the matrix calculus notation and appreciate the help.

score 0 · Answer 1 · answered Mar 25 '17 at 09:58

I'm dealing with the same problem so i'm not sure that my answer is correct. However, making egual to zero the partial derivatives (gradient) of the negative log marginal likelihood (nlml) you obtain such a step and a direction to change iteratively the value of hyperparameters. Your iteration ends when you minimize the nlml to an enough small value (so actually an high value because nlml is already negative). Finally to answer your question l is just the vector of its values, as you can find in Rasmussen&Williams:

Rasmussen&Williams- Gaussian process for machine learning - ch. 5.1, p.106

Gradient of gaussian process marginal likelihood with automatic relevance detection

1 Answers1