Exact GPR Method

An instance of response y from a Gaussian process regression (GPR) model can be modeled as

P(yi|f(xi),xi) ~N(yi|h(xi)Tβ+f(xi),σ2)

Hence, making predictions for new data from a GPR model requires:

  • Knowledge of the coefficient vector, β, of fixed basis functions

  • Ability to evaluate the covariance function k(x,x|θ) for arbitrary x and x, given the kernel parameters or hyperparameters, θ.

  • Knowledge of the noise variance σ2 that appears in the density P(yi|f(xi),xi)

That is, one needs first to estimate β, θ, and σ2 from the data (X,y).

Parameter Estimation

One approach for estimating the parameters β, θ, and σ2 of a GPR model is by maximizing the likelihood P(y|X) as a function of β, θ, and σ2[1]. That is, if β^, θ^, and σ^2 are the estimates of β, θ, and σ2, respectively, then:

β^,θ^,σ^2=arg maxβ,θ,σ2logP(y|X,β,θ,σ2).



the marginal log likelihood function is as follows:


where H is the vector of explicit basis functions, and K(X,X|θ) is the covariance function matrix (for more information, see Gaussian Process Regression Models).

To estimate the parameters, the software first computes β^(θ,σ2), which maximizes the log likelihood function with respect to β for given θ and σ2. It then uses this estimate to compute the β-profiled likelihood:


The estimate of β for given θ, and σ2 is

β^(θ,σ2)=[ HT[K(X,X|θ)+σ2In]1 H]1HT[K(X,X|θ)+σ2In]1 y.

Then, the β-profiled log likelihood is given by


The software then maximizes the β-profiled log-likelihood over θ, and σ2 to find their estimates.


Making probabilistic predictions from a GPR model with known parameters requires the density P(ynew|y,X,xnew). Using the definition of conditional probabilities, one can write:


To find the joint density in the numerator, it is necessary to introduce the latent variables fnew and f corresponding to ynew, and y, respectively. Then, it is possible to use the joint distribution for ynew, y, fnew, and f to compute P(ynew,y|X,xnew):


Gaussian process models assume that each response yi only depends on the corresponding latent variable fi and the feature vector xi. Writing P(ynew,y|fnew,f,X,xnew) as a product of conditional densities and based on this assumption produces:


After integrating with respect to ynew, the result only depends on f and X:



P(ynew, y|fnew, f, X, xnew)=P(ynew|fnew, xnew)P(y|f,X).

Again using the definition of conditional probabilities,


it is possible to write P(ynew,y|X,xnew) as follows:

P(ynew,y|X,xnew)=P(ynew|fnew, xnew)P(y|f,X)P(fnew|f,X,xnew)P(f|X,xnew)dfdfnew.

Using the facts that




one can rewrite P(ynew,y|X,xnew) as follows:

P(ynew,y|X,xnew)=P(y|X)P(ynew|fnew, xnew)P(f|y,X)P(fnew|f,X,xnew)dfdfnew.

It is also possible to show that


Hence, the required density P(ynew|y,X,xnew) is:

P(ynew|y,X,xnew)=P(ynew,y|X,xnew)P(y|X,xnew)=P(ynew,y|X,xnew)P(y|X)=P(ynew|fnew, xnew)(1)P(f|y,X)(2)P(fnew|f,X,xnew)(3)dfdfnew.

It can be shown that



(3)P(fnew|f,X,xnew)=N(fnew|K(xnewT,X)K(X,X)1f,Δ),whereΔ=k(xnew,xnew)K(xnewT,X) K(X,X)1K(X,xnewT).

After the integration and required algebra, the density of the new response ynew at a new point xnew, given y, X is found as






The expected value of prediction ynew at a new point xnew given y, X, and parameters β, θ, and σ2 is

E(ynew|y, X,xnew,β,θ,σ2)= h(xnew)Tβ+ K(xnewT,X|θ)α= h(xnew)Tβ+i=1nαik(xnew,xi|θ),



Computational Complexity of Exact Parameter Estimation and Prediction

Training a GPR model with the exact method (when FitMethod is 'Exact') requires the inversion of an n-by-n kernel matrix K(X,X). The memory requirement for this step scales as O(n^2) since K(X,X) must be stored in memory. One evaluation of logP(y|X) scales as O(n^3). Therefore, the computational complexity is O(k*n^3), where k is the number of function evaluations needed for maximization and n is the number of observations.

Making predictions on new data involves the computation of α^. If prediction intervals are desired, this step could also involve the computation and storage of the Cholesky factor of (K(X,X)+σ2In) for later use. The computational complexity of this step using the direct computation of α^ is O(n^3) and the memory requirement is O(n^2).

Hence, for large n, estimation of parameters or computing predictions can be very expensive. The approximation methods usually involve rearranging the computation so as to avoid the inversion of an n-by-n matrix. For the available approximation methods, please see the related links at the bottom of the page.


[1] Rasmussen, C. E. and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press. Cambridge, Massachusetts, 2006.

See Also


Related Topics