Introduction to Gaussian Process Regression (GPR)

subhasish_basak
5 min readMay 18, 2020
Gaussian Process Regression

Disclaimer : This article is designed to give the reader an introductory knowledge of Gaussian Process Regression using a demo example using python. The article is based on the author’s personal experience and thus it may contain some errors (the reader should never hesitate to bring such issues to the author, whenever found). One should not completely rely one this document only, the author highly recommends to go through other tutorials available on the net. This article requires basic knowledge of probability theory for understanding.

Introduction

Like any other regression setup we start with a dependent variable Y and an independent variables X (let’s keep it simple and easy with just one predictor). We also assume that there is a simple data generative model

Y = X*sin(X) . . . . . . . . . . . (a)

This means the data we are observing is originally generated from this sinusoidal functional form.

function f(.) & observations

Now, the problem is we do not know (here we will just pretend that we don’t know) the existing relation between X and Y given by (a). All we have is just some observed data points . See the figure , the observations along with the data generative function. Now we want to build a predictive model which approximates the generative model (a) and we’ll use Gaussian Process Regression for that. Lets see how.

Model Assumption:

The main assumption of GPR is

The output/dependent variable is a Gaussian Stochastic Process

In very very simple terms which is equivalent to say that our observed data-points (y0, y1, .. ) are just the realizations of jointly distributed Gaussian r.v’s (Y0, Y1,.. ) which has some mean vector m and covariance structure S. This line is very crucial !!

Here lies the main difference b/w Linear Regression and GPR. Unlike Linear regression the dependent variables in GPR are NOT independently distributed (since we have the covariance structure S). Moreover some of you might have noticed that unlike Linear regression I have not included any error terms in (a). In GPR we impose the distributional assumption directly on the dependent variables (where as in linear regression setup the distributional assumption is imposed on the error terms only). We can obviously work with a error term (in GPR setup it is called the noise) in (a), which will just be an extension of our current model.

How the model works ?

Here comes the amazing property of Gaussian distribution, which says

If Y1 & Y2 jointly follows a Gaussian distribution then Y1|Y2 is also Gaussian

conditional distribution

So, if we know the mean vector and the covariance structure, we can find out the distribution of Y1 for given observation of Y2=y2.

Mean & Covariance function:

Following the Simple Kriging (Kriging is also a regression technique which uses generalized least square method to compute the Best Linear Unbiased Predictor (BLUP) and it is exactly same as the GPR model we are using) case we take the mean vector to be 0, yes zero ! (life is simple)

For the covariance structure we use some available kernels. The covariance structure is very very important, it contain our presumptions about the data generative function we wish to learn and define the closeness and similarity between data points. So let us choose a simple kernel, for example the Radial Basis Function or the Gaussian kernel. The covariance b/w two points say y1 and y2 is simply computed as K(x1, x2), where K is the kernel function.

Now, every kernel has one or more parameters (like RBF has one called lengthscale) also if we work with non-zero mean function there we may have a couple of parameters. These are hyper-parameters to our GPR model and to estimate them there are several methods. Most famous one is MLE, one can also use Restricted MLE, MAP estimate (Bayesian approach), Leave One Out Cross Validation (LOO-CV) approach or a complete Bayesian approach using MCMC.

Lets code it:

Even if some portions are not understandable, I’ll try make them easy with this python demonstration. Remember at the beginning we started with some observed tuples. We’ll use them to fit a GPR model. For the python demonstration I just used the sample code available on the Sklearn’s documentation page.

We first start with a single observation say (x0,y0) and compute the conditional distribution of the other points based on this single one. We will then have the conditional mean and conditional sd’s for all other points in the domain. The graph will look like this.

with only 1 observation

It is to be noted that the 95% confidence interval band shrinks to 0 at the observed data position, since it is already a given quantity. The RED line is the plot of the conditional mean , as you can see its not even close to our data generative function (the BLUE one). Now let us increase the observations gradually and see what happens.

with 2observations

Notice the part b/w the two points , the BLUE and RED lines are almost coincided (it means that our GPR model is approximating the generative model well in that region) and the 95% confidence band is also thin. Now lets take some more observations.

with 6 observations

Now our GPR predictive model looks far better and it almost resembles the generative model . Note that the 95% confidence band which highlights the region around the conditional mean also shrinks around the observations and becomes null at the observed points.

This is how we construct a predictive model using GPR. The prediction corresponding to a test data point is nothing but the conditional mean corresponding to that point with the associated conditional variance as error. This is just an introduction to how GPR works and as mentioned earlier we can consider other variations in the model (for e.g. non zero mean function, composition of kernels, noisy models etc, see references for more details).

References:

I learned about GPR during my summer internship 2019, at the Laboratoire des signaux et systemés (L2S), Gif-sur-yvette, France. I worked on reviewing scalability of python toolboxes which implement GPR. You can find a detailed report of the work : click here.

As a part of my work I built a python wrapper which combines several python toolboxes and here is the link to the GitHub repository,

https://github.com/Subhasishbasak/pythongp

Don’t hesitate if you have any comments/suggestions and feel free to contact me (homepage) 😃 .

--

--