Introduction to Bayesian Optimization : A simple python implementation

5 min readNov 16, 2020

Disclaimer : This is an introductory article with a demonstration in python. This article requires basic knowledge of probability theory for understanding.

What is Bayesian Optimization (BO)?

Bayesian Optimization (BO) refers to a set of optimization algorithms which are used to optimize continuous functions (f) that usually possess the following characteristics:

f is “expensive” (can be both in terms of time or cost) to evaluate.
The gradients of f are either “expensive” to evaluate or unavailable.
f is a “black box”.

From the above three conditions we have the following restrictions:

The total number of evaluations of f is fixed to an integer say N (known as the evaluation budget)
We can NOT use gradient based optimization algorithms.
f lacks special known structure like concavity or linearity.

BO algorithms are developed by taking into account the above restrictions.

How BO is it different from other optimization techniques?

See the first answer (last 3 points)

Where BO is used?

See the first answer (first 3 points)

How to implement BO?

Here is the simple most and the basic algorithm for BO, needless to say there is a vast literature with extensions and modifications.

Step 1. Initialization

At first, we have the objective function f, an evaluation budget N and an initialization sample I_n, of size n (<N). The initialization sample is simply a set of evaluations of the objective function obtained with some initial space-filling experimental design (for e.g. LHSMDU). We denote it as follows,
I_n = {(x_0, y_0), (x_1, y_1), …, (x_n, y_n)}.

N.T. the input (x) is a vector in case f is multidimensional. BO algorithms works well when the dimension of the input is typically less than 20 (explained later).

Now, the idea is to extend the set I_n, with additional N-n evaluations of the objective function and output the optimum for the final set , as the optimal output!

Yes! its that simple.

How to evaluate the next (n+1)-th sample?

We have already evaluated the expensive objective function n times, and left with N-n budget. To use it efficiently, we define a function which will tell us where to evaluate/sample next. This function is known as acquisition function, and in this article we are concerned with the simple most acquisition function : The Expected Improvement (EI).

Step 2. Acquisition function

Suppose we are willing to minimize f. Lets define,

m_n = min(y_0, y_1, …, y_n) and m = min(f).

m_n is the empirical minimum value obtained so far and m the true minimum value which is unknown. We aim to reduce (but not increase) the difference (m_n - m) as n increases. For this to happen the (n+1)-th evaluation should be less than or at least equals to m_n. Thus the best possible location for the next evaluation should be at a input point x, which maximizes the quantity : max( m_n-f(x), 0 ).

But this quantity again requires f to be maximized! So we came back to where we started. The solution is surrogate modelling. We simply replace f by a dummy replicate g, in the above quantity and maximize the Expected value (why Expected value? explained later) of it. That is what Expected Improvement (EI) is,

EI_n(x) = E_n[ max( m_n - g(x), 0 ) ]

We optimize the acquisition function EI_n(x) using a normal gradient-based optimizer (yes, we can do that this time!) and evaluate the objective function at the optimal input point.

The new evaluation point is added to I_n and converts it to I_(n+1).

How to build the dummy replicate of f?

We do not (actually can not) directly use f, due to the restrictions (see, the first answer), hence we need a surrogate model or in simple words a dummy replicate for f, call it g which is “cheap” to evaluate. We use Gaussian Process Regression (GPR) for this.

Step 3. Gaussian Process Regression

Find my introductory article on GPR here.

Now to train the GPR model, we use I_n. Call the resulting surrogate model g_n.

N.T. Here I am considering GPR as a black box regression technique, to make it simpler for those who are not familiar with GPR. For the time being just consider, using GPR you can develop a “cheap” replicate g for a function f, such that g is Normally distributed (aka the posterior distribution of the Kriging predictor).

The surrogate model g is therefore a Gaussian random variable, the reason why we take the Expectation in EI. Here E_n is a conditional expectation that is computed w.r.t the n observed points in I_n. More precisely the closed form solution for EI_n(x) is dependent on the posterior mean and variance of the kriging predictor, which are functions of I_n.

The derivation is very simple & left the reader as an exercise.

How to compute the final output ?

We are almost done. Next we just repeat step 2 each time a new evaluation is added to the set I.

Step 3. Iteration

Until the budget N is exhausted, we keep on sampling new points and evaluate the objective function. With each evaluation the set I is expanded and the corresponding surrogate model is updated.

Finally we have a set I_N = {(x_0, y_0), (x_1, y_1), …, (x_N, y_N)}.

The final solution to our optimization problem of f is just the optimal observation from this set!

How to code it?

I am attaching a snippet of the pseudo code of the above algorithm, the python implementation is available in my GitHub repository.

The link to the GitHub repository is here.

Ending remarks

Usually modelling with GPR for high dimensional functions is a challenging task (although there are some recent literature for scalable GPs). Moreover tuning hyperparameters for GPR model is a crucial step in obtaining a robust surrogate. As a result the performance of BO highly depends on the underlying GPR model and henceforth most successful applications of BO are achieved with dimensions <20. To my knowledge this field is currently being explored.

In this article I have described the simplest implementation but in practical industrial problems we use more complex acquisition functions with multi-objective, multi-constrained and even multi-fidelity optimization problems. Detailed discussions can be easily found in the recent literature on BO.

Finally, if you have any comments/suggestions don not hesitate and feel free to contact me (homepage) 😃 .