Bayesian Optimization, Striking Gold

Santiago Velez Garcia
7 min readAug 22, 2020

--

If you ever attempted to optimize a machine learning model hyperparameters. You are familiar with the feeling of being digging for gold guided only by your intuition and tales of where other miners found some. If you ask what hyperparameters to use, the experts will tell you:

“use these, they work well”.

Hyperparameter optimization can feel like digging for gold.

Or, in miners jargon equivalent:

“I have heard somebody stroke gold there, go dig close to it”.

Not very satisfying. You decide you are doing things the scientific way. So you make a grid over your parcel and decide to dig in the intersections of the lines. Equivalent to performing a grid search. If you have a two by two grid, you will make four excavations. A ten by ten grid gives you already one hundred holes. By moving to a multidimensional plane, say of dims=5, choosing only two values per dimension, you have two to the power of 5 or 32 holes. Already tired of only thinking about it, chances are you run out of money before striking anything worth it.

Fortunately, there is a powerful technique to guide us in this quest. It goes by the name of Bayes Optimization (BayesianOpt), and it relies on another one called Gaussian Process (GP). Let move to a more formal talk about the two.

Gaussian Process

Gaussian Process belongs to the realm of probability theory and statistics. It is a stochastic process, that is, is a collection of random variables indexed by time or space. The defining factor is that every finite collection of those random variables has a multivariate normal distribution, or in other words, every finite linear combination of them is normally distributed.

In Machine Learning, Gaussian Process uses lazy learning and a measure of the similarity between points, called the kernel function, to predict the value for an unseen point from training data. Besides the prediction, it provides information on the uncertainty of the value found.

Bayesian Optimization

BayesianOpt is a class of machine-learning-based optimization methods focused on solving the problem max f(x), where x∈A. Our wiki friends say:

If an exact functional form for f is not available (that is, f behaves like a “black box”), what can we do? Bayesian optimization proceeds by maintaining a probabilistic belief about f and designing a so-called acquisition function to determine where to evaluate the function next. BayesianOpt is particularly well-suited to global optimization problems where f is an expensive black-box function; for example, evaluating f might require running an expensive simulation. BayesianOpt has recently become popular for training expensive machine-learning models whose behavior depends in a complicated way on their parameters (e.g., convolutional neural networks). This is an example of the “AutoML” paradigm.

When we read expensive black-box function think of digging for gold. Every evaluation of the function is a probe we make, or in other words, a hole we dig in the ground looking for the shiny metal, the result is how much gold we found. Evaluating an ML model may not be as dramatic as a probe drilling, but it involves costly computation time you do not want to spend.

There is a fabulous tutorial by Peter I. Frazier on the subject. Alternatively, you can consult this Washington University in St. Louis’s paper.

BayesianOpt in action

To illustrate the method, we are going to optimize the parameters of a relatively simple Neural Network trained in the MINST dataset. We have only two densely connected hidden layers with relu activation that yield to the output layer with ten softmax activation units that tells us which of the digits our input belongs.

This is our Neural network model. Note that we will optimize on the number of untits.

The first hyperparameter, even without starting the training, one thinks of, is the number of units in the hidden layers. Should we go with 16 or 1024? The illustration shows 256.

Regarding the training, one immediately thinks about the learning rate and the batch size. Optimization techniques are here to offer convergence speed. My post on the subject offers details on the matter.

And because we are afraid of the hideous over-fitting we will use some regularization technique. If you are interested in the regularization techniques take a look at my previous post. How about dropout and L2 regularization. For dropout, we will tune the keeping probability. For L2 regularization we will tune the L2 regularization weight.

One last thing. We have to define our measure of gold. That is, what is it that we are looking for. This is called the satisficing metric. We will default to the simplest one, namely validation accuracy.

Let us put it all together in an expedition sheet (Aka. requirements).

Requirements

  • Script to optimize five different hyperparameters: -learning rate -dropout rate -L2 regularization weight -batch size -number of units in hidden layers
  • Optimize the model on a single satisficing metric: Validation accuracy
  • The model should save a checkpoint of its best iteration during each training session. The filename of the checkpoint should specify the values of the tuned hyperparameters.
  • Perform early stopping.
  • Bayesian optimization should run for a maximum of 30 iterations.
  • Plot the acquisition function as well as the convergence, after optimization.
  • Generate a report to the file ‘bayes_opt.txt’

We will perform the BayesOpt using GPyOpt, which is a Bayesian optimization library based on GPy For a great tutorial on BayesOpt as well as examples of implementation in several libraries I refer you to Martin Krasser post. Magnus Erik Hvass Pedersen is another excellent resource.

To run the BayesOpt, one import decision is the selection of the Kernel. Our choice will be Matern52. The reason: it is often used in ML settings. Look at this post for a discussion about Matern merits. Other choice for the kernel could be to use the Radial Basis Function (RBF).

Results and Conclusions

First trial

First trail result. Distance and best value
Trial 1, Accuracy and tested hyperparameters (learning rate , batch_size, keep_prob, L2 regularization weight (lambtha) units in hidden layers

We were not able to improve on our first choice because it was already very good.

Trial 2

Let’s pretend we started with a less lucky choice.

Trial 2, Accuracy and tested hyperparameters (learning rate , batch_size, keep_prob, L2 regularization weight (lambtha) units in hidden layers
Trial 1, Accuracy and tested hyperparameters (learning rate , batch_size, keep_prob, L2 regularization weight (lambtha) units in hidden layers

Even with starting values far from optimal the method was able to move to an area of high accuracy. With a small number of probes (evaluations), we significantly improve the validation accuracy. Also, the selected evaluation points stay in the vicinity of the high number of hidden layer units. Presumably, this hyperparameter has more influence than others.

The starting point has a significant influence in were is the next probe. The second point was significantly better than the first, so the method stayed in the hight number of units space and optimized for the other parameters.

Trial 3

Here we purposely will have a defective Kernel. Instead of 5 dimensions, we will enter one dimension.

Trial 3, Accuracy and tested hyperparameters (learning rate , batch_size, keep_prob, L2 regularization weight (lambtha) units in hidden layers

From the result, we see the algorithm is somehow lost. It bounces back and forth between low and high accuracy. It is as if it did not really know where to look. Much as we will do it by chance selecting random points.

Final thoughts

It seems we traded having to choose our model hyperparameters by having to choose our bayesian optimization algorithm parameters. Besides, we have to give the BayesOpt a range for all the hyperparameters. Also, the selection of the two initial probes lies entirely with us.

As Prof. Nando Freitas said in his Bayes Optimization lecture at UBC, it is less of a commitment to choose a range that to select a single point. One may argue that we still need to produce the initial probes. That indeed is true, but although they have an influence, the algorithm can move away from them and find promising points.

And the BayesOpt method parameters? The most important being the choice of kernel. In that regard, we can say that they at least have an intuition if not a complete theory behind and that you can always rely on an old miner (data scientist) telling you which one to use.

Here is some code snippet you can play with. Hope you have as much fun as I did.

--

--