Bayesian optimization or how I carved boats from wood | by Aliaksei Mikhailiuk | Jan, 2021
[ad_1]
As a kid, I spent my summers with my grandmother in a small village with no peers around and very little entertainment. While reading took up most of my time, I also used to carve small boats from wooden logs. My boats had three features to which I paid particular attention — the width of the boat, the length of the keel, and the length of the mast.
Generally wider boats were more stable but slower and also did not look good. Boats with longer masts required counter-balancing resulting in a longer keels, slowing them. I tested boats with random configurations and did not follow a systematic approach.
Designing boats was fun and I had all the time I wanted back then. Finding the best hyper-parameters for most machine learning models, on the other hand, is often an unavoidable necessity. Under tight deadlines this can be a stressful and very expensive task. While hyper-parameter grid-search might suffice for a single parameter, the number of possible testing options grows exponentially for more of them!
Fortunately, there is help — Bayesian Optimization!
This article is a very short overview of the necessary components and intended to provide scalable code in form of a jupyter notebook. Check it out!
When explaining the problem I find it useful to bear in mind the final goal. The goal of this article would be finding hyper-parameters for a deep neural network that would maximize the validation accuracy. For this we will be using bayesian optimization. We can split the process into several steps:
- Initialize the network with values of the hyper-parameters;
- Train the model and save the best accuracy on the validation set;
- Fit a Gaussian Process (GP) over the performance data collected so far. Here, the values of the hyper-parameters tested so far are independent variables and accuracies of the model with these hyper-parameter values are predictor variable;
- Use acquisition function on the GP fit to obtain the next set of hyper-parameter values with which the model is expected to attain better performance;
- Go to step 1.
Thus the process requires two important steps
- Loss function estimation — that will be achieved with Gaussian Processes;
- Acquisition function— highest expected improvement in the loss.
Gaussian Processes
Unlike linear models, for which we select a function with a finite number of parameters, Gaussian Processes (GPs) are non-linear models and do not restrict us to one function. GPs allow for modelling the data with a family of functions.
While working with GPs, we assume that every data point is a normally distributed random variable and that every data point is related to every other. That is, the whole dataset can be described by a multi-variate normal distribution with a non-trivial covariance.
GPs are fully described by the mean m and kernell K functions. The mean function is very straightforward — it takes in the data-point features and returns the mean of the predictor variable for that datapoint. The kernel function takes as input two data points and returns the covariance between them. The kernel, describes the relationship between the datapoints and is thus an implicit constraint on the way the resultant fit would look like.
Let’s start with a simple example consisting of five data points. In our case let’s assume that the y-axis is the loss function of the machine learning algorithm we want to maximize and the x-axis is a hyper-parameter, such as burn-in preiod, that impacts the prediction accuracy.
For our Gaussian process we assume that the loss is smooth with respect to the burn-in period— we would not expect the loss to experience an abrupt change in the value after a small change in the burn-in period. Thus we choose a very standard squared exponential kernel to enforce this smoothness prior.
Intuitively if points x and x’ are close together, the covariance will be large, i.e. they have more influence on each others location, compared to points further apart (here l and sigma are hyper-parameters of the kernel). Kernel functions are usually written in the matrix form — the covariance matrix of a multivariate Gaussian distribution modelled with GPs. We can thus enjoy some of the properties this matrix has — positive-definiteness and symmetry.
The choice of kernel is an art, and there is a lot of research on how to do this automatically, how to combine kernels together and how to chose the kernel for multi-dimensional data, as assumptions made for one dimension might not hold for another. More kernel visualizations can be found here. A whole branch of research is looking into learning kernels at model training time.
Let’s return back to the problem. We model our data with GP:
Notice, that even at the location where we have data measurements (loss) the standard deviation is non-zero. This is because in addition to the squared exponential kernel we use the WhiteNoise kernel — small constant variance added to each point, allowing for some uncertainty in the predicted value. This will be useful when we look for the maximum of the function.
For more details on Gaussian Processes check out Neil Lawrence’s blog and this brilliant blog post.
Bayesian optimization — acquisition function
The task in this step is to find the value of the hyper-parameter that would improve the current maximum (minimum) of the loss function for which we found a fit. To identify the potential (e.g. with respect to the expected improvement or probability of improvement) of each point to be the maximum (minimum) we use an acquisition function. The acquisition function takes in the obtained GP fit (mean and variance at each point) and returns the hyper-parameter value for the next run of the machine learning algorithm. Below we consider two simple acquisition functions: probability of improvement and expected improvement.
Probability of improvement: With both the mean and the variance estimates, we are looking for the point with the highest probability of being the next peak.
Here Ф is the standard cumulative normal distribution, mu_t(x) is predicted mean at a point x, f(x+) is the value of the current maximum among the data points, epsilon is a small error term to avoid zero in the numerator, and sigma is a standard deviation at a point x, to standardize the distribution. The greater the mean, the higher the probability. On the other hand, a smaller standard deviation would result in higher probability values, as we are more certain in the prediction. The probability of improvement thus discourages exploration of the space.
Expected improvement: Another common approach is to use the expected improvement. The approach allows for points with higher estimated variance to have a greater chance to be selected. In this case the next point to be selected, would maximize the expected improvement, x_(t+1)=argmax_x(EI), where:
For more details on the derivation refer to this article. In the examples below I will be using expected improvement as it has been shown to produce better samples. Expected improvement for the function above is given below.
The EI criterium has two distinct peaks at around 11800 and 12900, with flat tails in [8000, 10500] and [14200, 16000]. We can revisit the GP fit in the plot above and notice that the peaks correspond to the values of the fit whose sum of mean and standard deviation is the largest.
Other strategies based on, for example, Thomson Sampling can be found in tutorial 1 and tutorial 2. For even more examples refer to this Distill article.
This article is based on a jupyter notebook. I have structured the example to be scalable — adding more variables to optimize onl yrequires setting up the names of these variables, their ranges and number of sampling locations. Thus, the two examples below required minimum code change.
In both examples I use Expected Improvement as the acquisition function. Try running the code yourself and replacing it with the Probability of Improvement.
1D Example
Let’s pretend that we are training a Transformer network, for which we would need to find the best value for the burn-in hyperparameter.
Selected points over time gradually climb to the maximum of the function. With more points sampled around the maximum, the variance around the point is decreasing. At the same time, even though the predicted mean value of the unexplored regions is smaller than the maximum, high variance allows for exploration of the untouched space.
2D Example
1D case is not very practical as there usually many more hyper-parameters to be optimized. In this example we consider a 2D case with two hyper-parameters: burn-in period and learning rate. Visualising sampled points in 3D might be dificult. Below I also show the contour plot for the data — view from the top.
Both examples have eventually converged. Running for more iterations would result in finer estimates. A typical stopping criterium would be the increase in every new step.
Practicalities
Running bayesian optimization directly on the original parameter values as well as original range of the loss function we are modelling can have a detrimental effect as the kernel function does not take varied ranges into account. To include that aspect into optimization I have re-scaled the data to 0–1 range.
Since the value of the final accuracy is subject to many condition, such as initialization of the machine learning algorithm, in modelling it we allow some noise in the prediction with the WhiteNoise kernel.
Bayesian optimization is very expensive and practical considerations should be taken into account. For more details on how to run it in practice check out this paper.
For those more interested in an integrated solution check out weights and biases which provide a ready suit for tracking hyper-parameters and metrics of your model, along with their implementation of Bayesian Optimization
Distill has a great article on Bayesian Optimization and definitely check out this review paper from Oxford!
Don’t forget that this article is based on a jupyter notebook. Check it out!
Many thanks to Param Hanji for his valuable suggestions on the content!
Read More …
[ad_2]