Hyperparameter Optimization — Three parts of a magic trick

Vishwesh Kirthivasan
6 min readApr 11, 2021

Hyperparameters are the non-trainable parameters of a neural network model. These include a variety of attributes such as the number of layers and types of layers in the architecture, the extent of dropout in each layer, the learning rate used etc. This also gives an idea of how diverse these parameters can be. We deal with both categorical and numerical data, differentiable and non-differentiable. In this blog, we discuss two important papers that have given us methods for hyperparameter optimization that are very widely used in the ML fraternity. These are the papers that introduced Random Search¹ and Sequential Model Based Optimization². But before we get into those paper, let’s set some context up!

Hyperparameter optimization is the process of identifying the set of parameters that yield the best performance on an objective function, usually a loss function in a neural network.

Mathematical Representation of Hyperparameter optimization

Why Hyperparameter Optimization?

The choice of hyperparameters can make or break a system, and the field of Machine Learning is rightly identifying the importance of hyperparameter optimization. A lot of research in the field is hence on hyperparameter optimization.

For the longest time, hyperparameter optimization has been done manually, by machine learning engineers trying out various settings and observing the behavior of the model. Often the person doing the optimization goes by their intuition to decide what parameters to try out next. Of course, it is not just a lot of repetitive work that relies on human intuition a lot more than on math, it is also extremely time consuming to retrain the models multiple times! All that said, this still remains one of the two most widely used techniques for hyperparameter optimization (the second one being Exhaustive Search, defined below), so much so, that there has been a term coined for manual hyperparameter optimization:

From here on, we are going to break the topic down into three parts of a magic trick, as in the movie The Prestige!

The Pledge — Exhaustive Search

We will have an exhaustive list of hyperparameter configurations. Each configuration will need the model to be trained until convergence, and the configuration with the best performance on the test set will be selected as the optimal hyperparameter setting. These belong to a class of “embarrassingly parallel” problems, where we have different configurations of hyperparameters, independent of each other. This allows each of the training runs to run in parallel. Of course we will be constrained by the resources we have.

Grid Search

In this, we have a discretized search space of each hyperparameter. The algorithm basically launches a learning for each of the hyper parameter configurations, and we simply choose the best one at the end. This suffers for the curse of dimensionality, owing to an order of configurations that is exponential in the number of attributes.

Random Search¹

This is a small variation of the previous algorithm, where we do not have predefined values for each attribute, instead the algorithm will randomly sample from the search space. This means that there is no natural end to this algorithm. Instead we will usually set some time budget, such as number of independent trials. This also suffers from the curse of dimensionality as in Grid Search and can be massively parallelized. In practice, however, random search has been found to reach closer to the optimal of each parameter, as can be seen in the image below. This is partly because random search does not depend on any chosen discrete grid, allowing the chosen parameter to be regions closer to the optimum.

Since random search can choose configurations at random, there is a greater chance of choosing the optimal configuration

The Turn — Sequential Model Based Optimization²

In exhaustive search, you might have noticed that we did not keep track of past runs while choosing next hyperparameter configurations. In fact, this is partly why the different training runs could be parallelized completely. Instead in Sequential Model Based Optimization (SMBO), we keep track of past performance to guide us in choosing the next configuration to test.

Steps in Sequential Model Based Optimization

Some Terminologies

Surrogate Model — This refers to a subset of the training data which “approximates” the actual distribution. With increasing iterations of the algorithm, we will increase the proportion of the training data used, eventually reaching the exact distribution of the training data. Note that the surrogate model is much easier to optimize than the objective function. Thus if we spend a bit more time identifying the best candidate for each next iteration, we make much fewer (and more expensive) training iterations.

Surrogate Model after 1 Iteration
Surrogate Model after 8 Iterations

Expected Improvement Function (Selection Function) — Criteria by which the next set of hyperparameters are chosen from the surrogate function given by:

y* — threshold value of the objective function,

x — proposed set of hyperparameters,

y — actual value of the objective function using hyperparameters x, and

p(y | x) — surrogate probability model expressing the probability of y given x.

Bayesian Optimization³

Gaussian Process

Uses a surrogate probability model given by P(y | x) to model the objective function y, for various hyperparameter configurations (x). These can only work on continuous hyperparameters and not on categorical ones. The strength of Gaussian Process is that it can model the interaction between parameters.

Tree-Structure Parzen Estimator (TPE)

While the Gaussian process directly models P(y | x), TPE models P(x | y) given by the equation

and then uses Bayes’ rule to get P(y | x). Here l(x) and g(x) are two functions that define the probability of a particular configuration given a value of the objective function. Essentially, it treats the values of the objective function greater than or equal to the optimum value y* different from the values lesser than y*. Since we get P(x | y), we assume that each of the parameters x is independent of the rest, so it cannot model interaction between parameters. However, it can work both on continuous and categorical parameters.

Below, we can see the difference between using Random and TPE for hyperparameter optimization. We can see how the TPE graph is more directed, owing to the fact that it has learnt from past trials.

TPE vs Random

Summary

We discussed what hyperparameters are and why hyperparameter optimization is very important for the field of machine learning. We then moved on to talk about some of the most prevalent hyperaparameter optimization methods, starting from Manual (graduate student descent), through exhaustive search (grid and random) to sequential model based optimization (Bayesian — TPE and Gaussian). There are other sequential methods also, which are beyond the scope of this blog. More recently, there are sophisticated algorithms that utilize ideas from Sequential Model Based Optimization, yet try to reduce the running time. There are two important related concepts called Sequential Halving and Asynchronous Sequential Halving which are building blocks to the culmination of one of the newest methods, HyperBand which is a multi arm-bandit based approach to hyperparameter optimization.

References:

Random Search —

1. https://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a

SMBO —

2. https://ml.informatik.uni-freiburg.de/papers/11-LION5-SMAC.pdf

3. https://towardsdatascience.com/a-conceptual-explanation-of-bayesian-model-based-hyperparameter-optimization-for-machine-learning-b8172278050f

Other Blogs —

4. https://medium.com/criteo-engineering/hyper-parameter-optimization-algorithms-2fe447525903

--

--