Probabilistic-Numerics.org

# Connections, Part III: Bayesian Optimization

### By Philipp Hennig, 2015-01-16 07:00:00 +0100

As I go around presenting the idea of probabilistic numerics to various audiences, certain questions about related areas come up repeatedly. This post explains how probabilistic numerics compares to the area of Bayesian Optimization. Previous posts discussed connections to uncertainty quantification and stochastic methods.

A disclaimer: Obviously, everyone has different opinions about the scope and purpose of certain concepts and academic fields. And I am not an expert in the areas discussed here. This post relates a part of my own personal justification, why I think certain ideas are novel and interesting. It is not intended as a holistic overview of a field. If anyone disagrees with characterizations I make here, I would be glad if you could relate your opinion in the comments section below.

One of the most frequent questions regarding PN from the machine learning community is “what about Bayesian Optimization? Shouldn’t it be part of PN, too?” Some of things I write here arose from an email thread mainly driven by Roman Garnett, John Cunningham, Mike Osborne and myself.

Bayesian Optimization (BO) is a probabilistic description of the task of finding the global extremum of a function that is not “directly” accessible, either because it is embodied in some physical process, or because it has very high evaluation cost. A good example is the search for good parameters of a robotic control problem (where each function evaluation involves a physical experiment), or finding good setups for a large machine learning algorithm, such as a deep net (where each function evaluation involves training the net to convergence, which may take weeks on a cluster).

BO is a relatively young community, but already very successful. A workshop on BO is now an annual fixture at NIPS. The recent 2014 installment attracted around 50 people (my own rough guess), making it one NIPS’s larger satellites. And there is some personal overlap between the PN and BO communities, for example through Roman and Mike (together with Christian Schuler, I also contributed a BO algorithm myself (Hennig & Schuler, 2012)). Nevertheless, in my personal opinion, there is a difference in focusses that makes BO rather different from what we are trying to achieve with probabilistic numerical methods.

There is a spectrum of algorithms in BO, but most Bayesian optimizers fit into the following scheme: We aim to find the minimum of $f(x):\mathbb{X}\to\mathbb{R}$. To decide at which $x_*$ to collect the next datapoint(s) $y_*$, the algorithm builds a regression posterior $p(f\mid X,Y)$ from previously collected evaluations $(X,Y)$. This is the “Bayesian” bit, and virtually always involves Gaussian process models. Said posterior is then transformed into a utility for the next evaluation, $u[x_*,p(f\mid X,Y)]$. For example, this functional may predict how likely $y_*$ is to be lower than previous function values, or the expected distance between $y_*$ and the previous lowest achieved value, or something altogether more complicated, like the expected information gain about the location of the minimum from an evaluation at $x_*$. Then – and this is crucial in this context – the Bayesian global optimizer performs a numerical global optimization to find the global minimum $\operatorname{arg min}_{x_*}(u)$. How exactly this is done varies across implementations. There are two main differences between the original optimization task of the Bayesian optimizer and this new numerical global optimization problem: The latter is free of noise, and the objective may be a lot cheaper, because it only involves computations on the surrogate, not physical experiments. The “structural complexity” (the shape) of the two problems, may be quite similar though.

In this sense BO “just” turns a tough physical optimization problem into a tough numerical optimization problem (the quotation marks are truly necessary, the models used in this are are anything but trivial). This doesn’t mean BO is pointless. It has been hugely successful recently at providing automated, surprisingly smart strategies saving time and money on design problems. But for our purposes it puts BO on a conceptual level “above” the base layer of numerics.

In our email thread, Roman gave a pointed criticism to this characterization: Don’t all numerical methods just turn a hard problem into an easier one? Quasi-Newton methods cheaply approximate expensive Hessian functions, for example. This is true, and one could have a philosophical debate about the boundaries between a numerical method and an algorithm merely using numerics (Bayesian quadrature methods, for example, may well use a numerical optimization algorithm to design their evaluation strategy). But what is more important from a practical standpoint is the difference in research focus: Having attended most of the recent NIPS workshop on BO, I would say the focus of work in this community is currently on more elaborate structured models (high-dimensional, heteroscedastic, with varying evaluation cost, etc.), and on identifying interesting application areas (molecular biology, automated machine learning, robotics, etc.). There is only limited debate about the computational cost of BO methods, and on the empirical robustness of these algorithms. This make sense, because BO algorithms are quite elaborate; so their behaviour and cost is difficult to analyse.

In contrast, numerical methods form the base layer, the inner loop of algorithms across many problem domains. They have to be parsimonious, and trustworthy. When building probabilistic numerical methods, we have to compare ourselves with the elegant, rugged algorithms built in the applied mathematical communities over the past century. In my opinion it would be a mistake to demand a large computational budget from our potential users. Instead we need to show that probabilistic functionality can deliver efficiency and expressivity gains at very limited overhead. In so far, I think it is a good strategy to keep the developments in BO and PN separate for the time being.

However, our tiny cottage industry can learn a lot from the great success that BO has had in recent years. In about half a decade, BO has started from an existing basis of just a handful of (much older) early papers to a thriving community with many different competing key players. There are about as many existing basic papers on PN ideas as there were on BO a few years ago. I would be thrilled if, in 5 years, PN were where BO is today.

### References

1. Hennig, P., & Schuler, C. J. (2012). Entropy Search for Information-Efficient Global Optimization. Journal of Machine Learning Research, 13, 1809–1837.

Contemporary global optimization algorithms are based on local measures of utility, rather than a probability measure over location and value of the optimum. They thus attempt to collect low function values, not to learn about the optimum. The reason for the absence of probabilistic global optimizers is that the corresponding inference problem is intractable in several ways. This paper develops desiderata for probabilistic optimization algorithms, then presents a concrete algorithm which addresses each of the computational intractabilities with a sequence of approximations and explicitly adresses the decision problem of maximizing information gain from each evaluation.

@article{HennigS2012,
title = {Entropy Search for Information-Efficient Global Optimization},
author = {Hennig, P. and Schuler, CJ.},
month = jun,
volume = {13},
pages = {1809-1837},
journal = {Journal of Machine Learning Research},
year = {2012},
file = {http://jmlr.csail.mit.edu/papers/volume13/hennig12a/hennig12a.pdf},