- May 18, 2016 Probabilistic Numerics for Computer Scientists
- Dec 3, 2015 Probabilistic Integration
- Jan 16, 2015 Connections, Part III: Bayesian Optimization
- Jan 15, 2015 Connections, Part II: Stochastic numerical methods
- Jan 14, 2015 Connections, Part I: Uncertainty Quantification
- Sep 5, 2014 Tübingen Manifesto: Community
- Sep 3, 2014 Tübingen Manifesto: Priors and Prior Work
- Sep 1, 2014 Tübingen Manifesto: Probabilistic Numerics and Probabilistic Programming
- Aug 27, 2014 Tübingen Manifesto: Uncertainty
- Aug 22, 2014 Roundtable in Tübingen

*As I go around presenting the idea of probabilistic numerics to various
audiences, certain questions about related areas come up repeatedly. This post
explains how probabilistic numerics compares to existing ideas of using
stochasticity in numerical problems.
A previous post discussed connections to
uncertainty quantification. A subsequent one will look at
Bayesian Optimization.*

*A disclaimer: Obviously, everyone has different opinions about the scope and
purpose of certain concepts and academic fields. And I am not an expert in the
areas discussed here. This post relates a part of my own personal
justification, why I think certain ideas are novel and interesting. It is not
intended as a holistic overview of a field. If anyone disagrees with
characterizations I make here, I would be glad if you could relate your
opinion in the comments section below.*

Using **stochasticity** to solve deterministic problems is anything but a new idea
in numerical mathematics. Random numbers have been employed in at least two quite different
ways for numerical purposes, one lowering cost and precision, the other
increasing cost and robustness.

*Projecting* a large problem onto a randomly chosen smaller space can reduce
computational cost while introducing a new kind of imprecision. Such methods
are called *randomized* or, unfortunately, also *probabilistic* algorithms
(Liberty et al., 2007)
(Halko et al., 2011).
The key idea is that given a particular numerical problem spanning variables
from a certain high-dimensional space, one chooses a random projection into a
space of much lower dimensionality, and solves the problem in that space. The
surprising aspect, the strength of this approach, is that it tends to cause a
lot less deficiencies than one would intuitively assume; and this has been
studied quite rigorously.

Repeatedly solving randomly *perturbed* variants of a problem can help quantify
the robustness of a task. This is in some sense the counterpart to the above
projection approach: Instead of removing degrees of freedom to get drastically
lower cost at surprisingly low loss of precision, perturbation methods enrich
the description of a problem to probe new interesting aspects – at drastically
higher cost. (For clarity: I’m not talking about
“perturbation methods”
which are a theoretical tool, not a computational one, and random numbers only
play a conceptual role there).

A third, much-studied area that is of particular relevance for probabilistic
numerics are Monte Carlo
methods. In statistics and machine learning, MC methods are used primarily to
compute expectations and marginals – i.e., for quadrature. In contrast to the
two other uses of random numbers above, MC methods really solve an unmodified
given numerical integration task. Although they can be quite elaborate
algorithms, the basic idea is quickly explained: To compute the intractable
expectation , draw samples
, and approximate . This is an unbiased statistical estimator which converges at the
*statistically* optimal rate of . If you can’t draw
exact samples from , you have to compute them approximately, and this is
where a lot of the research (particularly in Markov Chain MC methods) has
focussed over the past decades.

The area currently of biggest interest for probabilistic numerics are complementary to the two former settings described above. With a probabilistic numerical method, we do not necessarily want to reduce a given problem to a less costly variant, but are interested in as good a solution to the actual problem at hand (however, it is interesting that recent results suggest that choosing projections in a non-random, “most informative” way can be done at reasonable cost and may improve performance (Garnett et al., 2014)). And while we may well wonder about the sensitivity of the found solution to perturbations of the task, our chief interest is in the error created by the approximate computation itself, and the sensitivity of our computation’s result to further steps.

The story is much more intricate—and exciting—with regards to MC
methods. Recently, I have found myself repeatedly in discussions with
colleagues about the helpfulness of random numbers for computations, in
connection with our recent NIPS paper on fast Bayesian quadrature
(Gunter et al., 2014). There is a famous
polemic by Tony O’Hagan (O’Hagan, 1987) pointing out that
randomness leads to serious inefficiencies when performing a deterministic
computation. It is well-known that, on low-dimensional problems, classic
quadrature rules can converge *much* faster than MC estimators: Depending on
the rule used, and the smoothness of the integrand, for
is not unusual. Classic quadrature rules are identified with maximum a
posteriori estimators under various Gaussian process priors over the integrand
(Minka, 2000). Extending on this insight, our
recent NIPS paper presents a general purpose quadrature method for strictly
positive integrands (such as the probability distribution ) which
empirically outperforms (in wall-clock time) established MCMC methods. Given
the ubiquity of MC methods, this kind of result is very exciting, and once again
suggests an area of research that is only just beginning to take shape.

- Garnett, R., Osborne, M., & Hennig, P. (2014). Active Learning of Linear Embeddings for Gaussian Processes. In N. L. Zhang & J. Tian (Eds.),
*Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence*(pp. 230–239). AUAI Press. http://auai.org/uai2014/proceedings/individuals/152.pdfWe propose an active learning method for discovering low-dimensional structure in high-dimensional Gaussian process (GP) tasks. Such problems are increasingly frequent and important, but have hitherto presented severe practical difficulties. We further introduce a novel technique for approximately marginalizing GP hyperparameters, yielding marginal predictions robust to hyperparameter misspecification. Our method offers an efficient means of performing GP regression, quadrature, or Bayesian optimization in high-dimensional spaces.

@inproceedings{GarnettOH2013, title = {Active Learning of Linear Embeddings for Gaussian Processes}, author = {Garnett, R. and Osborne, M. and Hennig, P.}, booktitle = {Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence}, editor = {Zhang, NL and Tian, J}, publisher = {AUAI Press}, pages = {230-239}, year = {2014}, url = {http://auai.org/uai2014/proceedings/individuals/152.pdf}, url2 = {https://github.com/rmgarnett/mgp}, department = {Department Sch{\"o}lkopf}, file = {http://auai.org/uai2014/proceedings/individuals/152.pdf}, code = {https://github.com/rmgarnett/mgp} }

- Gunter, T., Osborne, M. A., Garnett, R., Hennig, P., & Roberts, S. (2014). Sampling for Inference in Probabilistic Models with Fast
Bayesian Quadrature. In C. Cortes & N. Lawrence (Eds.),
*Advances in Neural Information Processing Systems (NIPS)*.We propose a novel sampling framework for inference in probabilistic models: an active learning approach that converges more quickly (in wall-clock time) than Markov chain Monte Carlo (MCMC) benchmarks. The central challenge in probabilistic inference is numerical integration, to average over ensembles of models or unknown (hyper-)parameters (for example to compute marginal likelihood or a partition function). MCMC has provided approaches to numerical integration that deliver state-of-the-art inference, but can suffer from sample inefficiency and poor convergence diagnostics. Bayesian quadrature techniques offer a model-based solution to such problems, but their uptake has been hindered by prohibitive computation costs. We introduce a warped model for probabilistic integrands (likelihoods) that are known to be non-negative, permitting a cheap active learning scheme to optimally select sample locations. Our algorithm is demonstrated to offer faster convergence (in seconds) relative to simple Monte Carlo and annealed importance sampling on both synthetic and real-world examples.

@inproceedings{gunter14-fast-bayesian-quadrature, author = {Gunter, Tom and Osborne, Michael A. and Garnett, Roman and Hennig, Philipp and Roberts, Stephen}, title = {Sampling for Inference in Probabilistic Models with Fast Bayesian Quadrature}, booktitle = {Advances in Neural Information Processing Systems (NIPS)}, year = {2014}, editor = {Cortes, C. and Lawrence, N.}, code = {https://github.com/OxfordML/wsabi} }

- Halko, N., Martinsson, P.-G., & Tropp, J. A. (2011). Finding structure with randomness: Probabilistic algorithms
for constructing approximate matrix decompositions.
*SIAM Review*,*53*(2), 217–288.@article{halko2011finding, title = {Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions}, author = {Halko, Nathan and Martinsson, Per-Gunnar and Tropp, Joel A}, journal = {SIAM review}, volume = {53}, number = {2}, pages = {217--288}, year = {2011}, publisher = {SIAM} }

- Liberty, E., Woolfe, F., Martinsson, P.-G., Rokhlin, V., & Tygert, M. (2007). Randomized algorithms for the low-rank approximation of
matrices.
*Proceedings of the National Academy of Sciences*,*104*(51), 20167–20172.@article{liberty2007randomized, title = {Randomized algorithms for the low-rank approximation of matrices}, author = {Liberty, Edo and Woolfe, Franco and Martinsson, Per-Gunnar and Rokhlin, Vladimir and Tygert, Mark}, journal = {Proceedings of the National Academy of Sciences}, volume = {104}, number = {51}, pages = {20167--20172}, year = {2007} }

- Minka, T. P. (2000).
*Deriving quadrature rules from Gaussian processes*. Statistics Department, Carnegie Mellon University.Quadrature rules are often designed to achieve zero error on a small set of functions, e.g. polynomials of specified degree. A more robust method is to minimize average error over a large class or distribution of functions. If functions are distributed according to a Gaussian process, then designing an average-case quadrature rule reduces to solving a system of 2n equations, where n is the number of nodes in the rule (O’Hagan, 1991). It is shown how this very general technique can be used to design customized quadrature rules, in the style of Yarvin & Rokhlin (1998), without the need for singular value decomposition and in any number of dimensions. It is also shown how classical Gaussian quadrature rules, trigonometric lattice rules, and spline rules can be extended to the average-case and to multiple dimensions by deriving them from Gaussian processes. In addition to being more robust, multidimensional quadrature rules designed for the average-case are found to be much less ambiguous than those designed for a given polynomial degree.

@techreport{minka2000deriving, author = {Minka, T.P.}, institution = {Statistics Department, Carnegie Mellon University}, title = {{Deriving quadrature rules from {G}aussian processes}}, year = {2000}, link = {http://research.microsoft.com/en-us/um/people/minka/papers/quadrature.html} }

- O’Hagan, A. (1987). Monte Carlo is fundamentally unsound.
*The Statistician*, 247–249.@article{o1987monte, title = {Monte Carlo is fundamentally unsound}, author = {O'Hagan, Anthony}, journal = {The Statistician}, pages = {247--249}, year = {1987}, link = {http://www.jstor.org/stable/2348519?seq=1#page_scan_tab_contents} }