Probabilistic-Numerics.org

Connections, Part II: Stochastic numerical methods

By Philipp Hennig, 2015-01-15 07:00:00 +0100

As I go around presenting the idea of probabilistic numerics to various audiences, certain questions about related areas come up repeatedly. This post explains how probabilistic numerics compares to existing ideas of using stochasticity in numerical problems. A previous post discussed connections to uncertainty quantification. A subsequent one will look at Bayesian Optimization.

A disclaimer: Obviously, everyone has different opinions about the scope and purpose of certain concepts and academic fields. And I am not an expert in the areas discussed here. This post relates a part of my own personal justification, why I think certain ideas are novel and interesting. It is not intended as a holistic overview of a field. If anyone disagrees with characterizations I make here, I would be glad if you could relate your opinion in the comments section below.

Using stochasticity to solve deterministic problems is anything but a new idea in numerical mathematics. Random numbers have been employed in at least two quite different ways for numerical purposes, one lowering cost and precision, the other increasing cost and robustness.

Randomized methods

Projecting a large problem onto a randomly chosen smaller space can reduce computational cost while introducing a new kind of imprecision. Such methods are called randomized or, unfortunately, also probabilistic algorithms (Liberty, Woolfe, Martinsson, Rokhlin, & Tygert, 2007) (Halko, Martinsson, & Tropp, 2011). The key idea is that given a particular numerical problem spanning variables from a certain high-dimensional space, one chooses a random projection into a space of much lower dimensionality, and solves the problem in that space. The surprising aspect, the strength of this approach, is that it tends to cause a lot less deficiencies than one would intuitively assume; and this has been studied quite rigorously.

Perturbations

Repeatedly solving randomly perturbed variants of a problem can help quantify the robustness of a task. This is in some sense the counterpart to the above projection approach: Instead of removing degrees of freedom to get drastically lower cost at surprisingly low loss of precision, perturbation methods enrich the description of a problem to probe new interesting aspects – at drastically higher cost. (For clarity: I’m not talking about “perturbation methods” which are a theoretical tool, not a computational one, and random numbers only play a conceptual role there).

Monte Carlo methods

A third, much-studied area that is of particular relevance for probabilistic numerics are Monte Carlo methods. In statistics and machine learning, MC methods are used primarily to compute expectations and marginals – i.e., for quadrature. In contrast to the two other uses of random numbers above, MC methods really solve an unmodified given numerical integration task. Although they can be quite elaborate algorithms, the basic idea is quickly explained: To compute the intractable expectation $\langle f\rangle_p= \int f(x) p(x) dx$, draw $N$ samples $x_i\sim p$, and approximate $\langle f\rangle_p\approx \frac{1}{N}\sum_i f(x_i)$. This is an unbiased statistical estimator which converges at the statistically optimal rate of $\mathcal{O}(N^{-1/2})$. If you can’t draw exact samples from $p$, you have to compute them approximately, and this is where a lot of the research (particularly in Markov Chain MC methods) has focussed over the past decades.

Uncertainty does not need random numbers

The area currently of biggest interest for probabilistic numerics are complementary to the two former settings described above. With a probabilistic numerical method, we do not necessarily want to reduce a given problem to a less costly variant, but are interested in as good a solution to the actual problem at hand (however, it is interesting that recent results suggest that choosing projections in a non-random, “most informative” way can be done at reasonable cost and may improve performance (Garnett, Osborne, & Hennig, 2014)). And while we may well wonder about the sensitivity of the found solution to perturbations of the task, our chief interest is in the error created by the approximate computation itself, and the sensitivity of our computation’s result to further steps.

The story is much more intricate—and exciting—with regards to MC methods. Recently, I have found myself repeatedly in discussions with colleagues about the helpfulness of random numbers for computations, in connection with our recent NIPS paper on fast Bayesian quadrature (Gunter, Osborne, Garnett, Hennig, & Roberts, 2014). There is a famous polemic by Tony O’Hagan (O’Hagan, 1987) pointing out that randomness leads to serious inefficiencies when performing a deterministic computation. It is well-known that, on low-dimensional problems, classic quadrature rules can converge much faster than MC estimators: Depending on the rule used, and the smoothness of the integrand, $\mathcal{O}(N^{-p})$ for $p\in\mathbb{N}$ is not unusual. Classic quadrature rules are identified with maximum a posteriori estimators under various Gaussian process priors over the integrand (Minka, 2000). Extending on this insight, our recent NIPS paper presents a general purpose quadrature method for strictly positive integrands (such as the probability distribution $p$) which empirically outperforms (in wall-clock time) established MCMC methods. Given the ubiquity of MC methods, this kind of result is very exciting, and once again suggests an area of research that is only just beginning to take shape.

References

1. Gunter, T., Osborne, M. A., Garnett, R., Hennig, P., & Roberts, S. (2014). Sampling for Inference in Probabilistic Models with Fast Bayesian Quadrature. In C. Cortes & N. Lawrence (Eds.), Advances in Neural Information Processing Systems (NIPS).

We propose a novel sampling framework for inference in probabilistic models: an active learning approach that converges more quickly (in wall-clock time) than Markov chain Monte Carlo (MCMC) benchmarks. The central challenge in probabilistic inference is numerical integration, to average over ensembles of models or unknown (hyper-)parameters (for example to compute marginal likelihood or a partition function). MCMC has provided approaches to numerical integration that deliver state-of-the-art inference, but can suffer from sample inefficiency and poor convergence diagnostics. Bayesian quadrature techniques offer a model-based solution to such problems, but their uptake has been hindered by prohibitive computation costs. We introduce a warped model for probabilistic integrands (likelihoods) that are known to be non-negative, permitting a cheap active learning scheme to optimally select sample locations. Our algorithm is demonstrated to offer faster convergence (in seconds) relative to simple Monte Carlo and annealed importance sampling on both synthetic and real-world examples.

@inproceedings{gunter14-fast-bayesian-quadrature,
author = {Gunter, Tom and Osborne, Michael A. and Garnett, Roman and Hennig, Philipp and Roberts, Stephen},
title = {Sampling for Inference in Probabilistic Models with Fast
booktitle = {Advances in Neural Information Processing Systems (NIPS)},
year = {2014},
editor = {Cortes, C. and Lawrence, N.},
code = {https://github.com/OxfordML/wsabi}
}

2. Garnett, R., Osborne, M., & Hennig, P. (2014). Active Learning of Linear Embeddings for Gaussian Processes. In N. L. Zhang & J. Tian (Eds.), Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence (pp. 230–239). AUAI Press. Retrieved from http://auai.org/uai2014/proceedings/individuals/152.pdf

We propose an active learning method for discovering low-dimensional structure in high-dimensional Gaussian process (GP) tasks. Such problems are increasingly frequent and important, but have hitherto presented severe practical difficulties. We further introduce a novel technique for approximately marginalizing GP hyperparameters, yielding marginal predictions robust to hyperparameter misspecification. Our method offers an efficient means of performing GP regression, quadrature, or Bayesian optimization in high-dimensional spaces.

@inproceedings{GarnettOH2013,
title = {Active Learning of Linear Embeddings for Gaussian Processes},
author = {Garnett, R. and Osborne, M. and Hennig, P.},
booktitle = {Proceedings of the 30th Conference on Uncertainty in
Artificial Intelligence},
editor = {Zhang, NL and Tian, J},
publisher = {AUAI Press},
pages = {230-239},
year = {2014},
url = {http://auai.org/uai2014/proceedings/individuals/152.pdf},
url2 = {https://github.com/rmgarnett/mgp},
department = {Department Sch{\"o}lkopf},
file = {http://auai.org/uai2014/proceedings/individuals/152.pdf},
code = {https://github.com/rmgarnett/mgp}
}

3. Halko, N., Martinsson, P.-G., & Tropp, J. A. (2011). Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2), 217–288.
@article{halko2011finding,
title = {Finding structure with randomness: Probabilistic algorithms
for constructing approximate matrix decompositions},
author = {Halko, Nathan and Martinsson, Per-Gunnar and Tropp, Joel A},
journal = {SIAM review},
volume = {53},
number = {2},
pages = {217--288},
year = {2011},
publisher = {SIAM}
}

4. Liberty, E., Woolfe, F., Martinsson, P.-G., Rokhlin, V., & Tygert, M. (2007). Randomized algorithms for the low-rank approximation of matrices. Proceedings of the National Academy of Sciences, 104(51), 20167–20172.
@article{liberty2007randomized,
title = {Randomized algorithms for the low-rank approximation of
matrices},
author = {Liberty, Edo and Woolfe, Franco and Martinsson, Per-Gunnar and Rokhlin, Vladimir and Tygert, Mark},
journal = {Proceedings of the National Academy of Sciences},
volume = {104},
number = {51},
pages = {20167--20172},
year = {2007}
}

5. Minka, T. P. (2000). Deriving quadrature rules from Gaussian processes. Statistics Department, Carnegie Mellon University.

Quadrature rules are often designed to achieve zero error on a small set of functions, e.g. polynomials of specified degree. A more robust method is to minimize average error over a large class or distribution of functions. If functions are distributed according to a Gaussian process, then designing an average-case quadrature rule reduces to solving a system of 2n equations, where n is the number of nodes in the rule (O’Hagan, 1991). It is shown how this very general technique can be used to design customized quadrature rules, in the style of Yarvin & Rokhlin (1998), without the need for singular value decomposition and in any number of dimensions. It is also shown how classical Gaussian quadrature rules, trigonometric lattice rules, and spline rules can be extended to the average-case and to multiple dimensions by deriving them from Gaussian processes. In addition to being more robust, multidimensional quadrature rules designed for the average-case are found to be much less ambiguous than those designed for a given polynomial degree.

@techreport{minka2000deriving,
author = {Minka, T.P.},
institution = {Statistics Department, Carnegie Mellon University},
title = {{Deriving quadrature rules from {G}aussian processes}},
year = {2000},

@article{o1987monte,