Uncertainty quantification. An often overlooked yet critical aspect of any analysis in both science and engineering. Theorists like to talk about fundamental bounds on information. Experimentalists like to place bounds on measurements by propagating various sources of error. Computationalists like to perform convergence tests and propagate floating point errors.
The three foundational pillars of science have long since developed methods to handle uncertainty and account for the ambiguity in results.
With the proliferation of the fourth pillar, machine learning, this crucial practice is often overlooked. Not by foundational researchers, who have developed a variety of methods to quantifying uncertainty. But by practitioners, who treat the additional overhead of the requirement as a nuisance… like unit tests. Believe me, I would never skip writing a unit test.
Many productized applications of ML can live in ignorant bliss of their uncertainty without affecting the bottom line. On the other hand, safety critical applications cannot afford to be so naive. When turning life saving decision making over to The Algorithm, any prediction must be accompanied by some measure that answers the questions, “how confident are we in this estimate?”
Frequentist and bayesian interpretations take different approaches to answering this question. My goal for this post isn’t to layout the theoretical guarantees behind these methods or to compare the merits of each. I decided it would be best to sort of summarize the landscape and point any interested readers in the direction of further references which do a much better job than I at working through the details.
Traditionally, ML methods and especially deep learning methods make point predictions. This means that we only produce a single prediction for any individual input. An example of this would be using a softmax function to transform logits into a discrete distribution across classes, think cat vs dog. While the result is a distribution, it doesn’t tell us anything about how certain we are in our prediction.
Concretely, if we send a picture of a cat through a classifier, the classifier may score the input as a cat with 0.9 and a dog with 0.1. This can be useful, but it doesn’t help us understand how certain the classifier is in the prediction. If instead it produced a cat with 0.88 and a dog with 0.12, would we be more or less confident in that result?
Broadly, uncertainty in statistics and ML can be grouped into two categories: Aleatoric and Epistemic. Aleatoric uncertainty is the irreducible uncertainty inherent to the data. This is a fact of life for any generative process. Epistemic uncertainty is reducible and describes how certain the model is in its predictions based on the data it has seen. This sort of uncertainty can be broken down into model uncertainty, can the chosen solution class nearly approximate the true solution. Optimization error, how close to the true solution can the optimization algorithm bring the chosen solution class. And finally statistical uncertainty, how thoroughly is our data distributed across the underlying manifold which relates to generalization.
To quantify and reduce epistemic uncertainty we have a variety of methods available to us. The first and most widely used of these are ensemble methods. Simply train many different classifiers with different random seeds, send the input through each of them, then construct a distribution over the results. Another approach is to use bayesian networks. This is done by placing a prior over the weights of a neural network and updating our belief in the weights as we progress through training. A third approach is to repeatedly apply dropout during inference and use the resulting predictions to approximate a distribution over the result.
While I would argue that all of these are valid approaches, we can do better. I’d like to highlight two approaches specifically.
Evidential Deep Learning
This approach extends an obscure and little known (to me at least) subfield of statistics called evidential statistics. Neither bayesian nor frequentist, evidential statistics merges likelihoods with posteriors and a variety of other blah blah blah from the field into a hybrid theory. This method interprets each input as weighted evidence towards its class label or regression target. To do this we simply reinterpret the predictions of our model as predicting the parameters defining a second order distribution function and make some changes to the loss function so that the math works out. The distribution over our solution class may then be sampled from the second order distribution.
To make this more clear I can give an analogy. Imaging you performing a coin flip. The outcome of the toss is governed by a binomial distribution. If the coin is fair it will be ½ heads and ½ tails. But if the coin was flawed it may be something more like 9/16 heads and 7/16 tails. Evidential statistics is essentially the coin factory which manufactured our coin. We predict the parameters of the distribution which governs the parameters of the binomial distribution. This is known as the conjugate prior distribution and in this case it is the Beta distribution.
To extend this idea you can imagine an analogous situation where you are rolling a dice. This dice may be equally weighted with each side being 1/6 probable. But this is only one possible outcome of the conjugate prior Dirichlet distribution.
By placing a distribution over our predictions we intrinsically gain access to the uncertainty in our predictions. The nice thing about this approach is that it doesn’t require any re-engineering in the upstream data or training pipelines, the only changes required are in the loss function.
I of course need to add the caveat here that we are using the universal function approximation theorem to approximate our solution class. As such this is only a heuristic, but as shown recently a very successful one. Below I included the seminal papers on evidential deep learning for classification and regression tasks.
https://arxiv.org/abs/1910.02600
https://arxiv.org/abs/1806.01768
Conformal Predictors
The last method of estimating epistemic uncertainty that I would like to highlight is known as a conformal predictor. A relatively new field, conformal predictors live in the frequentist world. They work as a piece of bolt on machinery that you install after constructing your model. Notice I didn’t mention anything about which type of model, they work on any predictor: classification, regression, random forest, bayesian neural network, SVM, or anything else you can imagine.
Just as you might calibrate a measurement apparatus, you can calibrate a statistical model. The calibration process aligns the scores predicted by the model with the uncertainty of the model. Lets revisit our classification problem where we apply a softmax across three classes: cat vs dog vs parrot. After calibration the model may predict a cat with score 0.9, this now may be interpreted as having an associated 10% uncertainty. The calibration process implies that you may have several classes which score higher than the calibration threshold. For example, after calibrating we may score the cat as 0.7, the parrot as 0.74, and the dog as 0.2. In this case the results are interpreted as saying that we are at least 70% certain the true class is either a cat or a parrot. I don’t want to go into too much detail here, just bring attention to the technique. Christoph Molnar has an amazing series of posts on the subject where he goes into more detail. I will include the link to the first of five blog posts below.
https://mindfulmodeler.substack.com/p/week-1-getting-started-with-conformal
Hopefully this post gave you a broader understanding of the landscape of uncertainty quantification techniques and where they fit into modern machine learning. Next time you build a model, try and include some measure of uncertainty and see how it goes.
