One Statistical Paradigm to Rule Them All?

Gaithersburg, Maryland - Now, if you’re familiar with either statistics or J.R.R. Tolkien, I know the title grabbed your attention. And if not, don’t worry, all will become clear in time, but I’ve always wanted to reference “The Lord of the Rings” in a title (and scientific papers provide little opportunity for that!).

Statistics is a way of learning and making decisions when faced with uncertainty—in other words, all the time. There are two primary types of statistical methods, Bayesian and frequentist, and some statisticians have strong opinions about both. I apologize if I have omitted someone’s preferred approach for dealing with uncertainty; I know there are others.

Despite the title above, I do not really think that one method is better than the other, and I think the world is better off by having both at our disposal. For those of you who have a different opinion, note that there is a comments section below. I’d like to hear your thoughts and if I am able to reply, I will. (Note also that I will be loose with terminology on purpose as this is not a technical paper, and it will make the examples a little easier to follow.)

A simplified version of the scientific method is 1) make an observation or ask a question; 2) form a hypothesis; 3) test the hypothesis through an experiment; 4) revise the hypothesis if necessary and go back to 3); 5) draw conclusions and publish your result; and, hopefully, 6) collect your Nobel Prize (when will they add a mathematics and statistics category, I wonder?). A well-designed experiment provides as much information as possible. But because real-world data are noisy, statistical techniques are important for comparing hypothesis to experiment. Did one measured property really increase while another decreased, or did the observed pattern just appear by chance?

Related techniques can be used for prediction, too. Think about warranties on consumer products, such as the TV I just bought. A TV company would probably not profit, or even stay in business, if it warrantied all the televisions it made for 50 years. (But who knows? Perhaps today’s televisions are that good.) Instead, the company will conduct experiments and use statistical models to make statements such as, “90 percent of Model XYZ televisions will work perfectly for two years.” This is an area of statistics known as reliability.

So, all that said, what are frequentism and Bayesianism? From a philosophical point of view, they are just different ways to interpret the probability of something happening. A frequentist philosopher interprets probability as how often an event will occur in the long run. For example, you generally can flip a coin over and over, and if you did you could determine what percentages of the flips would be heads or tails over time. But wait, I can’t be quite that cliché. I work at the National Institute of Standards and Technology (NIST), so let’s talk about Standard Reference Materials (SRMs) instead.

portraits of Bayes and Fisher. Bayes is dressed as a 19th century Protestant minister, and Fisher is in a 1930s suit.

The Reverend Thomas Bayes* (left) originally discovered the theorem for conditional probability that bears his name and forms the basis of Bayesian statistical methods. Sir Ronald Fisher (right) is considered one of the founders of frequentist statistical methods, and originally introduced maximum likelihood. *This portrait was identified as Bayes in a 1936 book, but no definitively known portrait of Bayes exists.

An SRM is a material that NIST scientists have carefully measured and certified as containing specific amounts of certain substances or having certain measurable properties. Industry and government organizations can use these materials to calibrate or perform quality-control checks on their measurement equipment.

For instance, if a food manufacturer wanted to know how much fat was in the peanut butter that they are producing, something they need to know in order to fill out the nutrition label accurately, they could buy our peanut butter SRM, which has a certified value for the amount of fat to test their measurement equipment. If they get a significantly different value than the one on the certificate when they measure the SRM, they know that something is wrong with their instrument or measurement process that will likely affect the measurement of their own product as well.

If you look at an SRM certificate, which lists all the certified values, you might find a phrase like this, “[T]he expanded uncertainty corresponds to a 95 percent coverage interval.” But what does “95 percent coverage interval” mean? A frequentist philosopher might say it means that “if we conduct the same measurement experiment 100 times and analyze the data produced in the same way, we expect that 95 out of the 100 experiments will produce intervals that include the true value of the amount of substance we are interested in knowing about.”

The Bayesian philosopher might reply, “That is a reasonable interpretation, so long as the measurement experiment can be perfectly repeated. What about events that are not repeatable? For example, NASA’s InSight spacecraft will land on Mars only once. But beforehand, engineers selected a target in which to land and quantified the uncertainty about hitting that target with a certain probability. A pure frequency interpretation is not rich enough to cover such situations.” Indeed, there are many situations like this where it is necessary to express uncertainty in the prediction of a one-off outcome beforehand. The Bayesian philosopher might say, “It is preferable to think of probability as a way for a person to express uncertainty in his or her knowledge of some true state of nature or of a future outcome.”

Resolving these differences is not the point of this post, nor do my thoughts dwell there very often. No matter how you interpret probability, its mathematical, axiomatic foundations are without question. Both frequentist and Bayesian philosophies base their techniques on the same foundations, and as we’ll see, the techniques, from a practical perspective, are quite similar. Thus, they often lead to the same conclusions.

The main ingredient in a frequentist analysis is the likelihood, which is a function of unknown parameters that represent scientific quantities of interest. For example, coming back to the peanut butter SRM, one of the parameters of the likelihood will be the true fat concentration of the peanut butter. Another might be the true variability of the measurements of fat concentration due to uncontrollable influences such as minor temperature variations in the laboratory.

We interpret the likelihood as the probability of getting the data you have observed (measurements of fat concentration) given specific values of the parameters, in this case, the actual concentration of fat and measurement variability. A single best estimate and interval for the parameters of the likelihood are typical outputs of a frequentist analysis, e.g., there are 51.6 grams plus or minus 1.4 grams of fat per 100 grams of peanut butter being measured. These would correspond to the certified value and coverage interval on an SRM certificate.

There are two primary components to a Bayesian analysis. The first is the same likelihood that appeared in the frequentist analysis. The Bayesian data analyst interprets the likelihood as the probability of the observed data for specific parameter values, just as the frequentist, even though they may interpret probability differently.

The second ingredient in a Bayesian approach is the prior probability distribution, which is commonly interpreted as the probability of belief in specific parameter values BEFORE observing data. In the peanut butter SRM example, BEFORE any measurements of fat concentration are made, the prior distribution would describe the true fat concentration of the peanut butter. This is perfectly sensible because the ingredients comprising the peanut butter as well as their relative amounts are known when the peanut butter is made. The prior distribution can be used to make statements like, “the probability that the fat concentration is between 45 and 55 grams per 100 grams is 99 percent,” for example. The likelihood and prior probability distribution are combined using Bayes’ rule to produce the posterior probability distribution, which gives the probability of the analyst’s belief in specific parameter values AFTER observing data. When summarizing the results of a Bayesian analysis, it’s typical to provide a single best estimate and an interval estimate, just as for the frequentist case.

How is it possible that Bayesians and frequentists can arrive at the same scientific conclusions? After all, frequentists do not choose a prior probability distribution. It turns out that if the observed data provide strong information relative to the information conveyed by the prior probability distribution, both approaches yield similar results. So, why might the observed data provide stronger information than the prior distribution? Recall how to interpret the prior probability distribution. It is the probability of specific parameter values BEFORE observing data. A common cautious approach is to spread prior probability over many parameter values, assigning zero or very low probabilities only to implausible values.

Why then should someone go to the trouble of choosing a prior distribution if it’s inconsequential? The answer, of course, is that it’s not always inconsequential! As with many scientific debates, Randall Munroe has weighed in via his XKCD webcomic “Frequentists vs. Bayesians”. The observed data, in this case, is the detector answering “yes,” which could either mean that the machine rolled two sixes and is lying, a little less than a 3 percent chance, or the Sun really has exploded. Because it’s so unlikely that the machine is lying, the frequentist cueball concludes that the sun has indeed gone nova. Bayesian cueball, however, acting upon prior knowledge about the sun’s projected lifespan and the fact that both statisticians remain in perfect health, has attached prior probabilities to the competing hypotheses, with the hypothesis that the sun has not exploded receiving a very large share of the prior probability. Upon combining the information contained in the yes answer with the information contained in the prior, Bayesian cueball decides to disagree with frequentist cueball. And is willing to bet on it!

Randall’s cartoon is obviously an exaggeration to make his point and be funny. A frequentist’s hypothesis test would never be constructed this way. However, there are many real problems for which the prior probability distribution can have a large positive impact as opposed to the negative impact staunchly frequentist statisticians may fear.

SRMs, again, provide a good context for discussion. For some SRMs, information from two or more measurement methods is combined to calculate the certificate value and coverage interval. Think of a ruler and a tape measure, both of which can be used to measure the length of something. The prior distribution can be used to quantify the idea that these methods should produce similar measurements. They are targeting the same quantity being measured, they have been vetted by NIST scientists and found to be reliable, etc. Using the prior distribution to quantify those ideas can have an important positive impact.

Frequentist and Bayesian data analysis methods often lead to equivalent scientific conclusions. When they don’t, I think the focus should be on understanding why the prior probability distribution has such a large influence, and if that influence is sensible. In the XKCD cartoon, the prior distribution saved the day (or at least earned the Bayesian $50). In other cases, it could lead you astray.

In the endeavor of learning about our universe through observation, experimentation and analysis of the data we collect, it does not seem wise to me to limit the set of tools that you are willing to employ. My mindset is to broaden my toolbox as much as I can and to the best of my ability to understand the strengths and weaknesses of each of the tools available for my use.