## What is “Bayesian” Statistical Inference?

Warning: Dangerous Curves This entry presupposes you already know some math stats, such as how to manipulate joint, marginal and conditional distributions using the chain rule (product rule) and marginalization (sum/integration rule). You should also be familiar with the difference between model parameters (e.g. a regression coefficient or Poisson rate parameter) and observable samples (e.g. reported annual income or the number of fumbles in a quarter of a given American football game).

### Examples

I followed up this post with some concrete examples for binary and multinomial outcomes:

### Bayesian Inference is Based on Probability Models

Bayesian models provide full joint probability distributions over both observable data and unobservable model parameters. Bayesian statistical inference is carried out using standard probability theory.

### What’s a Prior?

The full Bayesian probability model includes the unobserved parameters. The marginal distribution over parameters is known as the “prior” parameter distribution, as it may be computed without reference to observable data. The conditional distribution over parameters given observed data is known as the “posterior” parameter distribution.

### Non-Bayesian Statistics

Non-Bayesian statisticians eschew probability models of unobservable model parameters. Without such models, non-Bayesians cannot perform probabilistic inferences available to Bayesians, such as definining the probability that a model parameter (such as the mean height of an adult male American) is in a defined range say (say 5’6″ to 6’0″).

Instead of modeling the posterior probabilities of parameters, non-Bayesians perform hypothesis testing and compute confidence intervals, the subtleties of interpretation of which have confused introductory statistics students for decades.

### Bayesian Technical Apparatus

The sampling distribution, with probability function $p(y|\theta)$, models the probability of observable data $y$ given unobserved (and typically unobservable) model parameters $\theta$. (The sampling distribution probability function $p(y|\theta)$ is called the likelihood function when viewed as a function of $\theta$ for fixed $y$.)

The prior distribution, with probability function $p(\theta)$, models the probability of the parameters $\theta$.

The full joint distribution over parameters and data has a probability function given by the chain rule,

$p(y,\theta) = p(y|\theta) \times p(\theta)$.

The posterior distribution gives the probability of parameters $\theta$ given the observed data $y$. The posterior probability function $p(\theta|y)$ is derived from the sampling and prior distributions via Bayes’s rule,

$p(\theta|y)$

${} = \frac{\displaystyle p(y,\theta)}{\displaystyle p(y)}$       by the definition of conditional probability,

${} = \frac{\displaystyle p(y|\theta) \times p(\theta)}{\displaystyle p(y)}$       by the chain rule,

${} = \frac{\displaystyle p(y|\theta) \times p(\theta)}{\int_{\Theta} \displaystyle p(y,\theta') \, d\theta'}$       by the rule of total probability,

${} = \frac{\displaystyle p(y|\theta) \times p(\theta)}{\int_{\Theta} \displaystyle p(y|\theta') \times p(\theta') \, d\theta'}$       by the chain rule, and

${} \propto p(y|\theta) \times p(\theta)$       because $y$ is constant.

The posterior predictive distribution $p(\tilde{y}|y)$ for new data $\tilde{y}$ given observed data $y$ is the average of the sampling distribution defined by weighting the parameters proportional to their posterior probability,

$p(\tilde{y}|y) = \int_{\Theta} \, p(\tilde{y}|\theta) \times p(\theta|y) \, d\theta$.

The key feature is the incorporation into predictive inference of the uncertainty in the posterior parameter estimate. In particular, the posterior is an overdispersed variant of the sampling distribution. The extra dispersion arises by integrating over the posterior.

### Conjugate Priors

Conjugate priors, where the prior and posterior are drawn from the same family of distributions, are convenient but not necessary. For instance, if the sampling distribution is binomial, a beta-distributed prior leads to a beta-distributed posterior. With a beta posterior and binomial sampling distribuiton, the predictive posterior distribution is beta-binomial, the overdispersed form of the binomial. If the sampling distribution is Poisson, a gamma-distributed prior leads to a gamma-distributed posterior; the predictive posterior distribution is negative-binomial, the overdispersed form of the Poisson.

### Point Estimate Approximations

An approximate alternative to full Bayesian inference uses $p(\tilde{y}|\hat{\theta})$ for prediction, where $\hat{\theta}$ is a point estimate.

The maximum $\theta^{*}$ of the posterior distribution defines the maximum a posteriori (MAP) estimate,

$\theta^{*} = \arg\max_{\theta} p(y|\theta) p(\theta)$.

If the prior $p(\theta)$ is uniform, the MAP estimate is called the maximum likelihood estimate (MLE), because it maximizes the likelihood function $p(y|\theta)$. Because the MLE does not assume a proper prior; the posterior may be improper (i.e., not integrate to 1).

By definition, the unbiased estimator for the parameter under the Bayesian model is the posterior mean,

$\hat{\theta} = \int_{\Theta} \theta \times p(\theta|y) \, d\theta$.

This quantity is often used as a Bayesian point estimator because it minimizes expected squared loss between an estimate and the actual value. The posterior median may also be used as an estimate — it minimizes expected absolute error of the estimate.

Point estimates may be reasonably accurate if the posterior has low variance. If the posterior is diffuse, prediction with point estimates tends to be underdispersed, in the sense of underestimating the variance of the predictive distribution. This is a kind of overfitting which, unlike the usual situation of overfitting due to model complexity, arises from the oversimplification of the variance component of the predictive model.

### 12 Responses to “What is “Bayesian” Statistical Inference?”

1. John Says:

Regarding conjugate priors, here’s a diagram that summarizes the most common conjugate relationships.

2. lingpipe Says:

That’s really cool. So is John’s distribution chart.

3. Scott Says:

Bayesian statistics are used quite frequently in evolutionary analysis. Paul Lewis at the University of Connecticut is a brilliant lecturer on the basics of Bayesian statistics. He developed a small program called the “Bayesian Coin Tosser” to help illustrate many complex concepts in Bayesian statistics. It can be downloaded here: http://hydrodictyon.eeb.uconn.edu/people/plewis/software.php.

His educational software on MCMC are also excellent teaching tools. An example lecture of his on Bayesian statistics and how they are used in phylogenetic analysis can be found here: http://www.molecularevolution.org/si/people/faculty/lewis_paul.php

4. lingpipe Says:

There was a sudden spike in traffic, and it turns out it comes from Y Combinator Hacker News, where there’s a discussion of this post with seven comments as of today.

The criticisms were sound — it’s too technical (i.e. jargon filled) for someone to understand who doesn’t already get it. Ironically, I’ve been telling Andrew Gelman that about his Bayesian Data Analysis book for years.

Unix man pages are the usual exemplar of doc that only works if you mostly know the answer. They’re great once you already understand something, but terrible for learning.

I think Andrew’s BDA is that way — it’s clear, concise and it actually does explain everything from first principles. And there are lots of examples. So why is this so hard to understand?

I usually write with my earlier self in mind as an audience. Sorry for not targeting a far-enough back version of myself this time! The jargon should be familiar to anyone who’s taken math stats. I don’t think it’d have helped if I’d have defined the sum for the prior defined as a marginal.

5. Bertok Says:

Hi lingpipe,

Just curious, what are you using to format your mathematics? It looks like html-based LaTeX. Is that what it is? Looks very nice.

• lingpipe Says:

I wish. I’m using Sitmo.com’s LaTeX editor, which is a pain. And I don’t trust its stability, so I save the source as the alt field in the image link in case I need to regenerate images at some point.

I’d rather use a built-in HTML-based solution, but WordPress is hosting and I haven’t figured out how to incorporate one of the LaTeX plugins. Any help appreciated!

P.S. If you right-click on images and look at their properties, you’ll see the link to sitmo.

• Bertok Says:

Thank you for that link to Sitmo. That’s a cool editor. To be honest, I’m not really sure that I would use any of the LaTeX plugins, even if WP did support them. I sort of like the idea of generating an image from an editor and just using that instead. I think it’s a cleaner, safer solution (assuming you get your equations right the first time).

Back on topic, I liked this brief synopsis of Bayesian stats. I recall the week that we spent on this in Stats I as the reason I continued my studies in the field of statistics. Such an elegant, powerful tool inspired me to really dig in to probability theory as an undergrad.

Thanks again!

• mattilyra Says:

Hi. This is quite an old post but looking at the more recent post on Settles EMNLP 2011 paper things haven’t changed with regards to the latex formatting. Just thought I’d point you to MathJax which is a JS script for rendering very very nice latex inline. Stackoverflow uses it and they appear to have a WordPress plugin.

http://www.mathjax.org/docs/1.1/platforms/wordpress.html

• Bob Carpenter Says:

I’ve just been using the WordPress LaTeX plugin since I found it. We’re hosted on wordpress.com, and alas, MathJax isn’t available there. See this discussion:

http://en.forums.wordpress.com/topic/using-mathjax-in-wordpresscom

6. abia Says:

what is the difference between classical and bayesian in statistics?

• lingpipe Says:

The answer’s clearer philosophically than it is in practice. Frequentists only allow probability statements about repeatable, observable variables. Bayesians let you talk about the probability of unobservable parameters.

In the canonical example of a biased coin flip, the outcome (heads or tails) of each trial is observable. The chance of landing heads or tails is not observable. In the example of human height, the height of a person is observable, but the mean height of the population is not.

In practice, this leads to the frequentist notion of hypothesis testing. A hypothesis test can either reject a hypothesis (such as that height and weight are independent) or fail to reject a hypothesis. They do not allow you to conclude that the alternative to the null hypothesis is correct! In probabilistic terms, the null hypothesis is usually a point hypothesis with zero probability of being correct in most continuous models (such as chance of a coin landing heads or a normally distributed population of women).

In practice, many people incorrectly interpret frequentist confidence intervals with Bayesian posterior intervals.

People misleadingly call Bayesian stats subjective, but this is a misunderstanding of the role of modeling and a misunderstanding of estimation. All of statistics, even frequentist statistics, is subjective in the sense that we’re using a mathematical model to approximate an unknown reality. Priors in the Bayesian sense may be estimated from data just like other parameters.

7. Bayesian Inference for LDA is Intractable « LingPipe Blog Says:

[…] LingPipe Blog: What is Bayesian Statistical Inference? […]