Warning: Dangerous Curves This entry presupposes you already know some math stats, such as how to manipulate joint, marginal and conditional distributions using the chain rule (product rule) and marginalization (sum/integration rule). You should also be familiar with the difference between model parameters (e.g. a regression coefficient or Poisson rate parameter) and observable samples (e.g. reported annual income or the number of fumbles in a quarter of a given American football game).
I followed up this post with some concrete examples for binary and multinomial outcomes:
- Dirichlet-Multinomial: Bayesian Naive Bayes
Bayesian Inference is Based on Probability Models
Bayesian models provide full joint probability distributions over both observable data and unobservable model parameters. Bayesian statistical inference is carried out using standard probability theory.
What’s a Prior?
The full Bayesian probability model includes the unobserved parameters. The marginal distribution over parameters is known as the “prior” parameter distribution, as it may be computed without reference to observable data. The conditional distribution over parameters given observed data is known as the “posterior” parameter distribution.
Non-Bayesian statisticians eschew probability models of unobservable model parameters. Without such models, non-Bayesians cannot perform probabilistic inferences available to Bayesians, such as definining the probability that a model parameter (such as the mean height of an adult male American) is in a defined range say (say 5’6″ to 6’0″).
Instead of modeling the posterior probabilities of parameters, non-Bayesians perform hypothesis testing and compute confidence intervals, the subtleties of interpretation of which have confused introductory statistics students for decades.
Bayesian Technical Apparatus
The sampling distribution, with probability function , models the probability of observable data given unobserved (and typically unobservable) model parameters . (The sampling distribution probability function is called the likelihood function when viewed as a function of for fixed .)
The prior distribution, with probability function , models the probability of the parameters .
The full joint distribution over parameters and data has a probability function given by the chain rule,
The posterior distribution gives the probability of parameters given the observed data . The posterior probability function is derived from the sampling and prior distributions via Bayes’s rule,
by the definition of conditional probability,
by the chain rule,
by the rule of total probability,
by the chain rule, and
because is constant.
The posterior predictive distribution for new data given observed data is the average of the sampling distribution defined by weighting the parameters proportional to their posterior probability,
The key feature is the incorporation into predictive inference of the uncertainty in the posterior parameter estimate. In particular, the posterior is an overdispersed variant of the sampling distribution. The extra dispersion arises by integrating over the posterior.
Conjugate priors, where the prior and posterior are drawn from the same family of distributions, are convenient but not necessary. For instance, if the sampling distribution is binomial, a beta-distributed prior leads to a beta-distributed posterior. With a beta posterior and binomial sampling distribuiton, the predictive posterior distribution is beta-binomial, the overdispersed form of the binomial. If the sampling distribution is Poisson, a gamma-distributed prior leads to a gamma-distributed posterior; the predictive posterior distribution is negative-binomial, the overdispersed form of the Poisson.
Point Estimate Approximations
An approximate alternative to full Bayesian inference uses for prediction, where is a point estimate.
The maximum of the posterior distribution defines the maximum a posteriori (MAP) estimate,
If the prior is uniform, the MAP estimate is called the maximum likelihood estimate (MLE), because it maximizes the likelihood function . Because the MLE does not assume a proper prior; the posterior may be improper (i.e., not integrate to 1).
By definition, the unbiased estimator for the parameter under the Bayesian model is the posterior mean,
This quantity is often used as a Bayesian point estimator because it minimizes expected squared loss between an estimate and the actual value. The posterior median may also be used as an estimate — it minimizes expected absolute error of the estimate.
Point estimates may be reasonably accurate if the posterior has low variance. If the posterior is diffuse, prediction with point estimates tends to be underdispersed, in the sense of underestimating the variance of the predictive distribution. This is a kind of overfitting which, unlike the usual situation of overfitting due to model complexity, arises from the oversimplification of the variance component of the predictive model.