It’s American baseball season, and I’ve been itching to do more baseball stats. So in the interest of explaining Bayesian posterior inference with an example…
Batting Averages
A player’s batting average is his number of “hits” divided by his number of “at bats“.
Binomial Model of Batting
For simplicitly, we’ll model a sequence of at bats using a binomial distribution with a
probability of getting a hit in each at bat. This assumes that the sequence of at bats is independent and identically distributed, aka iid.
Of course this is wrong. All models are wrong. It’s just that this one’s more wrong than usual, despite the popularity of the batting average as a summary statistic. (See the second last section for a reference to a better model.)
Maximum Likelihood Estimate (MLE)
Given a sample (aka training data) consisting of a player with hits in
at bats (where
), the maximum-likelihood estimate (MLE) of the ability
is
, aka the player’s batting average.
Uniform Prior
We assume a uniform (aka noninformative) prior distribution over abilities.
This is also wrong. We know major league position players don’t bat under 0.200 or they’d be fired, and no one’s had a batting average of over 0.400 since Ted Williams.
But we’ll have to come back to issues with the prior later. This post is just to clarify what integration over the posterior does for inference, not what an informative prior does.
Bayesian Posterior Estimate
The beta prior is conjugate to the binomial. Because the uniform distribution may be expressed as a beta distribution, , the posterior is a beta distribution. Because the beta and binomial play nicely together, the posterior ability distribution
may be expressed analytically as
. The maximum of
is
, so the maximum a posteriori (MAP) estimate of
is also
, the same as the maximum likelihood estimate.
Posterior Predictive Inference
Now let’s look at prediction. For concreteness, suppose we’ve observed M=10 at bats, of which m=3 were hits, for a batting average of 3/10 = 0.300. What we want to know is the likelihood of n hits in the next N=5 at bats.
There are six possible outcomes, 0, 1, 2, 3, 4, or 5 hits. Using the MLE or MAP point estimate, this distribution is .
The Bayesian estimate requires integrating over the posterior, effectively averaging the binomial prediction for each ability by the ability’s probability. The posterior in this case is , so the integral results in the well known beta-binomial distribution:
I solved the integral numerically using Monte Carlo integration, which is a fancy way of saying I took a bunch of samples from the beta posterior
and plugged them into
and averaged the resulting predictions. It’s a few lines of R (see below). The beta-binomial may be expessed analytically, but that’s even more typing than the Monte Carlo solution. I’m sure there’s an R package that’d do it for me.
Comparing the Results
The important issue is to look at the outputs. Given that we’ve seen a batter get 3 hits in 10 at bats, here’s our prediction using a binomial model with uniform prior, shown with both the MLE/MAP point estimate and the average of the Bayesian predictions:
Point vs. Bayesian Estimates | ||
---|---|---|
Hits | MLE/MAP | Bayesian |
0 | 0.168 | 0.180 |
1 | 0.360 | 0.302 |
2 | 0.309 | 0.275 |
3 | 0.132 | 0.165 |
4 | 0.028 | 0.064 |
5 | 0.002 | 0.012 |
The Bayesian estimate here is more dispersed (flatter, higher variance, higher entropy) than the MLE estimate. Of course, a true Bayesian estimate would provide a joint distribution over outcomes 0,1,2,3,4,5 (which reduce to a simplex, because any 5 values determine the 6th) for downstream reasoning. The nice thing about passing uncertainty along by integration is that it’s very plug-and-play.
Advanced Reading
Luckily for us baseball stats geeks, there’s some advanced reading hot off the press from Wharton Business School:
- Jensen, Shane T., Blake McShane, and Abraham J. Wyner. 2009. Hierarchical Bayesian Modeling of Hitting Performance in Baseball. Bayesian Analysis.
R Source
Here’s how I generated the table.
# observed hits = 3 atBats = 10 # predicted nextAtBats = 5 # y: MLE predictive distro average = hits/atBats y = rep(0,nextAtBats+1); for (i in 0:nextAtBats) y[i+1] = dbinom(i,nextAtBats,average) # z: Bayesian predictive distro numSamps = 20000 for (k in 1:numSamps) { sampledAvg = rbeta(1,1+hits,1+atBats-hits) for (i in 0:nextAtBats) z[i+1] = z[i+1] + dbinom(i,nextAtBats,sampledAvg) } for (i in 0:nextAtBats) z[i+1] = z[i+1] / numSamps print("Point"); print(y, digits=3) print("Bayes"); print(z, digits=3)
February 2, 2016 at 12:38 pm |
Hi thank you for the post. But I got error running your code: Object z is not found.
Also, why did you only want one sample, as the first input argument in the line:
sampledAvg = rbeta(1,1+hits,1+atBats-hits)
February 2, 2016 at 2:45 pm |
The reason is to get the proper predictive variance. If you took a bunch of draws and averaged them for the return average, the variance would be reduced in the posterior.
I wrote everything up in a much more sophisticated way statistically, and posted it as a knitr with examples running in Stan:
https://github.com/stan-dev/example-models/tree/master/knitr/pool-repeated-trials
I could mail the knitted HTML, but knitr’s very inefficient with the images and it’s something like 5MB.