Continuing our characterization of Bayesian inference, we now turn to the use of informative priors. In my previous post on Bayesian batting averages, we saw that even with a noninformative uniform prior, Bayesian multinomial inference led to more dispersed predictive distributions than inference with the point maximum likelihood/MAP estimate because of the posterior averaging over uncertainty.
Distribution of Hits vs. At-Bats
Here’s the scatterplot of hits vs. at-bats for all 308 of the 2006 American League position players.
The red diagonal line is where someone with a league-average 0.275 batting average would fall (league average is total number of at hits for all players divided by the total number of at bats for all players). We’ll soon see the effect of the apparent correlation between at-bats and average (better players get put into bat more and higher up in the lineup).
Model Refresher
Just so we’re all on the same page, recall that we draw player ‘s batting average
where
is the prior number of hits plus 1 and
the prior number of outs plus 1. Player
‘s number of hits in
at bats is then modeled as
. In words, each at bat has an independent
chance of being a hit. We call
player
‘s ability. Part of our inference problem is to characterize a player’s ability as a posterior distribution over abilities. The more at bats we’ve seen, the tighter (smaller variance) the posterior estimate will be.
We are interested in inferences about the number of hits in the next
at bats given an observation of
hits in
at bats along with the beta prior parameters
. The Bayesian estimate integrates over the posterior estimate
of ability:
Simulation from Uniform Prior
Last time, we assumed a noninformative uniform prior on batting averages (interpreted as 0 prior hits and 0 prior outs). If we generate simulated data from this model, we get a plot that looks as follows (drawn to the same scale).
Uh-oh. That’s nothing at all like what we observe in the league.
“Empirical” Bayes
Short of the fully Bayesian approach to uncertainty is what’s known as “empirical Bayes”. The term “empirical” here is in contrast to assuming a non-informative prior, not in contrast to fully Bayesian models, which are also empirical in this sense.
An empirical Bayes estimate involves a point estimate of the prior (in this case ) that’s used as a hyperparameter in the model (that is, a parameter that’s fixed as data).
Moment Matching Point Estimates
Often the point estimate is derived by a so-called moment-matching estimate. For moment-matching, it’s usual to choose an estimate that has the same mean (1st moment) and variance (2nd moment around the mean) as the empirical data, though other moments may be used.
Moment matching for the beta prior is straightforward. The mean (1st moment) of is
.
The prior scale (prior at bats plus 2) may be determined from the mean
and variance
(second moment around the mean), by
.
In our case, the sample mean of the player batting averages is 0.248 with a sample deviation (positive square root of variance, ) of 0.0755. Note that the mean of averages, 0.248, is less than the league average of 0.275. This bias is common in macro-averages such as macro-F measure for search or classifier evaluations.
In this case, moment matching leads to a prior mean parameter of 0.248 and a prior scale
of 31.8. For convenience, we reparameterized the usual
parameterization (here
is prior hits and
prior outs) with the prior average
and prior scale
. A sample drawn from this model is:
Close, but still not quite like what we observe. With this model, there’s too much variance in averages and also everything’s a bit too low compared to the empirical data.
The extra variance in the sampled season is due to equally weighting players with low numbers of at bats. As we know from the central limit theorem, averages based on smaller counts have higher variance (deviation declines inversely proportionally to ).
If we only consider the 88 batters with 400 at bats or more, the mean average is 0.284, higher than the league average. The prior scale derived by moment-matching with variance is 133.9. A sample looks as follows.
The variance is closer to the empirical variance, but still a bit too high. And now the predicted mean is wrong in the opposite direction, above the league average, rather than below. On the other hand, given the bias toward higher averages with more at bats, this curve predicts the high end of the distribution better. Not surprising given that’s where it was estimated from.
(In a better model, we’d probably want to take number of at bats into account as a predictor in a regression-style model rather than use a simple beta-binomial setup.)
Posterior Inferences
Now we can just plug that estimate into our example from last time. What if someone gets 3 out of 10 hits? Here’s the new table with the two empirical Bayes posterior predictions added.
Point vs. Bayesian Estimates for Observed 3 for 10 Batter | ||||
---|---|---|---|---|
Hits | MLE | NonInf Bayes | Empir. Bayes, All | Emp. Bayes, 400 AB |
0 | 0.168 | 0.180 | 0.239 | 0.192 |
1 | 0.360 | 0.302 | 0.373 | 0.368 |
2 | 0.309 | 0.275 | 0.262 | 0.292 |
3 | 0.132 | 0.165 | 0.103 | 0.120 |
4 | 0.028 | 0.064 | 0.022 | 0.025 |
5 | 0.002 | 0.012 | 0.002 | 0.002 |
I’ve been kind to the MLE and non-informative Bayes estimates by taking a fairly likely number of hits (3) for the number of at bats (10). The predictions are even more varied if you consider a batter going 5/10 or 1/10, neither of which is that unusual. Here, the high prior counts will dominate the estimates. Here’s the 5/10 case:
Point vs. Bayesian Estimates for Observed 5 for 10 Batter | ||||
---|---|---|---|---|
Hits | MLE | NonInf Bayes | Empir. Bayes, All | Emp. Bayes, 400 AB |
0 | 0.031 | 0.058 | 0.174 | 0.175 |
1 | 0.156 | 0.173 | 0.342 | 0.358 |
2 | 0.313 | 0.269 | 0.298 | 0.303 |
3 | 0.313 | 0.269 | 0.143 | 0.132 |
4 | 0.156 | 0.174 | 0.038 | 0.030 |
5 | 0.031 | 0.058 | 0.004 | 0.003 |
Comparing these two tables makes it evident how the priors pull the estimates closer to the league averages. The higher the prior at-bat count, the stronger the pull. If we up the count to 30/100 at bats, we’ll pull all the averages closer to the empirical averages.
Point vs. Bayesian Estimates for Observed 30 for 100 Batter | ||||
---|---|---|---|---|
Hits | MLE | NonInf Bayes | Empir. Bayes, All | Emp. Bayes, 400 AB |
0 | 0.168 | 0.180 | 0.189 | 0.182 |
1 | 0.360 | 0.352 | 0.366 | 0.365 |
2 | 0.309 | 0.304 | 0.294 | 0.299 |
3 | 0.132 | 0.138 | 0.122 | 0.125 |
4 | 0.028 | 0.033 | 0.026 | 0.026 |
5 | 0.002 | 0.003 | 0.002 | 0.002 |
After 30 observed at bats, the differences between the various Bayesian estimates begins to be dampened. The differences would be a bit greater if we’d chosen a 20/100 batter who’d have been further away from the prior means.
Censorship and Moment-Matching are Ad Hoc
The problem here is that the “empirical” Bayes estimates based on moment matching and censoring (removing items from) the data are ad hoc. If we truncate the data to try to reduce the variance, we lose the contribution to the averages of low-count players. Also, as we truncate the prior average changes, which is not desirable. Also, there’s no reason to believe matching moments will make sense for binomial data based on varying numbers of at bats.
In the next post, on our way to fully Bayesian inference, we’ll consider the so-called Bayes estimator based on squared loss, which provides a somewhat more principled point estimate for empirical Bayes.
December 1, 2009 at 4:25 am |
When you derived your empirical Bayes estimates, you used the sample variance. Wouldn’t it be more accurate to use the population variance (which you can estimate using the random variance in binomials)? Although finding the population variance in a population with a wide range of observations (ABs) isn’t so straightforward…
December 2, 2009 at 2:39 am |
I’m not sure how to estimate that other than with the methods I used in the followups. How can I calculate population variance with the set of hits and at-bats I have?
March 31, 2011 at 10:12 am |
one way of finding a beta prior is to write a script to randomly select a bunch of beta curves and then simulate 10,000 “seasons” from each one where every batters’ hits (s) and outs (f) are simulated binomially according to his at-bats (n). then you measure the average spread (you can use sd, var, or probably most preferably median absolute deviation to temper the outliers) over the 10,000 seasons and see which beta curve resulted in the spread most similar to the empirical spread of batting averages.
jim albert taught me that.
March 31, 2011 at 12:23 pm |
I’d really rather just put a prior on the beta parameters and then sample from the posterior. It fits in nicely with the rest of Bayesian inference that way. Jim Albert’s book on Monte Carlo methods in R gives you some idea of how to sample from the posterior given some kind of prior on the Beta parameters, or you could just use BUGS as I do in the follow-up to this post.
Won’t it be hard to sample the beta parameters densely enough to get a good estimate in the method you describe? I’m thinking that you probably don’t need to sample anyway, because you can analytically compute statistics like means and variances given that you have a beta-binomial form for the sampling distributions. So you could probably just optimize if you want a point estimate based on the kind of matching you suggest.