## Bayesian Estimators for the Beta-Binomial Model of Batting Ability

In my last post introducing Bayesian stats through the simplest non-trivial distribution, the binomial, I introduced moment-matching “empirical Bayes” point estimates of the beta priors. In this post, I’ll introduce the so-called “Bayesian estimator” point estimate for the beta priors. In addition, I’ll show why maximum a posteriori (MAP) estimates of batting average differ from the Bayesian estimates for batting average.

### Bayesian Estimators

Suppose we provide an estimate $\hat{\theta}$ for a parameter that has true value $\theta$. We define a loss function $\mathcal{L}(\hat{\theta},\theta)$, lower values of which indicate that $\hat{\theta}$ is a better estimate of $\theta$.

The Bayesian estimator minimizes the expected loss given a Bayesian posterior distribution $p(\theta|x)$ over parameter $\theta$:

$\theta^{\mathcal{L}} = \arg\min_{\theta} \mathbb{E}_{p(\theta|x)}[\mathcal{L}(\hat{\theta},\theta)]$

$= \arg\min_{\theta} \int_{\Theta} \mathcal{L}(\hat{\theta},\theta) \, p(\theta|x) \, d\theta$

### With Squared Loss, Bayesian Estimates are Posterior Means

The most common loss function, due to its convenience, is squared loss:

$\mathcal{L}(\hat{\theta},\theta) = L_2(\theta,\hat{\theta}) = (\theta - \hat{\theta})^2$

As stated, this assumes a single real-valued parameter; the natural extension is to square distance for parameter vectors.

With squared loss, the expected loss is minimized by the mean of the posterior. In other words, the Bayes estimator for a parameter under squared loss is the posterior mean of that parameter:

$\theta^{L_2} = \int_{\Theta} \theta \, p(\theta|x) \, d\theta$

### Beta-Binomial Batting Model

Our model for batting so far is very simple, with player $j$‘s ability being drawn from a beta prior with fixed hyperparameters $\alpha$ (prior hits plus 1) and $\beta$ (prior outs plus 1):

$\theta_j \sim \mbox{\sf Beta}(\alpha,\beta)$

The number of hits $n_j$ for player $j$ in $N_j$ at bats is drawn from a binomial sampling distribution:

$n_j \sim \mbox{\sf Binom}(\theta_j)$

The observed batting average is just $n_j/N_j$.

Recall that the binomial is defined by:

$\mbox{\sf Binom}(n|N,\theta) \propto \theta^n (1-\theta)^{N-n}$

and the beta prior by:

$\mbox{\sf Beta}(\theta|\alpha,\beta) \propto \theta^{\alpha} (1 - \theta)^{\beta}$

Given the convenient exponential form of the prior and likelihood function, the posterior is easily computed as:

$p(\theta_j|n_j,N_j,\alpha,\beta) \propto p(n_j|N_j,\theta_j) \, p(\theta_j|\alpha,\beta)$

$= \mbox{\sf Binom}(n_j|N_j,\theta_j) \, \mbox{\sf Beta}(\theta_j|\alpha,\beta)$

$\propto (\theta_j^{n_j}(1-\theta_j)^{N_j - n_j}) \, (\theta_j^{\alpha} (1 - \theta_j)^{\beta}$

$= \theta_j^{n_j + \alpha} \, (1 - \theta_j)^{N_j - n_j + \beta}$

Given this final form, which is proportional to a beta, we have

$\theta_j \sim \mbox{\sf Beta}(n_j+\alpha, N_j-n_j+\beta)$.

Thus the beta prior is conjugate to the binomial distribution, because when the prior over abilities is a beta density and the sampling distribution for the number of hits a binomial distribution, the posterior over abilities is also a beta density.

### Bayesian Estimator for Batting Ability

The posterior mean for batting ability $\theta_j$ for player $j$ is thus the mean of the posterior beta distribution. The mean of a beta distribution $\mbox{\sf Beta}(\alpha,\beta)$ is $\frac{\alpha}{\alpha+\beta}$. So for the posterior $\mbox{\sf Beta}(n_j+\alpha, N_j-n_j+\beta)$, the Bayes estimators for batting ability are just:

$\theta_j^{L_2} = \frac{n_j + \alpha}{N_j + \alpha + \beta}$

### Contrast with the MAP Estimate

Note that this is not the same estimate as the maximum a posteriori (MAP) estimate, because the mode (maximum point) and mean of a beta distribution are not the same. The MAP estimate is based on the mode,

$\theta_j^* = \frac{n_j + \alpha-1}{N_j + \alpha + \beta-2}$

### Diffuse Priors for the Beta Priors

In order to derive a Bayes estimate of the prior parameters $\alpha$ and $\beta$, we need to extend our model with a prior for the priors (as we said last time, it’s elephants all the way down). As is common in these priors-for-priors situations, we’ll put a very diffuse prior on the priors themselves.

For convenience, we’ll reparameterize the beta distribution using its mean $\alpha/(\alpha+\beta)$ and scale $\alpha + \beta$. Given the mean and scale, we can derive $\alpha$ and $\beta$ by:

$\alpha = (\alpha + \beta) (\frac{\alpha}{\alpha+\beta})$

$\beta = (\alpha + \beta) (1 - \frac{\alpha}{\alpha+\beta})$.

We then put priors directly on the mean and scale. Because the mean has a range of 0 to 1, we can use a uniform prior:

$\frac{\alpha}{\alpha+\beta} \sim \mbox{\sf Uniform}(0,1) = \mbox{\sf Beta}(1,1)$

Because the scale ranges from 0 to infinity, it’s impossible to have a (proper) uniform prior, because the integral would diverge. Instead, we use a Pareto prior,

$(\alpha + \beta) \sim \mbox{\sf Pareto}(1.5,1)$

The Pareto distribution $\mbox{\sf Pareto}(\gamma,k)$ is defined on the interval $[k,\infty)$ by

$\mbox{Pareto}(y|\gamma,k) \sim y^{-(\gamma+1)}$

In other words, the Pareto distribution is a power-law distribution with exponent $\gamma+1$. This is a fairly diffuse prior, slightly favoring smaller values of the scale.

### BUGS Code

Here’s the full model coded in BUGS:

model {
for (j in 1:J) {
ability[j] ~ dbeta(prior.hits,prior.outs)
hits[j] ~ dbin(ability[j],atbats[j])
}
prior.hits <- prior.avg * prior.atbats
prior.outs <- (1 - prior.avg) * prior.atbats
prior.avg ~ dbeta(1,1)
prior.atbats ~ dpar(1.5,1)
}


This model assumes J batters, with abilities $\theta_j$ coded as ability[j], hits $n_j$ as hits[j], at bats $N_j$ as atbats[j]. The prior parameter $\alpha$ is coded as prior.hits, and $\beta$ as prior.outs. The reparameterized prior mean $\alpha/(\alpha+\beta)$ is coded as prior.avg, and the reparameterized prior scale $\alpha+\beta$ as prior.atbats.

### Gibbs Sampling

With the model coded in BUGS, we can automatically run Gibbs sampling to collect a number of samples from the posterior distribution $p(\alpha,\beta,\theta|n,N)$ (we suppressed the conditioning on the hyperparameters).

We can then approximate the posterior marginal averages (to any number of decimal places) by taking the average of the samples for the respective variables.

It’s very easy to call BUGS from R using the >R2WinBUGS package. So we read the data into R, use it to call BUGS, then we collect the results back in R, where we can calculate their means.

data.tsv = scan("2006ALnonPitch.tsv")
N = length(data.tsv)/2
y = matrix(data.tsv, nrow=N, ncol=2, byrow=TRUE)
hits = y[,1]
atbats = y[,2]

library("R2WinBUGS")
data = list("atbats","hits","N")
parameters = c("ability","prior.avg","prior.atbats")

inits = function() {
list(ability=rep(N,0.30),
prior.hits=c(1),prior.outs=c(2),
prior.avg=c(1.0/3.0), prior.atbats=c(3))
}
anno = bugs(data, inits, parameters,
"c:/carp/devmitz/carp/baseball/src/baseballone/binomial.bug",
n.chains=3, n.iter=1000,
debug=TRUE,
clearWD=TRUE,
bugs.directory="c:\WinBUGS\WinBUGS14")

print(anno)
plot(anno)
attach.bugs(anno)


This presupposes that you’ve installed BUGS at the location specified by bugs.directory, have provided the BUGS program at the specified full path (I can’t get relative paths to work here), and have provided the data in the file 2006ALnonPitch.tsv (see the last section for a dump of the data in this file). The format of the tsv (tab-separated values) file is one player per line, with the number of hits, a tab, the number of at bats, and a newline.

The parameters specify three simultaneous Markov chains for 1000 iterations each. By default, BUGS will discard the first half of the iterations to remove noise from before the chains converge.

We initialize the ability values to $\theta_j = 0.300$, $\alpha=1$ and $\beta=2$. Better initialization helps models converge to the posterior samples more quickly, but aren’t even necessary for easy-to-fit models like these that run relatively quickly.

The data and parameters are specified. Data must be defined in R before calling BUGS, and parameters are returned as a two-dimensional array of samples (indexed by chain and sample number).

BUGS is not super-fast, taking about 15 minutes to take 3000 samples in 3 chains on my old notebook (it’s about twice as fast on my newish workstation).

### Bayesian Estimates

While I don’t have time to explain Gibbs sampling in detail in this context, let me just say that the three chains mixed very well with $\hat{R}$ values all having values very near 1.0.

The mean prior batting average is 0.271, and the mean prior at bats was 403. Note that this average is lower and the number of at bats higher than with the moment-matching estimates. I calculated these values in R after closing the BUGS window by:

> mean(prior.avg)
[1] 0.2709107
> mean(prior.atbats)
[1] 403.1252


This corresponds to beta density parameters $\alpha=109.2$ and $\beta=293.9$.

I’ll discuss more about the joint marginal average/at-bats posterior and the marginal ability posteriors in the next and (hopefully) final post in the series.

### Who Do We Think is Better?

Who do we think is a better batter, someone we’ve seen go 40/100 (a 0.400 average) or someone we’ve seen go 175/500 (a 0.350 average)?

For now, suppose we fix an empirical Bayes prior to the Bayesian estimate $\mbox{\sf Beta}(109.2,293.9)$, so that $\alpha=109.2$ and $\beta=293.9$ The Bayesian estimator (posterior mean) for a player’s ability $\theta$ when we observe the player getting get $n$ hits in $N$ at bats, is then the expected value of the posterior, which is:

$\theta^{L_2} = \frac{n+\alpha-1}{N + \alpha + \beta - 2}$

If we plug in $n=40, N=100$ we get an estimated ability of 0.295. If we plug in $n=175, N=500$, we get an estimated ability of 0.314! If we believe this model, we can’t just compare batting averages — the number of at bats is a critical element of our posterior estimate for low numbers of at bats. This is the effect of the prior, not the Bayesian inference per se. Of course, the scale of the prior being so large is the result of using the Bayesian estimator for the prior rather than the moment-matching estimate, so the effect is greater.

### In the Limit

In the limit, of course, the Bayesian estimate approaches the maximum likelihood estimate:

$\lim_{N \rightarrow \infty} \frac{n + \alpha + 1}{N + \alpha + \beta + 1} = \frac{n}{N}$

### Data from Baseball1

I gathered the data for the 2006 American League position players from Baseball1 (click on the “download database” menu item on the nav bar). It requires merging the player table (to get positions and exclude piters) and the batting table. For completeness, here’s the munged data:

161	558
156	555
17	89
103	333
10	39
22	60
0	2
3	20
41	183
57	201
2	8
9	30
16	74
65	237
8	35
54	185
121	423
82	365
159	546
5	22
3	9
12	56
143	491
147	506
1	13
12	45
2	13
1	4
43	175
11	39
155	596
3	14
59	251
46	212
141	556
2	14
55	251
3	10
114	393
54	193
76	267
10	40
129	460
31	154
60	229
0	8
27	110
2	3
16	60
6	11
0	1
0	9
15	99
3	20
6	27
8	55
5	29
165	482
37	136
37	164
47	193
1	11
43	146
2	10
170	603
16	70
61	268
16	43
14	49
5	12
11	50
8	24
103	413
1	6
103	352
29	130
27	115
0	1
190	655
30	115
67	286
2	7
53	220
11	45
10	27
40	126
136	450
59	246
91	371
18	65
33	131
0	3
159	569
12	79
76	333
21	87
7	24
102	373
35	127
181	521
83	385
145	491
82	358
75	234
3	32
8	40
6	27
146	569
190	592
53	234
18	87
4	13
10	98
147	461
72	243
52	172
110	364
5	25
181	581
62	224
161	604
5	18
140	454
109	413
97	365
0	6
1	4
14	76
51	156
2	17
158	557
1	7
89	348
21	82
11	50
3	39
18	75
28	103
122	444
24	96
62	189
138	541
91	279
110	441
133	459
157	591
18	80
17	87
53	222
30	99
3	13
132	494
154	520
87	294
56	235
17	81
59	226
10	41
185	611
51	174
4	16
1	5
52	197
23	80
217	691
177	628
164	547
166	572
91	320
70	308
33	130
131	437
68	218
65	230
51	221
45	132
7	49
183	600
151	527
17	64
136	461
9	44
80	325
224	695
5	28
137	524
174	543
62	251
14	67
38	176
138	502
8	42
4	17
22	104
163	552
74	255
7	36
77	343
69	209
214	648
9	33
6	36
138	501
170	539
117	485
43	152
12	48
165	557
6	37
5	50
118	420
83	337
139	448
129	465
214	623
113	446
71	320
14	56
200	607
1	9
171	607
150	509
160	558
7	40
89	314
73	270
155	557
113	401
2	10
156	591
102	350
32	153
163	548
166	620
56	217
13	65
181	626
111	474
1	13
79	290
89	289
5	28
61	179
117	430
163	573
99	388
173	584
43	172
137	542
109	463
37	156
141	490
58	260
181	572
7	27
34	169
97	351
9	61
18	79
83	276
0	4
36	156
45	184
39	164
86	268
89	354
128	463
0	2
28	128
126	466
38	146
177	593
76	236
177	566
7	45
59	220
154	544
2	9
10	46
144	449
87	365
102	381
31	161
4	19
89	296
169	593
161	563
171	624
194	620
25	121
18	88
95	343
103	389
42	178
45	151
12	63
74	279
82	288
88	344
56	211
54	225
123	433
123	451
136	540
21	95
152	543
5	27
46	197