Author Archive

Becky’s and My Annotation Paper in TACL

October 29, 2014

Finally something published after all these years working on the problem:

Rebecca J. Passonneau and Bob Carpenter. 2014. The Benefits of a Model of Annotation. Transactions of the Association for Comptuational Linguistics (TACL) 2(Oct):311−326. [pdf(Presented at EMNLP 2014.)

Becky just presented it at EMNLP this week. I love the TACL concept and lobbied Michael Collins for short reviewer deadlines based on the fact that (a) I saw the bioinformatics journals could do it and (b) I either do reviews right away or wait until I’m nagged, then do them right away — they never take even three weeks of real work. The TACL editors/reviewers really were quick, getting responses to us on the initial paper and revisions in under a month.

What you Get

There’s nothing new in the paper technically that I haven’t discussed before on this blog. In fact, the model’s a step back from what I consider best practices, mainly to avoid having to justify Bayesian hierarchical models to reviewers and hence avoid reviews like the last one I got. I’d highly recommend using a proper hierarchical model (that’s “proper” in the “proper English breakfast” sense) and using full Bayesian inference.

One of the main points of the paper is that high kappa values are not necessary for high quality corpora and even more surprisingly, high kappa values are not even sufficient.

Another focus is on why the Dawid and Skene type models that estimate a full categorical response matrix per annotator are preferable to anything that relies on weighted voting (hint: voting can’t adjust for bias).

We also make a pitch for using expected information gain (aka mutual information) to evaluate quality of annotators.

The Data

The paper’s very focused on the Mechanical Turk data we collected for word sense. That data’s freely available as part of the manually annotated subcorpus (MASC) of the American national corpus (ANC). That was 1000 instances of roughly 50 words (nouns, adjectives and verbs) and roughly 25 annotations per instance, for a total of around 1M labels. You can get it in easily digestible form from the repo linked below.

Open-Source Software and Data

I wrote a simple EM-based optimizer in R to perform maximum penalized likelihood (equivalently max a posteriori [MAP]) estimates. It’s not super fast, but it can deal with 25K labels over 200 annotators and 1000 items in a minute or so. The code is available from a

GitHub Repo: bob-carpenter/anno.

It includes a CSV-formatted version of all the data we collected in case you want to do your own analysis.

We collected so many labels per item so that we could try to get a handle on item difficulty. We just haven’t had time to do the analysis. So if you’re interested or want to collaborate, let me know (

High Kappa Values are not Necessary for High Quality Corpora

October 2, 2012

I’m not a big fan of kappa statistics, to say the least. I point out several problems with kappa statistics right after the initial studies in this talk on annotation modeling.

I just got back from another talk on annotation where I was ranting again about the uselessness of kappa. In particular, this blog post is an attempt to demonstrate why a high kappa is not necessary. The whole point of building annotation models a la Dawid and Skene (as applied by Snow et al. in their EMNLP paper on gather NLP data with Mechanical Turk) is that you can create a high-reliability corpus without even having high accuracy, much less acceptable kappa values — it’s the same kind of result as using boosting to combine multiple weak learners into a strong learner.

So I came up with some R code to demonstrate why a high kappa is not necessary without even bothering with generative annotation models. Specifically, I’ll show how you can wind up with a high-quality corpus even in the face of low kappa scores.

The key point is that annotator accuracy fully determines the accuracy of the resulting entries in the corpus. Chance adjustment has nothing at all to do with corpus accuracy. That’s what I mean when I say that kappa is not predictive. If I only know the annotator accuracies, I can tell you expected accuracy of entries in the corpus, but if I only know kappa, I can’t tell you anything about the accuracy of the corpus (other than that all else being equal, higher kappa is better; but that’s also true of agreement, so kappa’s not adding anything).

First, the pretty picture (the colors are in honor of my hometown baseball team, the Detroit Tigers, clinching a playoff position).

Kappa for varying prevalences and accuracies

What you’re looking at is a plot of the kappa value vs. annotator accuracy and category prevalence in a binary classification problem. (It’s only the upper-right corner of a larger diagram that would let accuracy run from 0 to 1 and kappa from 0 to 1. Here’s the whole plot for comparison.

Kappa for varying prevalences and accuracies

Note that the results are symmetric in both accuracy and prevalence, because very low accuracy leads to good agreement in the same way that very high accuracy does.)

How did I calculate the values? First, I assumed accuracy was the same for both positive and negative categories (usually not the case — most annotators are biased). Prevalence is defined as the fraction of items belonging to category 1 (usually the “positive” category).

Everything else follows from the definitions of kappa, to result in the following definition in R to compute expected kappa from binary classification data with a given prevalence of category 1 answers and a pair of annotators with the same accuracies.

kappa_fun = function(prev,acc) {
  agr = acc^2 + (1 - acc)^2;
  cat1 = acc * prev + (1 - acc) * (1 - prev);
  e_agr = cat1^2 + (1 - cat1)^2;
  return((agr - e_agr) / (1 - e_agr));

Just as an example, let’s look at prevalence = 0.2 and accuracy = 0.9 with say 1000 examples. The expected contingency table would be

  Cat1 Cat2
Cat1 170 90
Cat2 90 650

and the kappa coefficient would be 0.53, below anyone’s notion of “acceptable”.

The chance of actual agreement is the accuracy squared (both annotators are correct and hence agree) plus one minus the accuracy squared (both annotators are wrong and hence agree — two wrongs make a right for kappa, another of its problems).

The proportion of category 1 responses (say positive responses) is the accuracy times the prevalence (true category is positive, correct response) plus one minus accuracy times one minus prevalence (true category is negative, wrong response).

Next, I calculate expected agreement a la Cohen’s kappa (which is the same as Scott’s pi in this case because the annotators have identical behavior and hence everything’s symmetric), which is just the resulting agreement from voting according to the prevalences. So that’s just the probability of category 1 squared (both annotators respond category 1) and the probability of a category 2 response (1 minus the probability of a category 1 response) squared.

Finally, I return the kappa value itself, which is defined as usual.

Back to the plot. The white border is set at .66, the lower-end threshold established by Krippendorf for somewhat acceptable kappas; the higher-end threshold of acceptable kappas set by Krippendorf was 0.8, and is also indicated on the legend.

In my own experience, there are almost no 90% accurate annotators for natural language data. It’s just too messy. But you need well more than 90% accuracy to get into acceptable kappa range on a binary classification problem. Especially if prevalence is high, because as prevalence goes up, kappa goes down.

I hope this demonstrates why having a high kappa is not necessary.

I should add that Ron Artstein asked me after my talk what I thought would be a good thing to present if not kappa. I said basic agreement is more informative than kappa about how good the final corpus is going to be, but I want to go one step further and suggest you just inspect a contingency table. It’ll tell you not only what the agreement is, but also what each annotator’s bias is relative to the other (evidenced by asymmetric contingency tables).

In case anyone’s interested, here’s the R code I then used to generate the fancy plot:

pos = 1;
K = 200;
prevalence = rep(NA,(K + 1)^2);
accuracy = rep(NA,(K + 1)^2);
kappa = rep(NA,(K + 1)^2);
for (m in 1:(K + 1)) {
  for (n in 1:(K + 1)) {
    prevalence[pos] = (m - 1) / K;
    accuracy[pos] = (n - 1) / K;
    kappa[pos] = kappa_fun(prevalence[pos],accuracy[pos]);
    pos = pos + 1;
df = data.frame(prevalence=prevalence,
kappa_plot = 
  ggplot(df, aes(prevalence,accuracy,fill = kappa)) +
     labs(title = "Kappas for Binary Classification\n") +
     geom_tile() +
                        limits =c(0.5,1)) +
                        limits=c(0.85,1)) +
     scale_fill_gradient2("kappa", limits=c(0,1), midpoint=0.66,
                          low="orange", mid="white", high="blue",

Summing Out Topic Assignments in LDA

April 14, 2011

[Update: 20 April 2011. I replaced my original (correct, but useless) derivation with the derivation Matt Hoffman provided in the comments.]

In a previous post, I showed how to integrate out LDA’s multinomial parameters, with the result being a discrete distribution. In fact, we could go all the way to deriving the probability of each word’s topic assignment given all other parameters, thus defining a natural collapsed Gibbs sampling strategy.

In this post, I’ll show how to turn this around, summing out the discrete parameters in order to produce a fully continuous density. The nice feature about such a density is that it’s amenable to Hamiltonian Monte Carlo methods. These may actually mix faster than the collapsed Gibbs sampler. Now that we have a working HMC implementation, it’ll be easy to test.

It’s always pretty straightforward to sum out discrete parameters. For instance, it’s commonly encountered in the textbook mixture model in the E step of EM (e.g., for soft K-means clustering or semi-supervised learning). It’s all based on the sum law, which says that if p(a,b) is discrete in b and b can take on values 1..N, then

p(a) = \sum_{b=1}^{N} p(a,b).

All we need to do is repeatedly apply this rule to each of LDA’s topic assignments to tokens.

Latent Dirichlet Allocation

The LDA model involves a combination of Dirichlet and multinomial distributions. We observe the words w_{d,i} for each document d \in 1{:}D, where document d is of length I_d. We let V be the set of all words observed. We assume there are K topics. Next, there are multinomial parameters \theta_d of dimension K for the distribution of topics in document d. We also assume multinomial parameters \phi_k of dimension V for the distribution of words in topic k. For simplicity, we assume fixed Dirichlet hyperpriors \alpha and \beta of dimensionality K and V over the multinomial parameters \theta_d and \phi_k.

The full probability model p(\theta,\phi,z,w|\alpha,\beta) is defined by

\theta_d \sim \mbox{\sf Dirichlet}(\alpha)

\phi_k \sim \mbox{\sf Dirichlet}(\beta)

z_{d,i} \sim \mbox{\sf Discrete}(\theta_d)

w_{d,i} \sim \mbox{\sf Discrete}(\phi_{z[d,i]})

Summing out the Topic Assignments

In our previous post, we showed how to simplify the expression for the distribution of topic assignments given words and hyperpriors, p(z|w,\alpha,\beta). Here, we are going to derive p(\theta,\phi|w,\alpha,\beta), the density of the document distributions and topic distributions given the words and hyperpriors.


{} \propto p(\theta,\phi,w|\alpha,\beta)

{} = p(\theta|\alpha) \times p(\phi|\beta) \times p(w|\alpha,\beta)

{} = p(\theta|\alpha) \times p(\phi|\beta) \times \prod_{d=1}^D \prod_{i=1}^{I_d} p(w_{d,i}|\theta_d, \phi)

{} = p(\theta|\alpha) \times p(\phi|\beta) \times \prod_{d=1}^D \prod_{i=1}^{I_d} \sum_{z_{d,i} = 1}^K p(w_{d,i},z_{d,i}|\theta_d, \phi)

Next, by repeated application of the sum rule for total probability, we get

{} = p(\theta|\alpha) \times p(\phi|\beta) \times \prod_{d=1}^D \prod_{i=1}^{I_d} \sum_{z[d,i] = 1}^K p(z_{d,i}|\theta_d) \times p(w_{d,i}|\phi_{z[d,i]})

Expanding out the definitions, we are left with a nice tight expression that’s easy to calculate:


{} \propto \prod_{n=1}^N \mbox{\sf Dir}(\theta_n|\alpha)

{} \times \prod_{k=1}^K \mbox{\sf Dir}(\phi_k|\beta)

{} \times \prod_{d=1}^D \prod_{i=1}^{I_d} \sum_{z[d,i]=1}^K \mbox{\sf Disc}(z_{d,i}|\theta_d) \times \mbox{\sf Disc}(w_{d,i}|\phi_{z[d,i]}) .

Of course, we’ll probably be doing this on the log scale, where

\log p(\theta,\phi|w,\alpha,\beta)

{} = Z_{w,\alpha,\beta}

{} + \sum_{d=1}^D \log \mbox{\sf Dir}(\theta_d|\alpha)

{} + \sum_{k=1}^K \log \mbox{\sf Dir}(\phi_k|\beta)

{} + \sum_{d=1}^D \sum_{i=1}^{I_d} \log \left( \sum_{z[d,i]=1}^K \mbox{\sf Disc}(z_{d,i}|\theta_d) \times \mbox{\sf Disc}(w_{d,i}|\phi_{z[d,i]}) \right)

where Z_{w,\alpha,\beta} is a normalizing term that we can ignore for most optimization, expectation or sampling-based methods. Note that the \Gamma functions inside the Dirichlet terms go into the Z_{w,\alpha,\beta} term if we leave the priors \alpha and \beta constant.

Information Gain and Task-Based Costs: Biased versus Spam Annotators

November 22, 2010

Following on from Sheng et al.’s Get Another Label? paper, Panos and crew have a Human Computation Workshop paper,

The motivation for the new paper is to try to separate bias from accuracy. That is, some annotators would try hard, but the bias in their answers would give them an overall low exact accuracy. But their responses can still be useful if we correct for their bias. Luckily, the Dawid-and-Skene-type models for estimating and adjusting for annotator’s accuracies does just that.

G, PG, R, or X?

As an example, Ipeirotis et al. consider having workers on Amazon Mechanical Turk classify images into G, PG, R, and X ratings following MPAA ratings guidelines.

This is really an ordinal classification problem where the approach of Uebersax and Grove (1993) seems most appropriate, but let’s not worry about that for right now. We can imagine a purely categorical example such as classifying tokens in context based on part-of-speech or classifying documents based on a flat topic list.

Bias or Inaccuracy?

Uebersax and Grove discuss the bias versus accuracy issue. It’s easy to see in a two-category case that sensitivity and specificity (label response for 1 and 0 category items) may be reparameterized as accuracy and bias. Biased annotators have sensitivities that are lower (0 bias) or higher (1 bias) than their specificities.

What Ipeirotis et al. point out is that you can derive information from a biased annotator if their biases can be estimated. As they show, a model like Dawid and Skene’s (1979) performs just such a calculation in its “weighting” of annotations in a generative probability model. The key is that it uses the information about the likely response of an annotator given the true category of an item to estimate the category of that item.

Decision-Theoretic Framework

Ipeirotis et al. set up a decision-theoretic framework where there is a loss (aka cost, which may be negative, and thus a gain) for each decision based on the true category of the item and the classification.

One nice thing about the image classification task is that it makes it easy to think about the misclassification costs. For instance, classifying an X-rated image as G and having it land on a children’s site is probably a more costly error than rating a G image as PG and banning it from the same site. In the ordinal setting, there’s a natural scale, where rating an X-rated image as R is closer to the true result than PG or G.

Getting Another Label

Consider a binary setting where the true category of item i is c_i, the prevalence (overall population proportion) of true items is \pi, and annotator j has sensitivity (accuracy on 1 items; response to positive items) of \theta_{1,j} and a specificity (accuracy on 0 items; 1 – response to negative items) of \theta_{0,j}. If y_{i,j} is the response of annotators j \in 1{:}J, then we can calculate probabilities for the category c_i by

\mbox{Pr}[c_i = 1 | y_i] \propto \pi \times \prod_{j=1}^J \theta_{1,j}^{y_{i,j}} \times (1 - \theta_{1,j})^{1 - y_{i,j}} and

\mbox{Pr}[c_i = 0 | y_i] \propto (1 - \pi) \times \prod_{j=1}^J \theta_{0,j}^{1 - y_{i,j}} \times (1 - \theta_{0,j})^{y_{i,j}}.

The thing to take away from the above formula is that it reduces to an odds ratio. Before annotation, the odds ratio is \pi / (1-\pi). For each annotator, we multiply the odds ratio by the annotator’s ratio, which is \theta_{1,j} / (1 - \theta_{0,j}) if the label is y_{i,j} = 1, and is (1 - \theta_{1,j})/\theta_{0,j} if the label is negative.

Random Annotators Don’t Affect Category Estimates

Spammers in these settings can be characterized by having responses that do not depend on input items. That is, no matter what the true category, the response profile will be identical. For example, an annotator could always return a given label (say 1), or always choose randomly among the possible labels with the same distribution (say 1 with 20% chance and 0 with 80% chance, correspdonding, say, to just clicking some random checkboxes on an interface).

In the binary classification setting, if annotations don’t depend on the item being classified, we have specificity = 1 – sensitivity, or in symbols, \theta_{0,j} = 1 - \theta_{1,j} for annotator j. That is, there’s always a \theta_{1,j} chance of returning the label 1 no matter what the input.

Plugging this into the odds ratio formulation above, it’s clear that there’s no effect of adding such a spammy annotator. The update to the odds ratios have no effect because \theta_{1,j}/(1 - \theta_{0,j}) = 1 and (1-\theta_{1,j})/\theta_{0,j}=1.

Cooperative, Noisy and Adversarial Annotators

In the binary case, as long as the sum of sensitivity and specificity is greater than one, \theta_{0,j} + \theta_{1,j} > 1, there is positive information to be gained from an annotator’s response.

If the sum is 1, the annotator’s responses are pure noise and there is no information to be gained by their response.

If the sum is less than 1, the annotator is being adversarial. That is, they know the answers and are intentionally returning the wrong answers. In this case, their annotations will bias the category estimates.

Information Gain

The expected information gain from having an annotator label an item is easily calculated. We need to calculate the probability of true category and then probability of response and figure out the odds of each and the contribution to our overall estimates.

The random variable in question is c_i, the category of item i. We will assume a current odds ratio after zero or more annotations of \phi/(1-\phi) and consider the so-called information gain from observing y_{i,j}, the label provided for item i by annotator j,

\mbox{H}[c_i] - \mathbb{E}[\mbox{H}[c_i|y_{i,j}]],

where the expectation is with respect to our model’s posterior.

Expanding the expectation in the second term gives us

\mathbb{E}[\mbox{H}[c_i|y_{i,j}]] = \mbox{Pr}[y_{i,j} = 1] \times \mbox{H}[c_i|y_{i,j}=1] + \mbox{Pr}[y_{i,j} = 0] \times \mbox{H}[c_i|y_{i,j}=0]

The formulas for the terms inside the entropy are given above. As before, we’ll calculate the probabilities of responses using our model posteriors. For instance, carrying this through normalization, the probabilities on which the expectations are based are

\mbox{Pr}[y_{i,j} = 1] = Q_1/(Q_1+Q_2), and

\mbox{Pr}[y_{i,j} = 0] = Q_2/(Q_1 + Q_2), where

Q_1 = \phi \times \theta_{1,j} \ + \ (1-\phi) \times (1 - \theta_{0,j}), and

Q_2 = (1-\phi) \times \theta_{0,j} \ + \ \phi \times (1 - \theta_{1,j}).

so that the probability the annotation is 1 is proportional the sum of the probability that the true category is 1 (here \phi) and the response was correct (\theta_{1,j}) and the probability that the true category is 0 (1 - \phi) and the response was incorrect (1 - \theta_{0,j}).

It’s easy to see that spam annotators who have \theta_{0,j} = 1 - \theta_{1,j} provide zero information gain because as we showed in the last section, p(c_i|y_{i,j}) = p(c_i) if annotator j provides random responses, then \mbox{H}[c_i|y_{i,j}] = \mbox{H}[c_i].

Decision-Theoretic Framework

Ipeirotis et al. go one step further and consider a decision-theoretic context in which the penalities for misclassifications may be arbitrary numbers and the goal is minimizing expected loss (equivalently maximizing expected gain).

Rather than pure information gain, the computation would proceed through the calculation of expected true positives, true negatives, false positives, and false negatives, each with a weight.

The core of Bayesian decision theory is that expected rewards are always improved by improving posterior estimates. As long as an annotator isn’t spammy, their contribution is expected to tighten up our posterior estimate and hence improve our decision-making ability.

Suppose have have weights L_{k,k'}, which are the losses for classifying an item of category k as being of category k'. Returning to the binary case, consider an item whose current estimated chance of being positive is \phi. Our expected loss is

\mathbb{E}[\mbox{loss}] = L_{1,1}  \phi^2 + L_{1,0}  \phi  (1-\phi) + L_{0,1} (1-\phi)  \phi + L_{0,0} (1-\phi)^2.

We are implicitly assuming that the system operates by sampling the category from its estimated distribution. This is why \phi shows up twice after L_{1,1}, once for the probability that the category is 1 and once for the probability that’s the label chosen.

In practice, we often quantize answers to a single category. The site that wants a porn filter on an image presumably doesn’t want a soft decision — it needs to either display an image or not. In this case, the decision criterion is to return the result that minimizes expected loss. For instance, assigning category 1 if the probability the category is 1 is \phi leads to expected loss

L_{1,1} \phi + L_{0,1} (1-\phi)

and the loss for assigning category 0 is expected to be

L_{0,0} (1-\phi) + L_{1,0} \phi.

The decision is simple: return the result corresponding to the smallest loss.

After the annotation by annotator j, the positive and negative probabilities get updated and plugged in to produce a new estimate \phi', which we plug back in.

I’m running out of steam on the \LaTeX derivation front, so I’ll leave the rest as an exercise. It’s a hairy formula, especially when unfolded to the expectation. But it’s absolutely what you want to use as the basis of the decision as to which items to label.

Bayesian Posteriors and Expectations

In practice, we work with estimates of the prevalence \pi, sensitivities \theta_{1,j}, and specificities \theta_{0,j}. For full Bayesian inference, Gibbs sampling lets us easily compute the integrals required to use our uncertainty in the parameter estimates in calculating our estimate of \mbox{Pr}[c_i = 1|y_i] and its uncertainty.

Confused by Section 4

I don’t understand why Ipeirotis say, in section 4,

The main reason for this failure [in adjusting for spam] is the inability of the EM algorithm to identify the “strategic” spammers; these sophisticated spammers identify the class with the highest class prior (in our case the “G” class) and label all the pages as such.

Huh? It does in my calculations, as shown above, and in my experience with the Snow et al. NLP data and our own NE data.

One problem in practice may be that if a spammer annotates very few items, we don’t have a good estimate of their accuracies, and can’t adjust for their inaccuracy. Otherwise, I don’t see a problem.

Collapsed Gibbs Sampling for LDA and Bayesian Naive Bayes

July 13, 2010

[Update: 1 February 2014. David Bammam points out that there’s a mistake just before equation (62). The correct formula should be

\Gamma(x + q) = \Gamma(x) \times \prod_{i=1}^q (x + i - 1).

This has implications going forward replacing i with i - 1 which I don’t have time to change right now.]

[Update: 27 September 2010. I corrected several typos and brainos in the tech report after corrections in the comments (see below) from Arwen Twinkle. I also added some historical notes and references. Thanks for all the feedback.]

I’ve uploaded a short (though dense) tech report that works through the collapsing of Gibbs samplers for latent Dirichlet allocation (LDA) and the Bayesian formulation of naive Bayes (NB).

Thomas L. Griffiths and Mark Steyvers used the collapsed sampler for LDA in their (old enough to be open access) PNAS paper, Finding scientific topics. They show the final form, but don’t derive the integral or provide a citation.

I suppose these 25-step integrals are supposed to be child’s play. Maybe they are if you’re a physicist or theoretical statistician. But that was a whole lot of choppin’ with the algebra and the calculus for a simple country computational linguist like me.

On to Bayesian Naive Bayes

My whole motivation for doing the derivation was that someone told me that it wasn’t possible to integrate out the multinomials in naive Bayes (actually, they told me you’d be left with residual \Gamma functions). It seemed to me like it should be possible because the structure of the problem was so much like the LDA setup.

It turned out to be a little trickier than I expected and I had to generalize the LDA case a bit. But in the end, the result’s interesting. I didn’t wind up with what I expected. Instead, the derivation led to me to see that the collapsed sampler uses Bayesian updating at a per-token level within a doc. Thus the second token will be more likely than the first because the topic multinomial parameter will have been updated to take account of the assignment of the first item.

This is so cool. I actually learned something from a derivation.

In my prior post, Bayesian Naive Bayes, aka Dirichlet-Multinomial Classifiers, I provided a proper Bayesian definition of naive Bayes classifiers (though the model is also appropriate for soft clustering with Gibbs sampling replacing EM). Don’t get me started on the misappropriation of the term “Bayesian” by any system that applies Bayes’s rule, but do check out another of my previous posts, What is Bayesian Statistical Inference? if you’re curious.

Thanks to Wikipedia

I couldn’t have done the derivation for LDA (or NB) without the help of

It pointed me (implicitly) at a handful of invaluable tricks, such as

  • multiplying through by the appropriate Dirichlet normalizers to reduce an integral over a Dirichlet density kernel to a constant,
  • unfolding products based on position, unfolding a \Gamma() function for the position at hand, then refolding the rest back up so it could be dropped out, and
  • reparameterizing products for total counts based on sufficient statistics.

Does anyone know of another source that works through equations more gently? I went through the same exegesis for SGD estimation for multinomial logistic regression with priors a while back.

But Wikipedia’s Derivation is Wrong!

At least I’m pretty sure it is as of 5 PM EST, 13 July 2010.

Wikipedia’s calculation problem starts in the move from the fifth equation before the end to the fourth. At this stage, we’ve already eliminated all the integrals, but still have a mess of \Gamma functions left. The only hint at what’s going on is in the text above which says it drops terms that don’t depend on k (the currently considered topic assignment for the n-th word of the m-th document). The Wikipedia’s calculation then proceeds to drop the term \prod_{i \neq k} \Gamma(n^{i,-(m,n)}_{m,(\cdot)} + \alpha_i) without any justification. It clearly depends on k.

The problems continue in the move from the third equation before the end to the penultimate equation, where a whole bunch of \Gamma function applications are dropped, such as \Gamma(n^{k,-(m,n)}_{m,(\cdot)} + \alpha_k), which even more clearly depend on k.

It took me a few days to see what was going on, and I figured out how to eliminate the variables properly. I also explain each and every step for those of you like me who don’t eat, sleep and breathe differential equations. I also use the more conventional stats numbering system where the loop variable m ranges from 1 to M so you don’t need to keep (as large) a symbol table in your head.

I haven’t changed the Wikipedia page for two reasons: (1) I’d like some confirmation that I haven’t gone off the rails myself anywhere, and (2) their notation is hard to follow and the Wikipedia interface not so clean, so I worry I’d introduce typos or spend all day entering it.

LaTeX Rocks

I also don’t think I could’ve done this derivation without LaTeX. The equations are just too daunting for pencil and paper. The non-interactive format of LaTeX did get me thinking that there might be some platform for sketching math out there that would split the difference between paper and LaTeX. Any suggestions?

Inferring Splice-Variant mRNA Expression from RNA-Seq

February 5, 2010

I introduced the problem of splice-variant expression from RNA-Seq data in my previous post on maximum likelihood estimates of RNA-Seq expression. In this post, I’ll show you a Bayesian approach to the problem from which we can compute a full posterior over the latent read mappings and relative mature mRNA expression of gene splice variants.

The Model

The data from which we infer expression is

  • K \in \mathbb{N}^{+}, the number of splice variants,
  • N \in \mathbb{N}^{+}, the number of reads, and
  • y_1,\ldots,y_N, the reads.

For the purpose of this model, the reads are likely to be characterized in terms of the output of a sequencer, such as base calls with quality scores.

We assume two model hyperparameters,

  • \varphi, genetic variation, sample preparation, and sequencing parameters, and
  • \alpha_1,\ldots,\alpha_K \in \mathbb{R}^{+}, the prior read counts (plus one).

The general-purpose parameter \varphi reflect how reads are generated, including the expected variation from the reference genome, the sample preparation process, and the sequencing platform’s error profile. The variation from the reference genome will determine indel and SNP rates, for instance using a probabilistically normalized affine edit distance model. Details of the sample preparation process, such as amplicon selection, fractionation technique, and so on, also determine which mRNA sequences are actually observed (through cDNA). Finally, the sequencing platform’s error profile, such as color confusability in image space, decreasing accuracy from the start to end of the read, and sensitivity to context such as the previously read gene (as in color-space dependencies in SOLiD or errors due to multiple copies of the same base).

We will infer two of the model parameters,

  • t_1,\ldots,t_N \in 1:K, the mapping of read to splice variant, and
  • \theta_1,\ldots,\theta_K \in [0,1] such that \sum_{k=1}^K \theta_k = 1, the read expression probabilities.

The model structure in sampling notation is

  • \theta \sim \mbox{\sf Dirichlet}(\alpha)
  • t_n \sim \mbox{\sf Discrete}(\theta) for n \in 1:N
  • y_n \sim \mbox{\sf Channel}(t_n,\varphi) for n \in 1:N.

We select the expression levels \theta based on prior counts.

The heart of the mapping model is the discrete sampling of the latent mappings t_n based on the expression level parameter \theta. This is effectively a beta-binomial model of expression at the corpus level.

The mysterious part of this whole thing is the channel model, which assigns the probability of a given read y_n being observed given it arose from the splice variant t_n under the sequencing model parameterized by \varphi.

Simple Channel Examples

For purposes of illustration, we’re going to assume a really simple instance of this model.

First, we’ll assume that \alpha_k = 1 for k \in 1:K, which renders \mbox{\sf Dirichlet}(\alpha) a uniform distribution over possible values of \theta. With replicates at the sequencing run level (or even over channels for Solexa/Illumina data), we could build a hierarchical model and estimate \alpha.

Second, we’ll assume a simple uniform read model. Specifically, we’ll assume that a read is just as likely to come from anywhere on the mRNA; a richer model would account for known variation due to the experimental procedure.

Third, we’ll assume a mapping program that aligns reads with the reference gene models for splice variant sequences. If the mapper provides multiple reads, we will treat them all as equally likely; in a more articulated model, we would take into account both sequencing quality scores, the error properties of the sequencer, and likely variation such as indel and SNP rates in estimating the probability of a read given a splice variant.

Although cross-junction splice mapping is straightforward to model versus a known reference gene model, we’ll restrict attention to exon reads here.

With all of these assumptions in place, if read y_n is mapped to splice variant t_n \in 1:K, we set

\mbox{\sf Channel}(y_n|t_n,\varphi) = 1 / (\mbox{len}(t_n) - \mbox{len}(y_n)).

The denominator \mbox{len}(t_n) - \mbox{len}(y_n) reflects the number of positions a sequence of the given read length \mbox{len}(y_n) may occur in an mRNA of length \mbox{len}(t_n).

A Simple BUGS Implementation

Although not scalable to RNA-Seq data sizes, it’s simple to investigate the properties of our model in BUGS.

model {
    theta[1:K] ~ ddirch(alpha[1:K])
    for (n in 1:N) {
        t[n] ~ dcat(theta[1:K])
        y[n] ~ dcat(rho[t[n],1:K])

BUGS programs look just like the sampling notation defining the model; ddirch is the Dirichlet distribution and dcat is the discrete (aka categorical) distribution.

I call it from R in the usual way. Here’s just the part where R tells BUGS how to run the model. After the BUGS window is closed, R plots and attaches the data.


data <- list("alpha", "rho", "y", "N", "K")
parameters <- c("theta","t")
inits <- function() { list(theta=c(1/3,1/3,1/3)) }
rnaexp <- bugs(data, inits, parameters,
        n.chains=3, n.iter=50000,
        debug=TRUE, clearWD=TRUE,"c:\\WinBUGS\\WinBUGS14",

The other debugging problem I had is that you absolutely need to tell BUGS not to compute DIC (a kind of model comparison stat); otherwise, BUGS crashes when trying to compute DIC.

Attaching the data converts the parameter of the model to matrix data representing the Gibbs samples.

An Example

I’ll close with an example that illustrates how the R and BUGS program work, and how the model is able to infer expression even without a single non-multi-mapped read. The gene model and expression levels for the simulation are defined in a comment in the R code. We’ll consider a case with three exons and three isoforms:

# exons  A BB C  (B twice as long as others)
# var1 = A BB
# var2 =   BB C
# var3 = A    C

There are three exons, A, BB, and C, where the notation is meant to indicate the middle exon is twice as long as the others (to spice up the read calculation a bit). There are three splice variants, labeled var1, var2, and var3. Variant 1 is made up of exons A and BB, variant 2 of exons BB and C, and varient 3 of exons A and C.

Without cross-junction reads, it’s not possible to get a uniquely mapped read. Any read mapping to exon A will match splice variants 1 and 3, any read mapping to exon BB will match variants 1 and 2, and any read mapping to exon C will match 2 and 3.

Next, we specify the number of samples drawn from each exon in each splice variant.

#       exon
# var  1  2  3  Total theta*
# 1    1  2    | 3     3/19
# 2       4  2 | 6     6/19
# 3    5     5 | 10    10/19
#      --------
# tot  6  6  7   18

There are three samples from variant 1, 6 samples from variant 2, and 10 samples from variant 3. These are distributed uniformly across the exons. Here, variant 1 has 1 sample from exon 1 and 2 samples from exon 2, variant 2 has 4 samples from exon 2 and 2 samples from exon 3, whereas variant 3 has 5 samples each from exon 1 and exon 3.

If we adjust for length, variant 1 has 1 mRNA transcript, variant 2 has 2 transcripts, and variant 3 has 5 transcripts; these are computed by dividing by length. But our expression parameter theta (\theta) is defined by reads, not by transcripts.

The observed data is much simpler — it’s just the total number of reads for samples mapping to each exon. We see that there are 6 samples mapping to exon 1, 6 samples mapping to exon 2, and 7 samples mapping to exon 3.

Coding the Example in R

We set the Dirichlet prior to be uniform, whihc also determines our number of splice variants K:

alpha <- c(1,1,1)
K <- length(alpha)

We actually assume there are 10 times that many samples as in our figure, setting the read data as:

y <- c(rep(1,60),rep(2,60),rep(3,70))
N <- length(y)

That is, there are 60 reads from exon 1, 60 reads from exon 2 and 70 reads from exon 3, and N=190.

Finally, we use R’s matrix constructor to set up the read model:

rho <- matrix(c(1/3,2/3,0,  0,2/3,1/3,  1/2,0,1/2),

I tried to make this human readable: a sample from the first splice variant has a 1/3 chance of originating from exon 1 and a 2/3 chance of originating from exon 2 (because exon 2 is twice as long as exon 1). A sample from the third splice variant has an equal chance of being from either exon 1 or exon 3 (exon 1 and exon 3 are the same length).

The Output

I ran the R program, which called BUGS, which terminated neatly and returned to R. The \hat{R} values were all pinned at 1.0, indicating good chain mixing and hence convergence of the sampler.

Bayeisan Point Estimate

We can compute any statistics we want now with the posterior samples. For instance, theta[,1] contains all the Gibbs samples for the variable \theta_1.

We can then generate the Bayesian (L2) estimates for expression levels:

> thetaHat = c(mean(theta[,1]), mean(theta[,2]), mean(theta[,3]))
> print(thetaHat,digits=2)
[1] 0.17 0.31 0.52

The “right” answer for our simulation is also easy to compute with R:

> print(c(3/19, 6/19, 10/19), digits=2)
[1] 0.16 0.32 0.53

Our estimates are pretty darn close. (Recall we can normalize these sample counts by length to get mRNA transcript counts for each variant.)

How certain are we of these estimates? Not very, because there were only 190 samples and lots of mapping uncertainty. It’s easy to estimate posterior 95% intervals, 50% intervals, and medians using R’s quantile function over the Gibbs samples:

> print(quantile(theta[,1], c(0.025,0.25,0.5,0.75,0.975)),
 2.5%   25%   50%   75% 97.5%
0.025 0.104 0.167 0.231 0.340
> print(quantile(theta[,2], c(0.025,0.25,0.5,0.75,0.975)),
 2.5%   25%   50%   75% 97.5%
 0.13  0.25  0.31  0.37  0.47
> print(quantile(theta[,3], c(0.025,0.25,0.5,0.75,0.975)),
 2.5%   25%   50%   75% 97.5%
 0.42  0.49  0.52  0.56  0.61

With more data, and especially with uniquely mapping reads (e.g. across unique junction boundaries or in unique exons for a variant), posterior uncertainty will be greatly reduced.

A Triangle Plot and Correlated Expressions

Given that we only have three splice variants, we can create a so-called triangle plot of the expression levels, which will show correlations among the expression levels estimated for the variants on the simplex. Here’s the plot:

I couldn’t figure out how to label the axes in R’s ade4 package, but it’s clear from the means that are labeled that the upper left axis (mean of 0.17) is for splice variant 1, the bottom axis (mean of 0.308) is for variant 2, and the top right axis (mean of 0.522) for variant 3.

It takes some practice to view triangle plots like this; it’s perhaps easier to view two at a time. Here’s the two-way interaction plots:

two-way plots of RNA-seq expression 1 vs 2

two-way plots of RNA-seq expression 1 vs 3

two-way plots of RNA-seq expression 2 vs 3

These show the strong negative correlation between estimates of the expression for variants 1 and 2, but little correlation between these and variant 3. This makes sense given variants 1 and 2 share the large exon BB.

Finally, here’s the source that generated the PNG file for the triangle plot:


and for the ordinary plots:

plot(theta[,c(1,2)],xlab="variant 1",ylab="variant 2",
  xlim=c(0,1),ylim=c(0,1),pch=20,main="1 vs. 2")

plot(theta[,c(1,3)],xlab="variant 1",ylab="variant 3",
  xlim=c(0,1),ylim=c(0,1),pch=20,main="1 vs. 3")

plot(theta[,c(2,3)],xlab="variant 2",ylab="variant 3",
  xlim=c(0,1),ylim=c(0,1),pch=20,main="2 vs. 3")

Hierarchical Bayesian Batting Ability, with Multiple Comparisons

November 4, 2009

Just in time for the last game(s) of this year’s World Series, the final installment of my example of Bayesian stats using baseball batting ability. I took us off-topic (text analytics) with a general intro to Bayesian stats (What is “Bayesian” Statistical Inference?). As a first example, I compared Bayesian calculations of binomial posteriors with point estimates (Batting Averages: Bayesian vs. MLE Estimates). In exploring the question of “where do priors come from?”, I started with simple non-Bayesian point estimates (Moment Matching for Empirical Bayes Beta Priors for Batting Averages). Finally, I realized I’d meant “ability” where I’d said “average” (former is a latent parameter, latter is a statistic calculated from at bats), when I considered Bayesian point estimates (Bayesian Estimators for the Beta-Binomial Model of Batting Ability).

Hierarchical Model Recap

For batter j, the number of hits n_j in N_j at bats is modeled as a binomial,

n_j \sim \mbox{\sf Binom}(\theta_j,N_j) for j \in 1:J,

where the ability, or chance of getting a hit, for batter j is \theta_j. Ability is modeled as having a beta distribution

\theta_j \sim \mbox{\sf Beta}(\alpha,\beta) for j \in 1:J

with prior number of hits \alpha-1 and prior number of outs \beta-1. These parameters, which act as priors for the for the binomial parameter, are themselves given priors. The mean of the beta, \alpha/(\alpha+\beta), is given a uniform prior,

\alpha/(\alpha+\beta) \sim \mbox{\sf Beta}(1,1).

The prior for the scale \alpha+\beta is modeled by a Pareto distribution, which has probability proportional to 1/(\alpha+\beta)^{2.5},

\alpha + \beta \sim \mbox{Pareto}(1.5).


I used the 2006 AL position player data (given in this previous blog post). That fixes the number of players J and the hits n_j and at-bats N_j for each player j \in 1:J.

I then used BUGS encoding of the model above (as shown in this previous post). BUGS automaticaly implements Gibbs sampling, a form of Markov Chain Monte Carlo (MCMC) inference. I ran 3 chains of 1000 samples each, retaining the last 500 samples, for 1500 posterior samples total. All parameters had \hat{R} values very near 1, indicating convergence of the Markov chains. As usual, we read off statistics from the samples and used the sampled values for inference, where they allow the integrals involved in Bayesian inference (as descirbed in this blog post) to be computed.

Beta Posterior

We can inspect the posterior for the beta mean \alpha/(\alpha+\beta) and scale \alpha+\beta parameters. We plot them as a 2D scatterplot with their means picked out as lines.

beta parameters posterior

As usual, the scale is much harder to estimate than the mean.

Ability Posteriors

We can also plot the ability for particular players. Here’s histograms of sampled posteriors for the players with the best average posterior ability estimates, with their hits and at-bats provided as labels above the plot:

Notice how the variance of the posterior is determined by the number of at bats. The player with 60 at bats has a much wider posterior than the player with 695 at bats.

Multiple Comparisons: Who’s the Best Batter?

We’re going to do what Andrew, Jennifer and Masanao suggest, which is to use our hierarchical model to make a posterior comparison about player’s ability. In particular, we’ll estimate the posterior probabability that a particular player is the best player. We simply look at all 1500 posterior samples, which include ability samples as plotted above, and count how many times a player has the highest ability in a sample. Then divide by 1500, and we’re done. It’s a snap in R, and here’s the results, for the same batters as the plot above:

Average At-Bats Pr(best ability)
.367 60 0.02
.342 482 0.08
.347 521 0.12
.322 695 0.02
.330 648 0.04
.343 623 0.11
.330 607 0.04

The .347 batter with 521 at bats not only has the highest estimated chance in our model of having the highest ability, he also won the batting crown (Joe Mauer of Minnesota). The beta-binomial hierarchical model estimate is only 12% that this batter has the highest ability. The estimate is very close for the .343 batter with 623 at bats (Derek Jeter of the Yankees). [It turns out the race to the batting crown came down to the wire.]

The larger number of at bats provides more evidence that the batter has a higher ability than average, thus pulling the posterior ability estimate further away from the prior mean. Finally, note that we’re assigning some chance that the .367 batter with only 60 at bats is best in the league. That’s because when the samples are on the high side of the distribution, this batter’s best.

Bayesian Counterpart to Fisher Exact Test on Contingency Tables

October 13, 2009

I want to expand a bit on Andrew’s post, in which he outlines a simple Bayesian analysis of 2×2 contingency tables to replace Fisher’s exact test (or a chi-square test) for contingency tables.

Contingency Table

I’ll use the data in the example in the Wikipedia article on contingency tables:

Left-Handed Right-Handed TOTAL
Male 9 (y1) 43 52 (n1)
Female 4 (y2) 44 48 (n2)
Total 13 87 100


The statistical hypothesis we wish to investigate is whether the proportion of left-handed men is greater than, equal to, or less than the proportion of left-handed women.

Beta-Binomial Model

The observed data has y_1 = 9 left-handed men of a total of n_1 = 52 men and y_2 = 4 left-handed women of a total of n_2 = 48 women.

Letting \theta_1 \in [0,1] be the proportion of left-handed men and \theta_2 \in [0,1] the proportion of left-handed women, Andrew suggested modeling the number of left-handers as a binomial,

y_1 \sim \mbox{\sf Binom}(n_1,\theta_1), and

y_2 \sim \mbox{\sf Binom}(n_2,\theta_2).

He then suggested simple uniform priors on the proportions, which we know are equivalent to \mbox{\sf Beta}(1,1) priors:

\theta_1 \sim \mbox{\sf Unif}(0,1) = \mbox{\sf Beta}(1,1), and

\theta_2 \sim \mbox{\sf Unif}(0,1) = \mbox{\sf Beta}(1,1).

Given the conjugacy of the beta for the binomial, we can analytically compute the posteriors:

p(\theta_1|y_1,n_1) = \mbox{\sf Beta}(\theta_1|y_1+1, n_1-y_1+1), and

p(\theta_2|y_2,n_2) = \mbox{\sf Beta}(\theta_2|y_2+1, n_2-y_2+1)

We could try to compute the posterior density of \delta = \theta_1 - \theta_2 analytically by solving the integral

p(\delta|y,n) = \int_{-\infty}^{\infty} \mbox{\sf Beta}(\theta|y_1+1,n_1-y_1+1) \mbox{\sf Beta}(\theta-\delta|y_2+1,n_2-y_2+1) d\theta

and then evaluating \mbox{Pr}[\delta > 0].

Instead, we’ll just follow Andrew’s sage advice and estimate the integral by simulation. That is, we’ll take M samples from the joint posterior

p(\theta_1,\theta_2|y_1,n_1,y_2,n_2) = p(\theta_1|y_1,n_1) \times p(\theta_2|y_2,n_2),

which we write as

(\theta_1^{(1)},\theta_2^{(1)}), (\theta_1^{(2)},\theta_2^{(2)}), (\theta_1^{(3)},\theta_2^{(3)}), \ldots, (\theta_1^{(M)},\theta_2^{(M)})

and then use the Monte Carlo approximation,

\mbox{Pr}[\theta_1 > \theta_2]

\approx \frac{1}{M} \ \sum_{m=1}^{M} \mathbb{I}(\theta_1^{(m)} > \theta_2^{(m)}),

where \mathbb{I}() is the indicator function, taking on value 1 if its property is true and 0 otherwise.

In words, we estimate the probability that \theta_1 is greater than \theta_2 by the fraction of samples m \in M where \theta_1^{(m)} > \theta_2^{(m)}.

R Code

I love R. Here’s the R code to compute estimated posteriors by simulation and visualize them.

n1 = 52 # men
y1 = 9  # left-handed men
n2 = 48 # women
y2 = 4  # left-handed women

I = 10000 # simulations
theta1 = rbeta(I, y1+1, (n1-y1)+1)  
theta2 = rbeta(I, y2+1, (n2-y2)+1)
diff = theta1-theta2  # simulated diffs

quantiles = quantile(diff,c(0.005,0.025,0.5,0.975,0.995))

print("Probability higher % men than women lefties:")

     xlab="theta1 - theta2",
     ylab="p(theta1 - theta2 | y, n)",
     main="Posterior Simulation of Male - Female Lefties",
abline(v=quantiles[2], col="blue")
abline(v=quantiles[4], col="blue")

Running this code prints out:

> source("twoByTwoPosterior.R")
  0.5%   2.5%    50%  97.5%  99.5% 
-0.091 -0.046  0.084  0.218  0.262 

[1] "Probability higher % men than women lefties:"
[1] 0.901

[1] "Mean difference theta1-theta2"
[1] 0.08484979

The first numbers displayed are the posterior quantiles for the difference. The posterior median (50% quantile) difference is 0.084. The other numbers are quantiles which determine 95% and 99% posterior intervals, which span (0.025,0.975) and (0.005,0.995) respectively. The posterior central 95% quantile is thus (-0.046,0.218). Quantiles get the obvious interpretation in Bayesian stats — given the model and data, we assign a 95% probability to the true difference lying in the interval (-.046,0.218).

The second number printed is the posterior probability that men have a higher percentage of left-handness than women, which is 0.901, or about 90%. R makes this simple by letting us write mean(theta1 > theta2). This is also the value you’d get from integrating the posterior density from 0 to \infty.

The final number is the mean difference between theta1 and theta2, which is the unbiased Bayesian estimator of the difference. This value is 0.0848, which is a bit higher than the median difference of 0.084.

The code also generates the following graph of the posterior:

The vertical lines are drawn at the boundaries of the central 95% region of the posterior difference p(\theta_1 - \theta_2 | n, y).


Don’t ask. You’re a Bayesian now.

Instead, say that according to your model, based on the observed data, there’s a 90% chance there’s a higher prevalence of lefthandedness among men than women, or that there’s a 95% chance the difference between male prevalence of lefthandness and female prevalence of lefthandedness falls in the range (-.05,0.22). [Remember not to use too many digits. The outputs from R were so you could see the small differences.]

Batting Averages: Bayesian vs. MLE Estimates

September 11, 2009

It’s American baseball season, and I’ve been itching to do more baseball stats. So in the interest of explaining Bayesian posterior inference with an example…

Batting Averages

A player’s batting average is his number of “hits” divided by his number of “at bats“.

Binomial Model of Batting

For simplicitly, we’ll model a sequence of N at bats using a binomial distribution with a \theta probability of getting a hit in each at bat. This assumes that the sequence of at bats is independent and identically distributed, aka iid.

Of course this is wrong. All models are wrong. It’s just that this one’s more wrong than usual, despite the popularity of the batting average as a summary statistic. (See the second last section for a reference to a better model.)

Maximum Likelihood Estimate (MLE)

Given a sample (aka training data) consisting of a player with m hits in M at bats (where m > 0,  (M-m) > 0), the maximum-likelihood estimate (MLE) of the ability \theta is m/M, aka the player’s batting average.

Uniform Prior

We assume a uniform (aka noninformative) prior distribution over abilities.

This is also wrong. We know major league position players don’t bat under 0.200 or they’d be fired, and no one’s had a batting average of over 0.400 since Ted Williams.

But we’ll have to come back to issues with the prior later. This post is just to clarify what integration over the posterior does for inference, not what an informative prior does.

Bayesian Posterior Estimate

The beta prior is conjugate to the binomial. Because the uniform distribution may be expressed as a beta distribution, \mbox{\sf Beta}(1,1), the posterior is a beta distribution. Because the beta and binomial play nicely together, the posterior ability distribution p(\theta|m,M) may be expressed analytically as \mbox{\sf Beta}(1+m, 1+M-m). The maximum of \mbox{\sf Beta}(1+m,1+M-m) is m/M, so the maximum a posteriori (MAP) estimate of \theta is also m/M, the same as the maximum likelihood estimate.

Posterior Predictive Inference

Now let’s look at prediction. For concreteness, suppose we’ve observed M=10 at bats, of which m=3 were hits, for a batting average of 3/10 = 0.300. What we want to know is the likelihood of n hits in the next N=5 at bats.

There are six possible outcomes, 0, 1, 2, 3, 4, or 5 hits. Using the MLE or MAP point estimate, this distribution is \mbox{\sf Binomial}(n|5,0.300).

The Bayesian estimate requires integrating over the posterior, effectively averaging the binomial prediction for each ability by the ability’s probability. The posterior in this case is \mbox{\sf Beta}(3+1,10-3+1)=\mbox{\sf Beta}(4,8), so the integral results in the well known beta-binomial distribution:

\mbox{\sf BetaBinomial}(n|5, 4,8) = \int_0^1 \mbox{\sf Binomial}(n|5,\theta) \mbox{\sf Beta}(\theta|4,8) d\theta

I solved the integral numerically using Monte Carlo integration, which is a fancy way of saying I took a bunch of samples \theta_n from the beta posterior \mbox{\sf Beta}(4,8) and plugged them into \mbox{\sf Binomial}(5,\theta_n) and averaged the resulting predictions. It’s a few lines of R (see below). The beta-binomial may be expessed analytically, but that’s even more typing than the Monte Carlo solution. I’m sure there’s an R package that’d do it for me.

Comparing the Results

The important issue is to look at the outputs. Given that we’ve seen a batter get 3 hits in 10 at bats, here’s our prediction using a binomial model with uniform prior, shown with both the MLE/MAP point estimate and the average of the Bayesian predictions:

Point vs. Bayesian Estimates
Hits MLE/MAP Bayesian
0 0.168 0.180
1 0.360 0.302
2 0.309 0.275
3 0.132 0.165
4 0.028 0.064
5 0.002 0.012

The Bayesian estimate here is more dispersed (flatter, higher variance, higher entropy) than the MLE estimate. Of course, a true Bayesian estimate would provide a joint distribution over outcomes 0,1,2,3,4,5 (which reduce to a simplex, because any 5 values determine the 6th) for downstream reasoning. The nice thing about passing uncertainty along by integration is that it’s very plug-and-play.

Advanced Reading

Luckily for us baseball stats geeks, there’s some advanced reading hot off the press from Wharton Business School:

R Source

Here’s how I generated the table.

# observed
hits = 3
atBats = 10

# predicted
nextAtBats = 5

# y: MLE predictive distro
average = hits/atBats
y = rep(0,nextAtBats+1);
for (i in 0:nextAtBats) y[i+1] = dbinom(i,nextAtBats,average)

# z: Bayesian predictive distro
numSamps = 20000
for (k in 1:numSamps) {
    sampledAvg = rbeta(1,1+hits,1+atBats-hits)
    for (i in 0:nextAtBats)
        z[i+1] = z[i+1] + dbinom(i,nextAtBats,sampledAvg)
for (i in 0:nextAtBats)
    z[i+1] = z[i+1] / numSamps

print("Point"); print(y, digits=3)
print("Bayes"); print(z, digits=3)

What is “Bayesian” Statistical Inference?

September 9, 2009

Warning: Dangerous Curves This entry presupposes you already know some math stats, such as how to manipulate joint, marginal and conditional distributions using the chain rule (product rule) and marginalization (sum/integration rule). You should also be familiar with the difference between model parameters (e.g. a regression coefficient or Poisson rate parameter) and observable samples (e.g. reported annual income or the number of fumbles in a quarter of a given American football game).


I followed up this post with some concrete examples for binary and multinomial outcomes:

Bayesian Inference is Based on Probability Models

Bayesian models provide full joint probability distributions over both observable data and unobservable model parameters. Bayesian statistical inference is carried out using standard probability theory.

What’s a Prior?

The full Bayesian probability model includes the unobserved parameters. The marginal distribution over parameters is known as the “prior” parameter distribution, as it may be computed without reference to observable data. The conditional distribution over parameters given observed data is known as the “posterior” parameter distribution.

Non-Bayesian Statistics

Non-Bayesian statisticians eschew probability models of unobservable model parameters. Without such models, non-Bayesians cannot perform probabilistic inferences available to Bayesians, such as definining the probability that a model parameter (such as the mean height of an adult male American) is in a defined range say (say 5’6″ to 6’0″).

Instead of modeling the posterior probabilities of parameters, non-Bayesians perform hypothesis testing and compute confidence intervals, the subtleties of interpretation of which have confused introductory statistics students for decades.

Bayesian Technical Apparatus

The sampling distribution, with probability function p(y|\theta), models the probability of observable data y given unobserved (and typically unobservable) model parameters \theta. (The sampling distribution probability function p(y|\theta) is called the likelihood function when viewed as a function of \theta for fixed y.)

The prior distribution, with probability function p(\theta), models the probability of the parameters \theta.

The full joint distribution over parameters and data has a probability function given by the chain rule,

p(y,\theta) = p(y|\theta) \times p(\theta).

The posterior distribution gives the probability of parameters \theta given the observed data y. The posterior probability function p(\theta|y) is derived from the sampling and prior distributions via Bayes’s rule,


{} = \frac{\displaystyle p(y,\theta)}{\displaystyle p(y)}       by the definition of conditional probability,

{} = \frac{\displaystyle p(y|\theta) \times p(\theta)}{\displaystyle p(y)}       by the chain rule,

{} = \frac{\displaystyle p(y|\theta) \times p(\theta)}{\int_{\Theta} \displaystyle p(y,\theta') \, d\theta'}       by the rule of total probability,

{} = \frac{\displaystyle p(y|\theta) \times p(\theta)}{\int_{\Theta} \displaystyle p(y|\theta') \times p(\theta') \, d\theta'}       by the chain rule, and

{} \propto p(y|\theta) \times p(\theta)       because y is constant.

The posterior predictive distribution p(\tilde{y}|y) for new data \tilde{y} given observed data y is the average of the sampling distribution defined by weighting the parameters proportional to their posterior probability,

p(\tilde{y}|y) = \int_{\Theta} \, p(\tilde{y}|\theta) \times p(\theta|y) \, d\theta.

The key feature is the incorporation into predictive inference of the uncertainty in the posterior parameter estimate. In particular, the posterior is an overdispersed variant of the sampling distribution. The extra dispersion arises by integrating over the posterior.

Conjugate Priors

Conjugate priors, where the prior and posterior are drawn from the same family of distributions, are convenient but not necessary. For instance, if the sampling distribution is binomial, a beta-distributed prior leads to a beta-distributed posterior. With a beta posterior and binomial sampling distribuiton, the predictive posterior distribution is beta-binomial, the overdispersed form of the binomial. If the sampling distribution is Poisson, a gamma-distributed prior leads to a gamma-distributed posterior; the predictive posterior distribution is negative-binomial, the overdispersed form of the Poisson.

Point Estimate Approximations

An approximate alternative to full Bayesian inference uses p(\tilde{y}|\hat{\theta}) for prediction, where \hat{\theta} is a point estimate.

The maximum \theta^{*} of the posterior distribution defines the maximum a posteriori (MAP) estimate,

\theta^{*} = \arg\max_{\theta} p(y|\theta) p(\theta).

If the prior p(\theta) is uniform, the MAP estimate is called the maximum likelihood estimate (MLE), because it maximizes the likelihood function p(y|\theta). Because the MLE does not assume a proper prior; the posterior may be improper (i.e., not integrate to 1).

By definition, the unbiased estimator for the parameter under the Bayesian model is the posterior mean,

\hat{\theta} = \int_{\Theta} \theta \times p(\theta|y) \, d\theta.

This quantity is often used as a Bayesian point estimator because it minimizes expected squared loss between an estimate and the actual value. The posterior median may also be used as an estimate — it minimizes expected absolute error of the estimate.

Point estimates may be reasonably accurate if the posterior has low variance. If the posterior is diffuse, prediction with point estimates tends to be underdispersed, in the sense of underestimating the variance of the predictive distribution. This is a kind of overfitting which, unlike the usual situation of overfitting due to model complexity, arises from the oversimplification of the variance component of the predictive model.


Get every new post delivered to your Inbox.

Join 835 other followers