Author Archive

The Futility of Commenting Code

October 15, 2009

I often read about novice programmers’ aspirations for a greater density of code comments, because they think it’s “professional”. I have news for you. Professional coders don’t comment their own code much and never trust the comments of others they find in code. Instead, we try to learn to read code and write more readable code.

API Documentation vs. Code Comments

I’m a big believer in clear API documentation. Java makes this easy with javadoc. It even produces relatively spiffy results.

But if you look at the LingPipe source code, you’ll find very few code comments.

Comments Lie

The reason to be very suspicious of code comments is that they can lie. The code is what’s executed, so it can’t lie. Sure, it can be obfuscated, but that’s not the same as lying.

I don’t mean little white lies, I mean big lies that’ll mess up your code if you believe them. I mean comments like “verifies the integrity of the object before returning”, when it really doesn’t.

A typical reason for slippage between comments and code is patches to the code that are not accompanied by patches to the comments. (This can happen for API doc, too, if you fiddle with your APIs.)

Another common reason is that the code author didn’t actually understand what the code was doing, so wrote comments that were wrong. If you really mess up along these lines, your code’s likely to be called out on blogs like Coding Horror or Daily WTF.

Most Comments Considered Useless

The worst offenses in the useless category are things that simply repeat what the code says.

// Set x to 3
x = 3;

It’s even better when embedded in a bloated Fortran-like comment block:

 ************* add1 ***********************
 *  A function to add 1 to integers       *
 *  arguments                             *
 *     + n, any old integer               *
public int add1(int n) { return n+1; }

Thanks. I didn’t know what int meant or what n+1 did. This is a classic rookie mistake, because rookies don’t know the language or its libraries, so often what they do seems mysterious to them. For instance, commenting n >>> 2 with “shift two bits to right and fill in on the left with zeroes” may seem less egregious, but your reader should know that >>> is the unsigned shift operator (or look it up).

There is a grey line in the sand (I love mixed metaphors) here. When you’re pulling out techniques like those in Bloch and Gafter’s Java Puzzlers, you might want to reconsider how you code something, or adding a wee comment.

Eliminate, don’t Comment Out, Dead Code

I remember being handed a 30-page Tcl/Tk script at Bell Labs back in the 1990s. It ran some kind of speech recognition pipeline, because that’s what I was up to at the time. I picked it up and found dozens of methods with the same name, and lots of commented-out code. This makes it really hard to follow what’s going on, especially if whole blocks get commented out with /* ... */.

Please don’t do this. You should lLearn any of thea version control systems instead, like SVN.

When do I Comment Code?

I add comments in code in two situations. The first is when I wrote something inefficiently, but I know a better way to do it. I’ll write something like “use dynamic programming to reduce quadratic to linear”. This is a bad practice, and I wish I could stop myself from doing it. I feel bad writing something inefficient when I know how to do it more efficiently, and I certainly don’t want people reading the code to think I’m clueless.

I know only one compelling reason to leave comments: when you write code that’s not idiomatic in the language it’s written in, or when you do something for efficiency that’s obscure. And even then, keep the comments telegraphic, like “C style strings for efficiency”, “unfolds the E step and M step of EM”, “first step unfolded for boundary checks” or something along those lines.

Update: 13 Dec 2012. I’ve thought about this issue a bit more and wanted to add another reason to comment code: to associate the code with an algorithm. If you’re implementing a relatively complex algorithm, then you’re going to have the algorithm design somewhere and it can be helpful to indicate which parts of the code correspond to which parts of the algorithm. Ideally, though, you’d just write functions with good names to do the stages of the algorithm if they’re clear. But often that’s not really an option because of the shape of the algorithm, mutability of arguments, etc.

Also, I want to be clear that I’m not talking about API comments in code, which I really like. For instance, we do that in LingPipe using Javadoc. I think these comments are really really important, but I think of them somehow more as specifications than as comments.

Write Readable Code Instead

What you should be doing is trying to write code that’s more readable.

I don’t actually mean in Knuth’s literate programming sense; Knuth wants programs to look more like natural language, which is a Very Bad Idea. For one, Knuth has a terrible track record, bringing us TeX, which is a great typesetting language, but impossible to read, and a three-volume set of great algorithms written in some of the most impenetrable, quirky pseudocode you’re ever likely to see.

Instead, I think we need to become literate in the idioms and standard libraries of the programming language we’re using and then write literately in that language.

The biggest shock for me in moving from academia to industry is just how much of other people’s code I have to read. It’s a skill, and I got much better at it with practice. Just like reading English.

Unfortunately, effective programming is as hard, if not harder, than effective essay writing. Writing understandable code that’s also efficient enough and general enough takes lots of revision, and ideally feedback from a good code reviewer.

Not everyone agrees about what’s most readable. Do I call out my member variables from local variables with a prefix? I do, but lots of perfectly reasonable programmers disagree, including the coders for Java itself. Do I use really verbose names for variables in loops? I don’t, but some programmers do. Do I use four-word-long class names, or abbreviations? The former, but it’s another judgment call.

However you code, try to do it consistently. Also, remember that coding standards and macro libraries are not good places to innovate. It takes a while to develop a consistent coding style, but it’s worth the investment.

Bayesian Counterpart to Fisher Exact Test on Contingency Tables

October 13, 2009

I want to expand a bit on Andrew’s post, in which he outlines a simple Bayesian analysis of 2×2 contingency tables to replace Fisher’s exact test (or a chi-square test) for contingency tables.

Contingency Table

I’ll use the data in the example in the Wikipedia article on contingency tables:

Left-Handed Right-Handed TOTAL
Male 9 (y1) 43 52 (n1)
Female 4 (y2) 44 48 (n2)
Total 13 87 100


The statistical hypothesis we wish to investigate is whether the proportion of left-handed men is greater than, equal to, or less than the proportion of left-handed women.

Beta-Binomial Model

The observed data has y_1 = 9 left-handed men of a total of n_1 = 52 men and y_2 = 4 left-handed women of a total of n_2 = 48 women.

Letting \theta_1 \in [0,1] be the proportion of left-handed men and \theta_2 \in [0,1] the proportion of left-handed women, Andrew suggested modeling the number of left-handers as a binomial,

y_1 \sim \mbox{\sf Binom}(n_1,\theta_1), and

y_2 \sim \mbox{\sf Binom}(n_2,\theta_2).

He then suggested simple uniform priors on the proportions, which we know are equivalent to \mbox{\sf Beta}(1,1) priors:

\theta_1 \sim \mbox{\sf Unif}(0,1) = \mbox{\sf Beta}(1,1), and

\theta_2 \sim \mbox{\sf Unif}(0,1) = \mbox{\sf Beta}(1,1).

Given the conjugacy of the beta for the binomial, we can analytically compute the posteriors:

p(\theta_1|y_1,n_1) = \mbox{\sf Beta}(\theta_1|y_1+1, n_1-y_1+1), and

p(\theta_2|y_2,n_2) = \mbox{\sf Beta}(\theta_2|y_2+1, n_2-y_2+1)

We could try to compute the posterior density of \delta = \theta_1 - \theta_2 analytically by solving the integral

p(\delta|y,n) = \int_{-\infty}^{\infty} \mbox{\sf Beta}(\theta|y_1+1,n_1-y_1+1) \mbox{\sf Beta}(\theta-\delta|y_2+1,n_2-y_2+1) d\theta

and then evaluating \mbox{Pr}[\delta > 0].

Instead, we’ll just follow Andrew’s sage advice and estimate the integral by simulation. That is, we’ll take M samples from the joint posterior

p(\theta_1,\theta_2|y_1,n_1,y_2,n_2) = p(\theta_1|y_1,n_1) \times p(\theta_2|y_2,n_2),

which we write as

(\theta_1^{(1)},\theta_2^{(1)}), (\theta_1^{(2)},\theta_2^{(2)}), (\theta_1^{(3)},\theta_2^{(3)}), \ldots, (\theta_1^{(M)},\theta_2^{(M)})

and then use the Monte Carlo approximation,

\mbox{Pr}[\theta_1 > \theta_2]

\approx \frac{1}{M} \ \sum_{m=1}^{M} \mathbb{I}(\theta_1^{(m)} > \theta_2^{(m)}),

where \mathbb{I}() is the indicator function, taking on value 1 if its property is true and 0 otherwise.

In words, we estimate the probability that \theta_1 is greater than \theta_2 by the fraction of samples m \in M where \theta_1^{(m)} > \theta_2^{(m)}.

R Code

I love R. Here’s the R code to compute estimated posteriors by simulation and visualize them.

n1 = 52 # men
y1 = 9  # left-handed men
n2 = 48 # women
y2 = 4  # left-handed women

I = 10000 # simulations
theta1 = rbeta(I, y1+1, (n1-y1)+1)  
theta2 = rbeta(I, y2+1, (n2-y2)+1)
diff = theta1-theta2  # simulated diffs

quantiles = quantile(diff,c(0.005,0.025,0.5,0.975,0.995))

print("Probability higher % men than women lefties:")

     xlab="theta1 - theta2",
     ylab="p(theta1 - theta2 | y, n)",
     main="Posterior Simulation of Male - Female Lefties",
abline(v=quantiles[2], col="blue")
abline(v=quantiles[4], col="blue")

Running this code prints out:

> source("twoByTwoPosterior.R")
  0.5%   2.5%    50%  97.5%  99.5% 
-0.091 -0.046  0.084  0.218  0.262 

[1] "Probability higher % men than women lefties:"
[1] 0.901

[1] "Mean difference theta1-theta2"
[1] 0.08484979

The first numbers displayed are the posterior quantiles for the difference. The posterior median (50% quantile) difference is 0.084. The other numbers are quantiles which determine 95% and 99% posterior intervals, which span (0.025,0.975) and (0.005,0.995) respectively. The posterior central 95% quantile is thus (-0.046,0.218). Quantiles get the obvious interpretation in Bayesian stats — given the model and data, we assign a 95% probability to the true difference lying in the interval (-.046,0.218).

The second number printed is the posterior probability that men have a higher percentage of left-handness than women, which is 0.901, or about 90%. R makes this simple by letting us write mean(theta1 > theta2). This is also the value you’d get from integrating the posterior density from 0 to \infty.

The final number is the mean difference between theta1 and theta2, which is the unbiased Bayesian estimator of the difference. This value is 0.0848, which is a bit higher than the median difference of 0.084.

The code also generates the following graph of the posterior:

The vertical lines are drawn at the boundaries of the central 95% region of the posterior difference p(\theta_1 - \theta_2 | n, y).


Don’t ask. You’re a Bayesian now.

Instead, say that according to your model, based on the observed data, there’s a 90% chance there’s a higher prevalence of lefthandedness among men than women, or that there’s a 95% chance the difference between male prevalence of lefthandness and female prevalence of lefthandedness falls in the range (-.05,0.22). [Remember not to use too many digits. The outputs from R were so you could see the small differences.]

Batting Averages: Bayesian vs. MLE Estimates

September 11, 2009

It’s American baseball season, and I’ve been itching to do more baseball stats. So in the interest of explaining Bayesian posterior inference with an example…

Batting Averages

A player’s batting average is his number of “hits” divided by his number of “at bats“.

Binomial Model of Batting

For simplicitly, we’ll model a sequence of N at bats using a binomial distribution with a \theta probability of getting a hit in each at bat. This assumes that the sequence of at bats is independent and identically distributed, aka iid.

Of course this is wrong. All models are wrong. It’s just that this one’s more wrong than usual, despite the popularity of the batting average as a summary statistic. (See the second last section for a reference to a better model.)

Maximum Likelihood Estimate (MLE)

Given a sample (aka training data) consisting of a player with m hits in M at bats (where m > 0,  (M-m) > 0), the maximum-likelihood estimate (MLE) of the ability \theta is m/M, aka the player’s batting average.

Uniform Prior

We assume a uniform (aka noninformative) prior distribution over abilities.

This is also wrong. We know major league position players don’t bat under 0.200 or they’d be fired, and no one’s had a batting average of over 0.400 since Ted Williams.

But we’ll have to come back to issues with the prior later. This post is just to clarify what integration over the posterior does for inference, not what an informative prior does.

Bayesian Posterior Estimate

The beta prior is conjugate to the binomial. Because the uniform distribution may be expressed as a beta distribution, \mbox{\sf Beta}(1,1), the posterior is a beta distribution. Because the beta and binomial play nicely together, the posterior ability distribution p(\theta|m,M) may be expressed analytically as \mbox{\sf Beta}(1+m, 1+M-m). The maximum of \mbox{\sf Beta}(1+m,1+M-m) is m/M, so the maximum a posteriori (MAP) estimate of \theta is also m/M, the same as the maximum likelihood estimate.

Posterior Predictive Inference

Now let’s look at prediction. For concreteness, suppose we’ve observed M=10 at bats, of which m=3 were hits, for a batting average of 3/10 = 0.300. What we want to know is the likelihood of n hits in the next N=5 at bats.

There are six possible outcomes, 0, 1, 2, 3, 4, or 5 hits. Using the MLE or MAP point estimate, this distribution is \mbox{\sf Binomial}(n|5,0.300).

The Bayesian estimate requires integrating over the posterior, effectively averaging the binomial prediction for each ability by the ability’s probability. The posterior in this case is \mbox{\sf Beta}(3+1,10-3+1)=\mbox{\sf Beta}(4,8), so the integral results in the well known beta-binomial distribution:

\mbox{\sf BetaBinomial}(n|5, 4,8) = \int_0^1 \mbox{\sf Binomial}(n|5,\theta) \mbox{\sf Beta}(\theta|4,8) d\theta

I solved the integral numerically using Monte Carlo integration, which is a fancy way of saying I took a bunch of samples \theta_n from the beta posterior \mbox{\sf Beta}(4,8) and plugged them into \mbox{\sf Binomial}(5,\theta_n) and averaged the resulting predictions. It’s a few lines of R (see below). The beta-binomial may be expessed analytically, but that’s even more typing than the Monte Carlo solution. I’m sure there’s an R package that’d do it for me.

Comparing the Results

The important issue is to look at the outputs. Given that we’ve seen a batter get 3 hits in 10 at bats, here’s our prediction using a binomial model with uniform prior, shown with both the MLE/MAP point estimate and the average of the Bayesian predictions:

Point vs. Bayesian Estimates
Hits MLE/MAP Bayesian
0 0.168 0.180
1 0.360 0.302
2 0.309 0.275
3 0.132 0.165
4 0.028 0.064
5 0.002 0.012

The Bayesian estimate here is more dispersed (flatter, higher variance, higher entropy) than the MLE estimate. Of course, a true Bayesian estimate would provide a joint distribution over outcomes 0,1,2,3,4,5 (which reduce to a simplex, because any 5 values determine the 6th) for downstream reasoning. The nice thing about passing uncertainty along by integration is that it’s very plug-and-play.

Advanced Reading

Luckily for us baseball stats geeks, there’s some advanced reading hot off the press from Wharton Business School:

R Source

Here’s how I generated the table.

# observed
hits = 3
atBats = 10

# predicted
nextAtBats = 5

# y: MLE predictive distro
average = hits/atBats
y = rep(0,nextAtBats+1);
for (i in 0:nextAtBats) y[i+1] = dbinom(i,nextAtBats,average)

# z: Bayesian predictive distro
numSamps = 20000
for (k in 1:numSamps) {
    sampledAvg = rbeta(1,1+hits,1+atBats-hits)
    for (i in 0:nextAtBats)
        z[i+1] = z[i+1] + dbinom(i,nextAtBats,sampledAvg)
for (i in 0:nextAtBats)
    z[i+1] = z[i+1] / numSamps

print("Point"); print(y, digits=3)
print("Bayes"); print(z, digits=3)

What is “Bayesian” Statistical Inference?

September 9, 2009

Warning: Dangerous Curves This entry presupposes you already know some math stats, such as how to manipulate joint, marginal and conditional distributions using the chain rule (product rule) and marginalization (sum/integration rule). You should also be familiar with the difference between model parameters (e.g. a regression coefficient or Poisson rate parameter) and observable samples (e.g. reported annual income or the number of fumbles in a quarter of a given American football game).


I followed up this post with some concrete examples for binary and multinomial outcomes:

Bayesian Inference is Based on Probability Models

Bayesian models provide full joint probability distributions over both observable data and unobservable model parameters. Bayesian statistical inference is carried out using standard probability theory.

What’s a Prior?

The full Bayesian probability model includes the unobserved parameters. The marginal distribution over parameters is known as the “prior” parameter distribution, as it may be computed without reference to observable data. The conditional distribution over parameters given observed data is known as the “posterior” parameter distribution.

Non-Bayesian Statistics

Non-Bayesian statisticians eschew probability models of unobservable model parameters. Without such models, non-Bayesians cannot perform probabilistic inferences available to Bayesians, such as definining the probability that a model parameter (such as the mean height of an adult male American) is in a defined range say (say 5’6″ to 6’0″).

Instead of modeling the posterior probabilities of parameters, non-Bayesians perform hypothesis testing and compute confidence intervals, the subtleties of interpretation of which have confused introductory statistics students for decades.

Bayesian Technical Apparatus

The sampling distribution, with probability function p(y|\theta), models the probability of observable data y given unobserved (and typically unobservable) model parameters \theta. (The sampling distribution probability function p(y|\theta) is called the likelihood function when viewed as a function of \theta for fixed y.)

The prior distribution, with probability function p(\theta), models the probability of the parameters \theta.

The full joint distribution over parameters and data has a probability function given by the chain rule,

p(y,\theta) = p(y|\theta) \times p(\theta).

The posterior distribution gives the probability of parameters \theta given the observed data y. The posterior probability function p(\theta|y) is derived from the sampling and prior distributions via Bayes’s rule,


{} = \frac{\displaystyle p(y,\theta)}{\displaystyle p(y)}       by the definition of conditional probability,

{} = \frac{\displaystyle p(y|\theta) \times p(\theta)}{\displaystyle p(y)}       by the chain rule,

{} = \frac{\displaystyle p(y|\theta) \times p(\theta)}{\int_{\Theta} \displaystyle p(y,\theta') \, d\theta'}       by the rule of total probability,

{} = \frac{\displaystyle p(y|\theta) \times p(\theta)}{\int_{\Theta} \displaystyle p(y|\theta') \times p(\theta') \, d\theta'}       by the chain rule, and

{} \propto p(y|\theta) \times p(\theta)       because y is constant.

The posterior predictive distribution p(\tilde{y}|y) for new data \tilde{y} given observed data y is the average of the sampling distribution defined by weighting the parameters proportional to their posterior probability,

p(\tilde{y}|y) = \int_{\Theta} \, p(\tilde{y}|\theta) \times p(\theta|y) \, d\theta.

The key feature is the incorporation into predictive inference of the uncertainty in the posterior parameter estimate. In particular, the posterior is an overdispersed variant of the sampling distribution. The extra dispersion arises by integrating over the posterior.

Conjugate Priors

Conjugate priors, where the prior and posterior are drawn from the same family of distributions, are convenient but not necessary. For instance, if the sampling distribution is binomial, a beta-distributed prior leads to a beta-distributed posterior. With a beta posterior and binomial sampling distribuiton, the predictive posterior distribution is beta-binomial, the overdispersed form of the binomial. If the sampling distribution is Poisson, a gamma-distributed prior leads to a gamma-distributed posterior; the predictive posterior distribution is negative-binomial, the overdispersed form of the Poisson.

Point Estimate Approximations

An approximate alternative to full Bayesian inference uses p(\tilde{y}|\hat{\theta}) for prediction, where \hat{\theta} is a point estimate.

The maximum \theta^{*} of the posterior distribution defines the maximum a posteriori (MAP) estimate,

\theta^{*} = \arg\max_{\theta} p(y|\theta) p(\theta).

If the prior p(\theta) is uniform, the MAP estimate is called the maximum likelihood estimate (MLE), because it maximizes the likelihood function p(y|\theta). Because the MLE does not assume a proper prior; the posterior may be improper (i.e., not integrate to 1).

By definition, the unbiased estimator for the parameter under the Bayesian model is the posterior mean,

\hat{\theta} = \int_{\Theta} \theta \times p(\theta|y) \, d\theta.

This quantity is often used as a Bayesian point estimator because it minimizes expected squared loss between an estimate and the actual value. The posterior median may also be used as an estimate — it minimizes expected absolute error of the estimate.

Point estimates may be reasonably accurate if the posterior has low variance. If the posterior is diffuse, prediction with point estimates tends to be underdispersed, in the sense of underestimating the variance of the predictive distribution. This is a kind of overfitting which, unlike the usual situation of overfitting due to model complexity, arises from the oversimplification of the variance component of the predictive model.

Deleting Values in Welford’s Algorithm for Online Mean and Variance

July 7, 2009

To take a Gibbs sample, the previous sampled value for a variable is deleted from the sufficient statistics of the estimators that depend on it, the variable is resampled, and the the variable is added back into the estimators. What if our variable’s normal? Well, it turns out we can generalize Welford’s algorithm, which I describe in a comment to:

Here’s the basic algorithm, cribbed and simplified from John Cook’s site (see reference below), along with an extra method (unHandle(double)) added in for deletes:

private long mN = 0L;
private double mM = 0.0;
private double mS = 0.0;
public void handle(double x) {
    double nextM = mM + (x – mM) / mN;
    mS += (x – mM) * (x – nextM);
    mM = nextM;
public void unHandle(double x) {
    if (mN == 0) {
        throw new IllegalStateException();
    } else if (mN == 1) { 
        mN = 0; mM = 0.0; mS = 0.0;
    } else {
        double mMOld = (mN * mM - x)/(mN-1);
        mS -= (x -  mM) * (x - mMOld);
        mM = mMOld;
public double mean() {
    return mM;
public double varianceUnbiased() {
    return mN > 1 ? mS/(mN-1) : 0.0;

Works like a charm. I’m sure someone’s done this already, but I didn’t find any references in a quick check.

The same technique can obviously be applied to covariance and correlation calculations.

This’ll be in the next release of LingPipe.


  • Welford, B. P. 1962. Note on a method for calculating corrected sums of squares and products. Technometrics 4(3): 419-420.
  • Cook, John. Accurately computing running variance. Web page.

Collapsed Gibbs Sampler for Hierarchical Annotation Model

July 6, 2009

The R and BUGS combination is fine as far as it goes, but it’s slow, hard to debug, and doesn’t scale well. Because I’m going to need to process some big Turker-derived named entity corpora in the next month (more about that later), I thought I’d work on scaling the sampler.

Mitzi was away over the weekend, so I naturally spent my newly found “free time” coding and reading up on stats. While I was procrastinating refactoring feature extraction for CRFs reading a really neat paper (On Smoothing and Inference for Topic Models) from the Irvine paper mill (I also blogged about another of their paper’s on fast LDA sampling), it occurred to me that I could create a fast collapsed sampler for the multinomial form of the hierarchical models of annotation I’ve blogged about.

Hierarchical Model of Binomial Annotation

The model’s quite simple, at least in the binomial case. I’ve further simplified here to the case where every annotator labels every item, but the general case is just as easy (modulo indexing):

Variable Range Status Distribution Description
I > 0 input fixed number of Items
J > 0 input fixed number of annotators
π [0,1] estimated Beta(1,1) prevalence of category 1
c[i] {0,1} estimated Bern(π) category for item i
θ0[j] [0,1] estimated Beta(α0,β0) specificity of annotator j
θ1[j] [0,1] estimated Beta(α1,β1) sensitivity of annotator j
α0/(α0+β0) [0,1] estimated Beta(1,1) prior specificity mean
α0 + β0 (0,∞) estimated Pareto(1.5)* prior specificity scale
α1/(α1+β1) [0,1] estimated Beta(1,1) prior sensitivity mean
α1 + β1 (0,∞) estimated Pareto(1.5)* prior sensitivity scale
x[i,j] {0,1} input Bern(c[i,j]==1
? θ1[j]
: 1-θ0[j])
annotation of item i by annotator j

The Collapsed Sampler

The basic idea is to sample only the category assignments c[i] in each round of Gibbs sampling. With categories given, it’s easy to compute prevalence, annotator sensitivity and specificity given their conjugate priors.

The only thing we need to sample is c[i], and we can inspect the graphical model for dependencies: the parent π of the c[i], and the parents θ0 and θ1 of the descendants x[i,j] of c[i]. The formula’s straightforwardly derived with Bayes’ rule:

p(c[i]|x, θ0, θ1) p(c[i]) * Πj in 1:J p(x[i,j] | c[i], θ0[j], θ1[j])

Moment-Matching Beta Priors

*The only trick is estimating the priors over the sensitivities and specificities, for which I took the usual expedient of using moment matching. Note that this does not take into account the Pareto prior on scales of the prior specificity and sensitivity (hence the asterisk in the table). In particular, given a set of annotator specificities (and there were 200+ annotators for the named-entity data), we find the beta prior with mean matching the empirical mean and variance matching the empirical variance (requires some algebra).

I’m not too worried about the Pareto scale prior — it’s pretty diffuse. I suppose I could’ve done something with maximum likelihood rather than moment matching (but for all I know, this is the maximum likelihood solution! [update: it’s not the ML estimate; check out Thomas Minka’s paper Estimating a Dirichlet Distribution and references therein.]).


The inputs are initial values for annotator specificity, annotator sensitivity, and prevalence. These are used to create the first category sample given the above equation, which allows us to define all the other variables for the first sample. Then each epoch just resamples all the categories, then recomputes all the other estimates. This could be made more stochastic by updating all of the variables after each category update, but it converges so fast as is, that it hardly seemed worth the extra coding effort. I made sure to scale for underflow, and that’s about it.

It took about an hour to think about (most of which was working out the moment matching algebra, which I later found in Gelman’s BDA book’s appendix), about two hours to code, and about forty-five minutes to write up for this blog entry.

Speed and Convergence

It’s very speedy and converges very very quickly compared to the full Gibbs sampler in BUGS. I spent another hour after everything was built and running writing the online sample handler that’d compute means and deviations for all the estimated variables, just like the R/BUGS interface prints out. Having the online mean and variance calculator was just what I needed (more about it later, too), especially as many of the values were very close to each other and I didn’t want to store all of the samples (part of what’s slowing BUGS down).


I didn’t run into identifiability problems, but in general, something needs to be done to get rid of the dual solutions (you’d get them here, in fact, if the initial sensitivities and specificities were worse than 0.5).

Open Question (for me)

My only remaining question is: why does this work? I don’t understand the theory of collapsed samplers. First, I don’t get nearly the range of possible samples for the priors given that they’re estimated from discrete sets. Second, I don’t apply beta-binomial inference in the main sampling equation — that is, I take the prevalence and annotator sensitivities and specificities as point estimates rather than integrating out their beta-binomial form. Is this kosher?

Downloading from LingPipe Sandbox

You can find the code in the LingPipe Sandbox in the project hierAnno (the original R and BUGS code and the data are also in that project). It’s not documented at all yet, but the one Ant task should run out of the box; I’ll probably figure out how to move the application into LingPipe.

[Update: The evaluation looks fine for named entities; I’m going to censor the data the same way as in the NAACL submission, and then evaluate again; with all the negative noise, it’s a bit worse than voting as is and the estimates aren’t useful because the specificites so dominate the calculations. For Snow et al.’s RTE data, the collapsed estimator explodes just like EM, with sensitivity scales diverging to infinity; either there’s a bug, the collapsed sampler isn’t stable, or I really do need a prior on those scales!]

[Update 2: The censored NE evaluation with collapsed Gibbs sampling and simple posterior evaluation by category sample worked out to have one fewer error than the full sampling model in BUGS (versus the original, albeit noisy, gold standard): collapsed Gibbs 232 errors, full Gibbs 233 errors, voting 242.5 errors (the half is from flipping coins on ties). Voting after throwing out the two really noisy annotators is probably competitive as is. I still don’t know why the RTE data’s blowing out variance.]

Lacks a Convincing Comparison with Simple Non-Bayesian Approaches

January 23, 2009

That was just one comment from my 2009 NAACL rejection letter (reprinted in full with a link to the paper below). As Andrew says, I’m glad to hear they had so many great submissions.

This was for a paper on gold-standard inference that I’ve been blogging about. Here’s a link to the submission:

The only thing it adds to the white paper and graphs on the blog is another mechanical Turk experiment that recreated the MUC 6 PERSON entity annotations. As with the Snow et al. work, the Turkers did better than the gold standard.

I should’ve paid more attention to Simon Peyton Jones’s most excellent advice. Ironically, I used to give most of that advice to our own students. Always easier medicine to give than to take.

In particular, I needed to make the use cases much clearer to those who weren’t knee deep in Bayesian analysis and inter-annotator agreement statistics. The trap I fell into has a name. It’s called “the curse of knowledge”. The best general study of this phenomenon I know is Elizabeth Newton’s research, which is described in Heath and Heath’s great book Made to Stick (see the tappers and listeners section of the excerpt). Good psych experiments are so compelling they give me goosebumps.

It’s hard to compare the Bayesian posterior intervals with non-Bayesian estimates. The whole point of the Bayesian analysis is to analyze the posteriors. But if you’re not a Bayesian, what do you care?

The standard in NLP is first-best analyses on held-out data sets with a simple accuracy measure. No one ever talks about log loss except in model fitting and language modeling evaluation. So even a probabilistic model that’s not Bayesian can’t get evaluated on its own terms. This goes for simple posterior predictive inferences in Bayesian analyses. I guess the reviewer missed the comparison with simple voting, which is the de facto standard (coupled with censorship of uncertain cases).

What I really should’ve addressed is two sides of the issue of what happens with fewer annotators. The answer is that the Bayesian approach has an even stronger advantage in agreeing with the gold standard annotations than simple voting. I only did that after the analysis after the submission in a “Doh!” moment anticipating the problem reviewers might have with 10 annotators. The second aspect is to push harder on the fools gold standard point that the state of the art produces very poor corpora in terms of consistency.

It is possible to see if the informative priors help with cross-validation. But even without cross-validation, it’s obvious with 20 samples when the annotator made no mistakes — they’re unlikely to have 100% accuracy on the rest of the corpus. You don’t estimate a player’s batting average (in baseball) to be .500 after he goes 10 for 20 in his first 20 at bats. Everyone in NLP knows low count estimates need to be smoothed. So now that we’ve decided we need a prior (call it regularization or smoothing if you like), we’re just haggling about price. So I just optimized that, too. It’s what any good Bayesian would do.

Here’s the full set of comments and ratings (on a 1-5 scale, with 5 being the best).

NAACL-HLT 2009 Reviews for Submission #138

Title: A Multilevel Bayesian Model of Categorical Data Annotation

Authors: Bob Carpenter
                            REVIEWER #1

Reviewer's Scores

                         Appropriateness: 3
                                 Clarity: 3
            Originality / Innovativeness: 2
                 Soundness / Correctness: 4
                   Meaningful Comparison: 4
                            Thoroughness: 4
              Impact of Ideas or Results: 2
                     Impact of Resources: 3
                          Recommendation: 2
                     Reviewer Confidence: 4
                                Audience: 2
                     Presentation Format: Poster
             Resubmission as short paper: recommended


The paper presents a simple Bayesian model of labeling error in multiple
annotations.  The goal is to evaluate and potentially clean-up errors of
sloppy/misguided annotation, for example, obtained via Amazon's Mechanical
Turk.  Standard Gibbs sampling is used to infer model parameters from observed
annotations and produce likely 'true' annotations.

I found section 2 confusing, even though the model is fairly simple.  Overall,
I didn't find the model or method very innovative or the results very
Nevertheless, I think this paper could be good as short one, since there is
some interest in empirical assessment of noisy annotation.

                            REVIEWER #2

Reviewer's Scores

                         Appropriateness: 5
                                 Clarity: 4
            Originality / Innovativeness: 3
                 Soundness / Correctness: 3
                   Meaningful Comparison: 3
                            Thoroughness: 3
              Impact of Ideas or Results: 2
                     Impact of Resources: 1
                          Recommendation: 2
                     Reviewer Confidence: 4
                                Audience: 3
                     Presentation Format: Poster
             Resubmission as short paper: recommended


This paper deals with modelling the annotation process using a fully specified
Bayesian model.

The paper itself is nicely written and I like the mixture of synthetic and real
experiments.  The problems I have however are:

--what does this actually buy us?  It would be nice to see exactly what is
gained through this process.  As it stands, it seems to tell us that annotators
can disagree and that some tasks are harder than other ones.  This is not
surprising really.

--it is highly unrealistic to have tens of annotators per item.  A far more
realistic situation is to have only one or possibly two annotators.  What would
this mean for the approach?

I would have liked to have seen some kind of baseline experiment, rather than
assuming that the various priors are actually warranted.  

Why assume binary labels?  Although it is possible to simulate non-binary
labels, this will introduce errors and it is not necessarily natural for

                            REVIEWER #3

Reviewer's Scores

                         Appropriateness: 5
                                 Clarity: 3
            Originality / Innovativeness: 4
                 Soundness / Correctness: 3
                   Meaningful Comparison: 3
                            Thoroughness: 4
              Impact of Ideas or Results: 3
                     Impact of Resources: 1
                          Recommendation: 3
                     Reviewer Confidence: 3
                                Audience: 4
                     Presentation Format: Poster
             Resubmission as short paper: recommended


Statistical approaches to NLP have been largely depending on human annotated
data. This paper uses a Bayesian approach to address the uncertainty in human
annotation process, which is an important and interesting problem that may
directly affect statistical learning performance. The proposed model is able to
infer true labels given only a set of human annotations. The paper is
in general sound; the experiments on both simulated data and real data are
provided and with detailed analysis. Overall the work seems a contribution to
the field which I recommend for acceptance, but I have a few suggestions for

My major concern is that the work lacks a convincing comparison with simple
non-Bayesian approaches in demonstrating the utility of the model. The paper
has a excessively sketchy description of a non-hierarchical version of the
model (which actually gives similar result in label inference). Moreover, it is
not very clear how the proposed approach is related to all previous works
listed in Section 9.

The authors should explain more on their evaluation metrics. What points is it
trying to make in Figure 6? Why using residual errors instead of ROC curves?   

Some minor comments:
-- Typo at end of Section 3: "residual error error"
-- Section 4: "39% of all annotators perform no better than chance, as
indicated by the diagonal green line in Figure 6". Maybe I missed something but
I don't see this in Figure 6.
-- Section 7: define "item-level" predictors.

Good Kappa’s not Enough

July 22, 2008

I stumbled across Reidsma and Carletta’s Reliability measurement without limits, which is pending publication as a Computational Lingusitics journal squib (no, not a non-magical squib, but a linguistics squib).

The issue they bring up is that if we’re annotating data, a high value for the kappa statistic isn’t enough to guarantee what they call "reliability". The problem is that the disagreements may not be random. They focus on simulating the case where an annotator over-uses a label, which results in kappa way overestimating reliability when compared to performance versus the truth. The reason is that the statistical model will be able to pick up on the pattern of mistakes and reproduce them, making the task look more reliable than it is.

This discussion is similar to the case we’ve been worrying about here in trying to figure out how we can annotate a named-entity corpus with high recall. If there are hard cases that annotators miss (over-using the no-entity label), random agreement assuming equally hard problems will over-estimate the "true" recall.

Reidsma and Carletta’s simulation shows that there’s a strong effect from the relationship between true labels and features of the instances (as measured by Cramer’s phi).

Review of Cohen’s Kappa

Recall that kappa is a "chance-adjusted measure of agreement", which has been widely used in computational linguistics since Carletta’s 1996 squib on kappa, defined by:

kappa = (agreement - chanceAgreement)           
           / (1 - chanceAgreement)

where agreement is just the empirical percentage of cases on which annotators agree, and chanceAgreement is the percentage of cases on which they’d agree by chance. For Cohen’s kappa, chance agreement is measured by assuming annotators pick categories at random according to their own empirical category distribution (but there are lots of variants, as pointed out in this Artstein and Poesio tech report, a version of which is also in press at The Journal of Kappa Studies, aka Computational Linguistics). Kappa values will range between -1 and 1, with 1 only occurring if they have perfect agreement.

I (Bob) don’t like kappa, because it’s not estimating a probability (despite being an arithmetic combination of [maximum likelihood] probability estimates). The only reason to adjust for chance is that it allows one, in theory, to compare different tasks. The way this plays out in practice is that an arbitrary cross-task threshold is defined above which a labeling task is considered "reliable".

A final suggestion for those using kappa: confidence intervals from bootstrap resampling would be useful to see how reliable the estimate of kappa itself is.


Get every new post delivered to your Inbox.

Join 848 other followers