## All Bayesian Models are Generative (in Theory)

May 23, 2013

[This post is a followup to my previous post, Generative vs. discriminative; Bayesian vs. frequentist.]

I had a brief chat with Andrew Gelman about the topic of generative vs. discriminative models. It came up when I was asking him why he didn’t like the frequentist semicolon notation for variables that are not random. He said that in Bayesian analyses, all the variables are considered random. It just turns out we can sidestep having to model the predictors (i.e., covariates or features) in a regression. Andrew explains how this works in section 14.1, subsection “Formal Bayesian justification of conditional modeling” in Bayesian Data Analysis, 2nd Edition (the 3rd Edition is now complete and will be out in the foreseeable future). I’ll repeat the argument here.

Suppose we have predictor matrix $X$ and outcome vector $y$. In regression modeling, we assume a coefficient vector $\beta$ and explicitly model the outcome likelihood $p(y|\beta,X)$ and coefficient prior $p(\beta)$. But what about $X$? If we were modeling $X$, we’d have a full likelihood $p(X,y|\beta,\psi)$, where $\psi$ are the additional parameters involved in modeling $X$ and the prior is now joint, $p(\beta,\psi)$.

So how do we justify conditional modeling as Bayesians? We simply assume that $\psi$ and $\beta$ have independent priors, so that

$p(\beta,\psi) = p(\beta) \times p(\psi)$.

The posterior then neatly factors as

$p(\beta,\psi|y,X) = p(\psi|X) \times p(\beta|X,y)$.

Looking at just the inference for the regression coefficients $\beta$, we have the familiar expression

$p(\beta|y,X) \propto p(\beta) \times p(y|\beta,X)$.

Therefore, we can think of everything as a joint model under the hood. Regression models involve an independence assumption so we can ignore the inference for $\psi$. To quote BDA,

The practical advantage of using such a regression model is that it is much easier to specify a realistic conditional distribution of one variable given $k$ others than a joint distribution on all $k+1$ variables.

We knew that all along.

## The Latin1 Transcoding Trick for Ant

April 21, 2013

A while back Bob blogged about The Latin1 Transcoding Trick for Java Servlets, etc.

Suppose you have an API that insists on converting an as-yet-unseen stream of bytes to characters for you (e.g. servlets), but lets you set the character encoding if you want.

Because Latin1 (officially, ISO-8859-1) maps bytes one-to-one to Unicode code points, setting Latin1 as the character encoding means that you get a single Java char for each byte.

Another situation where this trick comes in real handy is dealing with the way that Ant compiles its logfiles.

If, like me, you’re fond of debug-by-printf and you use Ant to compile and run your programs, then you might have run into the problem that has given rise to many StackOverflow queries, that is, when you use an Ant task to run the program and instrument your code with print statements to standard out, Ant replaces non-ASCII characters with a question mark. When the problem you’re trying to debug is making sure that non-ASCII characters are being processed correctly, this is both misleading and maddening. The standard advice on StackOverflow is to set the shell environment variable ANT_OPTS using the following incantation (for bash shell):

export ANT_OPTS="-Dfile.encoding=UTF-8"

This works as long as you’re working with UTF-8 encoded character data and your terminal’s encoding is set to UTF-8 as well. Here a solution that works no matter what character encoding is in play:

export ANT_OPTS="-Dfile.encoding=Latin1"

It’s the ol’ Latin1 transcoding trick!

Of course you already know about character encodings . Do you know about Ant’s IO System? Here’s what Ant contributor Conor MacNeill says:

The Ant IO system is designed to associate any output sent to System.out and System.err with the task that generated the output and to log the output accordingly.

Ant’s Main class installs its own output streams into System.out and System.err. These streams are instances of DemuxOutputStreams

Using the source code for Ant 1.9.0, in class org.apache.tools.ant.Main we see that System.In, System.Out, and System.Err are all reassigned to Ant’s DemuxInputStream and DemuxOutputStream, which extend InputStream and OutputStream, respectively:

System.setIn(new DemuxInputStream(project));
System.setOut(new PrintStream(new DemuxOutputStream(project, false)));
System.setErr(new PrintStream(new DemuxOutputStream(project, true)));

The call to the PrintStream constructor is the one-arg constructor PrintStream(OutputStream out). Because no file encoding is specified, the encoding used is the default charset for the JVM that’s running Ant. This is specified by the system property file.encoding. This property varies depending on your platform and locale. To check this, try this on your machine:

public class GetDefaultEncoding {
public static void main(String[] args) {
System.out.println(System.getProperty("file.encoding"));
}
}

On my Mac running OS-X the default is US-ASCII (because the default locale on this machine is en_US). On my Windows XP machine the default is Cp1252 (Windows Latin1 which differs from ISO_8859-1 just enough to be noticeable).

At the point where Ant’s DemuxInputStream reads in the bytes sent to System.out by a Java task, any character outside of the default character set is replaced by a question mark. When Latin1 is the default encoding, all bytes are valid Latin1 characters and their Unicode code point value is the same as the byte value so the bytes from the Java task pass through the Ant IO system unchanged.

As long as the next process in the chain (e.g. the terminal app) is configured to handle whatever encoding your text data is in, you’re good to go.

## Generative vs. Discriminative; Bayesian vs. Frequentist

April 12, 2013

[There’s now a followup post, All Bayesian models are generative (in theory).]

I was helping Boyi Xie get ready for his Ph.D. qualifying exams in computer science at Columbia and at one point I wrote the following diagram on the board to lay out the generative/discriminative and Bayesian/frequentist distinctions in what gets modeled.

To keep it in the NLP domain, let’s assume we have a simple categorization problem to predict a category z for an input consisting of a bag of words vector w and parameters β.

 Frequentist Bayesian Discriminative p(z ; w, β) p(z, β ; w) = p(z | β ; w) * p(β) Generative p(z, w ; β) p(z, w, β) = p(z, w | β) * p(β)

If you’re not familiar with frequentist notation, items to the right of the semicolon (;) are not modeled probabilistically.

Frequentists model the probability of observations given the parameters. This involves a likelihood function.

Bayesians model the joint probability of data and parameters, which, given the chain rule, amounts to a likelihood function times a prior.

Generative models provide a probabilistic model of the predictors, here the words w, and the categories z, whereas discriminative models only provide a probabilistic model of the categories z given the words w.

## Mean-Field Variational Inference Made Easy

March 25, 2013

I had the hardest time trying to understand variational inference. All of the presentations I’ve seen (MacKay, Bishop, Wikipedia, Gelman’s draft for the third edition of Bayesian Data Analysis) are deeply tied up with the details of a particular model being fit. I wanted to see the algorithm and get the big picture before being overwhelmed with multivariate exponential family gymnastics.

### Bayesian Posterior Inference

In the Bayesian setting (see my earlier post, What is Bayesian Inference?), we have a joint probability model $p(y,\theta)$ for data $y$ and parameters $\theta$, usually factored as the product of a likelihood and prior term, $p(y,\theta) = p(y|\theta) p(\theta)$. Given some observed data $y$, Bayesian predictive inference is based on the posterior density $p(\theta|y) \propto p(\theta, y)$ of the the unknown parameter vector $\theta$ given observed data vector $y$. Thus we need to be able to estimate the posterior density $p(\theta|y)$ to carry out Bayesian inference. Note that the posterior is a whole density function—we’re not just after a point estimate as in maximum likelihood estimation.

### Mean-Field Approximation

Variational inference approximates the Bayesian posterior density $p(\theta|y)$ with a (simpler) density $g(\theta|\phi)$ parameterized by some new parameters $\phi$. The mean-field form of variational inference factors the approximating density $g$ by component of $\theta = \theta_1,\ldots,\theta_J$, as

$g(\theta|\phi) = \prod_{j=1}^J g_j(\theta_j|\phi_j)$.

I’m going to put off actually defining the terms $g_j$ until we see how they’re used in the variational inference algorithm.

### What Variational Inference Does

The variational inference algorithm finds the value $\phi^*$ for the parameters $\phi$ of the approximation which minimizes the Kullback-Leibler divergence of $g(\theta|\phi)$ from $p(\theta|y)$,

$\phi^* = \mbox{arg min}_{\phi} \ \mbox{KL}[ g(\theta|\phi) \ || \ p(\theta|y) ]$.

The key idea here is that variational inference reduces posterior estimation to an optimization problem. Optimization is typically much faster than approaches to posterior estimation such as Markov chain Monte Carlo (MCMC).

The main disadvantage of variational inference is that the posterior is only approximated (though as MacKay points out, just about any approximation is better than a delta function at a point estimate!). In particular, variational methods systematically underestimate posterior variance because of the direction of the KL divergence that is minimized. Expectation propagation (EP) also converts posterior fitting to optimization of KL divergence, but EP uses the opposite direction of KL divergence, which leads to overestimation of posterior variance.

### Variational Inference Algorithm

Given the Bayesian model $p(y,\theta)$, observed data $y$, and functional terms $g_j$ making up the approximation of the posterior $p(\theta|y)$, the variational inference algorithm is:

• $\phi \leftarrow \mbox{random legal initialization}$
• $\mbox{repeat}$
• $\phi_{\mbox{\footnotesize old}} \leftarrow \phi$
• $\mbox{for } j \mbox{ in } 1:J$
• $\mbox{{} \ \ set } \phi_j \mbox{ such that } g_j(\theta_j|\phi_j) = \mathbb{E}_{g_{-j}}[\log p(\theta|y)]$.
• $\mbox{until } ||\phi - \phi_{\mbox{\footnotesize old}}|| < \epsilon$

The inner expectation is a function of $\theta_j$ returning a single non-negative value, defined by

$\mathbb{E}_{g_{-j}}[\log p(\theta|y)]$

$\begin{array}{l} \mbox{ } = \int_{\theta_1} \ldots \int_{\theta_{j-1}} \int_{\theta_{j+1}} \ldots \int_{\theta_J} \\[8pt] \mbox{ } \hspace*{0.4in} g(\theta_1|\phi_1) \times \cdots \times g(\theta_{j-1}|\phi_{j-1}) \times g(\theta_{j+1}|\phi_{j+1}) \times \cdots \times g(\theta_J|\phi_J) \times \log p(\theta|y) \\[8pt] \mbox{ } \hspace*{0.2in} d\theta_J \cdots d\theta_{j+1} \ d\theta_{j-1} \cdots d\theta_1 \end{array}$

Despite the suggestive factorization of $g$ and the coordinate-wise nature of the algorithm, variational inference does not simply approximate the posterior marginals $p(\theta_j|y)$ independently.

### Defining the Approximating Densities

The trick is to choose the approximating factors so that the we can compute parameter values $\phi_j$ such that $g_j(\theta_j|\phi_j) = \mathbb{E}_{g_{-j}}[\log p(\theta|y)]$. Finding such approximating terms $g_j(\theta_j|\phi_j)$ given a posterior $p(\theta|y)$ is an art form unto itself. It’s much easier for models with conjugate priors. Bishop or MacKay’s books and the Wikipedia present calculations for a wide range of exponential-family models.

### What if My Model is not Conjugate?

Unfortunately, I almost never work with conjugate priors (and even if I did, I’m not much of a hand at exponential-family algebra). Therefore, the following paper just got bumped to the top of my must understand queue:

It’s great having Dave down the hall on sabbatical this year — one couldn’t ask for a better stand in for Matt Hoffman. They are both insanely good at on-the-fly explanations at the blackboard (I love that we still have real chalk and high quality boards).

## Anyone Want to Write an O’Reilly Book on NLP with Java?

February 21, 2013

Mitzi and I pitched O’Reilly books a revision of the Text Processing in Java book that she’s been finishing off.

The response from their editor was that they’d love to have an NLP book based on Java, but what we provided looked like everything-but-the-NLP you’d need for such a book. Insightful, these editors. That’s exactly how the book came about, when the non-proprietary content was stripped out of the LingPipe Book.

I happen to still think that part of the book is incredibly useful. It covers all of unicode, UCI for normalization and detection, all of the streaming I/O interfaces, codings in HTML, XML and JSON, as well as in-depth coverage of reg-exes, Lucene, and Solr. All of the stuff that is continually misunderstood and misconfigured so that I have to spend way too much of my time sorting it out. (Mitzi finished the HTML, XML and JSON chapter, and is working on Solr; she tuned Solr extensively on her last consulting gig, by the way, if anyone’s looking for a Lucene/Solr developer).

O’Reilly isn’t interested in LingPipe because of LingPipe’s non-OSF approved license. I don’t have time to finish the LingPipe book now that I’m at Columbia full time; I figured it’d be 1500 pages when I was done if I kept up at the same level of detail, and even that didn’t seem like enough!

A good model for such a book is Manning Press’s Taming Text, co-authored by Breck’s grad-school classmate Tom Morton. It’s based on Lucene/Mahout and Tom’s baby, OpenNLP. (Manning also passed on Text Processing in Java, which is where Breck sent it first.)

Another model, aimed more at linguists than programmers, is O’Reilly’s own NLTK Book, co-authored by my grad school classmate Steve Bird, and my Ph.D. supervisor, Ewan Klein. Very small world, NLP.

Manning also passed on TPiJ, so Colloquial Media will branch out from genre fiction into tech books. More news when we’re closer to finishing.

If you know why LDA will think this document is about American football and why frame-based parsing will only make topic classification worse, then you’re probably a good candidate to write this book. [Domain knowledge: Manning is the surname of two generations of very well known American football players all of whom play the only position that fill the agent roll in a very popular play known as a “pass”.] If you know why Yarowsky‘s one discourse, one sense rule is violated by the background knowledge and also understand the principled Bayesian strategies for what Yarowsky called “bootstrapping,” even better.

## Bayesian Inference for LDA is Intractable

February 18, 2013

Bayesian inference for LDA is intractable. And I mean really really deeply intractable in a way that nobody has figured or is ever likely to figure out how to solve.

Before sending me a “but, but, but, …” reply, you might want to bone up on the technical definition of Bayesian inference, which is a bit more than applying Bayes’s rule or using a prior,

and on computational intractability, which is a bit more involved than a program being slow,

The reason Bayesian inference for LDA is intractable is that we can’t effectively integrate over its posterior. And I don’t just mean there’s no analytic solution; there rarely is for interesting Bayesian models. But for some models, you can use numerical techniques like MCMC to sample from the posterior and compute a posterior integral by averaging.

Posterior sampling doesn’t work for LDA. I know, I know, you’re going to tell me that lots of smart people have applied Gibbs sampling to LDA. You might even point out that LingPipe has a Gibbs sampler for LDA.

Yes, these systems return values. But they’re not sampling from the posterior according to the posterior distribution. The number of modes in the posterior makes comprehensive posterior sampling intractable.

And I’m not just talking about what people call “label switching”, which refers to the non-identifiability of the indexes. Label switching just means that you can permute the K indices corresponding to topics and get the same model predictions. It adds another layer of intractability, but it’s not the one that’s interesting — the real problem is multimodality.

So what does everybody do? Some use Gibbs sampling to take a single sample and then go with it. This is not a Bayesian approach; see the description of Bayesian inference above. It’s not even clear you can get a single effective sample this way (or at least it’s unclear if you can convince yourself you have an effective sample).

Others use variational inference to compute an approximate posterior mean corresponding to a single posterior mode. Variational inference approximates the posterior variance as well as the posterior mean. But alas, they don’t work that way for LDA, where there is no meaningful posterior mean due to lack of identifiability.

So what to do? The output of LDA sure does look cool. And clients do love reading the tea leaves, even if you obfuscate them in a word cloud.

I only implemented Gibbs for LDA in LingPipe because it was all that I understood when I did it. But now that I understand Bayesian inference much better, I agree with the experts: just do variational inference. You’re going to get a lousy point-based representation of the posterior no matter what you do, but variational inference is fast.

Specifically, I recommend Vowpal Wabbit package’s online LDA algorithm, which is based on Matt Hoffman’s stochastic variational LDA algorithm. It’ll also be much much more scalable than LingPipe’s sampling-based approach because it works online and doesn’t need to store all the data in memory, much like stochastic gradient descent.

The algorithm’s clean and it’d be easy enough to add such an online LDA package to LingPipe so that we could compute topic models over dozens of gigabytes of text really quickly (as Matt did with Wikipedia in the paper). But I don’t know that anyone has the time or inclination to do it.

## Another Linguistic Corpus Collection Game

November 12, 2012

Johan Bos and his crew at University of Groningen have a new suite of games aimed at linguistic data data collection. You can find them at:

Wordrobe is currently hosting four games. Twins is aimed at part-of-speech tagging, Senses is for word sense annotation, Pointers for coref data, and Names for proper name classification.

One of the neat things about Wordrobe is that they try to elicit some notion of confidence by allowing users to “bet” on their answers.

They also discuss prizes, but I didn’t see any mention of what the prizes were.

The project is aimed at imrpoving the Groningen Meaning Bank. I hope they release the raw user data as well as their best guess at a gold standard. I had some background discussion with Johan about annotation models, but they’re going to go with something relatively simple, which means there’s an opportunity to compare a richer statistical models like the other ones I’ve cited on the Data Annotation category of this blog.

### Other Linguistic Games

The first linguistic game of which I was aware was Ahn’s reCAPTCHA. Although aimed at capturing OCR annotations as a side effect, it is more of a security wall aimed at filtering out bots than a game. Arguably, I’ve been played by it more than the other way around.

A more linguistically relevant game is Poesio et al.’s Phrase Detectives, which is aimed at elucidating coreference annotations. I played through several rounds of every aspect of it. The game interface itself is very nice for a web app. Phrase Detectives occassionally has cash prizes, but it looks like they ran out of prize money because the last reference to prizes was July 2011.

### Are they Really Games?

Phrase Detectives is more like an Amazon Mechanical Turk task with a backstory and leaderboard. I didn’t create a login for Wordrobe to try it, but I’m going out on a limb to guess it’s going to be similar given the descriptions of the games.

## Upgrading from Beta-Binomial to Logistic Regression

October 30, 2012

### Bernoulli Model

Consider the following very simple model of drawing the components of a binary random N-vector y i.i.d. from a Bernoulli distribution with chance of success theta.

data {
int N;  // number of items
int y[N];  // binary outcome for item i
}
parameters {
real theta;  // Prob(y[n]=1) = theta
}
model {
theta ~ beta(2,2); // (very) weakly informative prior
for (n in 1:N)
y[n] ~ bernoulli(theta);
}


The beta distribution is used as a prior on theta. This is the Bayesian equivalent to an “add-one” prior. This is the same model Laplace used in the first full Bayesian analysis (or as some would have it, Laplacian inference) back in the Napoleonic era. He used it to model male-vs.-female birth ratio in France, with N being the number of samples and y[n] = 1 if the baby was male and 0 if female.

### Beta-Binomial Model

You can get the same inferences for theta here by replacing the Bernoulli distribution with a binomial:

model {
theta ~ beta(2,2);
sum(y) ~ binomial(N,theta);
}


But it doesn’t generalize so well. What we want to do is let the prediction of theta vary by predictors (“features” in NLP speak, covariates to some statisticians) of the items n.

### Logistic Regression Model

A roughly similar model can be had by moving to the logistic scale and replacing theta with a logistic regression with only an intercept coefficient alpha.

data {
int N;
int y[N];
}
parameters {
real alpha;  // inv_logit(alpha) = Prob(y[n]=1)
}
model {
alpha ~ normal(0,5);  // weakly informative
for (n in 1:N)
y[n] ~ bernoulli(inv_logit(alpha));
}


Recall that the logistic sigmoid (inverse of the logit, or log odds function) maps

$\mbox{logit}^{-1}:(-\infty,\infty)\rightarrow(0,1)$

by taking

$\mbox{logit}^{-1}(u) = 1 / (1 + \mbox{exp}(-u))$.

The priors aren’t quite the same in the Bernoulli and logistic models, but that’s no big deal. In more flexible models, we’ll move to hierarchical models on the priors.

Now that we have the inverse logit transform in place, we can replace theta with a regression on predictors for y[n]. You can think of the second model as an intercept-only regression. For instance, with a single predictor x[n], we could add a slope coefficient beta and write the following model.

data {
int N;  // number of items
int y[N];  // binary outcome for item n
real x[N];  // predictive feature for item n
}
parameters {
real alpha;  // intercept
real beta;  // slope
}
model {
alpha ~ normal(0,5);  // weakly informative
for (n in 1:N)
y[n] ~ bernoulli(inv_logit(alpha + beta * x[n]));
}


### Stan

I used Stan’s modeling language — Stan is the full Bayesian inference system I and others have been developing (it runs from the command line or from R). For more info on Stan, including a link to the manual for the modeling language, see:

Stan’s not a competitor for LingPipe, by the way. Stan scales well for full Bayesian inference, but doesn’t scale like LingPipe’s SGD-based point estimator for logistic regression. And Stan doesn’t do structured models like HMMs or CRFs and has no language-specific features like tokenizers built in. As I’ve said before, it’s horses for courses.

## Mystery Novel with Natural Language Processing

October 24, 2012

For those of you who like mystery novels, Mitzi’s just written one. The added bonus for readers of this blog is that there’s natural language processing involved in the detective work (I don’t want to give too much away, so I can’t tell you how).

Poetic Justice is in the cozy mystery sub-genre, where the focus is on the amateur sleuths and their milieu, not on grisly multiple homicides.

### The Back-Cover Blurb

Once you’ve made it in Manhattan, why would you be caught dead in Staten Island? That’s what Jay Alfred, editor-in-chief of Ars Longa Press, can’t understand. Jay and his partner Ken live on the best block in Chelsea. They’re an attractive pair of opposites. Jay would never stoop to snoop. Ken exercises his right to know every chance he gets.

When Sheba Miller, literary agent and downtown doyenne, is found dead in a bar on Staten Island, Ken can’t wait to investigate. He hustles Jay onto the Staten Island Ferry and into adventure. Then Sheba’s tell-all memoir surfaces. It’s a catalog of white nights with hot artists and liquid lunches with idiot publishers. Jay’s the idiot-in-chief, but he’s not alone. It’s a good thing that Sheba’s dead, because half of literary New York is ready to kill her.

That’s not the only book in town and Ken’s not the only amateur detective. Sheba’s old friends and lovers and the junior members of Ars Longa are all ready and willing to explore New York City and beyond in search of authors, books, killers, and a killer martini.

### Look Inside!

If you follow the Amazon link below, the first six chapters are available free online through Amazon’s “Look Inside!” feature.

### Early Reviews

For what it’s worth, at least twenty people have read it and said they enjoyed it. Some (including me) are already clamoring for the second book in the series. Other mystery writers told Mitzi this would happen; luckily for us fans, she already has two follow-ons in the pipeline.

### Details

• Publisher:  Colloquial Media
• Language:  English
• Pages:  324
• ISBN-10: 0-9882087-0-9 (Paperback)
• ISBN-13: 978-0-9882087-0-4 (Kindle)

### Ordering

On Amazon, it’s eligible for the 4-for-3 deal (order 4 books, get the cheapest one free).

It’s also available from Amazon UK.

If you want a review copy, add a comment to this post or send me e-mail at carp@alias-i.com.

## High Kappa Values are not Necessary for High Quality Corpora

October 2, 2012

I’m not a big fan of kappa statistics, to say the least. I point out several problems with kappa statistics right after the initial studies in this talk on annotation modeling.

I just got back from another talk on annotation where I was ranting again about the uselessness of kappa. In particular, this blog post is an attempt to demonstrate why a high kappa is not necessary. The whole point of building annotation models a la Dawid and Skene (as applied by Snow et al. in their EMNLP paper on gather NLP data with Mechanical Turk) is that you can create a high-reliability corpus without even having high accuracy, much less acceptable kappa values — it’s the same kind of result as using boosting to combine multiple weak learners into a strong learner.

So I came up with some R code to demonstrate why a high kappa is not necessary without even bothering with generative annotation models. Specifically, I’ll show how you can wind up with a high-quality corpus even in the face of low kappa scores.

The key point is that annotator accuracy fully determines the accuracy of the resulting entries in the corpus. Chance adjustment has nothing at all to do with corpus accuracy. That’s what I mean when I say that kappa is not predictive. If I only know the annotator accuracies, I can tell you expected accuracy of entries in the corpus, but if I only know kappa, I can’t tell you anything about the accuracy of the corpus (other than that all else being equal, higher kappa is better; but that’s also true of agreement, so kappa’s not adding anything).

First, the pretty picture (the colors are in honor of my hometown baseball team, the Detroit Tigers, clinching a playoff position).

What you’re looking at is a plot of the kappa value vs. annotator accuracy and category prevalence in a binary classification problem. (It’s only the upper-right corner of a larger diagram that would let accuracy run from 0 to 1 and kappa from 0 to 1. Here’s the whole plot for comparison.

Note that the results are symmetric in both accuracy and prevalence, because very low accuracy leads to good agreement in the same way that very high accuracy does.)

How did I calculate the values? First, I assumed accuracy was the same for both positive and negative categories (usually not the case — most annotators are biased). Prevalence is defined as the fraction of items belonging to category 1 (usually the “positive” category).

Everything else follows from the definitions of kappa, to result in the following definition in R to compute expected kappa from binary classification data with a given prevalence of category 1 answers and a pair of annotators with the same accuracies.

kappa_fun = function(prev,acc) {
agr = acc^2 + (1 - acc)^2;
cat1 = acc * prev + (1 - acc) * (1 - prev);
e_agr = cat1^2 + (1 - cat1)^2;
return((agr - e_agr) / (1 - e_agr));
}


Just as an example, let’s look at prevalence = 0.2 and accuracy = 0.9 with say 1000 examples. The expected contingency table would be

 Cat1 Cat2 Cat1 170 90 Cat2 90 650

and the kappa coefficient would be 0.53, below anyone’s notion of “acceptable”.

The chance of actual agreement is the accuracy squared (both annotators are correct and hence agree) plus one minus the accuracy squared (both annotators are wrong and hence agree — two wrongs make a right for kappa, another of its problems).

The proportion of category 1 responses (say positive responses) is the accuracy times the prevalence (true category is positive, correct response) plus one minus accuracy times one minus prevalence (true category is negative, wrong response).

Next, I calculate expected agreement a la Cohen’s kappa (which is the same as Scott’s pi in this case because the annotators have identical behavior and hence everything’s symmetric), which is just the resulting agreement from voting according to the prevalences. So that’s just the probability of category 1 squared (both annotators respond category 1) and the probability of a category 2 response (1 minus the probability of a category 1 response) squared.

Finally, I return the kappa value itself, which is defined as usual.

Back to the plot. The white border is set at .66, the lower-end threshold established by Krippendorf for somewhat acceptable kappas; the higher-end threshold of acceptable kappas set by Krippendorf was 0.8, and is also indicated on the legend.

In my own experience, there are almost no 90% accurate annotators for natural language data. It’s just too messy. But you need well more than 90% accuracy to get into acceptable kappa range on a binary classification problem. Especially if prevalence is high, because as prevalence goes up, kappa goes down.

I hope this demonstrates why having a high kappa is not necessary.

I should add that Ron Artstein asked me after my talk what I thought would be a good thing to present if not kappa. I said basic agreement is more informative than kappa about how good the final corpus is going to be, but I want to go one step further and suggest you just inspect a contingency table. It’ll tell you not only what the agreement is, but also what each annotator’s bias is relative to the other (evidenced by asymmetric contingency tables).

In case anyone’s interested, here’s the R code I then used to generate the fancy plot:

pos = 1;
K = 200;
prevalence = rep(NA,(K + 1)^2);
accuracy = rep(NA,(K + 1)^2);
kappa = rep(NA,(K + 1)^2);
for (m in 1:(K + 1)) {
for (n in 1:(K + 1)) {
prevalence[pos] = (m - 1) / K;
accuracy[pos] = (n - 1) / K;
kappa[pos] = kappa_fun(prevalence[pos],accuracy[pos]);
pos = pos + 1;
}
}
library("ggplot2");
df = data.frame(prevalence=prevalence,
accuracy=accuracy,
kappa=kappa);
kappa_plot =
ggplot(df, aes(prevalence,accuracy,fill = kappa)) +
labs(title = "Kappas for Binary Classification\n") +
geom_tile() +
scale_x_continuous(expand=c(0,0),
breaks=c(0,0.25,0.5,0.75,1),
limits =c(0.5,1)) +
scale_y_continuous(expand=c(0,0),
breaks=seq(0,10,0.1),
limits=c(0.85,1)) +