Author Archive

Ethics, logistic regression, and 0-1 loss

December 27, 2016

Andrew Gelman and David Madigan wrote a paper on why 0-1 loss is so problematic:

This is related to the issue of whether one should be training on an artificial gold standard. Suppose we have a bunch of annotators and we don’t have perfect agreement on items. What do we do? Well, in practice, machine learning evals tend to either (1) throw away the examples without agreement (e.g., the RTE evals, some biocreative named entity evals, etc.), or (2) go with the majority label (everything else I know of). Either way, we are throwing away a huge amount of information by reducing the label to artificial certainty. You can see this pretty easily with simulations, and Raykar et al. showed it with real data.

Yet 0/1 corpora and evaluation remain the gold standard (pun intended) in most machine learning evals. Kaggle has gone largely to log loss, but even that’s very flat around the middle of the range, as Andrew discusses in this blog post:

The problem is that it’s very hard to train a model that’s well calibrated if you reduce the data to an artificial gold standard. If you don’t know what I mean by calibration, check out this paper:

It’s one of my favorites. Once you’ve understood it, it’s hard not to think of evaluating models in terms of calibration and sharpness.

Becky’s and My Annotation Paper in TACL

October 29, 2014

Finally something published after all these years working on the problem:

Rebecca J. Passonneau and Bob Carpenter. 2014. The Benefits of a Model of Annotation. Transactions of the Association for Comptuational Linguistics (TACL) 2(Oct):311−326. [pdf(Presented at EMNLP 2014.)

Becky just presented it at EMNLP this week. I love the TACL concept and lobbied Michael Collins for short reviewer deadlines based on the fact that (a) I saw the bioinformatics journals could do it and (b) I either do reviews right away or wait until I’m nagged, then do them right away — they never take even three weeks of real work. The TACL editors/reviewers really were quick, getting responses to us on the initial paper and revisions in under a month.

What you Get

There’s nothing new in the paper technically that I haven’t discussed before on this blog. In fact, the model’s a step back from what I consider best practices, mainly to avoid having to justify Bayesian hierarchical models to reviewers and hence avoid reviews like the last one I got. I’d highly recommend using a proper hierarchical model (that’s “proper” in the “proper English breakfast” sense) and using full Bayesian inference.

One of the main points of the paper is that high kappa values are not necessary for high quality corpora and even more surprisingly, high kappa values are not even sufficient.

Another focus is on why the Dawid and Skene type models that estimate a full categorical response matrix per annotator are preferable to anything that relies on weighted voting (hint: voting can’t adjust for bias).

We also make a pitch for using expected information gain (aka mutual information) to evaluate quality of annotators.

The Data

The paper’s very focused on the Mechanical Turk data we collected for word sense. That data’s freely available as part of the manually annotated subcorpus (MASC) of the American national corpus (ANC). That was 1000 instances of roughly 50 words (nouns, adjectives and verbs) and roughly 25 annotations per instance, for a total of around 1M labels. You can get it in easily digestible form from the repo linked below.

Open-Source Software and Data

I wrote a simple EM-based optimizer in R to perform maximum penalized likelihood (equivalently max a posteriori [MAP]) estimates. It’s not super fast, but it can deal with 25K labels over 200 annotators and 1000 items in a minute or so. The code is available from a

GitHub Repo: bob-carpenter/anno.

It includes a CSV-formatted version of all the data we collected in case you want to do your own analysis.

We collected so many labels per item so that we could try to get a handle on item difficulty. We just haven’t had time to do the analysis. So if you’re interested or want to collaborate, let me know (carp@alias-i.com).

Bayesian Inference for LDA is Intractable

February 18, 2013

Bayesian inference for LDA is intractable. And I mean really really deeply intractable in a way that nobody has figured or is ever likely to figure out how to solve.

Before sending me a “but, but, but, …” reply, you might want to bone up on the technical definition of Bayesian inference, which is a bit more than applying Bayes’s rule or using a prior,

and on computational intractability, which is a bit more involved than a program being slow,

The reason Bayesian inference for LDA is intractable is that we can’t effectively integrate over its posterior. And I don’t just mean there’s no analytic solution; there rarely is for interesting Bayesian models. But for some models, you can use numerical techniques like MCMC to sample from the posterior and compute a posterior integral by averaging.

Posterior sampling doesn’t work for LDA. I know, I know, you’re going to tell me that lots of smart people have applied Gibbs sampling to LDA. You might even point out that LingPipe has a Gibbs sampler for LDA.

Yes, these systems return values. But they’re not sampling from the posterior according to the posterior distribution. The number of modes in the posterior makes comprehensive posterior sampling intractable.

And I’m not just talking about what people call “label switching”, which refers to the non-identifiability of the indexes. Label switching just means that you can permute the K indices corresponding to topics and get the same model predictions. It adds another layer of intractability, but it’s not the one that’s interesting — the real problem is multimodality.

So what does everybody do? Some use Gibbs sampling to take a single sample and then go with it. This is not a Bayesian approach; see the description of Bayesian inference above. It’s not even clear you can get a single effective sample this way (or at least it’s unclear if you can convince yourself you have an effective sample).

Others use variational inference to compute an approximate posterior mean corresponding to a single posterior mode. Variational inference approximates the posterior variance as well as the posterior mean. But alas, they don’t work that way for LDA, where there is no meaningful posterior mean due to lack of identifiability.

So what to do? The output of LDA sure does look cool. And clients do love reading the tea leaves, even if you obfuscate them in a word cloud.

I only implemented Gibbs for LDA in LingPipe because it was all that I understood when I did it. But now that I understand Bayesian inference much better, I agree with the experts: just do variational inference. You’re going to get a lousy point-based representation of the posterior no matter what you do, but variational inference is fast.

Specifically, I recommend Vowpal Wabbit package’s online LDA algorithm, which is based on Matt Hoffman’s stochastic variational LDA algorithm. It’ll also be much much more scalable than LingPipe’s sampling-based approach because it works online and doesn’t need to store all the data in memory, much like stochastic gradient descent.

The algorithm’s clean and it’d be easy enough to add such an online LDA package to LingPipe so that we could compute topic models over dozens of gigabytes of text really quickly (as Matt did with Wikipedia in the paper). But I don’t know that anyone has the time or inclination to do it.

High Kappa Values are not Necessary for High Quality Corpora

October 2, 2012

I’m not a big fan of kappa statistics, to say the least. I point out several problems with kappa statistics right after the initial studies in this talk on annotation modeling.

I just got back from another talk on annotation where I was ranting again about the uselessness of kappa. In particular, this blog post is an attempt to demonstrate why a high kappa is not necessary. The whole point of building annotation models a la Dawid and Skene (as applied by Snow et al. in their EMNLP paper on gather NLP data with Mechanical Turk) is that you can create a high-reliability corpus without even having high accuracy, much less acceptable kappa values — it’s the same kind of result as using boosting to combine multiple weak learners into a strong learner.

So I came up with some R code to demonstrate why a high kappa is not necessary without even bothering with generative annotation models. Specifically, I’ll show how you can wind up with a high-quality corpus even in the face of low kappa scores.

The key point is that annotator accuracy fully determines the accuracy of the resulting entries in the corpus. Chance adjustment has nothing at all to do with corpus accuracy. That’s what I mean when I say that kappa is not predictive. If I only know the annotator accuracies, I can tell you expected accuracy of entries in the corpus, but if I only know kappa, I can’t tell you anything about the accuracy of the corpus (other than that all else being equal, higher kappa is better; but that’s also true of agreement, so kappa’s not adding anything).

First, the pretty picture (the colors are in honor of my hometown baseball team, the Detroit Tigers, clinching a playoff position).

Kappa for varying prevalences and accuracies

What you’re looking at is a plot of the kappa value vs. annotator accuracy and category prevalence in a binary classification problem. (It’s only the upper-right corner of a larger diagram that would let accuracy run from 0 to 1 and kappa from 0 to 1. Here’s the whole plot for comparison.

Kappa for varying prevalences and accuracies

Note that the results are symmetric in both accuracy and prevalence, because very low accuracy leads to good agreement in the same way that very high accuracy does.)

How did I calculate the values? First, I assumed accuracy was the same for both positive and negative categories (usually not the case — most annotators are biased). Prevalence is defined as the fraction of items belonging to category 1 (usually the “positive” category).

Everything else follows from the definitions of kappa, to result in the following definition in R to compute expected kappa from binary classification data with a given prevalence of category 1 answers and a pair of annotators with the same accuracies.

kappa_fun = function(prev,acc) {
  agr = acc^2 + (1 - acc)^2;
  cat1 = acc * prev + (1 - acc) * (1 - prev);
  e_agr = cat1^2 + (1 - cat1)^2;
  return((agr - e_agr) / (1 - e_agr));
}

Just as an example, let’s look at prevalence = 0.2 and accuracy = 0.9 with say 1000 examples. The expected contingency table would be

  Cat1 Cat2
Cat1 170 90
Cat2 90 650

and the kappa coefficient would be 0.53, below anyone’s notion of “acceptable”.

The chance of actual agreement is the accuracy squared (both annotators are correct and hence agree) plus one minus the accuracy squared (both annotators are wrong and hence agree — two wrongs make a right for kappa, another of its problems).

The proportion of category 1 responses (say positive responses) is the accuracy times the prevalence (true category is positive, correct response) plus one minus accuracy times one minus prevalence (true category is negative, wrong response).

Next, I calculate expected agreement a la Cohen’s kappa (which is the same as Scott’s pi in this case because the annotators have identical behavior and hence everything’s symmetric), which is just the resulting agreement from voting according to the prevalences. So that’s just the probability of category 1 squared (both annotators respond category 1) and the probability of a category 2 response (1 minus the probability of a category 1 response) squared.

Finally, I return the kappa value itself, which is defined as usual.

Back to the plot. The white border is set at .66, the lower-end threshold established by Krippendorf for somewhat acceptable kappas; the higher-end threshold of acceptable kappas set by Krippendorf was 0.8, and is also indicated on the legend.

In my own experience, there are almost no 90% accurate annotators for natural language data. It’s just too messy. But you need well more than 90% accuracy to get into acceptable kappa range on a binary classification problem. Especially if prevalence is high, because as prevalence goes up, kappa goes down.

I hope this demonstrates why having a high kappa is not necessary.

I should add that Ron Artstein asked me after my talk what I thought would be a good thing to present if not kappa. I said basic agreement is more informative than kappa about how good the final corpus is going to be, but I want to go one step further and suggest you just inspect a contingency table. It’ll tell you not only what the agreement is, but also what each annotator’s bias is relative to the other (evidenced by asymmetric contingency tables).

In case anyone’s interested, here’s the R code I then used to generate the fancy plot:

pos = 1;
K = 200;
prevalence = rep(NA,(K + 1)^2);
accuracy = rep(NA,(K + 1)^2);
kappa = rep(NA,(K + 1)^2);
for (m in 1:(K + 1)) {
  for (n in 1:(K + 1)) {
    prevalence[pos] = (m - 1) / K;
    accuracy[pos] = (n - 1) / K;
    kappa[pos] = kappa_fun(prevalence[pos],accuracy[pos]);
    pos = pos + 1;
  }
}
library("ggplot2");
df = data.frame(prevalence=prevalence,
                accuracy=accuracy,
                kappa=kappa);
kappa_plot = 
  ggplot(df, aes(prevalence,accuracy,fill = kappa)) +
     labs(title = "Kappas for Binary Classification\n") +
     geom_tile() +
     scale_x_continuous(expand=c(0,0), 
                        breaks=c(0,0.25,0.5,0.75,1),
                        limits =c(0.5,1)) +
     scale_y_continuous(expand=c(0,0), 
                        breaks=seq(0,10,0.1), 
                        limits=c(0.85,1)) +
     scale_fill_gradient2("kappa", limits=c(0,1), midpoint=0.66,
                          low="orange", mid="white", high="blue",
                          breaks=c(1,0.8,0.66,0));

Computing Autocorrelations and Autocovariances with Fast Fourier Transforms (using Kiss FFT and Eigen)

June 8, 2012

[Update 8 August 2012: We found that for KissFFT if the size of the input is not a power of 2, 3, and 5, then things really slow down. So now we’re padding the input to the next power of 2.]

[Update 6 July 2012: It turns out there’s a slight correction needed to what I originally wrote. The correction is described on this page:

I’m fixing the presentation below to reflect the correction. The change is also reflected in the updated Stan code using Eigen, but not the updated Kiss FFT code.]

Suppose you need to compute all the sample autocorrelations for a sequence of observations

x = x[0],...,x[N-1]

The most efficient way to do this is with the discrete fast fourier transform (FFT) and its inverse; it’s {\mathcal O}(N \log N) versus {\mathcal O}(N^2) for the naive approach. That much I knew. I had both experience with Fourier transforms from my speech reco days (think spectograms) and an understanding of the basic functional analysis principles. I didn’t know how to code it given an FFT library. The web turned out to be not much help — the explanations I found were all over my head complex-analysis-wise and I couldn’t find simple code examples.

Matt Hoffman graciously volunteered to give me a tutorial and wrote an initial prototype. It turns out to be really really simple once you know which way ’round the parts all go.

Autocorrelations via FFT

Conceptually, the input N-vector x is the time vector and the autocorrelations will be the frequency vector. Here’s the algorithm:

  1. create a centered version of x by setting x_cent = x / mean(x);
  2. pad x_cent at the end with entries of value 0 to get a new vector of length L = 2^ceil(log2(N));
  3. run x_pad through a forward discrete fast fourier transform to get an L-vector z of complex values;
  4. replace the entries in z with their norms (the norm of a complex number is the real number resulting of summing the squared real component and squared imaginary component).
  5. run z through the inverse discrete FFT to produce an L-vector acov of (unadjusted) autocovariances;
  6. trim acov to size N;
  7. create a L-vector named mask consisting of N entries with value 1 followed by L-N entries with value 0;
  8. compute the forward FFT of mask and put the result in the L-vector adj
  9. to get adjusted autocovariance estimates, divide each entry acov[n] by norm(adj[n]), where norm is the complex norm defined above; and
  10. to get autocorrelations, set acorr[n] = acov[n] / acov[0] (acov[0], the autocovariance at lag 0, is just the variance).

The autocorrelation and autocovariance N-vectors are returned as acorn and acov respectively.

It’s really fast to do all of them in practice, not just in theory.

Depending on the FFT function you use, you may need to normalize the output (see the code sample below for Stan). Run a test case and make sure that you get the right ratios of values out in the end, then you can figure out what the scaling needs to be.

Eigen and Kiss FFT

For Stan, we started out with a direct implementation based on Kiss FFT.

  • Stan’s original Kiss FFT-based source code (C/C++) [Warning: this function does not have the correction applied; see the current Stan code linked below for an example]

At Ben Goodrich’s suggestion, I reimplemented using the Eigen FFT C++ wrapper for Kiss FFT. Here’s what the Eigen-based version looks like:

As you can see from this contrast, nice C++ library design can make for very simple work on the front end.

Hat’s off to Matt for the tutorial, Thibauld Nion for the nice blog post on the mask-based correction, Kiss FFT for the C implementation, and Eigen for the C++ wrapper.

Summing Out Topic Assignments in LDA

April 14, 2011

[Update: 20 April 2011. I replaced my original (correct, but useless) derivation with the derivation Matt Hoffman provided in the comments.]

In a previous post, I showed how to integrate out LDA’s multinomial parameters, with the result being a discrete distribution. In fact, we could go all the way to deriving the probability of each word’s topic assignment given all other parameters, thus defining a natural collapsed Gibbs sampling strategy.

In this post, I’ll show how to turn this around, summing out the discrete parameters in order to produce a fully continuous density. The nice feature about such a density is that it’s amenable to Hamiltonian Monte Carlo methods. These may actually mix faster than the collapsed Gibbs sampler. Now that we have a working HMC implementation, it’ll be easy to test.

It’s always pretty straightforward to sum out discrete parameters. For instance, it’s commonly encountered in the textbook mixture model in the E step of EM (e.g., for soft K-means clustering or semi-supervised learning). It’s all based on the sum law, which says that if p(a,b) is discrete in b and b can take on values 1..N, then

p(a) = \sum_{b=1}^{N} p(a,b).

All we need to do is repeatedly apply this rule to each of LDA’s topic assignments to tokens.

Latent Dirichlet Allocation

The LDA model involves a combination of Dirichlet and multinomial distributions. We observe the words w_{d,i} for each document d \in 1{:}D, where document d is of length I_d. We let V be the set of all words observed. We assume there are K topics. Next, there are multinomial parameters \theta_d of dimension K for the distribution of topics in document d. We also assume multinomial parameters \phi_k of dimension V for the distribution of words in topic k. For simplicity, we assume fixed Dirichlet hyperpriors \alpha and \beta of dimensionality K and V over the multinomial parameters \theta_d and \phi_k.

The full probability model p(\theta,\phi,z,w|\alpha,\beta) is defined by

\theta_d \sim \mbox{\sf Dirichlet}(\alpha)

\phi_k \sim \mbox{\sf Dirichlet}(\beta)

z_{d,i} \sim \mbox{\sf Discrete}(\theta_d)

w_{d,i} \sim \mbox{\sf Discrete}(\phi_{z[d,i]})

Summing out the Topic Assignments

In our previous post, we showed how to simplify the expression for the distribution of topic assignments given words and hyperpriors, p(z|w,\alpha,\beta). Here, we are going to derive p(\theta,\phi|w,\alpha,\beta), the density of the document distributions and topic distributions given the words and hyperpriors.

p(\theta,\phi|w,\alpha,\beta)

{} \propto p(\theta,\phi,w|\alpha,\beta)

{} = p(\theta|\alpha) \times p(\phi|\beta) \times p(w|\alpha,\beta)

{} = p(\theta|\alpha) \times p(\phi|\beta) \times \prod_{d=1}^D \prod_{i=1}^{I_d} p(w_{d,i}|\theta_d, \phi)

{} = p(\theta|\alpha) \times p(\phi|\beta) \times \prod_{d=1}^D \prod_{i=1}^{I_d} \sum_{z_{d,i} = 1}^K p(w_{d,i},z_{d,i}|\theta_d, \phi)

Next, by repeated application of the sum rule for total probability, we get

{} = p(\theta|\alpha) \times p(\phi|\beta) \times \prod_{d=1}^D \prod_{i=1}^{I_d} \sum_{z[d,i] = 1}^K p(z_{d,i}|\theta_d) \times p(w_{d,i}|\phi_{z[d,i]})

Expanding out the definitions, we are left with a nice tight expression that’s easy to calculate:

p(\theta,\phi|w,\alpha,\beta)

{} \propto \prod_{n=1}^N \mbox{\sf Dir}(\theta_n|\alpha)

{} \times \prod_{k=1}^K \mbox{\sf Dir}(\phi_k|\beta)

{} \times \prod_{d=1}^D \prod_{i=1}^{I_d} \sum_{z[d,i]=1}^K \mbox{\sf Disc}(z_{d,i}|\theta_d) \times \mbox{\sf Disc}(w_{d,i}|\phi_{z[d,i]}) .

Of course, we’ll probably be doing this on the log scale, where

\log p(\theta,\phi|w,\alpha,\beta)

{} = Z_{w,\alpha,\beta}

{} + \sum_{d=1}^D \log \mbox{\sf Dir}(\theta_d|\alpha)

{} + \sum_{k=1}^K \log \mbox{\sf Dir}(\phi_k|\beta)

{} + \sum_{d=1}^D \sum_{i=1}^{I_d} \log \left( \sum_{z[d,i]=1}^K \mbox{\sf Disc}(z_{d,i}|\theta_d) \times \mbox{\sf Disc}(w_{d,i}|\phi_{z[d,i]}) \right)

where Z_{w,\alpha,\beta} is a normalizing term that we can ignore for most optimization, expectation or sampling-based methods. Note that the \Gamma functions inside the Dirichlet terms go into the Z_{w,\alpha,\beta} term if we leave the priors \alpha and \beta constant.

Information Gain and Task-Based Costs: Biased versus Spam Annotators

November 22, 2010

Following on from Sheng et al.’s Get Another Label? paper, Panos and crew have a Human Computation Workshop paper,

The motivation for the new paper is to try to separate bias from accuracy. That is, some annotators would try hard, but the bias in their answers would give them an overall low exact accuracy. But their responses can still be useful if we correct for their bias. Luckily, the Dawid-and-Skene-type models for estimating and adjusting for annotator’s accuracies does just that.

G, PG, R, or X?

As an example, Ipeirotis et al. consider having workers on Amazon Mechanical Turk classify images into G, PG, R, and X ratings following MPAA ratings guidelines.

This is really an ordinal classification problem where the approach of Uebersax and Grove (1993) seems most appropriate, but let’s not worry about that for right now. We can imagine a purely categorical example such as classifying tokens in context based on part-of-speech or classifying documents based on a flat topic list.

Bias or Inaccuracy?

Uebersax and Grove discuss the bias versus accuracy issue. It’s easy to see in a two-category case that sensitivity and specificity (label response for 1 and 0 category items) may be reparameterized as accuracy and bias. Biased annotators have sensitivities that are lower (0 bias) or higher (1 bias) than their specificities.

What Ipeirotis et al. point out is that you can derive information from a biased annotator if their biases can be estimated. As they show, a model like Dawid and Skene’s (1979) performs just such a calculation in its “weighting” of annotations in a generative probability model. The key is that it uses the information about the likely response of an annotator given the true category of an item to estimate the category of that item.

Decision-Theoretic Framework

Ipeirotis et al. set up a decision-theoretic framework where there is a loss (aka cost, which may be negative, and thus a gain) for each decision based on the true category of the item and the classification.

One nice thing about the image classification task is that it makes it easy to think about the misclassification costs. For instance, classifying an X-rated image as G and having it land on a children’s site is probably a more costly error than rating a G image as PG and banning it from the same site. In the ordinal setting, there’s a natural scale, where rating an X-rated image as R is closer to the true result than PG or G.

Getting Another Label

Consider a binary setting where the true category of item i is c_i, the prevalence (overall population proportion) of true items is \pi, and annotator j has sensitivity (accuracy on 1 items; response to positive items) of \theta_{1,j} and a specificity (accuracy on 0 items; 1 – response to negative items) of \theta_{0,j}. If y_{i,j} is the response of annotators j \in 1{:}J, then we can calculate probabilities for the category c_i by

\mbox{Pr}[c_i = 1 | y_i] \propto \pi \times \prod_{j=1}^J \theta_{1,j}^{y_{i,j}} \times (1 - \theta_{1,j})^{1 - y_{i,j}} and

\mbox{Pr}[c_i = 0 | y_i] \propto (1 - \pi) \times \prod_{j=1}^J \theta_{0,j}^{1 - y_{i,j}} \times (1 - \theta_{0,j})^{y_{i,j}}.

The thing to take away from the above formula is that it reduces to an odds ratio. Before annotation, the odds ratio is \pi / (1-\pi). For each annotator, we multiply the odds ratio by the annotator’s ratio, which is \theta_{1,j} / (1 - \theta_{0,j}) if the label is y_{i,j} = 1, and is (1 - \theta_{1,j})/\theta_{0,j} if the label is negative.

Random Annotators Don’t Affect Category Estimates

Spammers in these settings can be characterized by having responses that do not depend on input items. That is, no matter what the true category, the response profile will be identical. For example, an annotator could always return a given label (say 1), or always choose randomly among the possible labels with the same distribution (say 1 with 20% chance and 0 with 80% chance, correspdonding, say, to just clicking some random checkboxes on an interface).

In the binary classification setting, if annotations don’t depend on the item being classified, we have specificity = 1 – sensitivity, or in symbols, \theta_{0,j} = 1 - \theta_{1,j} for annotator j. That is, there’s always a \theta_{1,j} chance of returning the label 1 no matter what the input.

Plugging this into the odds ratio formulation above, it’s clear that there’s no effect of adding such a spammy annotator. The update to the odds ratios have no effect because \theta_{1,j}/(1 - \theta_{0,j}) = 1 and (1-\theta_{1,j})/\theta_{0,j}=1.

Cooperative, Noisy and Adversarial Annotators

In the binary case, as long as the sum of sensitivity and specificity is greater than one, \theta_{0,j} + \theta_{1,j} > 1, there is positive information to be gained from an annotator’s response.

If the sum is 1, the annotator’s responses are pure noise and there is no information to be gained by their response.

If the sum is less than 1, the annotator is being adversarial. That is, they know the answers and are intentionally returning the wrong answers. In this case, their annotations will bias the category estimates.

Information Gain

The expected information gain from having an annotator label an item is easily calculated. We need to calculate the probability of true category and then probability of response and figure out the odds of each and the contribution to our overall estimates.

The random variable in question is c_i, the category of item i. We will assume a current odds ratio after zero or more annotations of \phi/(1-\phi) and consider the so-called information gain from observing y_{i,j}, the label provided for item i by annotator j,

\mbox{H}[c_i] - \mathbb{E}[\mbox{H}[c_i|y_{i,j}]],

where the expectation is with respect to our model’s posterior.

Expanding the expectation in the second term gives us

\mathbb{E}[\mbox{H}[c_i|y_{i,j}]] = \mbox{Pr}[y_{i,j} = 1] \times \mbox{H}[c_i|y_{i,j}=1] + \mbox{Pr}[y_{i,j} = 0] \times \mbox{H}[c_i|y_{i,j}=0]

The formulas for the terms inside the entropy are given above. As before, we’ll calculate the probabilities of responses using our model posteriors. For instance, carrying this through normalization, the probabilities on which the expectations are based are

\mbox{Pr}[y_{i,j} = 1] = Q_1/(Q_1+Q_2), and

\mbox{Pr}[y_{i,j} = 0] = Q_2/(Q_1 + Q_2), where

Q_1 = \phi \times \theta_{1,j} \ + \ (1-\phi) \times (1 - \theta_{0,j}), and

Q_2 = (1-\phi) \times \theta_{0,j} \ + \ \phi \times (1 - \theta_{1,j}).

so that the probability the annotation is 1 is proportional the sum of the probability that the true category is 1 (here \phi) and the response was correct (\theta_{1,j}) and the probability that the true category is 0 (1 - \phi) and the response was incorrect (1 - \theta_{0,j}).

It’s easy to see that spam annotators who have \theta_{0,j} = 1 - \theta_{1,j} provide zero information gain because as we showed in the last section, p(c_i|y_{i,j}) = p(c_i) if annotator j provides random responses, then \mbox{H}[c_i|y_{i,j}] = \mbox{H}[c_i].

Decision-Theoretic Framework

Ipeirotis et al. go one step further and consider a decision-theoretic context in which the penalities for misclassifications may be arbitrary numbers and the goal is minimizing expected loss (equivalently maximizing expected gain).

Rather than pure information gain, the computation would proceed through the calculation of expected true positives, true negatives, false positives, and false negatives, each with a weight.

The core of Bayesian decision theory is that expected rewards are always improved by improving posterior estimates. As long as an annotator isn’t spammy, their contribution is expected to tighten up our posterior estimate and hence improve our decision-making ability.

Suppose have have weights L_{k,k'}, which are the losses for classifying an item of category k as being of category k'. Returning to the binary case, consider an item whose current estimated chance of being positive is \phi. Our expected loss is

\mathbb{E}[\mbox{loss}] = L_{1,1}  \phi^2 + L_{1,0}  \phi  (1-\phi) + L_{0,1} (1-\phi)  \phi + L_{0,0} (1-\phi)^2.

We are implicitly assuming that the system operates by sampling the category from its estimated distribution. This is why \phi shows up twice after L_{1,1}, once for the probability that the category is 1 and once for the probability that’s the label chosen.

In practice, we often quantize answers to a single category. The site that wants a porn filter on an image presumably doesn’t want a soft decision — it needs to either display an image or not. In this case, the decision criterion is to return the result that minimizes expected loss. For instance, assigning category 1 if the probability the category is 1 is \phi leads to expected loss

L_{1,1} \phi + L_{0,1} (1-\phi)

and the loss for assigning category 0 is expected to be

L_{0,0} (1-\phi) + L_{1,0} \phi.

The decision is simple: return the result corresponding to the smallest loss.

After the annotation by annotator j, the positive and negative probabilities get updated and plugged in to produce a new estimate \phi', which we plug back in.

I’m running out of steam on the \LaTeX derivation front, so I’ll leave the rest as an exercise. It’s a hairy formula, especially when unfolded to the expectation. But it’s absolutely what you want to use as the basis of the decision as to which items to label.

Bayesian Posteriors and Expectations

In practice, we work with estimates of the prevalence \pi, sensitivities \theta_{1,j}, and specificities \theta_{0,j}. For full Bayesian inference, Gibbs sampling lets us easily compute the integrals required to use our uncertainty in the parameter estimates in calculating our estimate of \mbox{Pr}[c_i = 1|y_i] and its uncertainty.

Confused by Section 4

I don’t understand why Ipeirotis say, in section 4,

The main reason for this failure [in adjusting for spam] is the inability of the EM algorithm to identify the “strategic” spammers; these sophisticated spammers identify the class with the highest class prior (in our case the “G” class) and label all the pages as such.

Huh? It does in my calculations, as shown above, and in my experience with the Snow et al. NLP data and our own NE data.

One problem in practice may be that if a spammer annotates very few items, we don’t have a good estimate of their accuracies, and can’t adjust for their inaccuracy. Otherwise, I don’t see a problem.

Collapsed Gibbs Sampling for LDA and Bayesian Naive Bayes

July 13, 2010

[Update: 1 February 2014. David Bammam points out that there’s a mistake just before equation (62). The correct formula should be

\Gamma(x + q) = \Gamma(x) \times \prod_{i=1}^q (x + i - 1).

This has implications going forward replacing i with i - 1 which I don’t have time to change right now.]

[Update: 27 September 2010. I corrected several typos and brainos in the tech report after corrections in the comments (see below) from Arwen Twinkle. I also added some historical notes and references. Thanks for all the feedback.]

I’ve uploaded a short (though dense) tech report that works through the collapsing of Gibbs samplers for latent Dirichlet allocation (LDA) and the Bayesian formulation of naive Bayes (NB).

Thomas L. Griffiths and Mark Steyvers used the collapsed sampler for LDA in their (old enough to be open access) PNAS paper, Finding scientific topics. They show the final form, but don’t derive the integral or provide a citation.

I suppose these 25-step integrals are supposed to be child’s play. Maybe they are if you’re a physicist or theoretical statistician. But that was a whole lot of choppin’ with the algebra and the calculus for a simple country computational linguist like me.

On to Bayesian Naive Bayes

My whole motivation for doing the derivation was that someone told me that it wasn’t possible to integrate out the multinomials in naive Bayes (actually, they told me you’d be left with residual \Gamma functions). It seemed to me like it should be possible because the structure of the problem was so much like the LDA setup.

It turned out to be a little trickier than I expected and I had to generalize the LDA case a bit. But in the end, the result’s interesting. I didn’t wind up with what I expected. Instead, the derivation led to me to see that the collapsed sampler uses Bayesian updating at a per-token level within a doc. Thus the second token will be more likely than the first because the topic multinomial parameter will have been updated to take account of the assignment of the first item.

This is so cool. I actually learned something from a derivation.

In my prior post, Bayesian Naive Bayes, aka Dirichlet-Multinomial Classifiers, I provided a proper Bayesian definition of naive Bayes classifiers (though the model is also appropriate for soft clustering with Gibbs sampling replacing EM). Don’t get me started on the misappropriation of the term “Bayesian” by any system that applies Bayes’s rule, but do check out another of my previous posts, What is Bayesian Statistical Inference? if you’re curious.

Thanks to Wikipedia

I couldn’t have done the derivation for LDA (or NB) without the help of

It pointed me (implicitly) at a handful of invaluable tricks, such as

  • multiplying through by the appropriate Dirichlet normalizers to reduce an integral over a Dirichlet density kernel to a constant,
  • unfolding products based on position, unfolding a \Gamma() function for the position at hand, then refolding the rest back up so it could be dropped out, and
  • reparameterizing products for total counts based on sufficient statistics.

Does anyone know of another source that works through equations more gently? I went through the same exegesis for SGD estimation for multinomial logistic regression with priors a while back.

But Wikipedia’s Derivation is Wrong!

At least I’m pretty sure it is as of 5 PM EST, 13 July 2010.

Wikipedia’s calculation problem starts in the move from the fifth equation before the end to the fourth. At this stage, we’ve already eliminated all the integrals, but still have a mess of \Gamma functions left. The only hint at what’s going on is in the text above which says it drops terms that don’t depend on k (the currently considered topic assignment for the n-th word of the m-th document). The Wikipedia’s calculation then proceeds to drop the term \prod_{i \neq k} \Gamma(n^{i,-(m,n)}_{m,(\cdot)} + \alpha_i) without any justification. It clearly depends on k.

The problems continue in the move from the third equation before the end to the penultimate equation, where a whole bunch of \Gamma function applications are dropped, such as \Gamma(n^{k,-(m,n)}_{m,(\cdot)} + \alpha_k), which even more clearly depend on k.

It took me a few days to see what was going on, and I figured out how to eliminate the variables properly. I also explain each and every step for those of you like me who don’t eat, sleep and breathe differential equations. I also use the more conventional stats numbering system where the loop variable m ranges from 1 to M so you don’t need to keep (as large) a symbol table in your head.

I haven’t changed the Wikipedia page for two reasons: (1) I’d like some confirmation that I haven’t gone off the rails myself anywhere, and (2) their notation is hard to follow and the Wikipedia interface not so clean, so I worry I’d introduce typos or spend all day entering it.

LaTeX Rocks

I also don’t think I could’ve done this derivation without LaTeX. The equations are just too daunting for pencil and paper. The non-interactive format of LaTeX did get me thinking that there might be some platform for sketching math out there that would split the difference between paper and LaTeX. Any suggestions?

Contextual Effects and Read Quality in a Probabilistic Aligner

June 14, 2010

In my last post on this topic, Sequence Alignment with Conditional Random Fields, I introduced a properly normalized Smith-Waterman-like global aligner (which required the entire read to be aligned to a subsequence of the reference sequence). The motivation was modeling the probability of a read being observed given a reference sequence from which it originated.

Context Plus Read Quality

Today I want to generalize the model in two ways. First, I’ll add contextual effects like affine gap scores and poly-base errors on Illumina.

Second, I’ll take into consideration per-base read quality, as produced by Illumina’s prb files (with simple fastq format, where you only get the probability of the best base, we can either treat the others as uniformly likely or use knowledge of the confusion error profile of the sequencer).

We’ll still take start position into account, for issues like modeling hexamer priming or fractionation effects in sample prep.

The dynamic programming algorithms work as before with finer-grained cells. Further, you can infer what the actual read was, though I believe a hierarchical model will be better for this, because the sequence polymorphism can be separated from the read errors generatively.

Modeling Alignments

What we’re interested in for our work on RNA-Seq based mRNA expression is modeling is p(y|x), the probability of a read y given a reference sequence x. Rather than approaching this directly, we’ll take the usual route of modeling a (latent) alignment A at a (latent) start position i, p(y,A,i|x), and then marginalizing,

p(y|x) = \sum_{A,i} p(y,A,i|x).

The best alignment will be computable via a standard sequence max algorithm, and the marginalization through the standard product-sum algorithm. Both run in time proportional to the read length times reference sequence length (of course, the latter will be trimmed down to potential matches in a region). Both can be sped up with upper bounds on distance.

The Model

We assume there is a score \mbox{\sc start}(i,x) \in \mathbb{R} for each start position i in reference sequence x. For instance, this may be a log probability of starting at a position given the lab fractionation and priming/amplification protocol.

We will also assume there is a quality score on a (natural) log probability scale \mbox{\sc logprb}(y,b,j) \in \mathbb{R} for the probability of base b at position j in read y. For instance, these may be the Phred quality scores reported in Illumina’s prb files converted to (natural) log probability of match.

Our definition only normalizes for reads of fixed length, which we’ll write \mbox{\sc readlen}.

We will recursively define a score on an additive scale and then exponentiate to normalize. In symbols,

p(y,A,i|x) \propto \exp\big(\mbox{\sc start}(i,x) + f(a,i,0,x,y,\mbox{\rm start})\big)

where we define the alignment score function f(A,i,j,x,y,a), for alignment sequence A, read x and read position i, reference sequence y and reference position j and previous alignment operation a, recursively by

\begin{array}{rcl}f([],i,\mbox{\sc readlen},x,y,a) & = & 0.0\\[18pt]f(\mbox{\rm sub}(x_i,b)+A,i,j,x,y,a) & = &  \mbox{\sc subst}(x,y,i,j,a)\\ & + & \mbox{\sc logprb}(y,b,j)\\ & + & f(A,i+1,j+1,x,y,\mbox{\rm sub}(x_i,b))\\[12pt]f(\mbox{\rm del}(x_i)+A,i,j,x,y,a) & = &  \mbox{\sc del}(x,y,i,j,a)\\ & + & f(A,i+1,j,x,y,\mbox{\rm del}(x_i))\\[12pt]f(\mbox{\rm ins}(b)+A,i,j,x,y,a) & = &  \mbox{\sc ins}(x,y,i,j,a)\\ & + & \mbox{\sc logprb}(y,b,j)\\ & + & f(A,i,j+1,x,y,\mbox{\rm ins}(b))\end{array}

Reminds me of my Prolog programming days.

The Edit Operations

The notation a + A is meant to be the edit operation a followed by edit sequence A. The notation [] is for the empty edit sequence.

The notation \mbox{\rm sub}(x_i,b) is for the substitution of an observed base b from the read for base x_i of the reference sequence; if x_i \neq b it’s a particular mismatch, and if x_i = b, it’s a match score. The notation \mbox{\rm ins}(b) is for the insertion of observed base b and \mbox{\rm del}(x_i) for the deletion of base x_i in the reference sequence. Insertion and deletion are separated because gaps in the reference sequence and read may not be equally likely given the sequencer and sample being sequenced.

Note that the base case requires the entire read to be consumed by requring j = \mbox{\sc readlen}.

Scoring Functions

Note that the score functions, \mbox{\sc subst}(x,y,i,j,a) \in \mathbb{R}, \mbox{\sc del}(x,y,i,j,a) \in \mathbb{R}, and \mbox{\sc ins}(x,y,i,j,a) \in \mathbb{R}, have access to the read y, the reference sequence x, the present position in the read j and reference sequence i, as well as the previous edit operation a.

Enough Context

This context is enough to provide affine gap penalties, because the second delete will have a context indicating the previous operation was a delete, so the score can be higher (remember, bigger is better here). It’s also enough for dealing with poly-deletion, because you have the identity of the previous base on which the operation took place.

Further note that by defining the start position score by a function \mbox{\sc start}(x,i), which includes a reference to not only the position i but the entire reference sequence i, it is possible to take not only position but also information in the bases in x at positions around i into account.

Dynamic Programming Same as Before

The context is also restricted enough so that the obvious dynamic programming algorithms for first-best and marginalization apply. The only tricky part is computing the normalization for p(y,A,i|x), which has only been defined up to a constant. As it turns out, the same dynamic programming algorithm works to compute the normalizing constant as for a single probabilistic read.

In practice, you have the usual dynamic programming table, only it keeps track of the last base matched in the read, the last base matched in the reference sequence, and whether the last operation was an insertion, deletion, or substitution. To compute the normalizer, you treat each base log prob from a pseudoread of the given read length \mbox{\sc readlen} as having score 0 and accumulate all the matches in the usual way with the sum-product algorithm. (You’ll need to work on the log scale and use the log-sum-of-exponentials operation at each stage to prevent underflow.)

In general, these dynamic programming algorithms will work anywhere you have this kind of tail-recursive definition over a finite set of operations.

Inferring the Actual Read Sequence for SNP, Etc. Calling

Because of variation among individuals, the sequence of bases in a read won’t necessarily match the reference sequence. We can marginalize out the probability that the read sequence is z given that it was generated from reference sequence x with read y (remember the read’s non-deterministic here),

p(z|x,y) = \sum_{i,\mbox{yield}(A) = z} p(A,i,x|y),

where \mbox{yield}(A) is the read produced from alignment A, which itself is defined recursively by

\mbox{yield}([]) = \epsilon,

\mbox{\rm yield}(\mbox{\rm del}(x) + A) = \mbox{\rm yield}(A),

\mbox{\rm yield}(\mbox{subst}(b,x_i) + A) = b \cdot \mbox{\rm yield}(A), and

\mbox{\rm yield}(\mbox{ins}(b)+A) = b \cdot \mbox{\rm yield}(A),

where \epsilon is the empty string and \cdot string concatenation.

I believe this is roughly what the GNUMAP mapper does (here’s the pay-per-view paper).

Reference Sequences with Uncertainty

We could extend the algorithm even further if we have a non-deterministic reference sequence. For instance, we can use data from the HapMap project to infer single nucleotide polymorphisms (SNPs). We just do the same thing for reference side non-determinsm as on the read side, assuming there’s some score associated with each choice and adding the choice into the context mix.

Li, Ruotti, Stewart, Thomson and Dewey (2010) RNA-Seq Gene Expression Estimation with Read Mapping Uncertainty

June 9, 2010

It was bound to happen. The model I was proposing for splice-variant (or isoform) expression was too obvious (and too good!) for it not to be independently discovered. Seeing my ideas discovered by others is on one hand a bummer, but on the other hand gives me confidence that I’m thinking about these models in the right way.

The following paper presents roughly the same model of expression I presented in a previous blog post, Inferring Splice Variant mRNA Expression with RNA-Seq.

It also incorporates some of the features I suggested for edit distance in another blog post, Sequence alignment with conditional random fields.

The Basic Expression Model

In fact, it’d almost be easier to say what’s different in Li et al.’s model. The expression model is almost identical, down to the name of the variable, \theta, used for read expression. The model in this paper assumes N reads, implicitly assumes an uniform prior for \theta \sim \mbox{\sf Dirichlet}({\bf 1}), then for each read 1 \leq n \leq N, chooses splice variants z_n \sim \mbox{\sf Discrete}(\theta). So far, identical (other than that I considered arbitrary Dirichlet priors).

[Update, June 10, 2010: I forgot to mention:

Noise Isoform

Li et al. include a so-called “noise isoform”, which they assign a simple probability distribution to. This will act as a sink for reads that do not map elsewhere. I don’t quite see given how it’s defined how anything would get mapped to it if we restricted attention to reads that mapped within a few edits with BOWTIE.]

Position and Strand-Based Reads

Things look different when Li et al. start to generate reads y_n. What I did was assume an arbitrary distribution \mbox{\sf Read}(y_n|z_n,\varphi), with parameters \varphi characterizing the likelihood of observing read y_n from reference source z_n. Li et al. decompose part of \mbox{\sf Read} in their model. First, they assume s_n is the start position of read y_n on reference sequence z_n and assume o_n is the strand, both of which are generated from the ref sequence z_n, by distributions p(o_n|z_n) and p(s_n|z_n).

This is where Li et al.’s proposal relates to my probabilistic aligner proposal. In particular, with the position, strand and reference sequence, s_n, o_n, z_n, the reads may be defined to be sensitive to location (say in relative position measured from 5′ to 3′), or to underlying sequence (such as the hexamer bias due to priming studied in [Hansen, Brenner and Dudoit 2010]). They only study the positional biases, and for their data, they were small. But groovy nonetheless. It’s building in the right kind of sensitivity to the biological sample prep and sequencing.

Non-Probabilistic Alignment

[Updated, June 10, 2010: I don’t know what I was thinking — Li et al. definitely use probabilistic alignment. In fact, they use a position-specific edit matrix w_t for substitutions (no indels). I’m not sure how the matrices are estimated.]

What they’re not using is the the quality scores on the reads themselves (as is done in Clement et al.’s GNUMap aligner).

I think there’s an opportunity to squeeze more sensitivity and specificity out of the model by using the quality scores from the sequencer (if they can be calibrated properly, that is).

The aligner I propose also normalizes the edits to proper probabilities using a sum-product algorithm; it’d be easy to extend to read quality as in GNUMap, or to compose it with a color space finite-state transducer, as in SHRiMP for SOLiD reads.

EM MLE vs. Gibbs Sampling Bayesian Posterior

The other main difference is that I was proposing using Gibbs sampling to produce samples from the posterior p(\theta|y,\alpha,\varphi) of expression given reads y and channel model \varphi and Dirichlet prior \alpha. Li et al. use EM to find the maximum likelihood estimate, \theta^* = \arg\max p(y|\theta,\varphi). As usual, the MLE is just the MAP estimate with a uniform prior, so in my model, the MLE is \theta^* = \arg\max_\theta p(\theta|y,\alpha={\bf 1},\varphi).

Bootstrap Standard Error Estimates

One of the main arguments I made for the Bayesian approach is that, like all truly Bayesian approaches, it supports posterior inference with uncertainty. This is very useful for doing things like pairwise or multiple comparisons of expression or adjusting for false discovery rates.

Li et al. use a bootstrap estimate of standard error, which is great to see. I wish more researchers provided variance estimates.

The danger with only reporting standard error (or variance) in these skewed binomials (or multinomials) is that the parameter value’s very close to the edge of the allowed values, so the 95% interval can contain illegal values. You see the same problem for normal approximations of variance for the Poisson, where a few standard deviations below the mean can result in negative counts.

[Update: June 10, 2010 Of course, you can plot plug-in quantiles such as 95% intervals with the bootstrap, too. It’s really pretty much like Gibbs sampling in terms of the application, if not philosophy.]

Simulations

They run a bunch of simulation experiments to show that this kind of model makes sense. I did the same thing on a (much much) smaller scale. They use Langmead et al.’s BOWTIE aligner with up to two mismatches, which seems a rather dodgy basis for this kind of expression model. It will depend on the settings, but the defaults provide a greedy heuristic search that’s not guaranteed to be complete in the sense of finding all or even the best alignment.

[Update: June 10, 2010: BOWTIE has a --all setting that the documentation to generate all matching reads, but there’s also a maximum number of backtracks parameter that can eliminate some matches if there are 2 or more edits allowed.

Even if BOWTIE can be configured to find all the matches up to edit distance 2, there’s no reason to assign them all the same probability in the model or to assume that a read is accurately mapped at edit distance 2 and not at edit distance 3 or greater.

My understanding is that due to its best-first heuristic search, BOWTIE does not guarantee it will find all reads even up to a given edit distance.]

What we really need is some real applications. Mitzi and I are reanalyzing the human liver and kidney Illumina RNA-Seq data described in (Marioni et al. 2008), and also some new data generated by Ken Birnbaum (and Paul Scheid) at NYU using a single-cell protocol on SOLiD on Arabidopsis roots over a 5-day period. Paul Scheid, the director of the SOLiD core facilty at NYU, just presented the data at a SOLiD user’s group meeting he hosted at NYU last Friday. The short story is that Mitzi crunched the data to show that you can use a single-cell protocol on SOLiD and use SHRiMP for alignment to derive expression results similar to that estimated from parallel microarray day using Li and Wong’s D-Chip factor model for microarray expression.

20 to 25 Bases!

The conclusion of Li et al. is that if each base costs the same to read independent of read length, then according to their simulations, the optimal read length for caclulating variant expression is around 20 to 25 bases, not the 50 or longer that Illumina and SOLiD are producing these days.