Archive for the ‘Reviews’ Category

Twitter POS Tagging with LingPipe and ARK Tweet Data

November 4, 2011

The Data

We will train and test on anything that’s easy to parse. Up today is a basic English part-of-speech tagging for Twitter developed by Kevin Gimpel et al. (and when I say “et al.”, there are ten co-authors!) in Noah Smith’s group at Carnegie Mellon.

The relevant resources are:

Their paper describes their tagging scheme as well as their CRF-based tagger. It uses Stanford’s CRF tagger with baseline features as a performance comparison. The code for their tagger’s also in the distribution. I’m not sure what the license is — it’s listed as “other open source” (I didn’t even know Google Code let you do that — I thought it was “free beer” or nothing with them).

Training and Evaluating a LingPipe POS Tagger

Their corpus was very easy to parse (thanks, I really appreciate it). It only took me about an hour or so to download the data, parse it, and evaluate LingPipe’s baseline POS tagger on it. (It helps to be the author of code. The patterns feel awfully comfortable.)

Our performance was 85.4% accuracy on their train/test split using the default parameters for tagging in LingPipe. In contrast, the Stanford CRF tagger with default features was 85.9% accurate, whereas Gimpel et al.’s tagger achieved 89.4% accuracy. As usual, LingPipe’s HMM tagger is competitive with out-of-the-box CRFs and a few percentage points behind tuned, feature-rich CRFs.

Their paper (on page 5) says the annotator agreement is 92.2%. They also break accuracy out per tag, which LingPipe’s output also does; you can see this yourself if you run it.

LingPipe’s Baseline POS Tagger

The baseline POS tagger in LingPipe is a bigram HMM with emissions defined by a bounded character language model. Estimation is with simple additive smoothing (i.e., MAP estimates given symmetric Dirichlet priors) for the initial state and transition probabilities and Witten-Bell smoothing for the character LMs. Our main motivation for doing things this way is that (a) it’s online, letting us train an example at a time, and (b) it’s reasonably fast when it runs. We should be able to decode this tag set at well over 500K tokens/second by turning on caching of character LM results and pruning.

We could also implement their approach using LingPipe’s CRFs. It’s just that it’d take a bit longer than an hour all in.

Run it Yourself

You can get their code from their project home page, linked above.

All of my code’s checked into the LingPipe Sandbox in a project named “twitter-pos”. You can check it out anonymously using Subversion:

svn co https://aliasi.devguard.com/svn/sandbox/twitter-pos

The code’s in a single file, stored under the src subdirectory of the package:

package com.lingpipe.twpos;

import com.aliasi.classify.*;
import com.aliasi.corpus.*;
import com.aliasi.io.*;
import com.aliasi.hmm.*;
import com.aliasi.tag.*;
import java.io.*;
import java.util.*;

public class Eval {

    public static void main(String[] args) throws IOException {
        System.out.println("Reading Corpus");
        TwitterPosCorpus corpus
            = new TwitterPosCorpus(new File(args[0]));

        System.out.println("Training Tagger");
        HmmCharLmEstimator hmm = new HmmCharLmEstimator();
        corpus.visitTrain(hmm);
        HmmDecoder tagger = new HmmDecoder(hmm);

        System.out.println("Evaluating");
        boolean storeTokens = true;
        TaggerEvaluator evaluator
            = new TaggerEvaluator(tagger,storeTokens);
        corpus.visitTest(evaluator);
        System.out.println(evaluator.tokenEval());
    }

    static List<Tagging> parse(File f) throws IOException {
        List<Tagging> taggings
            = new ArrayList<Tagging>();
        FileLineReader reader = new FileLineReader(f,"UTF-8");
        List tokens = new ArrayList();
        List tags = new ArrayList();
        for (String line : reader) {
            String[] tokTag = line.split("\\s+");
            if (tokTag.length != 2) {
                taggings.add(new Tagging(tokens,tags));
                // System.out.println("tokens=" + tokens);
                // System.out.println("tags=" + tags);
                tokens = new ArrayList();
                tags = new ArrayList();
            } else {
                tokens.add(tokTag[0]);
                tags.add(tokTag[1]);
            }
        }
        return taggings;
    }

    static class TwitterPosCorpus extends ListCorpus<Tagging> {
        public TwitterPosCorpus(File path) throws IOException {
            for (Tagging t : parse(new File(path,"train")))
                addTrain(t);
            for (Tagging t : parse(new File(path,"dev")))
                addTrain(t);
            for (Tagging t : parse(new File(path,"test")))
                addTest(t);
        }
    }
}

LingPipe’s pretty fast for this sort of thing, with the entire program above, including I/O, corpus parsing, training, and testing taking a total of 5 seconds on my now ancient workstation.

Although it wouldn’t be a fair comparison, there’s usually a percent or so to be eked out of a little tuning in this setting (it would’ve been fair had I done tuning on the dev set and evaluated exactly once). This was just a straight out of the box, default settings eval. In general, one shouldn’t trust results that report post-hoc best settings values as they’re almost always going to overestimate real performance for all the usual reasons.

Finally, here’s the confusion matrix for tags in the first-best output:

,D,E,#,!,G,&,@,A,$,L,N,O,,,U,T,V,P,S,R,~,^,X,Z
D,446,0,0,1,0,0,0,4,0,0,0,7,0,0,0,0,11,0,7,0,8,1,0
E,0,53,0,1,2,0,0,0,1,0,0,0,5,0,1,0,0,0,0,0,0,0,0
#,0,0,44,0,1,0,0,0,0,0,10,0,0,0,0,3,0,0,0,0,20,0,0
!,0,0,1,140,1,0,0,5,0,1,15,5,0,0,0,3,1,0,7,0,7,0,0
G,1,1,5,2,14,0,0,1,3,0,10,0,10,0,0,4,1,0,1,2,15,0,0
&,0,0,0,0,0,122,0,1,0,0,1,0,0,0,0,0,1,0,1,0,1,0,0
@,0,0,0,0,0,0,328,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0
A,0,0,0,1,0,1,0,248,3,0,44,0,0,0,2,30,2,0,24,0,12,0,0
$,0,0,0,0,0,0,1,0,79,0,2,0,0,0,0,0,3,0,0,0,0,0,0
L,2,0,0,0,1,0,0,0,0,120,3,1,0,0,0,2,0,0,0,0,0,0,0
N,1,0,1,5,1,0,0,49,1,1,783,2,0,0,2,52,6,0,14,0,63,0,0
O,4,0,0,0,1,0,0,2,0,0,2,456,0,0,1,0,0,0,2,0,4,0,0
,,0,4,0,0,2,0,0,0,0,0,0,0,861,0,0,2,0,0,0,11,0,0,0
U,0,0,0,1,0,0,0,0,0,0,0,0,1,114,0,0,0,0,0,0,1,0,0
T,0,0,0,0,0,0,0,0,0,0,0,1,0,0,24,0,9,0,1,0,1,0,0
V,0,1,0,0,0,0,0,21,0,1,69,1,0,0,0,921,9,0,7,2,21,0,0
P,2,0,0,1,0,0,0,4,1,0,1,0,0,0,11,6,571,0,12,0,4,0,0
S,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,2,0,0,1,0,0
R,4,0,0,1,0,0,0,13,0,0,20,1,0,0,1,6,15,0,269,0,8,1,0
~,0,0,0,1,1,0,0,0,0,0,0,0,32,0,0,1,0,0,0,177,0,0,0
^,1,0,4,1,2,0,0,29,2,0,101,0,2,0,0,16,4,0,1,0,331,0,1
X,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,0,0,3,0
Z,1,0,0,0,0,0,0,0,0,1,4,0,0,0,0,0,0,1,0,0,13,0,2

I should really figure out how to format that a bit more neatly.

Tang and Lease (2011) Semi-Supervised Consensus Labeling for Crowdsourcing

September 12, 2011

I came across this paper, which, among other things, describes the data collection being used for the 2011 TREC Crowdsourcing Track:

But that’s not why we’re here today. I want to talk about their modeling decisions.

Tang and Lease apply a Dawid-and-Skene-style model to crowdsourced binary relevance judgments for highly-ranked system responses from a previous TREC information retrieval evaluation. The workers judge document/query pairs as highly relevant, relevant, or irrelevant (though highly relevant and relevant are collapsed in the paper).

The Dawid and Skene model was relatively unsupervised, imputing all of the categories for items being classified as well as the response distribution for each annotator for each category of input (thus characterizing both bias and accuracy of each annotator).

Semi Supervision

Tang and Lease exploit the fact that in a directed graphical model, EM can be used to impute arbitrary patterns of missing data. They use this to simply add some known values for categories (here true relevance values). Usually, EM is being used to remove data, and that’s just how they pitch what they’re doing. They contrast the approach of Crowdflower (nee Dolores Labs) and Snow et al. as fully supervised. They thus provide a natural halfway point between Snow et al. and Dawid and Skene.

Good Results vs. NIST Gold

The key results are in the plots in figures 4 through 7,which plot performance versus amount of supervision (as well as fully unsupervised and majority vote approaches). They show the supervision helping relative to the fully unsupervised approach and the approach of training on just the labeled data.

Another benefit of adding supervised data (or adding unsupervised data if viewed the other way) is that you’ll get better estimates of annotator responses (accuracies and biases) and of topic prevalences.

Really Gold?

They get their gold-standard values from NIST, and the notion of relevance is itself rather vague and subjective, so the extra labels are only as golden as the NIST annotators. See below for more on this issue.

Voting: Quantity vs. Quality

Tang and Lease say that voting can produce good results with high quality annotators. It’ll also produce good results with a high quantity of annotators of low quality. As long as their results are independent enough, at least. This is what everyone else has seen (me with the Snow et al. data and Vikas Raykar et al. very convincingly in their JMLR paper).

Regularized Estimates (vs. MLE vs. Bayesian)

I think it’d help if they regularized rather than took maximum likelihood estimates. Adding a bit of bias from regularization often reduces variance and thus expected error even more. It helps with fitting EM, too.

For my TREC entry, I went whole hog and sampled from the posterior of a Bayesian hierarchical model which simultaneously estimates the regularization parameters (now cast as priors) along with the other parameters.

I also use Bayesian estimates, specifically posterior means, which minimize expected squared error. MLE for the unregularized case and maximum a posterior (MAP) estimates for the regularized case can both be viewed as taking posterior maximums (or modes) rather than means. These can be pretty different for the kinds of small count beta-binomial distributions used in Dawid and Skene-type models.

Really Adversarial Turkers?

How in the world did they get a Mechanical turker to have an accuracy of 0 with nearly 100 responses? That’s very very adversarial. I get higher accuracy estimates using their data for TREC and don’t get very good agreement with the NIST gold standard, so I’m really wondering about this figure and the quality of the NIST judgments.

Active Learning

Choosing labels for items on the margin of a classifier is not necessarily the best thing to do for active learning. You need to balance uncertainty with representativeness, or you’ll do nothing but label a sequence of outliers. There’s been ongoing work by John Langford and crew on choosing the right balance here.

Adding a Model

Vikas Raykar et al. in their really nice JMLR paper add a regression-based classifier to the annotators. I think this is the kind of thing Tang and Lease are suggesting in their future work section. They cite the Raykar et al. paper, but oddly not in this context, which for me, was its major innovation.

Not Quite Naive Bayes

Tang and Lease refer to the Dawid and Skene approach as “naive Bayes”, which is not accurate. I believe they’re thinking of generating the labels as analogous to generating tokens. But the normalization for that is wrong, being over annotators rather than over annotator/label pairs. If they had a term estimating the probability of an annotator doing an annotation, then it would reduce to naive Bayes if they allow multiple annotations by the same annotator independently (which they actually consider, but then rule out).

So it’s odd to see the Nigam et al. paper on semi-supervised naive Bayes text classification used as an example, as it’s not particularly relevant, so to speak. (I really like Nigam et al.’s paper, by the way — our semi-supervised naive Bayes tutorial replicates their results with some more analysis and some improved results.)

Two-Way vs. K-Way Independence

Another nitpick is that it’s not enough to assume every pair of workers is independent. The whole set needs to be independent, and these conditions aren’t the same. (I was going to link to the Wikipedia article on independent random variables, but it only considers the pairwise case. So you’ll have to go to a decent probability theory textbook like Degroot and Schervish or Larsen and Marx, where you’ll get examples of three variables that are not independent though each pair is pairwise independent.

One Last Nitpick

A further nitpick is equation (6), the second line of which has an unbound i in the p[i] term. Instead, i needs to be bound to the true category for instance m.

Synthetic Data Generation?

I also didn’t understand their synthetic data generation in 3.1. If they generate accuracies, do they take the sensitivities and specificities to be the same (in their notation, pi[k,0,0] = pi[k,1,1]). In my (and others’) experience, there’s usually a significant bias so that sensitivity is not equal to specificity for most annotators.

Steyvers, Lee, Miller and Hemmer (2009) The Wisdom of Crowds in the Recollection of Order Information

June 23, 2011

I just found Mark Steyvers et al.’s work on models of annotation for rankings:

They also describe the model with a more psych/experimental slant with some more experimental data relating observed (and estimated) expertise to self-reported expertise in:

The Problem and the Data

The basic idea, which they describe as Thurstonian, has annotators rank-order a common set of items, such as the sizes of 10 cities. The goal is to then induce the true ranking (the so-called “wisdom of crowds”) and also to estimate the annotator’s accuracies (but not biases in this case).

Some Background

The model they propose should be familiar to anyone who’s seen item-response models or Bradley-Terry models from the psychometrics literature on educational testing and preference ranking respectively. Somewhat surprisingly given Steyvers’s connection to cognitive science, they don’t seem to know (or don’t care to cite) sixty years worth of previous psychometrics literature on these kinds of problems. As Andrew Gelman is fond of saying, just about any model you invent was studied decades ago by a psychometrician.

Instead, they dig back even deeper to Thurston in the 1920s and also cite some work by Mallows in the 1950s, the latter of which is closer to what I’d have expected in the way of citations.

Their model reminds me most of Uebersax and Grove’s approach to ordinal ranking problems described in their 1993 Biometrics paper A latent trait finite mixture model for the analysis of rating agreement. Uebersax and Grove also use latent positions and normally distributed noise. The difference is that Uebersax and Grove looked at the case of multiple annotators evaluating multiple items on an ordinal scale. An example would be five doctors ranking 100 slides of potential tumors on a 0-4 scale of severity.

Steyvers et al.’s Model

The Items

The basic idea is to introuce a latent scalar \mu_i for each item i \in 1{:}I being ranked. The ordering of the latent scalars \mu_i induces a complete ordering of items.

The Annotators

Each annotator j \in 1{:}J is characterized by a single noise parameter \sigma_j > 0. These are given what seem like a rather arbitrary prior:

\sigma_i \sim \mbox{\sf Gamma}(\lambda,1/\lambda)

where \lambda is a constant hyperprior (set to 3). I’m more used to seeing inverse gamma distributions used as priors for variance (or gammas used as priors for precision).

They mention that one could fit another level of hierarchy here for \lambda, which would account for population effects in the model; this is standard operating procedure for Bayesian modeling these days and usually results in a better model of posterior uncertainty than optimizing or arbitrarily setting hyperparameters.

The Annotations

The annotations that are observed are of the form of complete rankings. That is, if we had three cities to rank by population, Chicago, Houston and Phoneix, an annotator’s response might be

Houston > Chicago > Phoenix.

The model assumes these annotations are derived from a latent annotator-specific scalar x_{i,j} for each item i and annotator j (it’d also be easy to allow an incomplete panel design in which not every annotator ranks every item). The model for this latent scalar is the obvious one:

x_{i,j} \sim \mbox{\sf Norm}(\mu_i,\sigma_j^2).

That is, the latent position x_{i,j} assigned to item i by annotator j is drawn from a normal centered around the true latent location \mu_i for item i with noise determined by the annotator-specific deviation parameter \sigma_j.

The Sampling Trick

There’s only one problem in implemeting this model: the latent x_{i,j} must be consistent with the observed ranking y_{i,j}. As you can imagine, they follow Albert and Chib’s approach, which involves a truncated normal sampler. That is, conditional on all but a single position x_{i,j}, use a normal distribution truncated to the interval bounded by the next lowest-ranked and next highest-rank item than i (with no lower bound for the lowest-ranked item and no upper bound for the highest-ranked item).

The whole model’s only a few lines of JAGS code, though they wrote their own implementation using a mix of Metropolis and Gibbs updating (this is a case where Gibbs is going to mix relatively slowly because of the interdependence of the x_{i,j}, yet this is where they use Gibbs). An advantage of using JAGS is that it’s trivial to explore the hierarchical extensions of the model.

The Posterior

The posterior distribution is modeled using samples. Here, the random variables being sampled are (\mu, x, \sigma). Given samples for \mu, we can estimate the rank of each item by looking at the rank in each sample \mu^{(k)}. We also get a direct characterization of the noise of an annotator through \sigma_i. We probably don’t care at all about the annotator-specific latent positions x_{i,j}.

In the NIPS paper, we get posterior intervals. In general, I prefer more direct views of the posterior samples, like scatterplots of histograms. For instance, check out the alternative plots for similar data in the collection BUGS Examples, Volume I, which contains a closely related example model involving ranking hospitals based on pediatric surgery fatalities (p. 13, diagram p. 17). It’s basically a histogram of the item’s rank in the posterior samples.

The Wisdom of Crowds

The result is a successful “wisdom of the crowds” aggregation of rankings. Each ranker is weighted by their estimated noise, so more reliable rankers have their rankings weighted more heavily. This is just like all the annotation models we’ve talked about, beginning with Dawid and Skene’s seminal 1979 paper, Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm (sorry for the paywalled link — I can’t find a pirated copy).

In the terminology of statistics, these kinds of models are called “measurement error” models (there doesn’t seem to be a good Wikipedia page or good general overview — you find introductions in any book on survey sampling). It’s not uncommon to have the basic data be measured with noise. Especially in an epidemiology setting or any kind of subjective coding by human subjects, like survey responses.

Future Directions

The authors point out in their second paper that it’d be natural to build hierarchical models for this task. But their suggestion for how to do it is odd. They suggest adding a single parameter for an individual across all tasks. Usually, you’d have this, then a domain-specific parameter that varied around the hierarchical parameter.

That is, if our tasks were indexed by t, we’d have an individual level error \sigma_i for each individual across tasks, and an error \sigma_{t,i} for each task for each user that’s sampled in the neighborhood of \sigma_i. It’d be common at this point to estimate random effects for things like task difficulty or annotator expertise. You see this all over the epidemiology and psychometrics literature when they extend these annotation models (for instance, in epidemiology, blood sample tests vs. doctor physical exam is an example of an annotator-level effect; in psychometrics, midterm versus final exam is an example of an item-level effect).

I’d at least start with giving the \sigma_i a hierarchical prior.

I’m guessing since they also suggest multiple-choice questions as an extension, they really haven’t seen the item-response theory literature.

Resistance is Futile — I’ve been Assimilated (by Apple)

March 23, 2011

The second most popular post ever on this blog was my review of Lenovo’s Thinkpad W510. I used to really like the IBM Thinkpads. (The number one post is the link to 64-bit versions of Eclipse, and I don’t even use Eclipse except for an occassional refactoring or stepwise debug).

I just gave my 15-month old quad-core, high-res, 8GB Thinkpad W510 with a 128GB SSD to one of our grad students after Andrew bought me a shiny new Macbook Air 13″ with 4GB and a 256GB SSD. Apple only just a few weeks ago came out with similar hardware for the Macbook Pro, by the way, and Lenovo’s not exactly rushing out of the gate with new CPUs.

Evaluation Period Honeymoon

I thought at the very least, if I hated Mac OS X, I could just install Windows. If the Air was too small, I could just use it for travel.

Within about 24 hours, having watched Apple’s converting from Windows to Mac video, I’m a convert. So much so that Mitzi’s been getting sick of hearing me tell her how great the new Macs are. We’d both last used them with any frequency when I had a Mac Quadra in the early 1990s.

Among my favorite things is emacs commands in text windows. And I’m not sure how I lived without the gestures for scrolling and navigating the web. It really is a unix machine underneath, though I never had any problems with Cygwin on Windows. I also love the keyboards. That and the weight are what finally soured me on the Thinkpads — their new ones are terrible.

Did I mention the insane battery life?

Or just how nice it feels? It’s like moving from a Kodak Brownie to a Leica.

The last time I was this impressed with a piece of hardware was when my grad school department got its first Sun Workstation I way back in 1985 or 1986.

It’s not perfect. Does Microsoft own the patent on grabbing windows by their sides to resize them? It’s not possible in either Linux or the Mac [I was corrected in the comments; it is possible in Linux, just very trying as a test of hand-eye-mouse coordination]. And why do I have a delete key that’s really a backspace and no real delete key? (Yes, I realize fn-delete adds up to a delete.) And what’s the function key doing where my control key should be? Does anyone ever use caps lock for anything? Why do computers still have them?

Also, I don’t find Macs as blindingly obvious as everyone else says they are. For instance, I had no clue as to what to do with the “power cord”. I went online. I read the manual. I’d have thought it was the wrong piece, but it was listed in the parts. Turns out you need to take their power supply apart and then it clips in. Huh? It reminds me of getting an iPod many years ago and having no clue how to turn it off.

No Half Measures

Given that the money was turned on at Columbia, I ordered the Macbook Air for home and travel, as well as a 27″ iMac for the office, and an iPad 2. The iMac is very sweet, but somehow not nearly as cool as the Air. I can’t wait until the iPad shows up — I’m tired of printing out PDFs. As soon as the iPhone 5 shows up, I’m getting one from Verizon (I don’t even have a cellphone now).

Welcome to the New Borg

The old joke used to be that Microsoft was like The Borg, whose tagline was “you will be assimilated”. Slashdot even uses a Borged-out Bill Gates as an icon.

It turns out that as hard as Google’s trying to be the new Borg, the current Borg headgear apparatus rests on Apple’s head. After all, they’ve not just locked down their hardware, they’re also trying to take over the music and movie distribution business. That’s why I’m so surprised I couldn’t find a Steve-Jobs-as-Borg image. C’mon Slashdot, help me out here.

Tip of the Day

When your significant other says he or she thinks you like your new Macbook Air more than him or her, do NOT reply, “It’s so thin!”. Luckily, Mitzi has a sense of humor to go with her yoga-and-running-toned body.

Information Gain and Task-Based Costs: Biased versus Spam Annotators

November 22, 2010

Following on from Sheng et al.’s Get Another Label? paper, Panos and crew have a Human Computation Workshop paper,

The motivation for the new paper is to try to separate bias from accuracy. That is, some annotators would try hard, but the bias in their answers would give them an overall low exact accuracy. But their responses can still be useful if we correct for their bias. Luckily, the Dawid-and-Skene-type models for estimating and adjusting for annotator’s accuracies does just that.

G, PG, R, or X?

As an example, Ipeirotis et al. consider having workers on Amazon Mechanical Turk classify images into G, PG, R, and X ratings following MPAA ratings guidelines.

This is really an ordinal classification problem where the approach of Uebersax and Grove (1993) seems most appropriate, but let’s not worry about that for right now. We can imagine a purely categorical example such as classifying tokens in context based on part-of-speech or classifying documents based on a flat topic list.

Bias or Inaccuracy?

Uebersax and Grove discuss the bias versus accuracy issue. It’s easy to see in a two-category case that sensitivity and specificity (label response for 1 and 0 category items) may be reparameterized as accuracy and bias. Biased annotators have sensitivities that are lower (0 bias) or higher (1 bias) than their specificities.

What Ipeirotis et al. point out is that you can derive information from a biased annotator if their biases can be estimated. As they show, a model like Dawid and Skene’s (1979) performs just such a calculation in its “weighting” of annotations in a generative probability model. The key is that it uses the information about the likely response of an annotator given the true category of an item to estimate the category of that item.

Decision-Theoretic Framework

Ipeirotis et al. set up a decision-theoretic framework where there is a loss (aka cost, which may be negative, and thus a gain) for each decision based on the true category of the item and the classification.

One nice thing about the image classification task is that it makes it easy to think about the misclassification costs. For instance, classifying an X-rated image as G and having it land on a children’s site is probably a more costly error than rating a G image as PG and banning it from the same site. In the ordinal setting, there’s a natural scale, where rating an X-rated image as R is closer to the true result than PG or G.

Getting Another Label

Consider a binary setting where the true category of item i is c_i, the prevalence (overall population proportion) of true items is \pi, and annotator j has sensitivity (accuracy on 1 items; response to positive items) of \theta_{1,j} and a specificity (accuracy on 0 items; 1 – response to negative items) of \theta_{0,j}. If y_{i,j} is the response of annotators j \in 1{:}J, then we can calculate probabilities for the category c_i by

\mbox{Pr}[c_i = 1 | y_i] \propto \pi \times \prod_{j=1}^J \theta_{1,j}^{y_{i,j}} \times (1 - \theta_{1,j})^{1 - y_{i,j}} and

\mbox{Pr}[c_i = 0 | y_i] \propto (1 - \pi) \times \prod_{j=1}^J \theta_{0,j}^{1 - y_{i,j}} \times (1 - \theta_{0,j})^{y_{i,j}}.

The thing to take away from the above formula is that it reduces to an odds ratio. Before annotation, the odds ratio is \pi / (1-\pi). For each annotator, we multiply the odds ratio by the annotator’s ratio, which is \theta_{1,j} / (1 - \theta_{0,j}) if the label is y_{i,j} = 1, and is (1 - \theta_{1,j})/\theta_{0,j} if the label is negative.

Random Annotators Don’t Affect Category Estimates

Spammers in these settings can be characterized by having responses that do not depend on input items. That is, no matter what the true category, the response profile will be identical. For example, an annotator could always return a given label (say 1), or always choose randomly among the possible labels with the same distribution (say 1 with 20% chance and 0 with 80% chance, correspdonding, say, to just clicking some random checkboxes on an interface).

In the binary classification setting, if annotations don’t depend on the item being classified, we have specificity = 1 – sensitivity, or in symbols, \theta_{0,j} = 1 - \theta_{1,j} for annotator j. That is, there’s always a \theta_{1,j} chance of returning the label 1 no matter what the input.

Plugging this into the odds ratio formulation above, it’s clear that there’s no effect of adding such a spammy annotator. The update to the odds ratios have no effect because \theta_{1,j}/(1 - \theta_{0,j}) = 1 and (1-\theta_{1,j})/\theta_{0,j}=1.

Cooperative, Noisy and Adversarial Annotators

In the binary case, as long as the sum of sensitivity and specificity is greater than one, \theta_{0,j} + \theta_{1,j} > 1, there is positive information to be gained from an annotator’s response.

If the sum is 1, the annotator’s responses are pure noise and there is no information to be gained by their response.

If the sum is less than 1, the annotator is being adversarial. That is, they know the answers and are intentionally returning the wrong answers. In this case, their annotations will bias the category estimates.

Information Gain

The expected information gain from having an annotator label an item is easily calculated. We need to calculate the probability of true category and then probability of response and figure out the odds of each and the contribution to our overall estimates.

The random variable in question is c_i, the category of item i. We will assume a current odds ratio after zero or more annotations of \phi/(1-\phi) and consider the so-called information gain from observing y_{i,j}, the label provided for item i by annotator j,

\mbox{H}[c_i] - \mathbb{E}[\mbox{H}[c_i|y_{i,j}]],

where the expectation is with respect to our model’s posterior.

Expanding the expectation in the second term gives us

\mathbb{E}[\mbox{H}[c_i|y_{i,j}]] = \mbox{Pr}[y_{i,j} = 1] \times \mbox{H}[c_i|y_{i,j}=1] + \mbox{Pr}[y_{i,j} = 0] \times \mbox{H}[c_i|y_{i,j}=0]

The formulas for the terms inside the entropy are given above. As before, we’ll calculate the probabilities of responses using our model posteriors. For instance, carrying this through normalization, the probabilities on which the expectations are based are

\mbox{Pr}[y_{i,j} = 1] = Q_1/(Q_1+Q_2), and

\mbox{Pr}[y_{i,j} = 0] = Q_2/(Q_1 + Q_2), where

Q_1 = \phi \times \theta_{1,j} \ + \ (1-\phi) \times (1 - \theta_{0,j}), and

Q_2 = (1-\phi) \times \theta_{0,j} \ + \ \phi \times (1 - \theta_{1,j}).

so that the probability the annotation is 1 is proportional the sum of the probability that the true category is 1 (here \phi) and the response was correct (\theta_{1,j}) and the probability that the true category is 0 (1 - \phi) and the response was incorrect (1 - \theta_{0,j}).

It’s easy to see that spam annotators who have \theta_{0,j} = 1 - \theta_{1,j} provide zero information gain because as we showed in the last section, p(c_i|y_{i,j}) = p(c_i) if annotator j provides random responses, then \mbox{H}[c_i|y_{i,j}] = \mbox{H}[c_i].

Decision-Theoretic Framework

Ipeirotis et al. go one step further and consider a decision-theoretic context in which the penalities for misclassifications may be arbitrary numbers and the goal is minimizing expected loss (equivalently maximizing expected gain).

Rather than pure information gain, the computation would proceed through the calculation of expected true positives, true negatives, false positives, and false negatives, each with a weight.

The core of Bayesian decision theory is that expected rewards are always improved by improving posterior estimates. As long as an annotator isn’t spammy, their contribution is expected to tighten up our posterior estimate and hence improve our decision-making ability.

Suppose have have weights L_{k,k'}, which are the losses for classifying an item of category k as being of category k'. Returning to the binary case, consider an item whose current estimated chance of being positive is \phi. Our expected loss is

\mathbb{E}[\mbox{loss}] = L_{1,1}  \phi^2 + L_{1,0}  \phi  (1-\phi) + L_{0,1} (1-\phi)  \phi + L_{0,0} (1-\phi)^2.

We are implicitly assuming that the system operates by sampling the category from its estimated distribution. This is why \phi shows up twice after L_{1,1}, once for the probability that the category is 1 and once for the probability that’s the label chosen.

In practice, we often quantize answers to a single category. The site that wants a porn filter on an image presumably doesn’t want a soft decision — it needs to either display an image or not. In this case, the decision criterion is to return the result that minimizes expected loss. For instance, assigning category 1 if the probability the category is 1 is \phi leads to expected loss

L_{1,1} \phi + L_{0,1} (1-\phi)

and the loss for assigning category 0 is expected to be

L_{0,0} (1-\phi) + L_{1,0} \phi.

The decision is simple: return the result corresponding to the smallest loss.

After the annotation by annotator j, the positive and negative probabilities get updated and plugged in to produce a new estimate \phi', which we plug back in.

I’m running out of steam on the \LaTeX derivation front, so I’ll leave the rest as an exercise. It’s a hairy formula, especially when unfolded to the expectation. But it’s absolutely what you want to use as the basis of the decision as to which items to label.

Bayesian Posteriors and Expectations

In practice, we work with estimates of the prevalence \pi, sensitivities \theta_{1,j}, and specificities \theta_{0,j}. For full Bayesian inference, Gibbs sampling lets us easily compute the integrals required to use our uncertainty in the parameter estimates in calculating our estimate of \mbox{Pr}[c_i = 1|y_i] and its uncertainty.

Confused by Section 4

I don’t understand why Ipeirotis say, in section 4,

The main reason for this failure [in adjusting for spam] is the inability of the EM algorithm to identify the “strategic” spammers; these sophisticated spammers identify the class with the highest class prior (in our case the “G” class) and label all the pages as such.

Huh? It does in my calculations, as shown above, and in my experience with the Snow et al. NLP data and our own NE data.

One problem in practice may be that if a spammer annotates very few items, we don’t have a good estimate of their accuracies, and can’t adjust for their inaccuracy. Otherwise, I don’t see a problem.

Grammar as Science

September 29, 2010

I had an hour free time on the University of Chicago campus, so after strolling the quad and gawking at the Robie house, I toddled into the university bookstore. After marveling at the number of X and Philosophy titles*, I wandered over to the linguistics section, where I found:

I was struck by the beginning of the back-cover blurb:

This introductory text takes a novel approach to the study of syntax. Grammar as Science offers an introduction to syntax as an exercise in scientific theory construction.

If I’m not mistaken, this is implying that other introductions to syntax are exercises in something other than science. I’ll buy that. But I’m not sure I buy the next bit:

Syntax provides an excellent instrument for introducing students from a wide variety of backgrounds to the principles of scientific theorizing and scientific thought; it engages general intellectual themes present in all scientific theorizing as well as those arising specifically within the modern cognitive sciences. …

This book aside, I doubt Chomskyan linguistics is a good exemplar of science for reasons that (part of) Peggy Speas’s endorsement illuminates:

[Grammar as Science] shows students explicitly how to ‘think like a linguist.’ Students who use this book will come away with an extraordinarily strong grasp of the real underpinnings of linguistics.

The key here is “think like a linguist”. If you’ve ever tried to take Chomsky’s conception of grammar from Aspects and explain it to students (which I’m embarrassed to admit that I have done in the past), you’d have your doubts about the relation between “think like a scientist” and “think like a linguist”, too.

I was further amused by (the beginning of) Norbert Hornstein’s endorsement, also on the back cover:

What makes modern generative linguistics a science, and an interesting one at that, is a subtle interaction among several factors: the questions that it asks, the abstractions that it makes to answer these questions, the technical apparatus it uses to make these questions precise, the evidence it uses to adjudicate between different answers, and the aesthetic that animates it when putting all these ingredients together.

Silly me. Here I was thinking what makes something interesting science is predictive power.

Show, Don’t Tell

One piece of very good advice for writers in any genre is show, don’t tell.

Sci-fi writers and linguists seem to have a particularly hard time following this advice. I’m afraid the newbie/off-worlder/woken-up-from-cryosleep listener who has everything explained to him obeys the letter but not the spirt of the dictum. That’s similar to “serious” texts where thin characters have a “dialogue” that’s a proxy for the author explaining to the reader. Examples include Gödel, Escher, Bach in logic/AI (which I loved when I was in high school when it came out) and Rhyme and Reason in linguistics, which covers very similar “philosophy of science” ground as Grammar as Science.

I’ve never seen a physics or chemistry or even a biology book start off by asserting that what follows is really science. Most scientists, and most scientific texts, just do science.

Methinks the linguists do protest too much.

Been There, Done That

If you look at the first chapter of the second HPSG book, Carl Pollard (this was his axe to grind) preaches the same gospel, as in this excerpt from page 7:

 


* The following are real, though I’m too lazy to link past the first: Anime and Philosophy, Star Wars and Philosophy, Alice in Wonderland and Philosophy, The Matrix and Philosophy, Facebook and Philosophy, Transformers and Philosophy, Batman and Philosophy, The Undead and Philosophy and so on (and on and on). Is this some kind of works project for philosophers?

McCandless, Hatcher & Gospodnetić (2010) Lucene in Action: 2nd Ed. (Review)

August 16, 2010

I’m very happy to see that the 2nd Edition is out:

  • McCandless, Michael, Erik Hatcher, and Otis Gospodnetić. 2010. Lucene in Action. Second Edition. Manning Press.

Manning’s offering 40% off until September 30, 2010. Follow the link to the book and use code lingpipeluc40 when you check out.

You Need This Book

If you’re interested in version 3 or later of the Apache Lucene search library, you need this book. As far as I know, this book is the only way to come up to speed on the intricacies of Lucene.

What if I have the First Edition?

Keep it for your projects that require 2.x versions of Lucene. So much has changed in version 3 that is not backward compatible with previous versions that the first edition of this book will be of almost no use. Every example has changed. It’s pretty much a whole new book (other than the first page where it describes Lucene as simple — that used to be true, but is not any longer).

I’ve been coding in Lucene since version 1.2 (2002), and the javadoc’s been getting better over time. I still needed the code samples from this book to get a basic index up and running and to figure out how to enumerate the terms given an analyzer. (In the first edition of the book, I contributed a case study on n-gram-based analysis, so it’s not like I was unfamiliar with the whole process!)

Is it a Good Book?

I’m going to violate the “show, don’t tell” rule of writing and answer my own question with an unequivocal “yes”. Now let me show you why.

The authors clearly know their Lucene. Not surprising, given that they’re three of the core developers for Lucene. (Erik Hatcher was also involved in the Ant build system, and has another Manning … in Action book for that.) Following the Manning Press in Action methodology, they lay everything out with very clear examples that you can run. As such, the book’s better than good — it’s really great set of code examples presented in a coherent style and order, with easy-to-follow explanations.

The book covers a wide range of introductory and advanced topics in a way that’s about as easy to follow as possible. It starts from simple indexing and query construction and then moves into more advanced topics like spell checking and term vector more-like-this implementations. It does all of this with examples, cleverly sidestepping most of the (relatively simple) mathematics underlying search and scoring, while still conveying the basic principles.

They cover extras like the Tika document framework, where they provide a very clear intro to the SAX XML parsers and a nice overview the Jakarta Commons Digester (I dislike the config-and-reflection-based Digester so much that I built LingPipe’s delegating and delegate handlers as an in-the-code alternative).

If you follow the examples in this book and try them out you’ll be in a good position to use Lucene in a serious application. Even better, you’ll be prepared to join the project’s amazingly helpful and responsive mailing list (for years now, the first response to most newbie questions has been to read Lucene in Action!).

They cover many more advanced topics than the last book, like the range of file-level locking options and their use in a more transactional setting than basic Lucene provides itself.

Needs a Revision

Reading the book, I can’t shake the guilt arising from not sending them feedback on drafts. I had the drafts lying around in PDF form for ages and kept promising to send Otis feedback. As is, this book needs another editorial pass. (That’s not to say it’s not fantastic as is — all the glitches in the book I saw were relatively minor and didn’t get in the way of the code examples actually illustrating what’s going on with Lucene.)

The initial two lines of a paragraph are repeated on pages xxx and xxxi. There are also typos like inserted question marks, as in “book?two hundred years ago” on p. xxxii.

There are mistakes in code fragments, like a missing comma in the parse() example on p. 239.

There are also bugs in code. I only studied a few programs in the term vector section (which is great) and in the Tika section. Listing 5.18 has a bug in that it calls searcher.search() with a maximum number of hits of 10 (below pushpin 6), whereas the method is specified to take a configurable max (above pushpin 3).

There are dangling references like “uncomment the output lines toward the end of the docsLike method” (on p. 195), where there are no output lines toward the end of the said method (on p. 193).

I’d be in favor of getting rid of the content-free text like “It’s time to get our feet wet!” and so on, while you’re at it.

What’s Missing?

There’s a huge range of contributed packages surrounding Lucene these days, coupled with an ever-expanding range of applications to which Lucene is being put. As a result, the authors face a daunting curatorial challenge.

From my perspective, the biggest gap is around Unicode and how it behaves with respect to normalization and search. It’s more than just stripping accents from Western characters.

JUnit Examples

Don’t get me wrong. I love JUnit. We use it for unit testing in LingPipe. And I recommend reading our tests to help understand our classes. I just don’t think it’s a good way to present examples in an intro book. On top of that, they’re using the old style of JUnit rather than the newer style with annotations for tests and startup/teardown methods.

What about the Code?

Authors of textbooks trying to introduce an API are presented with the difficult problem of balancing simplicity with solid coding practice. I have to fight my own tendency to err on the latter side. The authors favor simplicity in most cases, as they explain up front. The authors realize that users are likely to start from cut-and-pasted versions of their code, and try to warn people about the shortcuts they took for the sake of clarity. (We do the same thing in our tutorials.)

Many (but not all) methods that throw exceptions are just declared to throw Exception. I think this does readers a disservice in not knowing what’s actually throwing what kind of exception. Others might argue it makes the code cleaner and less cluttered. The authors explain that’s their motivation and point out proper exception etiquette. We do the same thing in some of our tutorial code.

The use of generics is spotty, with allocations like new TreeMap() that should be new TreeMap<String,Integer> and subsequent use of non-generic Iterator instances. As a result, they have to cast their gets and can’t use auto-boxing. This just clutters up the code using the maps and makes their intended contents opaque.

I’d recommend autoboxing, or its explicit result, Integer.valueOf(int), over the older style new Integer(int) (the latter always creates a new instance, whereas valueOf() being a factory can share instances).

I’d like to see more foreach and fewer explicit iterators.

They don’t like ternary operators or min/max operators, resulting in awkward fragments like:

int size = max;
if (max > hits.scoreDocs.length) size = hits.scoreDocs.length;

rather than the simpler

int size = Math.min(max,hits.scoreDocs.length);

I don’t know if this is the different authors’ example coding style (they all know how to write real production code, like Lucene itself!).

There is a more serious issue for floating point precision they try to deal with in listing 5.22 in the last if/else block. They try to avoid having cosines take on values outside of (-1,1), which would be illegal inputs to acos(). Their approach is insufficient. The only way to guarantee that your natural cosine computations fall in this range is to do:

double boundedX = Math.max(-1.0, Math.min(1.0, x));

That’s another problem we ran into for our k-means clustering implementation in LingPipe. Once the vectors get big, it’s easy to get results greater than 1 for close matches.

Also, if you’re worried about efficiency, replace Math.sqrt(x) * Math.sqrt(y) with Math.sqrt(x * y).

Layout, Etc.

I’m not a big fan of Manning’s approach to code formatting and presentation. Whole programs are printed out, often across multiple pages, and a numbered push-pin like graphic in very heavy bold is used to pick them out and cross-reference them in the text. I prefer the interleaved style I’ve been using, and you see in something like Bloch’s Effective Java, but I can see that it’s a matter of taste.

The code formatting also bugged me. Rather than breaking before assignment equal signs, they break after. They use two characters of indent, which is hard to scan. And the code’s too light for the rest of the text (the “color” doesn’t balance in the typographical sense).

The real typographical sin this book commits is breaking class and method names and file paths and URLs across lines. A class name like IndexWriter is apt to show up with Index- at the end of one line and Writer at the start of the next. The reason this is bad is that hyphens are legal characters in paths and may be confused with underscores in class names (hyphens aren’t legal in Java identifiers).

How Much Math?

For the most part, the authors stick to concrete examples and shy away from abstract mathematics (and related algorithm analyses). If you want to understand the math behind comparison or the probability models behind compressed indexes, I’d still recommend Witten, Moffat and Bell’s Managing Gigabytes (Morgan Kaufmann text, so it’s prohibitively expensive). Mannnig, Raghavan and Schütze’s more recent Introduction to Information Retrieval (Cambridge Uni. Press let them leave an online version up for free!) has most of the math and many more recent topics related to language processing such as edit distance algorithms and their relation to spell checking, probabilistically motivated retrieval, and a range of classifiers like K-nearest neighbors and support vector machines.

In the cosine example in the “Leveraging Term Vectors” (section 5.10), they define cosine in a formula without ever defining the length or dot product functions. If someone needs cosine defined for them, they probably need the definitions for basic vector operations. In the example code, they assume that the word vector has only a single instance of each word, which is certain to be confusing for novice geometers. I would’ve just stuck with largest cosine for the sort, rather than smallest angle and skipped the call to Math.acos(), the arc (or inverse) cosine operation (which they later describe as prohibitive in terms of computation costs, though it’ll only make a difference percentage-wise if the term vectors are short).

For spell checking, it’d have been nice to get up to something like the noisy channel model. Peter Norvig provides an elegant introduction in his chapter in the Beautiful Data book from O’Reilly. Lucene’s spell checking is still pretty simple and word based (rather than frequency based), though the authors provide some suggestions on how to work with what’s there.

What about Solr?

Now that the Solr search client-server project is merging with Lucene, it’ll be interesting to see if the third edition includes information about Solr. This book only mentions it in passing in a page-long section 10.8.

Other Books on Lucene

I haven’t seen Smiley and Pugh’s 2009 Solr 1.4 Enterprise Search Server, but it has positive feedback on Amazon.

Manu Konchady’s book Building Search Applications with Lucene, LingPipe, and Gate is more introductory (more like the O’Reilly NLTK book), and needs an update for Lucene 3 and LingPipe 4.

About those Cover Illustrations

On page xxxii, they tell the story of an editor buying the book from which the cover art is drawn. They say the cover page is missing so they couldn’t track it down, but they mention the date and artists. I just did a search for the terms they mention, [ottoman empire clothing 1802 william miller] and found two copies of the book for sale at a Donald Heald Books (you may need to search again). Here’s the scoop

[DALVIMART, Octavien (illustrator) - William ALEXANDER (1767-1816)].

The Costume of Turkey, illustrated by a series of engravings; with descriptions in English and French

London: printed for William Miller by T. Bensley, 1802 [but text watermarked 1796 and 1811; plates 1811 and 1817). Folio (14 1/8 x 10 1/2 inches). Parallel text in French and English. Letterpress title with integral hand-coloured stipple-engraved vignette, 60 hand-coloured stipple-engraved plates by J. Dadley or William Poole after Octavien Dalvimart. Contemporary green straight-grained morocco, covers bordered in gilt and blind, skilfully rebacked to style with the spine in six compartments with double raised bands, the bands highlighted with gilt tooling, lettered in gilt in the second compartment, the others with repeat decoration in gilt, oatmeal endpapers, gilt edges. Provenance: William Rees Mogg (Cholwell House, Somerset, England, armorial bookplate).

A good copy of this classic work on the costume of the Ottoman Empire.

Starting with the "Kislar Aga or first black unuch [sic.]” in the “Grand Signior’s Seraglio” the subjects covered are quite wide-ranging but centered on the inhabitants of Constantinople and those who were visiting the capital city: a “Sultana”, “Officers of the Grand Signior”, “Turkish woman of Constantinople”, “Turkish woman in provincial dress”. Dalvimart does make occasional forays out into the provincial areas of the empire: included are images of an “Albanian”, an “Egyptian Arab”, a “Bedouin Arab”, a “Dervise [sic.] of Syria”, an “Armenian”, a “Bosniac” as well as a number of fine plates of the female costume of the Greek Islands (which are much admired in the text). “The Drawings, from which … [the] plates have been engraved, were made on the spot … [in about 1798] by Monsieur Dalvimart, and may be depended upon for their correctness. They have been accurately attended to in the progress of the engraving; and each impression has been carefully coloured according to the original drawing, that the fidelity of them might not be impaired” (Preface). Abbey points out that the informative text is attributed to Willliam Alexander in the British Library catalogue.

Abbey Travel II,370; Atabey 313; cf. Colas I, 782; cf. Lipperheide I, Lb37; cf. Vinet 2337.

Sierra & Bates. Head First Java 2nd Ed. (Review)

August 9, 2010

After the comments on the last post about relying more heavily on Java textbooks, I ordered the most popular ones from Amazon. I’ll start the reviews with the number one most popular book on Java (ranked by Amazon sales; it also has 269 reviews; I’m happy to see Effective Java is number 4):

  • Sierra, Kathy and Bert Bates. 2005. Head First Java. 2nd Edition. O’Reilly.

Mashup: Elementary School Workbook + Kitschy Humor

Image from Title Page.

I grew up with texts like The Little LISPer, but this book’s even stranger. It strikes me as a mashup of an elementary school math problem workbook and one of those cutesy humor books you see near the checkout counters of bookstores around the holidays.

Here’s an example of one of their quizzes, where you fill in missing pieces of a Java program; just like school writing exercises:

The Design and Layout

I think they were striving for a sense of the enthusiastic amateur. The “fun” fonts and askew alignments make it look non-linear. Code comments are added in a handwritten post-it note style.

You really need to head over to Amazon and “look inside”.

Who’s it For?

It’s aimed at Java novices. Other than that, I’m not sure. Maybe for potential Java developers who like crossword puzzles or true/false quizzes with titles like “Be the Compiler”. I’m more of a Java Puzzlers guy myself (thank my other childhood friend, Martin Gardner).

Unless you laughed at the photo included above, don’t buy it for the humor.

How Much Does it Cover?

It covers a startling amount of Java, including advanced topics like threads, sockets, RMI, Swing, audio, …

Of course, there can’t be much depth given this range. But you do get a wide range of hello-world-type programs.

Image from Page 101.

Is it Sound Java?

The authors clearly know what they’re doing. The advice and coverage are too basic if you want to program Java seriously, but it’ll get you started on the right track.

There’s even good high-level advice like telling you to write unit tests (though not using anything as complex as JUnit).

It’s an Industry

Like other popular series (i.e., …for Dummies, … in Action, … in a Nutshell), this one’s turned into an industry, with dozens of titles covering everything from HTML to statistics.

As such, they’ve created all the trappings of such an enterprise, such as “Head First learning principles”:

  • make it visual — put the words within or near graphics
  • use a conversational and personalized style
  • get the learner to think more deeply
  • get — and keep — the reader’s attention
  • touch their emotions

And they break out all sorts of “metacognitive” tips explaining why they wrote such a whacky book and how you should go about learning with it (like read it before bed and mix right-brain/left-brain activities for learning).

Many reviewers like the approach well enough to blurb about it.

Li, Ruotti, Stewart, Thomson and Dewey (2010) RNA-Seq Gene Expression Estimation with Read Mapping Uncertainty

June 9, 2010

It was bound to happen. The model I was proposing for splice-variant (or isoform) expression was too obvious (and too good!) for it not to be independently discovered. Seeing my ideas discovered by others is on one hand a bummer, but on the other hand gives me confidence that I’m thinking about these models in the right way.

The following paper presents roughly the same model of expression I presented in a previous blog post, Inferring Splice Variant mRNA Expression with RNA-Seq.

It also incorporates some of the features I suggested for edit distance in another blog post, Sequence alignment with conditional random fields.

The Basic Expression Model

In fact, it’d almost be easier to say what’s different in Li et al.’s model. The expression model is almost identical, down to the name of the variable, \theta, used for read expression. The model in this paper assumes N reads, implicitly assumes an uniform prior for \theta \sim \mbox{\sf Dirichlet}({\bf 1}), then for each read 1 \leq n \leq N, chooses splice variants z_n \sim \mbox{\sf Discrete}(\theta). So far, identical (other than that I considered arbitrary Dirichlet priors).

[Update, June 10, 2010: I forgot to mention:

Noise Isoform

Li et al. include a so-called "noise isoform", which they assign a simple probability distribution to. This will act as a sink for reads that do not map elsewhere. I don't quite see given how it's defined how anything would get mapped to it if we restricted attention to reads that mapped within a few edits with BOWTIE.]

Position and Strand-Based Reads

Things look different when Li et al. start to generate reads y_n. What I did was assume an arbitrary distribution \mbox{\sf Read}(y_n|z_n,\varphi), with parameters \varphi characterizing the likelihood of observing read y_n from reference source z_n. Li et al. decompose part of \mbox{\sf Read} in their model. First, they assume s_n is the start position of read y_n on reference sequence z_n and assume o_n is the strand, both of which are generated from the ref sequence z_n, by distributions p(o_n|z_n) and p(s_n|z_n).

This is where Li et al.’s proposal relates to my probabilistic aligner proposal. In particular, with the position, strand and reference sequence, s_n, o_n, z_n, the reads may be defined to be sensitive to location (say in relative position measured from 5′ to 3′), or to underlying sequence (such as the hexamer bias due to priming studied in [Hansen, Brenner and Dudoit 2010]). They only study the positional biases, and for their data, they were small. But groovy nonetheless. It’s building in the right kind of sensitivity to the biological sample prep and sequencing.

Non-Probabilistic Alignment

[Updated, June 10, 2010: I don't know what I was thinking -- Li et al. definitely use probabilistic alignment. In fact, they use a position-specific edit matrix w_t for substitutions (no indels). I'm not sure how the matrices are estimated.]

What they’re not using is the the quality scores on the reads themselves (as is done in Clement et al.’s GNUMap aligner).

I think there’s an opportunity to squeeze more sensitivity and specificity out of the model by using the quality scores from the sequencer (if they can be calibrated properly, that is).

The aligner I propose also normalizes the edits to proper probabilities using a sum-product algorithm; it’d be easy to extend to read quality as in GNUMap, or to compose it with a color space finite-state transducer, as in SHRiMP for SOLiD reads.

EM MLE vs. Gibbs Sampling Bayesian Posterior

The other main difference is that I was proposing using Gibbs sampling to produce samples from the posterior p(\theta|y,\alpha,\varphi) of expression given reads y and channel model \varphi and Dirichlet prior \alpha. Li et al. use EM to find the maximum likelihood estimate, \theta^* = \arg\max p(y|\theta,\varphi). As usual, the MLE is just the MAP estimate with a uniform prior, so in my model, the MLE is \theta^* = \arg\max_\theta p(\theta|y,\alpha={\bf 1},\varphi).

Bootstrap Standard Error Estimates

One of the main arguments I made for the Bayesian approach is that, like all truly Bayesian approaches, it supports posterior inference with uncertainty. This is very useful for doing things like pairwise or multiple comparisons of expression or adjusting for false discovery rates.

Li et al. use a bootstrap estimate of standard error, which is great to see. I wish more researchers provided variance estimates.

The danger with only reporting standard error (or variance) in these skewed binomials (or multinomials) is that the parameter value’s very close to the edge of the allowed values, so the 95% interval can contain illegal values. You see the same problem for normal approximations of variance for the Poisson, where a few standard deviations below the mean can result in negative counts.

[Update: June 10, 2010 Of course, you can plot plug-in quantiles such as 95% intervals with the bootstrap, too. It's really pretty much like Gibbs sampling in terms of the application, if not philosophy.]

Simulations

They run a bunch of simulation experiments to show that this kind of model makes sense. I did the same thing on a (much much) smaller scale. They use Langmead et al.’s BOWTIE aligner with up to two mismatches, which seems a rather dodgy basis for this kind of expression model. It will depend on the settings, but the defaults provide a greedy heuristic search that’s not guaranteed to be complete in the sense of finding all or even the best alignment.

[Update: June 10, 2010: BOWTIE has a --all setting that the documentation to generate all matching reads, but there's also a maximum number of backtracks parameter that can eliminate some matches if there are 2 or more edits allowed.

Even if BOWTIE can be configured to find all the matches up to edit distance 2, there's no reason to assign them all the same probability in the model or to assume that a read is accurately mapped at edit distance 2 and not at edit distance 3 or greater.

My understanding is that due to its best-first heuristic search, BOWTIE does not guarantee it will find all reads even up to a given edit distance.]

What we really need is some real applications. Mitzi and I are reanalyzing the human liver and kidney Illumina RNA-Seq data described in (Marioni et al. 2008), and also some new data generated by Ken Birnbaum (and Paul Scheid) at NYU using a single-cell protocol on SOLiD on Arabidopsis roots over a 5-day period. Paul Scheid, the director of the SOLiD core facilty at NYU, just presented the data at a SOLiD user’s group meeting he hosted at NYU last Friday. The short story is that Mitzi crunched the data to show that you can use a single-cell protocol on SOLiD and use SHRiMP for alignment to derive expression results similar to that estimated from parallel microarray day using Li and Wong’s D-Chip factor model for microarray expression.

20 to 25 Bases!

The conclusion of Li et al. is that if each base costs the same to read independent of read length, then according to their simulations, the optimal read length for caclulating variant expression is around 20 to 25 bases, not the 50 or longer that Illumina and SOLiD are producing these days.

Piecewise-Linear or Isotonic Regression for Calibrating Naive Bayes (or Other Predictors)

June 3, 2010

I was just reading a really nifty Bayesian approach to clickthrough rate estimation from Microsoft Research Cambridge (UK), which used as a baseline the following paper’s approach to calibrating naive Bayes classifiers:

I’ve blogged about Rennie et al.’s approach to tackling the poor assumptions of Naive Bayes for text classifiers; oddly, the Rennie et al. 2003 ICML paper I blogged about before doesn’t cite the 2001 Zadrozny and Elkan ICML paper on the same topic.

Failed Independence Assumptions in Naive Bayes

The well-known problem with naive Bayes (i.e. a discrete mixture of multinomials) derives from using features that are correlated in a model that assumes they’re not. This is, in fact, the naivete from which the name is derived, though it’s a misnomer, because the model’s right for some situations. For instance, it’s common to see naive Bayes applied to model bags of words in human language, which are highly correlated both locally (by syntax and collocation) and globally (by topic). It’s what you’ll get from LingPipe’s NaiveBayesClassifier and TradNaiveBayesClassifier classes.

In practice, what happens, especially for longer documents, is that the probabilities tend toward 0 or 1 because of redundant tokens with high correlations being treated as independent. This is especially easy to see for repeated words. If we’re classifying sports documents and it’s 90% likely to be an article about basketball if it mentions “lebron” (actually it’s probably higher, due to the celebrity of LeBron James). So what if we see “lebron” twice? In the model, the probability of being about basketball goes up to 0.9^2 / (0.9^2 + 0.1^2) \approx 0.987. The problem is that two “lebron”s don’t tell you much more about the topic than one. An article about the filmmaker Eddie Lebron or the Salsa band The Lebrón Brothers is also likely to mention “lebron” more than once. Once you’ve seen one, the effect of the next one should go down. (The fact that all three uses of “lebron” are spelled differently is an orthogonal issue I shouldn’t have confused here; the same holds for a fully capitalized name like “Smith”.)

Measuring Term Correlation

One thing to do is try to measure the correlation. Along these lines, I really liked Ken Church’s analysis of correlation in a paper titled One term or two? (but only available pay-per-view) and Martin Jansche’s careful follow up.

It’s traditional to hack calibrations by converting every document x to the same virtual document length \lambda, using p_{\lambda}(x) \propto p(x)^{\lambda/{\rm length}(x)}. This is what we do in our traditional naive Bayes class in LingPipe (it doesn’t change first-best results, but helps enormously for EM and presumably for n-best results ordering).

Binned Calibration

What Zadrozny and Elkan did for naive Bayes is to train a classifer and and then calibrate its predictions. In particular, they sort the inputs by predicted probability for the most likely category. They then sort these into ten bins of equal size. For each bin, they calculate the proportion of true answers in the bin. For new items, they are assigned probabilities based on which bin their estimate for the probability of the best category falls into.

They talk about using held-out data, but don’t. I’m not sure why — naive Bayes is pretty easy to overfit. Ideally they’d have just done cross-validation on the whole data set to get predictions, or use a representative held-out data set of appropriate size.

Isotonic Regression

The Microsoft paper referred to Zadrozny and Elkan’s approach as isotonic regression, but on my reading, Zadrozny and Elkan didn’t enforce isotonicity (i.e. upward monotonicity) on the bin probabilities. They probably didn’t need to with lots of items per bin and a reasonably ordered (if not calibrated) set of naive Bayes results. Structural constraints like isotonicity (or simiarly intentioned smoothness priors for something like loess) are especially helpful if some of the bins don’t contain much data or you’re fitting a more complex model.

Binning versus Piecewise Linear Normalization

Binning may be useful for absolute probability prediction, but it’s not very useful for sorting results because it’s not a properly isotonic function (that is, we can have p(x_1) > p(x_2) and p'(x_1) = p'(x_2)).

Instead of binning, a more reasonable solution seems to be a piecewise linear approximation. Just fit linear functions through each bin that remain isotonic. That’s what you typically see in non-parametric statistics approaches (e.g. for normalizing intensities in micro-array data or empirically adjusting for false discovery rates).

Length, Anyone?

In addition to predicted probability on new items, I’d think document length would be a huge factor here. In particular, a two-way binning by length and by prediction probability makes sense to me if there’s enough data. As documents get longer, the overconfidence due to failed independence assumptions gets worse.

Category Effects?

I’d also think you could adjust for category effects. There’s often one or more small foreground categories and a big background category. I’d expect them to calibrate differently.

Comaprative Effects

It’s common in estimating confidence for speech recognizer output to look at not only the first-best result, but the difference between the first-best and second-best result’s probability. That may also help here.


Follow

Get every new post delivered to your Inbox.

Join 151 other followers