Medical Subject Headings (MeSH) Parser

July 2, 2009 by lingpipe

I just finished up the doc on a complete parser and object model for the for the U.S. National Library of Medicine’s XML-based Medical Subject Headings (MeSH) distribution. MeSH contains over 25,000 controlled vocabulary items in the biomedical domain, organized into highly structured and cross-linked records. Here are links that explain what’s in MeSH:

What’s really neat is that MEDLINE is distributed with MeSH terms added by a dedicated staff of research librarians.

The parser and object representations follow the XML pretty closely, which may be familiar from LingPipe’s longstanding MEDLINE parsing and object package (which has its own tutorial). The parsers for MeSH and MEDLINE can take the compressed distribution files and parse out object-oriented representations of the associated MeSH or MEDLINE records.

The parsing pattern employed in LingPipe’s corpus package is very much like SAX’s parser/handler pattern. At one point, I wrote a whole blog entry on LingPipe’s parser/handler framework. A handler is attached to a parser, and as the parser parses the raw data, it sends object-oriented representations of records to the handler. There’s a simple demo command in the package that just prints them out to show how the code can be used.

As part of the preparations for LingPipe 4.0, I’m going to be moving the MEDLINE package (com.aliasi.medline) out of LingPipe proper and into the LingMed sandbox project. LingMed will probably graduate from the sandbox and start getting distributed on its own.

LingMed also includes parsers for distributions of NLM’s Entrez-Gene, NLM et al.’s Online Mendelian Inheritance in Man (OMIM) and the GO Consortium’s Gene Ontology (GO). These all work using the same basic SAX-like pattern.

LingMed also has data access layer implementations of search for many of these data sets (like Entrez-Gene and MEDLINE) integrated with Lucene. In particular, you can do fielded searches, yet retrieve object-oriented results (rather than Lucene documents) out the other side.

We’ll continue to add to the parsing and search base over time. Mitzi’s working on Wormbase as part of her job at NYU, and I’ve done the preliminary parsing for Uniprot.

Downloading LingMed

You can find instructions on the LingPipe site for anonymous subversion (SVN) access to our sandbox, which includes specific information about checking out the LingMed project.

Let us know if you find it useful or want us to add other features. We’d love to get some other folks using this. As is, the documentation’s a bit sketchy in places. Please feel free to send the LingPipe mailing list or me (carp@alias-i.com) questions about it directly.

DARPA grants BBN $30M for “Universal Reading System”

June 30, 2009 by lingpipe

Wow. DARPA just granted BBN a $30M contract (US dollars, of course), spread over 5 years, to develop a “universal reading system.”

I was amused by this quote from BBN’s press release, neatly worded in the subjunctive:

Prem Natarajan, vice president, Speech and Language Processing, BBN Technologies, said, “The machine reading system that DARPA envisions is not evolutionary, but revolutionary. Such a system could eliminate many of the impediments to stability that our military faces such as a lack of understanding of local customs, and give us the ability to assess global technology developments continuously.”

It’s nice to envision something revolutionary, but here’s what BBN says its going to do:

BBN will leverage its expertise in natural language processing and distillation to develop a universal text engine that captures knowledge from text and transforms it into the formal representations required by artificial intelligence systems.

DARPA’s been promoting this approach to “event extraction” (basically populating a Minsky-style frame from text) in pretty much the same way since the Message Understanding Conferences (MUC) and before. The DARPA call for proposals would be not unfamiliar to an AI researcher from the 1970s.

I found this announcement in

which linked to this

The titles are perhaps more revealing of the authors than the proposal and grant.

Convergence for (Stochastic) Gradient Descent (or EM)

June 29, 2009 by lingpipe

One of the most confusing aspects of the stochastic gradient descent (SGD) and expectation maximization (EM) algorithms as usually written is the line that says “iterate until convergence”. For some reason, most presentations don’t indicate how to measure convergence.

There are two common approaches, and perhaps more that I don’t know about:

1. Convergence by Likelihood + Prior

The first common approach to calculate the log likelihood

     log p(y|θ)

for the data y given current parameter settings θ, the log prior for the parameters

     log p(θ),

then monitor the sum

     log p(y|θ) log p(θ)

for convergence.

This approach makes sense intuitively, because it’s the value of the likelihood and prior that’s being maximized by the search algorithms (in loss terms, the negation of the log likelihood and prior is minimized).

2. Convergence by Vector Distance

The second approach is to monitor the coefficient vectors themselves. Typically, this would involve measuring Euclidean (or taxicab) distance between the parameter vectors across epochs.

Note that the vector-based approach is the only one that makes sense for a truly online algorithm.

3. Convergence by Held-Out Evaluation

I forgot to mention this in the first post. You can also use held-out labeled data to evaluate the classifier or tagger being produced, and stop training at the point the error is minimized. The main problem with this approach is that the search problem’s not convex. In practice, people often use this as a form of “early stopping”, whereby training is halted before convergence. As Bishop’s machine learning book shows, this can be viewed as a kind of regularization if the learning rate’s low enough. By stopping early, coefficients haven’t had time to reach their optimal value.

Relative Convergence and Rolling Averages

Algorithms like stochastic gradient (unlike EM) can get into situations where log likelihood actually increases across an epoch. SGD and EM both sometimes makes small steps, then start making bigger steps. So what I’ve done is taken a rolling average, as well as allowing a minimum number of epochs to be set.

Likelihood Converges, Vectors Diverge

If the vectors converge, the likelihood converges, but the converse doesn’t hold for two reasons. First, the vectors can plain old diverge. If the data’s separable and there’s no prior, there is no limit to how big the coefficients can get; the larger the cooefficients get, the higher the likelihood. (For logistic regression, this is because 1/(1+exp(-x)) approaches 0 as x approaches negative infinity and approaches 1 as x approaches infinity.)

The second problem is that the coefficients may not be identified. For instance, if I take maximum likelihood and use a different vector for each category (instead of one fewer than the number of categories), then the maximum a posteriori (MAP) solution isn’t identified in the statistical sense. That is, two different sets of vectors may produce the same result (subtract any vector from all the coefficient vectors and the results don’t change for logistic regression).

$1M Netflix Prize Qualifier BellKor Pragmatic Chaos

June 27, 2009 by lingpipe

Finally! The BellKor Pragmatic Chaos team, winner of the last two Netflix Progress Prizes just topped the 10% improvement on the development set required to be evaluated for the grand prize of US$ 1M. Check out the the Netflix Prize Leaderboard.

Given what the public’s said about this contest before, I can’t say I’m too shocked at the public’s comments to the NY Times’ Big Blog post And the Winner of the $1 Million Netflix Prize (Probably) Is …. They were overwhelmingly mocking and negative. They thought the academics were giving Netflix free work. They thought 10% better than lousy recommendations were still lousy.

I’ve said this before, but what the public doesn’t realize is just how valuable this kind of data is. They’d probably be horrified at how much time we “waste” on bakeoffs like this one where there is no reward. A lot of groups out there probably would’ve paid Netflix for this kind of data, just like we pay the partly publicly funded Linguistic Data Consortium.

Plain and simple, this is a relatively huge data regression problem with data missing nonrandomly. The top teams achieved a fairly amazing reduction in variance by pooling multiple predictors. Given the (root) mean-square error (MSE) evaluation metric, and the relation of the bias-variance decomposition of mean square error, it’s not surprising the best systems were all ensembles of classifiers.

The effort required to fit these models and dream up structures and parameterizations was a Herculean effort. Check out the details up to last year in:

Log Sum of Exponentials

June 25, 2009 by lingpipe

I’ve decided, at least in the first pass, to calculate forward/backward for CRFs on a log scale. The alternative way to maintain floating point stability, by keeping log multipliers for columns of outputs, was just making my head hurt too much in the nested loops.

The log computation involves taking the logarithm of a sum over the exponentiated values found in an array. In symbols:

The problem, of course, is that if any of the x values get sufficiently large (greater than log(Double.MAX_VALUE), which is about 710), they overflow to positive infinity when you calculate the exponent; if all of the values are tiny (less than -750 or so), they all underflow to 0.0 and the resulting log sum is zero.

Luckily, we can get more arithmetic stability with a little algebra. First, find the maximum value in x:

Then use it to normalize the largest value to 0:

\begin{array}{lll} \log \sum_i \exp(x_i) & = & \log \sum \frac{\exp(m)}{\exp(m)}\exp(x_i) \\[12pt] & = & \log \exp(m) \sum \frac{1}{\exp(m)} \exp(x_i) \\[12pt]& = & \log \exp(m) + \log \sum \exp(-m) \exp(x_i)\\[12pt]& = & m + \log \sum \exp(x_i - m) \end{array}

So the largest value passed to the exponential function is 0. If there are really tiny values after subtraction, they’ll become zero and drop out, as they should with limited precision arithmetic.

I’ve implemented a new static utility method to compute the log of the sum of the exponentiated values of an array:

com.aliasi.util.Math.logSumOfExponentials(double[]).

It’ll be out in the next release (hopefully along with CRFs).

In case you’re curious where this comes up, here’s the inductive step of the forward pass of the forward-backward algorithm for CRFs (φ(n,k’) is the feature vector for token position n and previous outcome k’; β[k] is the coefficient coefficient for outcome k):

\alpha_{n,k} = \sum_{k'} \alpha_{n-1,k'} \exp \big( \phi(n,k')^{\top} \beta_k \big)

and here it is on the log scale:

\log \alpha_{n,k} = \log \sum_{k'} \exp\big(\, \log \alpha_{n-1,k'} + \phi(n,k')^{\top} \beta_k\, \big)

Veeramachaneni and Kondadadi (2009) Surrogate Learning – From Feature Independence to Semi-Supervised Classification

June 23, 2009 by lingpipe

It’s more fun with classifier algebra today, as we look in our friends at Thomson-Reuters R&D:

The Algebra

The problem is binary classification into a class y in {0,1}, and the modeling assumption is that inputs consist of pairs of vectors x = (x1,x2) where x1 in X1 and x2 in X2, and where the x1 and x2 are conditionally independent given the class y:

p(x1,x2|y) = p(x1|y) * p(x2|y)

Note that we don’t have to make the co-training assumption that both x1 and x2 are sufficient to build a classifier in and of themselves. To prevent divides-by-zero, they also assume p(x1|x2) > 0 and p(x1|y) > 0, and that p(x1|y=0) != p(x1|y=1). With this, they derive a formula for p(y|x) involving only p(x1|y) and p(x1|x2).

Pr(y=0|x1,x2)
  = p(x1|y=0) / p(x1|x2)
  * [ p(x1|y=1) - p(x1|x2) ] / [ p(x1|y=1) - p(x1|y=0) ]

What’s neat about their analysis is that you can train p(x1|x2) without supervision about p(y|x1,x2). If you can get a good handle on p(x1|y) in some way, you can bring in all that information about x2 through p(x1|x2). As the authors note, their approach is in many ways similar to Ando and Zhang’s 2007 two view model.

Special Case Examples

They run two examples, both of which are very real problems for Thomson-Reuters: one on physician record linkage and one on discovering paraphrases for merger and acquisition sentences. In both experiments they make a further assumption that x1 is a binary variable with 100% recall for y=1, so that: Pr(y=1, x1=0) = 0, which lets them loosen the restriction on independence to cases where y=0, namely

p(x1,x2|y=0) = p(x1|y=0) * p(x2|y=0)

Physician Record Linkage

For the record-linkage case, they are classifying two physician records as to whether they are the same physician (y=1) or not. The 100% recall feature x1 is defined to be “have the same year of graduation”. That is, they’re assuming that if two records are identical, they have the same year of graduation. x2 is then the remaining information, and they estimate p(x1|x2) using logistic regression. They estimate p(x1|y=0) with a simple frequency estimate by taking random samples of pairs of physicians (some of these might actually match, but it’s sparse).

They show that their approach does a pretty good job compared to training a logistic regression model on a few hundred supervised instances (though not overwhelmingly better). The improvements over the supervised case all come in recall, which is not surprising. It’d be nice to see the whole precision-recall curves for the problem, which is typical for record linkage apps (see, e.g., one of the first examples in Gelman et al.’s Bayesian Data Analysis, which covers calibrating record linkage, a closely related problem).

They did a nice job of dealing with missing data, too, but that’s really a separate issue to some extent.

Paraphrasing Merger and Acquisition Sentences

This is a neat bootstrap (in the colloquial usage, not the statistical one) approach to finding sentences about mergers and acquisitions between pairs of companies. It’s similar to Agitchein and Gravano’s Snowball system. They used a named entity detector to tag the data, and then restricted attention to sentences containing pairs of companies. They then created high-precision patterns like “ORG bought ORG” and “offer for ORG” to create seeds (it’s not my fault that bootstrapping employs such a horribly mixed metaphor). They use the seeds to find relation sentences. They then go and find other sentences with the same pair of organizations within a time frame around the seed sentence of a month or two and consider those positive. They consider all other instances of one of the organizations and another organization negative (another noisy rule like graduation year).

The 100% recall feature x1 is that the two pair of organizations occur in a seed sentence (this is also noisy). They use a bag of unigram, bigram and trigram word features for x2. For some reason they use an SVM for p(x1=1|x2). They don’t actually mention how they estimate p(x1|y=0).

They give a few operating points, one of which has 99% recall at 79% precision, which is nice.

Conclusion

I’d have given this paper the thumbs-up, but the explanation of the paraphrase task could’ve been more clearly related back to the technique of the paper. Overall, I’d like to see more of this kind of innovative work on both technique and application in the main ACL.

Chain Conditional Random Fields: Implementation and Design Issues

June 18, 2009 by lingpipe

I’m starting to implement conditional random fields (CRF) for LingPipe. (See, I really don’t hate CRFs.) If you don’t know CRFs, here are three good tutorials:

Implementing CRFs brings up several design issues.

N-gram Order

Like HMMs, chain CRFs extract features over the sequence of tags. The issue is whether the features can span more than a single pair of tags (a bigram). For instance, can I use the previous label and label two back to help model the current label?

Although it’d be nice to implement everything very generally for arbitrary n-gram contexts, it’s actually a pain in practice in that everything gets deeper loops and allocating all the arrays is hard to do generically in Java with arrays. And then extracting answers with confidence and doing n-best also gets more complex.

I’m also not sure that any order beyond first is going to be tractable enough for training.

Feature Extraction

I plan to do the usual factoring of features into node features (depend only on inputs) and edge features (depend on input and also previous category). Even though edge features are more general, it’s much more efficient to implement the usually much more prevalent node features independently (it all goes through the forward-backward algorithm). Nodes and edges have the following interfaces:

interface ChainCrf.Node {
    int position();
    String[] words();
}

and

interface ChainCrf.Edge  {
    String previousTag();
    int position();
    String[] words();
}

Then we have two feature extractors required to estimate a CRF from a corpus of labeled tag data:

  • FeatureExtractor<ChainCrf.Edge>, and
  • FeatureExtractor<ChainCrf.Node>.

One or More Coefficient Vectors?

Given the feature extraction, I have to follow our logistic regression model in having different coefficient vectors per category rather than a single coefficient vector that essentially concatenates all the individual vectors. Having different vectors makes feature extraction much more economical. In the “standard” CRF presentation, a separate feature extraction is done for each possible tag, both for nodes and edges. That does allow some richer features (e.g. using a feature for labels that don’t change entity type, such as B_PERSON to I_PERSON), and also allows features to be structural zeroes for some tags.

Note that the output tag is implicit in the proposed feature extraction, which means it’s always included to distinguish both node and edge features across output tags.

K-1 Coefficient Vectors vs. K coefficient Vectors

That leaves the issue of whether to use (K-1) coefficient vectors (where K is the number of tags), as you see in typical statistics presentations of logistic regression. The first advantage is one less vector to store, and even more importantly, one fewer vector to update. The second is a system with an identified answer even for Laplace priors or maximum likelihood.

The alternative is to use K coefficient vectors, as you see in the typical “machine learning” presentation of CRFs or logistic regression. I’m planning to go with the stats approach, as I did for our logistic regression.

Order of Coefficient Vectors and Sparsity

There’s an issue of how to store the coefficient matrix vector, either tag dominant or feature dominant. That is, do we have a map from features to categories to numerical values, or from categories to features to numerical values. The latter is the natural way to do it if you look at the typical definitions. The former way would be useful if many features are non-zero for half the categories or fewer. I don’t think that’ll happen much given the way the regressions work, so I’m planning to go with (features to categories to numerical values).

I pretty much have to use sparse vectors for the edge and node features. But what about the model’s regression coefficients? I’ll be taking the coefficient vector’s dot-product with the sparse edge and node features. That’s much more efficient for dense coefficient vectors. I could save some space for really big models, but I think the time’s going to be more precious here. Luckily, this implementation detail will be easy to change under the hood if I change my mind later.

Arithmetic Precision

We won’t be able to estimate coefficients with more accuracy than is representable as a float. So for storing the models, floats make most sense. The problem is that at run time, we need to do double-precision arithmetic for decoding (in all three cases, Viterbi first best, Viterbi/A* n-best, or full forward-backward marginal decoding). It’s slower to cast floats to doubles every time you do a multiply, so I’ll probably keep the coefficients dense and doubles as long as they’ll fit. If they won’t fit, I’ll probably switch the implementation to sparse floats. It’ll all be transparent and backward compatible at the API level if I make the switch after the first release.

Start/End Tags

For HMMs, I added a BEGIN-OF-SENTENCE tag and an END-OF-SENTENCE tag. The start tag is not generated, but used as context to generate the first word in the sentence. The end-of-sentence tag is generated, because it’s required to get the normalizations right.

For CRFs, it seems to make most sense to include a BEGIN-OF-SENTENCE tag that doesn’t have an estimated coefficient, but is just used in defining the edge features for the first token in a sentence. The end marker is actually derivable from the node (word-only) features. So is the begin marker, but it seems simpler this way than to have a null value passed in to the edge feature extractor.

Stochastic Gradient Descent

I’m not even considering other implementations. I’ll parameterize it the same way as for logistic regression. This means I’ll allow coefficient priors (Laplace, Gaussian, Cauchy, with the option of non-zero means and different variances/scales per dimension). I’ll also allow an annealing schedule.

SGD Code Re-use

A serious question is whether I should try to re-use the SGD code in logistic regression. The basic algorithm’s the same, only now I have to do forward-backward in the inner loop to compute the gradient of the log likelihood function. I’m thinking I won’t share the code, but that’s mainly because the refactoring to something with the right callbacks in the inner loop looks daunting.

Forward/Backward Decoding and N-best

I need to do forward/backward for training, not just confidence-based extraction. What I’d like to do is provide a common format for the lattice output I can share between HMM taggers and CRFs.

I’ve already defined a new package com.aliasi.tag that has three interfaces for tagging: first-best, n-best (joint or conditional probabilities), and marginal (per tag) taggers matching those for HMMs.

Relation to Chunking

I’d also like to provide a general tagger to chunker implementation, which means more refactoring, but will immediately let me use CRFs to do named-entity extraction with n-best and confidence-based (per entity mention) extractions. I’d very much like to refactor the CharLmHmm chunker to rely on a generic tagger. This should be doable.

Threading and/or File Compilation

Unfortunately, storing in memory all of the features for an input task the size of the Penn Treebank POS data (about 50 tags and 1M words) is prohibitive. There are 50M edge feature extractions and 1M node feature extractions. Even if the edge feature extractions are tiny (say 40 bytes), that’s a lot of memory; if we add in previous tag plus word features, it won’t fit in memory even on a big workstation. If there are hundreds of word features per node, which is conservative if we use 5-word contexts, n-grams and subword features, and represent them tightly, we’re still looking at several KB/vector and thus several GB of memory.

So I’m inclined to do the feature extraction online. This is making me queasy, though, because it’ll be very expensive if I do it naively in a single thread (even measured relative to forward/backward, perhaps dominating training time).

Alternatively, I can run feature extraction in one thread and gradient computation in another. Then we probably won’t notice the feature extraction if there are enough CPUs free.

The other alternative, which matches what may CRF packages do, is to require all of the feature vectors for the training corpus to be serialized. Then they can be read back in much more efficiently than creating them from scratch. It won’t take that much disk space, but the requirement will be nontrivial (perhaps as much as 20 or 30GB with Treebank POS size corpora and subword features).

Generic Inputs

There’s really nothing word-specific about CRFs. In fact, they could take arbitrarily structured inputs instead of word tokens. This would require generifying the CRF interface to the type of objects being tagged, and similarly generifying feature extraction and the lattice representations. I could probably get away then with having the HMMs just instantiate the generic to String.

I may just go ahead and do this once it’s implemented for strings. It’ll probably be easier to refactor a string-based implementation than to start with something more generic. (Thanks again to those Java engineers for making things so easy to modify with backward compatibility.)

Overflow / Underflow

Overflow happens when a linear predictor gets too big and exponentiating it overflows. We handle this by computing softmax the right way.

It’s easy to underflow during forward-backward, which is on a linear scale. Sutton and McCallum outline two approaches to handling this. One is what we did for HMM forward/backward, which is to normalize each forward vector (the forward values for all tags at a particular token) to sum to 1.0. In this case, the usual computation for the normalizing constant Z just drops out.

The other approach Sutton and McCallum mention, which I’d never heard of before reading their paper, is to keep the forward/backward values on the log scale and then perform computations in the exponentiated domain and rescale. This can actually be done with reasonable precision (always a problem with logs/exps). As Sutton and McCallum note, this can be expensive because of the increased number of calls to log() and exp().

Unsupported Features

I have to admit I didn’t understand the discussion of “unsupported” features in Sutton and McCallum, which are defined to be features that don’t appear in the training data. This may be because I’ve already factored my features so that there are no output-specific features, so there may not be any such “unsupported” features. I’m definitely restricting to only features that appear in the training data, because I can’t train anything using stochastic gradient.

Boolean Feature Vectors

Many NLP applications use boolean feature vectors, which only allow values 0 or 1. By restricting to boolean feature vectors, many computations can be made more efficient. I think I may go partway there by including dense and sparse boolean feature vector representations. The sparse ones would require a very high degree of sparsity to be worth it in terms of space, but they may dominate in things like cross-products because they’d use very few comparisons.

Dasgupta and Hsu (2008) Hierarchical Sampling for Active Learning

June 17, 2009 by lingpipe

Once I got onto the ICML discussion site, I followed some of the papers that had discussions. I’d been meaning to circle back around to this paper, as it addressed several of the concerns I had about active learning:

The general topic’s still how to efficiently (in human effort) collect useful training data (e.g. for classifiers).

Update at 3:02 PM: As soon as I wrote this, I read a post on active learning in John Langford’s blog, which points to the following tutorial page, which among other things, has a great bibliography:

What’s worried me since the first time I heard about active learning is that the data’s likely to wind up biased, because it’s not being chosen randomly. For example, by choosing examples at the margin (e.g. always choosing the least confident annotation), you can wind up with a highly non-representative data set. Dasgupta and Hsu do some analysis along with providing a nice literature survey.

The basic idea of this paper is that you can cluster the underlying data, then sample from the clusters. If the underlying data clusters well w.r.t. the classification problem, this can be very helpful. Needless to say, there’s a whole lot of math explaining exactly what it means for data to cluster well w.r.t. the classification problem.

What struck me skimming through this paper is that K-means++ initialization for K-means clustering had the same motivation (it’s now implemented in LingPipe, and we’ve found it to work well in practice). K-means++ works by sampling a new centroid with a probablity proportional to its distance to the closest centroid. Outliers are by definition a long way away from a centroid, and thus have high individual selection probabilities. The idea is that by sampling, more representative items, by sheer numbers, can be chosen, even if they’re individually unlikely.

So what about choosing items for active learning proportional to their expected probability of being erroneously classified. I’m thinking this’d have the same kind of effect as K-means++.

Has anyone done this already?

PS: I also learned from Foster’s active learning paper that my academic hero and former colleague, Herb Simon, had the first citation in this field back in the early 1970s. As usual, he was decades ahead of his time.

Update: John Langford just posted a blog entry on active search, mentioning this:

Raykar et al. (2009) Supervised Learning from Multiple Experts: Whom to Trust when Everyone Lies a Bit

June 16, 2009 by lingpipe

ICML paper:

The scientific zeitgest says to assess inter-annotator agreement, infer gold standards, and use them to train classifiers.

Raykar et al. use EM for a binomial model of annotator sensitivity and specificity (like Dawid and Skene’s original multinomial approach from the 1970s paper and the Snow et al. EMNLP paper). My experiments showed full Bayesian models slightly outperform EM, which slightly outperforms naive voiting (the effects are stronger with fewer annotators).

The obvious thing to do is to take the output of the gold standard inference and use that to train a classifier. With EM, you can use the MAP estimate of category likelihoods (a fuzzy gold standard); with Bayesian models, you can sample from the posterior, which provides more dispersion. Smyth et al.’s 1995 NIPS paper showed EM-style training was effective for simulations.

I was just in San Francisco presenting this work to the Mechanical Turk Meetup, and Jenny Finkel opined that fuzzy training wouldn’t work well in practice. Even taking the discussion offline, I’m still not sure why she thinks that [update: see her comments below]. In some ways, if we use the fuzzy truth as the gold standard, then using it to train should perform better than quantizing the gold standard to 0/1. There’s not a problem with convexity; we just impute a big data set with Gibbs sampling and train on that. We could even train up an SVM or naive Bayes system that way.

The interesting twist in the Raykar et al. paper is to jointly estimate a logistic regression classifier along with the gold standard. That is, throw the regression coefficients into the model and estimate them along with everything else. That’s the same linkage as I suggested above. But Raykar et al. go further — they let the trained model vote on the gold standard just like another annotator.

Even though the annotation model corrects for individual annotator bias (or in this case, the logistic regression classifier’s bias as estimated), each annotator still affects the overall model through its bias-adjusted vote (if it didn’t, you couldn’t get off the ground at all). If you evaluate the classifier on a “gold standard” which was voted upon by a committee including the classifier itself, the classifier should perform better because it’s getting a vote on the truth!

The right question is whether Raykar et al.’s jointly estimated classifiers are “better” in some sense than ones trained on the imputed gold standard. For that, I’d think we’d need some kind of held-out eval, but that begs the question on inferring the gold standard. The gold standards behind Snow et al.’s work weren’t that pure after all (I have some commentary on discrepancies in the paper cited below).

I have considered using the trained classifier as another annotator when doing active learning of the kind proposed in Sheng et al.’s 2008 KDD paper on getting another label for an existing item vs. annotating a new item. In fact, there’s no reason in principle why you can’t have more than one classifier being trained along with annotator sensitivities and specificities.

Another nice idea in the Raykar et al. paper is the use of simulation from a known gold standard to create a fuzzy gold standard. That’s still questionable, in that it’s generating fake data that are known to follow the model. But everyone should do this in every way possible for all parts of their models, so you can bet I’ll be saving this one for my bag of tricks.

I’m a little unclear on why the numbers in the lefthand plots in figures 1 and 2 don’t have the same AUC value for the proposed algorithm. Figure 2 actually does evaluate the gold-standard estimation followed by classifier estimation. If I’m reading that figure right, then training on the imputed gold standard didn’t do measurably better than the majority voted baseline.

[Update with comment: The right hand side plot in figure 2 is of the inferred gold standard versus the "golden gold standard". It's possible to plot this because the inferred gold standard is actually a point probability estimate of the item being in category 1.]

If we’re lucky, Raykar et al. will share their data. [Update 2: no luck -- the data belongs to Siemens.]

P.S. All of these models assume the annotators don’t actually lie. Specifically, in order for the models to be identifiable, we need to assume the annotators are not adversarial (that is, they don’t know the right answer and intentionally lie, and thus perform worse than chance). There was, to reinforce the zeitgeist, also a paper about mixing in adversarial coders at ICML, Dekel and Shamir’s Good learners for evil teachers.

Greg Little’s Turkit: Multi-Pass Annotation Refinement

June 15, 2009 by lingpipe

I had the luxury of dashing out to California last week for the San Francisco Turk Meetup hosted by Dolores Labs (see the O’Reilly radar blog summary). The TurKit Application presented by Greg Little really caught my eye. You, too, can read the pre-publication paper:

TurKit’s an open-source package that lets you submit a single task to multiple Turkers in a novel way. Rather than having them all do the task independently, the Turkers work on the tasks sequentially. That is the first Turker does it from scratch, then the second Turker works to improve the first Turker’s answer. The example in the talk involved transcribing a very messily written note. It’s amazing to see the progression from the first Turker to the eighth Turker.

TurKit has a beautiful programming paradigm which lets you write jobs imperatively. The base language is JavaScript (with JSON-based serialization). The beautiful part is that TurKit lets the programmer write loops, the inner operations of which call out to Turk annotation jobs. This is completely natural, and completely unsupported by the Turk API, which is your usual REST-type stateless application. It reminds me of Python’s yield-based iterator implementations.