Archive for the ‘Reviews’ Category

Chang et al. (2009) Reading Tea Leaves: How Humans Interpret Topic Models

November 18, 2009

This forthcoming NIPS paper outlines a neat little idea for evaluating clustering output:

The question they pose is how to evaluate clustering output, specifically topic-word and document-topic coherence, for human interpretability.

Bag of Words

Everything’s in the bag of words setting, so a topic is modeled as a discrete distribution over words.

Multiple Topics per Document

They only consider models that allow multiple topics per document. Specifically, the clusterers all model a document as discrete distribution over topics. The clusterers considered share strong family resemblances: probabilistic latent semantic indexing (pLSI) and two forms of latent Dirichlet allocation (LDA), the usual one and one in which topics for documents are drawn from a logistic prior modeling topic correlations rather than a uniform Dirichlet.

Intrusion Tasks

To judge the coherence of a topic, they take the top six words from a topic, delete one of the words and insert a top word from a different topic. They then measure whether subjects can detect the “intruder”.

To judge the coherence of the topics assigned to a document, they do the same thing for document distributions: they take the top topics for a document, delete one and insert a topic not assigned with high probability to the document.

Analysis

They considered two small corpora of roughly 10K articles, 10K word types and 1M tokens, one from Wikipedia pages and one from NY Times articles. These can be relatively long documents compared to tweets, customer support requests, MEDLINE abstracts, etc, but are shorter than full-text research articles or corporate 10K or 10Q statements.

They only consider 50, 100, and 150 topic models, and restrict parameterizations to add-1 smoothing (aka the Laplace form of the Dirichlet prior) for per-document topic distributions. I didn’t see any mention of what the prior was for the per-topic word distributions. I’ve found both of these parameters to have a huge effect on LDA output, with larger prior counts in both cases leading to more diffuse topic assignments to documents.

They only consider point estimates of the posteriors, which they compute using EM or variational inference. This is is not surprising given the non-identifiability of topics in the Gibbs sampler.

Mechanical Turker Voting

They used 8 mechanical Turkers per task (aka HIT) of 10 judgments (wildly overpaying at US$0.07 to US$0.15 per HIT).

(Pseudo Expected) Predictive Log Likelihood

They do the usual sample cross-entropy rate evaluations (aka [pseudo expected] predictive log likelihoods). Reporting these to four decimal places is a mistake, because the different estimation methods for the various models have more variance than the differences shown here. Also, there’s a huge effect from the priors. For both points, check out Asuncion et al.’s analysis of LDA estimation, which the first author, Jonathan Chang, blogged about.

Model Precision

Their evaluation for precision is the percentage of subjects who pick out the intruder. It’d be interesting to see the effect of adjusting for annotator bias and accuracy. This’d be easy to evaluate with any of our annotation models. For instance, it’d be interesting to see if it reduced the variance in their figure 3.

There’s variation among the three models at different topics over the different corpora. I’m just not sure how far to trust their model precision estimates.

Their Take Home Message

The authors drive home the point that traditional measures such as expected predictive log likelihood are negatively correlated with their notion of human evaluated precision. As I said, I’m skeptical about the robustness of this inference given the variation in estimation techniques and the strong effect of priors.

The authors go so far as to suggest using humans in the model selection loop. Or developing an alternative estimation technique. If they’d been statisticians rather than computer scientists, my guess is that they’d be calling for better models, not a new model selection or estimation technique!

The Real Take Home Message

I think the real contribution here is the evaluation methodology. If you’re using clustering for exploratory data analysis, this might be a way to vet clusters for further consideration.

What They Could’ve Talked About

Although they mention Griffiths and Steyvers’ work on using LDA for traditional psychometrics, I think a more interesting result is Griffith and Steyvers’ use of KL-divergence to measure the stability of topics across Gibbs samples (which I describe and recreate in the LingPipe clustering tutorial). Using KL divergence to compare different clusters may give you a Bayesian method to automatically assess Chang et al.’s notion of precision.

Phrase Detectives Linguistic Annotation Game

November 10, 2009

Massimo Poesio just sent me a pointer to the following awesome web application:

Annotation Game

Annotation games (aka “games with a purpose”) were popularized by van Ahn’s ESP game, though I first heard about them through David Stork’s Open Mind project.

Unlike the mechanical Turk, the games approach tries to make the task somewhat fun and competitive. It seems like making the users “detectives” is a thin veneer of “fun”, but they maintain the metaphor beautifully throughout the whole site, so it works.

As with many games, Phrase Detectives pays out in leader board bragging rights and cash prizes rather than directly for work completed as on Mechanical Turk.

Phrase Detective Tasks

The really brilliant part is how they break the coref annotation task into four easy-to-answer questions about a single highlighted phrase.

  1. Not Mentioned Before: yes/no question as to whether the referent of highlighted phrase was previously mentioned in the text
  2. Mentioned Before: highlight previous mention of a given phrase
  3. Non-Referring: pick out non-referential nouns (like the “there” in “there is trouble brewing”)
  4. Property of Another Phrase: pick out other phrases that describe someone already mentioned (e.g. attributives or apositives)

The site also has nice clean, easy-to-follow graphics, and appears to still have users after two years.

Adjudication Phase

OK, they call it “Detectives Conference”, but the idea is you get to vote yes/no as to whether someone else’s answer is right. This is a good idea widely used on Mechanical Turk because it’s easier to check someone’s work than to create it from scratch.

Read All About It

It was developed by academics, so there are at least as many papers as contributors:

Coreference Annotation

There are “expert” annotated within-doc coref corpora for the MUC 7 and ACE 2005 evaluations (available from LDC, who charge an arm and a leg for this stuff, especially for commercial rights).

LingPipe does within-document coreference and we’ve worked on cross-document coreference.

More Like This

As soon as you find one of these things, you find more. Check out:

I’d love to hear about more of these if anyone knows any.

Mihalcea (2007) Using Wikipedia for Automatic Word Sense Disambiguation

October 22, 2009

In response to a recent thread on the LingPipe mailing list about using Wikipedia for a word-sense disambiguation corpus, I’d like to point out Rada Mihalcea’s 2007 NAACL paper:

The idea’s very simple. A Wikipedia page represents a single sense of an ambiguous phrase. Both the senses and text for training can be extracted from Wikipedia’s link structure.

Rada maintains the SenseEval web pages and serves on the advising committee for the bakeoffs, which have pretty much defined the field recently.

Leaning on Wikitext

The wikitext markup used for Wikipedia pages contains explicit disambiguation and mention-level information. Consider these examples from Rada’s paper:

  • In 1834, Sumner was admitted to the [[bar (law)|bar]] at the age of twenty three, and entered private practice in Boston.
  • It is danced in 3/4 time (like most waltzes), with the couple turning approx. 180 degrees every [[bar (music)|bar]].

The double brackets “[[ ]]” enclose links with the item on the left being the disambiguated reference, e.g. “bar (law)” or “bar (music)” and the item on the right the ambiguous word, e.g. “bar”. When viewing the page, the text to the right of the vertical bar shows up.

The scale’s impressive — there were 1108 examples of “bar” used as link text with 40 different categories, and this was a few years ago.

Rada extracted paragraphs around the mentions for training data. These are clearly marked in the Wikitext, so nothing fancy needs to be done for paragraph extraction (beyond the pain of Wikitext parsing itself).

What About Disambiguation Pages?

Wikipedia is full of disambiguation pages, which Wikipedia itself describes as:

Disambiguation in Wikipedia is the process of resolving the conflicts that occur when articles about two or more different topics could have the same “natural” page title. This category contains disambiguation pages: non-article pages containing links to other Wikipedia articles and disambiguation pages.

For instance, the Wikipedia page for “bar”, lists an astounding number ambiguous meanings for the word. There are 11 popular usages (e.g. bar serving alcohol, bar of gold), 5 math and science usages (e.g. unit of pressure, bar chart), 4 uses in the law, and many other usages, some which appear under other categories.

I’d think the pages linked to would be useful for training. But Rada didn’t use them, listing reasons:

  1. mismatch between disambiguation pages and usage, causing precision problems from links in a disambiguation page that are never referenced with the word and recall problems from some usages not showing up on the disambiguation page,
  2. inconsistencies in the naming of disambiguation pages

The first issue is still problematic, though the recall side of it seems to be improving. The second issues is also still problematic. Although the category helps find such pages, Wikipedia’s formatting is still more like natural language than a database.

Evaluation

As you might expect, it worked pretty well. Rada evaluated a naive Bayes classifier with features including contextual words with part-of-speech tags.

Coding it in LingPipe

This is something you could easily replicate in LingPipe. You could use one of the naive Bayes parsers. Or any of the other classifiers we consider in our word-sense disambiguation tutorial, including K-nearest neighbors and character or token-based language model classifiers.

If you need part-of-speech tags, check out our part of speech tagging tutorial.

With naive Bayes and easily extracted unsupervised data, you could also use EM semi-supervised training, as described in our EM tutorial.

You could also use a discriminative linear classifier, as described in our logistic regression tutorial.

The hard part’s all the data munging for Wikipedia. If you build a parser in Java and want to share, I’d be happy to link to it from here or host it in our development sandbox.

Whitehill, Ruvolo, Wu, Bergsma, and Movellan (2009) Optimal Integration of Labels from Labelers of Unknown Expertise

October 5, 2009

[Update: 4:51 PM, 5 October 2009 after corrections from Jacob Whitehill (thanks!); they did use a prevalence estimate and did allow mixed panel fits, as the post now reflects.]

Thanks to Panos for pointing out this upcoming NIPS poster, which makes a nice addition to our running data annotation thread:

The authors’ knowledge of the epidemiology literature was limited when they stated:

To our knowledge BLOG [bilinear log odds generative] is the first model in the literature to simultaneously estimate the true label, item difficulty, and coder expertise in an unsupervised manner.

Just check out the literature survey portion of my technical report for a range of different approaches to this problem, some of which have even been applied to binary image data classification such as Handelman’s X-ray data for dental caries diagnoses (see the tech report for the Handleman data).

Model Overview

In this case, the authors use a logistic scale (that’s the log-odds part) consisting of the product of an annotator accuracy term and an item (inverse) difficulty term (that’s the “bilinear” part). Although the authors mention item-response and Rausch models (see below), they do not exploit their full expressive power.

In particular, the authors model annotator accuracy, but do not break out sensitivity and specificity separately, and thus do not model annotator bias (a tendency to overlabel cases in one category or another). I and others have found huge degrees of annotator bias for real data (e.g. Handelman’s dentistry data and the Snow et al. NLP data).

The authors’ model also clips difficulty at a random coin flip, whereas in reality, some positive items may be so hard to find as to have less than a 50% chance of being modeled correctly.

They impose unit normal priors over annotator accuracy and normal priors over the log of item difficulty (ensuring item difficulties are non-negative). They fit the models with EM using conjugate gradient to solve the logistic regression in the M-step. Epidemiologists have fitted empirical Bayes priors by using other expert opinion, and I went further and actually fitted the full hierarchical Bayesian model using Gibbs sampling (in BUGS; the code is in the sandbox project).

Point estimates (like EM-derived maximum a posterior estimates as the authors use) always underestimate posterior uncertainty compared to full Bayesian inference. Posterior uncertainty in item difficulty is especially difficult to estimate with only 10 annotators. In fact, we found the Bayesian posteriors for item difficulty to be so diffuse with only 10 annotators that using the full posterior effectively eliminated the item difficulty effect.

Synthetic Data

They run synthetic data and show fits. Not surprisingly given the results I’ve reported elsewhere about fitting item difficulty, they only report fits for difficulty with 50 annotators! (I found reasonable fits for a linear (non-multiplicative) model with 20 annotators, though recall the reviewers for my rejected ACL paper thought even 10 annotators was unrealistic for real data!)

They also simulate very low numbers of noisy annotators compared to the actual numbers found on Mechanical Turk (even with pre-testing, we had 1/10 noisy annotations and without testing, Snow et al. found 1/2 noisy labels). I was surprised they had such a hard time adjusting for the noisy labelers. I think this may be due to trying to model item difficulty. Without item difficulty, as in the Dawid and Skene-type models, there’s no problem filtering out bad annotators.

Pinning Values

The authors note that you can fix some values to known gold-standard values and thus improve accuracy. I noted this in my papers and in my comment on Dolores Labs’ CrowdControl, which only uses gold-standard values to evaluate labelers.

Real Data

They show some nice evaluations for image data consisting of synthetic images and classification of Duchenne smiles. As with other data sets of this kind (such as my results and Raykar et al.’s results), they show decreasing advantage of the model-based methods over pure voting as the number of annotators approaches 10. This is as we’d expect — the Bayesian priors and proper weighting are most important for sparse data.

Mathematical Presentation

The authors suppose J items (images in this case) and I annotators. The correct label for item J is c_j and the label provided by the annotator i is is x_{i,j}. They consider fitting for the case where not every annotator labels every item.

The authors model correctness of an annotation by:

\mbox{Pr}(x_{i,j} = c_j) = \mbox{logit}^{-1}(\alpha_i\beta_j)

where \alpha_i is a measure of an annotators ability and \beta_j > 0 a measure of inverse item difficulty. The authors observe some limits to help understand the parameterization. First, as inverse item difficulties approach 0, items become more difficult to label, and accuracy approaches chance (recall \mbox{logit}^{-1}(0) = 0.5):

\lim_{\beta_j \rightarrow 0} \mbox{Pr}(x_{i,j} = c_i) = 0.5.

As inverse item difficulties approach infinity, the item becomes easier to label:

\lim_{\beta_j \rightarrow \infty} \mbox{Pr}(x_{i,j} = c_i) = 1.0.

As annotator accuracy approaches infinity, accuracy approaches perfection:

\lim_{\alpha_i \rightarrow \infty} \mbox{Pr}(x_{i,j} = c_i) = 1.0.

As annotator accuracy approaches zero, accuracy approaches chance:

\lim_{\alpha_i \rightarrow 0} \mbox{Pr}(x_{i,j} = c_i) = 0.5.

If accuracy is less than zero, the annotator is adversarial. We didn’t find any adversarial annotators in any of our Mechanical Turk data, but there were lots of noisy ones, so some of the models I fit just constrained prior accuracies to be non-adversarial. Others have fixed priors to be non-adversarial. In some settings, I found initialization to non-adversarial accuracies in EM or Gibbs sampling led to the desired solution. Of course, when lots of annotators are adversarial and priors are uniform, the solution space is bimodal with a flip of every annotator’s adversarial status and every item’s label.

The authors also model prevalence with a p(c_i) term. If prevalence is 0.5, it drops out, but their Duchenne smile example was unbalanced, so the prevalence term is important.

Comparison with Item-Response and Ideal Point Models

The authors mention item-response theory (IRT), which is also where I got the idea to use these models in the first place (too bad the epidemiologists beat us all to the punch).

A basic IRT-like model for annotation would use the difference \delta_i - \beta_j between annotator accuracy \delta_i and item difficulty \beta_j. By allowing \beta_j to be positive or negative, you can model positive and negative items on the same scale. Or you can fit separate \delta_i and \beta_j for positive and negative items, thus independently modeling sensitivity and specificity.

Discriminativeness can be modeled by a multiplicative factor \alpha_i, producing a predictor \alpha_i \, (\delta_i - \beta_j). In this way, the \delta_i term models a positive/negative bias and the \alpha_i the sharpness of the decision boundary.

I’m a big fan of the approach in (Uebersax and Grove 1993), which handles ordinal responses as well as discrete ones. Too bad I can’t find an open-access version of the paper.

Tsuruoka, Tsujii, Ananiadou (2009) Stochastic Gradient Descent Training for L1-regularized Log-Linear Models with Cumulative Penalty

September 18, 2009

I’m a big fan of descriptive titles and clearly stated goals, both of which today’s paper has:

SGD with Lazy Update and Clipping

The paper introduces a variant of truncated stochastic gradient, a perceptron-like online algorithm for estimating L1-regularized (what we’d call a Laplace or double-exponential prior in the Bayesian world) log-linear models (what we call logistic regression).

LingPipe employs the variant the authors evaluated as a baseline and dubbed “SGD-L1 (Clipping + Lazy-Update)” (see LingPipe’s logistic regression tutorial), which is the simplest form of truncated stochastic gradient where batch size is one (that is, it’s fully stochastic, updating stats on a per-item basis).

The Problem: Non-Sparse Solutions

The beauty of L1 regularization is that it induces sparse solutions (many coefficients zero). The problem with the stochastic approximation to the gradient is that it can leave many features non-zero solely because of their late position in the training sequence (their figure 1 is a nice illustration).

Tsuroaka et al. evaluate this effect on three CRF-sequence tagging problems (a form of logistic regression where you use the undirected graphical model form of forward-backward to compute the gradient and log loss), and propose a novel solution.

The Solution: Cumulative Penalty

Because the gradient of the L1 regularizer only depends on the sign of a feature and not its scale, the authors suggest buffering regularization penalties and then applying them right after a feature is estimated. Furthermore, if the regularization weight is greater than the feature (that is, applying it drives the feature past zero, so it’s clipped), the amount of clipping is buffered for future use.

LingPipe basically uses an algorithm that’s almost identical, except that (1) the regularization gradient is applied before estimation and weight update, and at least once afterwards, at the end of an epoch at the latest; and (2) if there’s clipping, the buffered gradient is reduced to zero rather than buffered.

Another slight difference in their algorithm is that they randomly permute the items before each epoch.

Evaluation for Sparsity

They showed that their new method reduces the final number of non-zero features compared to the baseline truncated stochastic gradient from roughly 90K to roughly 30K. Exponential decay annealing reduces even more to zero. Here’s where you want to see plots for different annealing schedules and numbers of iterations.

A standard quasi-Newton implementation resulted in roughly 20K features, so there’s still some slippage in their method compared to the “correct” result.

A Possible Bug?

What I didn’t see in their algorithm (in figure 2) was something that applied the penalties at the end of training. More penalty accumulates for variables that never get seen again, and that needs to be applied. This could actually account for the 20K vs. 30K feature discrepancy when compared to quasi-Newton methods.

I introduced this bug in the first version of logistic regression in LingPipe! Mike Ross caught this problem during code review.

Batch Size 1?

I’d have thought using larger batch sizes as proposed by Langford et al. in their truncated stochastic gradient paper would also help to drive features down. If you’re updated in a batch of 100 features, you get a 100* boost in the regularization gradient that applies. Of course, this won’t necessarily have the same effect with the cumulative algorithm — it’ll probably help more in the simple clipping-based approach.

Evaluation for Log Loss

The quasi-Newton method did better in all of the evaluations (though the results were very close on their own NLPBA gene tagging data).

The authors’ cumulative penalty versions had log loss in between the simple truncation algorithm and the quasi-Newton method, but they were all pretty close (1/10th to 1/20th of a bit per example).

I’m thinking this could just be an issue of not running SGD long enough with low enough learning rates. I found in my experiments, and others have found the same thing, that it takes a long time to converge to several decimal places.

Differing estimates present another issue for speed evaluation. Without pegging log loss to be the same, you haven’t really found the minimum with SGD. You can see in their figure 2 of learning rates that they could’ve stopped after 20 passes and gotten to nearly the same log loss. What we’ve found is that held out performance is sometimes even better this way — it effectively gives you an overlay of extra regularization from the early stopping.

(In an aside, their figures 3 and 4 clip the last 130 epochs of the OWL-QN baseline, misleadingly make it look like the log loss was better for SGD than quasi-Newton.)

Oracle Evaluation for Speed

Evaluating SGD for speed is tricky, as I mentioned in my favorable review of Pegasos. The problem is that convergence speed varies dramatically w.r.t. the learning rate and annealing schedule.

If I’m reading the Tsuroaka et al. paper correctly, in section 4.1 the authors say they use 30 passes (the final number they use) to tune the learning rate for SGD, using base learning rates of 1, 0.5, 0.2, and 0.1 (?eh — I need much smaller ones as did Langford et al.) and exponential annelaing rates of 0.9, 0.85 and 0.8 (a total of 12 settings, with no indication of how problem-specific they are).

The problem I have with this approach is that the authors are effectively using an oracle to choose learning rates and then claiming their system is faster. It’s exactly this tuning of learning rates and number of epochs that’s so cumbersome for SGD approaches.

To make matters worse, they use Microsoft’s research-use-only quasi-Newton system OWL-QN to tune the regularization parameter. This kind of empirical Bayes tuning of the priors interacts with the learning rate and number of epoch tuning in an unfortunate way. Or in a fortunate way if you use path following to evaluate a range of priors.

The simple method, as used in LingPipe, was just as fast to get through 30 epochs. That’s not surprising, but it’s the wrong evaluation. You can see that the cumulative versions get to the same log loss much faster than the simple approach.

Exponential vs. Inverse Annealing

They argue for exponential annealing over inverse annealing on practical rather than theoretical grounds; LingPipe implements both forms along with constant learning rates, as well as providing an annealing interface for custom implementations to be plugged in. Inverse annealing is required in theory, but I’ve found the same thing in practice — exponential workds better.)

Comparing Apples and Oranges

As the authors point out, another problem with evaluating for speed is that they compared with a system they didn’t write, so who knows what unknowns are lurking in that comparison. Even if a single person writes all the code, there’s no reason to believe their experience and skill is equal or the time they spend tuning for application speed is the same. Having more comparable implementations (and I don’t even know how to measure this) might drive the ratio in either direction.

Memory vs. CPU Speed!

Why does everyone only report CPU type and speed (e.g. “Xeon 2.13 GHz processor”)? Different Xeons have very different amounts of L2 cache and some have different numbers of cores.

Perhaps an even bigger variable is memory and front-side bus speed. We’ve found that to be as big a factor in these large models where essentially every example needs to be moved into and out of the CPU. Error correcting memory is slower, but more robust. Etc. etc.

Trieschnigg, Pezik, Lee, de Jong, Kraaij, and Rebhoz-Schumann (2009) MeSH Up: Effective MeSH Text Classification for Improved Document Retrieval

August 28, 2009

I’m about to implement the technique for assigning Medical Subject Heading (MeSH) terms to text described in this paper from the TLAs EBI, HMI and TNO ICT:

The same technique could be applied to assign Gene Ontology (GO) terms to texts, tags to tweets or blog posts or consumer products, or keywords to scientific articles.

k-NN via Search

To assign MeSH terms to a text chunk, they:

  1. convert the text chunk to a search query,
  2. run the query against a relevance-based MEDLINE index, then
  3. rank MeSH terms by frequency in the top k (k=10) hits.

In other words, k-nearest-neighbors (k-NN) where “distance” is implemented by a relevance-based search.

Just the Text, Ma’am

Trieschnigg et al. concatenated the title and abstract of MEDLINE citations into a single field for both document indexing and query creation.

k-NN implemented as search could be faceted to include journals, years, authors, etc. For products, this could include all the facets seen on sites like Amazon or New Egg.

Language-Model-Based Search

They use language-model-based search, though I doubt that’s critical for success. Specifically, they estimate the maximum-likelihood unigram language model for the query and interpolated (with a model trained on all documents) model for each document, and then rank documents versus that query by cross-entropy of the query model given the document model (given the MLE estimate for the query, this is just the log probability of the query in the document’s LM.) Other LM-based search systems measure similarity by KL-divergence.

There weren’t any details on stemming, stoplisting, case normalization or tokenization in the paper or supplement; just a pointer to author Wessel Kraaij’s Ph.D. thesis on LM-based IR.

Application to MEDLINE

The text being assigned MeSH terms was another MEDLINE title-plus-abstract. This may seem redundant given that MEDLINE citations are already MeSH annotated, but it’s useful if you’re the one at NLM who has to assign the MeSH terms, or if you want a deeper set of terms (NLM only assigns a few per document).

It’s easy to apply the authors’ approach to arbitrary texts, such as paragraphs out of textbooks or full text articles or long queries of the form favored by TREC.

Efficient k-NN!

I did a double-take when I saw k-nearest-neighbors and efficiency together. As we all know, k-NN scales linearly with training set size and MEDLINE is huge. But, in this case, the search toolkit can do the heavy lifting. The advantage of doing k-NN here is that it reproduces the same kind of sparse assignment of MeSH terms as are found in MEDLINE itself.

Error Analysis

The authors did a nice job in the little space they devoted to error analysis, with more info in the supplements (PR curves and some more parameter evaluations and the top few hits for one example). They reported that k-NN was better than some other systems (e.g. thesaurus/dictionary-based tagging and direct search with MeSH descriptors as pseudocuments) at assigning the sparse set of MeSH terms found in actual MEDLINE citations.

Errors tended to be more general MeSH terms that just happened to show up in related documents. I’d also like to see how sensitive performance is to the parameter setting of k=10, as it was chosen to optimize F measure against the sparse terms in MEDLINE. (All results here are for optimal operating points (aka oracle results), which means the results are almost certainly over-optimistic.)

What You’ll Need for an Implementation

It should be particularly easy to reproduce for anyone with:

Look for a demo soon.

Discriminative Semi-Supervised Classifiers?

It’d be nice to see an evaluation with text generated from MeSH and articles referencing those terms that used any of the semi-supervisied or positive-only training algorithms (even just sampling negative training instances randomly) with some kind of discriminative classifier like logistic regression or SVMs.

Improving IR with MeSH (?)

I didn’t quite follow this part of the paper as I wasn’t sure what exactly they indexed and what exactly they used as queries. I think they’re assigning MeSH terms to the query and then adding them to the query. Presumably they also index the MeSH terms for this.

I did get that only the best MeSH-term assigner improved search performance (quantized to a single number using TREC’s mean average precision metric).

Alternatives like generating a 24,000-category multinomial classifier are possible, but won’t run very fast (though if it’s our tight naive Bayes, it might be as fast as the authors

Brodley and Friedl (1999) Identifying Mislabeled Training Data

August 27, 2009

Today’s blast from the past is the descriptively titled:

It’s amazing this paper is so long given the simplicity of their approach — it’s largely careful evaluation on three approaches versus five data sets and a thorough literature survey. It’s also amazing just how many people have tackled this problem over time in more or less the same way.

The approach is almost trivial: cross-validate using multiple classifiers, throw away training instances on which the classifiers disagree with the gold standard, then train on the remaining items. They consider three forms of disagreement: at least one classifier disagrees, a majority of classifiers disagree, and all classifiers disagree with the gold label. They consider three classifiers, 1-nearest-neighbor with Euclidean distance, linear classifiers (using the “thermal training rule”, whatever that is), and C4.5-trained decision trees.

They conclude that filtering improves accuracy, though you’ll have to be more patient than me to dissect the dozens of detailed tables they provide for more inisght. (Good for them for including confidence intervals; I still wish more papers did that today.)

The authors are careful to set aside 10% of each data set for testing; the cross-validation-based filtering is on the other 90% of the data, to which they’ve introduced simulated errors.

One could substitute annotators for algorithms, or even use a mixture of the two to help with active learning. And you could try whatever form of learners you like.

What I’d like to see is a comparison with probabilistic training on the posterior category assignments, as suggested by Padhraic Smyth et al. in their 1994 NIPS paper, Inferring ground truth from subjective labelling of Venus images. I’d also like to see more of an analysis of noisy training data on evaluation, along the lines of Lam and Stork’s 2003 IJCAI paper, Evaluating classifiers by means of test data with noisy labels. Because my experience is that gold standards are less pure than their audience of users imagines.

Rzhetsky, Shatkay and Wilbur (2009) How to Get the Most out of Your Curation Effort

August 25, 2009

Following the scientific zeitgeist, here’s another paper rediscovering epidemiological accuracy models for data coding, this time from Mitzi‘s former boss and GeneWays mastermind Andrey Rzhetsky, IR and bio text mining specialist Hagit Shatkay, and the co-creator of the Biocreative entity data, NLM/NCBI’s own John Wilbur:

Their motivations and models look familar. They use a Dawid-and-Skene-like multinomial model and a “fixed effects” correlation-based model (to account for the overdispersion relative to the independence assumptions of the multinomial model).

Neither they nor the reviewers knew about any of the other work in this area, which is not surprising given that it’s either very new, quite obscure, or buried in the epidemiology literature under a range of different names.

Data Distribution

What’s really cool is that they’ve distributed their data through PLoS. And not just the gold standard, all of the raw annotation data. This is a great service to the community.

What they Coded

Özlem Uzuner‘s i2b2 Obesity Challenge and subsequent labeling we’ve done in house convinced me that modality/polarity is really important. (Yes, this should be obvious, but isn’t when you’ve spent years looking at newswire and encyclopedia data.)

Rzhetsky et al. used 8 coders (and 5 follow-up control coders) to triple code sentences (selected with careful distributions from various kinds of literature and paper sections) for:

  • Focus (Categorical): generic, methodological, scientific
  • Direct Evidence for Claim (Ordinal): 0, 1, 2, 3
  • Polarity: (Ordinal) Positive/Negative with scale 0,1,2,3 for certainty

I’m not sure why they allowed Positive+0 and Negative+0, as they describe 0 certainty as completely uncertain.

Given the ordinal nature of their data, they could’ve used something like Uebersax and Grove’s 1993 model based on ordinal regression (and a really nice decomposition of sensitivity and specificity into accuracy and bias).

CiteXplore: MEDLINE with Citation Details and Entity Highlighting

August 9, 2009

[Update, 10 August 2009: check out the comment for some responses to feedback from EBI's development team.]

I found this site through the syllabus for EBI’s upcoming text mining course (October 2009):

It supports searches over MEDLINE with gene name, protein name, MeSH or protein-protein interaction highlighting. They’re sensibly exploiting external resources, such as iHOP (“info hyperlinked over proteins”), Entrez-PubMed and Entrez-Gene. I greatly prefer the link-out approach to trying to encapsulate everything on their own page.

What’s really neat is that CiteXplore mined the full-text articles for citation details, which they link back to MEDLINE. It looks very much like the ACM’s proprietary “Portal”, only over MEDLINE. One limitation is that CiteXplore only links back into MEDLINE (thus leaving some refs as “not found”), whereas ACM pulls out arbitrary references. This is something we couldn’t do, because we don’t have subscriptions to all the full text doc feeds (I have no idea how complete CiteXplore is in this regard, but they do have proprietary articles linked for citations.)

CiteXplore supports named-entity tagging for genes and/or proteins, though it looks like gene search also includes proteins. For instance, pull up their page and do a search for [serpina3], which is a gene that has the same name in six different mammals and birds. (Sorry I can’t link to the search — like Entrez itself, they used a web engine that doesn’t allow query bookmarking; they also don’t support “open in new window” or “open in new tab”, which is really frustrating.) Then pull down the “gene name highlighting” menu item and click on the “highlight” button. The gene names it finds are rendered in green.

CiteXplore finds some of the SERPINA3 mentions, but misses others. For instance, in the context “Alpha 1-antichymotrypsin/SerpinA3″ it seems to have tokenization problems, because both are names of the same gene, but neither is highlighted. The mention of the bos taurus (cow) gene SERPINA3 (PMID 18384666) isn’t highlighted, but mus musculus (mouse) is (PMID 15638460). Unfortunately, the mouse mention is linked to the human SerpinA3 in iHOP. (Argh! iHOP’s all Javascripted up to intercept search on page and their search doesn’t jump to the matches. Why, oh why, do people try to innovate on things that already work?)

CiteXplore misses mentions with too many parens — they match “alpha-1-antichymotrypsin”, but miss “alpha(1)-antichymotrypsin”. And they have trouble with mixed case, matching “SERPINA3″ and “serpina3″, but missing “SerpinA3″. Other misses seem inexplicable, as in “SERPINA3 (aka alpha-1-antichymotrypsin)”.

They also miss the alias “ACT” for SERPINA3 (it overlaps with a common word, so is probably just hard filtered out). That one’s tricky, as it requires context, and is very hard to recognize with high precision.

There are also other tokenization problems in that they seem to have inserted extra whitespace after open parens. I really wish people would take more care with text data. This is why you need tokenizers that give you info on whitespace (either directly or through start/end offsets for tokens).

I also tried CiteXplore’s “Proteins / Gene ontology” highlighting, but that adds precision problems to their recall problems. For instance, it matches the “Alpha 1″ in “Alpha1-antichymotrypsin” to the Uniprot entry for mating type protein alpha 1. They also mess up the whitespace in this application, inserting extra spaces after hyphens.

There are also bigger recall problems. The search is by exact string match, so overall, they only find 38 citations for the search [serpina3]. Entrez-PubMed itself only finds 35 citations for the same query.

CiteXplore’s advanced search allows expansion by synonym, raising the number of hits for query [serpina3] to 133. Judging from their display, they only add “aact” as a synonym for “serpina3″. iHOP, which now has greatly improved recall over its original version, finds what looks to be many more articles about SERPINA3 searching by Entrez-Gene ID (12 for human), but they don’t report hit numbers, and I’m not going to count them. They also deal with the precision problems involved in expanding with synonyms (and also skip the synonym “ACT”).

CiteXplore also has a frustratingly short fuse on session timeouts; I had to keep redoing all these searches as I wrote this post.

CiteXplore seems to be based on EBI’s text matching system Whatizit, which supports tagging arbitrary text, not just MEDLINE citations found by search. There’s also a freely downloadable paper explaining the system:

An application like this could be built with LingPipe’s exact dictionary matcher coupled with a normalizing tokenizer.

Zou and Hastie (2005) Regularization and Variable Selection via the Elastic Net [Prior]

July 30, 2009

I don’t know how the elastic net prior (aka regularizer) snuck by me unnoticed. It was introduced in:

I was reading about it in the technical report appearing as part of the documentation to their glmnet R package:

The idea’s basically an “interpolation” between the Laplace (L1, lasso) and Gaussian (L2, ridge) priors:

p_{\lambda,\alpha}(\beta) \propto \exp(-\lambda(\alpha|\beta|_2^2 + (1-\alpha)|\beta|_1)

where β is the vector coefficients, λ a factor for prior variance, and α the blending ratio between L2 and L1 priors, themselves represented by the squared L2 norm and (unsquared) L1 norm terms.

Note that the interpolation happens inside the exponential form, so it’s not a mixture of a Gaussian and Laplace prior, but rather a mixture of their bases. Thus on the log scale used in gradient methods, the prior takes on a pleasantly differentiable form:

\log p_{\lambda,\alpha}(\beta) = -\log Z_{\lambda,\alpha} -\lambda(\alpha|\beta|_2^2 + (1-\alpha)|\beta|_1)

where Z is the usual regularization term, here dependent on the variance and mixture parameters.

The later paper divides the L2 component by 2, which gives you a slightly different blend. In general, I suppose you could give the two distributions being blended arbitrary variances λ1 and λ2.

The authors are particularly interested in two situations that are problematic for L1 or L2 alone:

  • the number of dimensions is larger than the number of training items
  • the features/coefficients are highly correlated

Note that bag-of-words and other lexically-based feature representations in natural language classification and tagging are of exactly this nature. In the former case, L1 breaks down with too few features selected. In the latter case, L2 tends to select all the features and L1 only one of the features. With the elastic net, with most of the weight on L1, you get the beneifts of L1′s sparseness and fat tails, with better behavior in the face of correlated variables and large numbers of dimensions.

There’s a beautiful plot in the later paper with prior variance decreasing on the X-axis versus the fitted coefficient values on the Y-axis, with a line for each dimension/feature (and there are lots of non-zero coefficients overlaid on the Gaussian, which never pushes any feature to 0 through regularization):

The Zou and Hastie paper also discusses “bridge regression” (bad pun), which takes an arbitrary Ln penalty for n in the interval [1,2], and thus provides a slightly different way of blending L1 and L2 priors. The limit at 1 and 2 are the L1 and L2 priors. Although the gradient is still easy to calculate, the resulting solutions for n > 1 are not sparse for the same reason L2 isn’t sparse — the shrinkage is proportional to a fraction of the current value.

Given the general exponential form, the gradient of the elastic net’s log probability is easy to compute. It just reduces to interpolating between the L1 and L2 gradients. I’ll probably just go ahead and add elastic net priors to LingPipe as implementations of stats.RegressionPrior. That’ll let them plug-and-play in multinomial logistic regression and CRFs.


Follow

Get every new post delivered to your Inbox.

Join 797 other followers