Archive for the ‘LingPipe in Use’ Category

LingPipe Incubator Welcomes Seer

September 9, 2013

We have started an incubator program for start-ups with natural language processing (NLP) needs. Seer is our first egg. The company creates productivity apps based on unstructured data sources–that is where LingPIpe fits in. The guys (Conall and Joe) fit a pattern that we see quite often–smart people with a good idea that presupposes NLP but they don’t have much experience with it.

The GetSeer guys were on the ball right away because they had a bunch of data annotated and had cobbled together a regular expression based classifier that served as an excellent starting point. We had machine learning based classifiers up and running within two hours of starting. That included an evaluation harness. Nice.

The great thing about our working setup is that I get to teach/advise which keeps my time commitment manageable and they get to do most of the heavy lifting which is how they learn to build and maintain NLP systems. I have had to learn some Scala since I co-code most of the time when we work together and that is their implementation language. Scala is a hip extension of Java with a less verbose syntax and stronger type inference.

I’ll keep the blog updated with developments. Current status:

• There is a small amount of training data.
• Current results are awful (no surprise).
• We have been roughing in the solution with Naive Bayes. Next steps will be opening a can of logistic regression for more interesting feature extraction.

Token Lumpers and Splitters: Spider-Man vs. Superman

August 21, 2013

One of the perennial problems in tokenization for search, classification, or clustering of natural language data is which words and phrases are spelled as compounds and which are separate words. For instance consider “dodgeball” (right) vs. “dodge ball” (wrong) and “golfball” (wrong) vs. “golf ball” (right)?

It’s a classic lumpers vs. splitters problem. Alas, the “right” answer can change over time and some words are listed as both lumped and split in some dictionaries.

This point was first brought home to me and Breck when we were browsing Comcast’s query logs for TV search queries, where users seemed to arbitrarily insert or delete spaces in queries. Who can blame them with the inconsistencies of English in this regard?

This token lumping vs. splitting problem is one of the reasons LingPipe’s spelling correction is not token-based like Lucene’s (though Lucene has other ways to handle combining and splitting compounds — more on that from Mitzi in the near future). And it’s one of the reasons (along with morphological affixes) that character n-grams provide a better basis for classification than hard tokenization.

My favorite examples are comic book superhero names. It turns out that the editors at Marvel Comics are splitters, whereas those at DC Comics are lumpers. Hence we have on the Marvel roster,

Ant-Man, Iron Man, and Spider-Man.

They’re not consistent on hyphenation, which is another issue if you don’t throw away punctuation.

DC, on the other hand, employs men, women, and children, but doesn’t have the budget for spaces:

Aquaman, Batman, Catwoman, Hawkman, Martian Manhunter, Superboy, Supergirl, and Superman.

The exception that proves the rule (I’m open-minded enough to use a phrase that’s a pet peeve of mine), DC gives us:

Wonder Woman

Bayesian Inference for LDA is Intractable

February 18, 2013

Bayesian inference for LDA is intractable. And I mean really really deeply intractable in a way that nobody has figured or is ever likely to figure out how to solve.

Before sending me a “but, but, but, …” reply, you might want to bone up on the technical definition of Bayesian inference, which is a bit more than applying Bayes’s rule or using a prior,

and on computational intractability, which is a bit more involved than a program being slow,

The reason Bayesian inference for LDA is intractable is that we can’t effectively integrate over its posterior. And I don’t just mean there’s no analytic solution; there rarely is for interesting Bayesian models. But for some models, you can use numerical techniques like MCMC to sample from the posterior and compute a posterior integral by averaging.

Posterior sampling doesn’t work for LDA. I know, I know, you’re going to tell me that lots of smart people have applied Gibbs sampling to LDA. You might even point out that LingPipe has a Gibbs sampler for LDA.

Yes, these systems return values. But they’re not sampling from the posterior according to the posterior distribution. The number of modes in the posterior makes comprehensive posterior sampling intractable.

And I’m not just talking about what people call “label switching”, which refers to the non-identifiability of the indexes. Label switching just means that you can permute the K indices corresponding to topics and get the same model predictions. It adds another layer of intractability, but it’s not the one that’s interesting — the real problem is multimodality.

So what does everybody do? Some use Gibbs sampling to take a single sample and then go with it. This is not a Bayesian approach; see the description of Bayesian inference above. It’s not even clear you can get a single effective sample this way (or at least it’s unclear if you can convince yourself you have an effective sample).

Others use variational inference to compute an approximate posterior mean corresponding to a single posterior mode. Variational inference approximates the posterior variance as well as the posterior mean. But alas, they don’t work that way for LDA, where there is no meaningful posterior mean due to lack of identifiability.

So what to do? The output of LDA sure does look cool. And clients do love reading the tea leaves, even if you obfuscate them in a word cloud.

I only implemented Gibbs for LDA in LingPipe because it was all that I understood when I did it. But now that I understand Bayesian inference much better, I agree with the experts: just do variational inference. You’re going to get a lousy point-based representation of the posterior no matter what you do, but variational inference is fast.

Specifically, I recommend Vowpal Wabbit package’s online LDA algorithm, which is based on Matt Hoffman’s stochastic variational LDA algorithm. It’ll also be much much more scalable than LingPipe’s sampling-based approach because it works online and doesn’t need to store all the data in memory, much like stochastic gradient descent.

The algorithm’s clean and it’d be easy enough to add such an online LDA package to LingPipe so that we could compute topic models over dozens of gigabytes of text really quickly (as Matt did with Wikipedia in the paper). But I don’t know that anyone has the time or inclination to do it.

0/1 Loss Meaningless for Predicting Rare Events such as Exploding Manholes

June 14, 2012

[Update: 19 June 2012: Becky just wrote me to clarify which tools they were using for what (quoted with permission, of course -- thanks, Becky):

... we aren't using BART to rank structures, we use an independently learned ranked list to bin the structures before we apply BART. We use BART to do a treatment analysis where the y values represent whether there was an event, then we compute the role that the treatment variable plays in the prediction. Here's a journal paper that describes our initial ranking method

http://www.springerlink.com/content/3034h0j334211484/

and the pre-publication version

http://www1.ccls.columbia.edu/%7Ebeck/pubs/ConedPaperRevision-v5.pdf

The algorithm for doing the ranking was modified a few years ago, and now Cynthia is taking a new approach that uses survival analysis.]

Rare Events

Let’s suppose you’re building a model to predict rare events, like manhole explosions in the Con-Ed system in New York (this is the real case at hand — see below for more info). For a different example, consider modeling the probability of a driver getting into a traffic accident in the next week. The problem with both of these situations is that even with all the predictors in hand (last maintenance, number of cables, voltages, etc. in the Con-Ed case; driving record, miles driven, etc. in the driving case), the estimated probability for any given manhole exploding (any person getting into an accident next week) is less than 50%.

The Problem with 0/1 Loss

A typical approach in machine learning in general, and particularly in NLP, is to use 0/1 loss. This forces the system to make a simple yea/nay (aka 0/1) prediction for every manhole about whether it will explode in the next year or not. Then we compare those predictions to reality, assigning a loss of 1 if you predict the wrong outcome and 0 if you predict correctly, then summing these losses over all manholes.

The way to minimize expected loss is to predict 1 if the probability estimate of failure is greater than 0.5 and 0 otherwise. If all of the probability estimates are below 0.5, all predictions are 0 (no explosion) for every manhole. Consequently, the loss is always the number of explosions. Unfortunately, this is the best you can do if your loss is 0/1 and you have to make 0/1 predictions.

So we’ve minimized 0/1 loss and in so doing created a useless 0/1 classifier.

A Hacked Threshold?

There’s something fishy about a classifier that returns all 0 predictions. Maybe we can adjust the threshold for predicting explosions below 0.5. Equivalently, for 0/1 classification purposes, we could rescale the probability estimates.

Sure, it gives us some predicted explosions, but the result is a non-optimal 0/1 classifier. The reason it’s non-optimal in 0/1-loss terms is that each prediction of an explosion is likely to be wrong, but in aggregate some of them will be right.

It’s not a 0/1 Classification Problem

The problem in 0/1 classification arises from converting estimates of explosion of less than 50% per manhole to 0/1 predictions minimizing expected loss.

Suppose our probability estimates are close, at least in the sense that for any given manhole there’s only a very small chance it’ll explode no matter what its features are.

Some manholes do explode and the all-0 predictions are wrong for every exploding manhole.

What Con-Ed really cares about is finding the most at-risk properties in its network and supplying them maintenance (as well as understanding what the risk factors are). This is a very different problem.

A Better Idea

Take the probabilities seriously. If your model predicts a 10% chance of explosion for each of 100 manholes, you expect to see 10 explosions. You just don’t know which of the 100 manholes they’ll be. You can measure these marginal predictions (number of predicted explosions) to gauge how accurate your model’s probability estimates are.

We’d really like a general evaluation that will measure how good our probability estimates are, not how good our 0/1 predictions are. Log loss does just that. Suppose you have $N$ outcomes $y_1,\ldots,y_N$ with corresoponding predictors (aka features), $x_1,\ldots,x_N$, and your model has parameter $\theta$. The log loss for parameter (point) estimate $\hat{\theta}$ is

${\mathcal L}(\hat{\theta}) - \sum_{n=1}^N \, \log \, p(y_n|\hat{\theta};x_n)$

That is, it’s the negative log probability (the negative turns gain into loss) of the actual outcomes given your model; the summation is called the log likelihood when viewed as a function of $\theta$, so log loss is really just the negative log likelihood. This is what you want to optimize if you don’t know anything else. And it’s exactly what most probabilistic estimators optimize for classifiers (e.g., logistic regression, BART [see below]).

Decision Theory

The right thing to do for the Con-Ed case is to break out some decision theory. We can assign weights to various prediction/outcome pairs (true positive, false positive, true negative, false negative), and then try to optimize weights. If there’s a huge penalty for a false negative (saying there won’t be an explosion when there is), then you are best served by acting on low-probability information, such as servicing even low-probability manholes. For example, if there is a $100 cost for a manhole blowing up and it costs$1 to service a manhole so it doesn’t blow up, then even a 1% chance of blowing up is enough to send out the service team.

We haven’t changed the model’s probability estimates at all, just how we act on them.

In Bayesian decision theory, you choose actions to minimize expected loss conditioned on the data (i.e., optimize expected outcomes based on the posterior predictions of the model).

Ranking-Based Evaluations

Suppose we sort the list of manholes in decreasing order of estimated probability of explosion. We can line this up with the actual outcomes. Good system performance is reflected in having the actual explosions ranked high on the list.

Information retrieval supplies a number of metrics for this kind of ranking. The thing I like to see for this kind of application is a precision-recall curve. I’m not a big fan of single-number evaluations like mean average precision, though precision-at-N makes sense in some cases, such as if Con-Ed had a fixed maintenance budget and wanted to know how many potentially exploding manholes it could service.

There’s a long description of these kind evaluations in

Just remember there’s noise in this received curves and that picking an optimal point on them is unlikely to produce such good behavior on held-out data.

With good probability estimates for the events you will get good rankings (there’s a ton of theory around this I’ve never studied).

I’ve been hanging out at Columbia’s Center for Computational Learning Systems (CCLS) talking to Becky Passonneau, Haimanti Dutta, Ashish Tomar, and crew about their Con-Ed project of predicting certain kinds of events like exploding manholes. They built a non-parametric regression model using Bayesian additive regression trees with a fair amount of data and many features as predictors.

I just wrote a blog post on Andrew Gelman’s blog that’s related to issues they were having with diagnosing convergence:

But the real problem is that all the predictions are below 0.5 for manholes exploding and the like. So simple 0/1 loss just fails. I thought the histograms of residuals looked fishy until it dawned on me that it actually makes sense for all the predictions to be below 0.5 in this situation.

Moral of the Story

0/1 loss is not your real friend. Decision theory is.

This whole discussion reminds me of the lottery “paradox”. Each ticket holder is very unlikely to win a lottery, but one of them will win. The “paradox” arises from the inconsistency of the conjunction of beliefs that each person will lose and the belief that someone will win.

Oh, no! Henry Kyburg died in 2007. He was a great guy and decades ahead of his time. He was one of my department’s faculty review board members when I was at CMU. I have a paper in a book he edited from the 80s when we were both working on default logics.

May 18, 2012

From the Emailbox

Krishna writes,

I have a question about using the chunking evaluation class for inter annotation agreement : how can you use it when the annotators might have missing chunks I.e., if one of the files contains more chunks than the other.

The answer’s not immediately obvious because the usual application of interannotator agreement statistics is to classification tasks (including things like part-of-speech tagging) that have a fixed number of items being annotated.

Chunker Evaluation

The chunker evaluations built into LingPipe calculate the usual of precision and recall measures (see below). These evaluations compare a set of response chunkings to a set of reference chunkings. Usually the reference is drawn from a gold-standard corpus and the response from an automated system built to do chunking.

Precision (aka positive predictive accuracy) measures the proportion of chunks in the response that are also in the reference. Recall (aka sensitivity) measures the proportion of chunks in the reference that are in the response. If we swap the reference and response chunkings, we swap precision and recall.

True negatives aren’t really being counted here — theoretically there are a huge number of them — any possible span with any possible tag could have been labeled. LingPipe just sets the true negative count to zero, and as a result, specificity (TN/[TN+FP]) doesn’t make sense.

Interannotator Agreement

Suppose you have chunkings from two human annotators. Just treat one as the reference and one as the response and run a chunking evaluation. The precision and recall values will tell you which annotator returned more chunkings. For instance, if precision is .95 and recall .75, you know that the annotator assigned as the reference chunking had a whole bunch of chunks the other annotator didn’t think were chunks, but most of the chunks found by the response annotator were also chunks of the reference annotator.

You can use F-measure as an overall single-number score.

The base metrics are all explained in

and their application to chunking in

Examples of running chunker evaluations can be found in

LingPipe Annotation Tool

If you’re annotating entity data, you might be interested in our learn-a-little, tag-a-little tool.

Now that Mitzi’s brought it up to compatibility with LingPipe 4, we should move citationEntities out of the sandbox and into a tutorial.

All Aboard for Quasi-Productive Stemming

April 4, 2012

One of the words Becky and I are having annotated for word sense (collecting 25 non-spam Mechanical Turk responses per word) is the nominal (noun) use of “board”.

One of the examples was drawn from a text with a typo where “aboard” was broken into two words, “a board”. I looked at the example, and being a huge fan of nautical fiction, said “board is very productive — we should have the nautical sense”. Then I thought a bit longer and had to admit I didn’t know what “board” meant all by itself. I did know a whole bunch of terms that involved “board” as a stem:

• inboard
• outboard
• aboard, onboard
• overboard, “by the board”
• larboard (port)
• weatherboard (facing the weather [wind])
• starboard
• above board (on deck)

And what about “seaboard”? As in the “Eastern seaboard”.

The nautical meaning wasn’t listed in WordNet, but dictionary.com has an entry for board that lists it as one of two nautical senses. Words have a surprising number of meanings if you’re willing to go into low frequency, archaic/obsolete and domain-specific usages.

The nautical sense in play is “side of a ship”. It also lists an obsolete sense meaning edge or side of anything. So the nautical sense is just a specialization of this obsolete sense. That’s one way in which meaning drift occurs.

This is all consistent with “side”, cf., “inside”/”inboard”, “outside”/”outboard”, and “aside”/”aboard”. The “side” in question here seems to have drifted to something like “side of an enclosed structure”

This is the same problem we had with our morphological annotation project at LingPipe — there were words that seemed to be compounds, but one of the roots didn’t really stand alone in (common, everyday) English.

March 12, 2012

Time, Negation, and Clinical Events

Mitzi’s been annotating clinical notes for time expressions, negations, and a couple other classes of clinically relevant phrases like diagnoses and treatments (I just can’t remember exactly which!). This is part of the project she’s working on with Noemie Elhadad, a professor in the Department of Biomedical Informatics at Columbia.

LingPipe Chunk Annotation GUI

Mitzi’s doing the phrase annotation with a LingPipe tool which can be found in

She even brought it up to date with the current release of LingPipe and generalized the layout for documents with subsections.

Our annotation tool follows the tag-a-little, train-a-little paradigm, in which an automatic system based on the already-annotated data is trained as you go to pre-annotate the data for a user to correct. This approach was pioneered in MITRE’s Alembic Workbench, which was used to create the original MUC-6 named-entity corpus.

The chunker underlying LingPipe’s annotation toolkit is based on LingPipe’s character language-model rescoring chunker, which can be trained online (that is, as the data streams in) and has quite reasonable out-of-the-box performance. It’s LingPipe’s best out-of-the-box chunker. In contrast, CRFs can be engineered to outperform the rescoring chunker with good feature engineering.

A very nice project would be to build a semi-supervised version of the rescoring chunker. The underlying difficulty is that our LM-based and HMM-based models take count-based sufficient statistics.

It Works!

Mitzi’s getting reasonable system accuracy under cross validation, with over 80% precision and recall (and hence over 80% balanced F-measure).

That’s not Cricket!

According to received wisdom in natural language processing, she’s left out a very important step of the standard operating procedure. She’s supposed to get another annotator to independently label the data and then measure inter-annotator agreement.

So What?

If we can train a system to performa at 80%+ F-measure under cross-validation, who cares if we can’t get another human to match Mitzi’s annotation?

We have something better — we can train a system to match Mitzi’s annotation!

In fact, training such a system is really all that we often care about. It’s much better to be able to train a system than another human to do the annotation.

The other thing we might want a corpus for is to evaluate a range of systems. There, if the systems are highly comparable, the fringes of the corpus matter. But perhaps the small, but still p < 0.05, differences in such systems don't matter so much. What the MT people have found is that even a measure that's roughly correlated with performance can be used to guide system development.

Error Analysis and Fixing Inconsistencies

Mitzi’s been doing the sensible thing of actually looking at the errors the system’s making under cross validation. In some of these cases, she’d clearly made a braino and annotated the data wrong. So she fixes it. And system performance goes up.

What Mitzi’s reporting is what I’ve always found in these tasks. For instance, she inconsistently annotated time plus date sequences, sometimes including the times and sometimes not. So she’s going back to correct to do it all consistently to include all of the time information in a phrase (makes sense to me).

After a couple of days of annotation, you get a much stronger feeling for how the annotations should have gone all along. The annotations drifted so much over time in this fashion in the clinical notes annotated for the i2b2 Obesity Challenge that the winning team exploited time of labeling as an informative feature to predict co-morbidities of obesity!

That’s also not Cricket!

The danger with re-annotating is that the system’s response will bias the human annotations. System-label bias is also a danger with single annotation under the tag-a-little, learn-a-little setup. If you gradually change the annotation to match the system’s responses, you’ll eventually get to very good, if not perfect, performance under cross validation.

So some judgment is required in massaging the annotations into a coherent system, but one that you care about, not one driven by the learned system’s behavior.

On the other hand, you do want to choose features and chunkings the system can learn. So if you find you’re trying to make distinctions that are impossible for the system to learn, then change the coding standard to make it more learnable, that seems OK to me.

Go Forth and Multiply

Mitzi has only spent a few days annotating the data and the system’s already working well end to end. This is just the kind of use case that Breck and I had in mind when we built LingPipe in the first place. It’s so much fun seeing other people use your tools

When Breck and Linnea and I were annotating named entities with the citationEntities tool, we could crank along at 5K tokens/hour without cracking a sweat. Two eight-hour days will net you 80K tokens of annotated data and a much deeper insight into the problem. In less than a person-week of effort, you’ll have a corpus the size of the MUC 6 entity corpus.

Of course, it’d be nice to roll in some active learning here. But that’s another story. As is measuring whether it’s better to have a bigger or a better corpus. This is the label-another-instance vs. label-a-fresh-instance decision problem that (Sheng et al. 2008) addressed directly.

Settles (2011): Closing the Loop: Fast, Interactive Semi-Supervised Annotation with Queries on Features and Instances

February 23, 2012

Whew, that was a long title. Luckily, the paper’s worth it:

Settles, Burr. 2011. Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances. EMNLP.

It’s a paper that shows you how to use active learning to build reasonably high-performance classifier with only minutes of user effort. Very cool and right up our alley here at LingPipe.

The Big Picture

The easiest way to see what’s going on is with a screenshot of DUALIST, the system on which the paper is based:

It’s basically a tag-a-little, learn-a-little annotation tool for classifiers. I wrote something along these lines for chunk tagging (named entities, etc.) — you can find it in the LingPipe sandbox project citationEntities (called that because I originally used it to zone bibliogrphies in docs, citations in bibliographies and fields in citations). Mitzi just brought it up to date with the current LingPipe and generalized it for some large multi-part document settings.

In DUALIST, users provide two kinds of input:

1. category classifications for documents
2. words associated with categories

The left-hand-side of the interface presents a scrolling list of documents, with buttons for categories. There are then columns for categories with words listed under them. Users can highlight words in the lists that they believe are associated with the category. They may also enter new words that don’t appear on the lists.

Settles points out a difficult choice in the design. If you update the underlying model after every user choice, the GUI items are going to rearrange themselves. Microsoft tried this with Word, etc., arranging menus by frequency of use, and I don’t think anyone liked it. Constancy of where something’s located is very important. So what he did was let the user mark up a bunch of choices of categories and words, then hit the big submit button at the top, which would update the model. I did roughly the same thing with our chunk annotation interface.

There’s always a question in this kind of design whether to pre-populate the answers based on the model’s guesses (as far as I can tell, DUALIST does not pre-populate answers). Pre-populating answers makes the user’s life easier in that if the system is halfway decent, there’s less clicking. But it raises the possibility of bias, with users just going with what the system suggests without thinking too hard.

Naive-Bayes Classifier

The underlying classification model is naive Bayes with a Dirichlet prior. Approximate inference is carried out using a maximum a posteriori (MAP) estimate of parameters. It’s pretty straightforward to implement naive Bayes this in a way that’s fast enough to use in this setting. The Dirichlet is conjugate to the multinomial so the posteriors are analytically tractable and the sufficient statistics are just counts of documents in each category and the count of words in documents of each category.

The Innovations

The innovation here is twofold.

The first innovation is that Settles uses EM to create a semi-supervised MAP estimate. As we’ve said before, it’s easy to use EM or some kind of posterior sampling like Gibbs sampling over a directed graphical model with any subset of its parameters or labels being unknown. So technically, this is straightforward. But it’s still a really good idea. Although semi-supervised classifiers are very popular, I’ve never seen it used in this kind of active-learning tagging interface. I should probably add this to our chunking tagger.

The second (and in my opinion more important) innovation is in letting users single out some words as being important words in categories. The way this gets pushed through to the model is by setting the component of the Dirichlet prior corresponding to the word/category pair to a larger value. Settles fits this value using held-out data rather than rolling it into the model itself with a prior. The results seem oddly insensitive to it, which surprised me (but see below).

Gadzooks! Settles seems to be missing the single biggest tuning parameter typically applied to naive Bayes — the document length normalizer. Perhaps he did this because when you document-length normalize, you no longer have a properly generative model that corresponds to the naive Bayes paradigm. But it makes a huge difference.

The LingPipe EM tutorial uses naive Bayes and the same 20 Newsgroups corpus as (Nigam, McCallum and Mitchell 2000) used for evaluation, and I showed the effect of document length normalization is huge (the Nigam et al. article has been cited nearly 2000 times!). You can do way better than Nigam et al.’s reported results by setting the document length norm to a small value like 5. (What document length norm’s doing is trying to correct for the lack of covariance and overdispersion modeling in naive multinomial document model — it’s the same kind of shenanigans you see in speech recognition in weighting the acoustic and language models and the same trick I just saw Kevin Knight pull out during a talk last week about decoding encrypted documents and doing machine translation.)

I think one of the reasons that the setting of the prior for important words has so little effect (see the performance figures) is that all of the priors are too high. If Settles really is starting with the Laplace prior (aka add 1), then that’s already too big for naive Bayes in this setting. Even the uniform prior (aka add 0) is too big. We’ve found that we need very small (less than 1) prior parameters for word-in-topic models unless there’s a whole lot of data (and the quantity Settles is using hasn’t gotten there by a long shot — you need to get enough so that the low counts dominate the prior before the effect of the prior washes out, so we’re talking gigabytes of text, at least).

Also, this whole approach is not Bayesian. It uses point estimates. For a discussion of what a properly Bayesian version of naive Bayes would look like, check out my previous blog post, Bayesian Naive Bayes, aka Dirichlet Multinomial Classifiers. For a description of what it means to be Bayesian, see my post What is Bayesian Statistical Inference?.

Confusion with Dirichlet-Multinomial Parameterization and Inference

There’s a confusion in the presentation of the Dirichlet prior and consequent estimation. The problem is that the prior parameter for a Dirichlet is conventionally the prior count (amount you add to the usual frequency counts) plus one. That’s why a prior of 1 is uniform (you add nothing to the frequency counts) and why a prior parameter of 2 corresponds to Laplace’s approach (add one to all frequency counts). The parameter is constrained to be positive, so what does a prior of 0.5 mean? It’s sort of like subtracting 1/2 from all the counts (sound familiar from Kneser-Ney LM smoothing?).

Now the maximum a posteriori estimate is just the estimate you get from adding the prior counts (parameter minus one) to all the empirical counts. It doesn’t even exist if the counts are less than 1, which can happen with Dirichlet parameter components that are less than 1. But Settles says he’s looking at the posterior expectation (think conditional expectation of parameters given data — the mean of the posterior distribution). The posterior average always exist (it has to given the bounded support here), but it requires you to add another one to all the counts.

To summarize, the mean of the Dirichlet distribution $\mbox{Dir}(\theta|\alpha)$ is

$\bar{\theta} = \alpha/(\sum_k \alpha_k)$,

whereas the maximum (or mode) is

$\theta^{*} = (\alpha - 1) / (\sum_k (\alpha_k - 1))$.

where the $-1$ is read componentwise, so $\alpha - 1 = (\alpha_1-1,\ldots,\alpha_K-1)$. This only exists if all $\alpha_k \geq 0$.

That’s why a parameter of 1 corresponds to the uniform distribution and why a parameter of 2 (aka Laplace’s “add-one” prior) is not uniform.

Settles says he’s using Laplace and taking the posterior mean (which, by the way, is the “Bayesian point estimate” (oxymoron warning) minimizing expected square loss). But that’s not right. If he truly adds 1 to the empirical frequency counts, then he’s taking the posterior average with a Laplace prior (which is not uniform). This is equivalent to the posterior mode with a prior parameter of 3. But it’s not equivalent to either the posterior mode or mean with a uniform prior (i.e., prior parameter of 1).

Active Learning

Both documents and words are sorted by entropy-based active learning measures which the paper presents very clearly.

Documents are sorted by the conditional entropy of the category given the words in the model.

Word features are sorted by information gain, which is the reduction in entropy from the prevalence category distribution to the expected category distribution entropy conditoned on knowing the feature’s value.

Rather than sorting docs by highest classification uncertainty, as conditional entropy does, we’ve found it useful to sort docs by the lowest classification uncertainty! That is, we ask humans to label the docs about which the classifier is least uncertain. The motivation for this is that we’re often building high-precision (and relatively low recall) classifiers for customers and thus have a relatively high probability threshold to return a guess. So the higher ranked items are closer to the boundary we’re trying to learn. Also, we find in real world corpora that if we go purely by uncertainty, we get a long stream of outliers.

Settles does bring up the issue of whether using what’s effectively a kind of active learning mechanism trained with one classifier will be useful for other classifiers. We need to get someone like John Langford or Tong Zhang in here to prove some useful bounds. Their other work on active learning with weights is very cool.

I love the big submit button. What with Fitts’s law, and all.

I see a big problem with this interface for situations with more than a handful of categories. What would the full 20 Newsgroups look like? There aren’t enough room for more columns or a big stack of buttons.

Also, buttons seems like the wrong choice for selecting categories. These should probably be radio buttons to express the exclusivity and the fact that they don’t take action themselves. Typically, buttons cause some action.

Discriminative Classifiers

Given the concluding comments, Settles doesn’t seem to know that you can do pretty much exactly the same thing in a “discriminative” classifier setting. For instance, logistic regression can be cast as just another directed graphical model with parameters for each word/category pair. So we could do full Bayes with no problem.

There are also plenty of online estimation procedures for weighted examples; you’ll need the weighting to deal with EM (see, e.g., (Karampatziakis and Langford 2010) for an online weighted training method that adjusts neatly for curvature in the objective; it’s coincidentally the paper I’m covering for the next Columbia Machine Learning Reading Group meeting).

The priors can be carried over to this setting, too, only now they’re priors on regression coefficients. See (Genkin, Lewis and Madigan 2007) for guidance. One difference is that you get a mean and a variance to play with.

Building it in LingPipe

LingPipe’s traditional naive Bayes implementation contains all that you need to build a system like DUALIST. Semi-supervised learning with EM is covered in our EM tutorial with naive Bayes as an example.

To solve some of the speed issues Settles brings up in the discussion section, you can always thread the retraining in the background. That’s what I did in the chunk tagger. With a discriminative “online” method, you can just keep cycling through epochs in the background, which gives you the effect of a hot warmup for subsequent examples. Also, you don’t need to run to convergence — the background models are just being used to select instances for labeling.

How to Prevent Overflow and Underflow in Logistic Regression

February 16, 2012

Logistic regression is a perilous undertaking from the floating-point arithmetic perspective.

Logistic Regression Model

The basic model of an binary outcome $y_n \in \{ 0, 1\}$ with predictor or feature (row) vector $x_n \in \mathbb{R}^K$ and coefficient (column) vector $\beta \in \mathbb{R}^K$ is

$y_n \sim \mbox{\sf Bernoulli}(\mbox{logit}^{-1}(x_n \beta))$

where the logistic sigmoid (i.e., the inverse logit function) is defined by

$\mbox{logit}^{-1}(\alpha) = 1 / (1 + \exp(-\alpha))$

and where the Bernoulli distribution is defined over support $y \in \{0, 1\}$ so that

$\mbox{\sf Bernoulli}(y_n|\theta) = \theta \mbox{ if } y_n = 1$, and

$\mbox{\sf Bernoulli}(y_n|\theta) = (1 - \theta) \mbox{ if } y_n = 0$.

(Lack of) Floating-Point Precision

Double-precision floating-point numbers (i.e., 64-bit IEEE) only support a domain for $\exp(\alpha)$ of roughly $\alpha \in (-750,750)$ before underflowing to 0 or overflowing to positive infinity.

Potential Underflow and Overflow

The linear predictor at the heart of the regression,

$x_n \beta = \sum_{k = 0}^K x_{n,k} \beta_k$

can be anywhere on the real number line. This isn’t usually a problem for LingPipe’s logistic regression, which always initializes the coefficient vector $\beta$ to zero. It could be a problem if we have even a moderately sized coefficient and then see a very large (or small) predictor. Our probability estimate will overflow to 1 (or underflow to 0), and if the outcome is the opposite, we assign zero probability to the data, which is not good predictively.

Log Sum of Exponents to the Rescue

Luckily, there’s a solution. First, we’re almost always working with log probabilities to prevent underflow in the likelihood function for the whole data set $y,x$,

$\log p(y|\beta;x) = \log \prod_{n = 1}^N p(y_n|\beta;x_n) = \sum_{n=1}^N \log p(y_n|\beta;x_n)$

Working on the inner log probability term, we have

$\log p(y_n|\beta;x_n)$

${ } = \log \mbox{\sf Bernoulli}(y_n|\mbox{logit}^{-1}(x_n \beta))$

${ } = \log \ \mbox{logit}^{-1}(x_n \beta) \mbox{ if } y_n = 1$
${ } = \log (1 - \mbox{logit}^{-1}(x_n \beta)) \mbox{ if } y_n = 0$

Recalling that

$1 - \mbox{logit}^{-1}(\alpha) = \mbox{logit}^{-1}(-\alpha)$,

we further simplify to

${ } = \log \ \mbox{logit}^{-1}(x_n \beta) \mbox{ if } y_n = 1$
${ } = \log \ \mbox{logit}^{-1}(-x_n \beta) \mbox{ if } y_n = 0$

Now we’re in good shape if we can prevent the log of the inverse logit from overflowing or underflowing. This is manageable. If we let $\alpha$ stand in for the linear predictor (or its negation), we have

${ } = \log \ \mbox{logit}^{-1}(\alpha)$

${ } = \log (1 / (1 + \exp(-\alpha)))$

${ } = - \log (1 + \exp(-\alpha))$

${ } = - \mbox{logSumExp}(0,-\alpha)$

Log Sum of Exponentials

Recall that the log sum of exponentials function is

$\mbox{logSumExp}(a,b) = \log (\exp(a) + \exp(b))$

If you’re not familiar with how it prevents underflow and overflow, check out my previous post:

In the logistic regression case, we have an even greater chance for optimization because the argument $a$ is a constant zero.

Logit-transformed Bernoulli

Putting it all together, we have the logit-transformed Bernoulli distribution,

$\mbox{\sf Bernoulli}(y_n|\mbox{logit}^{-1}(x_n\beta))$

${ } = - \mbox{logSumExp}(0,-x_n\beta) \mbox{ if } y_n = 1$
${ } = - \mbox{logSumExp}(0,x_n\beta) \mbox{ if } y_n = 0$

We can just think of this as an alternatively parameterized Bernoulli distribution,

$\mbox{\sf BernoulliLogit}(y|\alpha) = \mbox{\sf Bernoulli}(y|\mbox{logit}^{-1}(\alpha))$

with which our model can be expressed as

$y_n \sim \mbox{\sf BernoulliLogit}(x_n\beta)$.

Recoding Outcomes {0,1} as {-1,1}

The notation’s even more convenient if we recode the failure outcome as $-1$ and thus take the outcome $y \in \{ -1, 1 \}$, where we have

$\mbox{\sf BernoulliLogit}(y|\alpha) = - \mbox{logSumExp}(0,-y \alpha)$

November 4, 2011

The Data

We will train and test on anything that’s easy to parse. Up today is a basic English part-of-speech tagging for Twitter developed by Kevin Gimpel et al. (and when I say “et al.”, there are ten co-authors!) in Noah Smith’s group at Carnegie Mellon.

The relevant resources are:

Their paper describes their tagging scheme as well as their CRF-based tagger. It uses Stanford’s CRF tagger with baseline features as a performance comparison. The code for their tagger’s also in the distribution. I’m not sure what the license is — it’s listed as “other open source” (I didn’t even know Google Code let you do that — I thought it was “free beer” or nothing with them).

Training and Evaluating a LingPipe POS Tagger

Their corpus was very easy to parse (thanks, I really appreciate it). It only took me about an hour or so to download the data, parse it, and evaluate LingPipe’s baseline POS tagger on it. (It helps to be the author of code. The patterns feel awfully comfortable.)

Our performance was 85.4% accuracy on their train/test split using the default parameters for tagging in LingPipe. In contrast, the Stanford CRF tagger with default features was 85.9% accurate, whereas Gimpel et al.’s tagger achieved 89.4% accuracy. As usual, LingPipe’s HMM tagger is competitive with out-of-the-box CRFs and a few percentage points behind tuned, feature-rich CRFs.

Their paper (on page 5) says the annotator agreement is 92.2%. They also break accuracy out per tag, which LingPipe’s output also does; you can see this yourself if you run it.

LingPipe’s Baseline POS Tagger

The baseline POS tagger in LingPipe is a bigram HMM with emissions defined by a bounded character language model. Estimation is with simple additive smoothing (i.e., MAP estimates given symmetric Dirichlet priors) for the initial state and transition probabilities and Witten-Bell smoothing for the character LMs. Our main motivation for doing things this way is that (a) it’s online, letting us train an example at a time, and (b) it’s reasonably fast when it runs. We should be able to decode this tag set at well over 500K tokens/second by turning on caching of character LM results and pruning.

We could also implement their approach using LingPipe’s CRFs. It’s just that it’d take a bit longer than an hour all in.

Run it Yourself

All of my code’s checked into the LingPipe Sandbox in a project named “twitter-pos”. You can check it out anonymously using Subversion:

svn co https://aliasi.devguard.com/svn/sandbox/twitter-pos


The code’s in a single file, stored under the src subdirectory of the package:

package com.lingpipe.twpos;

import com.aliasi.classify.*;
import com.aliasi.corpus.*;
import com.aliasi.io.*;
import com.aliasi.hmm.*;
import com.aliasi.tag.*;
import java.io.*;
import java.util.*;

public class Eval {

public static void main(String[] args) throws IOException {

System.out.println("Training Tagger");
HmmCharLmEstimator hmm = new HmmCharLmEstimator();
corpus.visitTrain(hmm);
HmmDecoder tagger = new HmmDecoder(hmm);

System.out.println("Evaluating");
boolean storeTokens = true;
TaggerEvaluator evaluator
= new TaggerEvaluator(tagger,storeTokens);
corpus.visitTest(evaluator);
System.out.println(evaluator.tokenEval());
}

static List<Tagging> parse(File f) throws IOException {
List<Tagging> taggings
= new ArrayList<Tagging>();
List tokens = new ArrayList();
List tags = new ArrayList();
for (String line : reader) {
String[] tokTag = line.split("\\s+");
if (tokTag.length != 2) {
// System.out.println("tokens=" + tokens);
// System.out.println("tags=" + tags);
tokens = new ArrayList();
tags = new ArrayList();
} else {
}
}
return taggings;
}

static class TwitterPosCorpus extends ListCorpus<Tagging> {
public TwitterPosCorpus(File path) throws IOException {
for (Tagging t : parse(new File(path,"train")))
for (Tagging t : parse(new File(path,"dev")))
for (Tagging t : parse(new File(path,"test")))
}
}
}


LingPipe’s pretty fast for this sort of thing, with the entire program above, including I/O, corpus parsing, training, and testing taking a total of 5 seconds on my now ancient workstation.

Although it wouldn’t be a fair comparison, there’s usually a percent or so to be eked out of a little tuning in this setting (it would’ve been fair had I done tuning on the dev set and evaluated exactly once). This was just a straight out of the box, default settings eval. In general, one shouldn’t trust results that report post-hoc best settings values as they’re almost always going to overestimate real performance for all the usual reasons.

Finally, here’s the confusion matrix for tags in the first-best output:

,D,E,#,!,G,&,@,A,$,L,N,O,,,U,T,V,P,S,R,~,^,X,Z D,446,0,0,1,0,0,0,4,0,0,0,7,0,0,0,0,11,0,7,0,8,1,0 E,0,53,0,1,2,0,0,0,1,0,0,0,5,0,1,0,0,0,0,0,0,0,0 #,0,0,44,0,1,0,0,0,0,0,10,0,0,0,0,3,0,0,0,0,20,0,0 !,0,0,1,140,1,0,0,5,0,1,15,5,0,0,0,3,1,0,7,0,7,0,0 G,1,1,5,2,14,0,0,1,3,0,10,0,10,0,0,4,1,0,1,2,15,0,0 &,0,0,0,0,0,122,0,1,0,0,1,0,0,0,0,0,1,0,1,0,1,0,0 @,0,0,0,0,0,0,328,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0 A,0,0,0,1,0,1,0,248,3,0,44,0,0,0,2,30,2,0,24,0,12,0,0$,0,0,0,0,0,0,1,0,79,0,2,0,0,0,0,0,3,0,0,0,0,0,0
L,2,0,0,0,1,0,0,0,0,120,3,1,0,0,0,2,0,0,0,0,0,0,0
N,1,0,1,5,1,0,0,49,1,1,783,2,0,0,2,52,6,0,14,0,63,0,0
O,4,0,0,0,1,0,0,2,0,0,2,456,0,0,1,0,0,0,2,0,4,0,0
,,0,4,0,0,2,0,0,0,0,0,0,0,861,0,0,2,0,0,0,11,0,0,0
U,0,0,0,1,0,0,0,0,0,0,0,0,1,114,0,0,0,0,0,0,1,0,0
T,0,0,0,0,0,0,0,0,0,0,0,1,0,0,24,0,9,0,1,0,1,0,0
V,0,1,0,0,0,0,0,21,0,1,69,1,0,0,0,921,9,0,7,2,21,0,0
P,2,0,0,1,0,0,0,4,1,0,1,0,0,0,11,6,571,0,12,0,4,0,0
S,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,2,0,0,1,0,0
R,4,0,0,1,0,0,0,13,0,0,20,1,0,0,1,6,15,0,269,0,8,1,0
~,0,0,0,1,1,0,0,0,0,0,0,0,32,0,0,1,0,0,0,177,0,0,0
^,1,0,4,1,2,0,0,29,2,0,101,0,2,0,0,16,4,0,1,0,331,0,1
X,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,0,0,3,0
Z,1,0,0,0,0,0,0,0,0,1,4,0,0,0,0,0,0,1,0,0,13,0,2


I should really figure out how to format that a bit more neatly.