Twitter POS Tagging with LingPipe and ARK Tweet Data


The Data

We will train and test on anything that’s easy to parse. Up today is a basic English part-of-speech tagging for Twitter developed by Kevin Gimpel et al. (and when I say “et al.”, there are ten co-authors!) in Noah Smith’s group at Carnegie Mellon.

The relevant resources are:

Their paper describes their tagging scheme as well as their CRF-based tagger. It uses Stanford’s CRF tagger with baseline features as a performance comparison. The code for their tagger’s also in the distribution. I’m not sure what the license is — it’s listed as “other open source” (I didn’t even know Google Code let you do that — I thought it was “free beer” or nothing with them).

Training and Evaluating a LingPipe POS Tagger

Their corpus was very easy to parse (thanks, I really appreciate it). It only took me about an hour or so to download the data, parse it, and evaluate LingPipe’s baseline POS tagger on it. (It helps to be the author of code. The patterns feel awfully comfortable.)

Our performance was 85.4% accuracy on their train/test split using the default parameters for tagging in LingPipe. In contrast, the Stanford CRF tagger with default features was 85.9% accurate, whereas Gimpel et al.’s tagger achieved 89.4% accuracy. As usual, LingPipe’s HMM tagger is competitive with out-of-the-box CRFs and a few percentage points behind tuned, feature-rich CRFs.

Their paper (on page 5) says the annotator agreement is 92.2%. They also break accuracy out per tag, which LingPipe’s output also does; you can see this yourself if you run it.

LingPipe’s Baseline POS Tagger

The baseline POS tagger in LingPipe is a bigram HMM with emissions defined by a bounded character language model. Estimation is with simple additive smoothing (i.e., MAP estimates given symmetric Dirichlet priors) for the initial state and transition probabilities and Witten-Bell smoothing for the character LMs. Our main motivation for doing things this way is that (a) it’s online, letting us train an example at a time, and (b) it’s reasonably fast when it runs. We should be able to decode this tag set at well over 500K tokens/second by turning on caching of character LM results and pruning.

We could also implement their approach using LingPipe’s CRFs. It’s just that it’d take a bit longer than an hour all in.

Run it Yourself

You can get their code from their project home page, linked above.

All of my code’s checked into the LingPipe Sandbox in a project named “twitter-pos”. You can check it out anonymously using Subversion:

svn co

The code’s in a single file, stored under the src subdirectory of the package:

package com.lingpipe.twpos;

import com.aliasi.classify.*;
import com.aliasi.corpus.*;
import com.aliasi.hmm.*;
import com.aliasi.tag.*;
import java.util.*;

public class Eval {

    public static void main(String[] args) throws IOException {
        System.out.println("Reading Corpus");
        TwitterPosCorpus corpus 
            = new TwitterPosCorpus(new File(args[0]));
        System.out.println("Training Tagger");
        HmmCharLmEstimator hmm = new HmmCharLmEstimator();
        HmmDecoder tagger = new HmmDecoder(hmm);

        boolean storeTokens = true;
        TaggerEvaluator evaluator
            = new TaggerEvaluator(tagger,storeTokens);

    static List<Tagging> parse(File f) throws IOException {
        List<Tagging> taggings 
            = new ArrayList<Tagging>();
        FileLineReader reader = new FileLineReader(f,"UTF-8");
        List tokens = new ArrayList();
        List tags = new ArrayList();
        for (String line : reader) {
            String[] tokTag = line.split("\\s+");
            if (tokTag.length != 2) {
                taggings.add(new Tagging(tokens,tags));
                // System.out.println("tokens=" + tokens);
                // System.out.println("tags=" + tags);
                tokens = new ArrayList();
                tags = new ArrayList();
            } else {
        return taggings;

    static class TwitterPosCorpus extends ListCorpus<Tagging> {
        public TwitterPosCorpus(File path) throws IOException {
            for (Tagging t : parse(new File(path,"train")))
            for (Tagging t : parse(new File(path,"dev")))
            for (Tagging t : parse(new File(path,"test")))

LingPipe’s pretty fast for this sort of thing, with the entire program above, including I/O, corpus parsing, training, and testing taking a total of 5 seconds on my now ancient workstation.

Although it wouldn’t be a fair comparison, there’s usually a percent or so to be eked out of a little tuning in this setting (it would’ve been fair had I done tuning on the dev set and evaluated exactly once). This was just a straight out of the box, default settings eval. In general, one shouldn’t trust results that report post-hoc best settings values as they’re almost always going to overestimate real performance for all the usual reasons.

Finally, here’s the confusion matrix for tags in the first-best output:


I should really figure out how to format that a bit more neatly.

2 Responses to “Twitter POS Tagging with LingPipe and ARK Tweet Data”

  1. Brendan O'Connor Says:

    Nice! BTW it’s Apache License.

  2. Twitter POS Tagging with LingPipe and ARK Tweet Data « Another Word For It Says:

    [...] Twitter POS Tagging with LingPipe and ARK Tweet Data by Bob Carpenter. [...]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Get every new post delivered to your Inbox.

Join 797 other followers