Hierarchical Bayesian Batting Ability, with Multiple Comparisons

November 4, 2009 by lingpipe

Just in time for the last game(s) of this year’s World Series, the final installment of my example of Bayesian stats using baseball batting ability. I took us off-topic (text analytics) with a general intro to Bayesian stats (What is “Bayesian” Statistical Inference?). As a first example, I compared Bayesian calculations of binomial posteriors with point estimates (Batting Averages: Bayesian vs. MLE Estimates). In exploring the question of “where do priors come from?”, I started with simple non-Bayesian point estimates (Moment Matching for Empirical Bayes Beta Priors for Batting Averages). Finally, I realized I’d meant “ability” where I’d said “average” (former is a latent parameter, latter is a statistic calculated from at bats), when I considered Bayesian point estimates (Bayesian Estimators for the Beta-Binomial Model of Batting Ability).

Hierarchical Model Recap

For batter j, the number of hits n_j in N_j at bats is modeled as a binomial,

n_j \sim \mbox{\sf Binom}(\theta_j,N_j) for j \in 1:J,

where the ability, or chance of getting a hit, for batter j is \theta_j. Ability is modeled as having a beta distribution

\theta_j \sim \mbox{\sf Beta}(\alpha,\beta) for j \in 1:J

with prior number of hits \alpha-1 and prior number of outs \beta-1. These parameters, which act as priors for the for the binomial parameter, are themselves given priors. The mean of the beta, \alpha/(\alpha+\beta), is given a uniform prior,

\alpha/(\alpha+\beta) \sim \mbox{\sf Beta}(1,1).

The prior for the scale \alpha+\beta is modeled by a Pareto distribution, which has probability proportional to 1/(\alpha+\beta)^{2.5},

\alpha + \beta \sim \mbox{Pareto}(1.5).

Fitting

I used the 2006 AL position player data (given in this previous blog post). That fixes the number of players J and the hits n_j and at-bats N_j for each player j \in 1:J.

I then used BUGS encoding of the model above (as shown in this previous post). BUGS automaticaly implements Gibbs sampling, a form of Markov Chain Monte Carlo (MCMC) inference. I ran 3 chains of 1000 samples each, retaining the last 500 samples, for 1500 posterior samples total. All parameters had \hat{R} values very near 1, indicating convergence of the Markov chains. As usual, we read off statistics from the samples and used the sampled values for inference, where they allow the integrals involved in Bayesian inference (as descirbed in this blog post) to be computed.

Beta Posterior

We can inspect the posterior for the beta mean \alpha/(\alpha+\beta) and scale \alpha+\beta parameters. We plot them as a 2D scatterplot with their means picked out as lines.

beta parameters posterior

As usual, the scale is much harder to estimate than the mean.

Ability Posteriors

We can also plot the ability for particular players. Here’s histograms of sampled posteriors for the players with the best average posterior ability estimates, with their hits and at-bats provided as labels above the plot:

Notice how the variance of the posterior is determined by the number of at bats. The player with 60 at bats has a much wider posterior than the player with 695 at bats.

Multiple Comparisons: Who’s the Best Batter?

We’re going to do what Andrew, Jennifer and Masanao suggest, which is to use our hierarchical model to make a posterior comparison about player’s ability. In particular, we’ll estimate the posterior probabability that a particular player is the best player. We simply look at all 1500 posterior samples, which include ability samples as plotted above, and count how many times a player has the highest ability in a sample. Then divide by 1500, and we’re done. It’s a snap in R, and here’s the results, for the same batters as the plot above:

Average At-Bats Pr(best ability)
.367 60 0.02
.342 482 0.08
.347 521 0.12
.322 695 0.02
.330 648 0.04
.343 623 0.11
.330 607 0.04

The .347 batter with 521 at bats not only has the highest estimated chance in our model of having the highest ability, he also won the batting crown (Joe Mauer of Minnesota). The beta-binomial hierarchical model estimate is only 12% that this batter has the highest ability. The estimate is very close for the .343 batter with 623 at bats (Derek Jeter of the Yankees). [It turns out the race to the batting crown came down to the wire.]

The larger number of at bats provides more evidence that the batter has a higher ability than average, thus pulling the posterior ability estimate further away from the prior mean. Finally, note that we’re assigning some chance that the .367 batter with only 60 at bats is best in the league. That’s because when the samples are on the high side of the distribution, this batter’s best.

LingPipe Classifiers and Chunkers for Endeca Extend Partner Program

November 3, 2009 by lingpipe

A couple weeks ago, Endeca made the following press release:

The “leading text analytics software vendors” are us (props to Breck for naming us with an “A”), Basis Technology, Lexalytics, MetaCarta, NetOwl, Nstein, Semantia and Temis. But wait, that’s not all. A slew of text analytics companies had either joined earlier or announced joining now, including ChoiceStream, BayNote, Lexalytics, Coremetrics, NStein, and Searchandise.

It’s no surprise that we’re all thinking Endeca has quite a bit of potential as a channel partner.

After the usual marketing blather (e.g. “leveraging the extensibility of the McKinley platform”, “lower cost of ownership”, “value-added capabilities”, etc.) and vague promises (e.g. “unrestricted exploration of unstructured content”), the third paragraph of Endeca’s press release explains what it’s all about in allowing Endeca’s search customers to

… run their data through an Endeca Extend partner solution, extract additional meta-data elements from the text, and append that meta-data to the original content

Endeca Records

Endeca stores documents in record data structures, which associate string keys with lists of string values. This is the same rought structure as is found in a Lucene Document.

One striking difference is that Endeca’s API is cleaner and better documented. Overall, I’m very impressed with Endeca’s API. Looking at their API reminds me of the APIs we built at SpeechWorks, where wiser heads prevailed on me to forego complex controls designed for control-freak grad students in favor of making easy things easy.

Another striking difference is that Lucene’s document structure is much richer, allowing for binary blobs to be stored by those trying to use Lucene as a database. Lucene also allows both documents as a whole and fields within a document to be boosted, adding a multiplier to their search scores for matching queries.

Manipulator Extensions

Endeca’s produced an API for extensions. An extension visits records, modifies them, and writes them back to the index. It can also write into its own scratch space on the file system and generate all new records.

An extension consists of three components: configuration, factory, and runtime.

Class 1. Configuration

The bean-like configuration class provides setters and getters for strings, booleans, integers, and doubles. These are labeled with attributes and accessed through reflection. There’s then a method to validate a configuration that returns a list of errors as structured objects. I’m a big fan of immutable objects, so working with beans drives me crazy. They could use some more doc on concurrency and lifecycle order; as is, I was conservative and programmed defensively against changes in config.

Configuration is handled through an administrator interface. As I said, it’s bean-like.

Class 2. Factory

There is then a factory class with a method that returns the config class (so the admin interface can tell what kind of config to build for it). It also contains a method that takes an Endeca application context and configuration and produces a runtime application. The context provides services like logging, a path to local file space, and a hook into a pipe into which modified records may be sent.

Class 3. Runtime

The runtime simply provides a record visitor method. To write out changes, you grab the output channel from the context provided to the factory. There are also some lifecycle methods used as callbacks: interrupt processing, processing of records is complete, and final cleanup. You can still write out answers during the completion callback.

Endeca’s Demo Manipulator Extension

Endeca has great programmers and their Java API design was really clear. I love it when vendors follow standard patterns and idioms in their API designs. Especially when they use generics usefully.

The PDF developer doc’s still in progress, but their Javadoc’s mostly in place. What was really sweet is that they gave us a working demo extension program with all of its configuration, tests, and even mock objects for use in JUnit testing the entire framework without a complete install of Endeca’s platform. I’m so happy when someone sends me a Java package that unpacks then compiles with Ant without griping.

LingPipe Classifier CAS Manipulator Extension

The first extension I wrote is configured with a path to a serialized text classifier on the classpath. I then configured a list of field names (only strings are available, so I went with comma-separated values) from which to collect text, and a field name into which to write the result of classification. [Correction: 5 Nov 2009, Endeca let me know that they had this covered; if I declare the variables in the bean-like configuration to be list-like values, the reflection-based config system will figure it out. This is awesome. I always hate rolling in ad-hoc little parsing "languages" like CSV in config. It's just sooo hard to doc and code correctly.]

LingPipe Chunker CAS Manipulator Extension

The second extension is a chunker. It requires a path to a chunker. Optionally, it allows a sentence detector to be configured for preprocessing (most of our chunkers work better at the sentence level). It also optionally allows a dictionary (and tokenizer factory) to be specified for overriding the chunks found by the chunker. Then, a list of field names from which to read text. The output gets written into chunk-specific fields. Because a given field name can contain multiple values, you can keep the resulting spans separate.

Endeca’s Faceting

Endeca’s big on faceted search. You may be familiar with it from two of the best online stores, NewEgg and Amazon.

It’s easy to treat our classifier plugin output as a facet. For instance, classify documents by sentiment and now sentiment’s a facet. Do a search, and you’ll get a summary of how many positive and how many negative documents, with an option to restrict search to either subset.

It’s also easy to treat our chunker output as a facet. For instance, if you include a company name chunker, you’ll be able to use companies as facets (e.g. as NewEgg does with manufacturers, though documents may contain references to more than one company).

Buying Plugins

Drop Breck a line.

Now that I have my head around the bigger picture, it’s pretty easy to build these kinds of extensions. So if there’s something you’d like integrated into Endeca and you’re willing to pay for it, let us know.

Mihalcea (2007) Using Wikipedia for Automatic Word Sense Disambiguation

October 22, 2009 by lingpipe

In response to a recent thread on the LingPipe mailing list about using Wikipedia for a word-sense disambiguation corpus, I’d like to point out Rada Mihalcea’s 2007 NAACL paper:

The idea’s very simple. A Wikipedia page represents a single sense of an ambiguous phrase. Both the senses and text for training can be extracted from Wikipedia’s link structure.

Rada maintains the SenseEval web pages and serves on the advising committee for the bakeoffs, which have pretty much defined the field recently.

Leaning on Wikitext

The wikitext markup used for Wikipedia pages contains explicit disambiguation and mention-level information. Consider these examples from Rada’s paper:

  • In 1834, Sumner was admitted to the [[bar (law)|bar]] at the age of twenty three, and entered private practice in Boston.
  • It is danced in 3/4 time (like most waltzes), with the couple turning approx. 180 degrees every [[bar (music)|bar]].

The double brackets “[[ ]]” enclose links with the item on the left being the disambiguated reference, e.g. “bar (law)” or “bar (music)” and the item on the right the ambiguous word, e.g. “bar”. When viewing the page, the text to the right of the vertical bar shows up.

The scale’s impressive — there were 1108 examples of “bar” used as link text with 40 different categories, and this was a few years ago.

Rada extracted paragraphs around the mentions for training data. These are clearly marked in the Wikitext, so nothing fancy needs to be done for paragraph extraction (beyond the pain of Wikitext parsing itself).

What About Disambiguation Pages?

Wikipedia is full of disambiguation pages, which Wikipedia itself describes as:

Disambiguation in Wikipedia is the process of resolving the conflicts that occur when articles about two or more different topics could have the same “natural” page title. This category contains disambiguation pages: non-article pages containing links to other Wikipedia articles and disambiguation pages.

For instance, the Wikipedia page for “bar”, lists an astounding number ambiguous meanings for the word. There are 11 popular usages (e.g. bar serving alcohol, bar of gold), 5 math and science usages (e.g. unit of pressure, bar chart), 4 uses in the law, and many other usages, some which appear under other categories.

I’d think the pages linked to would be useful for training. But Rada didn’t use them, listing reasons:

  1. mismatch between disambiguation pages and usage, causing precision problems from links in a disambiguation page that are never referenced with the word and recall problems from some usages not showing up on the disambiguation page,
  2. inconsistencies in the naming of disambiguation pages

The first issue is still problematic, though the recall side of it seems to be improving. The second issues is also still problematic. Although the category helps find such pages, Wikipedia’s formatting is still more like natural language than a database.

Evaluation

As you might expect, it worked pretty well. Rada evaluated a naive Bayes classifier with features including contextual words with part-of-speech tags.

Coding it in LingPipe

This is something you could easily replicate in LingPipe. You could use one of the naive Bayes parsers. Or any of the other classifiers we consider in our word-sense disambiguation tutorial, including K-nearest neighbors and character or token-based language model classifiers.

If you need part-of-speech tags, check out our part of speech tagging tutorial.

With naive Bayes and easily extracted unsupervised data, you could also use EM semi-supervised training, as described in our EM tutorial.

You could also use a discriminative linear classifier, as described in our logistic regression tutorial.

The hard part’s all the data munging for Wikipedia. If you build a parser in Java and want to share, I’d be happy to link to it from here or host it in our development sandbox.

Learn Measure Theory for Under US$10

October 21, 2009 by lingpipe

My friend and former colleague Christer Samuelsson sent me a present a while back (if you send me presents, you may be mentioned here too, though I don’t have mugs for you):

Kolmogorov at blackboard

Yes, it’s written by the Kolmogorov, the man who invented (discovered, if you’re a Platonist) measure theory. Not content with laying the foundations for probability, Kolmogorov also kicked off the study of algorithmic information theory through what later became known as Kolmogorov complexity, and I can also highly recommend Li and Vitanyi’s book on that subject, though I’ve only read the introductory bits.

You can check out the contents by following the Amazon link above. I’ll summarize by saying it covers the basics, from set theory to metric topology, including a long discussion of linear spaces and operators, a chapter on measure, and sections on Lebesgue integration in the first integration chapter and the Stieltjes generaliation later (really cool, in that it lets you merge integration and summation).

It presupposes the requisite “mathematical sophistication”, which really means you’ll be hard pressed to read more than a page every 10 or 15 minutes. Seriously, though, if you didn’t understand calc in high school (or college), this probably isn’t the book for you. In fact, if you haven’t studied higher math at all, this probably isn’t the place to start. Having said that, I hated calc in school, and only felt I understood it after doing analysis and topology, which cranked the level of abstraction beyond {\mathbb R}^n and Euclidean distances. It’d help to have some background in abstract algebra and set theory, too, though everything is explained clearly (though quickly) from first principles.

I don’t know whether to thank Kolmogorov, his co-author, or his translator, but this book is wonderfully clear. The examples are direct, the notation’s clear, and it’s fairly easy to use modularly (e.g. just to look up what a measure or σ-algebra is).

Happy 95th Birthday, Martin Gardner

October 20, 2009 by lingpipe

Martin Gardner, who turns 95 today, was a huge influence on me.

I devoured his books and Scientific American “Mathematical Games” columns in junior high, high school, and even well into college.

There’s a nice profile of Martin Gardner by John Tierny in today’s New York Times. (Already cited in four places by the Wikipedia article linked above.)

Being now into my forties and trying to figure out what to do with my life, I’m fascinated by the fact that Gardner started at age 42 with no background in math other than a love of puzzles!

I’m not surprised a journalist would call “recreational mathematics” an oxymoron.

We already knew Gardner liked the A-HA style of puzzle — his books were full of them, requiring no real math knowledge to understand. I wonder what he’d have done if he grew up on programming puzzles rather than mathematical ones? We’d probably have more interesting, but equally ridiculous, programmer interview quizzes.

Examples of “Futility of Commenting Code”

October 19, 2009 by lingpipe

Continuing on from the previous blog entry, The Futility of Commenting Code, I’d like to address the dissenting comment of Sandman, who claims to have only seen the kind of caricature comments I made up in posts about commenting. Well, here goes reality.

A Real Example

Consider the following real example, which I chose because I knew the source for Apache projects was available online, and suspecting Apache Commons would be less heavily checked than more popular projects. I dove into the e-mail package, because it’s the first one I actually use. I found this example right away:

pieces of which I repeat here:

 /**
 * Create a datasource from a String.
 *
 * @param data A String.
 * @param aType A String.
 * @throws IOException IOException
 */
public ByteArrayDataSource(String data, String aType) throws IOException {

    this.type = aType;
    try {
        baos = new ByteArrayOutputStream();

        // Assumption that the string contains only ASCII
        // characters! Else just pass in a charset into this
        // constructor and use it in getBytes().
        baos.write(data.getBytes("iso-8859-1"));
        baos.flush();
        baos.close();
    } catch (UnsupportedEncodingException uex) {
        throw new IOException("The Character Encoding is not supported.");
    } finally {
        if (baos != null) {
            baos.close();
        }
    }
}

I wasn’t talking about API comments, but these display the same problem. This public constructor is documented with “Create a datasource from a String”, but in fact, there are two string parameters, both documented as “a String”. That’s what the delcaration says, so this is exactly the kind of redundant comment I was talking about.

Next, consider the one code comment. It starts on the wrong foot, with “Assumption that the string contains only ASCII characters!”. If you look at the method call, data.getBytes("iso-8859-1"), you’ll see that it’s actually more general than documented, in that it’ll work for any ISO-8859-1 characters (aka Latin1). The second part of the comment, “Else just pass in a charset into this constructor and use it in getBytes()” makes no sense, because the bytes are hard coded, and there is no char encoding argument to the constructor.

Furthermore, the catch block (with its throw) should just be removed. It’s catching an UnsupportedEncodingException, which extends IOException, then throwing a fresh IOException. It should just be removed, at which point an unsupported encoding throws an unsupported encoding exception; you don’t even need to change the signature, because unsupported encoding exceptions are kinds of I/O exceptions. There are two problems with the code as is. First, the the IOException reports an unhelpful message; the unsupported encoding exception has the info on what went wrong in its message. Second, it’s returning a more general type, making it harder for catchers to do the right thing. You might also consider the fact that the original stack trace is lost a problem.

Another instance of (almost) commenting the language is later in the same file:

try {
    int length = 0;
    byte[] buffer = new byte[ByteArrayDataSource.BUFFER_SIZE];

    bis = new BufferedInputStream(aIs);
    baos = new ByteArrayOutputStream();
    osWriter = new BufferedOutputStream(baos);

    //Write the InputData to OutputStream
    while ((length = bis.read(buffer)) != -1) {
        osWriter.write(buffer, 0, length);
    }
    osWriter.flush();
    osWriter.close();
} ...

I’d argue comments like “Write the InputData to OutputStream” are useless, because this is the idiom for buffered writes.

Aside on Bad Code

Maybe you think you should always buffer streams. In this case, that’s wrong. Both bufferings simply cause twice as many assignments as necessary.

Buffering the input stream is useless because you’re already reading into the byte array buffer.

Buffering the output stream is useless, because you’re writing to a byte array output stream baos.

Furthermore, you don’t need to close or flush a byte array output stream, because both are no-ops (see the ByteArrayOutputStream javadoc).

Finally, the naming is wonky, because osWriter is not a writer, it’s an output stream.

This isn’t an isolated case. Other files in this package have similar problems with doc.

Another Example

While we’re picking on Apache Commons, I checked out another package I’ve used, fileUpload.

public class DiskFileUpload extends FileUploadBase {
    // ------------------------------------ Data members

    /**
    * The factory to use to create new form items.
    */
   private DefaultFileItemFactory fileItemFactory;

    // ------------------------------------ Constructors

That's the kind of pre-formatted useless stuff I was talking about. We know what a member variable and constructor look like in Java. There weren't any other code comments in that file.

In that same package, the fileupload.ParameterParser class has some, though, and I'm guessing they're of the kind that others mentioned as "good" comments, such as:

    ...
    // Trim leading white spaces
    while ((i1 < i2) && (Character.isWhitespace(chars[i1]))) {
        i1++;
    }
    ...

I'd argue that perhaps a method called trimLeadingWhiteSpace() implemented the same way would be clearer. But if you're not going to define new methods, I'd say this kind of comment helps. But always verify that the code does what it says it does; don't take the comment's word (or method name's word) for it.

I couldn't leave that file without commenting on their return-only-at-the-end-of-a-function style:

String result = null;
if (i2 > i1) {
    result = new String(chars, i1, i2 - i1);
}
return result;

More Examples

I thought only a couple wouldn't be convincing. So here's some links and examples:

public boolean isFull() {
    // size() is synchronized
    return (size() == maxSize());
}

Documenting what's clear from the code. (And nice section divider comments.)

 // Return the parser we already created (if any)
if (parser != null) {
    return (parser);
}
...
 } catch (RuntimeException e) {
    // rethrow, after logging
    log.error(e.getMessage(), e);
    throw e;
}

Yes, that's what return, log, and throw do.

public LogFactoryImpl() {
    super();
    initDiagnostics(); // method on this object

Yes, it really says "method on this object".

There's lots more, so I'll leave finding them as an exercise.

It's Not All Bad

There were useful code comments in those packages that explained how the code corresponded to an algorithm in Knuth, or how complex invariants were being preserved in complex loops. I'm really not saying don't comment code at all.

The Futility of Commenting Code

October 15, 2009 by lingpipe

I often read about novice programmers’ aspirations for a greater density of code comments, because they think it’s “professional”. I have news for you. Professional coders don’t comment their own code much and never trust the comments of others they find in code. Instead, we try to learn to read code and write more readable code.

API Documentation vs. Code Comments

I’m a big believer in clear API documentation. Java makes this easy with javadoc. It even produces relatively spiffy results.

But if you look at the LingPipe source code, you’ll find very few code comments.

Comments Lie

The reason to be very suspicious of code comments is that they can lie. The code is what’s executed, so it can’t lie. Sure, it can be obfuscated, but that’s not the same as lying.

I don’t mean little white lies, I mean big lies that’ll mess up your code if you believe them. I mean comments like “verifies the integrity of the object before returning”, when it really doesn’t.

A typical reason for slippage between comments and code is patches to the code that are not accompanied by patches to the comments. (This can happen for API doc, too, if you fiddle with your APIs.)

Another common reason is that the code author didn’t actually understand what the code was doing, so wrote comments that were wrong. If you really mess up along these lines, your code’s likely to be called out on blogs like Coding Horror or Daily WTF.

Most Comments Considered Useless

The worst offenses in the useless category are things that simply repeat what the code says.

// Set x to 3
x = 3;

It’s even better when embedded in a bloated Fortran-like comment block:

/******************************************
 ************* add1 ***********************
 ******************************************
 *  A function to add 1 to integers       *
 *  arguments                             *
 *     + n, any old integer               *
 *****************************************/
public int add1(int n) { return n+1; }

Thanks. I didn’t know what int meant or what n+1 did. This is a classic rookie mistake, because rookies don’t know the language or its libraries, so often what they do seems mysterious to them. For instance, commenting n >>> 2 with “shift two bits to right and fill in on the left with zeroes” may seem less egregious, but your reader should know that >>> is the unsigned shift operator (or look it up).

There is a grey line in the sand (I love mixed metaphors) here. When you’re pulling out techniques like those in Bloch and Gafter’s Java Puzzlers, you might want to reconsider how you code something, or adding a wee comment.

Eliminate, don’t Comment Out, Dead Code

I remember being handed a 30-page Tcl/Tk script at Bell Labs back in the 1990s. It ran some kind of speech recognition pipeline, because that’s what I was up to at the time. I picked it up and found dozens of methods with the same name, and lots of commented-out code. This makes it really hard to follow what’s going on, especially if whole blocks get commented out with /* ... */.

Please don’t do this. You should lLearn any of thea version control systems instead, like SVN.

When do I Comment Code?

I add comments in code in two situations. The first is when I wrote something inefficiently, but I know a better way to do it. I’ll write something like “use dynamic programming to reduce quadratic to linear”. This is a bad practice, and I wish I could stop myself from doing it. I feel bad writing something inefficient when I know how to do it more efficiently, and I certainly don’t want people reading the code to think I’m clueless.

I know only one compelling reason to leave comments: when you write code that’s not idiomatic in the language it’s written in, or when you do something for efficiency that’s obscure. And even then, keep the comments telegraphic, like “C style strings for efficiency”, “unfolds the E step and M step of EM”, “first step unfolded for boundary checks” or something along those lines.

Write Readable Code Instead

What you should be doing is trying to write code that’s more readable.

I don’t actually mean in Knuth’s literate programming sense; Knuth wants programs to look more like natural language, which is a Very Bad Idea. For one, Knuth has a terrible track record, bringing us TeX, which is a great typesetting language, but impossible to read, and a three-volume set of great algorithms written in some of the most impenetrable, quirky pseudocode you’re ever likely to see.

Instead, I think we need to become literate in the idioms and standard libraries of the programming language we’re using and then write literately in that language.

The biggest shock for me in moving from academia to industry is just how much of other people’s code I have to read. It’s a skill, and I got much better at it with practice. Just like reading English.

Unfortunately, effective programming is as hard, if not harder, than effective essay writing. Writing understandable code that’s also efficient enough and general enough takes lots of revision, and ideally feedback from a good code reviewer.

Not everyone agrees about what’s most readable. Do I call out my member variables from local variables with a prefix? I do, but lots of perfectly reasonable programmers disagree, including the coders for Java itself. Do I use really verbose names for variables in loops? I don’t, but some programmers do. Do I use four-word-long class names, or abbreviations? The former, but it’s another judgment call.

However you code, try to do it consistently. Also, remember that coding standards and macro libraries are not good places to innovate. It takes a while to develop a consistent coding style, but it’s worth the investment.

Coding Chunkers as Taggers: IO, BIO, BMEWO, and BMEWO+

October 14, 2009 by lingpipe

I’ve finished up the first order linear-chain CRF tagger implementation and a bunch of associated generalizations in the tagging interface. Now it’s time to code up chunkers using CRFs, and I’m faced with the usual problem of how to encode chunks, which are about character spans, as taggings, which are sequences of tokens (words or other units) and tags (aka labels, categories).

An Example

Here’s an example of a tokenized string and its encoding using several standards:

Tokens IO BIO BMEWO BMEWO+
Yesterday O O O BOS_O
afternoon 0 O O O
, 0 O O O_PER
John I_PER B_PER B_PER B_PER
J I_PER I_PER M_PER M_PER
. I_PER I_PER M_PER M_PER
Smith I_PER I_PER E_PER E_PER
traveled O 0 O PER_O
to O O O O_LOC
Washington I_LOC B_LOC W_LOC W_LOC
. O O O O_EOS

IO Encoding

The simplest encoding is the IO encoding, which tags each token as either being in (I_X) a particular type of named entity type X or in no entity (O). This encoding is defective in that it can’t represent two entities next to each other, because there’s no boundary tag.

BIO Encoding

The “industry standard” encoding is the BIO encoding (anyone know who invented this encoding?). It subdivides the in tags as either being begin-of-entity (B_X) or continuation-of-entity (I_X).

BMEWO Encoding

The BMEWO encoding further distinguishes end-of-entity (E_X) tokens from mid-entity tokens (M_X), and adds a whole new tag for single-token entities (W_X). I believe the BMEWO encoding was introduced in Andrew Borthwick’s NYU thesis and related papers on “max entropy” named entity recognition around 1998, following Satoshi Sekine’s similar encoding for decision tree named entity recognition. (Satoshi and David Nadeau just released their Survey of NER.)

BMEWO+ Encoding

I introduced the BMEWO+ encoding for the LingPipe HMM-based chunkers. Because of the conditional independence assumptions in HMMs, they can’t use information about preceding or following words. Adding finer-grained information to the tags themselves implicitly encodes a kind of longer-distance information. This allows a different model to generate words after person entities (e.g. John said), for example, than generates words before location entities (e.g. in Boston). The tag transition constraints (B_X must be followed by M_X or E_X, etc.) propagate decisions, allowing a strong location-preceding word to trigger a location.

Note that it also adds a begin and end of sequence subcategorization to the out tags. This helped reduce the confusion between English sentence capitalization and proper name capitalization.

Transition and Feature Tying

Consider the BMEWO notation. At least for regular named-entity recognition in text, we probably don’t want to distinguish being preceded by an end-of-entity token labeled E_X from a whole entity token labeled W_X.

In HMMs, we’d tie transitions, so p(T|E_X) = p(T|W_X) for all tags T.

In CRFs, we could tie the feature generation process, so having either preceding tag (end or whole entity) produces the same features, and hence the same probability estimates.

Complexity

First-best decoding in a first-order CRF or HMM is quadratic in the number of tags (and linear in the number of tokens). Here’s a rundown:

Encoding # Tags N=1 N=3 N=20
IO N+1 2 4 21
BIO 2N+1 3 7 41
BMEWO 4N+1 5 13 81
BMEWO+ 7N+3 10 24 143

Legal Transition Pruning

It’s not quite as bad as number of tags squared in practice, because not every tag can follow every other tag in the more complex codings. Even in BIO encoding, I_X can only follow B_X. In BMEWO encoding, M_X and E_X can only follow B_X or I_X, and O can only follow O, E_X, or W_X.

Of course, this is a major pain to compute and specify in the API, as you need some way of figuring out legal transitions. For HMMs, I used the fact that transition probabilities were zero to rule out consideration. I’ll have to add some way to specify legal transitions if I implement something general for CRF tagger decoding.

Necessary for CRFs?

Jenny Finkel recently told me she really thought IO encoding is all you need in most cases. It’s how I coded a single-entity problem for the mechanical Turkers for NE annotation. As far as encodability, it’s fine for MUC, almost OK for Biocreative [there are a couple contiguous entities, I think], but degenerate in general. The really nice thing about this encoding is that for a single entity type, we can get away with a single feature vector.

The recent Ratinov and Roth NE survey concluded BMEWO was superior to BIO for a second-order online logistic model (basically what’s known as a MEMM, but with best guess at each token enforced going forward).

I found BMEWO+ worked best for single-entity and small entity set tagging for our HMMs.

What to Do for First-Order CRFs?

What I don’t know is what I should do for CRFs. I’ve settled on only using first-order CRFs, and not having feature extraction depend on the output tag (or more precisely, every feature is crossed with every output tag at every position).

I’m tempted to try to punt and write up a flexible TaggingChunkingCodec interface and let the user specify it. The real problem is that evaluation’s not simple in this context, because results vary pretty dramatically based on features. That’s why I love and hate CRFs.

Bayesian Counterpart to Fisher Exact Test on Contingency Tables

October 13, 2009 by lingpipe

I want to expand a bit on Andrew’s post, in which he outlines a simple Bayesian analysis of 2×2 contingency tables to replace Fisher’s exact test (or a chi-square test) for contingency tables.

Contingency Table

I’ll use the data in the example in the Wikipedia article on contingency tables:

Left-Handed Right-Handed TOTAL
Male 9 (y1) 43 52 (n1)
Female 4 (y2) 44 48 (n2)
Total 13 87 100

Hypothesis

The statistical hypothesis we wish to investigate is whether the proportion of left-handed men is greater than, equal to, or less than the proportion of left-handed women.

Beta-Binomial Model

The observed data has y_1 = 9 left-handed men of a total of n_1 = 52 men and y_2 = 4 left-handed women of a total of n_2 = 48 women.

Letting \theta_1 \in [0,1] be the proportion of left-handed men and \theta_2 \in [0,1] the proportion of left-handed women, Andrew suggested modeling the number of left-handers as a binomial,

y_1 \sim \mbox{\sf Binom}(n_1,\theta_1), and

y_2 \sim \mbox{\sf Binom}(n_2,\theta_2).

He then suggested simple uniform priors on the proportions, which we know are equivalent to \mbox{\sf Beta}(1,1) priors:

\theta_1 \sim \mbox{\sf Unif}(0,1) = \mbox{\sf Beta}(1,1), and

\theta_2 \sim \mbox{\sf Unif}(0,1) = \mbox{\sf Beta}(1,1).

Given the conjugacy of the beta for the binomial, we can analytically compute the posteriors:

p(\theta_1|y_1,n_1) = \mbox{\sf Beta}(\theta_1|y_1+1, n_1-y_1+1), and

p(\theta_2|y_2,n_2) = \mbox{\sf Beta}(\theta_2|y_2+1, n_2-y_2+1)

We could try to compute the posterior density of \delta = \theta_1 - \theta_2 analytically by solving the integral

p(\delta|y,n) = \int_{-\infty}^{\infty} \mbox{\sf Beta}(\theta|y_1+1,n_1-y_1+1) \mbox{\sf Beta}(\theta-\delta|y_2+1,n_2-y_2+1) d\theta

or we could just follow Andrew’s sage advice and estimate the integral by simulation.

R Code

I love R. Here’s the R code to compute estimated posteriors by simulation and visualize them.

# DATA
n1 = 52 # men
y1 = 9  # left-handed men
n2 = 48 # women
y2 = 4  # left-handed women

# SIMULATION
I = 10000 # simulations
theta1 = rbeta(I, y1+1, (n1-y1)+1)
theta2 = rbeta(I, y2+1, (n2-y2)+1)
diff = theta1-theta2  # simulated diffs

# OUTPUT
quantiles = quantile(diff,c(0.005,0.025,0.5,0.975,0.995))
print(quantiles,digits=2)

print("Probability higher % men than women lefties:")
print(mean(theta1>theta2))

# VISUALIZATION
#png(file="bayesContingency.png")
plot(density(diff),
     xlab="theta1 - theta2",
     ylab="p(theta1 - theta2 | y, n)",
     main="Posterior Simulation of Male - Female Lefties",
     ylim=c(0,8),
     frame.plot=FALSE,cex.lab=1.5,lwd=3,yaxt="no")
abline(v=quantiles[2], col="blue")
abline(v=quantiles[4], col="blue")
#dev.off()

Running this code prints out:

> source("twoByTwoPosterior.R")
  0.5%   2.5%    50%  97.5%  99.5%
-0.091 -0.046  0.084  0.218  0.262 

[1] "Probability higher % men than women lefties:"
[1] 0.901

[1] "Mean difference theta1-theta2"
[1] 0.08484979

The first numbers displayed are the posterior quantiles for the difference. The posterior median (50% quantile) difference is 0.084. The other numbers are quantiles which determine 95% and 99% posterior intervals, which span (0.025,0.975) and (0.005,0.995) respectively. The posterior central 95% quantile is thus (-0.046,0.218). Quantiles get the obvious interpretation in Bayesian stats — given the model and data, we assign a 95% probability to the true difference lying in the interval (-.046,0.218).

The second number printed is the posterior probability that men have a higher percentage of left-handness than women, which is 0.901, or about 90%. R makes this simple by letting us write mean(theta1 > theta2). This is also the value you’d get from integrating the posterior density from 0 to \infty.

The final number is the mean difference between theta1 and theta2, which is the unbiased Bayesian estimator of the difference. This value is 0.0848, which is a bit higher than the median difference of 0.084.

The code also generates the following graph of the posterior:

The vertical lines are drawn at the boundaries of the central 95% region of the posterior difference p(\theta_1 - \theta_2 | n, y).

Significant?

Don’t ask. You’re a Bayesian now.

Instead, say that according to your model, based on the observed data, there’s a 90% chance there’s a higher prevalence of lefthandedness among men than women, or that there’s a 95% chance the difference between male prevalence of lefthandness and female prevalence of lefthandedness falls in the range (-.05,0.22). [Remember not to use too many digits. The outputs from R were so you could see the small differences.]

What’s Wrong with Probability Notation?

October 13, 2009 by lingpipe

[Update 21 October 2009: Check out Michael Collins's comment and my reply. Michael points out that introductions to probability theory carefully subscript probability functions with their random variables and distinguish random variables from regular variables by capitalizing the former. I replied that it's really the practice that's problematic, and I'm talking about the notation in Gelman et al.'s Bayesian Data Analysis or Blei, Ng and Jordan's LDA paper.]

What’s Wrong?

What’s wrong with the probability notation used in Bayesian stats papers? The triple whammy of

  1. overloading p() for every probability function,
  2. using bound variables named after random variables, and
  3. using the bound variable names to distinguish probability functions.

Probabilty Notation is Bad

The first and third issues arise explicitly and the second implicitly in the usual expression of the first step of Bayes’s rule,

p(x|y) = p(y|x)p(x) / p(y),

where each of the four uses of p() corresponds to a different probability function! In computer science, we’re used to using names to distinguish functions. So f(x) and f(y) are the same function f applied to different arguments. In probability notation, p(x) and p(y) are different probability functions, picked out by their arguments.

Random Variables Don’t Help

As Michael (and others) pointed out in the comments, if these are densities determined by random variables, we use the capitalized random variables Y, X to distinguish the distributions and bound variables x, y in the usual mathematical sense, disambiguating our random variables with

p_{X|Y}(x|y) = p_{Y|X}(y|x) p_X(x) / p_{Y}(y).

When we have dozens of parameters in our multivariate densities, this gets messy really quickly. So practitioners fall back to the unsubscripted notation.

Great Expectations

The third issue appears in expectation notation (and in information theory notation for entropy, mutual information, etc.). Here, statisticans write x and y for random variables with the probability function and sample space implicit. The way you then see expectation notation written in applied Bayesian modeling is often:

{\mathbb E}[x] = \sum_x x p(x).

What’s really sneaky here is the use of x as a global random variable on the left side of the equation and as a bound variable on the right hand side. Distinguishing random variables with capital letters, this would look like

{\mathbb E}[X] = \sum_x x p_X(x).

Continuous vs. Discrete

The definitions are even more overloaded than they first appear, because of the different definitions for continuous and discrete probabilities.

In Bayes’s rule, if p(x,y) is continuous in x, we’re meant to understand integration

p(x|y) = p(y|x) p(x) / \int_{-\infty}^{\infty} p(y|x) p(x) dx,

in place of summation

p(x|y) = p(y|x) p(x) / \sum_x p(y|x) p(x).

Similarly, for expectations of continuous densities, we write

{\mathbb E}[x] = \int_{-\infty}^{\infty} x p(x) dx

or if we’re being very careful with random variables,

{\mathbb E}[X] = \int_{-\infty}^{\infty} x p_X(x) dx.

Intros to probability theory often use f(x) for continuous probability density functions and reserve p(n) for discrete probability mass functions. They’ll start with notation P(A) (or \mbox{Pr}(A)) for the event probability function.

Samples Spaces to the Rescue?

In applied work, we rarely, if ever, need to talk about the sample space \Omega, measures P from subsets of \Omega to [0,1], or actually consider a random variable X as a function from \Omega to {\mathbb R}. I don’t recall ever seeing a model defined in this way.

Instead, we typically construct multivariate densities modularly by combining simpler distributions. For instance, a hierarchical beta-binomial model, such as the one I used for the post about Bayesian batting averages, would typically be expressed as:

p(y,\theta|n,\alpha,\beta) = \mbox{Beta}(\theta|\alpha,\beta) \prod_i \mbox{Binom}(y_i|n_i,\theta_i)

or in sampling notation by stating y_i \sim \mbox{Binom}(n_i,\theta) and \theta \sim \mbox{Beta}(\alpha,\beta). In fact, that’s just how it gets coded in the model interpreter BUGS (Bayesian inference using Gibbs sampling) and compiler HBC (hierarchical Bayes compiler).

Since we only really ever talk about probability density functions, why not get rid of random variable notation altogether? We could start with a joint density function p(y) over a vector y \in {\mathbb R}^n, and consider projections of {\mathbb R}^n in lieu of random variables. This we can fully specify with high-school calculus.

If we use consistent naming for the dimensions, we can get away without ever formally definining a random variable. In practice, it’s not really that hard to keep our sampling distributions p(y|\theta) separate from our posteriors p(\theta|y), even if we write them both as p(\cdot | \cdot).

Technically, we could take \Omega = {\mathbb R}^n as the sample space and then define the random variables as projections. But this formal definition of random variables doesn’t buy us much other than a connection to the usual way of writing things in theory textbooks. If it makes you feel better, you can treat the normal ambiguous definitions this way; some of the lowercase letters are random variables, some are bound variables, and we just drop all the subscripts to pick out densities.

Lambda Calculus to the Rescue?

Maybe we can do better. We could express our models as joint densities, and borrow a ready-made language for talking about functions, the lambda calculus.

For instance, we could define a discrete bivariate p:({\mathbb N} \times {\mathbb N}) \rightarrow [0,1] for reference. We could then distinguish the marginals p_1,p_2:{\mathbb N} \rightarrow [0,1], using

p_1 = \lambda n. \sum_{m \in {\mathbb N}} p(n,m) instead of p(n), and

p_2 = \lambda m. \sum_{n \in {\mathbb N}} p(n,m) instead of p(m).

We could similarly distinguish the conditionals, writing

p_{1|2} = \lambda n . \lambda m. p(n,m) / p(m) instead of p(x|y), and

p_{2|1} = \lambda m . \lambda n. p(n,m) / p(n) instead of p(y|x).

Bayes’s rule now becomes:

p_{1|2} = \lambda n . \lambda m . p_{2|1}(n,m) p_1(m) / p_2(n).

Clear, no?

Perhaps Not

OK, so maybe the statisticians are onto something here with their sloppy notation in practice.

All attempts to distinguish function names in stats only seem to make matters worse. This is especially problematic for Bayesian stats, where a fixed prior in one model becomes an estimated parameter in the next.