Archive for the ‘LingPipe News’ Category

LingPipe Incubator Welcomes Seer

September 9, 2013

We have started an incubator program for start-ups with natural language processing (NLP) needs. Seer is our first egg. The company creates productivity apps based on unstructured data sources–that is where LingPIpe fits in. The guys (Conall and Joe) fit a pattern that we see quite often–smart people with a good idea that presupposes NLP but they don’t have much experience with it.

The GetSeer guys were on the ball right away because they had a bunch of data annotated and had cobbled together a regular expression based classifier that served as an excellent starting point. We had machine learning based classifiers up and running within two hours of starting. That included an evaluation harness. Nice.

Image

The great thing about our working setup is that I get to teach/advise which keeps my time commitment manageable and they get to do most of the heavy lifting which is how they learn to build and maintain NLP systems. I have had to learn some Scala since I co-code most of the time when we work together and that is their implementation language. Scala is a hip extension of Java with a less verbose syntax and stronger type inference.

I’ll keep the blog updated with developments. Current status:

  • There is a small amount of training data.
  • Current results are awful (no surprise).
  • We have been roughing in the solution with Naive Bayes. Next steps will be opening a can of logistic regression for more interesting feature extraction.

“Sentiment Takes a LONG Time”, Said the Logistic Regression Classifier

August 12, 2011

Hello again, fellow LingPipers!  It is I, yet again, Sam The Intern.  In my last (and first) post I promised the masses that I would return with a study similar to the last one that I conducted, but this time doing the non-trivial case of sentiment.  It turns out that this aspiration ended up being nothing more than a (Ling)Pipe-dream.  As my titular character, the logistic regression classifier so concisely said before, sentiment takes a long time!  And, with this being my last week at LingPipe, I will not have enough time to give our logistic regression classifiers the attention and whispered sweet nothings that they require to function properly.

As you may remember from my first post, I absolutely love to annotate.  However, annotating for sentiment takes much more time than language identification, because the annotator needs to internalize and judge the text, not just glance at it.  Add to this the fact that the sentiment classifier requires much more training data than the language identification classifier, and you have yourself an intern that has been product-conditioned to Coca-Cola, but no time to do anything about it (reading 1000 tweets about how much people love coke will do that to you).

I was able to start the study, with small amounts of success.  The following confusion matrix is the result of  a 10-fold cross-validation run across all of the data that I had time to annotate (about 1000 tweets each for “Clinton”, “Coke”, and “Gaga”).  The top of the chart is the reference data which I annotated, and the side is the response data that the classifier provided.  The categories are: w = neutral, e = positive, q = negative.

   w    e   q
w 1244,239,50
e 313 ,312,23
q 199 ,45 ,23

I wish that I had more time to spend on the sentiment classification problem.  Perhaps I will be back next summer, to continue the effort to teach computers good from bad.  Until then, I will be at Rochester, college-ing!  And drinking massive quantities of Coke…

LingPipe and JDK 7

July 28, 2011

[Update: 29 July 2011: Thanks to Dean Jones for pointing out that our friends at Lucid Imagination have found a deal-breaking bug in the initial JDK 7's optimization of loops. Here's the scoop from the source:

In any case, we won't be using any of the JDK 7 features in LingPipe until JDK 6 is retired.]
 

Oracle (seems odd to say that) just released version 7 of the Java developers kit, aka JDK 7.

What’s in JDK 7?

I haven’t been following what was going into JDK 7. I found this comprehensive list of changes:

This includes Project Coin, which involved changes to the language itslef. These are summarized nicely with examples at:

There were also changes to add module structure, some concurrency changes, some new I/O (nio) changes, support for Unicode 6 in internationalization, support for the latest XML, a new version of JDBC, a bunch of changes to Swing, and some changes to things like cryptography that I didn’t even know were in Java in the first place.

LingPipe Compatibility

The good news is that LingPipe seems to work just fine with Java 7. At least all of our unit tests pass.

There are a couple new compiler warnings. Variable list arguments (i.e., varargs) and generics don’t play nicely together. This should be obvious from the fact that varargs use an array and you can’t have generic arrays. So now I get very menacing error messages like

warning: [unchecked] Possible heap pollution from parameterized vararg type E

I’m a bit surprised they didn’t add this warning earlier, as it seems like a well understood and commonly made mistake. There’s a nice discussion of the underlying problem due to type erasure and arrays here:

And also a very nice discussion with great examples here:

From Kalanger’s discussion, I realized everywhere I got these errors was actually safe as the arrays in question didn’t persist beyond the scope of the function body where the user couldn’t get at them.

“Language ID”, Said the NGram Process Dynamic Language Model Classifier…

July 25, 2011

Hi, there! My name is Sam Brown, and I’m a summer intern here at LingPipe. I’m from Eastchester, NY, and I’m a rising sophomore in the University of Rochester Biomedical Engineering Department.

I’ve been interning at LingPipe’s luxuriously airconditioned headquarters for about a month and a half now. After barely learning Java and some of the LingPipe API in the span of about a week, my slavemaster…rather…boss Breck thought it would be a great idea if I could publish a three-part Java-based computational linguistics experiment exploring the relative performances of generic and customized classifiers for language identification.

And then he told me this is the trivial case.

Since I hardly understood the objective myself, even after Breck kindly spelled it out for me about 50 times, let me try and explain the experiment. Language model classifiers can be used to sort test data (like a bunch of tweets retrieved from Twitter using a query like “Gaga”) into two categories (in my case English, and Non-English). These classifiers, of course, need to be trained on annotated data (tweets collected from Twitter that we sort for the computer) before they can be useful. This is where my experiment’s problem comes in; Is it better to make a custom classifier that is trained on data similar to the test data (a set of tweets different from the test data, but also retrieved using the query “Gaga”), or to train a generic classifier on data that comes from a variety of sources (a collection of tweets from the queries “Coke, “Clinton”, etc.)? What we found is that, based on the input and test data, either classifier may do better at language identification depending on the circumstances.

Data:

Before we could start the experiment, we first needed to collect our data from Twitter. For each of our five queries, 1500 tweets were retrieved from Twitter, and duplicates were removed from this data by requiring a Jaccard distance of .5 or less between any two tweets. The effect of this is that any tweet with more than 50% token (word) overlap with any other tweet in the accepted data collection was rejected from the corpus (our collection of tweets to be used in the experiment).

After that came my absolute favorite part; Annotation. I annotated 500 tweets for being written in English (‘e’) or non-English (‘x’), for 5 queries (for a total of 2,500 tweets). These annotated tweets comprised our first epoch of data. As if that wasn’t enough fun, we reran the 5 queries again two weeks later, and annotated these as well to form the second epoch of data (for a cumulative total of 5000 tweets). We checked if there were duplicates between the two Twitter retrievals by using the Jaccard distance between any two given tweets to decide if they were identical.

To form a baseline to compare our final results to, I needed to find an interannotator agreement rate. My lovely girlfriend very nicely (and possibly naively) agreed to annotate some of the data.  She annotated 400 tweets of each query in the second epoch of data, all of which overlapped with data that I annotated as well. A program was then used to find the precision and recall between my annotations and hers. The agreement rate was 95% precision and 97% recall with confidence .01% with one annotator serving as truth and the other the response (thereby proving that she knows English 5% ± .01% better than me). For evaluation purposes the annotations were adjudicated resulting in complete agreement.

Process:

Part 1: Evaluation of Customized Language Classifier

We created a custom language model classifier by training it on data retrieved from Twitter using one of our queries, like “Gaga”. To do this, we first loaded all of the annotated training tweets for that query into the training section of a corpus. We then loaded the annotated testing tweets into the test section of the corpus. This created a corpus of 1000 tweets, 500 each for training and testing. The testing and training tweets were both retrieved from Twitter using the same query at different times.

As the title character in today’s adventure, our friendly neighborhood nGram boundary language model classifier was used to train and test on the corpus. Because nGram boundary classifiers are sensitive to nGram size (the size of the chunk of characters in a tweet that the classifier being trained actually sees), it trained and tested on the data with nGram sizes 1 through 15.

After each test, the classifier was given to a joint classifier evaluator. This evaluator was used to record the one-versus-all data for each category (precision, recall, and f-measure), as well as calculate the micro-averaged f-measure. The micro-averaged f-measure was used for quantitative comparison between classifiers.

This procedure was repeated for each of the five Twitter queries.

Part 2: Evaluation of Generic Language Classifier

We created a generic language model classifier by training it on data retrieved from Twitter using the same queries as the customized experiment. The difference between the custom and generic classifiers is that, in the generic classifier, the testing and training tweets were retrieved using different queries. Because each set of data included 500 tweets, the total amount of data that we had was 5,000 tweets. For any given query, the training data consisted of all the other data minus the 500 tweets of data to be tested on and 500 tweets of data that was retrieved at an earlier epoch with the same query. All of this data, which contained a total of 4000 tweets for any given test data, was put into a list. Five-hundred tweets were then randomly selected from this list and entered into the training section of the corpus. We then loaded the annotated testing tweets into the testing section of the corpus, and trained and tested the corpus using the same nGram looping structure as we did in the customized training process.

After each test, we evaluated the classifier in the same way that we did for the custom classifier. This procedure was repeated for each of the five Twitter queries.

Part 3: Evaluation of the Effect of Corpus Size On a Generic Classifier

A generic classifier was trained on data from a variety of different sources, as described above. However, the size of the corpus was increased incrementally after each training and testing experiment. This was done by nesting the nGram looping structure inside of the corpus size looping structure. The evaluator returned the best micro-averaged f-measure after each corpus size increase, and this was graphed against corpus size.

Results:

Part 1

For each of the five queries that we tested, the custom classifier had an f-measure that ranged from 80% to 95%. For four out of the five queries, the English classifications performed better than the non-english. However, on the query “Mitsubishi”, the non-english classifications performed better. Out of the four queries in which English performed better, two queries (“Clinton”(Fig. 1) and “Coke” (Fig. 2)) had significantly higher f-measures for the English than the non-English.

Figure 1 - Clinton Generic

Figure 2 - Coke Generic

Figure 2 New

Figure 3 - Gaga Generic

Figure 4 NEW

Figure 4 - Mitsubishi Generic

Figure 5 - Wiener Generic

Part 2

For each of the five queries, the generic classifier had an f-measure that ranged from 70% to 95%

Part 2.5

We put the micro-averaged f-measures for the generic and custom classifiers for each query on the same graph of F-Measure versus NGram size. For two of the queries (“Clinton” (Fig. 6) and “Coke” (Fig. 7)), the custom and generic classifiers performed about equally for an nGram size of 4. For two of the remaining queries (“Gaga” (Fig. 8) and “Wiener” (Fig. 10)), the custom classifier outperformed the generic. For the final query (“Mitsubishi” (Fig. 9)), the generic classifier outperformed the custom.

Figure 6 - Clinton Generic vs. Custom

Figure 7 - Coke Generic vs. Custom

Figure 8 - Gaga Generic vs. Custom

Figure 9 - Mitsubishi Generic vs. Custom

Figure 10 - Wiener Generic vs. Custom

The experiment was run again (at nGram size 4),  50 times.  The sample mean of these experiments’ f-measures were graphed, with a 95% confidence error bar (Figures 11 and 12).

Figure 11 - Non-English Classifier Comparrison

Figure 12 - English Classifier Comparrison

Part 3

When we graphed the generic classifiers’ performance versus the size of the of the training data, we found that the English classifications’ F-Measure increased slightly over increased training data. Non-English classifications’ F-Measure increased dramatically over increased training data. (Fig. 13).

Figure 13 - Clinton Generic Performance Graph (on a logarithmic scale)

We graphed the size of the English portion of the corpus versus the English F-Measure (and did the same with the Non-English), and found that English classifications performed better at low category-corpus sizes than Non-English did (Figs. 14 & 15).

Figure 14 - Coke Category Performance

Figure 15 - Mitsubishi Category Performance

Discussion:

Parts 1+2

We came to the conclusion that neither generic nor customized training regimens are necessarily better than the other because each one performed better in different circumstances. We believe that the Non-English classifications for Mitsubishi scored higher than the English classifications because the Non-English category was larger than the English category, and also it was very coherent (with mostly asian writing, so that each category has either Roman characters or Asian characters). We believe that “Clinton” and “Coke” had much higher English F-Measures than the others because these two queries produce the most English tweets.

Parts 3

We believe that the English classifications performed better than the non-English classifications, especially at low corpus sizes, because the makeup of the English training data is more coherent. Because there are many different, seemingly unrelated attributes that define the broad Non-English category, it’s difficult for the classifier to identify what makes a tweet Non-English at low corpus-sizes. This also explains why the classifier is better at identifying English tweets, even at equal category sizes.

Our data balance for each test are as follows (for the generic classifiers, the balance of the 4000 tweet list is shown):

Custom Data Distribution

Generic Data Distribution

And that’s my first experiment in computational linguistics! Anyone who would like to throw money at me for this monumental achievement may do so liberally. Otherwise, look forward to my next experiment, the not-so-trivial case, Sentiment!! (perhaps THAT will convince you to throw money at me!) Until then, I will be annotating. A lot. My favorite!

Sorting Primitive Types in Java

July 14, 2011

One of the big pains of Java is the lack of built-in collection-like utilities for primitive types like int. As a result, you need to box all the integers, which leads to huge bloat in terms of size, and because of memory locality and dereferencing issues, a huge slowdown to boot.

Oddly, more than one person at ICML told me they were impressed with IBM’s tuning on Watson (their champion-level Jeopardy! playing system Breck blogged about recently) because they reimplemented search and sorting!

C’mon people, this isn’t that hard. In fact, as I was telling our intern today, it’s basic CS 101. At least it was back in the 1980s when I was an undergrad. I have no idea what the kids are up to these days.

Primitive int Sorts for LingPipe Suffix Arrays

As I was implementing suffix arrays for LingPipe 4.1, I realized I’d need to implement sorting over primitive int types for both efficiency and scalability. So I designed everything to let me plug in a better sort implementation when I got some time. Well, I got some time today, so here goes.

It’s also not that rare a problem. There are open-source primitive collection libs available from GNU Trove (GPL, of course). And while not geared toward sorting, there’s the Apache Primitives lib which gives you collection-like behavior with primitives (also something LingPipe could benefit from in several places).

Gift Horses and Strings

Ah, the GPL. It giveth and it taketh away.

In general, one shouldn’t look a gift horse in the mouth. But what if there are strings attached? (Sorry, but I just love mixed metaphors, even if they are rather fixed idioms; I laugh out loud at the New Yorker “block that metaphor” quotes from news sources.)

We can hardly drop GPL-ed code into LingPipe (in fact, as of many years ago, at the behest of our customers, we removed all code dependencies from LingPipe, except for Java itself). So that means it’s time to saddle up (hard to turn off that metaphor machine) and write some really low-level code.

I actually find this kind of thing more fun than a slap in the face with a wet fish. (Last one — I just had to drop in a bit of the old litotes).

Programming Pearls is Your Friend

The best book I’ve ever read on designing algorithms for invariants and then profiling and tuning is:

  • Bentley, Jon. 2000. Programming Pearls, 2nd Edition. Addison-Wesley.

It’s so good that Josh Bloch used as the basis for the Java collections, as he outlines in this awesome blog post on a related topic, also related to Bentley’s search algorithms and his own experience as a CS student at CMU where Bentley was teaching:

I just went for Bentley’s qsort4, which is dead easy. It took only half the afternoon to write and unit test the whole thing with random tests, leaving plenty of time for this blog entry (enough rope to hang myself, one might say).

Show, Don’t Tell

How easy is it? Easy peasey isn’t the half of it. Here’s the code for com.aliasi.util.Sort other than the package declaration and imports and doc.

    public static void isort(int[] x, CompareInt compare) {
        for (int i = 0; i < x.length; ++i) {
            int t = x[i];
            int j = i;
            for ( ; j > 0 && compare.lessThan(t,x[j-1]); --j)
                x[j] = x[j-1];
            x[j] = t;
        }
    }
public static void qsort(int[] x, CompareInt compare) {
    Random random = new Random();
    qsortPartial(x,0,x.length-1,compare,random);
    isort(x,compare);
}
   static void qsortPartial(int[] x, int lower, int upper,
                             CompareInt compare,
                             Random random) {
        if (upper - lower < MIN_QSORT_SIZE)
            return;
        swap(x, lower, lower + random.nextInt(upper-lower+1));
        int t = x[lower];
        int i = lower;
        int j = upper + 1;
        while (true) {
            do {
                ++i;
            } while (i <= upper && compare.lessThan(x[i],t));
            do {
                --j;
            } while (compare.lessThan(t,x[j]));
            if (i > j)
                break;
            swap(x,i,j);
        }
    }
public static void swap(int[] xs, int i, int j) {
    int temp = xs[i];
    xs[i] = xs[j];
    xs[j] = temp;
}
public interface CompareInt {
    public boolean lessThan(int a, int b);
}
static int MIN_QSORT_SIZE = 7;

Thanks, Jon!

I don’t think this needs much explanation. There’s a comparator-like interface CompareInt to implement comparisons. Bentley’s quicksort uses Robert Sedgewick‘s speedups for worst case via randomization of the pivot. It also uses insertion sort, also shown, for small cases, as it’s faster with memory locality. This is more or less a symbol-by-symbol translation of Bentley’s code, with a general plug-in comparator.

Look for it wherever high quality software is sold (or given away).

Randomized Unit Tests

I’ve also started moving to more large-scale randomized unit tests. Here’s what I did for testing sorts.

@Test
public void testSort() {
    Random random = new Random();
    for (int i = 0; i < 129; ++i) {
        int[] xs = new int[i];
        int[] ys = new int[i];
        int[] zs = new int[i];
        for (int numTests = 1; numTests < 10; ++numTests) {
            randomFill(xs,random);
            copy(xs,ys);
            copy(xs,zs);
            Arrays.sort(xs);
            Sort.qsort(ys, Sort.NATURAL_INT_COMPARE);
            Sort.isort(zs, Sort.NATURAL_INT_COMPARE);
            assertArrayEquals(xs,ys);
            assertArrayEquals(xs,zs);
        }
    }
}

static void randomFill(int[] xs, Random random) {
    for (int i = 0; i < xs.length; ++i)
        xs[i] = random.nextInt();
}

static void copy(int[] xs, int[] ys) {
    for (int i = 0; i < xs.length; ++i)
        xs[i] = ys[i];
}

The built-in Sort.NATURAL_INT_COMPARE just implements Sort.CompareInt with the natural sort.

 public static final CompareInt NATURAL_INT_COMPARE
        = new CompareInt() {
                public boolean lessThan(int a, int b) {
                    return a < b;
                }
            };

That’s my afternoon in a nutshell (sorry, just one more (idiom) for the road).

Drafts 0.5 of Books on LingPipe and Java Text

June 27, 2011

We’ve released draft 0.5 of the LingPipe books. Get them while they’re hot from their home page:

Two Books for the Price of One

For version 0.5, we’ve broken out the bits about text processing in Java that aren’t LingPipe specific into their own book. And we’ve slightly changed the title of the LingPipe book to make this clearer.

Authors Piling On

Breck’s going to help out with both books.

Mitzi’s agreed to help us out with the text processing book now that she’s back in the bio(medical) text processing business after several years of working on genomics.

Comments, Please

As always, your comments are appreciated.

LingPipe 4.1.0 Released

June 27, 2011

We just rolled out LingPipe 4.1.0. It’s at the usual place:

Suffix Arrays

The reason for the big number change is that we added suffix arrays. You’ll find them in a new package, the doc for which is at:

We’re about to release the next version of the LingPipe book, which has a chapter on suffix arrays.

Patches and Extensions

All of the patches discussed on the mailing list have been applied. This includes making the language model serialization more general and consistent, fixing some bugs with distant points in hierarchical clustering, and providing access methods for TF/IDF statistics.

Price-is-Right Binary Search (for Suffix Arrays of Documents)

May 27, 2011

Suffix Arrays

Although this isn’t really the topic of the post, I’m adding suffix arrays to LingPipe at both the character and token level. Suffix arrays are one of those magic data structures that have too-good-to-be-true-but-it-really-is-true peformance. Here, we care about their ability to find arbitrarily long substring matches of substrings in text efficiently.

For instance, suffix arrays are useful if you’re looking for anything from plagiarized passages in a pile of writing assignments, cut-and-paste code blocks in a large project, or just commonly repeated phrases on Twitter.

I’m representing a collection of documents as a single long text with distinguished break tokens (or characters) to separate the documents in the text. In building the text, I collect have an array holding all the start positions of each document in the big text (in ascending order). Then I’m going to retrieve positions in the large text from the suffix array, leaving me with the problem of finding the document that spans a given position.

Price-is-Right Search: Largest without Going Over

I need what I think of as a Price-is-Right search — you want the index in the array of positions that is largest without going over.

For example, suppose I have an array of starting positions starts = { 0, 12, 19, 113, 492 }. And I want to know which document spans character 22. The answer is the document spanning characters 19 (inclusive) to 113 (exclusive), or index 2 in the start array.

The Algorithm

It’s surprisingly hard to get these trivial little algorithms right. It’s one of the things I learned to do much better when practicing on TopCoder, which is full of these kinds of algorithms. And these make great interview questions. The ability to code this kind of algorithm quickly is really the limiting factor in most coding projects. I’m still amazed thinking of how fast the TopCoder winners are at this kind of thing.

It took me about an hour to get it working and tested. The trick, as with all these algorithms, is to stay on top of your edge cases and handle the general case with a loop maintaining careful loop invariants. That actually makes it sound harder than it is.

/**
 * Given an increasing array of values and a specified value,
 * return the largest index into the array such that the
 * array's value at the index is smaller than or equal to the 
 * specified value.  Returns -1 if there are no entries in the 
 * array less than the specified value.
 *
 * Warning: No test is made that the values are in
 * increasing order.  If they are not, the behavior of this
 * method is not specified.
 * 
 * @param vals Array of values, sorted in ascending order.
 * @param val Specified value to search.
 */
public static int largestWithoutGoingOver(int[] vals, 
                                          int val) {
    int start = 0;
    int end = vals.length;
    if (vals.length == 0)
        return -1;
    if (val >= vals[end-1]) 
        return end - 1; 
    // invariant: vals[start] <= val <= vals[end-1] 
    while (start + 1 < end) {
        int mid = (start + end) / 2;
        // invariant: start < mid < end
        if (val < vals[mid]) 
            end = mid;
        else if (val > vals[mid])
            start = mid;
        else
            return mid;
    }
    return start;
}

There are even two comments in the code, explaining the invariants.

Where would we be without (J)unit tests?

@Test
public void testPriceIsRightEmpty() {
    int[] vs = { };
    assertEquals(-1,largestWithoutGoingOver(vs,3));
}

@Test
public void testPriceIsRight2() {
    int[] vs = { 0, 17, 23, 152, 153, 190 };
    assertEquals(-1, largestWithoutGoingOver(vs,-10));
    for (int k = 0; k + 1 < vs.length; ++k)
        for (int i = vs[k]; i < vs[k+1]; ++i)
            assertEquals(k, largestWithoutGoingOver(vs,i));
   assertEquals(vs.length-1, 
	         largestWithoutGoingOver(vs,190));
    assertEquals(vs.length-1, 
		 largestWithoutGoingOver(vs,195));
    assertEquals(vs.length-1, 
		 largestWithoutGoingOver(vs,Integer.MAX_VALUE));
    }

LingPipe 4.0.1 Patch Release

October 21, 2010

I just released LingPipe version 4.0.1. As usual, you can read more about and download the latest version from the

The latest version of the book is soon to follow, with much more LingPipe-specific content.

Release Notes

  1. fixes some bugs and brings definitions in line with standards for classify.ScoredPrecisionRecallEvaluation,
  2. added more documentation and utilities to classify.ConfusionMatrix,
  3. updates stat.LogisticRegression and classify.LogisticRegressionClassifier regression so that likelihood gradient updates are blocked along with prior updates,
  4. adds an implementation tokenizer.TokenNGramTokenizerFactory,
  5. updates toString() methods in all tokenizer factories to expose nested structure,
  6. added a collection utilities class util.CollectionUtils,
  7. added some more utility methods for utilities classes
    code>util.Streams and util.Strings,
  8. added utility methods for corpus.ListCorpus and corpus.Parser,
  9. fixed a bug in sparse vector addition with other sparse vectors (which nothing else depended on), and
  10. brings the tokenized LM tutorial in line with LingPipe 4.

How to Define (Interpolated) Precision-Recall/ROC Curves and the Area under Them?

October 1, 2010

I’m revising LingPipe’s ScoredPrecisionRecallEvaluation class. My goal is to replace what’s currently there with more industry-standard definitions, as well as fixing some bugs (see the last section for a list).

Uninterpolated and Interpolated Curves?

I’ve seen multiple different definitions for uninterpolated and interpolated curves.

For the uninterpolated case, I’ve seen some authors include all the points on the curve and others include only the points corresponding to relevant results (i.e., true positives). LingPipe took the latter approach, but Mannig et al.’s IR book took the former.

For the interpolated case, I’ve seen some authors (e.g., Manning et al.) set precision for a given recall point r to be the maximum precision at an operating point with recall r or above. I’ve seen others just drop points that were dominated (LingPipe’s current behavior). I seem to recall reading that others used the convex hull (which is what LingPipe’s doc says it does, but the doc’s wrong).

Limit Points?

Should we always add the limit points corresponding to precision/recall values of (0,1) and (1,0)?

Break-Even Point?

If we use all the points on the precision-recall curve, the break-even point (BEP) will be at the rank equal to the number of relevant results (i.e., BEP equals R-precision). If we remove points that don’t correspond to true positives, there might not be an operating point that is break even (e.g., consider results: irrel, irrel, rel, rel in the case where there are only two relevant results — BEP is 0/0 in the full case, but undefined if we only consider true positive operating points).

Area Under the Curves?

For area under the curve (AUC) computations, I’ve also seen multiple definitions. In some cases, I’ve seen people connect all the operating points (however defined), and then calculate the actual area. Others define a step function more in keeping with interpolating to maximum precision at equal or higher recall, and then take the area under that.

Bugs in LingPipe’s Current Definitions

For the next version of LingPipe, I need to fix three bugs:

1. The ROC curves should have 1-specificity on the X axis and sensitivity on the Y axis (LingPipe currently returns sensitivity on the X axis and specificity on the Y axis, which doesn’t correspond to anyone’s definition of ROC curves.)

2. The break-even point calculation can return too low a value.

3. Area under curve calculations don’t match anything I can find in the literature, nor do they match the definitions in the javadoc.


Follow

Get every new post delivered to your Inbox.

Join 797 other followers