LingPipe Baseline for MITRE Name Matching Challenge

by

We’ve released a baseline system for MITRE’s name matching challenge. Unfortunately, there’s no training data, so I used a simple heuristic matcher that we’ve used before for this problem and related problems.

In First Place (for now)

As of right this second, we’re in first place on their leaderboard (team name “Agent Smith”). There aren’t many participants yet, so I don’t expect to enjoy this position for long.

Check out the LingPipe Baseline

You can anonymously check out our source for our entry from the LingPipe Sandbox with Apache Subversion version control system via the command:

svn co https://aliasi.devguard.com/svn/sandbox/mitreChallenge2011

There’s an Ant build file that runs the eval and a README.txt to help you get started. I’ll repeat the contents of the README (as it stands now) below.

Mean Average Precision Scoring

The contest is being scored by mean average precision. This measure does not penalize you for returning long lists of irrelevant results. Our baseline returns lots of results, and its results for the variously balanced F measures that are reported could probably be improved by tuning thresholds and the maximum number of results returned.

What I’d like to see, as usual, is a precision-recall curve or ROC curve derived from setting a threshold on the scores. But it’s hard to rank bakeoff entries by ROC curve, so IR bakeoffs typically resort to area under one of these curves or a measure like mean average precision (sort of like area under the curve — see the LingPipe book‘s section on classifier evaluation for more discussion).


README File

The Contest

MITRE is hosting an unsupervised name-matching challenge. The
home page is:

http://www.mitre.org/work/challenge/

From there, you can register for the challenge and download the data, which consists of two lists of names, one long “index” file (about 825K names), and one short “query” file (around 9K names). The task is to rank potential matches in the index file for each name in the query file. The names are broken into forenames and surnames, but there’s no guarantee this assignment is accurate.

There is no gold standard as to which names actually match, so MITRE’s using a liberal notion of matching corresponding to “requires further human review to reject”. Furthermore, this notion of matching may change as the task evolves.

Example Matches

The only way (I know of) to see some example matches is to download the distribution and look at the documentation (.docx format — I had to install the latest OpenOffice to read it). They list examples of what they take to be potential matches for further review.

The LingPipe Baseline

This directory contains baseline code for an entry based on character n-grams. The entry is set up so that the character n-gram scores are used as a filter, which should have high sensitivity (low false negative rate) for matches, though low specificity (high false positive rate). Parameters controlling its agressiveness may be tuned.

If you download the data and unpack it into directory $DATADIR, you can run the task as follows:

% ant -Ddata.dir=$DATADIR ngrams

This’ll write the output match rankings into a fresh file in the output subdirectory /runs. If you don’t change the output routines, the output will be in the appropriate format for submission to the contest.

You can turn verbose debugging on via the flag DEBUG in the code itself.

Three Pass Refinement

Our approach is based on three passes.

1. Indexing

Create a character n-gram index and select potential pairs based on having at least one n-gram in common. The n-gram length is parameterizable through the constant INDEX_NGRAM. Setting it lower increases run time but may increase sensitivity for obscure matches.

2. TF/IDF Scoring and Filtering

Rank the first-pass possible matches by TF/IDF distance over their character n-grams. The range of n-grams is parameterizable with MATCH_NGRAM_MIN and MATCH_NGRAM_MAX; setting these to 2 and 4 respectively produces matches based on 2-, 3-, and 4-grams. TF/IDF weighting will weight the less frequent n-grams more heavily. The maximum nubmer of candidates surviving this stage may be bounded by setting the MAX_RESULTS_INDEX variable in the code.

3. Rescoring

The method

     double rescore(String[],String[],double);

takes the original fields (first name/last name) as string arrays and a double consisting of the n-gram match score, and allows an arbitrary score to be returned. As distributed, this method just returns the score passed in. The maximum number of results surviving the final ranking is determiend by the variable MAX_RESULTS.

This will write a system response ready for submission to the challenge in the default output directory /runs.

Questions and Comments

Let us know if you have questions or comments about our distribution. Feel free to post your uses here or give the rest of us suggestions on how to improve the baseline.

20 Responses to “LingPipe Baseline for MITRE Name Matching Challenge”

  1. Tweets that mention LingPipe Baseline for MITRE Name Matching Challenge « LingPipe Blog -- Topsy.com Says:

    […] This post was mentioned on Twitter by Chris Mulligan, rzoz. rzoz said: RT @chmullig: The competition! LingPipe Baseline for MITRE Name Matching Challenge: http://t.co/WfNWWOd […]

  2. Chris Mulligan Says:

    Fascinating, thank you for sharing! I’m one of your competitors, YouGov (you just beat us by 0.258!) Our matching system was built a few years ago for a very different purpose (and originally for names, birth dates and addresses, not just names).

    Our approach actually uses a similar 3 level methodology. First, we break the files into blocks. We don’t currently use ngrams, preferring things more like Soundex, NYSIIS or Double-Metaphone, but we’ve experimented with them. We then do a whole bunch of crunching. Then we take those raw inputs and coalesce them into a single score. Typically that score for names is just one component in an overall record match that includes name, address, birthdate, and a few other variables. For this challenge we just export that name score by itself.

    All the code already existed, so I’ve been pretty pleased at how well we’ve done so far. I’m excited to dig into your source and see how you’re doing it.

    Two notes – your command has -Ddata.dir, but it seems that you need -Ddata-dir. I also can’t figure out the output prefix.

    • Bob Carpenter Says:

      Thanks — I also patched the doc to show that you indeed need “-Ddata-dir=$DATADIR”. I’m bouncing among R, C, Python and Java these days, and the naming conventions are killing me!

      This is also pretty off the shelf for us. I’m about six or eight hours into the project, including all the doc and blogging and submissions. We join just about any competition that has this character.

      We have general string-distance routines, of which character n-grams is the best. There’s also soundex, Jaccard distance on tokens, general weighted edit distance (which includes Levenshtein distance as a special case) and a properly implemented and parameterized Jaro-Winkler (based on the Census bureau’s C code).

      Our system was really designed for exactly the task MITRE’s interested in — finding possible matches for human review. So the MAP scoring works to our advantage for the challenge. In practice, we find that calibrating a threshold across different names is a much harder task.

      Given that MITRE basically told us what they consider matches, I’m thinking a heuristic system that takes initials into account would help. And something user wider context that won’t match a name if it’s used discriminatively elsewhere. That is, if you think “John Smith” might match “John Smythe”, but you find a “Sally Smith” and “Sally Smythe” in the data, don’t match. This uses the info that the Smith/Smythe distinction is important. (This assumes the reference data is pretty clean, otherwise it won’t filter much.)

      In practice, we’d always create a training set with our customers if only to evaluate the performance of the system we’ve built and predict how it’ll perform on held out data.

      In real apps, there’s almost always more fields than names. For instance, we’re interested in name matching in the context of bibliography citations, where you have dates, journals, etc. We’ve also done movie databases, where film names are linked to dates (with somewhat sketchy meanings) and actors and genres and so on.

      When we were working on gene name linkage from biomedical research articles to databases, we also used discriminative features of the whole set of possible matches in addition to string distance. In that context, the names appeared in the context of articles, so we could also use the other text to see if a gene was mentioned in the article and make the overall decisions more consistent. We could then blend all the info with logistic regression because we had training examples. I should note that when I say “we”, I mean Mike Ross and Tong Zhang and Mitzi Morris and Breck — I was pretty much just an observer on the applied side.

  3. Chris Mulligan Says:

    Looks like the real issue is actually an OutOfMemory error. My unfamiliarity with Ant is hurting me here though – I played around trying to increase it but couldn’t do it. Always around 210MB of real mem and 25-27 seconds.

    The ‘ characters around the executable and arguments are
    not part of the command.
    [java] outPrefix=runs/test-
    [java] parsing file=../mitre/index.txt
    [java] Exception in thread “main” java.lang.OutOfMemoryError: Java heap space
    [java] at java.util.regex.Pattern.compile(Pattern.java:1451)
    [java] at java.util.regex.Pattern.(Pattern.java:1133)
    [java] at java.util.regex.Pattern.compile(Pattern.java:823)
    [java] at java.lang.String.split(String.java:2292)
    [java] at java.lang.String.split(String.java:2334)
    [java] at com.lingpipe.mitre2011.Corpus.parseFile(Corpus.java:26)
    [java] at com.lingpipe.mitre2011.Corpus.(Corpus.java:17)
    [java] at com.lingpipe.mitre2011.Ngrams.main(Ngrams.java:50)
    [java] Java Result: 1

    BUILD SUCCESSFUL
    Total time: 27 seconds
    Kotai:agentsmith chmullig$

    • Bob Carpenter Says:

      It needs a fair amount of memory — it used about 2.5GB on 64-bit Java 1.6 on Windows. Most of the memory’s for the character n-gram to name reverse index, but there’s also some for caching the name array to name mapping.

      The output prefix is sort of a hack. What I do is append numbers to the end of the prefix and then create a file from the name into which to write the results. That way, you never overwrite the output. I’m also time-stamping and dumping the parameters as comments in the output.

      I patched the build.xml to request up to 3GB of memory, which should be enough.

      I also moved the README.txt to the top-level directory while I was at it.

      The Ant tasks translate pretty directly into java. For instance, you could use Ant to build the jar in location $JAR and then run:

      java -cp $JAR com.lingpipe.mitre2011.NGrams $DATA-DIR $OUT-PREFIX
      
  4. Eks Dev Says:

    Hi Bob,
    You could have done it, likely with less code, using plain Lucene NGramTokenizer. Proper inverted index (no memory limits), flexibility to tune TF-IDF a bit (e.g. token prefix boost & Co), possibility to x-check first/last name swaps with query… Something along Lucene SpellChecker lines, with two fields.

    Actually, I was just surprised, you did not recycle your lucene in action use case :)

    Best regards and thanks again for excellent blogs,
    eks dev

    • Bob Carpenter Says:

      Thanks for the feedback. I could’ve indeed built something roughly similar using Lucene’s n-gram tokenizer, index, and document scoring. But given that what I did fit in 2.5G of memory, I didn’t see the need to worry about a disk-based index.

      I’m thinking it’d be awfully slow if I used on-disk Lucene directories with a 2-4-gram tokenizer. I’d wind up with something on the order of 100K or 200K hits in the hits list from the 2-grams. Does Lucene deal with this in a reasonable way? There are huge numbers of 2-gram hits, especially if you literally copy the string preprocessing I did, which prefixes and postfixes a space to get boundary effects. I used two passes, a first-pass 3-gram match and a second pass 2-4-gram rescorer, which was reasonably fast. There may be better ways to do this “blocking” (coming up with potential matches).

      I could probably have used somehting like RAM directory implementations in Lucene to run everything in memory — Lucene has a much tighter representation than my own crude index, so it would probably take much less memory, too.

      I don’t think I can do an exactly similar matching, though I might be able to get close by turning off some of the features in Lucene’s matcher (like the overlap term). But the real difference is that Lucene’s TF/IDF is asymmetric (which helps immensely with scale), so it can’t be used to compute the exact relation I used. Here’s a blog entry discussing why:

      https://lingpipe-blog.com/2007/06/06/tfidf-scaling-and-symmetry/

      Having said that, Lucene’s scoring might even work better than our TF/IDF scoring. It’s not like we tuned an approach to this problem.

      The plan’s to do more with rescoring using more bits of LingPipe string comparison like Jaro-Winkler, edit distance, exact token matching (though in either Lucene or LingPipe, I could build a compound tokenizer returning tokens, first letters, and n-grams in different name spaces).

      For much larger scale, a better index would definitely be in order. Perhaps even multiple passes as they do in speech recognition.

  5. Eks Dev Says:

    I do not think it would be slow, with this N-Gram sizes, you would have ridiculously small term dictionary in lucene with a few dense postings lists. It all would be cached very quickly.

    You are right about asymmetric scoring, only matched terms contribute. Even that is doable with DocumentVectors. But honestly, I see no reason to use it.

    I would bet it would work better to throw all smallish N-Grams (2 and 3) into plain OR Query and just fiddle a bit with similarity. You would get ranked top M candidates (beam depth of your choice) that are much better ranked then „plain, binary min one 3gram match“. As you said, you could index both name fields as N-Grams, full tokens, prefixed, synonyms. After that, you would need to tweak query generation and adjust boosting for different query components. Blocking can be tuned by setting minNumberShouldMatch for example, or at least one 3gram from first name and minimum one 2gram from the last name… that would reduce your „search space“ significantly, if speed becomes an issue.

    Simply, this way you merge two phases into one to get very nicely ranked candidates (asymmetric tf-idf with boosting as you defined it, surname more/less relevant, prefix Gram…), making it possible to take more time for iterating „the hard part with rescoring“ on only top ranked candidates.

    Lucene scoring asymmetry can be compensated somewhere latter in rescoring. But this is minor, or none effect I guess.

    But this part is easy to solve, I just simply like it as it is easy to change „query formulation“and experiment. It also scales for free.

    Hard part is indeed related to rescoring, string distances are important, and have to be somehow conditional on other information. Example, you could loose thresholds for distance on first name if last name agrees , as having both first and last name modified is not likely, even when your favorite string distance, combined under independence assumption, makes it look like a good match.

    You are absolutely right; character level works much better then token level. Token level has huge problems with errors, where high freq tokens become suddenly low freq if they have one letter dropped.

    But the most promising part is to get semantics somehow extracted from the data (imo, 1Mio MITRE is kind of smallish if international names are to be covered). Tagging tokens as „likely first name“, „likely last name“ to get swaps or mixed fields right. Could be done semi-automatically if collection is kind of „predominantly clean“, by just counting how often one token appears as a first name against last name…. If enough „support“, you could let it play nicely as a model-data. Also, dealing with missings is typically an issue (did not look at MITRE data), e.g. missing surname could be ugly …

    This is definitely one interesting problem, looks much simpler as it really is. And this is only an „easy part“. Classifying records into „yes, that is the same person/fammily/whatewer“, or as you said it, „defining good trasholds“ is an order of magnitude harder.

    PS: regarding measure, ROC & Co. I think “David J. Hand” H-Measure would work perfectly for such competitions (Normalized Loss with extenally specified cost ratio distribution for alpha/beta errors) See: http://www2.imperial.ac.uk/~djhand/ You will find R code for H-Measure there.

  6. Eks Dev Says:

    I do not think it would be slow, with this N-Gram size, you would have ridiculously small term dictionary in lucene with a few dense postings lists. It all would be cached very quickly.

    You are right about asymmetric scoring, only matched terms contribute. Even that is doable with DocumentVectors. But honestly, I see no reason to use it.

    I would bet it would work better to throw all smallish N-Grams (2 and 3) into plain OR Query and just fiddle a bit with similarity. You would get ranked top M candidates (beam depth of your choice) that are much better ranked then „plain, binary min one 3gram match“. As you said, you could index both name fields as N-Grams, full tokens, prefixed, synonyms. After that, you would need to tweak query generation and adjust boosting for different query components. Blocking can be tuned by setting minNumberShouldMatch for example, or at least one 3gram from first name and minimum one 2gram from the last name… that would reduce your „search space“ significantly, if speed becomes an issue.

    Simply, this way you merge two phases into one to get very nicely ranked candidates (asymmetric tf-idf with boosting as you defined it, surname more/less relevant, prefix Gram…), making it possible to take more time for iterating „the hard part with rescoring“ on only top ranked candidates.

    Lucene scoring asymmetry can be compensated somewhere latter in rescoring. But this is minor, or none effect I guess.

    But this part is easy to solve, I like it as it is easy to change „query formulation“and experiment. It also scales for free.

    Hard part is indeed related to rescoring, string distances are important, and have to be somehow conditional on other information. Example, you could loose thresholds for distance on first name if last name agrees , as having both first and last name modified is not likely, even when your favorite string distance, combined under independence assumption, makes it look like a good match.

    You are absolutely right; character level to-do works much better then token level. Token level to-do has huge problems with errors, where high freq tokens become suddenly low freq if they have one letter dropped.

    But the most promising part is to get semantics somehow extracted from the data (imo, 1Mio MITRE is kind of smallish if international names are to be covered). Tagging tokens as „likely first name“, „likely last name“ to get swaps or mixed fields right. Could be done semi-automatically if collection is kind of „predominantly clean“, by just counting how often one token appears as a first name against last name…. If enough „support“, you could let it play nicely as a model-data. Also, dealing with missings is typically an issue (did not look at MITRE data), e.g. missing surname could be ugly …

    This is definitely one interesting problem, looks much simpler as it really is. And this is only an „easy part“. Classifying records into „yes, that is the same person/fammily/whatewer“, or as you said, „defining good trasholds“ is an order of magnitude harder.

  7. Eks Dev Says:

    PS: Regarding measure, ROC & Co… I am convinced in this case, H-Measure as defined by Prof. D. J. Hand http://www2.imperial.ac.uk/~djhand/ would work like a charm. This is replacement for AUC, effectively normalized loss with externally defined cost (misclassification cost alpha and beta) ratio distribution. This external distribution defines indirectly optimum cut-off…

  8. Bob Carpenter Says:

    The kind of thing you’re talking about for first-name and last-name induction is like some of the stuff Mike Ross did when he was working on gene name linkage. There wasn’t a first-name/last-name issue, but there was the problem that there’d be “LONG-NAME alpha” and “LOGN-NAME beta” and you’d take those last terms to be discriminative. To do that, we made exactly the assumption you’re talking about — assuming the data’s roughly clean. We do the same thing for spell checking, and that’d be another feature to add into the mix.

    It just occurred to me that if you treat this as a kind of clustering problem then you could perhaps get a bit more mileage in the matching. But it’s combinatorically tricky.

    I think the only real advantage of doing this in Lucene is scaling. You could maybe eke out a bit better blocking (finding potential matches), but the real effort’s in rescoring, which shouldn’t be that expensive, at least the way MITRE’s defining what a potential match is.

    There’s a lot of room to scale what I did in memory, but it’ll only go so far. After that, Lucene’s definitely the next step I’d take to scale. The density can be exploited if you really do go for a small index, so I’m sure you could do better than Lucene’s general index if you knew the structure ahead of time.

    The MITRE list is fairly small (real lists can easily be 10 times this size in terms of names — just MEDLINE authors is at least as big as MITRE’s list), but the real problem is that it’s only in ASCII (other than a handful of instances of capital ‘E’ with different accents).

    The tricky part is matching name across languages, such as names written with Arabic characters and their transliteration into English or Chinese. We did some work on this problem almost 10 years ago, when we were working on a MITRE-sponsored system called MITAP. There’s been some neat work on it using both CRF-like approaches and HMM-like approaches.

    I was really hoping we’d get supervised data in order to tune thresholds and in order to tune weightings of different matching features. I’d basically develop a lot of little matchers and use their scores (and interactions of scores) in a logistic regression. That’d also have the nice property of calibrating the probabilities for free.

    As a challenge, the problem is that you could annotate the data by hand easily enough — it’s only 8K items! Also, they let you game the system by submitting as many runs as you want, which lets you do a crude form of parameter tuning.

    As to what’ll fit in memory, the cache is going to be pretty big. With bigrams, you’ll likely have 50K hits on average for about 1000 terms at this scale. That’s probably cacheable with Lucene’s efficient rep of postings list (the advantage of a merged index).

    Wow. That’s a bit more than I intended to write in a response!

  9. Chris Mulligan Says:

    Really interesting discussion, thanks for being so open Bob.

    I was playing with some R graphing code and decided to throw together a little automated graph of the leaderboard. Take a look: http://chmullig.com/2011/02/mitre-challenge-graph/

  10. Eks Dev Says:

    Matching under transliteration, huh, hats down!

    Also, I know you are interested in classifier performance measures. Something I use in my alter ego life is „H-Measure“ (David J. Hand, http://www.springerlink.com/content/y35743hp7010g354/).
    I am convinced solutions for problems like this one could be much better compared using it. The idea is simple; this is properly scaled loss under externally, a priori specified external distribution of the ratio of the misclassification costs. This way, you avoid problems with crossing ROC-curves, typical for AUC and friends. Here it would fit perfectly, as it forces one to specify apriori optimal threshold (operating point @ROC curve) where you want to compare „solutions“. This prevents „cheating“, possible by implicitly setting different misclassification costs (alpha and beta) , which is in turn equivalent to setting different decision cut-off (match / mismatch).

    Huh, stop publishing topics I like, please write something boring. :)

  11. Brendan O'Connor Says:

    Where’s the leaderboard? I don’t see a link. How else will we make quantitative-based subjective and shallow judgments? j/k

    • Bob Carpenter Says:

      http://www.mitre.org/work/challenge/

      You need to hit their kitschy “start” button from the home page. It’s a really hard to navigate and stateful site. Once you login, you get to an https domain that doesn’t link back to the top-level site.

      On top of that, I’m locked out right now. If you forget your password, you can e-mail them to have them change it. Then they phone you (yes, phone you) with a new password. It’s like we’re working for a bank or something. Unfortunately, the new password they mailed me didn’t work, and I haven’t had the patience to try to get a new one.

      We’re now soundly in third place :-) For reasons I outlined in the last comment reply, I don’t think this “challenge” makes much sense as a contest.

      • Kirk Roberts Says:

        Actually, Bob, as of right now you can consider the 1967 team to be LingPipe’s as well. We’re using your code as a baseline and while we’ve tried some of our own work, several submissions have yet to produce anything major. We’re working on beating your open source code though :-).

      • Kirk Roberts Says:

        I do agree with you on the metrics, however. Given the experiments we’ve run I’ve come to doubt their evaluation metrics are a reasonable measure of the usefulness of a system.

  12. Chris Mulligan Says:

    Incorporated some of your code/results into our latest run. Bumped us up a bit. Interesting work, thanks for being public with it.

  13. Dave Yelen Says:

    Does anyone here have access to the “results” calculator program that comes up with the metrics for the Name Challenge?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s