We’ve released a baseline system for MITRE’s name matching challenge. Unfortunately, there’s no training data, so I used a simple heuristic matcher that we’ve used before for this problem and related problems.
In First Place (for now)
As of right this second, we’re in first place on their leaderboard (team name “Agent Smith”). There aren’t many participants yet, so I don’t expect to enjoy this position for long.
Check out the LingPipe Baseline
svn co https://aliasi.devguard.com/svn/sandbox/mitreChallenge2011
There’s an Ant build file that runs the eval and a README.txt to help you get started. I’ll repeat the contents of the README (as it stands now) below.
Mean Average Precision Scoring
The contest is being scored by mean average precision. This measure does not penalize you for returning long lists of irrelevant results. Our baseline returns lots of results, and its results for the variously balanced F measures that are reported could probably be improved by tuning thresholds and the maximum number of results returned.
What I’d like to see, as usual, is a precision-recall curve or ROC curve derived from setting a threshold on the scores. But it’s hard to rank bakeoff entries by ROC curve, so IR bakeoffs typically resort to area under one of these curves or a measure like mean average precision (sort of like area under the curve — see the LingPipe book‘s section on classifier evaluation for more discussion).
MITRE is hosting an unsupervised name-matching challenge. The
home page is:
From there, you can register for the challenge and download the data, which consists of two lists of names, one long “index” file (about 825K names), and one short “query” file (around 9K names). The task is to rank potential matches in the index file for each name in the query file. The names are broken into forenames and surnames, but there’s no guarantee this assignment is accurate.
There is no gold standard as to which names actually match, so MITRE’s using a liberal notion of matching corresponding to “requires further human review to reject”. Furthermore, this notion of matching may change as the task evolves.
The only way (I know of) to see some example matches is to download the distribution and look at the documentation (.docx format — I had to install the latest OpenOffice to read it). They list examples of what they take to be potential matches for further review.
The LingPipe Baseline
This directory contains baseline code for an entry based on character n-grams. The entry is set up so that the character n-gram scores are used as a filter, which should have high sensitivity (low false negative rate) for matches, though low specificity (high false positive rate). Parameters controlling its agressiveness may be tuned.
If you download the data and unpack it into directory
$DATADIR, you can run the task as follows:
% ant -Ddata.dir=$DATADIR ngrams
This’ll write the output match rankings into a fresh file in the output subdirectory
/runs. If you don’t change the output routines, the output will be in the appropriate format for submission to the contest.
You can turn verbose debugging on via the flag
DEBUG in the code itself.
Three Pass Refinement
Our approach is based on three passes.
Create a character n-gram index and select potential pairs based on having at least one n-gram in common. The n-gram length is parameterizable through the constant INDEX_NGRAM. Setting it lower increases run time but may increase sensitivity for obscure matches.
2. TF/IDF Scoring and Filtering
Rank the first-pass possible matches by TF/IDF distance over their character n-grams. The range of n-grams is parameterizable with MATCH_NGRAM_MIN and MATCH_NGRAM_MAX; setting these to 2 and 4 respectively produces matches based on 2-, 3-, and 4-grams. TF/IDF weighting will weight the less frequent n-grams more heavily. The maximum nubmer of candidates surviving this stage may be bounded by setting the MAX_RESULTS_INDEX variable in the code.
takes the original fields (first name/last name) as string arrays and a double consisting of the n-gram match score, and allows an arbitrary score to be returned. As distributed, this method just returns the score passed in. The maximum number of results surviving the final ranking is determiend by the variable MAX_RESULTS.
This will write a system response ready for submission to the challenge in the default output directory /runs.
Questions and Comments
Let us know if you have questions or comments about our distribution. Feel free to post your uses here or give the rest of us suggestions on how to improve the baseline.