We’re working on a database record linkage (also known as database deduplication) problem for a customer. The standard string comparison for this task is the Jaro-Winkler distance, named after Matt Jaro and Bill Winkler of the U.S. Census Bureau. Here are some quick links to what we’ve done:
- LingPipe Javadoc:
- Lingpipe Java Source:
- Lingpipe String Comparison Tutorial
Unfortunately, all of the definitions I could find were as vague as the Jaro-Winkler Wikipedia entry. And some definitions actually varied when considering the boundary conditions.
I next went and looked in two open-source string comparison packages, namely SecondString and Simmetrics. Here’s the Simmetrics Java source and the SecondString Java source. Interestingly, they contained almost identical definitions that matched neither the definitions in Winkler’s papers nor their own documentation (which is clearly wrong in the third term in Jaro distance Simmetrics case). Hmm. This makes me very skeptical of the widely cited results reported in (Cohen et al. 2003); more on their soft TF/IDF measure in a future blog post.
To get to the bottom of the matter, I composed a long e-mail missive to William Winkler at the U.S. Census Bureau. I outlined the vexing boundary conditions. He was not only kind enough to write back, he sent me a copy of the C code he actually used in his papers, along with pointers to tables in the paper that had sample output.
I implemented Jaro-Winkler as an instance of the interface
com.aliasi.spell.StringDistance. I implemented all of Winkler’s tests as unit tests along with a number of pure boundary condition tests. It’ll be out in the next release of LingPipe, along with some TF/IDF-based distances. Here’s the class javadoc, which contains links to all of the relevant papers and test cases.
JaroWinklerDistance class implements the original Jaro string comparison as well as Winkler’s modifications. As a distance measure, Jaro-Winkler returns values between
0 (exact string match) and
1 (no matching characters). Note that this is reversed from the original definitions of Jaro and Winkler in order to produce a distance-like ordering. The original Jaro-Winkler string comparator returned
1 for a perfect match and
0 for complete mismatch; our method returns one minus the Jaro-Winkler measure.
The Jaro-Winkler distance measure was developed for name comparison in the U.S. Census. It is designed to compae surnames to surnames and given names to given names, not whole names to whole names. There is no character-specific information in this implementation, but assumptions are made about typical lengths and the significance of initial matches that may not apply to all languages.
The easiest way to understand the Jaro measure and the Winkler variants is procedurally. The Jaro measure involves two steps, first to compute the number of "matches" and second to compute the number of "transpositions". The Winkler adjustment involves a final rescoring based on an exact match score for the initial characters of both strings.
Formal Definition of Jaro-Winkler Distance
Suppose we are comparing character sequences
cs2. The Jaro-Winkler distance is defined by the following steps. After the definitions, we consider some examples.
Step 1: Matches: The match phase is a greedy alignment step of characters in one string against the characters in another string. The maximum distance (measured by array index) at which characters may be matched is defined by:
matchRange = max(cs1.length(), cs2.length()) / 2 - 1
The match phase is a greedy alignment that proceeds character by character through the first string, though the distance metric is symmetric (that, is reversing the order of arguments does not affect the result). For each character encountered in the first string, it is matched to the first unaligned character in the second string that is an exact character match. If there is no such character within the match range window, the character is left unaligned.
Step 2: Transpositions: After matching, the subsequence of characters actually matched in both strings is extracted. These subsequences will be the same length. The number of characters in one string that do not line up (by index in the matched subsequence) with identical characters in the other string is the number of "half transpositions". The total number of transpoisitons is the number of half transpositions divided by two, rounding down.
The Jaro distance is then defined in terms of the number of matching characters
matches and the number of transpositions,
jaroCompare(cs1,cs2) = ( matches(cs1,cs2) / cs1.length() + matches(cs1,cs2) / cs2.length() + (matches(cs1,cs2) - transposes(cs1,cs2)) / matches(cs1,cs2) ) / 3 jaroDistance(cs1,cs2) = 1 - jaroCompare(cs1,cs2)
In words, the measure is the average of three values; (a) the percentage of the first string matched, (b) the percentage of the second string matched, and (c) the percentage of matches that were not transposed.
Step 3: Winkler Modification The Winkler modification to the Jaro comparison, resulting in the Jaro-Winkler comparison, boosts scores for strings that match character for character initially. Let
boostThreshold be the minimum score for a string that gets boosted. This value was set to
0.7 in Winkler’s papers (see references below). If the Jaro score is below the boost threshold, the Jaro score is returned unadjusted. The second parameter for the Winkler modification is the size of the initial prefix considered,
prefixSize. The prefix size was set to
4 in Winkler’s papers. Next, let
prefixMatch(cs1,cs2,prefixSize) be the number of characters in the prefix of
cs2 that exactly match (by original index), up to a maximum of
prefixSize. The modified distance is then defined to be:
jaroWinklerCompare(cs1,cs2,boostThreshold,prefixSize) = jaroMeasure(cs1,cs2) < = boostThreshold ? jaroMeasure(cs1,cs2) : jaroMeasure(cs1,cs2) + 0.1 * prefixMatch(cs1,cs2,prefixSize) * (1.0 - jaroDistance(cs1,cs2)) jaroWinklerDistance(cs1,cs2,boostThreshold,prefixSize) = 1 - jaroWinklerCompare(cs1,cs2,boostThreshold,prefixSize)
Examples: We will present the alignment steps in the form of tables, with offsets in the second string below the first string positions that match. For a simple example, consider comparing the given (nick)name
AL to itself. Both strings are of length 2. Thus the maximum match distance is
max(2,2)/2 - 1 = 0, meaning all matches must be exact. The matches are illustrated in the following table:
The notation in the matches row is meant to indicate that the
A at index
cs1 is matched to the
A at index
cs2. Similarlty for the
L at index 1 in
cs1, which matches the
L at index 1 in
cs2. This results in
matches(AL,AL) = 2. There are no transpositions, so the Jaro distance is just:
jaroCompare(AL,AL) = 1/3*(2/2 + 2/2 + (2-0)/2) = 1.0
Applying the Winkler modification yields the same result:
jaroWinklerCompare(AL,AL) = 1 + 0.1 * 2 * (1.0 - 1) = 1.0
Next consider a more complex case, matching
MARHTA. Here the match distance is
max(6,6)/2 - 1 = 6/2 - 1 = 2, allowing matching characters to be up to two characters away. This yields the following alignment.
Note that the
T at index 3 in the first string aligns with the
T at index 4 in the second string, whereas the
H at index 4 in the first string alings with the
H at index 3 in the second string. The strings that do not directly align are rendered in bold. This is an instance of a transposition. The number of half transpositions is determined by comparing the subsequences of the first and second string matched, namely
MARHTA. There are two positions with mismatched characters, 3 and 4. This results in two half transpositions, or a single transposition, for a Jaro distance of:
jaroCompare(MARTHA,MARHTA) = 1/3 * (6/6 + 6/6 + (6 - 1)/6) = 0.944
Three initial characters match,
MAR, for a Jaro-Winkler distance of:
jaroWinklerCompare(MARTHA,MARHTA) = 1 + 0.1 * 3 * (1.0 - 0.944) = 0.961
Next, consider matching strings of different lengths, such as
The unmatched characters are rendered in italics. Here the subsequence of matched characters for the two strings are
JONS, so there are no transpositions. Thus the Jaro distance is:
jaroCompare(JONES,JOHNSON) = 1/3 * (4/5 + 4/7 + (4 - 0)/4) = 0.790
JOHNSON only match on their first two characters,
JO, so the Jaro-Winkler distance is:
jaroWinklerCompare(JONES,JOHNSON) = .790 + 0.1 * 2 * (1.0 - .790) = 0.832
We will now consider some artificial examples not drawn from (Winkler 2006). First, compare
CABVWXYZ, which are of length 8, allowing alignments up to
8/4 - 1 = 3 positions away. This leads to the following alignment:
Here, there are 8/8 matches in both strings. There are only three half-transpositions, in the first three characters, because no position of
CAB has an identical character to
ABC. This yields a total of one transposition, for a Jaro score of:
jaroCompare(ABCVWXYZ,CABVWXYZ) = 1/3 * (8/8 + 8/8 + (8-1)/8) = .958
There is no initial prefix match, so the Jaro-Winkler comparison produces the same result. Now consider matching
CBAWXYZ. Here, the initial alignment is
2, 1, 0, which yields only two half transpositions. Thus under the Jaro distance,
ABC is closer to
CBA than to
CAB, though due to integer rounding in computing the number of transpositions, this will only affect the final result if there is a further transposition in the strings.
Now consider the 10-character string
ABCDUVWXYZ. This allows matches up to
10/2 - 1 = 4 positions away. If matched against
DABCUVWXYZ, the result is 10 matches, and 4 half transposes, or 2 transposes. Now consider matching
DBCAUVWXYZ. Here, index 0 in the first string (
A) maps to index 3 in the second string, and index 3 in the first string (
D) maps to index 0 in the second string, but positions 1 and 2 (
C) map to themselves. Thus when comparing the output, there are only two half transpositions, thus making the second example
DBCAUVWXYZ closer than
DABCUVWXYZ to the first string
Note that the transposition count cannot be determined solely by the mapping. For instance, the string
BBBAUVWXYZ with alignment
4, 0, 1, 2, 5, 6, 7, 8, 9, 10. But there are only two half-transpositions, because only index 0 and index 3 mismatch in the subsequences of matching characters. Contrast this with
DABCUVWXYZ, which has the same alignment, but four half transpositions.
This class’s implementation is a literal translation of the C algorithm used in William E. Winkler’s papers and for the 1995 U.S. Census Deduplication. The algorithm is the work of multiple authors and available from the folloiwng link:
- Winkler, Bill, George McLaughlin, Matt Jaro and Marueen Lynch. 1994. strcmp95.c (ed. link deprecated by the census), Version 2. United States Census Bureau.
- Working link: CodeGuru strcmp95.c
Unlike the C version, the
compare(CharSequence,CharSequence) methods do not require its inputs to be padded with spaces. In addition, spaces are treated just like any other characters within the algorithm itself. There is also no case normalization in this class’s version. Furthermore, the boundary conditions are changed so that two empty strings return a score of
1.0 rather than zero, as in the original algorithm.
The algorithm, along with applications in record linkage, is sketched in the following highly readable survey article:
- Winkler, William E. 2006. Overview of Record Linkage and Current Research Directions. Statistical Research Division, U.S. Census Bureau.
This document provides test cases in Table 6, which are the basis for the unit tests for this class (though note the three 0.0 results in the table do not agree with the return results of
strcmp95.c or the results of this class, which matches
strcmp95.c). The description of the matching procedure above is based on the actual
strcmp95 code, the boundary conditions of which are not obvious from the text descriptions in the literature. An additional difference is that
strcmp95, but not the algorithms in Winkler’s papers nor the algorithm in this class, provides the possibility of partial matches with similar-sounding characters ( .g.
The greedy nature of the alignment phase in the Jaro-Winkler algorithm actually prevents the optimal alignments from being found in some cases. Consider the aligning
Here the first pair of
A characters are matched, leading to three half transposes (the first three matched characters). A better scoring, though illegal, alignment would be the following, because it has the same number of matches, but no transposes:
The illegal links are highlighted in yellow. Note that neither alignment matches in the initial character, so the Winkler adjustments do not apply.
We’d like to thank Bill Winkler for helping us understand the versions of the algorithm and providing the
strcmp95.c code as a reference implementation.