Code Spelunking: Jaro-Winkler String Comparison

by

We’re working on a database record linkage (also known as database deduplication) problem for a customer. The standard string comparison for this task is the Jaro-Winkler distance, named after Matt Jaro and Bill Winkler of the U.S. Census Bureau. Here are some quick links to what we’ve done:

Unfortunately, all of the definitions I could find were as vague as the Jaro-Winkler Wikipedia entry. And some definitions actually varied when considering the boundary conditions.

I next went and looked in two open-source string comparison packages, namely SecondString and Simmetrics. Here’s the Simmetrics Java source and the SecondString Java source. Interestingly, they contained almost identical definitions that matched neither the definitions in Winkler’s papers nor their own documentation (which is clearly wrong in the third term in Jaro distance Simmetrics case). Hmm. This makes me very skeptical of the widely cited results reported in (Cohen et al. 2003); more on their soft TF/IDF measure in a future blog post.

To get to the bottom of the matter, I composed a long e-mail missive to William Winkler at the U.S. Census Bureau. I outlined the vexing boundary conditions. He was not only kind enough to write back, he sent me a copy of the C code he actually used in his papers, along with pointers to tables in the paper that had sample output.

With the C code in hand, GCC (Gnu’s C compiler), and a quick squint at Kernighan and Ritchie (the C version of Gosling et al.) to remind myself how to allocate an array in C, I was in business.

I implemented Jaro-Winkler as an instance of the interface com.aliasi.spell.StringDistance. I implemented all of Winkler’s tests as unit tests along with a number of pure boundary condition tests. It’ll be out in the next release of LingPipe, along with some TF/IDF-based distances. Here’s the class javadoc, which contains links to all of the relevant papers and test cases.

—————————————-

The JaroWinklerDistance class implements the original Jaro string comparison as well as Winkler’s modifications. As a distance measure, Jaro-Winkler returns values between 0 (exact string match) and 1 (no matching characters). Note that this is reversed from the original definitions of Jaro and Winkler in order to produce a distance-like ordering. The original Jaro-Winkler string comparator returned 1 for a perfect match and 0 for complete mismatch; our method returns one minus the Jaro-Winkler measure.

The Jaro-Winkler distance measure was developed for name comparison in the U.S. Census. It is designed to compae surnames to surnames and given names to given names, not whole names to whole names. There is no character-specific information in this implementation, but assumptions are made about typical lengths and the significance of initial matches that may not apply to all languages.

The easiest way to understand the Jaro measure and the Winkler variants is procedurally. The Jaro measure involves two steps, first to compute the number of "matches" and second to compute the number of "transpositions". The Winkler adjustment involves a final rescoring based on an exact match score for the initial characters of both strings.

Formal Definition of Jaro-Winkler Distance

Suppose we are comparing character sequences cs1 and cs2. The Jaro-Winkler distance is defined by the following steps. After the definitions, we consider some examples.

Step 1: Matches: The match phase is a greedy alignment step of characters in one string against the characters in another string. The maximum distance (measured by array index) at which characters may be matched is defined by:

   matchRange = max(cs1.length(), cs2.length()) / 2 - 1

The match phase is a greedy alignment that proceeds character by character through the first string, though the distance metric is symmetric (that, is reversing the order of arguments does not affect the result). For each character encountered in the first string, it is matched to the first unaligned character in the second string that is an exact character match. If there is no such character within the match range window, the character is left unaligned.

Step 2: Transpositions: After matching, the subsequence of characters actually matched in both strings is extracted. These subsequences will be the same length. The number of characters in one string that do not line up (by index in the matched subsequence) with identical characters in the other string is the number of "half transpositions". The total number of transpoisitons is the number of half transpositions divided by two, rounding down.

The Jaro distance is then defined in terms of the number of matching characters matches and the number of transpositions, transposes:

   jaroCompare(cs1,cs2)
     = ( matches(cs1,cs2) / cs1.length()
         + matches(cs1,cs2) / cs2.length()
         + (matches(cs1,cs2) - transposes(cs1,cs2)) 
             / matches(cs1,cs2) ) / 3

   jaroDistance(cs1,cs2) = 1 - jaroCompare(cs1,cs2)

In words, the measure is the average of three values; (a) the percentage of the first string matched, (b) the percentage of the second string matched, and (c) the percentage of matches that were not transposed.

Step 3: Winkler Modification The Winkler modification to the Jaro comparison, resulting in the Jaro-Winkler comparison, boosts scores for strings that match character for character initially. Let boostThreshold be the minimum score for a string that gets boosted. This value was set to 0.7 in Winkler’s papers (see references below). If the Jaro score is below the boost threshold, the Jaro score is returned unadjusted. The second parameter for the Winkler modification is the size of the initial prefix considered, prefixSize. The prefix size was set to 4 in Winkler’s papers. Next, let prefixMatch(cs1,cs2,prefixSize) be the number of characters in the prefix of cs1 and cs2 that exactly match (by original index), up to a maximum of prefixSize. The modified distance is then defined to be:

   jaroWinklerCompare(cs1,cs2,boostThreshold,prefixSize)
     = jaroMeasure(cs1,cs2) < = boostThreshold
     ? jaroMeasure(cs1,cs2)
     : jaroMeasure(cs1,cs2)
       + 0.1 * prefixMatch(cs1,cs2,prefixSize) 
               * (1.0 - jaroDistance(cs1,cs2))

   jaroWinklerDistance(cs1,cs2,boostThreshold,prefixSize)
     = 1 - jaroWinklerCompare(cs1,cs2,boostThreshold,prefixSize)

Examples: We will present the alignment steps in the form of tables, with offsets in the second string below the first string positions that match. For a simple example, consider comparing the given (nick)name AL to itself. Both strings are of length 2. Thus the maximum match distance is max(2,2)/2 - 1 = 0, meaning all matches must be exact. The matches are illustrated in the following table:

cs1 A L
matches 0 1
cs2 A L

The notation in the matches row is meant to indicate that the A at index 0 in cs1 is matched to the A at index 0 in cs2. Similarlty for the L at index 1 in cs1, which matches the L at index 1 in cs2. This results in matches(AL,AL) = 2. There are no transpositions, so the Jaro distance is just:

   jaroCompare(AL,AL) = 1/3*(2/2 + 2/2 + (2-0)/2) 
                              = 1.0

Applying the Winkler modification yields the same result:

   jaroWinklerCompare(AL,AL) = 1 +  0.1 * 2 * (1.0 - 1) 
                                        = 1.0

Next consider a more complex case, matching MARTHA and MARHTA. Here the match distance is max(6,6)/2 - 1 = 6/2 - 1 = 2, allowing matching characters to be up to two characters away. This yields the following alignment.

cs1 M A R T H A
matches 0 1 2 4 3 5
cs2 M A R H T A

Note that the T at index 3 in the first string aligns with the T at index 4 in the second string, whereas the H at index 4 in the first string alings with the H at index 3 in the second string. The strings that do not directly align are rendered in bold. This is an instance of a transposition. The number of half transpositions is determined by comparing the subsequences of the first and second string matched, namely MARTHA and MARHTA. There are two positions with mismatched characters, 3 and 4. This results in two half transpositions, or a single transposition, for a Jaro distance of:

   jaroCompare(MARTHA,MARHTA) = 1/3 * (6/6 + 6/6 + (6 - 1)/6) 
                                              = 0.944

Three initial characters match, MAR, for a Jaro-Winkler distance of:

   jaroWinklerCompare(MARTHA,MARHTA) = 1 + 0.1 * 3 * (1.0 - 0.944) 
                                                       = 0.961

Next, consider matching strings of different lengths, such as JONES and JOHNSON:

cs1 J O N E S
matches 0 1 3 - 5
cs2 J O H N S O N

The unmatched characters are rendered in italics. Here the subsequence of matched characters for the two strings are JONS and JONS, so there are no transpositions. Thus the Jaro distance is:

   jaroCompare(JONES,JOHNSON)
     = 1/3 * (4/5 + 4/7 + (4 - 0)/4) 
     = 0.790

The strings JONES and JOHNSON only match on their first two characters, JO, so the Jaro-Winkler distance is:

   jaroWinklerCompare(JONES,JOHNSON)
     = .790 + 0.1 * 2 * (1.0 - .790) 
     = 0.832

We will now consider some artificial examples not drawn from (Winkler 2006). First, compare ABCVWXYZ and CABVWXYZ, which are of length 8, allowing alignments up to 8/4 - 1 = 3 positions away. This leads to the following alignment:

cs1 A B C V W X Y Z
matches 1 2 0 3 4 5 6 7
cs2 C A B V W X Y Z

Here, there are 8/8 matches in both strings. There are only three half-transpositions, in the first three characters, because no position of CAB has an identical character to ABC. This yields a total of one transposition, for a Jaro score of:

   jaroCompare(ABCVWXYZ,CABVWXYZ)
     = 1/3 * (8/8 + 8/8 + (8-1)/8) 
     = .958

There is no initial prefix match, so the Jaro-Winkler comparison produces the same result. Now consider matching ABCVWXYZ to CBAWXYZ. Here, the initial alignment is 2, 1, 0, which yields only two half transpositions. Thus under the Jaro distance, ABC is closer to CBA than to CAB, though due to integer rounding in computing the number of transpositions, this will only affect the final result if there is a further transposition in the strings.

Now consider the 10-character string ABCDUVWXYZ. This allows matches up to 10/2 - 1 = 4 positions away. If matched against DABCUVWXYZ, the result is 10 matches, and 4 half transposes, or 2 transposes. Now consider matching ABCDUVWXYZ against DBCAUVWXYZ. Here, index 0 in the first string (A) maps to index 3 in the second string, and index 3 in the first string (D) maps to index 0 in the second string, but positions 1 and 2 (B and C) map to themselves. Thus when comparing the output, there are only two half transpositions, thus making the second example DBCAUVWXYZ closer than DABCUVWXYZ to the first string ABCDUVWXYZ.

Note that the transposition count cannot be determined solely by the mapping. For instance, the string ABBBUVWXYZ matches BBBAUVWXYZ with alignment 4, 0, 1, 2, 5, 6, 7, 8, 9, 10. But there are only two half-transpositions, because only index 0 and index 3 mismatch in the subsequences of matching characters. Contrast this with ABCDUVWXYZ matching DABCUVWXYZ, which has the same alignment, but four half transpositions.

Implementation Notes

This class’s implementation is a literal translation of the C algorithm used in William E. Winkler’s papers and for the 1995 U.S. Census Deduplication. The algorithm is the work of multiple authors and available from the folloiwng link:

  • Winkler, Bill, George McLaughlin, Matt Jaro and Marueen Lynch. 1994. strcmp95.c (ed. link deprecated by the census), Version 2. United States Census Bureau.
  • Working link: CodeGuru strcmp95.c

Unlike the C version, the distance(CharSequence,CharSequence) and compare(CharSequence,CharSequence) methods do not require its inputs to be padded with spaces. In addition, spaces are treated just like any other characters within the algorithm itself. There is also no case normalization in this class’s version. Furthermore, the boundary conditions are changed so that two empty strings return a score of 1.0 rather than zero, as in the original algorithm.

The algorithm, along with applications in record linkage, is sketched in the following highly readable survey article:

This document provides test cases in Table 6, which are the basis for the unit tests for this class (though note the three 0.0 results in the table do not agree with the return results of strcmp95.c or the results of this class, which matches strcmp95.c). The description of the matching procedure above is based on the actual strcmp95 code, the boundary conditions of which are not obvious from the text descriptions in the literature. An additional difference is that strcmp95, but not the algorithms in Winkler’s papers nor the algorithm in this class, provides the possibility of partial matches with similar-sounding characters ( .g. c and k).

The greedy nature of the alignment phase in the Jaro-Winkler algorithm actually prevents the optimal alignments from being found in some cases. Consider the aligning ABCAWXYZ with BCAWXYZ:

cs1 A B C A W X Y Z
matches 2 0 1 - 3 4 5 6
cs2 B C A W X Y Z  

Here the first pair of A characters are matched, leading to three half transposes (the first three matched characters). A better scoring, though illegal, alignment would be the following, because it has the same number of matches, but no transposes:

cs1 A B C A W X Y Z
matches - 0 1 2 3 4 5 6
cs2 B C A W X Y Z  

The illegal links are highlighted in yellow. Note that neither alignment matches in the initial character, so the Winkler adjustments do not apply.

Acknowledgements

We’d like to thank Bill Winkler for helping us understand the versions of the algorithm and providing the strcmp95.c code as a reference implementation.

34 Responses to “Code Spelunking: Jaro-Winkler String Comparison”

  1. lingpipe Says:

    This is by far our most popular blog entry, continuing to top our daily page view stats over all contenders. It’s the number two hit (as of 27 Aug 2008) for the Google query <jaro winkler>, so that explains how people are finding it.

    What I want to know is why are you here?

    Could someone drop me (carp@alias-i.com) a line and let me know? Is it getting assigned as a classroom exercise somewhere? Did someone ask you to de-duplicate their DB of ill-edited names and you stumbled your way to our site? Are you in the market for database de-duplication or record linkage tools?

    • Antonio Says:

      Answering your question, I’m here because I use to develop softwares for small offices and I often stumble with inconsistent DBs.

      Also, thank you very much! This information’s been very helpful!
      I’ll try it in conjuction with metaphone.

      Best regards,
      Antonio

  2. Hanif Says:

    I am interested in Jaro-Winkler Distance as it is important in my research.

  3. Terence Etchells Says:

    Hi, a great page to unravel the poor explanations to be found elsewhere for calculating the transpositions. I am implementing the algorithm in Matlab for a commercial project I am involved in Liverpool UK. It involves cleaning up raw personal data and looking for aliases in very large databases. Many thanks to the author, my code works fine now.

  4. Jim Says:

    I have been trying to implement jaro-winkler in common lisp for a bibliometric project, but couldn’t find a clear consistent and unambiguous description until I came across your page.

    I’ve translated your java code to common lisp: it seems to work. Thank you.

  5. Snobber Says:

    what are the weaknesses that you encountered for this algorithm? is this the best algorithm for string comparison? =)

  6. lingpipe Says:

    It’s a good algorithm for comparing single English names. You need to put Jaro-Winkler together with something else to compare full names, for instance, much less do database linkage, which should take into account other fields like phone numbers, addresses, ages, relations to other people, etc.

    It’s not necessarily going to work well for otherlanguages as parameterized.

    We’ve incorporated some of its insights into our spell checker (e.g. that errors are more prone to occur word-initially).

    Better models build in likely thinkos in terms of misspellings and misunderstandings as well as models of errors. And they do it using wider contexts and less deterministically.

    There are lots of papers out there comparing various string comparison algorithms, such as Cohen et al..

  7. Snobber Says:

    any other limitations of this algorithm ? i

  8. lingpipe Says:

    You should really try it for yourself and see if it does what you’re expecting it to do. You could grab our Java code or follow the link to the original C code from the census.

  9. Snobber Says:

    thx for the help … i really need this for my research …
    I appreciate your kindness =)

  10. NadMan Says:

    I for one am in the market for database de-duplication or record linkage tools, do you have a recommendation?
    I prefer open source, but one can’t always get what they want.

  11. NadMan Says:

    Oh never mind, googling answered my rather stupid question. I’ll check out LingPipe.
    All this time I thought LingPipe was a screen name. :)

  12. lingpipe Says:

    We don’t offer a de-duplication/linkage application per se, but do offer some string comparison algorithms (tutorial linked at top of blog entry) that can be used as the basis of such applications.

    You could check out some of the papers from Winkler and colleagues at the census, which explain how to build de-duplication applications. Typically, these are semi-automatic.

    Scaling’s an issue; if you have millions of records, you can’t really run all-versus-all comparisons on the records. For some algorithms, such as weighted edit distance, there are good dynamic programming all-versus-all algorithms that are linear in number of records (with a constant factor bounded by minimum distance and features of the distance measures).

    For most apps, field-based comparisons on names, dates, cities, and so on, need to be combined to compare whole records.

    I don’t know of any open-source packages to do deduplication, but would be interested to hear if anyone’s found and used any successfully.

  13. NadMan Says:

    I found this one, although I haven’t tried to use it yet.

    http://datamining.anu.edu.au/projects/linkage.html

    http://sourceforge.net/projects/febrl/

    Thought you might be interested.

  14. Ivo Ugrina Says:

    I’ve been using it (Jaro-Winkler) for my project (database of mathematical definitions, in Croatian) since 2005. and I’m very satisfied.
    I’ve made PHP implementation (under GPLv3.0 now) which is available for download at

    http://www.iugrina.com/files/JaroWinkler/JaroWinkler.php.IUGRINA

  15. itssmee Says:

    Thank you for posting this, it has certainly helped me understand the algorithm. I am currently researching different algorithms for record linkage. In my case I am trying to link anonymous data-sets, so I can’t link on surname, given name, in fact on any name.

    I intend to play around with this algorithm and try it on my datasets, but on simple string fields which reflect a location (e.g. office building codes), do you think it would work just as well, or would you suggest another algorithm.

    Thanks again for the detail.

    • lingpipe Says:

      It’s designed for single names, not even first-name last-name pairs, and weights errors in the first few characters more heavily.

      I like using character n-gram comparisons for something that you’ll post edit by hand for general phrases.

  16. Csaba Toth Says:

    I deal with record linkage as a part of my work. I use Java right now and I wanted to reproduce some results which were calculated by a Perl source code, it used Text::JaroWinkler::strcomp95. First I used SimMetrics’s JaorWinkler implementation, and the results are very different. I digged into the source code and realized what you write that some implementations are pretty different.
    As far as I could see there are three internal parameters/constants which are used by Jaro-Winkler:
    – weight threshold (SimMetrics terminology: ?)
    – num chars (SimMetrics terminology: MINPREFIXTESTLENGTH)
    – PREFIXADUSTMENTSCALE

    num chars is 4 in case of the original JaroWinkler, 6 in SimMetrics.
    PREFIXADUSTMENTSCALE is 0.1 in every implementation AFAICS.
    I cannot see the weight threshold in SimMetrics (I haven’t took really thorough look though)

    My problem, is that I still cannot get the same result as the Perl package. Probably I have to look at that too.

    It’s interesting that these parameters can create different algorithms!

    • lingpipe Says:

      strcomp95, as I link to in the post, is Jaro and Winkler’s own implementation, so I presume it’s the “right” one. You should check whether that gives you the same results. You may need to figure out what the parameters are.

      As I also point out, the SimMetrics and SecondString implementations don’t match strcomp95.

      I designed the LingPipe implementation to match strcomp95 and unit-tested it pretty extensively. If you’re getting different results, first verify that the params are the same, and then please let me know where.

  17. programming blog Says:

    well done tutorial. I am looking for the same…

  18. Davide Says:

    hi, i found this useful post with google.

    i have a question:
    i have to check that a name + surname is not into a list (anti-terrorism list).

    since you said JW script is not optimized for such a task like that (search for name + surname), have you got anything that can help me?

    thanks in advance!

    • lingpipe Says:

      Jaro-Winkler was aimed at comparing single names. If you really do have first-name plus surname, you could compare both independently using JW.

      There are lots of problems for looking for terrorists. One is variant spellings of aliases, which is rather common. Foreign names often get misclassified as to given name plus family name (especially Chinese). And then there are problems with patronymics and locations. For instance, Saddam Hussein’s son Uday is often referred to as “Uday Saddam Hussein al-Tikriti”, with a link to the father (Saddam) and to location (Tikiriti).

      You also get problems with diacritics and other marks. For instance, “al Tikriti” might be realized as “al-Tikrītī”.

      I don’t know of any method that has good sensitivity and specificity and is based just on names (as opposed to having other information like a social network or context in which it was used).

      Using character n-gram comparisons, perhaps after normalizing diacritics (e.g. by using a toolkit like ICU), can give you reasonable sensitivity if large parts of names aren’t dropped. But it won’t distinguish close names (e.g. “Jen” vs. “Jan”) from typos (“John” vs. “Jhon”) from variants. It’s really the problem that one letter sometimes makes a difference and sometimes you can drop whole phrases.

      Other people have tried to mix in token-level effects with character-level effects, making it costlier to drop sequences of letters within a name than a whole token in a name.

      P.S. I wouldn’t start blog responses so generically — there’s lately been a rash of “I found this post on search engine X” spam!

      • Davide Says:

        ok, now i’ll start this reply with:
        whoa, thanks a lot for taking time answering me :)

        i *shouldn’t* have problems with diacritics and stuff like that since i work on normalized lists that come from official organizations and in there are aliases, too.
        Plus i work with a “standard” alphabet (italian) so i think there won’t be problems here, too.
        For typos.. well my users will have to pay attention :)

        About merging more algorithm:
        i have tried something like that, but i have a question: does this enfatize or reduce my possible error?

        i have tried this:
        s1 = “SUKHARENKA STSIAPAN MIKALAEVICH”
        s2 = “SUKHORENKO NIKOLAEVICH STEPAN ”

        (yeah there are names like that -.-)

        doJaro(getSoundex(s1),getSoundex(s2)) => 0,8962
        doJaro(getCaverphone(s1),getCaverphone(s2)) => 0,7333

        what do you think about?
        thanks man!

      • lingpipe Says:

        Using pronunciation can help with recall. It’ll hurt precision unless you go with something much tighter than soundex. There are varying transliteration standards and names in English are pronounced differently depending on the source language. The more general way to deal with pronunciation is by training a corpus-specific transducer, as described in Toutanova and Moore’s 2002 ACL paper, Pronunciation Modeling for Improved Spelling Correction.

        All of this stuff works best in a search context. Or at least something that has an editorial stage for the close cases that can’t be assessed automatically. The real problem is that if you’re off by one letter, it can be a different name in some contexts and a typo or transliteration variation in another.

  19. moath Says:

    hi
    i am understood the algorithm as well as Wikipedia
    thank you so much

  20. Joerg Says:

    To answer your question I stumbled upon the post researching record linkage. I need to perform a merge/purge of US contact information against our database. What makes sense to me is to first pre-process address data ensuring the zip is valid and corresponds to the state and city. To compare first and last names it seems Jaro-Winkler is as good a choice as any. But how to handle the street address line(s)? I’m wavering between a hybrid approach such as a Soft TF/IDF (did you ever write a follow up post on that) or a straight n-gram. The n-gram approach is attractive due to the simplicity, perhaps combined with a some ad-hoc rules that require two addresses have the same set of numeric tokens if they are to match (i.e. ‘Apt 1 50 E4th’ is a possible match to ‘Apartment 1 50 4th Avenue East’ but not to ‘Apt1 51 E4th’).

    • Bob Carpenter Says:

      Addresses are a different kind of problem than typos in names. For one thing, they have structure. he example you give has an apartment number and a street address. For an address in my neighborhood, like 12 East 12th St, you get a ton of alternation at the token level, like “12” versus “Twelve”, “E” versus “East’, “12”, “12th” and “Twelfth”, and “St” versus “Street”. You really want a system that knows these kinds of things.

      I like the basic intuition, but I never really understood soft TF/IDF. So I never implemented it in LingPipe.

      There’s a neat paper by McCallum, Bellare and Pereira that I think approaches this kind of problem the right way. There’s a link and a description in a previous blog entry.

      This would allow you to do something like soft TF/IDF, only with a probabilistic basis and arbitrary generality for features. It would also let you emphasize numbers. But you’d need a training set to estimate the parameters.

      As a simple approximation, you could build a logistic regression classifier for match/non-match, that’d let you do lots of the same sorts of things. LingPipe does do that. But again it’d need training data and you’d have to engineer a feature extractor for features.

  21. Joerg Says:

    Thank you for the feedback and paper pointer. I can see where it is going but it may take a bit before I get there.

    You pointed out the ’12’ vs ‘twelve’ as one issue in the case of addresses. The made me realize that first replacing all written numbers with digits in a pre-processing step would help a lot.

    The more I think about it the more it seems it should be possible to find something ready made for US addresses, it is not an uncommon problem.

  22. Xavier Says:

    Hello,

    I’ve read your article and it has helped me a great deal understanding the Jaro-Winkler algorithm.

    In the third point, I’m a bit confused about your pseudo code, shouldn’t it be:

    jaroWinklerCompare(cs1,cs2,boostThreshold,prefixSize)
    = jaroCompare(cs1,cs2) < = boostThreshold
    ? jaroCompare(cs1,cs2)
    : jaroCompare(cs1,cs2)
    + 0.1 * prefixMatch(cs1,cs2,prefixSize)
    * (1.0 – jaroCompare(cs1,cs2))

    Furthermore I'd like to point out a few mistakes in the MARTHA example you have provided:

    – The length of MARTHA is 6, not 5
    – in the Martha example: 1/3 * (5/5 + 5/5 + (5 – 1)/5) = 0.933 and not 0.944
    – still in the MARTHA example: it should be 0.944 + 0.1 * 3 * (1.0 – 0.944) if you want to obtain 0.961

    • Bob Carpenter Says:

      Thanks so much. You’re absolutely right. And this is especially important to us, because it’s reflected in our javadoc.

      The outcome was right, not the text. That should’ve been 1/3 * (6/6 + 6/6 + (6-1)/6) = 0.9444. I also changed the other length-related calculations for MARTHA — the max distance is 2.

  23. Lars Marius Garshol Says:

    I’m implementing Jaro-Winkler for my own deduplication engine and found your explanations and links very helpful.

    However, as far as I can tell, you are omitting two of Winkler’s modifications. The first one you note: his notion of similar-sounding characters is briefly referred to in your implementation notes, but not explained above.

    The second one I can find no mention of: the adjustment for longer strings. In Winkler’s source code it’s implemented in the last four lines before the return (two ifs and two lines of weight += …). Yancey 2005 has an explanation of this.

    And, finally, I’m confused by Winkler’s test cases. How can “HARDIN” and “MARTINEZ” have a Jaro distance of 0? As far as I can see, common characters is 4 and transpositions is 0. That gives 0.722. Similarly, for “ITMAN” and “SMITH” I get 0.467, not 0. I compiled your code and find that it gives the same result. Any ideas?

    • Bob Carpenter Says:

      I don’t know the papers to which you’re referring. I basically reverse engineered what the algorithm was doing from the code.

      There’s a huge amount of latitude to mess around with these kinds of ad hoc edit-distance-like approaches.

      I’d suggest getting some data with true answers labeled and then measuring different approaches. Jaro-Winkler only really works well for single person names — it’s not even designed for forename+surname combos, because there’s no token sensitivity.

  24. Lars Marius Garshol Says:

    Interesting. My code with the Jaro comparator + the prefix adjustment matches your code exactly, and also matches Winkler’s test cases exactly (except for the HARDIN & ITMAN rows). But Winkler’s source code includes two more adjustments (as mentioned in my previous comment), and these are faithfully described in Yancey in 2005, but not included in Winkler’s 2006 test cases. What gives?

  25. Dominick Says:

    Hello, I could be wrong, but I believe the “JONES and JOHNSON” example has a typo. The matching S’s have the wrong index. Right now you have the index labeled as 5, but shouldn’t it be 4? The S in ‘JONES’ is in the 4th position, and the S in ‘JOHNSON’ is in the 4th position.

    I’m still trying to make sense of this algorithm, so forgive me if I’ve made a mistake.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

Join 817 other followers