Where’s Georgia? DB Linkage isn’t Easy

by

In case you didn’t see the link on Slashdot, Google was supplying maps of the American state Georgia when they should’ve been linking to the Caucasian country Georgia. As Homer (the Simpson, not the classical Greek poet, the Alaskan city, the Illinois city, the American painter, or the tunnel in New Zealand) would say, “D’oh!.”

Slashdot was just picking up the story from Valleywag. Google didn’t take this lying down; the current page shows a map centered on Vienna, which if you click on it, takes you to New York, saying “U.N.”.

The problem is that linking textual mentions to database entries is non-trivial, even for the relatively simple problem of geo-location. This is the business Metacarta is in, and users we’ve talked to say they do a very good job of it.

You see similar problems for products, as in the app formerly known as Froogle or ShopWiki, as the Slashdot story pointed out in the case of yet another search engine mismatching product photos.

We ran into this problem trying to find rap artists in text, who have names like “The Game.” And we’re battling the problem in our ongoing NIH project on linking genes and protein mentions in text to Entrez-Gene.

I’ve noted before that the NY Times site runs into problems in cases such as distinguishing the Pittsburgh suburb named Mount Lebanon from the area of middle Eastern country Lebanon. In general, text matching doesn’t work real well by itself. You run into false positives by linking the Pittsburgh suburb to the middle East, but you get false negatives if you just don’t link “Mount Lebanon”. But the Times has full editorial control, so they can catch these problems and fix them manually during copy editing.

To solve this problem automatically with better results than plain text matching, we need context, which is available on the web in the form of both texts and links. We discuss this in our clustering tutorial, which uses context to disambiguate multiple John Smiths in the news, and in our white paper on gene linkage. This basically reduces the linkage problem to that of word sense disambiguation. The only real problem (besides the remaining difficult to disambiguate cases) is that it’s relatively storage and compute intensive to use context compared to a simple dictionary matcher.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

Join 824 other followers