Google Squared: General Entity and Relation Extraction at a Web Scale

by

Wow!

When I came into the office yesterday, Mike was all revved up about Google Squared. If you’re not a natural language processing or search geek, you might not realize just how amazing this application is.

What is Google Squared?

Google Squared does arbitrary entity extraction and classification, relation extraction, relation clustering and labeling, all in a very nice spread-sheet like AJAX interface. It starts with the same old empty query box. Only now you type in a class of entities, such as <united states national parks>. Then what you get is something that looks like this:

Each row represents an instance of the entity type, here a U.S. national park. The parks include Yellowstone, Yosemite and the Grand Canyon, so the most famous ones definitely show up in the list, along with oddball parks like the Channel islands. The first column is the name. Just extracting a list of names of national parks given the entity type is a very hard problem to do for arbitrary inputs at a web scale.

The other columns are information about the entity in question, or at least that’s the idea. First, there’s a picture, which shows up in every “square” I’ve seen. For national parks, Google extracted three columns, “Nearest City”, “Established” and “Rooms”. So even for a softball query like this one, they’re pulling out oddball results like number of rooms. The nearest cities are good, though indicate a huge problem with this technology: granularity. What we really want is the “closest city we’ve heard of”, not the town of 10 people right on its doorstep. You can also see the lack of uniformity of results, with some cities being listed as just city names (e.g. “Santa Barbara”), and some with states (e.g. “San Francisco, California”).

One of the first things I asked is “what if there are multiple fillers?” (we were looking at baseball players and a column corresponding to their teams). Google’s got you covered, allowing multiple fillers (e.g. “Fredonia, Arizona (North Rim) and Grand Canyon, Arizona”). If you’re just threshold here, it’s easy to look stupid, such as pulling out two names for the same team.

The “established” column makes sense, and it pulls out dates. I’m guessing there are some canonical entity types, such as like dates and locations, that it’s looking for in a more structured way (rather than just pulling out arbitrary relations). Given their existing technology, they can geolocate a term and then use nearness on a map to pull out things like nearest cities. But how do they title it?

The column for “rooms” is clearly an error. They’re confusing hotels with the park itself, which is again, frighteningly easy to do on this kind of task. The amazing thing is that the whole chart’s not junk, not that there are some errors.

Expand the Chart

After it suggests columns for relations, you can delete them and add your own. So I tried “elevation”, which got mixed results, ranging from “a steep 1,100 foot climb” for Yellowstone, a number “10500” with no units for Yosemite, and the answer “No” for Golden Gate Recreation Area, which has a zen-like ring of truth (it’s on the coast). The column “attraction” gets no useful results (I was hoping for “Old Faithful”, etc.). If I add “state”, it totally nails it. “Season” gets one useful result, and “Open” some times. But it’s hardly something I could use for trip planning.

The table also provides suggested further columns, here “Children”, “Latitude”, “Longitude”, and “Local Climate”. Hmm. The column “Children” contained odd results like “0 1 2 3” and “0, 1, 2, 3, 4”.

You can also add more rows. So I tried “St. John” (U.S. Virgin Islands, with one of my fave parks), but wound up getting Acadia National Park (another incredible Rockefeller-bequest). So I binged the real name using query <st john national park> and found that its name is “Virgin Islands National Park”. I start typing “Vir..” and it autocompletes for me. Did I say this app was super-duper cool? I don’t know about “Charlotte Amilie” — it’s the capital of the Virgin Islands, but a quick bing for (<st john cities virgin islands>) returns a top-hit with snippet beginning with “Saint John: Largest city: Cruz Bay (2,743)…” Did I mention that Bing does a good job with those snippets?

Yes, but Does it Work for Genes?

Whew. At least we’re not out of a job (yet). If I type in “gene” as the type, I indeed get some genes, but what’s that “MHC class I” doing in row two? Google Squared introduces columns for “OMIM” (Online Mendelian Inheritance in Man database, which lists genes, known mutations, and phenotypes), “Uniprot” (the modern union of Swiss-Prot and TREMBL, listing known and hypothesized proteins), and “Symbol”. The symbols were only of mediocre quality if what they wanted was the Entrez official symbol for a gene. I’m not sure why it proposed the column “Uniprot”, because it didn’t find a single value.

It’s easy to add columns for your favorite gene (aka YFG, the geneticists name for a dummy variable, which an English speaking computer scientist would call “foo”). Sheesh — typing “YFG” just for laughs gives me an image of a topless woman tuning a TV? If I use “your favorite gene”, I get references to search engines that’ll search for it for you. It does a good job pulling back genes by official name. But it’s confused as we are with families (e.g. it pulls back “auts-2” just fine, but is confused by the family “auts” — these are autism-related genes).

What’s interesting here is that it’s pulling information from all sorts of sources ranging from Wikipedia to GeneCards.

So the $1M question for us is whether it can list the relevant facets of a gene. For instance, I want to know the proteins it produces. No luck with “protein” or “product”. What about its position? Nope, adding columsn like “position” or “location” return amusing fillers like “Jerusalem”, and their suggested “start” also provides meaningless fillers.

The column “function” is much better, but it’s hardly like using Entrez. You get “tumor supressor” for p53, and “signal transduction” for insulin.

I had no luck at all trying to find interactions (“interaction”, “regulation”, “methylation”, “binds”, etc.), which are all very nicely faceted in the Entrez database.

Some Nerve

It takes some nerve to roll out a technology as brittle as this. The enormity of the task and its difficulty is awe inspiring. The labbers did a great job implementing this. The real question is: will the civilians be as impressed as me, and more importantly, will they find use cases for it? Given the quality, I still can’t think of anywhere I’d use this over plain-old search. I can imagine some application where I need to discover members of some class I don’t already know.

Ironically, I think the real competition here is Wikipedia. It’s the old manual-labor versus automation battle, but with crowdsourcing on the manual side versus natural language processing for automation. For instance, check out Wikipedia’s National Park Service entry.

3 Responses to “Google Squared: General Entity and Relation Extraction at a Web Scale”

  1. lingpipe Says:

    There’s an about page for Google Squared hidden as a highly customized Easter egg:

       Magpie Team Squared

    I had the odd experience of going to Jeff Reynar’s place for dinner the day after I posted this. Jeff told me that Google Squared was the “top secret internal startup” he’d been working on before he quit Google.

    His story checks out. If you add two additional batches of suggestions, Jeff appears, with a picture, under the heading “manager emeritus, thought provoker and search engine wizard”.

  2. Nair Satheesh Says:

    Google Squared appears to be similar to my patent application:

    Frankly, I am getting a Déjà vu effect while going through the “Google Squared” application because it appears to be very similar in function to my United States patent application which was filed on April 12, 2007 and as publicly disclosed by the United States Patent and Trademark Office on October 16, 2008, when the patent application was published.

    My patent application is titled as “Method And System For Research Using Computer Based Simultaneous Comparison And Contrasting Of A Multiplicity Of Subjects Having Specific Attributes Within Specific Contexts” bearing Document Number “20080256023” and Inventor name “Nair Satheesh” which may be viewed at http://patft.uspto.gov/ upon Patent Applications: Quick Search.

    Google Squared appears to be using at least some if not many of the same methods and systems as set forth by me more than two years ago in my patent application. In fact there are many more methods and systems disclosed in my patent application which I believe will help resolve certain inaccuracies found in current Google Squared application.

    I have issued legal notices to Google through my Patent Attorney in the US but Google has not responded yet to any of my notices.

  3. lingpipe Says:

    @Nair You may be interested in my blog entry on the “scientific zeitgeist”.

    There’s lots and lots of previous work in attribute extraction that appeared in the scientific literature before Google Squared. Here’s one from a distinguished group of authors in 2007, some of whom work for Google:

    Bellare et al. 2007. Lightly supervised attribute extraction. NIPS.

    In fact, one of the earlier papers is from 1998 and authored by Sergey Brin, a co-founder of Google:

    Sergey Brin. 1998. Extracting patterns and relations from the world wide web. WWW and DB.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s