IBM’s Watson and the State of NLP


Aditya Kalyanpur presented an overview of the Jeopardy! winning Watson computer system June 6 at in New York for the New York Semantic Web Meetup. I was asked to present a three minute overview of the state of Natural Language Processing (NLP). In this post I want to couch the Watson system in the context of the state-of-the-art since it didn’t make sense to do it at the meetup because I presented first.

The State of NLP According to Breck

Conveying the state-of-the-art in three minutes is quite a challenge so lets run with the analogy of aviation for ease of comprehension. So where is NLP?

It Flies!

We have achieved the analog of basic powered flight. No doubt.

Yikes! and away

But in no sense have we gotten to the this level of performance.

Amelia Earhart's Lockheed Vega

My best guess is that we are at the point of a reasonable commercial foundation as an industry with some changes to come that we don’t know about yet, not unlike aviation in the mid 1920’s. Perhaps the beginning of the Golden Age of NLP.

And in no sense are we in the reliable, high technology commercial space that modern air transport provides.

Boing 777

Where does Watson Fit in the Analogy

Watson fits perfectly in the example of the red 1928 Lockheed Vega above for the following reasons:

  • The Vega is actually Amelia Earhart’s plane that was used to break records (crossing the Atlantic solo), generate publicity and was a stunning success for a nascent industry.
  • While inspirational, the Vega’s success had little to do with advancing the underlying technology. What would I consider an advancement of technology? Frank Whittle patented the turbojet in 1930.
  • Watson shows how a 20 person team working 4 years can win a very challenging game with skill, effort and daring much in the same way that aviation records were broken with the same. Don’t think some careers were not on the line with the Watson effort–I think IBM ceased termination by firing squad in the 70’s so Earhart had more on the line. But what are the prospects of a mid-level ex-IBM exec in today’s economy? Perhaps the firing squad would be a kindness.

But Watson is Playing a Game

There is one issue that seriously concerns me; Watson won a question answering game with the trivial twist that the answers must be phrased as questions. So the clue “First President of the US” is answered with “Who is George Washington”. But Watson is not a general purpose question answering system. What is the difference?

Another analogy: The game of chess is based on medieval battles but even though Big Blue beats the best human players one would never consider using Big Blue to manage an actual battle. Real war is messy, approximate and without clear rules which makes chess algorithms totally inappropriate.

Real world question answering has similar qualities to real war: messy, approximate and no clear rules. The game of Jeopardy! is based on the existence of a unique, easily understood and verified answer given the clue. Taking one of the examples from the talk;

In 1698, this comet discoverer took a ship called the Paramour Pink on the first purely scientific sea voyage

The correct “question” is “Who is Edmond Halley” of Halley’s Comet fame. The example is used to work through an impressive system diagram that resembles a well developed model train set (thanks to Prof. Mark Steedman for the simile). Much is done to generate the correct answer while avoiding distractors like Peter Sellers from the Pink Panther movies. But run the same clue past Google with “-watson -jeopardy” appended to eliminate pages that mention discussion of this publicized example and the first result is Halley’s Comet Stamps with the first sentence mentioning the correct answer.

There is still an impressive amount of work in extracting the correct name but the hunt for the answer was ready to be found exactly because it is a game, unambiguous, well known and well selected given the clue.

What does Real World Question Answering Look Like?

What kinds of questions have I approached a search engine with?

What is the current 30 year FHA mortgage rate?

This question is a disaster from the uniqueness of answer perspective. My initial search results were pretty low quality and did not provide accurate rate information for what I knew the answer to be.

When is it best to ski in Chile?

This went better. There was a FAQ on the first page of results but the answer just went on and on. “The season runs from mid-June to mid-October. Although every year is different, and it comes down to Mother Nature, the best time for dry powder is mid-June, July, August, and up to the 2nd week in September. After that,….” Again we have a non-unique answer because my question was not that specific in the first place.

What is the Reputation of LingPipe?

This is a question that a group of Columbia MBA students took on for us in their small business program which I recommend btw.

This question was hopeless in search because there is not a page out there that needs to be found with our reputation nicely summarized. Answering the question requires distillation across many resources even if information was restricted to web only.

Welcome to the real world, question answering is hell.

Where Might Watson Flourish Outside of Jeopardy! Tournaments?

Jeopardy! is a game of finding the uniquely obvious given indirect clues. Otherwise it is not a game that can be judged and played. What else in the world has this quality? The Watson team is now approaching medical diagnosis which is a real world use case that might match the Jeopardy! game format with symptoms as clues and diagnosis as the answer. Uniqueness is not guaranteed in diagnosis but Watson can handle multiple answers. This is an area where computer systems from the 1970’s, e.g. Mycin, out performed experts but they didn’t have a NLP component. Medical diagnosis, once symptoms are recognized, is a game like problem.

In the end Watson is an engineering achievement, but in no way have the skills of a good reference librarian been replicated.

I came across an interesting article by Michael Lind on information technology and its role in productivity while writing this blog post. Interestingly he puts information technology in the same time bracket as I do.


3 Responses to “IBM’s Watson and the State of NLP”

  1. John Lehmann Says:

    Great analogies!

    I also noted that the Jeopardy’s rich keyword sets result in correct “answers” being highly ranked in search engines. Even less realistic are the Jeopardy “questions” which describe two of the correct answer’s senses. Those are nice problems but aren’t realistic to automatic Q/A (“This plain-weave, sheer fabric made with tightly twisted yard is also used to describe a pie or a cake”).

    That being said I don’t want to diminish IBM’s fantastic accomplishment and applaud the progress that they made in this sort of “first flight”.

  2. Bob Carpenter Says:

    As I’ve said in other posts on other blogs and to many people in person, Watson’s very impressive as a train set even if you know how train sets work.

    0. I’m a bit more optimistic than Breck that it’s getting close to some of the kinds of questions people have. Part of the issue is whether the system can learn when it’s going to be wrong in a more heterogeneous question-answering environment. So far, NLP technologies are not very good at that (look, for instance, at the number of false positives from relatively easy problems like web query spell checking).

    1. Going back to Deep Blue (IBM’s chess-playing program), you could say the same thing about it in comparison to a human. IBM didn’t replicate a human chess player, they built something different with different abilities. Most specifically, the ability to beat a human at a game that seemed, at least until the point of IBM’s entry, to require intelligence.

    I’d say that in many ways Watson is better than a reference librarian. If you just allowed Watson to show you the source of its answer, it’d be more believable still.

    In fact, plain old search engines like Bing or Google work better than generic reference librarians for the kinds of queries I make (e.g., [hamiltonian monte carlo convergence rate]).

    2. The analogy to flight reminds me of Hynek Hermansky’s 1998 ICSLP talk. About that time, the funding agencies in the US and Europe were declaring speech a “solved problem”. Hynek put up images ranging from hot air baloons (perhaps a better analogy to where we’re at now), through to jets, and asked when the flight problem was solved, and if it was solved, why research into flight was such a large part of DARPA and other funding agency’s research budget.

    3. Temporal questions are a mess if you can’t add a specific date.

  3. Help Build a Watson Clone–Participants Sought for LingPipe Code Camp | LingPipe Blog Says:

    […] have blogged about Watson before and I totally respect what they have done. But the coverage is getting a bit breathless with […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: