Canceled: Help Build a Watson Clone–Participants Sought for LingPipe Code Camp

by

Code Camp is canceled. We are late on delivering a LingPipe recipes book to our publisher and that will have to be our project for March. But we could have testers/reviewers come and hang out. Less fun I think.

Apologies.
Breck

—————————

 

Dates: To be absolutely clear: Dates are 3/2/2014 to 3/31/14 in Driggs Idaho. You can work remotely, we will be doing stuff before as well for setup.

New: We have setup a github repository. URL is https://github.com/watsonclone/jeopardy-.git

————————

Every year we go out west for a month of coding and skiing. Last year it was Salt Lake City Utah, this year it is Driggs Idaho for access to Grand Targhee and Jackson Hole. The house is rented for the month of March and the project selected. This year we will create a Watson clone.

I have blogged about Watson before and I totally respect what they have done. But the coverage is getting a bit breathless with the latest billion dollar effort. So how about assembling a scrappy bunch of developers and see how close we can come to recreating the Jeopardy beast?

How this works:

  • We have a month of focused development. We tend to code mornings, ski afternoons. House has 3 bedrooms if you want to come stay. Prior arrangements must be made. House is paid for. Nothing else is.
  • The code base will be open source. We are moving LingPipe to the AGPL so maybe that license or we could just apache license it. We want folks to be comfortable contributing.
  • You don’t have to be present to contribute.
  • We have a very fun time last year. We worked very hard on an email app that didn’t quite get launched but the lesson learned was to start with the project defined.

If you are interested in participating let us know at watsonclone@lingpipe.com. Let us know your background, what you want to do, do you expect to stay with us etc…. No visa situations please and we don’t have any funds to support folks. Obviously we have limited space physically and mentally so we may say no but we will do our best to be inclusive. Step 1–transcribe some Jeopardy shows.

Ask questions in the comments so all can benefit. Check comments before asking questions pls. I’ll answer the first one that is on everyone’s mind:

Q: Are you frigging crazy???

A: Why, yes, yes we are. But we are also really good computational linguists….

Breck

15 Responses to “Canceled: Help Build a Watson Clone–Participants Sought for LingPipe Code Camp”

  1. breckbaldwin Says:

    Patrick was having problems posting so I am posting for him.

    The comment is as follows:

    *****
    Check out: http://www.j-archive.com/, 252,583 clues and counting!
    (with answers)
    *****

    Thinking that would be more efficient than transcribing the
    clues/questions.

    Hope you are having a great weekend!

    Patrick

    PS: I think the challenge sounds right on!

  2. webnetworkz Says:

    Crazy? No, this is perhaps one of the most inspirational Events I have heard of in a long time. Cheers to your future success!

  3. Mark Keller Says:

    Crazy? I think this is one of the inspirational events I have heard of in a long time… Hats off to the Code Camp!

  4. Jack Park Says:

    As a developer of a similar open source project, (SolrSherlock) also Apache licensed, I am most eager to see your project progress. There is a very good project already in place: OAQA http://oaqa.github.io/ ; The field of Open DeepQA is starting to bloom! Perhaps it is time to create a portal around Open DeepQA?

    Still my question would be this: if your project is ostensibly to create a Watson-like agent on top of an AGPL-licensed core, what is the purpose of using the Apache license for the agent itself?

    • breckbaldwin Says:

      I’ll answer the license question first. I chose the Apache license to maximize attractiveness for contributors. LingPipe follows the dual licensing strategy of MongoDB and others which is a fiscal reality for us. How much LingPipe is the core depends on how development goes.

      SolrSherlock and OAQA look interesting. Personally I am not a big framework fan but I can be educated. Back in the day we wrapped some LingPipe components in UIMA and it was a huge pain.

      I would like to keep focused on lighter weight approaches which hopefully will merit conversion to more enterprise friendly approaches. At this point I am planning on picking an approachable class of Jeopardy! clues and seeing if we can do a reasonable job on it. If that works then we can consider how to scale which may well involve a framework like OAQA. Or I could be wrong and we should adopt OAQA now.

      • Jack Park Says:

        Good answer, especially if Alias-i royalty-free license covers every component of the Lingpipe stack necessary to play in this sandbox.

        It’s not clear to me what “big framework” means unless by that you mean that, say, the OAQA platform builds on layers of underlying code. They do create a kind of “framework” in that sense. SolrSherlock, which is morphing to OpenSherlock when I migrated from Solr to ElasticSearch, is presently being prototyped more as a “society of agents”. But, there is an AgentFramework, a kind of glue that binds, much in the same way that UIMA does.

        I’d vote for rolling your own; we will all learn from each other. The more experiments, the better. This field seems huge and ready for fresh ideas building on the contributions of others.

      • Bob Carpenter Says:

        I think the main issue that wrapping in UIMA was a pain was that it was very heavy for what it did, requiring a ton of configuration just to get a tokenizer working. With Sasha Caskey (an IBM researcher at the time) sitting next to me, it was pretty easy to build wrappers for LingPipe tools in UIMA — it only took a couple of hours to do taggers and entity extractors. But then UIMA changed, so the simple interfaces we had became stale.

        The reason we never went anywhere with UIMA is that none of our customers ever cared about the glue layer. In particular, they didn’t care if their low-level components were plug and play. It would be like selling a car with the feature that any manufacturer’s engine could be plugged in. Such great flexibility requires even greater compromises on how well any instance is going to work. Sure, you can bolt on different wheels or tires, but even they have to be the right size; it would be silly to try to build a car that worked with any tire at all.

        Customers tend to want end-to-end solutions. Plug-and-play is more attractive to researchers and even more so to research evaluators like DARPA, who want to make their lives evaluating components easier.

        The bigger problem is that UIMA can’t solve the hard problems. What I find is that big frameworks make easy things hard (like pasting together compatible components in a pipeline), but don’t help at all with harder issues like what to do when two components use incompatible tokenizations. And tokenization is the tiniest and simplest of the natural language integration problems.

        Another Watson-relevant example is that if you want to combine movie databases (a project Breck and I tackled in the past with record linkage techniques), the issue is that one will use “actor” and “actress” and the other “supporting actor” and “lead actor”. The former distinguishes sex and the latter the scope of the role in the production. It’s hard to merge the two databases if they have different entries because the information they provide is different.

        Similarly, one database might list a title as “Star Wars IV” another as “Star Wars 4: A New Hope” and yet another just plain old “Star Wars”. Or a single database may contain both because it’s not deduplicated.

        The database join isn’t the problem, it’s the conceptual matching. And there’s just no way to define a framework that solves this problem in general — you need to apply domain knowledge.

        John Cook made a similar point on his blog recently, namely that combining formats such as XML, flat files, and databases isn’t the major problem, it’s combining conceptually different kinds of information, such as lab tests, doctor’s and nurse’s observartional notes, and ongoing monitoring such as heart rate.

  5. Angus McIntyre Says:

    If you want to work on the real thing, The IBM Watson Group is hiring in locations throughout the United States. Search on Watson on the IBM Career site, or direct link to an opportunity. https://jobs3.netmedia1.com/cp/faces/job_summary?job_id=SWG-0598726

    • breckbaldwin Says:

      Cheeky bastards those IBMers. You guys can fund the beer budget….

    • Bob Carpenter Says:

      What exactly is Watson? I couldn’t tell from the job ad or the Watson home page.

      Is it now just a name used to refer to everything IBM does in natural language processing? Or is it even broader than that?

      I thought the original Watson project, which was a very Jeopardy-specific question answerer (or answer questioner, given the form of Jeopardy), was one of the coolest NLP applications ever.

  6. Jack Park Says:

    Regarding merging databases, I’m sure my friend Patrick Durusau would agree with me that this sounds like the kind of issue topic mappers enjoy tackling.

  7. Andrew Beam Says:

    “Step 1–transcribe some Jeopardy shows.”

    Shouldn’t be necessary: http://www.j-archive.com

    • breckbaldwin Says:

      Do you know if there is a structured format download for the site? Going to have to write a scraper if not. I couldn’t find anything.

      • Andrew Beam Says:

        I’m not aware of any structured download link. I think it’s run by Jeopardy “enthusiasts” so they might be willing to share a structured archive, it they have one. I believe this message board is responsible for the archive:

        http://jboard.tv/

  8. Leonid Boytsov (@srchvrs) Says:

    Well, UIMA is, indeed, a not easy thing to use. It is supposed to simplify the pipelining processes as it helps you pass the data without thinking too much about serialization. However, it requires a tremendous amount of red-tape in terms of XML configuration files.

    UIMA-ECD, which is being developed by the OAQA group, is in my somewhat biased opinion, does simplify pipeline construction. You still have to know some UIMA, but it is much easier to configure an UIMA-ECD pipeline. Actually, I do believe that it is easier to use rather than saving intermediate results in a textual form (say in your own flavor of XML or TSV-format). This comes at the expense of some (mostly rather exotic) functionality not being supported.

    A disadvantage of any framework, of course, is that it won’t help you much to solve a hard problem. It will not retrain your parser, you will not come up with a clever idea of reconciling incompatible tokenization approaches. Of, put it even simpler: for instance, I can have a standalone parser which produces a TSV output. I need to invoke a parser, feed it with a sentence and parse an output. UIMA is not helpful here.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

Join 822 other followers