Author Archive

Canceled: Help Build a Watson Clone–Participants Sought for LingPipe Code Camp

January 10, 2014

Code Camp is canceled. We are late on delivering a LingPipe recipes book to our publisher and that will have to be our project for March. But we could have testers/reviewers come and hang out. Less fun I think.




Dates: To be absolutely clear: Dates are 3/2/2014 to 3/31/14 in Driggs Idaho. You can work remotely, we will be doing stuff before as well for setup.

New: We have setup a github repository. URL is


Every year we go out west for a month of coding and skiing. Last year it was Salt Lake City Utah, this year it is Driggs Idaho for access to Grand Targhee and Jackson Hole. The house is rented for the month of March and the project selected. This year we will create a Watson clone.

I have blogged about Watson before and I totally respect what they have done. But the coverage is getting a bit breathless with the latest billion dollar effort. So how about assembling a scrappy bunch of developers and see how close we can come to recreating the Jeopardy beast?

How this works:

  • We have a month of focused development. We tend to code mornings, ski afternoons. House has 3 bedrooms if you want to come stay. Prior arrangements must be made. House is paid for. Nothing else is.
  • The code base will be open source. We are moving LingPipe to the AGPL so maybe that license or we could just apache license it. We want folks to be comfortable contributing.
  • You don’t have to be present to contribute.
  • We have a very fun time last year. We worked very hard on an email app that didn’t quite get launched but the lesson learned was to start with the project defined.

If you are interested in participating let us know at Let us know your background, what you want to do, do you expect to stay with us etc…. No visa situations please and we don’t have any funds to support folks. Obviously we have limited space physically and mentally so we may say no but we will do our best to be inclusive. Step 1–transcribe some Jeopardy shows.

Ask questions in the comments so all can benefit. Check comments before asking questions pls. I’ll answer the first one that is on everyone’s mind:

Q: Are you frigging crazy???

A: Why, yes, yes we are. But we are also really good computational linguists….


LingPipe Incubator Welcomes Seer

September 9, 2013

We have started an incubator program for start-ups with natural language processing (NLP) needs. Seer is our first egg. The company creates productivity apps based on unstructured data sources–that is where LingPIpe fits in. The guys (Conall and Joe) fit a pattern that we see quite often–smart people with a good idea that presupposes NLP but they don’t have much experience with it.

The GetSeer guys were on the ball right away because they had a bunch of data annotated and had cobbled together a regular expression based classifier that served as an excellent starting point. We had machine learning based classifiers up and running within two hours of starting. That included an evaluation harness. Nice.


The great thing about our working setup is that I get to teach/advise which keeps my time commitment manageable and they get to do most of the heavy lifting which is how they learn to build and maintain NLP systems. I have had to learn some Scala since I co-code most of the time when we work together and that is their implementation language. Scala is a hip extension of Java with a less verbose syntax and stronger type inference.

I’ll keep the blog updated with developments. Current status:

  • There is a small amount of training data.
  • Current results are awful (no surprise).
  • We have been roughing in the solution with Naive Bayes. Next steps will be opening a can of logistic regression for more interesting feature extraction.

VerbCorner: Another Game with Words

July 4, 2013

Games with Words continues to roll out new games, releasing VerbCorner.

VerbCorner contains a series of sub-games with titles like “Philosophical Zombie Hunter” and questions that sound like they’re straight of an SAT reading comprehension test, starting with a long story I won’t repeat and then questions like:

Michelle smashed the gauble into quaps. Which of the following is *definitely* a real human?

  1. Michelle
  2. None of the above
  3. Can’t tell because in this context ‘smash’ has more than one meaning.
  4. Can’t tell because the sentence is ungrammatical.
  5. Can’t tell because I don’t know that verb.

Josh Hartshorne, one of the developers, writes that Verbcorner is “a new project I’m doing in collaboration with Martha Palmer at CU-Boulder, which is aimed at getting human judgments to inform the semantics assigned to verbs in VerbNet. You can find a press release about the project … You’ll see we’ve incorporated a lot of gamification elements in order to make the task more engaging (and also to improve annotation quality).”

This note from the press release harkens back to my philosophical semantics days, “If we do not want to define words in terms of other words, what should they be defined in terms of? This remains an open research question…”. I’ll say! The project has a generative semantics feel; in the press release, Josh is quoted as saying “There are a few dozen core components of meaning and there are tens of thousands of verbs in English.” Now I’m left hanging as to what a “core component of meaning” is!

Anyone Want to Write an O’Reilly Book on NLP with Java?

February 21, 2013

Mitzi and I pitched O’Reilly books a revision of the Text Processing in Java book that she’s been finishing off.

The response from their editor was that they’d love to have an NLP book based on Java, but what we provided looked like everything-but-the-NLP you’d need for such a book. Insightful, these editors. That’s exactly how the book came about, when the non-proprietary content was stripped out of the LingPipe Book.

I happen to still think that part of the book is incredibly useful. It covers all of unicode, UCI for normalization and detection, all of the streaming I/O interfaces, codings in HTML, XML and JSON, as well as in-depth coverage of reg-exes, Lucene, and Solr. All of the stuff that is continually misunderstood and misconfigured so that I have to spend way too much of my time sorting it out. (Mitzi finished the HTML, XML and JSON chapter, and is working on Solr; she tuned Solr extensively on her last consulting gig, by the way, if anyone’s looking for a Lucene/Solr developer).

O’Reilly isn’t interested in LingPipe because of LingPipe’s non-OSF approved license. I don’t have time to finish the LingPipe book now that I’m at Columbia full time; I figured it’d be 1500 pages when I was done if I kept up at the same level of detail, and even that didn’t seem like enough!

A good model for such a book is Manning Press’s Taming Text, co-authored by Breck’s grad-school classmate Tom Morton. It’s based on Lucene/Mahout and Tom’s baby, OpenNLP. (Manning also passed on Text Processing in Java, which is where Breck sent it first.)

Another model, aimed more at linguists than programmers, is O’Reilly’s own NLTK Book, co-authored by my grad school classmate Steve Bird, and my Ph.D. supervisor, Ewan Klein. Very small world, NLP.

Manning also passed on TPiJ, so Colloquial Media will branch out from genre fiction into tech books. More news when we’re closer to finishing.

If you know why LDA will think this document is about American football and why frame-based parsing will only make topic classification worse, then you’re probably a good candidate to write this book. [Domain knowledge: Manning is the surname of two generations of very well known American football players all of whom play the only position that fill the agent roll in a very popular play known as a “pass”.] If you know why Yarowsky‘s one discourse, one sense rule is violated by the background knowledge and also understand the principled Bayesian strategies for what Yarowsky called “bootstrapping,” even better.

Natural Language Generation for Spam

March 31, 2012

In a recent comment on an earlier post on licensing, we got this spam comment. I know it’s spam because of the links and the URL.

It makes faculty adage what humans can do with it. We’ve approved to beacon bright of that with LingPipe’s authorization — we artlessly can’t allow the attorneys to adapt our own arbitrary royalty-free license! It was advised to accept some AGPL-like restrictions (though we’d never heard of AGPL). At atomic with the (A)GPL, there are FAQs that I can about understand.

ELIZA, all over Again

What’s cool is how they used ELIZA-like technologies to read a bit of the post and insert it into some boilerplate-type generation. There are so many crazy and disfluent legitimate comments that with a little more work, this would be hard to filter out automatically. Certainly the WordPress spam filter, Akismet, didn’t catch it, despite the embedded links.

Black Hat NLP is Going to Get Worse

It would be really easy to improve on this technology with a little topic modeling and better word spotting (though they seem to do an OK job of that) and better language modeling for generation. Plus better filtering a la modern machine translation systems.

The real nasty applications of such light processing and random regeneration will be in auto-generating reviews and even full social media, etc. It’ll sure complicate sentiment analysis at scale. You can just create blogs full of this stuff, link them all up like a good SEO practitioner, and off you go.

YapMap: Breck’s Fun New Project to Improve Search

January 27, 2012

I have signed on as chief scientist at YapMap. It is a part time position that grew out of me being on their advisory board for the past 3 years. Try the search interface for the forums below:

Automotive Forums

YapMap search for Low Carb Breakfast on Diabetes Daily

A screen shot of the interface:

UI for YapMap's search results

What I like about the user interface is that threads can be browsed easily–I have spent hours on remote controlled airplane forums reading every post because it is quite difficult to find relevant information within a thread. The color coding and summary views are quite helpful in eliminating irrelevant posts.

My first job is to get query spell checking rolling. Next is search optimized for the challenges of thread based postings. The fact that relevance of a post to a query is a function of a thread is very interesting. I will hopefully get to do some discourse analysis as well.

I will continue to run Alias-i/LingPipe. The YapMap involvement is just too fun a project to pass up given that I get to build a fancy search and discovery tool.


Peter Jackson has Passed Away

August 4, 2011

I am very sorry to convey that Peter Jackson has passed away. Below is an email sent to Thomson Reuters from their CTO.

Peter was a great friend, early customer and advisor to LingPipe. I will miss him dearly.



From James Powell, CTO Thomson Reuters

Dear Colleagues,

It is with great sadness that I must inform you that Peter Jackson, our Chief Scientist and head of R&D, died suddenly this morning.

Peter JacksonPeter was a colleague of immense talents, both professionally and personally. He started his career in academia, and after a post as a Lecturer in Artificial Intelligence at Edinburgh University in Scotland he moved to the U.S. where he went on to become Associate Professor in the Department of Electrical and Computer Engineering at Clarkson University, NY.

A true global citizen who was born in Barbados and educated in England, Peter also served as Visiting Professor at the Artificial Intelligence Laboratory for the School of Electrical & Electronic Engineering at Singapore Polytechnic.

Peter switched from the corridors of academia to the meeting rooms of Rochester when he joined Lawyers Cooperative Publishing (LCP) back in 1995, as Lead Software Engineer for the Natural Language Group. LCP was acquired by West Group the following year.

His penetrating intelligence, allied to his outgoing and immediately likeable personality, ensured that Peter moved quickly through the ranks at Thomson; next in the West Group and then at Thomson Legal & Regulatory, where he rose to become Chief Research Scientist & Vice-President, Technology in 2005.

A lasting legacy

Peter was the obvious choice for the position of Chief Scientist when we became Thomson Reuters in 2008, and it was a role he fulfilled with exceptional results, forming a 40-strong R&D group as a corporate-wide resource, and aligning our Research & Development capabilities with the major platform and strategy initiatives of recent years.

Peter’s legacy at Thomson Reuters is significant and lasting. He oversaw the advanced technologies , such as CaRE, Concord and Results Plus, that enabled us to launch radical and successful new product platforms such as WestlawNext and PeopleMap. He was also integral to the development of our innovative Reuters Insider channel.

On top of that he represented Thomson Reuters with great professionalism — and considerable personality and panache — in the wider world of conferences, academia and professional associations. He was a great ambassador for Thomson Reuters in his role overseeing university liaison with regards to joint research projects with institutions of the caliber of MIT, NYU and CMU.

These achievements give some sense of Peter’s abilities and drive. Peter, though, had many more strings to his bow. He published three books and about 40 papers on his fields of expertise (artificial intelligence and natural language processing), and invented four US patents.

Above all, when it came to his work, Peter was proudest of the group that he built and worked with side-by-side every day.

If this wasn’t enough for one man, he was also a talented musician and a serious music fan, serving for a number of years on the Board of Directors for the Saint Paul Chamber Orchestra. In recent times, he traveled to Chicago, New York and Ojai to support the SPCO on tour. Many people in Eagan will remember Peter’s “Jazzkickers” band playing at employee events around campus.

He even demonstrated his guitar playing prowess in this internal technology video filmed at Joe’s Pub in NYC earlier this summer.

The many of us who worked with Peter will miss greatly his enthusiasm, the clarity of his thinking, his wisdom and his wit.

He leaves behind his wife Sandy, also a former employee here. Peter was 62.

Quiz: Acknowledged Globally as the Leader…

July 12, 2011

This is actually a quiz. I found the following in a job post for an NLP engineer from a company that’s so far been rather shy on the hype surrounding NLP.

XXX is a provider of innovative software products for semantic search and text mining. Thanks to a great team, and sophisticated natural language processing technology, our software is acknowledged globally as the leader in its field, with a strong focus on the YYY and ZZZ industry.

I love marketing/sales talk. It’s like political speech in its combination of vagueness and grandiloquence.


Read the above passage and answer:

Question 1. The software in question is “acknowledged globally” by whom?

Question 2. How restrictive is “field”? Just the YYY and ZZZ industries or all of semantic search and text mining?

Putting yourself in the place of a consumer, answer:

Question 3. Does a “strong focus” on YYY and ZZZ put you off asking XXX to do ABC?

Question 4. What does “innovative” mean in this context? Provide a contrasting example of non-innovative text mining and semantic search.

Using your general world knowledge, answer the following:

Question 5. What is “semantic search” and “text mining”?

Question 6. Based on your answer to 5, who do you think is the leader in the field of “semantic search” and “text mining”?

Question 7. What counts as a “sophisticated” technology? What is the relevant comparison group?

Using web search, answer the following (to yourself, no spoilers right away, please):

Extra Credit. What are the values for XXX, YYY, and ZZZ?

Why is C++ So Fast?

July 1, 2011

I had to respond to John Cook’s blog post,

My response was so long, I thought I’d repost it here.

I didn’t understand the answer properly six months ago, despite spending a couple years as a professional C coder (when I did speech recogntion) and having spent the past ten or so years writing Java. C++ is enough different than either of these to be a completely different sort of beast.

C++ is Faster than C!

At least, it’s easier to write fast code in C++ than in C these days. In fact, these days, C++ is the language of choice for optimization, not plain old C. The reason it’s so efficient is twofold.

Reason 1: Tight Data Structures

First, C++ is intrinsically stingy with memory (unlike Java objects, a C++ struct has no memory overhead if there are no virtual functions [modulo word alignment issues]). Smaller things run faster due to caching, and are also more scalable. Of course, this is true of C, too.

Reason 2: Template Metaprogramming

What makes C++ both hard to debug and understand and even more efficient than C is the use of template metaprogramming.

In particular, there’s huge gains to be made with the kind of lazy evaluation you can implement with expression templates and the kind of memory and indirection savings you get by avoiding virtual functions with the curiously recurring template pattern.

[Update: 28 July 2011: I thought an example would help, and this was the original one motivating template expressions.]

Suppose you have a scalar x and two m by n matrices a and b. Now suppose you want to evaluate d = a + (x * b). The standard non-lazy approach would be to create an intermediate matrix c representing the scalar-matrix product (x * b) and then add that to a. Basically, what’d happen is the same as if you’d coded it like this:

matrix c(m,n);
for (int m = 0; m < M; ++m)
    for (int n = 0; n < N; ++n)
        c[m,n] = x * b[m,n];
matrix d(m,n);
for (int m = 0; m < M; ++m)
    for (int n = 0; n < N; ++n)
        d[m,n] = a[m,n] + c[m,n];
return d;

With clever template expression programming, you can get around allocating an intermediate value, so that the resulting computation would amount to the same thing as coding it as follows.

matrix d(m,n);
for (int m = 0; m < M; ++m)
    for (int n = 0; n < N; ++n)
        d[m,n] = a[m,n] + x * b[m,n];
return d;

This saves the allocation and deallocation of matrix c. Memory contention is a huge problem with numerical code these days as CPU speed and parallelism outstrips memory speed, so this is a very large savings. You also save intermediate index arithmetic and a number of sets and gets from arrays, which are not free.

The same thing can happen for transposition, and other operations that don’t really need to generate whole intermediate matrices. The trick’s figuring out when it’s more efficient to allocate an intermediate. The Eigen matrix package we’re using seems to do a good job of this.

[end Update: 28 July 2011]

Voting with Their Code

I just visited Google NY lat week, where some of my old friends from my speech recognition days work. Upon moving to Google, they reimplemented their (publicly available) finite state transducer library in C++ using expression templates (the OpenFST project).

You’ll also see expression templates with the curious recurrence in the efficient matrix libraries from Boost (BLAS level only) and Eigen (also provides linear algebra). They use templates to avoid creating intermediate objects for transposes or scalar-matrix products, just like a good Fortran compiler.

But Wait, C++ is also More Expressive

Templates are also hugely expressive. They lead to particularly neat implementations of techniques like automatic differentiation. This is what roused me from my life of Java and got me into C++. That, and the need for good probability and matrix libraries — why don’t these things exist in Java (or where are they if I missed them?).

The Future: Stochastic Optimization

What’s really neat is that there are now stochastic optimizers for C++ that can analyze your actual run-time performance. I really really really like that feature of the Java server compilers. Haven’t tried it in C++ yet.

PS: So Why is LingPipe in Java?

Ability to write fast programs isn’t everything. Java is much, and I mean much, easier to deploy. The whole jar thing is much easier than platform-specific executables.

Java’s code organization is much cleaner. None of this crazy header guard and .hpp versus .cpp stuff. No putting namespaces one place, code another place, and having them completely unrelated to includes (you can be more organized, but this stuff’s just much easier to figure out in Java because it’s conventionalized and everyone does it the same way). The interfaces are simpler — they’re explicit, not undocumented pure abstractions (where you have to look at the code of what they call an “algorithm” to see what methods and values it requires of its objects).

Java has support for threading built in. And a great concurrency library. There are web server plugins like servlets that make much of what we do so much easier.

There’s also programming time. I’m getting faster at writing C++, but there’s just so much more that you need to think about and take care of, from both a system author and client perspective (mainly memory management issues).

Many people don’t realize this, but C is pretty much the most portable language out there. As far as I know, there’s no place Java runs that C won’t run. C++ is a different story. The compilers are getting more consistent, but the texts are still fraught with “your mileage may vary due to compiler” warnings, as are the mailing lists for the different packages we’re using (Eigen, Boost Spirit, Boost Math), all of which lean heavily on templates.

Sometimes I Wish NLP Systems Occasionally Blew Up

June 28, 2011

The consequences of a badly performing NLP (Natural Language Processing) system tend to be a pretty low key, low drama event. Bad things may happen but not in a way that will get noticed in the same way as: Rocket Go Boom

The failure of a rocket launch highly motivates the involved engineers to not have that happen again. Miss that named entity, blam! If only NLP programmers had such explicit negative feedback. I think the field would be better for it.

NLP Systems are Easier to Sell than Build

Customers get the potential value of advanced NLP/Text Analytics/etc in the same way that people get the potential value of space flight.

Space ships in artist's conception

The Dreaded Artist's Conception--picture credit NASA

It would be so cool to do sentiment analysis in low earth orbit! Sadly, the tremendous promise of the field is held back by a combination of overselling, under-delivering and lack of awareness of how to build good performing systems. What contributes to poor performance the most?

Be aware that you are selling to the best NLP systems out there: Humans

One of the greatest frustrations I face is severely underfunded projects. For the most part rockets get a much healthier dose of funding because people see the failures clearly and do not have a grasp of how rockets work. Not so much for NLP. Language processing is so easy for humans that it is like trying to sell cargo airplanes to eagles. They just don’t get what is hard, what is easy and the necessity of infrastructure. “Mr. Eagle, um, well we really need a runway to get the 20 tons of product into the air”. Mr. Eagle responds with “What are you talking about? I can take off, land and raise a family on a tree branch. Cargo planes are easy because flying is easy for me. So I will give you a fish to do the job.”

Don’t ask a banker from 1994 to understand your tweets

Another source of poor performance is the reliance of general purpose solutions that are not well suited or tuned to the domain. It is unrealistic to expect a named entity model to perform well on Twitter if its training data is the 1994 Wall Street Journal. Modules customized for the domain can make a huge difference in performance. But customization costs money, takes time and requires a different mind set.

Understand the problem clearly with hand-coded examples

The #1 bit of advice I give customers is to emulate what you expect the system to do by hand. Then show that to stake holders to make sure a problem is being solved that addresses something your business cares about. Also, Mr. Eagle will much better appreciate the need of another solution after ferrying 100 pounds of product 1 pound at a time. By doing this you will have reduced the risk of failure by half in my opinion.

NLP is hard because in addition to being technically difficult, it is made worse because it seems easy for humans to do. They then under-appreciate the challenges. If systems blew up spectacularly we might have a better appreciation of that.