LingPipe Classifiers and Chunkers for Endeca Extend Partner Program

by

A couple weeks ago, Endeca made the following press release:

The “leading text analytics software vendors” are us (props to Breck for naming us with an “A”), Basis Technology, Lexalytics, MetaCarta, NetOwl, Nstein, Semantia and Temis. But wait, that’s not all. A slew of text analytics companies had either joined earlier or announced joining now, including ChoiceStream, BayNote, Lexalytics, Coremetrics, NStein, and Searchandise.

It’s no surprise that we’re all thinking Endeca has quite a bit of potential as a channel partner.

After the usual marketing blather (e.g. “leveraging the extensibility of the McKinley platform”, “lower cost of ownership”, “value-added capabilities”, etc.) and vague promises (e.g. “unrestricted exploration of unstructured content”), the third paragraph of Endeca’s press release explains what it’s all about in allowing Endeca’s search customers to

… run their data through an Endeca Extend partner solution, extract additional meta-data elements from the text, and append that meta-data to the original content

Endeca Records

Endeca stores documents in record data structures, which associate string keys with lists of string values. This is the same rought structure as is found in a Lucene Document.

One striking difference is that Endeca’s API is cleaner and better documented. Overall, I’m very impressed with Endeca’s API. Looking at their API reminds me of the APIs we built at SpeechWorks, where wiser heads prevailed on me to forego complex controls designed for control-freak grad students in favor of making easy things easy.

Another striking difference is that Lucene’s document structure is much richer, allowing for binary blobs to be stored by those trying to use Lucene as a database. Lucene also allows both documents as a whole and fields within a document to be boosted, adding a multiplier to their search scores for matching queries.

Manipulator Extensions

Endeca’s produced an API for extensions. An extension visits records, modifies them, and writes them back to the index. It can also write into its own scratch space on the file system and generate all new records.

An extension consists of three components: configuration, factory, and runtime.

Class 1. Configuration

The bean-like configuration class provides setters and getters for strings, booleans, integers, and doubles. These are labeled with attributes and accessed through reflection. There’s then a method to validate a configuration that returns a list of errors as structured objects. I’m a big fan of immutable objects, so working with beans drives me crazy. They could use some more doc on concurrency and lifecycle order; as is, I was conservative and programmed defensively against changes in config.

Configuration is handled through an administrator interface. As I said, it’s bean-like.

Class 2. Factory

There is then a factory class with a method that returns the config class (so the admin interface can tell what kind of config to build for it). It also contains a method that takes an Endeca application context and configuration and produces a runtime application. The context provides services like logging, a path to local file space, and a hook into a pipe into which modified records may be sent.

Class 3. Runtime

The runtime simply provides a record visitor method. To write out changes, you grab the output channel from the context provided to the factory. There are also some lifecycle methods used as callbacks: interrupt processing, processing of records is complete, and final cleanup. You can still write out answers during the completion callback.

Endeca’s Demo Manipulator Extension

Endeca has great programmers and their Java API design was really clear. I love it when vendors follow standard patterns and idioms in their API designs. Especially when they use generics usefully.

The PDF developer doc’s still in progress, but their Javadoc’s mostly in place. What was really sweet is that they gave us a working demo extension program with all of its configuration, tests, and even mock objects for use in JUnit testing the entire framework without a complete install of Endeca’s platform. I’m so happy when someone sends me a Java package that unpacks then compiles with Ant without griping.

LingPipe Classifier CAS Manipulator Extension

The first extension I wrote is configured with a path to a serialized text classifier on the classpath. I then configured a list of field names (only strings are available, so I went with comma-separated values) from which to collect text, and a field name into which to write the result of classification. [Correction: 5 Nov 2009, Endeca let me know that they had this covered; if I declare the variables in the bean-like configuration to be list-like values, the reflection-based config system will figure it out. This is awesome. I always hate rolling in ad-hoc little parsing "languages" like CSV in config. It's just sooo hard to doc and code correctly.]

LingPipe Chunker CAS Manipulator Extension

The second extension is a chunker. It requires a path to a chunker. Optionally, it allows a sentence detector to be configured for preprocessing (most of our chunkers work better at the sentence level). It also optionally allows a dictionary (and tokenizer factory) to be specified for overriding the chunks found by the chunker. Then, a list of field names from which to read text. The output gets written into chunk-specific fields. Because a given field name can contain multiple values, you can keep the resulting spans separate.

Endeca’s Faceting

Endeca’s big on faceted search. You may be familiar with it from two of the best online stores, NewEgg and Amazon.

It’s easy to treat our classifier plugin output as a facet. For instance, classify documents by sentiment and now sentiment’s a facet. Do a search, and you’ll get a summary of how many positive and how many negative documents, with an option to restrict search to either subset.

It’s also easy to treat our chunker output as a facet. For instance, if you include a company name chunker, you’ll be able to use companies as facets (e.g. as NewEgg does with manufacturers, though documents may contain references to more than one company).

Buying Plugins

Drop Breck a line.

Now that I have my head around the bigger picture, it’s pretty easy to build these kinds of extensions. So if there’s something you’d like integrated into Endeca and you’re willing to pay for it, let us know.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

Join 820 other followers