Archive for the ‘Business’ Category

Java Specification Request (JSR) 247: Data Mining 2.0

February 19, 2009

I just stumbled on this Java Specification Request:

Is anyone using it? Are there public implementations? It seems to have become inactive around 2006 on the Java site.

The only reference I can find are Oracle javadoc for JSR 73 (Data Mining 1.0), which is part of the Oracle Data Mining kit (here’s their overview PDF).

I found it when looking through the source code distribution for Satnam Alag’s Collective Intelligence in Action (Manning Publications, 2008). The author included decompiled source for a version of someone’s JSR 247 implementation, rooted at package javax.datamining.

The number of interfaces defined is truly remarkable. For example, javax.datamining.algorithm includes

  • svm.classification.SVMClassificationSettingsFactory

as one of several SVM-specific interfaces. The package javax.datamining.supervised contains:

  • classification.ClassificationTestMetricsTaskFactory,

just one of dozens of interfaces dealing with classifier test configuration. I couldn’t actually find implementations of SVMs in the Manning distribution of the decompiled javax.datamining package.

Hunter/Gatherer vs Farming with Respect to Information Access

August 7, 2008

Antonio Valderrabanos of gave a talk at NYU today and he compared the current search/NLP strategies by information providers to  humanity’s hunter/gatherer stage and offered a vision of information farming. I kept having images of Google’s web spider out digging roots and chasing animals with a pointed stick wearing a grubby little loin cloth. Then I would switch to images of a farm stocked supermarket with well organized shelves, helpful clerks and lots of choice.

The analogy brought up a strong bias that I have in applying natural language processing (NLP) to real word problems– I generally assume that the software must encounter text as it occurs in the “wild”–after all it is what humans do so well and we are in the business of emulating human language processing right?

Nope, not on the farm we’re not. On the farm we use NLP help to enhance information that was never a part of standard written form. We use NLP to suggest and assign meta tags, connect entities to databases of concepts and create new database entries for new entities. These are things that humans are horrible at but humans are excellent at choosing from NLP driven suggestions– NLP is pretty good at suggestions. So NLP is helping create the tools to index and cross reference at the concept level all the information in the supermarket. Humans function as filters of what is correct. At least initially.

As the information supermarket gets bigger, the quality of the NLP (machine learning based) will get better, perhaps good enough to start automatically bringing in “wild” information with decent concept indexing and meta tagging. A keyword index is crude yet effective tool but an inventory system it is not and that is what we need to advance to the next level of information access.

How Home Dentistry Kits and LingPipe are Similar

March 27, 2007

LingPipe is a tough sell to most commercial customers without professional services. Occasionally I will do a deal where all I do is cash a check but almost all of our customers want lots of help. Suggest that they build it themselves and they look at me like I suggested a home root canal. Why?

Take one of our simplest capabilities, language model classification. There is a simple, runs out of the box tutorial that takes
a developer through line by line what needs to be done to do some classification. It is really simple. Yet I cannot get certain customers working with it.

The sticking point, I believe, is that unfamiliarity plus the slightly loose nature of machine learning techniques is too great a jump conceptually. The DynamicLMClassifier needs the labels of the categories (easy), boolean choice of whether to use a bounded or sequence based language model (starting to feel a little woozy) and a character n-gram size (whoa, a ‘whackety n-germ thingy’). The tutorial suggests that 6 is a good initial n-gram value but they are lost at this point I think. It gets worse because I suggest in the tutorial that they try different n-gram sizes to see what produces the best score. The scoring is nicely provided as part of the tutorial as well. This only gets worse as we dig deeper into the LingPipe API.

Tuning these systems requires a particular mindset that is not a part of a core computer science curriculum. It doesn’t require great intelligence, but experience is a must. Until we find a way to sort this out we will continue to see such systems out of general production. My mantra is “make computational linguistics as easy to use as a database.” We have a ways to go before we move away from the black art status of our field.