Archive for the ‘Business’ Category

Affero Gnu Public License (AGPLv3) for LingPipe 4.0?

March 4, 2009
Affero GPL 3 Logo

We’re busy preparing to launch LingPipe 4.0. We’re leaning toward switching from our “unauthorized” Alias-i Royalty Free License (version 1) to the GNU Affero General Public License (version 3) (AGPLv3).

The motivation for our license was very similar to Affero’s. Affero, the original license developer, wanted to close the so-called “Application Service Provider (ASP) loophole”, which didn’t count running software as a service as redistribution. Thus application service providers (ASPs) may run Gnu Public Licensed (GPLed) software as a service without distributing their software own software linked to it. The AGPL is stronger (i.e. its viral license is more contagious) in that it considers service providers to be distributing the software.

We’re neither lawyers nor software freedom fighters (though I am getting tired of all this software patent nonsense). The motivation for our royalty-free license was partly to prevent potential paying customers from running LingPipe on proprietary data, while keeping the system as “open” as possible, especially for research and teaching purposes. Our business model was thus something like MySQL’s business model. Here’s a nice survey of open-source business models — we’re a mix of 3, 4, and 5, plus 1 if you count DARPA/NIH grants as donations/subsidies.

Check out Bruce Perens‘s quite sensible review of the plethora of OS licenses: How Many Open Source Licenses Do You Need? (Spoiler: Perens, who wrote the original version of OSI’s Open Source Definition now thinks 4 are sufficient; I suppose you can’t blame a computer scientist for generalizing!).

We’re wondering what the ramifications of a switch to AGPLv3 would be. On the upside, we will potentially attract more users and perhaps even developers by having a license that plays nicely with others (specifically GPL and Apache/BSD). We also think it may be more attractive to funding agencies, like NIH. On the downside, we’re worried it may adversely affect our ability to sell our software, or perhaps even the business as a whole.

Are we crazy, or should we have done this ages ago?

Java Specification Request (JSR) 247: Data Mining 2.0

February 19, 2009

I just stumbled on this Java Specification Request:

Is anyone using it? Are there public implementations? It seems to have become inactive around 2006 on the Java site.

The only reference I can find are Oracle javadoc for JSR 73 (Data Mining 1.0), which is part of the Oracle Data Mining kit (here’s their overview PDF).

I found it when looking through the source code distribution for Satnam Alag’s Collective Intelligence in Action (Manning Publications, 2008). The author included decompiled source for a version of someone’s JSR 247 implementation, rooted at package javax.datamining.

The number of interfaces defined is truly remarkable. For example, javax.datamining.algorithm includes

  • svm.classification.SVMClassificationSettingsFactory

as one of several SVM-specific interfaces. The package javax.datamining.supervised contains:

  • classification.ClassificationTestMetricsTaskFactory,

just one of dozens of interfaces dealing with classifier test configuration. I couldn’t actually find implementations of SVMs in the Manning distribution of the decompiled javax.datamining package.

Hunter/Gatherer vs Farming with Respect to Information Access

August 7, 2008

Antonio Valderrabanos of Bitext.com gave a talk at NYU today and he compared the current search/NLP strategies by information providers to  humanity’s hunter/gatherer stage and offered a vision of information farming. I kept having images of Google’s web spider out digging roots and chasing animals with a pointed stick wearing a grubby little loin cloth. Then I would switch to images of a farm stocked supermarket with well organized shelves, helpful clerks and lots of choice.

The analogy brought up a strong bias that I have in applying natural language processing (NLP) to real word problems– I generally assume that the software must encounter text as it occurs in the “wild”–after all it is what humans do so well and we are in the business of emulating human language processing right?

Nope, not on the farm we’re not. On the farm we use NLP help to enhance information that was never a part of standard written form. We use NLP to suggest and assign meta tags, connect entities to databases of concepts and create new database entries for new entities. These are things that humans are horrible at but humans are excellent at choosing from NLP driven suggestions– NLP is pretty good at suggestions. So NLP is helping create the tools to index and cross reference at the concept level all the information in the supermarket. Humans function as filters of what is correct. At least initially.

As the information supermarket gets bigger, the quality of the NLP (machine learning based) will get better, perhaps good enough to start automatically bringing in “wild” information with decent concept indexing and meta tagging. A keyword index is crude yet effective tool but an inventory system it is not and that is what we need to advance to the next level of information access.

How Home Dentistry Kits and LingPipe are Similar

March 27, 2007

LingPipe is a tough sell to most commercial customers without professional services. Occasionally I will do a deal where all I do is cash a check but almost all of our customers want lots of help. Suggest that they build it themselves and they look at me like I suggested a home root canal. Why?

Take one of our simplest capabilities, language model classification. There is a simple, runs out of the box tutorial that takes
a developer through line by line what needs to be done to do some classification. It is really simple. Yet I cannot get certain customers working with it.

The sticking point, I believe, is that unfamiliarity plus the slightly loose nature of machine learning techniques is too great a jump conceptually. The DynamicLMClassifier needs the labels of the categories (easy), boolean choice of whether to use a bounded or sequence based language model (starting to feel a little woozy) and a character n-gram size (whoa, a ‘whackety n-germ thingy’). The tutorial suggests that 6 is a good initial n-gram value but they are lost at this point I think. It gets worse because I suggest in the tutorial that they try different n-gram sizes to see what produces the best score. The scoring is nicely provided as part of the tutorial as well. This only gets worse as we dig deeper into the LingPipe API.

Tuning these systems requires a particular mindset that is not a part of a core computer science curriculum. It doesn’t require great intelligence, but experience is a must. Until we find a way to sort this out we will continue to see such systems out of general production. My mantra is “make computational linguistics as easy to use as a database.” We have a ways to go before we move away from the black art status of our field.

breck