“Academic” Licenses, GPL, and “Free” Software


[This post repeats a long comment I posted about licensing in response to Brendan O’Connor’s blog entry, End-to-End NLP Packages. Brendan’s post goes over some packages for NLP and singles out LingPipe as being only “quasi free.”]

Restrictive “Academic-Only” Licenses

Some of those other packages, like C&C Tools and Senna, are in the same “quasi free” category as LingPipe in the sense that they’re released under what their authors call “non-commercial” licenses. For instance, none of the Senna, C&C, or LingPipe licenses are compatible with GPL-ed code. Senna goes so far as to prohibit derived works altogether.

The LingPipe License

The intent for the

was a little different from the “academic use only” licenses in that we didn’t single out academia as a special class of users. We do allow free use for research purposes for industrialists and academics alike. We also provide a “developers” license that explicitly gives you this right, which makes some users’ organizations feel better.

Truly Free NLP Software

The other tools, like NLTK, Mallet, OpenNLP, and GATE are released under more flexible licenses (LGPL, Apache or BSD), which I really do think of as being truly “free”. Mahout’s also in this category, though not mentioned by Brendan, whereas packages like TreeTagger are more like Senna or C&C in their restrictive “academic only” licensing.

Stanford and the GPL

Stanford NLP’s license sounds like it was written by someone who didn’t quite understand the GPL. Their page says (the link is also theirs):

The Stanford CoreNLP code is licensed under the full GPL, which allows its use for research purposes, free software projects, software services, etc., but not in distributed proprietary software.

Technically, what they say is true. It would’ve been clearer if they’d replaced “research” with “research and non-research” and “free” with “free and for-profit”. Instead, their choice of examples suggests “free” or “research” have some special status under the GPL, which they don’t. With my linguist hat on, I’d say their text leads the reader to a false implicature. The terms “research” and “academia” don’t even show up in the GPL, and although “free” does, GNU and others clarify this usage elswewhere as “free as in free speech”, not “free as in free beer”.

Understanding the GPL

The key to understanding the GPL lies behind Stanford’s embedded link to

Here, proprietary doesn’t have to do with ownership, but rather with closed source. Basically, if you redistribute source code or an application based on GPL-ed code, you have to also release your code under the GPL, which is why it’s called a “copyleft” or “viral” license. In some cases, you can get away with using a less restrictive license like LGPL or BSD for your mods or interacting libraries, though you can’t change the underlying GPL-ed source’s license.

GPL Applies to Academics, Too

There’s no free ride for academics here — you can’t take GPL-ed code, use it to build a research project for your thesis, then give an executable away for free without also distributing your code with a compatible license. And you can’t restrict the license to something research only. Similarly, you couldn’t roll a GPL-ed library into Senna or C&C or LingPipe and redistribute them under their own licenses. Academics are often violating these terms because they somehow think “research use only” is special.

Services Based on GPL-ed Software and the AGPL

You can also set up a software service, for example on Amazon’s Elastic Compute Cloud (EC2) or on your own servers, that’s entirely driven by GPL-ed software, like say Stanford NLP or Weka, and then charge users for accessing it. Because you’re not redistributing the software itself, you can modify it any way you like and write code around it without releasing your own software. GNU introduced the Affero GPL (AGPL), a license even more restrictive than the GPL that tries to close this server loophole for the basic GPL.

Charging for GPL-ed Code

You can charge for GPL-ed code if you can find someone to pay you. That’s what RedHat’s doing with Linux, what Revolution R’s doing with R, and what Enthought’s doing with Python.

LingPipe’s Business Model is Like MySQL’s

Note that this is not what MySQL did with MySQL (before they sold it to Oracle) nor is it what we do with LingPipe. In both those cases, the company owns all the intellectual property and copyrights and thus is able to release the code under multiple licenses. This strategy’s explained on the

We license LingPipe under custom licenses as well as our royalty-free license. These licenses include all sorts of additional restrictions (like only using some of the modules on so many servers) and additional guarantees (like indemnification and maintenance); don’t ask me about the details — that’s Breck’s bailiwick. Suffice it to say most companies don’t like to get involved with copyleft, be it from GPL or LingPipe’s royalty-free license. So we let them pay us extra and get an unencumbered license so they can do what they want with LingPipe and not have to share their code. We’ve had more than one customer buy commercial license for LingPipe who wouldn’t even tell us what they were going to do with our software.

Free “Academic” Software

Also, keep in mind that as an academic, your university (or lab) probably has a claim to your intellectual property developed using their resources. Here’s some advice from GNU on that front:


12 Responses to ““Academic” Licenses, GPL, and “Free” Software”

  1. Brendan O'Connor Says:

    Hi Bob, thanks for the clarifications. Actually I was thinking more about “free-as-in-beer” when I said “quasi-free”… people at companies can use Stanford NLP without paying anything (despite some of the confusing statements you pointed out on Stanford’s site), compared to LingPipe. Sorry for any confusion.

    Academic licensing and GPL and everything is quite a mess. We like to just do Apache License now for code releases, which seems a little easier. Of course there are still potential issues (e.g. university intellectual property), but I think the most common use case for releasing research software might be more about getting the code out there for replicability sorts of reasons.

    [On the other hand, that’s less the case with academic software that underwent a ton of engineering effort, like the Stanford tools…]

    • Bob Carpenter Says:

      Agreed — licensing is a mess.

      You’re right that companies can use Stanford NLP if they abide by the GPL in their use of it. Most companies don’t want to release software that depends on other GPL software because it seriously undercuts their ability to sell their software when the license also requires them to open source it. But, as I mentioned, they could develop a service around it as long as it never got distributed.

      LingPipe’s royalty-free license lets companies use LingPipe for free services, or for evaluation and research. My point was that this is a bit more liberal than the “academic use only” licenses, and much more liberal than the “no derivative works” licenses.

      Apache licensing is great. The main hurdle is the University itself. I just checked out your ARK Tweet corpus (see the post scheduled for tomorrow), but the license says “other open source”, and then the data itself says “GPL” (which is an odd license to use for data — why not Creative Commons?) Who owns the copyright on Twitter posts themselves?

      • Fred Mailhot Says:

        I just checked out your ARK Tweet corpus (see the post scheduled for tomorrow), but the license says “other open source”, and then the data itself says “GPL” (which is an odd license to use for data — why not Creative Commons?)

        This raised an interesting issue/problem for me recently, one that I haven’t seen addressed online in the copious literature on the GPL (presumably I haven’t looked hard enough yet):

        Do machine learning algorithms trained on copyrighted data count as “derivative works” in some sense? My intuition is no, but it’s clear that in some sense the data are embodied within a trained model (even setting aside clear cases like memory-based models).

        Aside: thanks to Brendan et al for re-releasing the data under the CC-BY!

      • Bob Carpenter Says:

        Some people definitely intend to release their data for “research purposes only” and consider building models, etc., derivative works.

        In order to use the LDC data to build models that you sell, they require a rather expensive commercial license.

        I have no idea whether any of this has been tested in court. I’m not a lawyer.

        I don’t even know if there are restrictions related to the corpus being reconstructible from the model. For instance, if I record 10-gram counts for a language model, I’d have a good chance of reconstructing the corpus.

  2. Dr. Jochen L. Leidner Says:

    Twitter tweets are by default copyright by their owners (if their limited length still makes them count as creative writing under the copyright acts of the respective jurisdictions, which to the best of my knowledge hasn’t been tested in any courts yet). I know of people who were redistributing tweet collections and were told by Twitter Inc. to refrain from doing so. Work with tweets at NIST (as part of TREC) has severely suffered from this legal limitation, and a workaround provided (sharing IDs and tweet download software) has shown to be not reliable enough, AFAIK.

  3. Christopher Manning Says:

    Hi Bob, I don’t think it was a matter of misunderstanding the GPL, it was just an attempt to say something positive about allowed uses before heading straight to things you can’t do. Nevertheless, you’re the second person that hasn’t liked this wording, and I agree with your objection that it is actually quite possible to make “research” use of these tools in violation of the GPL, and so for my new year spring cleaning, I have revised the wording. It now just says “which allows many free uses”, which hopefully will be okay by everyone….

    • Bob Carpenter Says:

      It makes sense saying what people can do with it. We’ve tried to steer clear of that with LingPipe’s license — we simply can’t afford the lawyers to interpret our own quirky royalty-free license! It was designed to have some AGPL-like restrictions (though we’d never heard of AGPL). At least with the (A)GPL, there are FAQs that I can almost understand.

      Many academics ignore licenses, but companies are nuts about them (for obvious reasons if you follow tech headlines).

      The project at Columbia’s going out as BSD. It’s a pain that R itself is GPL-ed, so the linkage of our project, R, and our R interface has to be GPL-ed, if I’m understanding the compatibility section of the GPL site properly.

      Licenses bring out the lawyers in everyone. Someone wrote to me asking to put a license on my arithmetic coder that had previously just said something like “do whatever you want with it”. They didn’t think that was enough of a license.

  4. John Bauer Says:

    As another maintainer of the Stanford CoreNLP package, I’d like to point out that the original text listed “software services” as an allowed option, which includes your example of using it on a for-profit Amazon cloud service. The one thing we cited as disallowed was “distributed proprietary software”.

    As the most common user support person, I’d go as far as to say I preferred the original wording. It is inevitable that in the next few weeks I will be answering questions such as “Is it okay to use Stanford CoreNLP in a research project?” or “Is it okay to use Stanford CoreNLP in a for-profit web service?”

    • Bob Carpenter Says:

      Why say more than that the software is GPL-ed? You could point people to the GPL FAQ.

      The word “distributed” is ambiguous between the redistribution of software sense and the distributed computing sense.

      We almost always just tell people to read our royalty-free license or have their lawyers read our royalty-free license.

      • John Bauer Says:

        Mostly because a lot of our online documentation is a response to a user question and an attempt to forestall having to answer the same user question again in the future.

  5. piper Says:

    would you please explain me the procedure of installing lingpipe in redhat linux 5 clearly?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: