To Stem or Not to Stem?

by

I’ve been working for a customer on unsupervised and semi-supervised approaches to stemming for information retrieval. I’d never thought too hard about stemming per se, preferring to tackle the underlying problem in applications through character n-gram indexing (we solve the same problems facing tokenized classifiers using character language models). I have thought about computational morpho-phonology back in my logic programming/constraint grammar phase.
I found two very interesting evaluation papers in the same issue of JASIS, the first of which is:

Paice, C. D. 1996. Method for evaluation of stemming algorithms based on error counting. Journa l of the American Society for Information Science 47(8):632–649.
Wiley Pay Link (I couldn’t find a free copy online)

The Paice paper introduces a stemming-as-clustering evaluation, treating the clustering evaluation as one of equivalence recall and precision (just like LingPipe’s clustering precision/recall evaluation). Paice assumes, quite rightly for stemming, if I may say so, that the issue in question is whether two words word1 and word2 have the same stem, not the actual identity of the stem. This makes sense because reverse-indexed IR system don’t care about the form of a word, just its identity. Paice then draws a distinction between “light” and “heavy” stemming, but I found this unmotivated as a two-way classification.

Extending Paice’s example, here are the words I found in the English Gigaword corpus with etymological root author:

antiauthoritarian, antiauthoritarianism, antiauthority, author, authoratative, authoratatively, authordom, authored, authoress, authoresses, authorhood, authorial, authoring, authorisation, authorised, authorises, authoritarian, authoritarianism, authoritarians, authoritative, authoritatively, authoritativeness, authorities, authoritory, authority, authorization, authorizations, authorize, authorized, authorizer, authorizers, authorizes, authorizing, authorless, authorly, authors, authorship, authorships, coauthor, coauthored, coauthoring, coauthors, cyberauthor, deauthorized, multiauthored, nonauthor, nonauthoritarian, nonauthorized, preauthorization, preauthorizations, preauthorized, quasiauthoritarian, reauthorization, reauthorizations, reauthorize, reauthorized, reauthorizes, reauthorizing, semiauthoritarian, semiauthorized, superauthoritarian, unauthorised, unauthoritative, unauthorized

As you can see, it’s rather difficult to draw a line here. The shared root of author and authorize is lost to most native speakers, despite having a regular suffix -ize. In contrast, the relation between notary and notarize is regular, but the relation between note and notary feels more opaque. And we’re not even considering word sense ambiguity here, a closely related problem for search.

These uncertainties in derivational stem equivalence is why many people want to restrict stemming to inflectional morphology. Inflections are forms of words, like the verb forms and noun number (e.g. -s makes a noun plural), whereas derivational morphology often converts one syntactic type to another (e.g. -ly converts and adjective to an adverb). The problem with restricting to inflectional morphology is that it’s not the right cut at the problem. Sometimes derivational stemming helps and sometimes inflectional stemming hurts.

This brings us to the second interesting paper:

Hull, David A. 1996. Stemming algorithms: a case study for detailed evaluation. Journal of the American Society for Information Science. 47(1): 70–84. [CiteSeer link to paper]

Hull goes through a couple hundred TREC queries with a fine-toothed comb, finding that it’s almost impossible to declare a single winner among stemmers. He compared the simple Porter and Lovins stemmers with more linguistically sophisticated algorithms developed at Xerox (now part of Inxight‘s offerings), and less sophisticated algorithms like reducing every word to it’s first 5 characters or just removing final s. There was simply not a clear winner. The beauty of the paper is in the careful case studies in which he contrasted stemmers. For example, query 21 contained the word superconductivity, but matches documents containing only the form superconductors , clearly requiring derivational morphology for high recall . A similar problem occurs for surrogate mother and surrogate motherhood in query 70. Another case was genetic engineering vs. genetically engineered. Another example where stemming helped was failure versus fail in the context of bank failures.

An example where stemming hurt was in client server matching serve and client, leading to numerous false positive matches; this is really a frequency argument, as servers in the computer sense do serve data. Another example was reducing fishing to fish, causing recall problems even though it’s a simple inflectional change. Other problems arose in converting privatization to private, which hurt precision.

What’s a poor computational linguist to do? One thing I’d recommend is to index both the word and its stem. If you’re using a TF/IDF-based document ranker, such as the one underlying Apache Lucene, you can index both the raw word and any number of stemmed forms in the same position and hope IDF sorts them out in the long run. That seems to work for Chinese, where folks tend to index both words and unigrams and/or bigrams because n-grams provide high recall and words provide added precision.

If anyone knows of other evaluation’s like Hull’s that involve n-gram based stemmers, I’d love to hear about them.

2 Responses to “To Stem or Not to Stem?”

  1. Mike Schultz Says:

    There’s a problem comparing search performance of stem no-stem systems to each other using aggregate, i.e. average measures. Those results tend to say, oh, it’s a wash, for some queries, stemming is better, for some no stemming is better. So let’s stem. The problem is, users perceptions weigh larger in the calculations, than an average measure can capture. The downside of over-stemming is never made up by the upside. Once a user sees “Amenities” -> “Amen”, “Heine” -> “Hein”, “Amy” -> “Ami”, “productivity” -> “produce” or the other bazillion wretched side effects of the porter stemmer, you have lost their trust. I think the prudent dividing line is the inflection/derivational distinction. People expect plural conflation to singlular, but Noun->Adjective->Verb->Noun? Please, that was a bad idea 30 years ago when it was born.

    A great article on the efficacy of stemming is “One Term or Two?” from Ken Church. The way to motivate stemming is by being convinced that two terms should be represented by one in the first place. If not, then not.

    • lingpipe Says:

      Indeed — I love Church’s paper. It’s nice to see someone try to measure the naivete of naive Bayes. And I think correlation is a good measure; it’s one that’s picked up by things like SVD approaches.

      You may be right about the downside. I’m getting very frustrated by what seems like overstemming on Google’s part (I use an acronym that means one thing, they expand it into something else or vice-versa).

      The other thing your comment made me realize is that there’s no way to recover from overstemming. You can recover from understemming by adding in disjunctions or prefix queries.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s