Building a Stemming Corpus: Coding Standards


We’d very much like to be able to offer stemmers for multiple languages, for applications in search and feature extraction for classifiers. We’d like to base them on pre-compiled dictionaries for common words and an automatic system trained on the dictionary. We only have research rights to CELEX 2, and it’s only available for English, Dutch and German anyway. It’s also small, with only about 50K English words.

We’ve had success with Amazon’s Mechanical Turk, so we thought we’d try to define a stemming standard. The legwork (well, Python-work and REST-work) is being done by Emily Jamison, who’s working as an intern for us.

I’ll discuss the results we got from the mechanical Turk in a later blog entry. Emily’s about to launch a third mechanical Turk run based on what we’ve learned from the first two. The rest of this post is just about defining how to stem.

The Universe of Tokens

We started with the set of lowercased alphabetic tokens appearing at least twice in the English section of the Leipzig Corpus. Emily sampled 1000 of these at random and had them each stemmed by 5 Turkers. We’ve actually done this twice trying out two different interfaces.

What’s a Stem?

For technical reasons, I want a corpus consisting of a word and the next simpler word(s) out of which it’s made, not counting productive affixes — basically an unlabeled morphological parse tree. Thus it might take a sequence of such stemming steps to produce a root form (e.g. “restfully” to “restful” to “rest”). In the first run, we tried to collect the affixes, too (e.g. “ly”, “ful” and no-stem).

My first motivation is term (feature) extraction for search (or general classifiers). I want to include the word itself and its whole chain of stems in the index and in the result of parsing queries. That way, if the whole long form matches, you get a higher score than if just a stem matches. This is similar to indexing Chinese with n-grams and words and should have a similar advantage for precision and recall. Hopefully this will mitigate the main problem raised in my earlier post, To Stem or Not to Stem?.

The second motivation is to gather data to train an automatic stemmer. This is much easier to do a single stem at a time, because the data’s awfully sparse for long sequences of prefixes and suffixes and compounds.

A third motivation is to make the task easier for the mechanical Turkers. The first run that asked for all the subwords and affixes was just too much work (4-5 minutes per 20 words for all stems versus 1 minute per 20 words for just the “next” stem).

Does it Work?

Emily and I each steemed the 1000 words after discussing the standard (but not the examples). It took us both less than an hour, making me think we could, with reasonable precautions against RSI, label a corpus of the most frequent 100,000 words ourselves.

Our agreement rate was an abysmal 85%. Most of the errors were on compounds. Not counting compounds, our agreement was closer to 95%, with most of the errors either brainos or typos, not misunderstandings of the standard.

Case Studies

In the rest of this post, I break down the cases which required some thoughtful adjudication. The raw word’s on the left and the stem in our gold standard on the right. I include notes on CELEX-2’s stemming in parentheses when they disagreed with our results.

Foreign/Historical Roots. Here, we often have multiple words in English that share a foreign or historical root that is not itself an English word. For instance, “hypo” and “epidemi” are not English words, nor is there a shared English stem for “euphoria” and “euphoric”.

  • odious: odious
  • euphoria: euphoria (CELEX-2 stems “euphoric” to “euphoria”)
  • mathematical: mathematics
  • epidemiology: epidemic (CELEX: missing)
  • hypocrisy: hypocrisy (CELEX: hypocrite)
  • maximize: maximum

Prefix/Suffix Ambiguity. A word with a productive prefix and suffix can often be analyzed in two ways. For instance, “unlocking” could be “un” + “locking” or “unlock” + “ing”. Here, because the action is an unlock action, we prefer that root. Similarly, we prefer “modernised” to “unmodern” as a reduction for “unmodernised”. The case of “overcollateralization” is trickier, with either “overcollateralize” and “collateralization” being reasonable choices.

  • unmodernised: modernised (CELEX: missing)
  • prearrangement: prearrange
  • unlocking: unlock (CELEX: missing)
  • incorrigible: corrigible
  • disability: disable
  • inconvenience: inconvenient (CELEX: convenient)
  • resentencing: sentencing (CELEX: missing)
  • overcollateralization: collateralization (CELEX: missing)

Compound/Affix Ambiguity. Compounds provide a general case of ambiguity with affixes or suffixes. Again, we want the natural word to be left behind after stemming, which is why “headquarters” must analyzed as “head” + “quarters” rather than “headquarter” + “s” — it’s not the plural of “headquarter”. On the other hand, “cockfighting” becomes “cockfight” + “ing”, because the compound is a natural word. In contrast, “newsgathering” becomes “news” + “gathering”, because “newsgather” isn’t a typical compound. Cases like “ultraleftist” are harder, because they make sense either way. Also note that you get non-affixational morphology with “signalmen”, the plural of “signalman”.

  • cockfighting: cockfight (CELEX: cock fighting)
  • headquarters: head quarters
  • ultraleftist: ultra leftist (CELEX: missing)
  • omnipotence: omnipotent
  • steelmaking: steel making (CELEX: missing)
  • stockholder: stock holder
  • signalmen: signalman (-)
  • weatherbeaten: weather beaten (CELEX: weather-beaten)
  • supercomputer: super computer (CELEX: missing)
  • newsgathering: news gathering (CELEX: missing)

To be honest, I steamrolled my own opinions in a couple of cases here. Emily still prefers “ultraleft” + “ist” to “ultra” + “leftist”. Digging my old linguistic semantics hat out of the closet, I’d argue there are two ambiguous derivations (sort of like “the ball in the box on the table”). “ultraleft” + “ist” means anyone on the extreme left, whereas “ultra” + “leftist” means someone who is an extreme instance of a leftist. To see this subtle distinction, note that “ultra” doesn’t mean moving more to the left — it’s just extreme in general. So you have to read “ultra” + “leftist” the way you’d read “ultramoderate”.

Non-Standard Affixes. Next, we have non-standard affixes or “funny” words, like “freebie”, which is clearly related to “free”.

  • freebie: free (CELEX: freebie)

Fixed Affixes. Some prefixes seem to be awfully tightly bound; it’s not clear “recall” should stem to “call” even though they’re morphologically related.

  • enclose: close (CELEX: enclose)
  • recall: call

Comparatives/Superlatives. We were unclear what to do with comparatives, but decided they should reduce to their basic scalar adjectives (“fitter” and “fittest” to “fit”), as it makes the most sense linguistically:

  • fitter: fit
  • furthest: far (CELEX: missing)

For search and feature extraction, it might make sense to first reduce the superlative to the comparative (“fittest”: “fitter”), then the comparative to the scalar adjective (“fitter”: “fit”).

False versus Rare Roots. Sometimes words aren’t compounds even though they look like it, such as “bulldoze”, which is a back-derivation from a proper the proper name “Bulldozer”. Who knew that “fangle”, “percept” and “aniline” were words? Or that “weary” isn’t related to “wear”?

And then there’s “incommunicado” where we might say “communicado” (73K Google hits), even though “communicado” is not in any of the online dictionaries.

Cases like “chloracne” are less clear, because “chlor”, one of the compounded nouns, isn’t an ordinary English word.

We break apart even the words that standard introductions to linguistic morphology tell you are not morphologically complex, like “gooseberry” and “strawberry”. This might argue that “bulldoze” should be broken into “bull” + “doze”, especially for search, where users often don’t know which words are compounds.

  • bulldoze: bulldoze (CELEX: bull doze)
  • fangled: fangle (CELEX: missing)
  • weary: weary
  • perceptual: percept (CELEX: missing)
  • incommunicado: incommunicado
  • doomsday: dooms day (CELEX: doomsday)
  • chloracne: chlor acne (CELEX: missing)
  • polyaniline: poly aniline (CELEX: missing)
  • gooseberry: goose berry

Semantic Drift. Some words don’t seem very closely related to their stems because the derived word has gathered a fixed meaning. We stemmed them anyway.

  • directory: direct

I don’t think the stem should be “director”. The examples surrounding “authority” versus “author” in my previous blog post were clearer.

Misspellings/Contractions/Slang. We just treat them as if they were spelled properly and did as well as we could.

  • oughta: ought (CELEX: missing)

Derivational Ambiguity: Our first example was better, leaving us asking whether “butcher” should stem to “butch” (imagine the gender-based usage, not the meat-cutting one). In some cases there are ambiguities. In our sample of 1000, we found this one:

  • sweater: sweat

More than One Issue. Some words have multiple issues (non-standard roots plus compound ambiguity; misspellings plus foreight roots).

  • sadomasochism: sado masochism (sadomasochism)
  • honoaria: honoarium (CELEX: missing but they stem “honourarium” to “honour”)

Google(fight) Bug

Emily found an interesting bug in Google’s results exploring “honoaria”, which reported 891K hits (“honoraria” reported 896K hits):

but ran out of actual results after 66:

So much for Google-based lexicography. What’s the world coming to when you can’t trust a good old fashioned googlefight?

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s