One of the perennial problems in tokenization for search, classification, or clustering of natural language data is which words and phrases are spelled as compounds and which are separate words. For instance consider “dodgeball” (right) vs. “dodge ball” (wrong) and “golfball” (wrong) vs. “golf ball” (right)?
It’s a classic lumpers vs. splitters problem. Alas, the “right” answer can change over time and some words are listed as both lumped and split in some dictionaries.
This point was first brought home to me and Breck when we were browsing Comcast’s query logs for TV search queries, where users seemed to arbitrarily insert or delete spaces in queries. Who can blame them with the inconsistencies of English in this regard?
This token lumping vs. splitting problem is one of the reasons LingPipe’s spelling correction is not token-based like Lucene’s (though Lucene has other ways to handle combining and splitting compounds — more on that from Mitzi in the near future). And it’s one of the reasons (along with morphological affixes) that character n-grams provide a better basis for classification than hard tokenization.
My favorite examples are comic book superhero names. It turns out that the editors at Marvel Comics are splitters, whereas those at DC Comics are lumpers. Hence we have on the Marvel roster,
Ant-Man, Iron Man, and Spider-Man.
They’re not consistent on hyphenation, which is another issue if you don’t throw away punctuation.
DC, on the other hand, employs men, women, and children, but doesn’t have the budget for spaces:
Aquaman, Batman, Catwoman, Hawkman, Martian Manhunter, Superboy, Supergirl, and Superman.
The exception that proves the rule (I’m open-minded enough to use a phrase that’s a pet peeve of mine), DC gives us: