We’re now running into a problem we’ve run into before: so-called “intelligent” tokenization. The earliest version of this of which I’m aware is the Penn Treebank tokenization, which assumes sentence splitting has already been done. That way, the end-of-sentence punctuation can be treated differently than other punctuation. Specifically, “Mr. Smith ran. Then he jumped.” gets split into two sentences, “Mr. Smith ran.” and “Then he jumped.”. Now the fun starts. The periods at the end of a sentence are split off into their own token. The period after “Mr” remains, so the tokens are “Mr.”, “Smith”, “ran” and “.”. Note that the Treebank tokenizer also replaces double quotes with either left or right LaTex-style quotes, so there’s no way to reconstruct the underlying text from the tokens. Like many other projects, they also throw away the whitespace information in the data, so there’s no way to train something to do tokenization that’s whitespace sensitive because we just don’t have the whitespace. That’s what you get for releasing data like “John/PN ran/V ./PUNCT” — you just don’t know if there was space between that final verb “ran” and the full stop “.”. You’ll also see their script builds in all sorts of knowledge about English, such as a handful of contractions, so that “cannot” is split out into two tokens, “can” and “not”.
The most recent form of “intelligent” tokenization I’ve seen is perpetrated by UPenn, this time as part of their BioIE project. There, the data’s not even consistently tokenized, because they left it to annotators to decide on token boundaries. They then use statistical chunkers to uncover the tokenizations probabilistically.
Google’s n-gram data is also distributed without a tokenizer. It looks very simple, but there are lots of heuristic boundary conditions that make it very hard to run on new text. Practically speaking, the data’s out of bounds anyway because of its research-only license. Unlike the Penn Treebank or French Treebank, there’s no option to buy commercial licenses.
I’ve just been struggling with the French Treebank, which follows the Penn Treebank’s lead in using “intelligent” tokenization. The problem for us is that we don’t know French, so we can’t quite read the technical docs (in French), nor induce from the corpus how tokenization was done.
This is all terribly problematic for the “traditional” parsing model of first running tokenization, then running analysis. The problem is that the tokenization depends on the analysis and vice-versa. At least with the Penn approach, there’s code to do their ad hoc sentence splitting and then their ad hoc heuristic tokenization. It may not be coherent from an English point of view (a handful of contractions will be split; others won’t), but at least it’s reproducible.
Our own approach (in practice — in theory we can plug and play any tokenizer that can be coded) has been to take very fine-grained tokenizations so that the tokenization would be compatible with any old kind of tagger. Our HMM chunker pays attention to tokenization, but the rescoring chunker uses whitespace and longer-distance token information.
At the BioNLP workshop at ACL 2008, Peter Corbett presented a paper (with Anne Copestake) on Cascaded Classifiers for Confidence-Based Chemical Named Entity Recognition. It’s a really neat paper that addresses issues of confidence estimation, and particularly trading precision for recall (or vice-versa). But they weren’t able to reproduce our 99.99% recall gene/protein name extraction. During the question period, we got to the bottom of what was going on, which turned out to be intelligent tokenization making mistakes so that entities weren’t extractable because they were only parts of tokens. I’m hoping Peter does the analysis to see how many entities are permanently lost due to tokenization errors.
So why do people do “intelligent” tokenization? The hope is that by making the token decisions more intelligently, downstream processing like part-of-speech tagging is easier. For instance, it’s difficult to even make sense of assigning part-of-speech tags to three tokens derived from “p-53″, namely “p”, “-” and “53”. Especially if you throw away whitespace information.
Tokenization is particularly vexing in the bio-medical text domain, where there are tons of words (or at least phrasal lexical entries) that contain parentheses, hyphens, and so on. This turned out to be a problem for WordNet).
In some sense, tokenization is even more vexing in Chinese, which isn’t written with spaces. To get around that problem, our named-entity detector just treats each character as a token; that worked pretty well for our entry in the SIGHAN 3 bakeoff. There were even two papers on jointly modeling segmentation and tagging for Chinese at this year’s ACL (Jiang et al. and Zhang et al.). Joint modeling of this kind seems like a promising approach to allowing “intelligent” tokenization; by extending the tokenization model far enough, we could even maintain high end-to-end recall, which is not possible with a state-of-the-art first-best probabilistic tokenizer.