The standard approach to information retrieval in English involves tokenizing text into a sequence of words, stemming the words to their morphological roots, then tossing out stop words that are too frequent or non-discriminative. This is how the standard settings work in Apache’s Java-based Lucene search engine.
Taking this approach for Chinese would require being able to find the words. But the state of the art in Chinese word segmentation is only 97-98% or so on a relatively well-edited and homogeneous test set making up the third SIGHan segmentation bakeoff. And keep in mind that this is the usual average performance figure, with rarer words being harder to find. Out-of-training-vocabulary performance is at best 85%. Unfortunately, it’s those rarer words in the long tail, likes names, that are often the key disrciminative component of a query.
It turns out that you can go a very long way indexing Chinese using character unigrams and bigrams. That is, instead of breaking text into tokens, pull out all the characters and index those as if they were words, then pull out all the two-character sequences and index those the same way. This is what we did for Technorati. The evaluation on real queries showed what you’d expect: very good recall, but often poor precision because of unintended cross-word and sub-word effects. Consider an English example of the word sequence “and add it to your”, which would look like “andaddittoyour” without spaces. The problem is that without spaces, it matches the words “dad”, “toy”, “ditto”, “you”, and “our”. (In swear word filtering, naive replacement in subwords leads to the “clbuttic” mistake).
By the way, subword problems come up in English at an alarming rate. How many of you know that “Spider-man” is two words while “Batman” is only one? In looking at customer query logs for TV data, we saw that users have no idea how many spaces to put in many examples. We just expect our search engines to have solved that one for us. They actually solve this problem roughly the same way as we suggest for Chinese. (This is also a problem in noun compounding languages like German and Danish, or generally highly compounding languages like Turkish.)
The cross-word effects would be completely mitigated if we could teach Chinese query writers to insert spaces between words. I’m told this is impossible given the conventions of the language, but my reply is “where there’s a will, there’s a way”, and I think the adoption of Pinyin provides some direct support for my conjecture. But let’s suppose we can’t get the Chinese searchers to insert spaces; it won’t take care of the subword problem anyway.
So how can we improve precision for unigram and bigram indexing? Add words when we can find them and let TF/IDF sort out the weights. This is the idea behind this paper, which evaluates a number of such combinations:
Jian-Yun Nie, Jiangfeng Gao, Jian Zhang, and Ming Zhou. 2001. On the Use of Words and N-grams for Chinese Information Retrieval. Proceedings of the 5th International Workshop Information Retrieval with Asian Languages.
The best indexing seems to arise from a mixture of character unigrams, character bigrams and words.
So how do we get the words? Two ways, for maximal precision and control. First, we use a dictionary. We can build an efficient tokenizer to find dictionary words using LingPipe’s Aho-Corasick implementation
dict.ExactDictionaryChunker, the use of which is described in the javadoc and in the named entity tutorial. Just make sure you use the right tokenizer, which for Chinese would be the
tokenizer.CharacterTokenizerFactory. A dedicated implementation would be faster than going through tokenizers, given that we’re treating each character as a token.
As to where to find a dictionary, that’s up to you. Training data is still available on the old SIGHan site or from LDC.
The second approach to word spotting can be approximate, and for that, we recommend our noisy channel Chinese word segmenter, as described in our tutorial on Chinese word segmentation.
The only fiddly bits of coding are wrapping up our dictionary chunker and chinese word segmenter as Lucene analyzers (see
lucene.analysis.Analyzer. You can see how to adapt LingPipe tokenizer factories for use as Lucene analyzers in our new sandbox project LingMed; look for the top-level
read-me.html file and the adapter implementation for LingPipe tokenizers in
We’ve sidestepped the fact that Chinese has two distinct ways of rendering characters, traditional Chinese characters (used mainly in Taiwan and Hong Kong and among ex-pats) and simplifed Chinese characters (used primarily on the mainland). The unicode standard, of course, contains both.
It’s a simple matter to load a single dictionary on both types of characters, and a single word segmenter operating over both types of characters. We’ve had good luck in training multilingual (English/Hindi) named entity extractors. It works because it relies on local context to predict words, and local context is almost always homogeneous in character set. That way, you can handle docs with mixed character sets and don’t have to do a character set ID pass up front (which is tricky, because the code points overlap in unicode and queries are often short).
We should point out that this method can also work for Japanese. Though you might want to up the n-gram count for the syllabic characters in Hirigana and Katakana, while keeping 1-grams and 2-grams for the imported Chinese ideograms making up Kanji.