Chinese Information Retrieval with LingPipe and Lucene

by

The standard approach to information retrieval in English involves tokenizing text into a sequence of words, stemming the words to their morphological roots, then tossing out stop words that are too frequent or non-discriminative. This is how the standard settings work in Apache’s Java-based Lucene search engine.

Taking this approach for Chinese would require being able to find the words. But the state of the art in Chinese word segmentation is only 97-98% or so on a relatively well-edited and homogeneous test set making up the third SIGHan segmentation bakeoff. And keep in mind that this is the usual average performance figure, with rarer words being harder to find. Out-of-training-vocabulary performance is at best 85%. Unfortunately, it’s those rarer words in the long tail, likes names, that are often the key disrciminative component of a query.

It turns out that you can go a very long way indexing Chinese using character unigrams and bigrams. That is, instead of breaking text into tokens, pull out all the characters and index those as if they were words, then pull out all the two-character sequences and index those the same way. This is what we did for Technorati. The evaluation on real queries showed what you’d expect: very good recall, but often poor precision because of unintended cross-word and sub-word effects. Consider an English example of the word sequence “and add it to your”, which would look like “andaddittoyour” without spaces. The problem is that without spaces, it matches the words “dad”, “toy”, “ditto”, “you”, and “our”. (In swear word filtering, naive replacement in subwords leads to the “clbuttic” mistake).

By the way, subword problems come up in English at an alarming rate. How many of you know that “Spider-man” is two words while “Batman” is only one? In looking at customer query logs for TV data, we saw that users have no idea how many spaces to put in many examples. We just expect our search engines to have solved that one for us. They actually solve this problem roughly the same way as we suggest for Chinese. (This is also a problem in noun compounding languages like German and Danish, or generally highly compounding languages like Turkish.)

The cross-word effects would be completely mitigated if we could teach Chinese query writers to insert spaces between words. I’m told this is impossible given the conventions of the language, but my reply is “where there’s a will, there’s a way”, and I think the adoption of Pinyin provides some direct support for my conjecture. But let’s suppose we can’t get the Chinese searchers to insert spaces; it won’t take care of the subword problem anyway.

So how can we improve precision for unigram and bigram indexing? Add words when we can find them and let TF/IDF sort out the weights. This is the idea behind this paper, which evaluates a number of such combinations:

Jian-Yun Nie, Jiangfeng Gao, Jian Zhang, and Ming Zhou. 2001. On the Use of Words and N-grams for Chinese Information Retrieval. Proceedings of the 5th International Workshop Information Retrieval with Asian Languages.

The best indexing seems to arise from a mixture of character unigrams, character bigrams and words.

So how do we get the words? Two ways, for maximal precision and control. First, we use a dictionary. We can build an efficient tokenizer to find dictionary words using LingPipe’s Aho-Corasick implementation dict.ExactDictionaryChunker, the use of which is described in the javadoc and in the named entity tutorial. Just make sure you use the right tokenizer, which for Chinese would be the tokenizer.CharacterTokenizerFactory. A dedicated implementation would be faster than going through tokenizers, given that we’re treating each character as a token.

As to where to find a dictionary, that’s up to you. Training data is still available on the old SIGHan site or from LDC.

The second approach to word spotting can be approximate, and for that, we recommend our noisy channel Chinese word segmenter, as described in our tutorial on Chinese word segmentation.

The only fiddly bits of coding are wrapping up our dictionary chunker and chinese word segmenter as Lucene analyzers (see lucene.analysis.Analyzer. You can see how to adapt LingPipe tokenizer factories for use as Lucene analyzers in our new sandbox project LingMed; look for the top-level read-me.html file and the adapter implementation for LingPipe tokenizers in src/com/aliasi/lingmed/lucene/LuceneAnalyzer.java.

We’ve sidestepped the fact that Chinese has two distinct ways of rendering characters, traditional Chinese characters (used mainly in Taiwan and Hong Kong and among ex-pats) and simplifed Chinese characters (used primarily on the mainland). The unicode standard, of course, contains both.

It’s a simple matter to load a single dictionary on both types of characters, and a single word segmenter operating over both types of characters. We’ve had good luck in training multilingual (English/Hindi) named entity extractors. It works because it relies on local context to predict words, and local context is almost always homogeneous in character set. That way, you can handle docs with mixed character sets and don’t have to do a character set ID pass up front (which is tricky, because the code points overlap in unicode and queries are often short).

We should point out that this method can also work for Japanese. Though you might want to up the n-gram count for the syllabic characters in Hirigana and Katakana, while keeping 1-grams and 2-grams for the imported Chinese ideograms making up Kanji.

2 Responses to “Chinese Information Retrieval with LingPipe and Lucene”

  1. Greg Holmberg Says:

    There seems to be some work on going on in the ICU project on word-breaking for Chinese. They’ve supported dictionary-based boundary analysis for some time, but as far as I can tell, only used it for Thai.

    I did find an enhancement request in their bug system, and it appears that someone has checked in some code for dictionary-based boundary analysis for Chinese. I have no idea it it’s any good. It is targeted to be released as part of ICU 4.2 (the current release is 4.0).

    See http://icu-project.org/userguide/boundaryAnalysis.html and http://bugs.icu-project.org/trac/ticket/2229

  2. lingpipe Says:

    Thanks for the link — I’ve seen that part of the Unicode API before, but didn’t know anyone was working on Chinese. The link you sent indicates they’re also doing Thai with dictionaries.

    Because of ambiguities, there’s a limit to how well a dictionary-based approach can work. They’re also limited to fairly crude heuristics when there aren’t dictionaries.

    For hyphenation and line-breaking, you can be conservative in finding breaks (high precision, low recall). That mitigates the problem of unknown words in the dictionaries. For other apps, like word counting, the exact counts may not be important. For imposing sentence capitalization, that’s not such a problem in Chinese.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s