We’ve been using our exact dictionary matching chunker rather heavily lately in extracting gene and protein names from free text. We’ve been looking at BioCreative 2’s Protein-Protein Interaction (PPI) interacting pairs subtask (IPS). I’d be curious to hear how others dealt with the data, which is not annotated at the mention level (it’s full text articles run through HTML and PDF converters, so there’s a huge amount of “noise” from figures and nav controls to boot).
Gene and protein dictionaries present quite a difficult matching problem, because the available dictionaries (e.g. Entrez-Gene and Uniprot-KB) are very spotty. One of the recurring problems is case and modifiers. For instance, it’s common to have a gene such as “Trp53” (Entrez-Gene 22059) referred to in text as “mP53wt” (mouse p-53, wild type).
This is clearly problematic if you have a space-based notion of tokenization. LingPipe lets you define flexible tokenizers. For this problem, we get rid of all the punctuation (treating it as whitespace), and choose to break at case changes and digits, using a regular expression tokenizer with pattern
This’ll tokenize “mP53wt” into four tokens, “m”, “P”, “53”, and “wt”. We typically wrap the reg-ex tokenizer in a lowercasing filter, so that the final matching is case insensitive.
LingPipe’s exact-match dictionary chunker is not sensitive to whitespaces. It tokenizes the phrases in the dictionary and the text being chunked, and sees if there’s a sequence match. Hence a dictionary entry “p53” will match “mP53wt” producing a chunk starting at position 1 and ending at position 4 (one past last char).
The tokenizer factories support stoplists, so it’s possible to filter out words like “protein” (allowing “Neuron-specific X11 protein” to match “Neuron-specific X11”), or like “related” (e.g. “transformation related protein 53”), or like “member” (e.g. “inwardly rectifying subfamily J member 2”).
The last example suggests we might want to stem, as well, so that “inwardly” would match “inward” and “rectifying” would match “rectify”.
It’s also possible to normalize words using a token modifying tokenizer factory to reduce “alpha”, “A”, and “α” to the same token.
Warning: In LingPipe 3.8.1, it’s not possible to write a dictionary chunker that uses a reductive tokenizer (like a stemmer), because it was counting the space by the length of the tokens and whitespaces. As of the next release, this problem has been fixed by converting the dictionary chunker to rely on the tokenizer’s start and end positions for the previous token. If you’re curious to try before then, I can send you the patch.
The other big problems are species normalization (Entrez lists a whole host of organisms with TP53 genes), and word-order variation. Stay tuned for more details on what we’re doing for these.