## Flexible Dictionary-Based Chunking for Extracting Gene and Protein Names

We’ve been using our exact dictionary matching chunker rather heavily lately in extracting gene and protein names from free text. We’ve been looking at BioCreative 2’s Protein-Protein Interaction (PPI) interacting pairs subtask (IPS). I’d be curious to hear how others dealt with the data, which is not annotated at the mention level (it’s full text articles run through HTML and PDF converters, so there’s a huge amount of “noise” from figures and nav controls to boot).

Gene and protein dictionaries present quite a difficult matching problem, because the available dictionaries (e.g. Entrez-Gene and Uniprot-KB) are very spotty. One of the recurring problems is case and modifiers. For instance, it’s common to have a gene such as “Trp53” (Entrez-Gene 22059) referred to in text as “mP53wt” (mouse p-53, wild type).

This is clearly problematic if you have a space-based notion of tokenization. LingPipe lets you define flexible tokenizers. For this problem, we get rid of all the punctuation (treating it as whitespace), and choose to break at case changes and digits, using a regular expression tokenizer with pattern

([a-z]+)|([A-Z]+)|([0-9]+)

This’ll tokenize “mP53wt” into four tokens, “m”, “P”, “53”, and “wt”. We typically wrap the reg-ex tokenizer in a lowercasing filter, so that the final matching is case insensitive.

LingPipe’s exact-match dictionary chunker is not sensitive to whitespaces. It tokenizes the phrases in the dictionary and the text being chunked, and sees if there’s a sequence match. Hence a dictionary entry “p53” will match “mP53wt” producing a chunk starting at position 1 and ending at position 4 (one past last char).

The tokenizer factories support stoplists, so it’s possible to filter out words like “protein” (allowing “Neuron-specific X11 protein” to match “Neuron-specific X11”), or like “related” (e.g. “transformation related protein 53”), or like “member” (e.g. “inwardly rectifying subfamily J member 2”).

The last example suggests we might want to stem, as well, so that “inwardly” would match “inward” and “rectifying” would match “rectify”.

It’s also possible to normalize words using a token modifying tokenizer factory to reduce “alpha”, “A”, and “α” to the same token.

Warning: In LingPipe 3.8.1, it’s not possible to write a dictionary chunker that uses a reductive tokenizer (like a stemmer), because it was counting the space by the length of the tokens and whitespaces. As of the next release, this problem has been fixed by converting the dictionary chunker to rely on the tokenizer’s start and end positions for the previous token. If you’re curious to try before then, I can send you the patch.

The other big problems are species normalization (Entrez lists a whole host of organisms with TP53 genes), and word-order variation. Stay tuned for more details on what we’re doing for these.

### 2 Responses to “Flexible Dictionary-Based Chunking for Extracting Gene and Protein Names”

1. Wei Says:

One possibility is to allow certain errors when performing the dictionary matching. One recent algorithm is in “Wei Wang, Chuan Xiao, Xuemin Lin, Chengqi Zhang. Approximate Entity Extraction with Edit Distance Constraints. SIGMOD 2009”

To handle the last two cases, one can, in principle, modify the neighborhood generation methods to allow transformations such as ‘a’ to ‘\alpha’.

2. lingpipe Says:

Right. LingPipe already implements the linear-time version (see Dan Gusfield’s strings book, as usual) of inexact dictionary matching with respect to an arbitrary character-by-character weighted edit distance with an upper bound on distance (the upper bound’s what allows it to be linear): com.aliasi.dict.ApproxDictionaryChunker.

This strategy works OK for finding transliterations and typos, but doesn’t work so well for things like gene names or company names, where you tend to get whole token order variation and whole dropped tokens.

You can do something similar in the Apache Lucene search API using their fuzzy term matching (which I helped recode to only use linear space in matching), or with character n-gram matching (the latter’s covered in our case study in the first edition of the Lucene in Action book).