The Incompleteness of Alias Lists


Yes, Alias-i was named after the notion of alias, as in lists of aliases for persons, or for genes, or for consumer products. The problem with alias lists is that they’re notoriously incomplete.

The OpenCyc projects lists the following aliases for ultra-thin flat panel display:

English Aliases: [ "ultra thin flat panel screen", "ultra thin flat panel screens", "ultra thin screen", "ultra thin screens", "ultra-thin display", "ultra-thin displays", "ultra-thin flat panel displays", "ultra-thin screen", "ultra-thin screens" ]

Lists like these also seem arbitrary. Why provide “ultra-thin” versus “ultra thin” variations in only some of the aliases? Is “ultra thin display”, which is not listed, an inferior alias? For some reason, the title isn’t included as an alias.

The temptation is to use a regular expression, but often there are dependencies that are hard to capture with a regular expression. For instance, if I want to be able to say “tall white coffee” in all six orders, it’s pretty much impossible to characterize with a regex; the tightest I could come up with was (tall (white coffee|coffee white))|(white (tall coffee|coffee tall))|(coffee (tall white|white tall)).

So what we wind up doing instead is taking a set of aliases and then looking at some kind of approximate matching. Our goal right now’s to get a high recall extraction over gene names. We know we can get the mentions, but what about the Entrez-Gene-ID/mention pairs?

