Genes are Generic, People are Specific


Although many reasons are cited for genes and product names being more difficult to find than person names in text, the main linguistic difference is explained by the syntactic and semantic distinction between proper names and common nouns.

Named entity recognizers for people look for names, like “Pervez Musharraf”. Syntactically, person named entity mentions take the form of proper names. Semantically, these names refer to a single person. By convention in English, they’re capitalized. Note that names may be ambiguous in that the name "John Smith" may be used to refer to many different people. But each use as a named entity will refer to a single individual. Of course, a name may be mentioned as a string without referring to an individual; I can say "There are many John Smiths." without referring to one. This is an example of the use-mention distinction.

Named entity recognizers for genes and proteins look for names of genes or proteins, like P53 or McrBC. Syntactically, gene mention names take the form of proper names. Semantically, they refer to a kind, and are thus more like common nouns. There are many instances of P53 and the use of "P53" doesn’t typically refer to a single sample, much less a single molecule. Another distinction is that “p53” can be used as a mass term, like “water”, or a countable term like “glass(es)” (compare “much water” vs. “many glasses”, or “water is everywhere” with “there are glasses everywhere”).

In addition to not being conventionally capitalized in English, a fundamental problem with common nouns is that they can be subclassed or superclassed. For instance, we see "P53-wt" and "P53-m" for a wild type (natural) or mutated form of P53. P53 is also found in everything from mice to zebrafish, so the name may be qualified further for species. Similarly, there is more than one known human mutation of p53, so the particular mutant strain may be identified (often by a catalog number).

The genericity in a gene or protein name is different than the ambiguity in a proper name like "Clinton", which may refer to Bill, Hilary, the fort in New York’s Central Park, or any number of other people, towns, monuments or manufacturers. It’s not like these are all types of Clintons the way a cricket is a type of insect. Linguistically, you can’t use the word "Clinton" to refer to all of these things the way you can use P53 to refer to all of its variants.

Because gene “names” act like nouns, it makes it particularly difficult to draw the line between plain old common descriptive nouns and more proper-name-like gene names. For instance, consider the common noun “tumor supressor gene”. This is a noun like “starting first-baseman”, not a name such as “Jason Giambi”.

Functional descriptors may also be added to gene names, as in “p53 tumor supressor protein”. Semantically, the “tumor supressor protein” is a non-restrictive modifier, with the whole meaning is “protein that supresses tumors”. All by itself, the name “p53” picks out the type of protein. Syntactically, “protein” is the head (root noun) of the phrase (that is, what’s being referred to is a protein, not a tumor). With proper names, English syntax requires apositives, such as “Albany, capital of New York State”, or modifiers as in “New York State capital Albany”; with people these may be honorifics, as in the use of “Senator” in “Senator Hilary Clinton” or apositives, as in “Hilary Clinton, Senator from New York.”

Most of the other problems with gene-name finding are shared with person-name finding. For instance, metonymy is a huge issue in both domains. The phrase “New York” may refer to the city, the government, or any of the many sports franchises calling the city home. So you find less ambiguous forms, like “New York Yankees”, used in the same way you find “p53 gene”. The main difference is that the former is still proper name, whereas the latter is still a common noun syntactically.

As another example, string variation is highly prevalent in company names (e.g. “IBM”, “I.B.M.”, “International Business Machines, Inc”, “Big Blue”, etc.) which also have multiple subsidiaries (e.g. “IBM Europe”). Names transliterated from foreign languags are particularly problematic (e.g. “Al Jazirah”, “Al Jazeera”, “El Jazira”; ..). Nicknames are also popular for celebrities (e.g. “Sean Combs”, “P. Diddy”, “Puff Daddy”; or “Reggie Jackson”, “Mr. October”).

Product names, by the way, are just like gene names. A term such as “Canon EOS” refers to a whole range of different cameras such as “Canon EOS Rebel Xti”. There are multiple instances of these cameras in the world, just like P53. And that’s why sites like ShopWiki Search “canon eos” and Froogle Search=”canon eos” have a hard time sorting them out.

In contrast, there’s only one New York City, one New York Yankees, and one Senator Hilary Rodham Clinton.