Our NIH work and Autism


We are funded by NIH (thanks to all you tax payers) to develop better ways to connect wet lab work to databases and text sources using software. I have blogged before about the importance of better linguistics in bioinformatics and now we are on Phase II of taking this issue on. As a starting place we are developing software that links a database of genes to PubMed. We have further restricted ourselves (at least initially) to working on the slice of PubMed that is returned by the query “(gene or genes or DNA or mRNA) and (autism or autistic)” which returned 1228 abstracts.

Why genes? Much of modern research is driven by studies that investigate the genome of disease groups. This in turn generates via micro array experiments large sets of candidate genes that must be winnowed out by text and database driven informatics techniques–this is our strong suit.

Why autism? We have a relationship with Harvard and that is what they are working on as part of the Autism Consortium. We need a user base, a smaller slice of the medical world that we can get a handle on and a genetic disease with difficult informatics problems.

Where do we stand now?

The short answer is, roughly speaking, that we are humbled. We cut our teeth in the intelligence world starting in 2000 and got NIH funding because we promised to deliver what we could do for tracking Osama bin Laden for medical researchers. While tracking bin Laden had its challenges, tracking genes has opened a world of tough problems. One dimension of the problem is obvious upon examination. Allow me to elaborate.

I decided that I wanted a sanity check on exact dictionary matches of genes from the Entrez Gene database into the autism literature. Being a good researcher you always want to establish a baseline system to get a sense of the scope of the problem. We built a somewhat effective system for doing this for human genes based on work by William Hayes et al as reported in the AZuRE system for our Phase I work. Now in Phase II I decided that we needed to move beyond our species to the very important animal models that are also in Entrez Gene–mouse being particularly significant for autism. Little did I know that genetics researchers name thier genes with the explicit purpose of making my life difficult.

I wrote a simple exact dictionary matcher for known aliases of genes. Entrez Gene has half a million genes spanning more than 500 species with more than a million unique aliases for genes. Nothing particularly concerning there yet. Looking up gene names in an autism abstract from PubMed however yields an interesting problem. Here is the abstract:

Note: Genes found with simple dictionary lookup, abstract on left, found genes on right in bold. Species and Entrez Gene id listed per alias. Not all species shown.

1) 18252227Structural variation of chromosomes in autism spectrum disorder.Structural variation (copy number variation [CNV] including deletion and duplication, translocation, inversion) of chromosomes has been identified in some individuals with autism spectrum disorder (ASD), but the full etiologic role is unknown. We performed genome-wide assessment for structural abnormalities in 427 unrelated ASD cases via single-nucleotide polymorphism microarrays and karyotyping. With microarrays, we discovered 277 unbalanced CNVs in 44% of ASD families not present in 500 controls (and re-examined in another 1152 controls). Karyotyping detected additional balanced changes. Although most variants were inherited, we found a total of 27 cases with de novo alterations, and in three (11%) of these individuals, two or more new variants were observed. De novo CNVs were found in approximately 7% and approximately 2% of idiopathic families having one child, or two or more ASD siblings, respectively. We also detected 13 loci with recurrent/overlapping CNV in unrelated cases, and at these sites, deletions and duplications affecting the same gene(s) in different individuals and sometimes in asymptomatic carriers were also found. Notwithstanding complexities, our results further implicate the SHANK3-NLGN4-NRXN1 postsynaptic density genes and also identify novel loci at DPP6-DPP10-PCDH9 (synapse complex), ANKRD11, DPYD, PTCHD1, 15q24, among others, for a role in ASD susceptibility. Our most compelling result discovered CNV at 16p11.2 (p = 0.002) (with characteristics of a genomic disorder) at approximately 1% frequency. Some of the ASD regions were also common to mental retardation loci. Structural variants were found in sufficiently high frequency influencing ASD to suggest that cytogenetic and microarray analyses be considered in routine clinical workup. we house mouse 22389 18252227-22389
unbalanced house mouse 21829 18252227-21829
a house mouse 50518 18252227-50518
de house mouse 21384 18252227-21384
or house mouse 12677 18252227-12677
2 human 3692 18252227-3692
at house mouse 11904 18252227-11904
s house mouse 13618 18252227-13618
SHANK3 human 85358 18252227-85358
NLGN4 human 57502 18252227-57502
NRXN1 human 9378 18252227-9378
novel house mouse 75125 18252227-75125
DPP6 chimpanzee 463835 18252227-463835 chimpanzee 735585 18252227-735585 human 1804 18252227-1804
DPP10 chimpanzee 459565 18252227-459565 chimpanzee 748962 18252227-748962 human 57628 18252227-57628
PCDH9 chimpanzee 452590 18252227-452590 human 5101 18252227-5101
ANKRD11 chimpanzee 468078 18252227-468078 chimpanzee 750284 18252227-750284 human 29123 18252227-29123
DPYD chimpanzee 457047 18252227-457047 human 1806 18252227-1806
PTCHD1 chimpanzee 465535 18252227-465535 human 139411 18252227-139411
p house mouse 18431 18252227-18431 Norway rat 308670 18252227-308670

You don’t need to be a doctor to know that the phrases ‘we’, ‘unbalanced’, ‘a’ and ‘or’ are not gene names in this context. But believe it or not, they are actually legitimate in other contexts. As in ‘The mutant gene we is probably responsible for a disrupted induction signal from the dermal papilla towards ectodermal cells of a hair follicle.’ I am not kidding and I feel like Dave Barry saying it. This is really a challenging problem, and unlike Dave Barry I have to solve it.

Soon I will post about our approaches to solving this over generation problem. In the world of computational linguistics it would be said that we have a precision problem. That is, we can’t accurately find examples of genes without finding a pile of junk in the process. It has the same quality as Paul Samuelson claiming that Wall Street indexes predicted 9 out of the past 5 recessions.

Wish us luck, we all will live healthier if we get this sorted out….


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s