Otis over at the Sematech blog wants to get people identifying hard problems faced by people working in search that PhD students can select from and hopefully solve with a Lucene/Solr implementation. The students get a ‘real world problem’ and the world gets a concrete open source implementation of the solution. The call is: Lucene / Solr for Academia: PhD Thesis Ideas
My suggested PhD idea is tolerable precision at high recall dictionary matching of phrases. Mike Ross spent a good deal of time trying to get 100% matches of genes in MEDLINE abstracts given a dictionary of genes (Entrez Gene) and aliases. The core of the problem is that not all the mentions of genes are on the aliases set for the gene. Huge issues around efficiency in addition to getting it working at all.