Archive for the ‘Breck’s Blog’ Category

Processing Tweets with LingPipe #2: Finding Duplicates with Hashing and Normalization

December 2, 2010

This post is about taking the csv file format used in post #1 and eliminating the huge number of annoying duplicates or near duplicates that exist in Twitter search results. The next entry in the series will cover more open ended near duplicate detection strategies such as Jaccard Distance and will introduce the very important area of evaluation.

As before you can download the code with this tarball or get it from subversion with the command

svn co

We will start with a simple ingest and representation of the csv data and consider how to find near or exact duplicates without concern for computational efficiency. The research literature in this area is very concerned with scalable performance, I will instead focus on more ideal solutions that focus on simplicity and quality.

Hashing Driven Deduplication

A simple way to find exact duplicates is to use a HashSet data structure in Java for comparison of tweets. Once we get the tweets into a list we iterate over the tweets to see whether the the exact string is in the HashSet. Below is the recognition phase from src/

	List texts = parseFile(dataDir);
	HashSet seenBefore = new HashSet();
	for (int i=0; i < texts.size(); ++i) {
	    String tweet = texts.get(i);
	    if (seenBefore.contains(tweet)) {
		System.out.println("Seen Before at postition:" + i +": " + tweet);
	    else {

The code iterates over the CSV format search results, tests to see if a HashSet already contains the string and adds it if the set does not contain the string. I could have just added all the strings and the HashSet would have behaved the same but that is slightly automagical and might prove confusing. Below the code for writing out the CSV format for presumably unique tweets:

	File output = new File(outDir,outFile); 
	System.out.println("Writing to: " + output.toString());
	FileOutputStream stream =  new FileOutputStream(output);
	OutputStreamWriter streamWriter 
	    = new OutputStreamWriter(stream,Strings.UTF8);
	CsvListWriter writer 
	    = new CsvListWriter(streamWriter,CsvPreference.EXCEL_PREFERENCE); 
	ArrayList row = new ArrayList();
	row.add("Unique Tweet");
	writer.write(row); //header for csv file
	for (String tweet : seenBefore) {
	System.out.println(seenCounter + " seen of " + texts.size());

Note that I have kept the tweet in the same row as the search results. This convention will allow interaction with other components.

Running the program on searches/Obama.csv yields the following:

>ant dedupeSimpleHash -Ddata=searches/Obama.csv -Doutdir=runs -Doutfile=Obama.simpleDedupeHash.csv
Buildfile: build.xml



     [java] 2
     [java] Seen Before at postition:48: RT @rolandsmartin: President Obama: 'I Don't Think About Sarah Palin' -
     [java] Seen Before at postition:58: RT @New_federalists: Obama lets Islamic terrorists pour across the Mexican border, but treats US citizens as criminals at airports!!! #RESIST #REBEL !
     [java] Seen Before at postition:1490: RT @Cubachi: RT @PalinTV: Obama can't even pardon a Turkey without notes
     [java] Seen Before at postition:1499: RT @BreakingNews: President Barack Obama pardons turkeys ?Apple? and ?Cider?; says feels good to prevent one ?shellacking? this year
     [java] Writing to: runs/Obama.simpleDedupeHash.csv
     [java] 297 seen of 1500

Duplicate Detection via Normalization

The simple deduplication approach gets rid of 297 exact duplicates, or 20% of the tweets. Pretty good but looking at the ‘unique’ tweets and sorting on the texts in a spread sheet shows some other issues:

Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update...
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update...
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update...
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update...
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update...

Note that your results will vary since we are not running the searches on the same day. These posts only differ in the url for which is still pretty much the same tweet as far as I am concerned. Looking further I see that tweets get re-tweeted with a prefix of ‘RT@:’ pretty often as well.

RT @mattyglesias: Repealing the whole Affordable Care Act would be a mistake, but Obama should compromise and agree to scrap the death...																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																												
Repealing the whole Affordable Care Act would be a mistake, but Obama should compromise and agree to scrap the death...

It appears that removing urls and the retweet prefixes will find more duplicates. That suggests normalization to help make the tweets more uniform by eliminating "trivial" differences. A look at src/ shows regular expressions that replace the url and retweets with the empty string.

	List texts = parseFile(dataDir);
	HashSet seenBefore = new HashSet();
	for (int i=0; i < texts.size(); ++i) {
	    String tweet = texts.get(i);
	    System.out.println("Before:" + tweet);
	    tweet = tweet.replaceAll("\\s*RT\\s*@\\w+:\\s*","");//removes "RT @foobar:"
	    tweet = tweet.replaceAll("https?:[^\\s]*",""); //removes "http://foo" "https://bar"
	    System.out.println("After :" + tweet);	    
	    if (seenBefore.contains(tweet)) {
		System.out.println("Seen Before at postition:" + i +": " + tweet);
	    else {

Running the normalized approach the modifications are evident:

> ant dedupeNormalizedHash -Ddata=searches/Obama.csv -Doutdir=runs -Doutfile=Obama.dedupeNormalizedHash.csv
     [java] Before:RT @rolandsmartin: President Obama: 'I Don't Think About Sarah Palin' -
     [java] After :President Obama: 'I Don't Think About Sarah Palin' -  
     [java] Writing to: runs/Obama.dedupeNormalizedqHash.csv
     [java] 454 seen of 1500

Both patterns apply to the tweet removing ‘RT @rolandsmartin: ‘ and ‘;. The normalization allows for identification of another 157 duplicates. Normalization is a standard technique used when working with text data. You see it in search engines where ‘walks’, ‘walking’ and ‘walked’ are normalized to the root word ‘walk’ with a stemmer. Both documents and queries are normalized the same way allowing a query ‘Books about walking’ to match documents that don’t mention ‘walking’ but do mention ‘walk’, ‘walks’ or ‘walked.’ Normalization can bring more trouble than the problems it solves and should be used carefully–perhaps a future blog post is called for on our preference to use character n-grams in situations where stemming is typically used.

Looking at the output of the normalized hash approach reveals more opportunity to eliminate near duplicates and there may be more patterns to exploit but eventually diminishing returns will bring the pattern driven normalization effort to a close. Lets move to more open ended approaches that better address the long tail side of the issue:

Has Obama written a stern letter to North Korea yet? They will respect that.
Has Obama written a stern letter to North Korea yet? They will respect that. <--- indeed!

Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party 
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner W...  #tcot #p2 #tlot
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner W...  #tcot #tlot #p2
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update... 

The first pair only differ in an additional ‘<—indeed!’ which is not a likely pattern in the data that can be exploited much beyond this case. The second example shows that the basic tweet has many small variations. In these situations it is useful to deduce what algorithm you use to say that the tweets are (mostly) the same.

In my next post in the series I will cover Jaccard Distance as the method of identifying near duplicates and bring in the importance of evaluation metrics to drive development.


Call for PhD Ideas for Lucene/Solr Implementation

November 29, 2010

Otis over at the Sematech blog wants to get people identifying hard problems faced by people working in search that PhD students can select from and hopefully solve with a Lucene/Solr implementation. The students get a ‘real world problem’ and the world gets a concrete open source implementation of the solution. The call is: Lucene / Solr for Academia: PhD Thesis Ideas

My suggested PhD idea is tolerable precision at high recall dictionary matching of phrases. Mike Ross spent a good deal of time trying to get 100% matches of genes in MEDLINE abstracts given a dictionary of genes (Entrez Gene) and aliases. The core of the problem is that not all the mentions of genes are on the aliases set for the gene. Huge issues around efficiency in addition to getting it working at all.

Processing Tweets with LingPipe #1: Search and CSV Data Structures

November 16, 2010

I have been focusing on teaching developers how to approach the challenges of computational linguistics. An interesting source to work with is Twitter since it is culturally relevant these days, fairly easy to work with from a data feed perspective and it is just so darned cute. Like many cute things, it is not long before it is eating your shoes, resisting toilet training and overall testing one’s patience. We will be using LingPipe to help tame the beast.

I will start with how to work with the search client and will move towards language ID, deduplication and a bit of sentiment classification in later blog entries. The code for this post is available in this tarball or from our subversion repository which can be accessed via the command:

svn co

Go ahead and download the software and make sure you can compile. I am developing with Ant so I just type ‘ant’ in the teaching/twitter directory:

$ ant
Buildfile: build.xml

    [mkdir] Created dir: /Users/breck/devguard/teaching/lingpipe4twitter/build/classes
    [javac] Compiling 8 source files to /Users/breck/devguard/teaching/lingpipe4twitter/build/classes

      [jar] Building jar: /Users/breck/devguard/teaching/lingpipe4twitter/demo.jar

Total time: 3 seconds

Searching Twitter

We are using the Twitter4j library for accessing the Twitter API. The folks at Twitter apparently make contributions, while not an endorsement, it certainly increases credibility.

Getting some data on disk is the first item. Run a search on Obama as follows:

ant search -Dquery="Obama"
Buildfile: build.xml



     [java] [Mon Nov 15 10:20:45 EST 2010]Will use class twitter4j.internal.logging.StdOutLoggerFactory as logging factory.
     [java] [Mon Nov 15 10:20:45 EST 2010]Will use twitter4j.internal.http.HttpClientImpl as HttpClient implementation.
     [java] writing to disk in directory: searches

Total time: 6 seconds

The file name will be the query, searchers/Obama.csv. Note that subsequent searches will not overwrite previous search results. Rerunning the above query would create a file searches/Obama_1.csv.

Looking at the csv output in spread sheet like Open Office or Excel (import as comma seperated UTF-8 text)

Search results in CSV format viewed in a spread sheet

The tweets are in column D. To get them more readable you should select the entire column D, resize width to around 5″ and modify formatting to include wrapping the text.

Pretty cute eh? Lots of entrepreneurs have gotten to this stage of perusing Twitter and see piles of customers clamoring for access to this feed that is effectively the id of the internet. Clean it up with some super ego and profit is sure to follow. Well, the cute gets ugly fast. Run a few queries in areas of interest and the messiness becomes apparent quickly. Below we go into some details and a bit of data analysis.


We will be using the CSV data structure to work with the Twitter search results. Spreadsheets provide an easy way to view, sort and manipulate the tweets and there is a simple parser that allows us to interface with LingPipe functionality.

The relevant bits of the search code in src/ are:

    static final int TWEETS_PER_PAGE = 100;
    static final int NUM_PAGES = 15;

    public static void main (String[] args) 
        throws IOException, TwitterException {
        String queryString = args[0];
        File outDir = new File(args[1]);
        Twitter twitter = new TwitterFactory().getInstance();
        Query query = new Query(queryString);
        List tweets = new ArrayList();
        for (int i = 1; i <= NUM_PAGES; ++i) {
            QueryResult result =;
            List resultTweets = result.getTweets();
            if (resultTweets.size() == 0) break;
        System.out.println("writing to disk in directory= " + outDir);
        TweetWriter.writeCsv(tweets,outDir,queryString,"csv"); //our class for writing tweets

The only tricky bit to this search interface is the two levels of indexing for search results, pages and
tweets per page. Otherwise we accumulate the tweets and send them off to our TweetWriter class for
writing to disk.

The code for writing is in src/

public class TweetWriter {    
    public static void writeCsv(List tweets, File dir, 
                                String filename, String suffix)
        throws IOException {
        File output = nextFile(dir,filename,suffix);
        OutputStream fileOut = null;
        Writer streamWriter = null;
        CsvListWriter writer = null;
        try {
            fileOut =  new FileOutputStream(output);
            streamWriter = new OutputStreamWriter(fileOut,Strings.UTF8);
            writer = new CsvListWriter(streamWriter,CsvPreference.EXCEL_PREFERENCE); 

            List headerRow 
                = Arrays.asList("Estimate", "Guessed Class", 
                                "True Class", "Text");
            List row = new ArrayList();
            for (Tweet tweet : tweets) {
                String whitespaceNormedTweet
                    = tweet.getText().replaceAll("[\\s]+"," ").trim();
        } finally {
            try {
                if (writer != null)
            } finally {
                try { 
                    Streams.closeQuietly(streamWriter); //lingpipe convenience utility to close without exceptions 
                } finally {

This code uses the Super Csv package to handle reading and writing csv (comma seperated value) formatted files. For this tutorial I am only interested in the text of the Tweet object but much more information is available–see the Twitter4j documentation on the returned Tweet object.

The basic pattern is to populate a row with string values and write the row out. I have adopted the convention that the first line of a csv file contains headers which is handled before iterating over the tweets. Writing out the tweets involves 3 empty fields followed by the text of the tweet. Later those empty fields will be populated by human and machine annotations so this is our foundational data structure. The remaining odd bit is the whitespaceNormedTweet = tweet.getText().replaceAll("\\s+"," ").trim(); which replaces new lines and carriage returns with a single white space. This is in clear violation of my “do not adulterate the data” stance but for the purposes of this tutorial the improved readability of the csv format makes it worth it.

Also note that the IO has been wrapped for maximum paranoia regarding open file handles that might take out a JVM on repeated failures of writer, streamWriter or fileOut. I just know that someone is going to copy and paste the above code into a production system so might as well make it well behaved. BTW a real implementation would be streaming tweets to disk to keep down memory use. Thanks to Bob for noting all this and re-writing the closes.

Exploring the Data

Blog entries are supposed to be short I will cut the presentation so that I may blog another day in good conscience. But a little data analysis to whet the appetite seems appropriate.

Duplicates and Near Duplicates:

We now have an on disk data structure for storing searched for tweets that we can browse with a spread sheet and modify at our will. Lets look at what our harvest has reaped. Select the entire sheet and sort by column D. Scrolling down a bit I find for my Obama search some near duplicates that share prefixes:

#2: Crimes Against Liberty: An Indictment of President Barack Obama
#2: Crimes Against Liberty: An Indictment of President Barack Obama
#5: Crimes Against Liberty: An Indictment of President Barack Obama
#5: Crimes Against Liberty: An Indictment of President Barack Obama
#5: Crimes Against Liberty: An Indictment of President Barack Obama
#5: Crimes Against Liberty: An Indictment of President Barack Obama: Crimes Against Liberty: An Indictment of Pre...

Looking more closely we see that the repeats are minor variants with both the referred to URL, the #2 vs #5 and some more variation in the last example. For the purposes of this tutorial I assume that duplicates and near duplicates need to be eliminated–there are other use cases where retweets, which are near duplicates passed on by different users, are desirable. We will be handling duplicates and near duplicates in the next blog post.


I browsed the tweets and started to markup up what the language was. I didn’t have to look past around 100 tweets to get a pretty rich smattering of languages.

Some of the languages found in Twitter searches

Diversity of Languages Displayed.

In another blog post I will approach the issue of language identification–fortunately it is pretty easy for languages that are not near neighbors like British English and American English. If you want to get a heads up on language id look at our language ID tutorial.

That is all for now. My next posts will attend to deduplication, then language ID before moving on to classification for topic and sentiment.


LingPipe 4.0.0 Released

June 2, 2010

After a week of relentless deprecation and pushing every last change through the unit tests and tutorials, we’re simultaneously rolling out LingPipe 4.0.0 and LingPipe 3.9.3. As usual, the latest is available from the

There’s a link from there to the latest 3.9 release, 3.9.3. 3.9.3 is a new release, and is only very minimally different than 3.9.2. I expected more updates for deprecation, but the whole process went more smoothly than anticipated.

Migrating from LingPipe 3.9 to LingPipe 4.0

Models are Backward Compatible

The good news is that all of the models compiled in 3.9 will run as is in 4.0 (using the 4.0 methods, of course). You can read about the mild wizardry required to do this in a previous blog entry, Upgrading Java classes with backward-compatible serialization.

Remove Deprecation Warnings from Code

The migration process for existing code importing LingPipe classes is simple. Get your code to where it compiles in 3.9 with no deprecation warnings and you’re good to go for 4.0. All of the deprecated methods and classes in 3.9 indicate what to replace them with in their javadoc.

3.9 Branched and Supported

We’re expecting to have to support 3.9 for a while longer, because 4.0 isn’t backward compatible without code changes. We’re anticipating continuing to patch bugs for 3.9 and release subsequent versions on the 3.9 branch. At least until all of our support-paying customers upgrade all of their code to 4.0.

The Time to Make it Shorter

Other than updating all the jars, there are no new features in 4.0. In fact, it’s smaller than 3.9. As Pascal said (originally in French, referring to a letter),

I have only made this longer, because I have not had the time to make it shorter.

Well, we finally found the time to make LingPipe shorter.

Language Model Generated Injection Attacks: Cool/Disturbing LingPipe Application

March 9, 2010

Joshua Mason emailed us with a link to his (with a bunch of co-authors) recent ACM paper “English Shellcode” ( Shell code attacks can attempt to seize control of a computer by masquerading as data. The standard defense is to look for tell-tale patterns in the data that reflect the syntax of assembly language instructions. It is sort of like spam filtering.The filter would have to reject strings that looked like:


which would not be too hard if you knew to expect language data.

Mason et al changed the code generation process so that lots of variants of the injection are tried but filtered against a language model of English based on the text of Wikipedia and Project Gutenberg.The result is an injection attack that looks like:

“There is a major center of economic activity, such as Star Trek, including The Ed Sullivan Show. The former Soviet Union.”

This is way better than I would have thought possible and it is going to be very difficult to filter. It would be interesting to see how automatic essay grading software would score the above. It is gibberish, but sophisticated sounding gibberish.

And it used LingPipe for the language processing.

I am a firm believer in the white hats publicizing exploits before black hats deploy them surreptitiously. This one could be a real problem however.


How Breck approaches new projects in natural language processing

March 8, 2009

A skill developers typically don’t get in school is how to frame problems in terms of the messy, approximate world of heuristic and machine learning driven natural language processing. This blog entry should help shed some light on what remains a mostly self taught black art. This is not the only way to do things, just my preferred way.

At the top level I seek three things:

  1. Human annotated data that directly encodes the intended output of your NLP program.
  2. A brain dead, completely simple instance of a program that connects all inputs to the intended output.
  3. An evaluation setup that takes 1) and 2) and produces a score for how good a job the system did. That score should map to a management approved objective.

Once I have the above I can then turn my attention to improving a score without worrying about whether I am solving the right problem (1 and 2 handle this) and whether I have sorted out access to the raw data and have a rough architecture that makes sense. Some more details on each point:

Human Annotated Data

If a human cannot carry out the task you expect the computer to do (given that we are doing NLP), then the project is extremely likely to fail. Humans are the best NLP systems in the world. Humans are just amazing at it and humans fail to appreciate the sophistication of what they do with zero effort. I almost always ask customers to provide annotated data before accepting work. What does this provide?

  • Disambiguation: Annotated data forces a decision on what the NLP system is supposed to do and it communicates it clearly to all involved parties. It also keeps the project from morphing away from what is being developed without an explicit negotiation over the annotation.
  • Buy in by relevant parties: It is amazing what happens when you sit management, UI developers, business development folks in a room and force them to take a text document and annotate it together. Disagreements that would be at the end of a project surface immediately, people know what they are buying and they get a sense that it might be hard. The majority of the hand waving “Oh, just have the NLP do the right thing ” goes away. Bonus points if you have multiple people annotate the same document independently and compare them. If the agreement rate is low then how can you expect a piece of software to do it?
  • Evaluation: The annotated data is a starting place for evaluation to take place. You have gold standard data to compare to. Without it you are flying blind.

Simple implementation that connects the bits

I am what Bob calls a "thin wire developer" because I really prefer to reduce project risk by being sure all the bits of software/information can talk to each other. I have been amazed at how difficult access to data/logs/programs can be in enterprise setups. Some judgement is required here, I want to hit where there are likely blocks that may force completely different approaches (e.g. access search engine logs for dynamic updates or lists of names that should be tracked in data). Once again this forces decisions early in development rather than later. Unfortunately it takes experience to know what bits are likely to be difficult to get and valuable in the end system.


An evaluation setup will truly save the day. It is very frustrating to build a system where the evaluation consists of "eyeballing data by hand" (I actually said this at my PhD defense to the teasing delight of Michael Niv, a fellow graduate student, who to this day ponders my ocularly enhanced appendages). Some of the benefits are:

  • Developers like a goal and like to see performance improve. It gets addictive and can be quite fun. You will get a better system as a result.
  • If the evaluation numbers map well to the business objective then the NLP efforts are well aligned with what the business wants. (Business objectives can be to win an academic bake-off for graduate students).
  • Looks great to management. Tuning systems for better performance can be a long and opaque process to management. I got some good advice to always link the quality of the GUI (Graphical User Interface) to the quality of the underlying software to communicate transparently the state of the project. An evaluation score that is better than last month communicates the same thing especially if management helped design the evaluation metric.

I will likely continue this blog thread picking up in greater detail the above points. Perhaps some use cases would be informative.


Hunter/Gatherer vs Farming with Respect to Information Access

August 7, 2008

Antonio Valderrabanos of gave a talk at NYU today and he compared the current search/NLP strategies by information providers to  humanity’s hunter/gatherer stage and offered a vision of information farming. I kept having images of Google’s web spider out digging roots and chasing animals with a pointed stick wearing a grubby little loin cloth. Then I would switch to images of a farm stocked supermarket with well organized shelves, helpful clerks and lots of choice.

The analogy brought up a strong bias that I have in applying natural language processing (NLP) to real word problems– I generally assume that the software must encounter text as it occurs in the “wild”–after all it is what humans do so well and we are in the business of emulating human language processing right?

Nope, not on the farm we’re not. On the farm we use NLP help to enhance information that was never a part of standard written form. We use NLP to suggest and assign meta tags, connect entities to databases of concepts and create new database entries for new entities. These are things that humans are horrible at but humans are excellent at choosing from NLP driven suggestions– NLP is pretty good at suggestions. So NLP is helping create the tools to index and cross reference at the concept level all the information in the supermarket. Humans function as filters of what is correct. At least initially.

As the information supermarket gets bigger, the quality of the NLP (machine learning based) will get better, perhaps good enough to start automatically bringing in “wild” information with decent concept indexing and meta tagging. A keyword index is crude yet effective tool but an inventory system it is not and that is what we need to advance to the next level of information access.

Spinning Straw into Gold: How to do gold standard data right

March 8, 2008

We have been struggling with how to evaluate whether we are finding ALL the genes in MEDLINE/PubMed abstracts. And we want to do it right. There has been a fair amount of work on how to evaluate natural language problems–search via TREC, BIOCREATIVE, MEDTAG but nothing out there really covered what we consider to be a key problem in text based bioinformatics–coverage or recall in application to existing databases of Entrez Gene.

What is the Rumpelstiltskin tie in?

From the Wikipedia:

“In order to make himself appear more important, a miller lied to the king that his daughter could spin straw into gold. The king called for the girl, shut her in a tower room with straw and a spinning wheel, and demanded that she spin the straw into gold by morning, for three nights, or be executed. ” Much drama ensues but in the end a fellow named Rumpelstiltskin saves the day.

The cast breaks down as follows:

  • The king: The National Institutes of Health (NIH) who really prefer that you deliver on what you promise on grants.
  • The miller: Our NIH proposal in which we say “We are committed to making all the facts, or total recall, available to scientists…” Even worse is that this is from the one paragraph summary of how we were going to spend $750,000 of the NIH’s money. They will be asking about this.
  • The daughter: I (Breck), who sees lots of straw and no easy path to developing adequate gold standard data to evaluate ourselves against.
  • The straw: 15 million MEDLINE/PubMed abstracts and 500,000 genes that need to be connected in order to produce the gold. Really just a subset of it.
  • The gold: A scientifically valid sample of mappings between genes and abstracts that we can test our claims of total recall. This is commonly called “Gold Standard Data.”
  • Rumpelstiltskin: Bob, and lucky for me I do know his name.

Creating Gold from Straw

Creating linguistic gold standard data is difficult, detail oriented, frustrating and ultimately some of the most important work that one can do to take on natural language problems seriously. I was around when version 1 of the Penn Treebank was created and would chat with Beatrice Santorini about the difficulties they encountered for things as simple seeming as part-of-speech tagging. I annotated MUC-6 data for named entities and coreference, did the John Smith corpus of cross-document coref with Amit Bagga and have done countless customer projects. All of those efforts gave me insights that I would not have had otherwise about how language is actually used rather than the idealized version you get in standard linguistics classes.

The steps for creating a gold standard are:

  • Define what you are trying to annotate: We started with a very open ended “lets see what looks annotatable” attitude for linking Entrez Gene to MEDLINE/PubMed. By the time we felt we had a sufficiently robust linguistic phenomenon we had a standard that mapped abstracts as a whole to gene entries in Entrez Gene. The relevant question was: “Does this abstract mention anywhere a literal instance of the gene?” Gene families were not taken to mention the member genes, so “the EXT familly of genes” would not count, but “EXT1 and EXT2 are not implicated in autism” would.
  • Validate that you can get multiple people to do the same annotation: Bob and I sat down and annotated 20 of the same abstracts independently and compared our results. We found that we had 36 shared mappings from gene to abstract, with Bob finding 3 mappings that Bob did not and I found 4 that Bob did not. In terms of recall I found 92% (36/39) of what Bob did. Bob found 90% (36/40) of what I found. Pretty good eh? Not really, see below.
  • Annotate enough data to be statistically meaningful: Once we are convinced we have a reliable phenomenon, then we need to be sure we have enough examples to minimize chance occurrences.

The Tricky Bit

I (the daughter) need to stand in front of the king (the NIH) and say how good our recall is. Better if the number is close to 100% recall. But what is 100% recall?

Even a corpus annotation with an outrageously high 90% interannoatator
agreement leads to problems:

  • A marketing problem: Even if we hit 99.99% recall on the corpus, we don’t know what’s up with the 5% error.
    We can report 99.99% recall against the corpus, but not against the truth.
    after being total rock stars and modeling Bob at 99.99%, a slide that says we can only claim recall of 85-95% on the data. So I can throw out the 99.99% number and introduce a salad of footnotes and diagrams. I see congressional investigations in my future.
  • A scientific problem: It bugs me that I don’t have a handle on what truth looks like. We really do think recall is the key to text bioinformatics and that text bioinformatics is the key to curing lots of diseases.

Rumpelstiltskin Saves the Day

So, here we are in our hip Brooklyn office space, sun setting beautifully over the Williamsburg bridge, Bob and I are sitting around with a lot of straw. It is getting tense as I imagine the king’s reaction to the “standard approach” of working from an interannotator agreement validated data set. Phrases like “cannot be done in a scientifically robust way”, “we should just do what everyone else does” and “maybe we should focus on precision” were bandied about with increasing panic. But the next morning Rumpelstiltskin walked in with the gold. And it goes like this:

The problem is in estimating what truth is given somewhat unreliable annotators. Assuming that Bob and I make independent errors and after adjudication (we both looked at where we differed and decided what the real errors were) we figured that each of us would miss 5% (1/20) of the abstract to gene mappings. If we took the union of our annotations, we end up with .025% missed mentions (1/400) by multiplying our recall errors (1/20*1/20)–this assumes independence of errors, a big assumption.

Now we have a much better upper limit that is in the 99% range, and more importantly, a perspective on how to accumulate a recall gold standard. Basically we should take annotations from all remotely qualified annotators and not worry about it. We know that is going to push down our precision (accuracy) but we are not in that business anyway.

An apology

At ISMB in Detroit, I stood up and criticized the BioCreative/GENETAG folks for adopting a crazy-seeming annotation standard that went something like this: “Annotate all the gene mentions in this data. Don’t get too worried about the phrase boundaries, but make sure the mention is specific.” I now see that approach as a sane way to increase recall. I see the error of my ways and feel much better since we have demonstrated 99.99% recall against gene mentions for that task–note this is a different, but related task to linking Entrez Gene ids to text abstracts. And thanks to the BioCreative folks for all the hard work pulling together those annotations and running the bakeoffs.

Our NIH work and Autism

February 28, 2008

We are funded by NIH (thanks to all you tax payers) to develop better ways to connect wet lab work to databases and text sources using software. I have blogged before about the importance of better linguistics in bioinformatics and now we are on Phase II of taking this issue on. As a starting place we are developing software that links a database of genes to PubMed. We have further restricted ourselves (at least initially) to working on the slice of PubMed that is returned by the query “(gene or genes or DNA or mRNA) and (autism or autistic)” which returned 1228 abstracts.

Why genes? Much of modern research is driven by studies that investigate the genome of disease groups. This in turn generates via micro array experiments large sets of candidate genes that must be winnowed out by text and database driven informatics techniques–this is our strong suit.

Why autism? We have a relationship with Harvard and that is what they are working on as part of the Autism Consortium. We need a user base, a smaller slice of the medical world that we can get a handle on and a genetic disease with difficult informatics problems.

Where do we stand now?

The short answer is, roughly speaking, that we are humbled. We cut our teeth in the intelligence world starting in 2000 and got NIH funding because we promised to deliver what we could do for tracking Osama bin Laden for medical researchers. While tracking bin Laden had its challenges, tracking genes has opened a world of tough problems. One dimension of the problem is obvious upon examination. Allow me to elaborate.

I decided that I wanted a sanity check on exact dictionary matches of genes from the Entrez Gene database into the autism literature. Being a good researcher you always want to establish a baseline system to get a sense of the scope of the problem. We built a somewhat effective system for doing this for human genes based on work by William Hayes et al as reported in the AZuRE system for our Phase I work. Now in Phase II I decided that we needed to move beyond our species to the very important animal models that are also in Entrez Gene–mouse being particularly significant for autism. Little did I know that genetics researchers name thier genes with the explicit purpose of making my life difficult.

I wrote a simple exact dictionary matcher for known aliases of genes. Entrez Gene has half a million genes spanning more than 500 species with more than a million unique aliases for genes. Nothing particularly concerning there yet. Looking up gene names in an autism abstract from PubMed however yields an interesting problem. Here is the abstract:

Note: Genes found with simple dictionary lookup, abstract on left, found genes on right in bold. Species and Entrez Gene id listed per alias. Not all species shown.

1) 18252227Structural variation of chromosomes in autism spectrum disorder.Structural variation (copy number variation [CNV] including deletion and duplication, translocation, inversion) of chromosomes has been identified in some individuals with autism spectrum disorder (ASD), but the full etiologic role is unknown. We performed genome-wide assessment for structural abnormalities in 427 unrelated ASD cases via single-nucleotide polymorphism microarrays and karyotyping. With microarrays, we discovered 277 unbalanced CNVs in 44% of ASD families not present in 500 controls (and re-examined in another 1152 controls). Karyotyping detected additional balanced changes. Although most variants were inherited, we found a total of 27 cases with de novo alterations, and in three (11%) of these individuals, two or more new variants were observed. De novo CNVs were found in approximately 7% and approximately 2% of idiopathic families having one child, or two or more ASD siblings, respectively. We also detected 13 loci with recurrent/overlapping CNV in unrelated cases, and at these sites, deletions and duplications affecting the same gene(s) in different individuals and sometimes in asymptomatic carriers were also found. Notwithstanding complexities, our results further implicate the SHANK3-NLGN4-NRXN1 postsynaptic density genes and also identify novel loci at DPP6-DPP10-PCDH9 (synapse complex), ANKRD11, DPYD, PTCHD1, 15q24, among others, for a role in ASD susceptibility. Our most compelling result discovered CNV at 16p11.2 (p = 0.002) (with characteristics of a genomic disorder) at approximately 1% frequency. Some of the ASD regions were also common to mental retardation loci. Structural variants were found in sufficiently high frequency influencing ASD to suggest that cytogenetic and microarray analyses be considered in routine clinical workup. we house mouse 22389 18252227-22389
unbalanced house mouse 21829 18252227-21829
a house mouse 50518 18252227-50518
de house mouse 21384 18252227-21384
or house mouse 12677 18252227-12677
2 human 3692 18252227-3692
at house mouse 11904 18252227-11904
s house mouse 13618 18252227-13618
SHANK3 human 85358 18252227-85358
NLGN4 human 57502 18252227-57502
NRXN1 human 9378 18252227-9378
novel house mouse 75125 18252227-75125
DPP6 chimpanzee 463835 18252227-463835 chimpanzee 735585 18252227-735585 human 1804 18252227-1804
DPP10 chimpanzee 459565 18252227-459565 chimpanzee 748962 18252227-748962 human 57628 18252227-57628
PCDH9 chimpanzee 452590 18252227-452590 human 5101 18252227-5101
ANKRD11 chimpanzee 468078 18252227-468078 chimpanzee 750284 18252227-750284 human 29123 18252227-29123
DPYD chimpanzee 457047 18252227-457047 human 1806 18252227-1806
PTCHD1 chimpanzee 465535 18252227-465535 human 139411 18252227-139411
p house mouse 18431 18252227-18431 Norway rat 308670 18252227-308670

You don’t need to be a doctor to know that the phrases ‘we’, ‘unbalanced’, ‘a’ and ‘or’ are not gene names in this context. But believe it or not, they are actually legitimate in other contexts. As in ‘The mutant gene we is probably responsible for a disrupted induction signal from the dermal papilla towards ectodermal cells of a hair follicle.’ I am not kidding and I feel like Dave Barry saying it. This is really a challenging problem, and unlike Dave Barry I have to solve it.

Soon I will post about our approaches to solving this over generation problem. In the world of computational linguistics it would be said that we have a precision problem. That is, we can’t accurately find examples of genes without finding a pile of junk in the process. It has the same quality as Paul Samuelson claiming that Wall Street indexes predicted 9 out of the past 5 recessions.

Wish us luck, we all will live healthier if we get this sorted out….


How Home Dentistry Kits and LingPipe are Similar

March 27, 2007

LingPipe is a tough sell to most commercial customers without professional services. Occasionally I will do a deal where all I do is cash a check but almost all of our customers want lots of help. Suggest that they build it themselves and they look at me like I suggested a home root canal. Why?

Take one of our simplest capabilities, language model classification. There is a simple, runs out of the box tutorial that takes
a developer through line by line what needs to be done to do some classification. It is really simple. Yet I cannot get certain customers working with it.

The sticking point, I believe, is that unfamiliarity plus the slightly loose nature of machine learning techniques is too great a jump conceptually. The DynamicLMClassifier needs the labels of the categories (easy), boolean choice of whether to use a bounded or sequence based language model (starting to feel a little woozy) and a character n-gram size (whoa, a ‘whackety n-germ thingy’). The tutorial suggests that 6 is a good initial n-gram value but they are lost at this point I think. It gets worse because I suggest in the tutorial that they try different n-gram sizes to see what produces the best score. The scoring is nicely provided as part of the tutorial as well. This only gets worse as we dig deeper into the LingPipe API.

Tuning these systems requires a particular mindset that is not a part of a core computer science curriculum. It doesn’t require great intelligence, but experience is a must. Until we find a way to sort this out we will continue to see such systems out of general production. My mantra is “make computational linguistics as easy to use as a database.” We have a ways to go before we move away from the black art status of our field.