Google recently jumped on the American election-targeted site bandwagon by launching a Google Labs (aka not quite ready for a beta) demo called In Quotes. In their own words, “The “In Quotes” feature allows you to find quotes from stories linked to from Google News.” As it’s released as of today, there’s a pulldown menu of 20 people whose quotes are being extracted and indexed, where those 20 are chosen based on their hotness in Google News.
Google’s approach looks straightforward, and could be implemented in LingPipe.
Step 0 is to spider news sites, scrape the important text out of them, and then de-duplicate the results. I’m going to skip this step, as it really has nothing to do with the quote extraction problem per se.
Step 1 is to break the input into sentences. The Indo-European sentence detector that ships with LingPipe will work just fine for this problem. The main trickiness in sentence detection is dealing with quotes, parentheticals and abbreviations. We discuss all of this in our sentence detection tutorial.
Step 2 is to do named entity extraction. Google’s using a finite set of names, which can be done with our exact dictionary matching named entity extractor. We cover dictionary-based entity extraction in our named entity tutorial. Be sure to add pronouns in anticipation of step 3.
Step 3 is to perform within-document coreference. LingPipe’s coref module will do this. There’s an example in the generic tutorials on the web.
Step 4 is to pull out sentences with quote symbols and index them by speaker, as determined by within-document coreference. Within-doc coref is necessary to pull out sentences like and link them to the actual speaker.
They are becoming more isolated in the world,” he said.
Step 5 is to extract contiguous sentences with quotes by the same speaker. This’ll let you get the actual content supplied in the above sentence:
“Syria and Iran continue to sponsor terror, yet their numbers are growing fewer. They are becoming more isolated in the world,” he said. “Like slavery and piracy, terrorism has no place in the modern world.”
As shown in this example, English punctuation does not include quotes at the beginning of a sentence that is continuing a quote that left off in the last sentence. That is, unless there’s non-quoted material like the attribution he said in between, or when a new paragraph is starting. So it’ll help if you can break inputs down into paragraphs before inputting them to the system.
Step 6 is to de-duplicate quotes. This is quite easy for quoted material given that it’s (usually) not paraphrased, except in punctuation and attribution. The string comparison tutorial provides a few methods that’ll work for this, like character n-gram vector comparison. This would be tricky to scale, except that you can restrict the search to pairs of quotes by the same person. You’ll have to play around with precision and recall here by setting thresholds. It’s clear from Google News that even they can’t do this extremely accurately.
Step 7 is to index everything for searchability from the server side. Lucene’s a good tool to do this. Just create docs consisting of a field with the quote, field with the speaker, field with the source name and link, and with a field for a unique ID per quote as determined in step 5.
Step 8 is to pull out topics for display using search over the free text indexes of the quotes. For instance, the Lucene query <iraq iraqi> would work (Lucene’s queries are implicitly disjunctive).
Step into uncharted territory by generalizing what Google’s up to with a statistical named entity detector. You can keep the dictionary, but you’ll need to do some kind of cross-document coreference resolution before you create the index.