I’m still digging into the Dolores Labs data from the Amazon Mechanical Turk. As a result, I’ve dug back through the Pascal Recognizing Textual Entailment (RTE) challenge data, and found it to be more than a little confusing. If you want to look at the raw data on which the Mechanical Turk evals were run, here’s the official RTE 1 Test Data.
You can also download Ido Dagan et al.’s overview of RTE1. Among other things, they told their annotators to ignore tense and treat anything that’s not “fully entailed” but is “very probable” as being true. These mechanical Turk annotators, who only saw these instructions, were instructed to “Assume that you do not know anything about the situation except what the Text itself says. Also, note that every part of the Hypothesis must be implied by the Text in order for it to be true.”
Here’s a FALSE case from the official training data in which the second sentence does not follow from the first:
- The chief of the European Central Bank, Vim Duesenburg, had spoken two days ago about "doubtless" signs of growth slowdown, following the financial crises that shook Asia then the rest of the developing countries in the last 18 months.
- The Asian economic crash affected the Euro.
And here’s a TRUE case from the official data:
- Azerbaijanis began voting this morning, Sunday, in the first round of the presidential elections, of which opinion polls are favoring current President Heydar Aliyev to win.
- Azerbaijanis are voting in the presidential elections, in which Heydar Aliyev is expected to be re-elected.
I compiled a list of inferences for which my model chose the wrong answer. You can find them in rte1-errors.doc (which is actually an .xml file, but %^&*! WordPress doesn’t allow XML docs; just save it and rename it to .xml, or just go ahead and view it in OpenOffice). The xml file contains every example to which the model assigned the wrong category, along with some notes, whether I agreed with the gold standard, and the posterior mean of the category variable in the model.
Here an example the gold standard marked TRUE, which the Turkers were pretty sure was false:
- Seiler was reported missing March 27 and was found four days later in a
marsh near her campus apartment.</t>
- Abducted Audrey Seiler found four days after missing.</h>
Examples like this bring up the problem of what it means to probably entail something, where the following was marked as FALSE in the gold standard data (as it should be):
- Kennedy had just won California's Democratic presidential primary when Sirhan shot him in Los Angeles on June 5, 1968.
- Sirhan killed Kennedy.
More subtle cases are on the not-quite inferred boundary, where the following was marked TRUE, though it didn’t say they’d remove his name, just the accusatory paragraphs:
- The Supreme Court was due to decide today, Wednesday, on the appeal filed by De Klerk for the purpose of omitting the accusatory paragraphs against him in the report, especially the paragraphs on his responsibility for bloody attacks that were carried out in the eighties against anti-apartheid organizations.
- De Klerk awaits the Supreme court decision regarding the omission of his name in the incriminating report.
and this one, marked TRUE, even though filing for an IPO isn’t the same as going public:
- Google files for its long awaited IPO.
- Google goes public.
And here’s one marked TRUE that’s clearly FALSE, because between now and 2009 doesn’t mean it’ll be in 2009 (note that this was done several years ago, to boot, so we can’t blame inference on the part of the gold standard annotators):
- Sonia Gandhi can be defeated in the next elections in India if between now and 2009, BJP can make Rural India Shine, if it can make every young man have a job that he can be proud of.
- Next elections in India will take place in 2009.
There are two morals to this story. First, you should always look at your data! Over half of the residual disagreements between the Turker annotations and the gold standard were of this highly suspect nature and some were just wrong. Second, we need to be very careful in explaining and clarifying the tasks we’re asking annotators to do. The user interface component is very important.
I’ll end with one the Turkers got wrong, as it’s an interesting example:
- Microsoft was established in Italy in 1985.
- Microsoft was established in 1985.
Don’t get me started on scalar implicature and intersective modifiers!