Fool’s Gold Standard


I’m still digging into the Dolores Labs data from the Amazon Mechanical Turk. As a result, I’ve dug back through the Pascal Recognizing Textual Entailment (RTE) challenge data, and found it to be more than a little confusing. If you want to look at the raw data on which the Mechanical Turk evals were run, here’s the official RTE 1 Test Data.

You can also download Ido Dagan et al.’s overview of RTE1. Among other things, they told their annotators to ignore tense and treat anything that’s not “fully entailed” but is “very probable” as being true. These mechanical Turk annotators, who only saw these instructions, were instructed to “Assume that you do not know anything about the situation except what the Text itself says. Also, note that every part of the Hypothesis must be implied by the Text in order for it to be true.”

Here’s a FALSE case from the official training data in which the second sentence does not follow from the first:

  • The chief of the European Central Bank, Vim Duesenburg, had spoken two days ago about "doubtless" signs of growth slowdown, following the financial crises that shook Asia then the rest of the developing countries in the last 18 months.
  • The Asian economic crash affected the Euro.

And here’s a TRUE case from the official data:

  • Azerbaijanis began voting this morning, Sunday, in the first round of the presidential elections, of which opinion polls are favoring current President Heydar Aliyev to win.
  • Azerbaijanis are voting in the presidential elections, in which Heydar Aliyev is expected to be re-elected.

I compiled a list of inferences for which my model chose the wrong answer. You can find them in rte1-errors.doc (which is actually an .xml file, but %^&*! WordPress doesn’t allow XML docs; just save it and rename it to .xml, or just go ahead and view it in OpenOffice). The xml file contains every example to which the model assigned the wrong category, along with some notes, whether I agreed with the gold standard, and the posterior mean of the category variable in the model.

Here an example the gold standard marked TRUE, which the Turkers were pretty sure was false:

  • Seiler was reported missing March 27 and was found four days later in a
    marsh near her campus apartment.</t>

  • Abducted Audrey Seiler found four days after missing.</h>

Examples like this bring up the problem of what it means to probably entail something, where the following was marked as FALSE in the gold standard data (as it should be):

  • Kennedy had just won California's Democratic presidential primary when Sirhan shot him in Los Angeles on June 5, 1968.
  • Sirhan killed Kennedy.

More subtle cases are on the not-quite inferred boundary, where the following was marked TRUE, though it didn’t say they’d remove his name, just the accusatory paragraphs:

  • The Supreme Court was due to decide today, Wednesday, on the appeal filed by De Klerk for the purpose of omitting the accusatory paragraphs against him in the report, especially the paragraphs on his responsibility for bloody attacks that were carried out in the eighties against anti-apartheid organizations.
  • De Klerk awaits the Supreme court decision regarding the omission of his name in the incriminating report.

and this one, marked TRUE, even though filing for an IPO isn’t the same as going public:

  • Google files for its long awaited IPO.
  • Google goes public.

And here’s one marked TRUE that’s clearly FALSE, because between now and 2009 doesn’t mean it’ll be in 2009 (note that this was done several years ago, to boot, so we can’t blame inference on the part of the gold standard annotators):

  • Sonia Gandhi can be defeated in the next elections in India if between now and 2009, BJP can make Rural India Shine, if it can make every young man have a job that he can be proud of.
  • Next elections in India will take place in 2009.

There are two morals to this story. First, you should always look at your data! Over half of the residual disagreements between the Turker annotations and the gold standard were of this highly suspect nature and some were just wrong. Second, we need to be very careful in explaining and clarifying the tasks we’re asking annotators to do. The user interface component is very important.

I’ll end with one the Turkers got wrong, as it’s an interesting example:

  • Microsoft was established in Italy in 1985.
  • Microsoft was established in 1985.

Don’t get me started on scalar implicature and intersective modifiers!

4 Responses to “Fool’s Gold Standard”

  1. Brendan O'Connor Says:

    Wow, great stuff. In the XML file, I assume the value=”” attribute means, the RTE gold label (since thats how it is in the original RTE dataset’s xml). And Turker labels (MAP estimates) are opposite of value=”” for this entire file? Does the conf=”” attribute mean, the Turker confidence (posterior P(c_i = NOT_GOLD | x) by your poster’s notation? it’s not P(c_i=T|x) because I only see >0.5 probabilities.

  2. lingpipe Says:

    Yes, the value attribute is from the original RTE data. I added attribute bob and attribute note and attribute conf. And yes, that’s how the conf values were computed. I only looked at cases where the model’s best guess didn’t match the gold standard.

    If I actually cared about the RTE task, I’d also look at the ones where the mean category assignment was in the (0.01,0.99) range, since you’d probably want fewer than 1/100 errors in your final result. The model’s almost certainly over-confident, to boot, because of failed independence assumptions.

    The NYU folks (Victor Sheng, Foster Provost and Panos Ipeirotis) have a KDD ’08 Paper about dynamically assigning labelers by quantity and quality to achieve a desired corpus accuracy.

  3. Oren Glickman Says:

    Bob, great post! I am glad to see that the RTE-1 data is still live and kicking.
    I was wondering what score the “Turkers” would have got. How would they have been placed compared to the other competing systems?

  4. Bob Carpenter Says:

    Majority vote is 89.7%, whereas the model-based best-guess estimate is 92.6% accurate vs. the gold standard. The best accuracy for a system entering the contest was 60.6% according to the paper.

    The paper also says two annotators labeled each example with an average inter-annotator agreement of only 80%. The gold standard contains only items on which the two annotators agreed, with an additional 13% of the examples on which they agreed removed after further review.

    Some of the participants did their own annotation and found 91-96% accuracy with respect to the gold standard.

    So the bottom line is that the Turkers performed up to the task inter-annotator agreement rate.

Leave a Reply to Oren Glickman Cancel reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: