I’ve been trying to find suitable paper(s) that explain this to present at our group’s reading club – you give some references in the slides (e.g. Bruce and Wiebe 1999) but you don’t specify exactly which paper, which is making it hard to track them down. Is there any chance you could give a full reference list? Is there a single paper that explains the sensitivity/specificity concept and why it’s a better option than majority voting?

]]>As Andrew Gelman’s always suggesting, it makes sense to do multiple comparisons in a hierarchical model to evaluate bakeoffs.

At the very least, I’d like to see the bootstrap used for variance estimates rather than making all the erroneous independence assumptions you get with other tests.

What Massimo and I are worrying about now is how to take his Phrase Detectives coref data and create gold standards. It’s easy to move from binomial to multinomial or ordinal or scalar, but so far we haven’t figured out how to do it with coreference chains. The combinatorics of the sets are rather daunting.

]]>How long until we have a Bayesian replacement for evalb? (smile)

