Lacks a Convincing Comparison with Simple Non-Bayesian Approaches

by

That was just one comment from my 2009 NAACL rejection letter (reprinted in full with a link to the paper below). As Andrew says, I’m glad to hear they had so many great submissions.

This was for a paper on gold-standard inference that I’ve been blogging about. Here’s a link to the submission:

The only thing it adds to the white paper and graphs on the blog is another mechanical Turk experiment that recreated the MUC 6 PERSON entity annotations. As with the Snow et al. work, the Turkers did better than the gold standard.

I should’ve paid more attention to Simon Peyton Jones’s most excellent advice. Ironically, I used to give most of that advice to our own students. Always easier medicine to give than to take.

In particular, I needed to make the use cases much clearer to those who weren’t knee deep in Bayesian analysis and inter-annotator agreement statistics. The trap I fell into has a name. It’s called “the curse of knowledge”. The best general study of this phenomenon I know is Elizabeth Newton’s research, which is described in Heath and Heath’s great book Made to Stick (see the tappers and listeners section of the excerpt). Good psych experiments are so compelling they give me goosebumps.

It’s hard to compare the Bayesian posterior intervals with non-Bayesian estimates. The whole point of the Bayesian analysis is to analyze the posteriors. But if you’re not a Bayesian, what do you care?

The standard in NLP is first-best analyses on held-out data sets with a simple accuracy measure. No one ever talks about log loss except in model fitting and language modeling evaluation. So even a probabilistic model that’s not Bayesian can’t get evaluated on its own terms. This goes for simple posterior predictive inferences in Bayesian analyses. I guess the reviewer missed the comparison with simple voting, which is the de facto standard (coupled with censorship of uncertain cases).

What I really should’ve addressed is two sides of the issue of what happens with fewer annotators. The answer is that the Bayesian approach has an even stronger advantage in agreeing with the gold standard annotations than simple voting. I only did that after the analysis after the submission in a “Doh!” moment anticipating the problem reviewers might have with 10 annotators. The second aspect is to push harder on the fools gold standard point that the state of the art produces very poor corpora in terms of consistency.

It is possible to see if the informative priors help with cross-validation. But even without cross-validation, it’s obvious with 20 samples when the annotator made no mistakes — they’re unlikely to have 100% accuracy on the rest of the corpus. You don’t estimate a player’s batting average (in baseball) to be .500 after he goes 10 for 20 in his first 20 at bats. Everyone in NLP knows low count estimates need to be smoothed. So now that we’ve decided we need a prior (call it regularization or smoothing if you like), we’re just haggling about price. So I just optimized that, too. It’s what any good Bayesian would do.

Here’s the full set of comments and ratings (on a 1-5 scale, with 5 being the best).

============================================================================ 
NAACL-HLT 2009 Reviews for Submission #138
============================================================================ 

Title: A Multilevel Bayesian Model of Categorical Data Annotation

Authors: Bob Carpenter
============================================================================
                            REVIEWER #1
============================================================================ 


---------------------------------------------------------------------------
Reviewer's Scores
---------------------------------------------------------------------------

                         Appropriateness: 3
                                 Clarity: 3
            Originality / Innovativeness: 2
                 Soundness / Correctness: 4
                   Meaningful Comparison: 4
                            Thoroughness: 4
              Impact of Ideas or Results: 2
                     Impact of Resources: 3
                          Recommendation: 2
                     Reviewer Confidence: 4
                                Audience: 2
                     Presentation Format: Poster
             Resubmission as short paper: recommended


---------------------------------------------------------------------------
Comments
---------------------------------------------------------------------------

The paper presents a simple Bayesian model of labeling error in multiple
annotations.  The goal is to evaluate and potentially clean-up errors of
sloppy/misguided annotation, for example, obtained via Amazon's Mechanical
Turk.  Standard Gibbs sampling is used to infer model parameters from observed
annotations and produce likely 'true' annotations.

I found section 2 confusing, even though the model is fairly simple.  Overall,
I didn't find the model or method very innovative or the results very
illuminating.
Nevertheless, I think this paper could be good as short one, since there is
some interest in empirical assessment of noisy annotation.

============================================================================
                            REVIEWER #2
============================================================================ 


---------------------------------------------------------------------------
Reviewer's Scores
---------------------------------------------------------------------------

                         Appropriateness: 5
                                 Clarity: 4
            Originality / Innovativeness: 3
                 Soundness / Correctness: 3
                   Meaningful Comparison: 3
                            Thoroughness: 3
              Impact of Ideas or Results: 2
                     Impact of Resources: 1
                          Recommendation: 2
                     Reviewer Confidence: 4
                                Audience: 3
                     Presentation Format: Poster
             Resubmission as short paper: recommended


---------------------------------------------------------------------------
Comments
---------------------------------------------------------------------------

This paper deals with modelling the annotation process using a fully specified
Bayesian model.

The paper itself is nicely written and I like the mixture of synthetic and real
experiments.  The problems I have however are:

--what does this actually buy us?  It would be nice to see exactly what is
gained through this process.  As it stands, it seems to tell us that annotators
can disagree and that some tasks are harder than other ones.  This is not
surprising really.

--it is highly unrealistic to have tens of annotators per item.  A far more
realistic situation is to have only one or possibly two annotators.  What would
this mean for the approach?

I would have liked to have seen some kind of baseline experiment, rather than
assuming that the various priors are actually warranted.  

Why assume binary labels?  Although it is possible to simulate non-binary
labels, this will introduce errors and it is not necessarily natural for
annotators.

============================================================================
                            REVIEWER #3
============================================================================ 


---------------------------------------------------------------------------
Reviewer's Scores
---------------------------------------------------------------------------

                         Appropriateness: 5
                                 Clarity: 3
            Originality / Innovativeness: 4
                 Soundness / Correctness: 3
                   Meaningful Comparison: 3
                            Thoroughness: 4
              Impact of Ideas or Results: 3
                     Impact of Resources: 1
                          Recommendation: 3
                     Reviewer Confidence: 3
                                Audience: 4
                     Presentation Format: Poster
             Resubmission as short paper: recommended


---------------------------------------------------------------------------
Comments
---------------------------------------------------------------------------

Statistical approaches to NLP have been largely depending on human annotated
data. This paper uses a Bayesian approach to address the uncertainty in human
annotation process, which is an important and interesting problem that may
directly affect statistical learning performance. The proposed model is able to
infer true labels given only a set of human annotations. The paper is
in general sound; the experiments on both simulated data and real data are
provided and with detailed analysis. Overall the work seems a contribution to
the field which I recommend for acceptance, but I have a few suggestions for
revision.

My major concern is that the work lacks a convincing comparison with simple
non-Bayesian approaches in demonstrating the utility of the model. The paper
has a excessively sketchy description of a non-hierarchical version of the
model (which actually gives similar result in label inference). Moreover, it is
not very clear how the proposed approach is related to all previous works
listed in Section 9.

The authors should explain more on their evaluation metrics. What points is it
trying to make in Figure 6? Why using residual errors instead of ROC curves?   

Some minor comments:
-- Typo at end of Section 3: "residual error error"
-- Section 4: "39% of all annotators perform no better than chance, as
indicated by the diagonal green line in Figure 6". Maybe I missed something but
I don't see this in Figure 6.
-- Section 7: define "item-level" predictors.

2 Responses to “Lacks a Convincing Comparison with Simple Non-Bayesian Approaches”

  1. lingpipe Says:

    It’s really heartwarming to read stories like this one by George A. Akerlof:

    http://nobelprize.org/nobel_prizes/economics/articles/akerlof/index.html

    on how he came about writing his Nobel-prizewinning paper that was rejected by three journals in a row for being “too trivial” (or perhaps too likely to change the way people do things).

    Of course, just because good papers are rejected for reason X, doesn’t mean that a paper rejected for reason X is good.

  2. Science for SEO» Blog Archive » Peer review for SEO Says:

    [...] called “Lacks a Convincing Comparison with Simple Non-Bayesian Approaches” and that’s what peer review concluded. He makes available the feedback, and it’s [...]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

Join 820 other followers