Posterior Category Distributions from Annotations (without a Gold Standard)


Here’s Handelman’s dentistry data (see paper cited below for a citation), which had 5 dentists evaluate X-rays for caries (a kind of pre-cavity), in the form of a contingency table:

00000:1880  00001:789  00010:43  00011:75  00100:23  00101:63  00110:8  00111: 22
01000: 188  01001:191  01010:17  01011:67  01100:15  01101:85  01110:8  01111: 56
10000:  22  10001: 26  10010: 6  10011:14  10100: 1  10101:20  10110:2  10111: 17
11000:   2  11001: 20  11010: 6  11011:27  11100: 3  11101:72  11110:1  11111: 100

For instance, there were 789 cases where annotators 1-4 said there was no caries, whereas annnotator 5 said there was. There were 3 cases where annotators 1-3 said there was caries, and annotators 4-5 said there were not.

The goal is to infer the true categories given these annotations. We described a general latent-category approach in our previous post, Hierarchical Bayesian Models of Categorical Data Annotation. But the paper didn’t have room to display the derived category densities, which is one of the main inference tasks.

Here are the posterior density estimates for the categories for each of the possible annotations, with horizontal axes all drawn on [0,1] and vertical axes to scale (the skewed distros are clipped). Click on it to view full size.

Posterior Category Densities on the Same Scale

Posterior Category Densities with Shared Scales

The thing to note here is that the results are rather different than a simple voting scheme. I’m hoping it’ll actually perform much better. The place to compare would be to the simple voting scheme, as evaluated for de-duplication by Dolores Labs and described in Brendan O’Connor’s blog post Wisdom of small crowds, part 1: how to aggregate Turker judgments for classification (the threshold calibration trick). (I would also highly recommend all their followup posts, but specifically the most recent with a link to Rion et al.’s forthcoming EMNLP paper, AMT is fast, cheap, and good for machine learning data.)

I’m trying to follow Gelman and Hill’s advice (from their regression book) on displaying multiple graphs. Here’s what R draws by default (thanks to Brendan O’Connor for telling me how to get rid of the vertical axes), which has different horizontal and vertical scales per plot:

Posterior Category Densities, R default scales

Posterior Category Densities, R default scales

One Response to “Posterior Category Distributions from Annotations (without a Gold Standard)”

  1. Mechanical Turk Best Practices | Blogs Says:

    […] against gold standard training sets. (It’s hard enough when turks disagree, but as Bob Carpenter highlighted, disagreements among experts makes it difficult to arrive at a gold standard.) Bob has found that in certain situations an […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: