Here’s Handelman’s dentistry data (see paper cited below for a citation), which had 5 dentists evaluate X-rays for caries (a kind of pre-cavity), in the form of a contingency table:
00000:1880 00001:789 00010:43 00011:75 00100:23 00101:63 00110:8 00111: 22 01000: 188 01001:191 01010:17 01011:67 01100:15 01101:85 01110:8 01111: 56 10000: 22 10001: 26 10010: 6 10011:14 10100: 1 10101:20 10110:2 10111: 17 11000: 2 11001: 20 11010: 6 11011:27 11100: 3 11101:72 11110:1 11111: 100
For instance, there were 789 cases where annotators 1-4 said there was no caries, whereas annnotator 5 said there was. There were 3 cases where annotators 1-3 said there was caries, and annotators 4-5 said there were not.
The goal is to infer the true categories given these annotations. We described a general latent-category approach in our previous post, Hierarchical Bayesian Models of Categorical Data Annotation. But the paper didn’t have room to display the derived category densities, which is one of the main inference tasks.
Here are the posterior density estimates for the categories for each of the possible annotations, with horizontal axes all drawn on [0,1] and vertical axes to scale (the skewed distros are clipped). Click on it to view full size.
The thing to note here is that the results are rather different than a simple voting scheme. I’m hoping it’ll actually perform much better. The place to compare would be to the simple voting scheme, as evaluated for de-duplication by Dolores Labs and described in Brendan O’Connor’s blog post Wisdom of small crowds, part 1: how to aggregate Turker judgments for classification (the threshold calibration trick). (I would also highly recommend all their followup posts, but specifically the most recent with a link to Rion et al.’s forthcoming EMNLP paper, AMT is fast, cheap, and good for machine learning data.)
I’m trying to follow Gelman and Hill’s advice (from their regression book) on displaying multiple graphs. Here’s what R draws by default (thanks to Brendan O’Connor for telling me how to get rid of the vertical axes), which has different horizontal and vertical scales per plot:
June 11, 2009 at 2:06 pm |
[…] against gold standard training sets. (It’s hard enough when turks disagree, but as Bob Carpenter highlighted, disagreements among experts makes it difficult to arrive at a gold standard.) Bob has found that in certain situations an […]