I found a link from Slashdot of all places to this forthcoming paper:
- Mavandadi, Sam, Stoyan Dimitrov, Steve Feng, Frank Yu, Uzair Sikora, Oguzhan Yaglidere, Swati Padmanabhan, Karin Nielsen, and Aydogan Ozcan. (2012) Distributed Medical Image Analysis and Diagnosis through Crowd-Sourced Games: A Malaria Case Study. PLoS ONE.
The main body of the paper is about they reapplication to malaria diagnosis. But I’m more interested in the statistical techniques they used for crowd sourcing.
None of the nine authors, the reviewer(s) or editor(s) knew that their basic technique for analyzing crowd sourced data has been around for over 30 years. (I’m talking about the statistical technique here, not the application to distributed diagnosis of diseases, which I don’t know anything about.)
Of course, many of us reinvented this particular wheel over the past three decades, and the lack of any coherent terminology for the body of work across computer science, statistics, and epidemiology is part of the problem.
The authors should’ve cited the seminal paper in this field (at least it’s the earliest one I know — if you know earlier refs, please let me know):
- Dawid, A. P. and A. M. Skene. 1979. Maximum likelihood estimation of observer error rates using the EM algorithm. Applied Statistics 28(1):20–28.
Here’s a 20-year old paper on analyzing medical image data (dental X-rays) with similar models:
- Espeland, M. A. and S. L. Handelman. 1989. Using latent class models to characterize and assess relative error in discrete measurements. Biometrics 45:587–599.
Mavandadi et al.
Mavandadi et al. use an approach they call a “binary channel model for gamers”. On page 4 of part II of the supplement to their paper, they define a maximum a posteriori estimate that is the same as Dawid and Skene’s maximum likelihood estimate. It’s the same wheel I reinvented in 2008 (I added hierarchical priors because I was asking Andrew Gelman and Jennifer Hill for advice) and that several groups have subsequently reinvented.
I didn’t understand the section about “error control coding” (starting with whether they meant the same thing as what I know as an “error correcting code”). Why have an annotator annotate an item an odd number of times and then take a majority vote? You can build a probabilistic model for reannotation of any number of votes (that presumably would take into account the correlation (fixed effect) of having the same annotator).
Role of Automatic Classifiers
As in Raykar et al.’s 2009 JMLR paper, Mavandadi et al. also include a machine-based system. But it is not tightly linked as in the work of Raykar et al. It’s just trained from the data a la Padhraic Smyth’s mid-1990s model of crowdsourcing crater location data and then training image analysis models on the resulting crowdsourced data.
Mavandadi et al. instead run their automatic classifier first, then if it’s not confident, hand it over to the crowd. This is, by the way, the standard practice in speech-recognition-based automated call centers.
Mavandadi et al. should check out (Sheng et al. 2010), which analyzes when you need to find another label, also using a Dawid-and-Skene-type model of data annotation. It’s also a rather common topic in the epidemiology literature, because it’s the basis of the decision as to which diagnostic test to administer next, if any, in situations like breast cancer diagnosis (which involves notoriously false-positive-prone image tests and notoriously false-negative-prone tissue tests).
I didn’t see any attempt by Mavandadi et al. to calibrate (or even measure) their system’s confidence assessments. I’d wait for that analysis before trusting their output.