Today’s blast from the past is the descriptively titled:
- Brodley, Carla E. and Mark A. Friedl. 1999. Identifying mislabeled training data. JAIR.
It’s amazing this paper is so long given the simplicity of their approach — it’s largely careful evaluation on three approaches versus five data sets and a thorough literature survey. It’s also amazing just how many people have tackled this problem over time in more or less the same way.
The approach is almost trivial: cross-validate using multiple classifiers, throw away training instances on which the classifiers disagree with the gold standard, then train on the remaining items. They consider three forms of disagreement: at least one classifier disagrees, a majority of classifiers disagree, and all classifiers disagree with the gold label. They consider three classifiers, 1-nearest-neighbor with Euclidean distance, linear classifiers (using the “thermal training rule”, whatever that is), and C4.5-trained decision trees.
They conclude that filtering improves accuracy, though you’ll have to be more patient than me to dissect the dozens of detailed tables they provide for more inisght. (Good for them for including confidence intervals; I still wish more papers did that today.)
The authors are careful to set aside 10% of each data set for testing; the cross-validation-based filtering is on the other 90% of the data, to which they’ve introduced simulated errors.
One could substitute annotators for algorithms, or even use a mixture of the two to help with active learning. And you could try whatever form of learners you like.
What I’d like to see is a comparison with probabilistic training on the posterior category assignments, as suggested by Padhraic Smyth et al. in their 1994 NIPS paper, Inferring ground truth from subjective labelling of Venus images. I’d also like to see more of an analysis of noisy training data on evaluation, along the lines of Lam and Stork’s 2003 IJCAI paper, Evaluating classifiers by means of test data with noisy labels. Because my experience is that gold standards are less pure than their audience of users imagines.