Brodley and Friedl (1999) Identifying Mislabeled Training Data


Today’s blast from the past is the descriptively titled:

It’s amazing this paper is so long given the simplicity of their approach — it’s largely careful evaluation on three approaches versus five data sets and a thorough literature survey. It’s also amazing just how many people have tackled this problem over time in more or less the same way.

The approach is almost trivial: cross-validate using multiple classifiers, throw away training instances on which the classifiers disagree with the gold standard, then train on the remaining items. They consider three forms of disagreement: at least one classifier disagrees, a majority of classifiers disagree, and all classifiers disagree with the gold label. They consider three classifiers, 1-nearest-neighbor with Euclidean distance, linear classifiers (using the “thermal training rule”, whatever that is), and C4.5-trained decision trees.

They conclude that filtering improves accuracy, though you’ll have to be more patient than me to dissect the dozens of detailed tables they provide for more inisght. (Good for them for including confidence intervals; I still wish more papers did that today.)

The authors are careful to set aside 10% of each data set for testing; the cross-validation-based filtering is on the other 90% of the data, to which they’ve introduced simulated errors.

One could substitute annotators for algorithms, or even use a mixture of the two to help with active learning. And you could try whatever form of learners you like.

What I’d like to see is a comparison with probabilistic training on the posterior category assignments, as suggested by Padhraic Smyth et al. in their 1994 NIPS paper, Inferring ground truth from subjective labelling of Venus images. I’d also like to see more of an analysis of noisy training data on evaluation, along the lines of Lam and Stork’s 2003 IJCAI paper, Evaluating classifiers by means of test data with noisy labels. Because my experience is that gold standards are less pure than their audience of users imagines.

4 Responses to “Brodley and Friedl (1999) Identifying Mislabeled Training Data”

  1. Ken Williams Says:

    It seems like the real trick is to distinguish, within the outliers, between improperly coded examples and properly coded true outliers. I don’t know of any great way to do this besides just re-submitting them to the annotators for possible correction. But the true outliers can be exceedingly valuable, and in many cases it’s not worth throwing them away just to get a more homogeneous training set.

  2. Vikas Raykar Says:

    If I remember right, when I read the paper they claim that the outliers identified will be independent of a particular model. For this to be be true I think the multiple classifiers will have to be as diverse as possible.

  3. Ramesh Nallapati Says:

    We had a recent publication at IJCAI 2009 workshop on Intelligence and interaction, where we discussed the same problem.(

    The main idea of our paper, and a very simple one at that, was to present the misclassified examples to the user for potential label correction, in an active learning framework — something the author alluded to in this posting as well! Our real world as well as synthetic experiments (not very extensive since this is only preliminary work) showed potential benefits of this approach.

  4. lingpipe Says:

    @Ramesh You seem to have overlooked Sheng et al.’s 2008 KDD paper, Get Another Label?, which addresses the same problem as your paper. Sheng et al. model the decision between relabeling an existing item and labeling a new item in an active learning setting.

    And while there may be a quantitative distinction between expert and non-expert labelers, you can’t assume experts always make the right call, or even that all instances have a correct category for a given coding standard iteration. As I showed, the “gold standard” data for MUC, RTE-1, etc., had more errors than the data resulting from Mechanical Turk.

    We actually take the opposite tack to traditional active learning and have users verify the systems most confident outputs rather than the least confident ones. Our hypothesis is that this helps with building very high precision classifiers, which many of our customers demand. It’d be interesting to see if the operating point (sensitivity/recall, specificity, precision) affects the best strategy for collecting new data.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: