Archive for the ‘Data Annotation’ Category

Tang and Lease (2011) Semi-Supervised Consensus Labeling for Crowdsourcing

September 12, 2011

I came across this paper, which, among other things, describes the data collection being used for the 2011 TREC Crowdsourcing Track:

But that’s not why we’re here today. I want to talk about their modeling decisions.

Tang and Lease apply a Dawid-and-Skene-style model to crowdsourced binary relevance judgments for highly-ranked system responses from a previous TREC information retrieval evaluation. The workers judge document/query pairs as highly relevant, relevant, or irrelevant (though highly relevant and relevant are collapsed in the paper).

The Dawid and Skene model was relatively unsupervised, imputing all of the categories for items being classified as well as the response distribution for each annotator for each category of input (thus characterizing both bias and accuracy of each annotator).

Semi Supervision

Tang and Lease exploit the fact that in a directed graphical model, EM can be used to impute arbitrary patterns of missing data. They use this to simply add some known values for categories (here true relevance values). Usually, EM is being used to remove data, and that’s just how they pitch what they’re doing. They contrast the approach of Crowdflower (nee Dolores Labs) and Snow et al. as fully supervised. They thus provide a natural halfway point between Snow et al. and Dawid and Skene.

Good Results vs. NIST Gold

The key results are in the plots in figures 4 through 7,which plot performance versus amount of supervision (as well as fully unsupervised and majority vote approaches). They show the supervision helping relative to the fully unsupervised approach and the approach of training on just the labeled data.

Another benefit of adding supervised data (or adding unsupervised data if viewed the other way) is that you’ll get better estimates of annotator responses (accuracies and biases) and of topic prevalences.

Really Gold?

They get their gold-standard values from NIST, and the notion of relevance is itself rather vague and subjective, so the extra labels are only as golden as the NIST annotators. See below for more on this issue.

Voting: Quantity vs. Quality

Tang and Lease say that voting can produce good results with high quality annotators. It’ll also produce good results with a high quantity of annotators of low quality. As long as their results are independent enough, at least. This is what everyone else has seen (me with the Snow et al. data and Vikas Raykar et al. very convincingly in their JMLR paper).

Regularized Estimates (vs. MLE vs. Bayesian)

I think it’d help if they regularized rather than took maximum likelihood estimates. Adding a bit of bias from regularization often reduces variance and thus expected error even more. It helps with fitting EM, too.

For my TREC entry, I went whole hog and sampled from the posterior of a Bayesian hierarchical model which simultaneously estimates the regularization parameters (now cast as priors) along with the other parameters.

I also use Bayesian estimates, specifically posterior means, which minimize expected squared error. MLE for the unregularized case and maximum a posterior (MAP) estimates for the regularized case can both be viewed as taking posterior maximums (or modes) rather than means. These can be pretty different for the kinds of small count beta-binomial distributions used in Dawid and Skene-type models.

Really Adversarial Turkers?

How in the world did they get a Mechanical turker to have an accuracy of 0 with nearly 100 responses? That’s very very adversarial. I get higher accuracy estimates using their data for TREC and don’t get very good agreement with the NIST gold standard, so I’m really wondering about this figure and the quality of the NIST judgments.

Active Learning

Choosing labels for items on the margin of a classifier is not necessarily the best thing to do for active learning. You need to balance uncertainty with representativeness, or you’ll do nothing but label a sequence of outliers. There’s been ongoing work by John Langford and crew on choosing the right balance here.

Adding a Model

Vikas Raykar et al. in their really nice JMLR paper add a regression-based classifier to the annotators. I think this is the kind of thing Tang and Lease are suggesting in their future work section. They cite the Raykar et al. paper, but oddly not in this context, which for me, was its major innovation.

Not Quite Naive Bayes

Tang and Lease refer to the Dawid and Skene approach as “naive Bayes”, which is not accurate. I believe they’re thinking of generating the labels as analogous to generating tokens. But the normalization for that is wrong, being over annotators rather than over annotator/label pairs. If they had a term estimating the probability of an annotator doing an annotation, then it would reduce to naive Bayes if they allow multiple annotations by the same annotator independently (which they actually consider, but then rule out).

So it’s odd to see the Nigam et al. paper on semi-supervised naive Bayes text classification used as an example, as it’s not particularly relevant, so to speak. (I really like Nigam et al.’s paper, by the way — our semi-supervised naive Bayes tutorial replicates their results with some more analysis and some improved results.)

Two-Way vs. K-Way Independence

Another nitpick is that it’s not enough to assume every pair of workers is independent. The whole set needs to be independent, and these conditions aren’t the same. (I was going to link to the Wikipedia article on independent random variables, but it only considers the pairwise case. So you’ll have to go to a decent probability theory textbook like Degroot and Schervish or Larsen and Marx, where you’ll get examples of three variables that are not independent though each pair is pairwise independent.

One Last Nitpick

A further nitpick is equation (6), the second line of which has an unbound i in the p[i] term. Instead, i needs to be bound to the true category for instance m.

Synthetic Data Generation?

I also didn’t understand their synthetic data generation in 3.1. If they generate accuracies, do they take the sensitivities and specificities to be the same (in their notation, pi[k,0,0] = pi[k,1,1]). In my (and others’) experience, there’s usually a significant bias so that sensitivity is not equal to specificity for most annotators.

Modeling Item Difficulty for Annotations of Multinomial Classifications

September 8, 2011

We all know from annotating data that some items are harder to annotate than others. We know from the epidemiology literature that the same holds true for medical tests applied to subjects, e.g., some cancers are easier to find than others.

But how do we model item difficulty? I’ll review how I’ve done this before using an IRT-like regression, then move on to Paul Mineiro’s suggestion for flattening multinomials, then consider a generalization of both these approaches.

Binary Data

My previous suggestions have been for binary data and were based on item-response theory (IRT) style models as applied to epidemiology by Uebersax and Grove. These are reflected, for instance, in my tutorial for LREC with Massimo Poesio. The basic idea is that you characterize an item i by position \alpha_i and an annotator by position \beta_j and discriminativeness \delta_j, then generate a label for item i by annotator j whose probability of being correct is:

y_{i,j} \sim \mbox{\sf Bern}(\mbox{logit}^{-1}(\delta_j(\alpha_i - \beta_j)))

The higher the value of \delta_j, the sharper the distinctions made by annotator j. You can break this out so there’s one model for positive items and another for negative items (essentially, one set of parameters for sensitivity and one for specificity).

If there’s no difficulty, equivalently all \alpha_i = 0, so we can reduce the logistic regression to a simple binomial parameter

\theta_j = \mbox{logit}^{-1}(\delta_j \beta_j)

which is the form of model I’ve been proposing for binary data with single response parameter \theta_j, with \theta_{0,j} for specificity and \theta_{1,j} for specificity.

Difficulty as Flattening Responses

Paul Mineiro’s blog post Low-Rank Confusion Modeling of Crowdsourced Workers introduces a general approach to handling item difficulty. If the true category is k, each annotator j has a response distribution paraemterized by a multinomial parameter \theta_{j,k}.

This is just Dawid and Skene‘s original model.

Mineiro applies a neat single-parameter technique to model item difficulty where the larger the parameter, the closer the response distribution is to uniform. That is, difficulty amounts to flattening an annotator’s response.

He does this in the standard temperature-based analogy used in annealing. If the difficulty of item i of true category k is \alpha_i, the response of annotator j is flattened from \theta_{j,k} to being proportional to \theta_{j,k}^{1/\alpha_i}. The higher the value of \alpha, the greater the amount of flattening. A value of \alpha_i greater than 1 indicates an annotator will do worse than their basic response and a value less than 1 indicates they’ll do better (assuming they assign the true value the highest probability).

General Regression Approach

We can do almost the same thing in a more general regression setting. To start, convert an annotator’s response probability vector \theta_{j,k} to a regression representation \log \theta_{j,k}. To get a probability distribution back, apply softmax (i.e., multi-logit), where the probability of label k' for an item i with true label k for annotator j is proportional to \exp(\theta_{j,k,k'})

We can encode Mineiro’s approach in this setting by adding a multiplicative term for the item, making the response proportional to \exp((1/\alpha_i) \theta_{j,k,k'}) = \exp(\theta_{j,k,k'})^{1/\alpha_i}. It would probably be more common to make the difficulty additive, so you’d get \exp(\alpha_i + \theta_{j,k,k'}).

For instance, suppose we have an ice-cream-flavor classification task with four response categories, lemon, orange, vanilla and chocolate. Built into users’ responses (and in my hierarchical models into the hierarchical priors) are the basic confusions. For instance, a lemon ice cream would be more likely to be confused with orange than vanilla or chocolate. A more difficult item will flatten this response to one that’s more uniform. But until \alpha_i approaches infinity, we’ll still confuse orange with lemon more than with the others.

Problem with the Basic Approach

The fundamental problem with Mineiro’s approach is that there’s no way to have the difficulty be one of limited confusability. In the ice cream case, an item that’s very citrusy will never be confused with chocolate or vanilla, but might in the limit of a very hard problem, have a uniform response between orange and lemon.

You also see this in ordinal problems, like Rzhetsky et al.’s modality and strength of assertion ordinal scale. A very borderline case might be confused between a 2 or 3 (positive and very positive), but won’t be confused between a -3 (very negative) modality of assertion.

Genealized Multi-Parameter Approach

What I can do is convert everything to a regression again. And then the generalization’s obvious. Instead of a single difficulty parameter \alpha_i for an item, have a difficulty parameter for each response k, namely \alpha_{i,k}. Now the probablity of annotator j responding with category k' when the true category is k is taken to be proportional to \exp(\beta_{j,k,k'} + \alpha_{i,k'}).

If you want it to be more like Mineiro’s approach, you could leave it on a multiplicative scale, and take the response to be proportional to \exp(\alpha_{i,k'} \beta_{j,k,k'}).

Of course, …

We’re never going to be able to fit such a model without a pile of annotations per item. It’s just basic stats — your estimates are only as good as (the square root of) the count of your data.

Being Bayesian, at least this shouldn’t hurt us (as I showed in my tech report). Even if the point estimate is unreliable, if we apply full Bayesian posterior inference, we get back overdispersion relative to our point estimate and everything should work out in principle. I just didn’t find it helped much in the problems I’ve looked at, which had 10 noisy annotations per item.

But then, …

If we have contentful predictors for the items, we might be able to model the difficulties more directly as regressions on basic predictors. Examples would be knowing the journal in which a paper came from when doing classification of article subjects. Some journals might be more interdisciplinary and have more confusable papers in general.

Steyvers, Lee, Miller and Hemmer (2009) The Wisdom of Crowds in the Recollection of Order Information

June 23, 2011

I just found Mark Steyvers et al.’s work on models of annotation for rankings:

They also describe the model with a more psych/experimental slant with some more experimental data relating observed (and estimated) expertise to self-reported expertise in:

The Problem and the Data

The basic idea, which they describe as Thurstonian, has annotators rank-order a common set of items, such as the sizes of 10 cities. The goal is to then induce the true ranking (the so-called “wisdom of crowds”) and also to estimate the annotator’s accuracies (but not biases in this case).

Some Background

The model they propose should be familiar to anyone who’s seen item-response models or Bradley-Terry models from the psychometrics literature on educational testing and preference ranking respectively. Somewhat surprisingly given Steyvers’s connection to cognitive science, they don’t seem to know (or don’t care to cite) sixty years worth of previous psychometrics literature on these kinds of problems. As Andrew Gelman is fond of saying, just about any model you invent was studied decades ago by a psychometrician.

Instead, they dig back even deeper to Thurston in the 1920s and also cite some work by Mallows in the 1950s, the latter of which is closer to what I’d have expected in the way of citations.

Their model reminds me most of Uebersax and Grove’s approach to ordinal ranking problems described in their 1993 Biometrics paper A latent trait finite mixture model for the analysis of rating agreement. Uebersax and Grove also use latent positions and normally distributed noise. The difference is that Uebersax and Grove looked at the case of multiple annotators evaluating multiple items on an ordinal scale. An example would be five doctors ranking 100 slides of potential tumors on a 0-4 scale of severity.

Steyvers et al.’s Model

The Items

The basic idea is to introuce a latent scalar \mu_i for each item i \in 1{:}I being ranked. The ordering of the latent scalars \mu_i induces a complete ordering of items.

The Annotators

Each annotator j \in 1{:}J is characterized by a single noise parameter \sigma_j > 0. These are given what seem like a rather arbitrary prior:

\sigma_i \sim \mbox{\sf Gamma}(\lambda,1/\lambda)

where \lambda is a constant hyperprior (set to 3). I’m more used to seeing inverse gamma distributions used as priors for variance (or gammas used as priors for precision).

They mention that one could fit another level of hierarchy here for \lambda, which would account for population effects in the model; this is standard operating procedure for Bayesian modeling these days and usually results in a better model of posterior uncertainty than optimizing or arbitrarily setting hyperparameters.

The Annotations

The annotations that are observed are of the form of complete rankings. That is, if we had three cities to rank by population, Chicago, Houston and Phoneix, an annotator’s response might be

Houston > Chicago > Phoenix.

The model assumes these annotations are derived from a latent annotator-specific scalar x_{i,j} for each item i and annotator j (it’d also be easy to allow an incomplete panel design in which not every annotator ranks every item). The model for this latent scalar is the obvious one:

x_{i,j} \sim \mbox{\sf Norm}(\mu_i,\sigma_j^2).

That is, the latent position x_{i,j} assigned to item i by annotator j is drawn from a normal centered around the true latent location \mu_i for item i with noise determined by the annotator-specific deviation parameter \sigma_j.

The Sampling Trick

There’s only one problem in implemeting this model: the latent x_{i,j} must be consistent with the observed ranking y_{i,j}. As you can imagine, they follow Albert and Chib’s approach, which involves a truncated normal sampler. That is, conditional on all but a single position x_{i,j}, use a normal distribution truncated to the interval bounded by the next lowest-ranked and next highest-rank item than i (with no lower bound for the lowest-ranked item and no upper bound for the highest-ranked item).

The whole model’s only a few lines of JAGS code, though they wrote their own implementation using a mix of Metropolis and Gibbs updating (this is a case where Gibbs is going to mix relatively slowly because of the interdependence of the x_{i,j}, yet this is where they use Gibbs). An advantage of using JAGS is that it’s trivial to explore the hierarchical extensions of the model.

The Posterior

The posterior distribution is modeled using samples. Here, the random variables being sampled are (\mu, x, \sigma). Given samples for \mu, we can estimate the rank of each item by looking at the rank in each sample \mu^{(k)}. We also get a direct characterization of the noise of an annotator through \sigma_i. We probably don’t care at all about the annotator-specific latent positions x_{i,j}.

In the NIPS paper, we get posterior intervals. In general, I prefer more direct views of the posterior samples, like scatterplots of histograms. For instance, check out the alternative plots for similar data in the collection BUGS Examples, Volume I, which contains a closely related example model involving ranking hospitals based on pediatric surgery fatalities (p. 13, diagram p. 17). It’s basically a histogram of the item’s rank in the posterior samples.

The Wisdom of Crowds

The result is a successful “wisdom of the crowds” aggregation of rankings. Each ranker is weighted by their estimated noise, so more reliable rankers have their rankings weighted more heavily. This is just like all the annotation models we’ve talked about, beginning with Dawid and Skene’s seminal 1979 paper, Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm (sorry for the paywalled link — I can’t find a pirated copy).

In the terminology of statistics, these kinds of models are called “measurement error” models (there doesn’t seem to be a good Wikipedia page or good general overview — you find introductions in any book on survey sampling). It’s not uncommon to have the basic data be measured with noise. Especially in an epidemiology setting or any kind of subjective coding by human subjects, like survey responses.

Future Directions

The authors point out in their second paper that it’d be natural to build hierarchical models for this task. But their suggestion for how to do it is odd. They suggest adding a single parameter for an individual across all tasks. Usually, you’d have this, then a domain-specific parameter that varied around the hierarchical parameter.

That is, if our tasks were indexed by t, we’d have an individual level error \sigma_i for each individual across tasks, and an error \sigma_{t,i} for each task for each user that’s sampled in the neighborhood of \sigma_i. It’d be common at this point to estimate random effects for things like task difficulty or annotator expertise. You see this all over the epidemiology and psychometrics literature when they extend these annotation models (for instance, in epidemiology, blood sample tests vs. doctor physical exam is an example of an annotator-level effect; in psychometrics, midterm versus final exam is an example of an item-level effect).

I’d at least start with giving the \sigma_i a hierarchical prior.

I’m guessing since they also suggest multiple-choice questions as an extension, they really haven’t seen the item-response theory literature.

TREC 2011 Crowdsourcing Track Data Released

June 16, 2011

Here’s the home page:

Clean and Small

The data’s super clean. I love it when task organizers take the time to release clean and easy to use data.

I was a bit surprised that there was so little of it, but I suppose this is just the dev data. There’s around 10K judgments from around 200 Mechanical Turkers for a single TREC topic (think query) with relevance labels. There appears to be the usual Turker quantity of noise annotations and a hugely skewed distribution of number of judgments per Turker (with the active Turkers looking more spammy, again as usual).

Mixed in Gold-Standard Data

One nice twist to the data is that there is also NIST gold-standard judgments for a subset of the data. In fact, the HITs (tasks assigned) for the Turkers had 5 relevance judgments, 1 of which was annotated previously by NIST.

As many of you may know, this is (among) the strategy(s) that CrowdFlower advocates for their customers.

What We Plan to Do

I don’t know how many submissions each team gets, but I’ll at least run the full Bayesian hierarchical model for a single topic (that is, the model Massimo and I described in our LREC tutorial on data annotation models). That’s a fully automatic system. I don’t know if we can remove the spam annotators by hand, but I might run that one, too. And probably something simpler with point estimates and simple voting, but that’d be four submissions. Hopefully they’ll evaluate voting themselves.

Probabilistic Evaluation

I have posted to their mailing list, but no luck getting anyone interested in log loss-based evaluations. Recall that in log loss, your score is the negation of the sum of the log probabilities your system assigned to the correct answers (it’s thus negated total probability, which is a loss function — I really wish optimization had settled on maximizing gain rather than minimizing loss).

Log loss the metric that’s being minimized by my (and any other statistical) model over the training data (that’s the M in maximum a posterior estimation, only we’re dealing with loss, not gain). This evaluation gets at how well calibrated a system’s probabilistic predictions are.

I’ll at least do that for our submissions once we get the “official gold standard” evaluation labels.

Hopefully, they’ll do the usual TREC bangup job of presenting things like precision-recall curves.

Processing Tweets with LingPipe #3: Near duplicate detection and evaluation

January 2, 2011

Summary:

In Post #1 we covered how to search Twitter and get a useful on disk data representation of the search results. In Post #2 we covered the first level of deduplication using HashSets as the test of sameness. We extended sameness by doing some normalization of the tweets which included removing urls and retweets.

You can download the code with this tarball or get it from subversion with the command


svn co https://aliasi.devguard.com/svn/teaching/lingpipe4twitter

Despite the solutions above we still have annoyingly similar tweets that are not useful for my goals for this set of posts. In particular the near duplicates really foul up any attempts at cross validation where one part of the data is used as training and the other as test data. If there are duplicates then we are training on testing for some examples and results end up looking too good.

Tokenization and Jaccard Distance

The duplicate detection problem in Twitter is really about word overlap with a slight game of telephone quality to it. The elaborations of previous tweets tend to be leading or following comments with the core of the source tweet preserved. Not much rephrasing is going on so word overlap between tweets is the obvious place to go. That entails a few elaborations to our approach:

  1. Find words in the tweets: Tokenization
  2. Measure similarity of tweets without sensitivity to tweet length: Jaccard Distance

Tokenization

A very simple tokenizer can just break on what characters separate words as follows:

public class Tokenize1 {
    //Approach #1, find token seperators
    public static void main (String[] args) {
	String tweet = "RT @mparent77772: Should Obama's 'internet kill switch' power be curbed? http://bbc.in/hcVGoz";
	System.out.println("Tweet: " + tweet);
	String[] tokens = tweet.split(" ");
	for (String token : tokens) {
	    System.out.println("Token: " + token);
	}
    }
}

This code is in src/Tokenize1.java. It produces output:

<ant tokenize1
     [java] Tweet: RT @mparent77772: Should Obama's 'internet kill switch' power be curbed? http://bbc.in/hcVGoz
     [java] Token: RT
     [java] Token: @mparent77772:
     [java] Token: Should
     [java] Token: Obama's
     [java] Token: 'internet

Another approach would be to try an define what characters belong in a token and match that, but that is a little more complex to program because String does not have a simple method call to capture an array of regex matches–see the regex chapter in the LingPipe book for more discussion:

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class Tokenize2 {
    //Approach 2, define what is a token
    public static void main (String[] args) {
	String tweet = "RT @mparent77772: Should Obama's 'internet kill switch' power be curbed? http://bbc.in/hcVGoz";
	Pattern tokenPattern = Pattern.compile("\\w+"); //match one or more letter or digit chars
	System.out.println("Tweet: " + tweet);
	Matcher matcher
	    = tokenPattern.matcher(tweet);
	while (matcher.find()) { //keep matching until no more matches
	    System.out.println("Token: " + matcher.group());//print what matched
	}
    }
}

This approach produces the following output:

<ant tokenize2
   [java] Tweet: RT @mparent77772: Should Obama's 'internet kill switch' power be curbed? http://bbc.in/hcVGoz
     [java] Token: RT
     [java] Token: mparent77772
     [java] Token: Should
     [java] Token: Obama
     [java] Token: s
     [java] Token: internet

Note that the text that separate the tokens are very different. Between ‘RT’ and ‘mparent77772′ is the separator ‘ @’. Depending on the needs of the application it can be easier to define tokens by what they look like rather than what the spaces around them look like. Often it is both.

While these approaches might work for the case at hand we will introduce the LingPipe tokenizers instead. They offer much richer tokenization options ready to go–see chapter 8 of the LingPipe book draft for more details or look at the Java doc.

An equivalent tokenization in the LingPipe API is created as follows:

import com.aliasi.tokenizer.RegExTokenizerFactory;
import com.aliasi.tokenizer.TokenizerFactory;
import com.aliasi.tokenizer.Tokenizer;

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class Tokenize3 {
    public static void main (String[] args) {
	//approach #3, A LingPipe tokenizer
	String tweet = "RT @mparent77772: Should Obama's 'internet kill switch' power be curbed? http://bbc.in/hcVGoz";
	TokenizerFactory tokFactory
	    = new RegExTokenizerFactory("\\w+");//match one or more letter or digit chars
	System.out.println("Tweet: " + tweet);
	char[] chars = tweet.toCharArray();
	Tokenizer tokenizer
	    = tokFactory.tokenizer(chars,0,chars.length);
	String token;
	System.out.println("White Space :'" +  tokenizer.nextWhitespace() + "'");
	while ((token = tokenizer.nextToken()) != null) {
	    System.out.println("Token: " + token);
	    System.out.println("White Space :'" + tokenizer.nextWhitespace()+"'");
	}
    }
}

Its output reports on both the white spaces as well as the tokens.

     [java] Tweet: RT @mparent77772: Should Obama's 'internet kill switch' power be curbed? http://bbc.in/hcVGoz
     [java] White Space :''
     [java] Token: RT
     [java] White Space :' @'
     [java] Token: mparent77772
     [java] White Space :': '
     [java] Token: Should

We add the normalization features of post #2 by extending the ModifyTokenTokenizerFactory to filter retweets and urls. A good way to develop a tokenizer is to put in into its own class and create a stand-alone task that runs the tokenizer over example data. In ‘src/NormalizedTokenizerFactory.java’ there is a main method:


public static void main(String[] args) {
	TokenizerFactory tokFactory = new NormalizedTokenizerFactory();
	String text = "RT @mparent77772: Should Obama's 'internet kill switch' power be curbed? http://bbc.in/hcVGoz"
	System.out.println("Tweet: " + text);
	char[] chars = text.toCharArray(); //convert to charArray
	Tokenizer tokenizer
	    = tokFactory.tokenizer(chars,0,chars.length);
	String token;
	System.out.println("White Space :'" +  tokenizer.nextWhitespace() + "'");
	while ((token = tokenizer.nextToken()) != null) {
	    System.out.println("Token: " + token);
	    System.out.println("White Space :'" + tokenizer.nextWhitespace()+"'");
	}
    }

Running the above with ant normalizedTokenizerFactorty produces nicely normalized output.

     [java] Tweet: RT @mparent77772: Should Obama's 'internet kill switch' power be curbed? http://bbc.in/hcVGoz
     [java] White Space :''
     [java] Token: Should
     [java] White Space :' '
     [java] Token: Obama
     [java] White Space :'''
     [java] Token: s
     [java] White Space :' ''
     [java] Token: internet

How that tokenizer functions is left as an exercise for the reader. Time to take on Mr. Jaccard.

Jaccard Distance as a Measure of Similarity

Jaccard Distance is a good way to impress folks with a fancy sounding term that is really just percent word overlap in our usage. If you think it helps you can also call it by the term ‘coefficient de communauté’ but then you might have to reveal that it was popularized by a French botanist–kind of undermines its jargon impact score. From the Jaccard Javadoc the proximity of two character sequences:

 proximity(cs1,cs2)
   = size(termSet(cs1) INTERSECT termSet(cs2))
     / size(termSet(cs1) UNION termSet(cs2))

We get the term sets from the tokenizer, and proximity is the percentage of tokens shared by both character sequences. The code to compute this for all pairs of tweets in our search results is in src/ExploreJaccard.java and the relevant bit of code is:

	JaccardDistance jaccardD = new JaccardDistance(tokFactory);
	int filteredCount = 0;
	List candidateTweets
	    = filterNormalizedDuplicates(texts,
					 tokFactory); //throw out easy cases
	System.out.println("Normalized duplicate filter leaves " + candidateTweets.size() + " tweets");
	row = new ArrayList();
	for (int i = 0; i < candidateTweets.size(); ++i) {
	    String closestTweet = "default value";
	    double closestProximity = -1d;
	    String targetTweet = candidateTweets.get(i);
	    for (int j = 0; j < candidateTweets.size(); ++j ) {//cross product, ouchy, ow ow.
		String comparisionTweet = candidateTweets.get(j);
		double thisProximity
		    = jaccardD.proximity(targetTweet,comparisionTweet);
		if (i != j) { // can't match self
		    if (closestProximity < thisProximity) {
			closestTweet = comparisionTweet;
			closestProximity = thisProximity;
		    }
		}
	    }

The goal of the above wildly inefficient program is to explore what the closest tweet is as determined by token overlap. The filterNormalizedDeduplicates(tweets,tokFactory) filters out duplicates as discussed in #2. We can then decide on a threshold to throw out tweets with too much overlap. Running the program on the Obama.csv example:

ant exploreJaccard -Ddata=searches/Obama.csv -Doutfile=runs/Obama.jaccard.csv

and then viewing the output .csv file with a sort on column B we get (click to see larger image):

Jaccard Distance sorted for similarity, click on for larger image

Note that we get some values of 1 even though there are differences in Tweet1 and Tweet2. In row 2 the normalization filters out the url leaving the only difference being the phrase “replacing trains by high speed buses” in Tweet1 has an additional “high speed” in “replacing high speed trains by high speed busses” in Tweet 2. Since Jaccard does not count the number of words in computing distance the additional phrase adds no words to the set of words since it already exists in the tweet.

Scrolling down reveals just how bad redundancy can be with tweets. At the 50% overlap point the tweets are still looking quite similar:

	Vet, 77, Busted For Obama Death Threat | The Smoking Gun http://t.co/MrTUwxv via @
	Vet, 77, Busted For Obama Death Threat http://tinyurl.com/25zyxgp #tcot #tlot #sgp

There are 14 unique tokens with 7 overlapping yielding 50% overlap. 36% of the Obama query tweets have an overlap of 50% or more with another tweet. Trying a different query, "the" finds less overlap between tweets. Only 3% of the tweets have another tweet with 50% overlap. The query "today" yields 14% of tweets overlap. The lesson here is that some queries yield more uniform tweets which impacts subsequent language processing. A more technical way of expressing this observation is that the entropy of the resulting search results varies depending on the query. The "obama" search result entropy is lower (less random) than results for "the" which is more random.

The ‘today’ query had usefully different tweets at the 50% overlap level:

Playing a show in Chicago, IL at 9:00 PM today at LE PASSAGE http://artistdata.com/a/32on
Playing a show in Cape Girardeau, MO at 9:00 PM today at The Venue http://artistdata.com/a/32ow

Despite sharing half the words they are clearly not retweets and contain different information. It might be quite difficult to automatically reject ‘obama’ near duplicates at 50% token overlap but retain the ‘today’ near duplicates with 50% overlap–I encourage people to suggest approaches in the comments.

Finally we add a class that will take a set of queries and filter them for near duplicates at a given Jaccard proximity in src/DedupeJaccard.java. The interesting bit is:

    public static List filterTweetsJaccard(List texts,
				       TokenizerFactory tokFactory,
				       double cutoff) {
	JaccardDistance jaccardD = new JaccardDistance(tokFactory);
	List filteredTweets = new ArrayList();
	for (int i = 0; i < texts.size(); ++i) {
	    String targetTweet = texts.get(i);
	    boolean addTweet = true;
	    //big research literature on making the below loop more efficient
	    for (int j = 0; j = cutoff) {
		    addTweet = false;
		    break; //one nod to efficency
		}
	    }
	    if (addTweet) {
		filteredTweets.add(targetTweet);
	    }
	}
	return filteredTweets;
    }

Deduplication along these lines is a big deal for web search as well as cleaning up Twitter feeds. The painful bit is the comparison of each new tweet to all the tweets that have passed the filter before which does not scale well. Much work has gone into hashing schemes that test for a hash match based on a lexical fingerprint of the existing corpus of tweets as well as more sophisticated similarity metrics. A recent paper with a decent overview of past work is http://research.microsoft.com/pubs/130851/ANDD.pdf.

Running the code on the Obama search results removes half the tweets with a .5 cutoff.

<ant dedupeJaccard -Ddata=searches/Obama.csv -Doutfile=runs/ObamaJaccardFiltered.csv
Buildfile: build.xml
compile:
    [javac] Compiling 1 source file to /home/breck/devguard/teaching/lingpipe4twitter/build/classes

jar:
      [jar] Building jar: /home/breck/devguard/teaching/lingpipe4twitter/demo.jar

dedupeJaccard:
     [java] Writing to: runs/ObamaJaccardFiltered.csv
     [java] Filtered size: 752 originally 1500

Looking at the tweets another issue immediately presents itself:

Coupe du monde - Obama: "Mauvaise décision": Le président Barack Obama a… http://goo.gl/fb/j5OUm #Maroc #sport #foot
RT @Luminary212: LOL “@BreakingNews: Obama knocks FIFA's decision to award Qatar, not the U.S., the 2022 World Cup: 'Wrong decision'”
I want to meet President Obama.
RT @laaficion Barack Obama, presidente de EU, consideró que la FIFA se equivocó al darle a Qatar el Mundial de 2022 y no a su país// celos

All sorts of languages are in the data. It looks like there is French and Spanish mixed in with the English. Next post will be creation of a language identification classifiers that will return to the notion of entropy discussed above.

Wrapup

In the quest to clean up the search results I have introduced some key concepts in text processing that will resurface in various forms throughout the posts. Summarizing:

  • Tokenization: Breaking strings into words/word separators is key to getting at the unit of meaning for word based processing. Oddly enough in the next post we won’t be using words for language id, but we will be still tokenizing.
  • Normalization: By eliminating trivial differences we can make text look more uniform–but beware of this tendancy.
  • Entropy: Some collections of data are more uniform than others.
  • Similarity: There are measures like Jaccard Distance to estimate the similarity of text. We will see many more examples that implement topic assignment, sentiment and more.
  • Evaluation: We have picked an operating point for near duplicate detection by examining data. While the evaluation metric remains quite loose, in the next post we will both create gold-standard (or truth) data and evaluation the performance of our language id system against it.

Breck

Processing Tweets with LingPipe #2: Finding Duplicates with Hashing and Normalization

December 2, 2010

This post is about taking the csv file format used in post #1 and eliminating the huge number of annoying duplicates or near duplicates that exist in Twitter search results. The next entry in the series will cover more open ended near duplicate detection strategies such as Jaccard Distance and will introduce the very important area of evaluation.

As before you can download the code with this tarball or get it from subversion with the command


svn co https://aliasi.devguard.com/svn/teaching/lingpipe4twitter

We will start with a simple ingest and representation of the csv data and consider how to find near or exact duplicates without concern for computational efficiency. The research literature in this area is very concerned with scalable performance, I will instead focus on more ideal solutions that focus on simplicity and quality.

Hashing Driven Deduplication

A simple way to find exact duplicates is to use a HashSet data structure in Java for comparison of tweets. Once we get the tweets into a list we iterate over the tweets to see whether the the exact string is in the HashSet. Below is the recognition phase from src/DedupeSimpleHash.java:

	List texts = parseFile(dataDir);
	HashSet seenBefore = new HashSet();
	for (int i=0; i < texts.size(); ++i) {
	    String tweet = texts.get(i);
	    if (seenBefore.contains(tweet)) {
		++seenCounter;
		System.out.println("Seen Before at postition:" + i +": " + tweet);
	    }
	    else {
		seenBefore.add(tweet);
	    }
	}

The code iterates over the CSV format search results, tests to see if a HashSet already contains the string and adds it if the set does not contain the string. I could have just added all the strings and the HashSet would have behaved the same but that is slightly automagical and might prove confusing. Below the code for writing out the CSV format for presumably unique tweets:

	File output = new File(outDir,outFile);
	System.out.println("Writing to: " + output.toString());
	FileOutputStream stream =  new FileOutputStream(output);
	OutputStreamWriter streamWriter
	    = new OutputStreamWriter(stream,Strings.UTF8);
	CsvListWriter writer
	    = new CsvListWriter(streamWriter,CsvPreference.EXCEL_PREFERENCE);
	ArrayList row = new ArrayList();
	row.add("");
	row.add("");
	row.add("");
	row.add("Unique Tweet");
	writer.write(row); //header for csv file
	for (String tweet : seenBefore) {
	    row.clear();
	    row.add("");
	    row.add("");
	    row.add("");
	    row.add(tweet);
	    writer.write(row);
	}
	writer.close();
	System.out.println(seenCounter + " seen of " + texts.size());

Note that I have kept the tweet in the same row as the search results. This convention will allow interaction with other components.

Running the program on searches/Obama.csv yields the following:

>ant dedupeSimpleHash -Ddata=searches/Obama.csv -Doutdir=runs -Doutfile=Obama.simpleDedupeHash.csv
Buildfile: build.xml

compile:

jar:

dedupeSimpleHash:
     [java] 2
     [java] Seen Before at postition:48: RT @rolandsmartin: President Obama: 'I Don't Think About Sarah Palin' - http://bit.ly/gXZfqN http://fb.me/NRKRnQWy
     [java] Seen Before at postition:58: RT @New_federalists: Obama lets Islamic terrorists pour across the Mexican border, but treats US citizens as criminals at airports!!! #RESIST #REBEL !
....[snip]....
     [java] Seen Before at postition:1490: RT @Cubachi: RT @PalinTV: Obama can't even pardon a Turkey without notes
     [java] Seen Before at postition:1499: RT @BreakingNews: President Barack Obama pardons turkeys ?Apple? and ?Cider?; says feels good to prevent one ?shellacking? this year
     [java] Writing to: runs/Obama.simpleDedupeHash.csv
     [java] 297 seen of 1500

Duplicate Detection via Normalization

The simple deduplication approach gets rid of 297 exact duplicates, or 20% of the tweets. Pretty good but looking at the ‘unique’ tweets and sorting on the texts in a spread sheet shows some other issues:

Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update... http://bit.ly/dHWLod
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update... http://bit.ly/geCufk
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update... http://bit.ly/hp7In9
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update... http://bit.ly/iggM61
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update... http://bit.ly/ijwRNA

Note that your results will vary since we are not running the searches on the same day. These posts only differ in the url for bit.ly which is still pretty much the same tweet as far as I am concerned. Looking further I see that tweets get re-tweeted with a prefix of ‘RT@:’ pretty often as well.

RT @mattyglesias: Repealing the whole Affordable Care Act would be a mistake, but Obama should compromise and agree to scrap the death...
Repealing the whole Affordable Care Act would be a mistake, but Obama should compromise and agree to scrap the death...

It appears that removing urls and the retweet prefixes will find more duplicates. That suggests normalization to help make the tweets more uniform by eliminating "trivial" differences. A look at src/DedupeNormalizedHash.java shows regular expressions that replace the url and retweets with the empty string.

	List texts = parseFile(dataDir);
	HashSet seenBefore = new HashSet();
	for (int i=0; i < texts.size(); ++i) {
	    String tweet = texts.get(i);
	    System.out.println("Before:" + tweet);
	    tweet = tweet.replaceAll("\\s*RT\\s*@\\w+:\\s*","");//removes "RT @foobar:"
	    tweet = tweet.replaceAll("https?:[^\\s]*",""); //removes "http://foo" "https://bar"
	    System.out.println("After :" + tweet);
	    if (seenBefore.contains(tweet)) {
		++seenCounter;
		System.out.println("Seen Before at postition:" + i +": " + tweet);
	    }
	    else {
		seenBefore.add(tweet);
	    }
	}	

Running the normalized approach the modifications are evident:

> ant dedupeNormalizedHash -Ddata=searches/Obama.csv -Doutdir=runs -Doutfile=Obama.dedupeNormalizedHash.csv
[snip]
     [java] Before:RT @rolandsmartin: President Obama: 'I Don't Think About Sarah Palin' - http://bit.ly/gXZfqN http://fb.me/NRKRnQWy
     [java] After :President Obama: 'I Don't Think About Sarah Palin' -
[snip]
     [java] Writing to: runs/Obama.dedupeNormalizedqHash.csv
     [java] 454 seen of 1500

Both patterns apply to the tweet removing ‘RT @rolandsmartin: ‘ and ‘http://bit.ly/gXZfqN http://fb.me/NRKRnQWy’. The normalization allows for identification of another 157 duplicates. Normalization is a standard technique used when working with text data. You see it in search engines where ‘walks’, ‘walking’ and ‘walked’ are normalized to the root word ‘walk’ with a stemmer. Both documents and queries are normalized the same way allowing a query ‘Books about walking’ to match documents that don’t mention ‘walking’ but do mention ‘walk’, ‘walks’ or ‘walked.’ Normalization can bring more trouble than the problems it solves and should be used carefully–perhaps a future blog post is called for on our preference to use character n-grams in situations where stemming is typically used.

Looking at the output of the normalized hash approach reveals more opportunity to eliminate near duplicates and there may be more patterns to exploit but eventually diminishing returns will bring the pattern driven normalization effort to a close. Lets move to more open ended approaches that better address the long tail side of the issue:

Has Obama written a stern letter to North Korea yet? They will respect that.
Has Obama written a stern letter to North Korea yet? They will respect that. <--- indeed!

Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner W...  #tcot #p2 #tlot
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner W...  #tcot #tlot #p2
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update...

The first pair only differ in an additional ‘<—indeed!’ which is not a likely pattern in the data that can be exploited much beyond this case. The second example shows that the basic tweet has many small variations. In these situations it is useful to deduce what algorithm you use to say that the tweets are (mostly) the same.

In my next post in the series I will cover Jaccard Distance as the method of identifying near duplicates and bring in the importance of evaluation metrics to drive development.

Breck

Need Another Label(s)!

November 23, 2010

It occurred to me while working through the math for my last post that there are situations when you not only need another label for an existing item, but need more than one to achieve an expected positive payoff.

Prior and Posterior Odds Ratios

First, let’s summarize the result from adding another annotator. For a given image i, suppose the current odds (aka prior) of clean (c_i = 1) versus porn (c_i = 0) are V:1, which corresponds to a probability of being clean of \mbox{Pr}[c_i =1] = V/(1+V). For instance, if the odds are 4:1 the image is clean, the probability it is clean is 4/5.

Now suppose we get a label y_{i,j} from an annotator j for image i. We update the odds to include the annotator’s label to get new (posterior) odds V', by the following formula if the label is

V' =   V \times \mbox{Pr}[y_{i,j}|c_i=1] \, / \, \mbox{Pr}[y_{i,j}|c_i=0].

We just multiply the prior odds V by the likelihood ratio of the annotation y_{i,j} given that c_i = 1 or c_i = 0. The new probability of a clean image is thus V'/(1+V').

The Porn Payoff

Suppose the payoff for the porn task is as follows. A porn image classified as clean has a payoff of -100, a porn image classified as porn has a cost of 0, a clean image classified as clean has a payoff of 1, and a clean image classified as porn has a payoff of -1.

In this setup, when we have an image, we need odds of better than 100:1 odds (a bit more than 99% probability) that it is clean before we return the decision that it’s clean; otherwise we maximize our payoff saying it’s porn. Unless we are 100% sure of our decision, we always have an expected loss from returning a decision that an image is porn, because the payoff is zero for a true negative (classifying porn as porn), whereas the payoff is -1 for rejecting a non-porn image.

Now suppose that the prevalence of clean images is 20 in 21 (of course, we work with an estimate). So we start with odds of 20:1. The decision to classify the item as clean before annotation has an expected payoff of [20*1 + 1*(-100)]/21 or about -4 per decision. The decision to classify an image as porn before an annotation has a payoff of [20*(-1) + 1*0]/21 which is around -1 per decision.

Need Multiple Annotators

Clearly we need to annotate an item before we can have a positive expected payoff.

Suppose for the sake of argument (and easy arithmetic) that we have an annotator with sensitivity 0.9 (they correctly classify 90% of the clean images as clean and reject 10% of the clean images as porn) and specificity 0.8 (they correctly classify 80% of the porn images as porn and let 20% through as clean).

In this context, we actually need more than one annotator to annotate an item before we get a positive expected payoff. We first need to work through our expectations properly. First, we start with 20:1 odds (probability 20/21) an image is clean, so we can expect to see 20 clean images for each porn image. Then we have to look at the annotator’s expected response. If the image is clean, there’s a 80% chance the annotator says it’s clean and a 20% chance they say it’s porn. If the image is porn, there’s a 90% chance they say it’s porn and a 10% chance they say it’s clean. That let’s us work out the expectations for true category and response a priori. For instance, the chance the image is clean and the annotator says its porn is 20/21 * 0.2 = 4/21.

We then calculate the updated odds under both possible annotator responses and figure out the new odds and weight them by our expectations before seeing the label.

I’ll let you work out the arithmetic, but the upshot is that until you have more than one annotator, you can’t get 100:1 odds or more of an image not being porn. The key step is noting that we only get positive expected payoff if we return that the image is clean, but the posterior odds if the annotator provides a 1 label, which are 20/1 * 0.9/0.2 = 90:1. And don’t forget to factor in that we only land in the happy situation of a getting a non-porn label around 90% of the time, so the expected gain in payoff from the annotation is less than the improvement from 20:1 to 90:1 we get in the ideal case.

In reality, you’re likely to have slightly better porn/not-porn annotators than this because it’s a relatively easy decision problem. But you’re also likely to have spammers, as we mentioned last time, which is really a mixed pool of spammers and cooperators.

Unknown Annotators

I’ll say again that one of the pleasant properties of the hierarchical model extension of Dawid and Skene is that it allows us to predict the behavior for a new unknown annotator. This is particularly useful in a mechanical Turk setting where we can’t choose our annotators directly (though we can feed items to an annotator if we write our own web app and they volunteer to do more).

We just sum over predictions from each posterior sample of the hyperpriors, then sample an annotator from that, calculate what the outcome would be for them, and average the result. What’s extra cool is that this includes all the extra dispersion in posterior inference that’s warranted due to our uncertainty in the hyperpriors representing the population of annotators.

Information Gain and Task-Based Costs: Biased versus Spam Annotators

November 22, 2010

Following on from Sheng et al.’s Get Another Label? paper, Panos and crew have a Human Computation Workshop paper,

The motivation for the new paper is to try to separate bias from accuracy. That is, some annotators would try hard, but the bias in their answers would give them an overall low exact accuracy. But their responses can still be useful if we correct for their bias. Luckily, the Dawid-and-Skene-type models for estimating and adjusting for annotator’s accuracies does just that.

G, PG, R, or X?

As an example, Ipeirotis et al. consider having workers on Amazon Mechanical Turk classify images into G, PG, R, and X ratings following MPAA ratings guidelines.

This is really an ordinal classification problem where the approach of Uebersax and Grove (1993) seems most appropriate, but let’s not worry about that for right now. We can imagine a purely categorical example such as classifying tokens in context based on part-of-speech or classifying documents based on a flat topic list.

Bias or Inaccuracy?

Uebersax and Grove discuss the bias versus accuracy issue. It’s easy to see in a two-category case that sensitivity and specificity (label response for 1 and 0 category items) may be reparameterized as accuracy and bias. Biased annotators have sensitivities that are lower (0 bias) or higher (1 bias) than their specificities.

What Ipeirotis et al. point out is that you can derive information from a biased annotator if their biases can be estimated. As they show, a model like Dawid and Skene’s (1979) performs just such a calculation in its “weighting” of annotations in a generative probability model. The key is that it uses the information about the likely response of an annotator given the true category of an item to estimate the category of that item.

Decision-Theoretic Framework

Ipeirotis et al. set up a decision-theoretic framework where there is a loss (aka cost, which may be negative, and thus a gain) for each decision based on the true category of the item and the classification.

One nice thing about the image classification task is that it makes it easy to think about the misclassification costs. For instance, classifying an X-rated image as G and having it land on a children’s site is probably a more costly error than rating a G image as PG and banning it from the same site. In the ordinal setting, there’s a natural scale, where rating an X-rated image as R is closer to the true result than PG or G.

Getting Another Label

Consider a binary setting where the true category of item i is c_i, the prevalence (overall population proportion) of true items is \pi, and annotator j has sensitivity (accuracy on 1 items; response to positive items) of \theta_{1,j} and a specificity (accuracy on 0 items; 1 – response to negative items) of \theta_{0,j}. If y_{i,j} is the response of annotators j \in 1{:}J, then we can calculate probabilities for the category c_i by

\mbox{Pr}[c_i = 1 | y_i] \propto \pi \times \prod_{j=1}^J \theta_{1,j}^{y_{i,j}} \times (1 - \theta_{1,j})^{1 - y_{i,j}} and

\mbox{Pr}[c_i = 0 | y_i] \propto (1 - \pi) \times \prod_{j=1}^J \theta_{0,j}^{1 - y_{i,j}} \times (1 - \theta_{0,j})^{y_{i,j}}.

The thing to take away from the above formula is that it reduces to an odds ratio. Before annotation, the odds ratio is \pi / (1-\pi). For each annotator, we multiply the odds ratio by the annotator’s ratio, which is \theta_{1,j} / (1 - \theta_{0,j}) if the label is y_{i,j} = 1, and is (1 - \theta_{1,j})/\theta_{0,j} if the label is negative.

Random Annotators Don’t Affect Category Estimates

Spammers in these settings can be characterized by having responses that do not depend on input items. That is, no matter what the true category, the response profile will be identical. For example, an annotator could always return a given label (say 1), or always choose randomly among the possible labels with the same distribution (say 1 with 20% chance and 0 with 80% chance, correspdonding, say, to just clicking some random checkboxes on an interface).

In the binary classification setting, if annotations don’t depend on the item being classified, we have specificity = 1 – sensitivity, or in symbols, \theta_{0,j} = 1 - \theta_{1,j} for annotator j. That is, there’s always a \theta_{1,j} chance of returning the label 1 no matter what the input.

Plugging this into the odds ratio formulation above, it’s clear that there’s no effect of adding such a spammy annotator. The update to the odds ratios have no effect because \theta_{1,j}/(1 - \theta_{0,j}) = 1 and (1-\theta_{1,j})/\theta_{0,j}=1.

Cooperative, Noisy and Adversarial Annotators

In the binary case, as long as the sum of sensitivity and specificity is greater than one, \theta_{0,j} + \theta_{1,j} > 1, there is positive information to be gained from an annotator’s response.

If the sum is 1, the annotator’s responses are pure noise and there is no information to be gained by their response.

If the sum is less than 1, the annotator is being adversarial. That is, they know the answers and are intentionally returning the wrong answers. In this case, their annotations will bias the category estimates.

Information Gain

The expected information gain from having an annotator label an item is easily calculated. We need to calculate the probability of true category and then probability of response and figure out the odds of each and the contribution to our overall estimates.

The random variable in question is c_i, the category of item i. We will assume a current odds ratio after zero or more annotations of \phi/(1-\phi) and consider the so-called information gain from observing y_{i,j}, the label provided for item i by annotator j,

\mbox{H}[c_i] - \mathbb{E}[\mbox{H}[c_i|y_{i,j}]],

where the expectation is with respect to our model’s posterior.

Expanding the expectation in the second term gives us

\mathbb{E}[\mbox{H}[c_i|y_{i,j}]] = \mbox{Pr}[y_{i,j} = 1] \times \mbox{H}[c_i|y_{i,j}=1] + \mbox{Pr}[y_{i,j} = 0] \times \mbox{H}[c_i|y_{i,j}=0]

The formulas for the terms inside the entropy are given above. As before, we’ll calculate the probabilities of responses using our model posteriors. For instance, carrying this through normalization, the probabilities on which the expectations are based are

\mbox{Pr}[y_{i,j} = 1] = Q_1/(Q_1+Q_2), and

\mbox{Pr}[y_{i,j} = 0] = Q_2/(Q_1 + Q_2), where

Q_1 = \phi \times \theta_{1,j} \ + \ (1-\phi) \times (1 - \theta_{0,j}), and

Q_2 = (1-\phi) \times \theta_{0,j} \ + \ \phi \times (1 - \theta_{1,j}).

so that the probability the annotation is 1 is proportional the sum of the probability that the true category is 1 (here \phi) and the response was correct (\theta_{1,j}) and the probability that the true category is 0 (1 - \phi) and the response was incorrect (1 - \theta_{0,j}).

It’s easy to see that spam annotators who have \theta_{0,j} = 1 - \theta_{1,j} provide zero information gain because as we showed in the last section, p(c_i|y_{i,j}) = p(c_i) if annotator j provides random responses, then \mbox{H}[c_i|y_{i,j}] = \mbox{H}[c_i].

Decision-Theoretic Framework

Ipeirotis et al. go one step further and consider a decision-theoretic context in which the penalities for misclassifications may be arbitrary numbers and the goal is minimizing expected loss (equivalently maximizing expected gain).

Rather than pure information gain, the computation would proceed through the calculation of expected true positives, true negatives, false positives, and false negatives, each with a weight.

The core of Bayesian decision theory is that expected rewards are always improved by improving posterior estimates. As long as an annotator isn’t spammy, their contribution is expected to tighten up our posterior estimate and hence improve our decision-making ability.

Suppose have have weights L_{k,k'}, which are the losses for classifying an item of category k as being of category k'. Returning to the binary case, consider an item whose current estimated chance of being positive is \phi. Our expected loss is

\mathbb{E}[\mbox{loss}] = L_{1,1}  \phi^2 + L_{1,0}  \phi  (1-\phi) + L_{0,1} (1-\phi)  \phi + L_{0,0} (1-\phi)^2.

We are implicitly assuming that the system operates by sampling the category from its estimated distribution. This is why \phi shows up twice after L_{1,1}, once for the probability that the category is 1 and once for the probability that’s the label chosen.

In practice, we often quantize answers to a single category. The site that wants a porn filter on an image presumably doesn’t want a soft decision — it needs to either display an image or not. In this case, the decision criterion is to return the result that minimizes expected loss. For instance, assigning category 1 if the probability the category is 1 is \phi leads to expected loss

L_{1,1} \phi + L_{0,1} (1-\phi)

and the loss for assigning category 0 is expected to be

L_{0,0} (1-\phi) + L_{1,0} \phi.

The decision is simple: return the result corresponding to the smallest loss.

After the annotation by annotator j, the positive and negative probabilities get updated and plugged in to produce a new estimate \phi', which we plug back in.

I’m running out of steam on the \LaTeX derivation front, so I’ll leave the rest as an exercise. It’s a hairy formula, especially when unfolded to the expectation. But it’s absolutely what you want to use as the basis of the decision as to which items to label.

Bayesian Posteriors and Expectations

In practice, we work with estimates of the prevalence \pi, sensitivities \theta_{1,j}, and specificities \theta_{0,j}. For full Bayesian inference, Gibbs sampling lets us easily compute the integrals required to use our uncertainty in the parameter estimates in calculating our estimate of \mbox{Pr}[c_i = 1|y_i] and its uncertainty.

Confused by Section 4

I don’t understand why Ipeirotis say, in section 4,

The main reason for this failure [in adjusting for spam] is the inability of the EM algorithm to identify the “strategic” spammers; these sophisticated spammers identify the class with the highest class prior (in our case the “G” class) and label all the pages as such.

Huh? It does in my calculations, as shown above, and in my experience with the Snow et al. NLP data and our own NE data.

One problem in practice may be that if a spammer annotates very few items, we don’t have a good estimate of their accuracies, and can’t adjust for their inaccuracy. Otherwise, I don’t see a problem.

LREC 2010 Tutorial: Modeling Data Annotation

May 17, 2010

Massimo and I are giving our tutorial this afternoon in Malta at LREC 2010. Here’s a link to my slides:

Thanks to Massimo for lots of great suggestions and feedback from when I gave the talk in Rovereto Italy last week (U. Trento), when we talked about it at length between then and now, and on the first few drafts of the slides.

And here’s a link to Massimo’s slides:

Massimo’s part was based on his and Ron’s CL Journal review of kappa-like stats, and the tutorials they’ve given in the past. As such, they’re much more mature. The examples are really tight and to the point.

Stochastic Gradient Descent for Probabilistic Training Data

March 23, 2010

I’ve been talking for a while about using the Bayesian hierarchical models of data annotation to generate probabilistic corpora. It’s really easy to extend stochastic gradient descent (SGD) optimization algorithms to handle probabilistic corpora.

I’m going to consider maximum likelihood estimation for binary logistic regression, but the same thing can be done for conditional random field (CRF) taggers, support vector machine (SVM) classifiers, or perceptron classifiers. I’ll discuss priors and multinomials after the main algorithm.

The Training Data

Suppose we have N labeled training instances, each consisting of a D-dimensional input vector x_n and probabilistic outcome y \in [0,1]:

double[N][D] x;  // inputs
double[N] y;  // outputs

A non-probabilistic corpus constrains y_n to be either 0 or 1.

The Model

Binary logistic regression for D-dimensional inputs is parameterized by a D-dimensional vector \beta. The probability of a 1 outcome for a D-dimensional input x is

\mbox{Pr}(y_n = 1 | x_n, \beta)  = \mbox{logit}^{-1}(\beta^{\top}x_n),

where

\mbox{logit}^{-1}(z) = \exp(z)/(1 + \exp(z)) = 1/(1 + \exp(-z)).

Maximizing Log Likelihood

Our goal’s to find the parameter \beta that maximizes the training data likelihood

\mathcal{L}(\beta;x,y) = \sum_{n = 1}^N \log \mbox{Pr}(y=y_n | x=x_n, \beta).

The maximum likelihood estimate is thus

\beta^* = \arg\max_{\beta} \mathcal{L}(\beta;x,y).

Stochastic Gradient Descent

SGD finds the maximum likelihood estimate \beta^* given training data x,y. For simplicity, we’ll only consider a fixed learning rate (\lambda > 0), though the algorithm obviously generalizes to annealing schedules or other learning rate enhancement schemes. We’ll let E be the number of epochs for which to run training, though you could also measure convergence in the algorithm.

β = 0
for (e in 1:E)
    for (n in 1:N)
        yHat[n] = inverseLogit(dotProduct(β,x[n]))
        err = y[n] - yHat[n]
        β += λ * err * x[n]

Easy peasy, as they say in the old country.

Priors

Priors don’t actually interact with the probabilistic training. So you just add them in as usual. I’d recommend either blocking them (as I did for CRFs in LingPipe) to update every few hundred training instances, or using lazy updates (as I did for logistic regression in LingPipe).

Multinomial Classifiers

It’s also easy to generalize to multinomials, where the data consists of a probability for each of K outcomes. You just add an inner loop over the dimensions (minus 1 or not depending on whether you use K or K-1 parameter vectors for a K-dimensional problem), and the error term is calculated versus the probability for that outcome.


Follow

Get every new post delivered to your Inbox.

Join 149 other followers