Archive for the ‘Data Annotation’ Category

Steyvers, Lee, Miller and Hemmer (2009) The Wisdom of Crowds in the Recollection of Order Information

June 23, 2011

I just found Mark Steyvers et al.’s work on models of annotation for rankings:

They also describe the model with a more psych/experimental slant with some more experimental data relating observed (and estimated) expertise to self-reported expertise in:

The Problem and the Data

The basic idea, which they describe as Thurstonian, has annotators rank-order a common set of items, such as the sizes of 10 cities. The goal is to then induce the true ranking (the so-called “wisdom of crowds”) and also to estimate the annotator’s accuracies (but not biases in this case).

Some Background

The model they propose should be familiar to anyone who’s seen item-response models or Bradley-Terry models from the psychometrics literature on educational testing and preference ranking respectively. Somewhat surprisingly given Steyvers’s connection to cognitive science, they don’t seem to know (or don’t care to cite) sixty years worth of previous psychometrics literature on these kinds of problems. As Andrew Gelman is fond of saying, just about any model you invent was studied decades ago by a psychometrician.

Instead, they dig back even deeper to Thurston in the 1920s and also cite some work by Mallows in the 1950s, the latter of which is closer to what I’d have expected in the way of citations.

Their model reminds me most of Uebersax and Grove’s approach to ordinal ranking problems described in their 1993 Biometrics paper A latent trait finite mixture model for the analysis of rating agreement. Uebersax and Grove also use latent positions and normally distributed noise. The difference is that Uebersax and Grove looked at the case of multiple annotators evaluating multiple items on an ordinal scale. An example would be five doctors ranking 100 slides of potential tumors on a 0-4 scale of severity.

Steyvers et al.’s Model

The Items

The basic idea is to introuce a latent scalar \mu_i for each item i \in 1{:}I being ranked. The ordering of the latent scalars \mu_i induces a complete ordering of items.

The Annotators

Each annotator j \in 1{:}J is characterized by a single noise parameter \sigma_j > 0. These are given what seem like a rather arbitrary prior:

\sigma_i \sim \mbox{\sf Gamma}(\lambda,1/\lambda)

where \lambda is a constant hyperprior (set to 3). I’m more used to seeing inverse gamma distributions used as priors for variance (or gammas used as priors for precision).

They mention that one could fit another level of hierarchy here for \lambda, which would account for population effects in the model; this is standard operating procedure for Bayesian modeling these days and usually results in a better model of posterior uncertainty than optimizing or arbitrarily setting hyperparameters.

The Annotations

The annotations that are observed are of the form of complete rankings. That is, if we had three cities to rank by population, Chicago, Houston and Phoneix, an annotator’s response might be

Houston > Chicago > Phoenix.

The model assumes these annotations are derived from a latent annotator-specific scalar x_{i,j} for each item i and annotator j (it’d also be easy to allow an incomplete panel design in which not every annotator ranks every item). The model for this latent scalar is the obvious one:

x_{i,j} \sim \mbox{\sf Norm}(\mu_i,\sigma_j^2).

That is, the latent position x_{i,j} assigned to item i by annotator j is drawn from a normal centered around the true latent location \mu_i for item i with noise determined by the annotator-specific deviation parameter \sigma_j.

The Sampling Trick

There’s only one problem in implemeting this model: the latent x_{i,j} must be consistent with the observed ranking y_{i,j}. As you can imagine, they follow Albert and Chib’s approach, which involves a truncated normal sampler. That is, conditional on all but a single position x_{i,j}, use a normal distribution truncated to the interval bounded by the next lowest-ranked and next highest-rank item than i (with no lower bound for the lowest-ranked item and no upper bound for the highest-ranked item).

The whole model’s only a few lines of JAGS code, though they wrote their own implementation using a mix of Metropolis and Gibbs updating (this is a case where Gibbs is going to mix relatively slowly because of the interdependence of the x_{i,j}, yet this is where they use Gibbs). An advantage of using JAGS is that it’s trivial to explore the hierarchical extensions of the model.

The Posterior

The posterior distribution is modeled using samples. Here, the random variables being sampled are (\mu, x, \sigma). Given samples for \mu, we can estimate the rank of each item by looking at the rank in each sample \mu^{(k)}. We also get a direct characterization of the noise of an annotator through \sigma_i. We probably don’t care at all about the annotator-specific latent positions x_{i,j}.

In the NIPS paper, we get posterior intervals. In general, I prefer more direct views of the posterior samples, like scatterplots of histograms. For instance, check out the alternative plots for similar data in the collection BUGS Examples, Volume I, which contains a closely related example model involving ranking hospitals based on pediatric surgery fatalities (p. 13, diagram p. 17). It’s basically a histogram of the item’s rank in the posterior samples.

The Wisdom of Crowds

The result is a successful “wisdom of the crowds” aggregation of rankings. Each ranker is weighted by their estimated noise, so more reliable rankers have their rankings weighted more heavily. This is just like all the annotation models we’ve talked about, beginning with Dawid and Skene’s seminal 1979 paper, Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm (sorry for the paywalled link — I can’t find a pirated copy).

In the terminology of statistics, these kinds of models are called “measurement error” models (there doesn’t seem to be a good Wikipedia page or good general overview — you find introductions in any book on survey sampling). It’s not uncommon to have the basic data be measured with noise. Especially in an epidemiology setting or any kind of subjective coding by human subjects, like survey responses.

Future Directions

The authors point out in their second paper that it’d be natural to build hierarchical models for this task. But their suggestion for how to do it is odd. They suggest adding a single parameter for an individual across all tasks. Usually, you’d have this, then a domain-specific parameter that varied around the hierarchical parameter.

That is, if our tasks were indexed by t, we’d have an individual level error \sigma_i for each individual across tasks, and an error \sigma_{t,i} for each task for each user that’s sampled in the neighborhood of \sigma_i. It’d be common at this point to estimate random effects for things like task difficulty or annotator expertise. You see this all over the epidemiology and psychometrics literature when they extend these annotation models (for instance, in epidemiology, blood sample tests vs. doctor physical exam is an example of an annotator-level effect; in psychometrics, midterm versus final exam is an example of an item-level effect).

I’d at least start with giving the \sigma_i a hierarchical prior.

I’m guessing since they also suggest multiple-choice questions as an extension, they really haven’t seen the item-response theory literature.

TREC 2011 Crowdsourcing Track Data Released

June 16, 2011

Here’s the home page:

Clean and Small

The data’s super clean. I love it when task organizers take the time to release clean and easy to use data.

I was a bit surprised that there was so little of it, but I suppose this is just the dev data. There’s around 10K judgments from around 200 Mechanical Turkers for a single TREC topic (think query) with relevance labels. There appears to be the usual Turker quantity of noise annotations and a hugely skewed distribution of number of judgments per Turker (with the active Turkers looking more spammy, again as usual).

Mixed in Gold-Standard Data

One nice twist to the data is that there is also NIST gold-standard judgments for a subset of the data. In fact, the HITs (tasks assigned) for the Turkers had 5 relevance judgments, 1 of which was annotated previously by NIST.

As many of you may know, this is (among) the strategy(s) that CrowdFlower advocates for their customers.

What We Plan to Do

I don’t know how many submissions each team gets, but I’ll at least run the full Bayesian hierarchical model for a single topic (that is, the model Massimo and I described in our LREC tutorial on data annotation models). That’s a fully automatic system. I don’t know if we can remove the spam annotators by hand, but I might run that one, too. And probably something simpler with point estimates and simple voting, but that’d be four submissions. Hopefully they’ll evaluate voting themselves.

Probabilistic Evaluation

I have posted to their mailing list, but no luck getting anyone interested in log loss-based evaluations. Recall that in log loss, your score is the negation of the sum of the log probabilities your system assigned to the correct answers (it’s thus negated total probability, which is a loss function — I really wish optimization had settled on maximizing gain rather than minimizing loss).

Log loss the metric that’s being minimized by my (and any other statistical) model over the training data (that’s the M in maximum a posterior estimation, only we’re dealing with loss, not gain). This evaluation gets at how well calibrated a system’s probabilistic predictions are.

I’ll at least do that for our submissions once we get the “official gold standard” evaluation labels.

Hopefully, they’ll do the usual TREC bangup job of presenting things like precision-recall curves.

Processing Tweets with LingPipe #3: Near duplicate detection and evaluation

January 2, 2011


In Post #1 we covered how to search Twitter and get a useful on disk data representation of the search results. In Post #2 we covered the first level of deduplication using HashSets as the test of sameness. We extended sameness by doing some normalization of the tweets which included removing urls and retweets.

You can download the code with this tarball or get it from subversion with the command

svn co

Despite the solutions above we still have annoyingly similar tweets that are not useful for my goals for this set of posts. In particular the near duplicates really foul up any attempts at cross validation where one part of the data is used as training and the other as test data. If there are duplicates then we are training on testing for some examples and results end up looking too good.

Tokenization and Jaccard Distance

The duplicate detection problem in Twitter is really about word overlap with a slight game of telephone quality to it. The elaborations of previous tweets tend to be leading or following comments with the core of the source tweet preserved. Not much rephrasing is going on so word overlap between tweets is the obvious place to go. That entails a few elaborations to our approach:

  1. Find words in the tweets: Tokenization
  2. Measure similarity of tweets without sensitivity to tweet length: Jaccard Distance


A very simple tokenizer can just break on what characters separate words as follows:

public class Tokenize1 {
    //Approach #1, find token seperators
    public static void main (String[] args) {
	String tweet = "RT @mparent77772: Should Obama's 'internet kill switch' power be curbed?";
	System.out.println("Tweet: " + tweet);
	String[] tokens = tweet.split(" ");
	for (String token : tokens) {
	    System.out.println("Token: " + token);

This code is in src/ It produces output:

<ant tokenize1
     [java] Tweet: RT @mparent77772: Should Obama's 'internet kill switch' power be curbed?
     [java] Token: RT
     [java] Token: @mparent77772:
     [java] Token: Should
     [java] Token: Obama's
     [java] Token: 'internet

Another approach would be to try an define what characters belong in a token and match that, but that is a little more complex to program because String does not have a simple method call to capture an array of regex matches–see the regex chapter in the LingPipe book for more discussion:

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class Tokenize2 {
    //Approach 2, define what is a token
    public static void main (String[] args) {
	String tweet = "RT @mparent77772: Should Obama's 'internet kill switch' power be curbed?";
	Pattern tokenPattern = Pattern.compile("\\w+"); //match one or more letter or digit chars
	System.out.println("Tweet: " + tweet);
	Matcher matcher 
	    = tokenPattern.matcher(tweet);
	while (matcher.find()) { //keep matching until no more matches
	    System.out.println("Token: " +;//print what matched

This approach produces the following output:

<ant tokenize2  
   [java] Tweet: RT @mparent77772: Should Obama's 'internet kill switch' power be curbed?
     [java] Token: RT
     [java] Token: mparent77772
     [java] Token: Should
     [java] Token: Obama
     [java] Token: s
     [java] Token: internet

Note that the text that separate the tokens are very different. Between ‘RT’ and ‘mparent77772’ is the separator ‘ @’. Depending on the needs of the application it can be easier to define tokens by what they look like rather than what the spaces around them look like. Often it is both.

While these approaches might work for the case at hand we will introduce the LingPipe tokenizers instead. They offer much richer tokenization options ready to go–see chapter 8 of the LingPipe book draft for more details or look at the Java doc.

An equivalent tokenization in the LingPipe API is created as follows:

import com.aliasi.tokenizer.RegExTokenizerFactory;
import com.aliasi.tokenizer.TokenizerFactory;
import com.aliasi.tokenizer.Tokenizer;

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class Tokenize3 {
    public static void main (String[] args) {
	//approach #3, A LingPipe tokenizer
	String tweet = "RT @mparent77772: Should Obama's 'internet kill switch' power be curbed?";
	TokenizerFactory tokFactory
	    = new RegExTokenizerFactory("\\w+");//match one or more letter or digit chars
	System.out.println("Tweet: " + tweet);
	char[] chars = tweet.toCharArray();
	Tokenizer tokenizer 
	    = tokFactory.tokenizer(chars,0,chars.length);
	String token;
	System.out.println("White Space :'" +  tokenizer.nextWhitespace() + "'");
	while ((token = tokenizer.nextToken()) != null) {
	    System.out.println("Token: " + token);
	    System.out.println("White Space :'" + tokenizer.nextWhitespace()+"'");

Its output reports on both the white spaces as well as the tokens.

     [java] Tweet: RT @mparent77772: Should Obama's 'internet kill switch' power be curbed?
     [java] White Space :''
     [java] Token: RT
     [java] White Space :' @'
     [java] Token: mparent77772
     [java] White Space :': '
     [java] Token: Should

We add the normalization features of post #2 by extending the ModifyTokenTokenizerFactory to filter retweets and urls. A good way to develop a tokenizer is to put in into its own class and create a stand-alone task that runs the tokenizer over example data. In ‘src/’ there is a main method:

public static void main(String[] args) {
	TokenizerFactory tokFactory = new NormalizedTokenizerFactory();
	String text = "RT @mparent77772: Should Obama's 'internet kill switch' power be curbed?"
	System.out.println("Tweet: " + text);
	char[] chars = text.toCharArray(); //convert to charArray
	Tokenizer tokenizer 
	    = tokFactory.tokenizer(chars,0,chars.length);
	String token;
	System.out.println("White Space :'" +  tokenizer.nextWhitespace() + "'");
	while ((token = tokenizer.nextToken()) != null) {
	    System.out.println("Token: " + token);
	    System.out.println("White Space :'" + tokenizer.nextWhitespace()+"'");

Running the above with ant normalizedTokenizerFactorty produces nicely normalized output.

     [java] Tweet: RT @mparent77772: Should Obama's 'internet kill switch' power be curbed?
     [java] White Space :''
     [java] Token: Should
     [java] White Space :' '
     [java] Token: Obama
     [java] White Space :'''
     [java] Token: s
     [java] White Space :' ''
     [java] Token: internet

How that tokenizer functions is left as an exercise for the reader. Time to take on Mr. Jaccard.

Jaccard Distance as a Measure of Similarity

Jaccard Distance is a good way to impress folks with a fancy sounding term that is really just percent word overlap in our usage. If you think it helps you can also call it by the term ‘coefficient de communauté’ but then you might have to reveal that it was popularized by a French botanist–kind of undermines its jargon impact score. From the Jaccard Javadoc the proximity of two character sequences:

   = size(termSet(cs1) INTERSECT termSet(cs2))
     / size(termSet(cs1) UNION termSet(cs2))

We get the term sets from the tokenizer, and proximity is the percentage of tokens shared by both character sequences. The code to compute this for all pairs of tweets in our search results is in src/ and the relevant bit of code is:

	JaccardDistance jaccardD = new JaccardDistance(tokFactory);
	int filteredCount = 0;
	List candidateTweets 
	    = filterNormalizedDuplicates(texts, 
					 tokFactory); //throw out easy cases
	System.out.println("Normalized duplicate filter leaves " + candidateTweets.size() + " tweets");
	row = new ArrayList();
	for (int i = 0; i < candidateTweets.size(); ++i) {
	    String closestTweet = "default value";
	    double closestProximity = -1d;
	    String targetTweet = candidateTweets.get(i);
	    for (int j = 0; j < candidateTweets.size(); ++j ) {//cross product, ouchy, ow ow. 
		String comparisionTweet = candidateTweets.get(j);
		double thisProximity 
		    = jaccardD.proximity(targetTweet,comparisionTweet);
		if (i != j) { // can't match self
		    if (closestProximity < thisProximity) {
			closestTweet = comparisionTweet;
			closestProximity = thisProximity;

The goal of the above wildly inefficient program is to explore what the closest tweet is as determined by token overlap. The filterNormalizedDeduplicates(tweets,tokFactory) filters out duplicates as discussed in #2. We can then decide on a threshold to throw out tweets with too much overlap. Running the program on the Obama.csv example:

ant exploreJaccard -Ddata=searches/Obama.csv -Doutfile=runs/Obama.jaccard.csv

and then viewing the output .csv file with a sort on column B we get (click to see larger image):

Jaccard Distance sorted for similarity, click on for larger image

Note that we get some values of 1 even though there are differences in Tweet1 and Tweet2. In row 2 the normalization filters out the url leaving the only difference being the phrase “replacing trains by high speed buses” in Tweet1 has an additional “high speed” in “replacing high speed trains by high speed busses” in Tweet 2. Since Jaccard does not count the number of words in computing distance the additional phrase adds no words to the set of words since it already exists in the tweet.

Scrolling down reveals just how bad redundancy can be with tweets. At the 50% overlap point the tweets are still looking quite similar:

	Vet, 77, Busted For Obama Death Threat | The Smoking Gun via @
	Vet, 77, Busted For Obama Death Threat #tcot #tlot #sgp

There are 14 unique tokens with 7 overlapping yielding 50% overlap. 36% of the Obama query tweets have an overlap of 50% or more with another tweet. Trying a different query, "the" finds less overlap between tweets. Only 3% of the tweets have another tweet with 50% overlap. The query "today" yields 14% of tweets overlap. The lesson here is that some queries yield more uniform tweets which impacts subsequent language processing. A more technical way of expressing this observation is that the entropy of the resulting search results varies depending on the query. The "obama" search result entropy is lower (less random) than results for "the" which is more random.

The ‘today’ query had usefully different tweets at the 50% overlap level:

Playing a show in Chicago, IL at 9:00 PM today at LE PASSAGE	
Playing a show in Cape Girardeau, MO at 9:00 PM today at The Venue

Despite sharing half the words they are clearly not retweets and contain different information. It might be quite difficult to automatically reject ‘obama’ near duplicates at 50% token overlap but retain the ‘today’ near duplicates with 50% overlap–I encourage people to suggest approaches in the comments.

Finally we add a class that will take a set of queries and filter them for near duplicates at a given Jaccard proximity in src/ The interesting bit is:

    public static List filterTweetsJaccard(List texts,
				       TokenizerFactory tokFactory,
				       double cutoff) {
	JaccardDistance jaccardD = new JaccardDistance(tokFactory);
	List filteredTweets = new ArrayList();
	for (int i = 0; i < texts.size(); ++i) {
	    String targetTweet = texts.get(i);
	    boolean addTweet = true;
	    //big research literature on making the below loop more efficient
	    for (int j = 0; j = cutoff) {
		    addTweet = false;
		    break; //one nod to efficency
	    if (addTweet) {
	return filteredTweets;

Deduplication along these lines is a big deal for web search as well as cleaning up Twitter feeds. The painful bit is the comparison of each new tweet to all the tweets that have passed the filter before which does not scale well. Much work has gone into hashing schemes that test for a hash match based on a lexical fingerprint of the existing corpus of tweets as well as more sophisticated similarity metrics. A recent paper with a decent overview of past work is

Running the code on the Obama search results removes half the tweets with a .5 cutoff.

<ant dedupeJaccard -Ddata=searches/Obama.csv -Doutfile=runs/ObamaJaccardFiltered.csv
Buildfile: build.xml
    [javac] Compiling 1 source file to /home/breck/devguard/teaching/lingpipe4twitter/build/classes

      [jar] Building jar: /home/breck/devguard/teaching/lingpipe4twitter/demo.jar

     [java] Writing to: runs/ObamaJaccardFiltered.csv
     [java] Filtered size: 752 originally 1500

Looking at the tweets another issue immediately presents itself:

Coupe du monde - Obama: "Mauvaise décision": Le président Barack Obama a… #Maroc #sport #foot
RT @Luminary212: LOL “@BreakingNews: Obama knocks FIFA's decision to award Qatar, not the U.S., the 2022 World Cup: 'Wrong decision'”
I want to meet President Obama.
RT @laaficion Barack Obama, presidente de EU, consideró que la FIFA se equivocó al darle a Qatar el Mundial de 2022 y no a su país// celos

All sorts of languages are in the data. It looks like there is French and Spanish mixed in with the English. Next post will be creation of a language identification classifiers that will return to the notion of entropy discussed above.


In the quest to clean up the search results I have introduced some key concepts in text processing that will resurface in various forms throughout the posts. Summarizing:

  • Tokenization: Breaking strings into words/word separators is key to getting at the unit of meaning for word based processing. Oddly enough in the next post we won’t be using words for language id, but we will be still tokenizing.
  • Normalization: By eliminating trivial differences we can make text look more uniform–but beware of this tendancy.
  • Entropy: Some collections of data are more uniform than others.
  • Similarity: There are measures like Jaccard Distance to estimate the similarity of text. We will see many more examples that implement topic assignment, sentiment and more.
  • Evaluation: We have picked an operating point for near duplicate detection by examining data. While the evaluation metric remains quite loose, in the next post we will both create gold-standard (or truth) data and evaluation the performance of our language id system against it.


Processing Tweets with LingPipe #2: Finding Duplicates with Hashing and Normalization

December 2, 2010

This post is about taking the csv file format used in post #1 and eliminating the huge number of annoying duplicates or near duplicates that exist in Twitter search results. The next entry in the series will cover more open ended near duplicate detection strategies such as Jaccard Distance and will introduce the very important area of evaluation.

As before you can download the code with this tarball or get it from subversion with the command

svn co

We will start with a simple ingest and representation of the csv data and consider how to find near or exact duplicates without concern for computational efficiency. The research literature in this area is very concerned with scalable performance, I will instead focus on more ideal solutions that focus on simplicity and quality.

Hashing Driven Deduplication

A simple way to find exact duplicates is to use a HashSet data structure in Java for comparison of tweets. Once we get the tweets into a list we iterate over the tweets to see whether the the exact string is in the HashSet. Below is the recognition phase from src/

	List texts = parseFile(dataDir);
	HashSet seenBefore = new HashSet();
	for (int i=0; i < texts.size(); ++i) {
	    String tweet = texts.get(i);
	    if (seenBefore.contains(tweet)) {
		System.out.println("Seen Before at postition:" + i +": " + tweet);
	    else {

The code iterates over the CSV format search results, tests to see if a HashSet already contains the string and adds it if the set does not contain the string. I could have just added all the strings and the HashSet would have behaved the same but that is slightly automagical and might prove confusing. Below the code for writing out the CSV format for presumably unique tweets:

	File output = new File(outDir,outFile); 
	System.out.println("Writing to: " + output.toString());
	FileOutputStream stream =  new FileOutputStream(output);
	OutputStreamWriter streamWriter 
	    = new OutputStreamWriter(stream,Strings.UTF8);
	CsvListWriter writer 
	    = new CsvListWriter(streamWriter,CsvPreference.EXCEL_PREFERENCE); 
	ArrayList row = new ArrayList();
	row.add("Unique Tweet");
	writer.write(row); //header for csv file
	for (String tweet : seenBefore) {
	System.out.println(seenCounter + " seen of " + texts.size());

Note that I have kept the tweet in the same row as the search results. This convention will allow interaction with other components.

Running the program on searches/Obama.csv yields the following:

>ant dedupeSimpleHash -Ddata=searches/Obama.csv -Doutdir=runs -Doutfile=Obama.simpleDedupeHash.csv
Buildfile: build.xml



     [java] 2
     [java] Seen Before at postition:48: RT @rolandsmartin: President Obama: 'I Don't Think About Sarah Palin' -
     [java] Seen Before at postition:58: RT @New_federalists: Obama lets Islamic terrorists pour across the Mexican border, but treats US citizens as criminals at airports!!! #RESIST #REBEL !
     [java] Seen Before at postition:1490: RT @Cubachi: RT @PalinTV: Obama can't even pardon a Turkey without notes
     [java] Seen Before at postition:1499: RT @BreakingNews: President Barack Obama pardons turkeys ?Apple? and ?Cider?; says feels good to prevent one ?shellacking? this year
     [java] Writing to: runs/Obama.simpleDedupeHash.csv
     [java] 297 seen of 1500

Duplicate Detection via Normalization

The simple deduplication approach gets rid of 297 exact duplicates, or 20% of the tweets. Pretty good but looking at the ‘unique’ tweets and sorting on the texts in a spread sheet shows some other issues:

Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update...
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update...
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update...
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update...
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update...

Note that your results will vary since we are not running the searches on the same day. These posts only differ in the url for which is still pretty much the same tweet as far as I am concerned. Looking further I see that tweets get re-tweeted with a prefix of ‘RT@:’ pretty often as well.

RT @mattyglesias: Repealing the whole Affordable Care Act would be a mistake, but Obama should compromise and agree to scrap the death...																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																												
Repealing the whole Affordable Care Act would be a mistake, but Obama should compromise and agree to scrap the death...

It appears that removing urls and the retweet prefixes will find more duplicates. That suggests normalization to help make the tweets more uniform by eliminating "trivial" differences. A look at src/ shows regular expressions that replace the url and retweets with the empty string.

	List texts = parseFile(dataDir);
	HashSet seenBefore = new HashSet();
	for (int i=0; i < texts.size(); ++i) {
	    String tweet = texts.get(i);
	    System.out.println("Before:" + tweet);
	    tweet = tweet.replaceAll("\\s*RT\\s*@\\w+:\\s*","");//removes "RT @foobar:"
	    tweet = tweet.replaceAll("https?:[^\\s]*",""); //removes "http://foo" "https://bar"
	    System.out.println("After :" + tweet);	    
	    if (seenBefore.contains(tweet)) {
		System.out.println("Seen Before at postition:" + i +": " + tweet);
	    else {

Running the normalized approach the modifications are evident:

> ant dedupeNormalizedHash -Ddata=searches/Obama.csv -Doutdir=runs -Doutfile=Obama.dedupeNormalizedHash.csv
     [java] Before:RT @rolandsmartin: President Obama: 'I Don't Think About Sarah Palin' -
     [java] After :President Obama: 'I Don't Think About Sarah Palin' -  
     [java] Writing to: runs/Obama.dedupeNormalizedqHash.csv
     [java] 454 seen of 1500

Both patterns apply to the tweet removing ‘RT @rolandsmartin: ‘ and ‘;. The normalization allows for identification of another 157 duplicates. Normalization is a standard technique used when working with text data. You see it in search engines where ‘walks’, ‘walking’ and ‘walked’ are normalized to the root word ‘walk’ with a stemmer. Both documents and queries are normalized the same way allowing a query ‘Books about walking’ to match documents that don’t mention ‘walking’ but do mention ‘walk’, ‘walks’ or ‘walked.’ Normalization can bring more trouble than the problems it solves and should be used carefully–perhaps a future blog post is called for on our preference to use character n-grams in situations where stemming is typically used.

Looking at the output of the normalized hash approach reveals more opportunity to eliminate near duplicates and there may be more patterns to exploit but eventually diminishing returns will bring the pattern driven normalization effort to a close. Lets move to more open ended approaches that better address the long tail side of the issue:

Has Obama written a stern letter to North Korea yet? They will respect that.
Has Obama written a stern letter to North Korea yet? They will respect that. <--- indeed!

Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party 
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner W...  #tcot #p2 #tlot
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner W...  #tcot #tlot #p2
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update... 

The first pair only differ in an additional ‘<—indeed!’ which is not a likely pattern in the data that can be exploited much beyond this case. The second example shows that the basic tweet has many small variations. In these situations it is useful to deduce what algorithm you use to say that the tweets are (mostly) the same.

In my next post in the series I will cover Jaccard Distance as the method of identifying near duplicates and bring in the importance of evaluation metrics to drive development.


Need Another Label(s)!

November 23, 2010

It occurred to me while working through the math for my last post that there are situations when you not only need another label for an existing item, but need more than one to achieve an expected positive payoff.

Prior and Posterior Odds Ratios

First, let’s summarize the result from adding another annotator. For a given image i, suppose the current odds (aka prior) of clean (c_i = 1) versus porn (c_i = 0) are V:1, which corresponds to a probability of being clean of \mbox{Pr}[c_i =1] = V/(1+V). For instance, if the odds are 4:1 the image is clean, the probability it is clean is 4/5.

Now suppose we get a label y_{i,j} from an annotator j for image i. We update the odds to include the annotator’s label to get new (posterior) odds V', by the following formula if the label is

V' =   V \times \mbox{Pr}[y_{i,j}|c_i=1] \, / \, \mbox{Pr}[y_{i,j}|c_i=0].

We just multiply the prior odds V by the likelihood ratio of the annotation y_{i,j} given that c_i = 1 or c_i = 0. The new probability of a clean image is thus V'/(1+V').

The Porn Payoff

Suppose the payoff for the porn task is as follows. A porn image classified as clean has a payoff of -100, a porn image classified as porn has a cost of 0, a clean image classified as clean has a payoff of 1, and a clean image classified as porn has a payoff of -1.

In this setup, when we have an image, we need odds of better than 100:1 odds (a bit more than 99% probability) that it is clean before we return the decision that it’s clean; otherwise we maximize our payoff saying it’s porn. Unless we are 100% sure of our decision, we always have an expected loss from returning a decision that an image is porn, because the payoff is zero for a true negative (classifying porn as porn), whereas the payoff is -1 for rejecting a non-porn image.

Now suppose that the prevalence of clean images is 20 in 21 (of course, we work with an estimate). So we start with odds of 20:1. The decision to classify the item as clean before annotation has an expected payoff of [20*1 + 1*(-100)]/21 or about -4 per decision. The decision to classify an image as porn before an annotation has a payoff of [20*(-1) + 1*0]/21 which is around -1 per decision.

Need Multiple Annotators

Clearly we need to annotate an item before we can have a positive expected payoff.

Suppose for the sake of argument (and easy arithmetic) that we have an annotator with sensitivity 0.9 (they correctly classify 90% of the clean images as clean and reject 10% of the clean images as porn) and specificity 0.8 (they correctly classify 80% of the porn images as porn and let 20% through as clean).

In this context, we actually need more than one annotator to annotate an item before we get a positive expected payoff. We first need to work through our expectations properly. First, we start with 20:1 odds (probability 20/21) an image is clean, so we can expect to see 20 clean images for each porn image. Then we have to look at the annotator’s expected response. If the image is clean, there’s a 80% chance the annotator says it’s clean and a 20% chance they say it’s porn. If the image is porn, there’s a 90% chance they say it’s porn and a 10% chance they say it’s clean. That let’s us work out the expectations for true category and response a priori. For instance, the chance the image is clean and the annotator says its porn is 20/21 * 0.2 = 4/21.

We then calculate the updated odds under both possible annotator responses and figure out the new odds and weight them by our expectations before seeing the label.

I’ll let you work out the arithmetic, but the upshot is that until you have more than one annotator, you can’t get 100:1 odds or more of an image not being porn. The key step is noting that we only get positive expected payoff if we return that the image is clean, but the posterior odds if the annotator provides a 1 label, which are 20/1 * 0.9/0.2 = 90:1. And don’t forget to factor in that we only land in the happy situation of a getting a non-porn label around 90% of the time, so the expected gain in payoff from the annotation is less than the improvement from 20:1 to 90:1 we get in the ideal case.

In reality, you’re likely to have slightly better porn/not-porn annotators than this because it’s a relatively easy decision problem. But you’re also likely to have spammers, as we mentioned last time, which is really a mixed pool of spammers and cooperators.

Unknown Annotators

I’ll say again that one of the pleasant properties of the hierarchical model extension of Dawid and Skene is that it allows us to predict the behavior for a new unknown annotator. This is particularly useful in a mechanical Turk setting where we can’t choose our annotators directly (though we can feed items to an annotator if we write our own web app and they volunteer to do more).

We just sum over predictions from each posterior sample of the hyperpriors, then sample an annotator from that, calculate what the outcome would be for them, and average the result. What’s extra cool is that this includes all the extra dispersion in posterior inference that’s warranted due to our uncertainty in the hyperpriors representing the population of annotators.

Information Gain and Task-Based Costs: Biased versus Spam Annotators

November 22, 2010

Following on from Sheng et al.’s Get Another Label? paper, Panos and crew have a Human Computation Workshop paper,

The motivation for the new paper is to try to separate bias from accuracy. That is, some annotators would try hard, but the bias in their answers would give them an overall low exact accuracy. But their responses can still be useful if we correct for their bias. Luckily, the Dawid-and-Skene-type models for estimating and adjusting for annotator’s accuracies does just that.

G, PG, R, or X?

As an example, Ipeirotis et al. consider having workers on Amazon Mechanical Turk classify images into G, PG, R, and X ratings following MPAA ratings guidelines.

This is really an ordinal classification problem where the approach of Uebersax and Grove (1993) seems most appropriate, but let’s not worry about that for right now. We can imagine a purely categorical example such as classifying tokens in context based on part-of-speech or classifying documents based on a flat topic list.

Bias or Inaccuracy?

Uebersax and Grove discuss the bias versus accuracy issue. It’s easy to see in a two-category case that sensitivity and specificity (label response for 1 and 0 category items) may be reparameterized as accuracy and bias. Biased annotators have sensitivities that are lower (0 bias) or higher (1 bias) than their specificities.

What Ipeirotis et al. point out is that you can derive information from a biased annotator if their biases can be estimated. As they show, a model like Dawid and Skene’s (1979) performs just such a calculation in its “weighting” of annotations in a generative probability model. The key is that it uses the information about the likely response of an annotator given the true category of an item to estimate the category of that item.

Decision-Theoretic Framework

Ipeirotis et al. set up a decision-theoretic framework where there is a loss (aka cost, which may be negative, and thus a gain) for each decision based on the true category of the item and the classification.

One nice thing about the image classification task is that it makes it easy to think about the misclassification costs. For instance, classifying an X-rated image as G and having it land on a children’s site is probably a more costly error than rating a G image as PG and banning it from the same site. In the ordinal setting, there’s a natural scale, where rating an X-rated image as R is closer to the true result than PG or G.

Getting Another Label

Consider a binary setting where the true category of item i is c_i, the prevalence (overall population proportion) of true items is \pi, and annotator j has sensitivity (accuracy on 1 items; response to positive items) of \theta_{1,j} and a specificity (accuracy on 0 items; 1 – response to negative items) of \theta_{0,j}. If y_{i,j} is the response of annotators j \in 1{:}J, then we can calculate probabilities for the category c_i by

\mbox{Pr}[c_i = 1 | y_i] \propto \pi \times \prod_{j=1}^J \theta_{1,j}^{y_{i,j}} \times (1 - \theta_{1,j})^{1 - y_{i,j}} and

\mbox{Pr}[c_i = 0 | y_i] \propto (1 - \pi) \times \prod_{j=1}^J \theta_{0,j}^{1 - y_{i,j}} \times (1 - \theta_{0,j})^{y_{i,j}}.

The thing to take away from the above formula is that it reduces to an odds ratio. Before annotation, the odds ratio is \pi / (1-\pi). For each annotator, we multiply the odds ratio by the annotator’s ratio, which is \theta_{1,j} / (1 - \theta_{0,j}) if the label is y_{i,j} = 1, and is (1 - \theta_{1,j})/\theta_{0,j} if the label is negative.

Random Annotators Don’t Affect Category Estimates

Spammers in these settings can be characterized by having responses that do not depend on input items. That is, no matter what the true category, the response profile will be identical. For example, an annotator could always return a given label (say 1), or always choose randomly among the possible labels with the same distribution (say 1 with 20% chance and 0 with 80% chance, correspdonding, say, to just clicking some random checkboxes on an interface).

In the binary classification setting, if annotations don’t depend on the item being classified, we have specificity = 1 – sensitivity, or in symbols, \theta_{0,j} = 1 - \theta_{1,j} for annotator j. That is, there’s always a \theta_{1,j} chance of returning the label 1 no matter what the input.

Plugging this into the odds ratio formulation above, it’s clear that there’s no effect of adding such a spammy annotator. The update to the odds ratios have no effect because \theta_{1,j}/(1 - \theta_{0,j}) = 1 and (1-\theta_{1,j})/\theta_{0,j}=1.

Cooperative, Noisy and Adversarial Annotators

In the binary case, as long as the sum of sensitivity and specificity is greater than one, \theta_{0,j} + \theta_{1,j} > 1, there is positive information to be gained from an annotator’s response.

If the sum is 1, the annotator’s responses are pure noise and there is no information to be gained by their response.

If the sum is less than 1, the annotator is being adversarial. That is, they know the answers and are intentionally returning the wrong answers. In this case, their annotations will bias the category estimates.

Information Gain

The expected information gain from having an annotator label an item is easily calculated. We need to calculate the probability of true category and then probability of response and figure out the odds of each and the contribution to our overall estimates.

The random variable in question is c_i, the category of item i. We will assume a current odds ratio after zero or more annotations of \phi/(1-\phi) and consider the so-called information gain from observing y_{i,j}, the label provided for item i by annotator j,

\mbox{H}[c_i] - \mathbb{E}[\mbox{H}[c_i|y_{i,j}]],

where the expectation is with respect to our model’s posterior.

Expanding the expectation in the second term gives us

\mathbb{E}[\mbox{H}[c_i|y_{i,j}]] = \mbox{Pr}[y_{i,j} = 1] \times \mbox{H}[c_i|y_{i,j}=1] + \mbox{Pr}[y_{i,j} = 0] \times \mbox{H}[c_i|y_{i,j}=0]

The formulas for the terms inside the entropy are given above. As before, we’ll calculate the probabilities of responses using our model posteriors. For instance, carrying this through normalization, the probabilities on which the expectations are based are

\mbox{Pr}[y_{i,j} = 1] = Q_1/(Q_1+Q_2), and

\mbox{Pr}[y_{i,j} = 0] = Q_2/(Q_1 + Q_2), where

Q_1 = \phi \times \theta_{1,j} \ + \ (1-\phi) \times (1 - \theta_{0,j}), and

Q_2 = (1-\phi) \times \theta_{0,j} \ + \ \phi \times (1 - \theta_{1,j}).

so that the probability the annotation is 1 is proportional the sum of the probability that the true category is 1 (here \phi) and the response was correct (\theta_{1,j}) and the probability that the true category is 0 (1 - \phi) and the response was incorrect (1 - \theta_{0,j}).

It’s easy to see that spam annotators who have \theta_{0,j} = 1 - \theta_{1,j} provide zero information gain because as we showed in the last section, p(c_i|y_{i,j}) = p(c_i) if annotator j provides random responses, then \mbox{H}[c_i|y_{i,j}] = \mbox{H}[c_i].

Decision-Theoretic Framework

Ipeirotis et al. go one step further and consider a decision-theoretic context in which the penalities for misclassifications may be arbitrary numbers and the goal is minimizing expected loss (equivalently maximizing expected gain).

Rather than pure information gain, the computation would proceed through the calculation of expected true positives, true negatives, false positives, and false negatives, each with a weight.

The core of Bayesian decision theory is that expected rewards are always improved by improving posterior estimates. As long as an annotator isn’t spammy, their contribution is expected to tighten up our posterior estimate and hence improve our decision-making ability.

Suppose have have weights L_{k,k'}, which are the losses for classifying an item of category k as being of category k'. Returning to the binary case, consider an item whose current estimated chance of being positive is \phi. Our expected loss is

\mathbb{E}[\mbox{loss}] = L_{1,1}  \phi^2 + L_{1,0}  \phi  (1-\phi) + L_{0,1} (1-\phi)  \phi + L_{0,0} (1-\phi)^2.

We are implicitly assuming that the system operates by sampling the category from its estimated distribution. This is why \phi shows up twice after L_{1,1}, once for the probability that the category is 1 and once for the probability that’s the label chosen.

In practice, we often quantize answers to a single category. The site that wants a porn filter on an image presumably doesn’t want a soft decision — it needs to either display an image or not. In this case, the decision criterion is to return the result that minimizes expected loss. For instance, assigning category 1 if the probability the category is 1 is \phi leads to expected loss

L_{1,1} \phi + L_{0,1} (1-\phi)

and the loss for assigning category 0 is expected to be

L_{0,0} (1-\phi) + L_{1,0} \phi.

The decision is simple: return the result corresponding to the smallest loss.

After the annotation by annotator j, the positive and negative probabilities get updated and plugged in to produce a new estimate \phi', which we plug back in.

I’m running out of steam on the \LaTeX derivation front, so I’ll leave the rest as an exercise. It’s a hairy formula, especially when unfolded to the expectation. But it’s absolutely what you want to use as the basis of the decision as to which items to label.

Bayesian Posteriors and Expectations

In practice, we work with estimates of the prevalence \pi, sensitivities \theta_{1,j}, and specificities \theta_{0,j}. For full Bayesian inference, Gibbs sampling lets us easily compute the integrals required to use our uncertainty in the parameter estimates in calculating our estimate of \mbox{Pr}[c_i = 1|y_i] and its uncertainty.

Confused by Section 4

I don’t understand why Ipeirotis say, in section 4,

The main reason for this failure [in adjusting for spam] is the inability of the EM algorithm to identify the “strategic” spammers; these sophisticated spammers identify the class with the highest class prior (in our case the “G” class) and label all the pages as such.

Huh? It does in my calculations, as shown above, and in my experience with the Snow et al. NLP data and our own NE data.

One problem in practice may be that if a spammer annotates very few items, we don’t have a good estimate of their accuracies, and can’t adjust for their inaccuracy. Otherwise, I don’t see a problem.

LREC 2010 Tutorial: Modeling Data Annotation

May 17, 2010

Massimo and I are giving our tutorial this afternoon in Malta at LREC 2010. Here’s a link to my slides:

Thanks to Massimo for lots of great suggestions and feedback from when I gave the talk in Rovereto Italy last week (U. Trento), when we talked about it at length between then and now, and on the first few drafts of the slides.

And here’s a link to Massimo’s slides:

Massimo’s part was based on his and Ron’s CL Journal review of kappa-like stats, and the tutorials they’ve given in the past. As such, they’re much more mature. The examples are really tight and to the point.

Stochastic Gradient Descent for Probabilistic Training Data

March 23, 2010

I’ve been talking for a while about using the Bayesian hierarchical models of data annotation to generate probabilistic corpora. It’s really easy to extend stochastic gradient descent (SGD) optimization algorithms to handle probabilistic corpora.

I’m going to consider maximum likelihood estimation for binary logistic regression, but the same thing can be done for conditional random field (CRF) taggers, support vector machine (SVM) classifiers, or perceptron classifiers. I’ll discuss priors and multinomials after the main algorithm.

The Training Data

Suppose we have N labeled training instances, each consisting of a D-dimensional input vector x_n and probabilistic outcome y \in [0,1]:

double[N][D] x;  // inputs
double[N] y;  // outputs

A non-probabilistic corpus constrains y_n to be either 0 or 1.

The Model

Binary logistic regression for D-dimensional inputs is parameterized by a D-dimensional vector \beta. The probability of a 1 outcome for a D-dimensional input x is

\mbox{Pr}(y_n = 1 | x_n, \beta)  = \mbox{logit}^{-1}(\beta^{\top}x_n),


\mbox{logit}^{-1}(z) = \exp(z)/(1 + \exp(z)) = 1/(1 + \exp(-z)).

Maximizing Log Likelihood

Our goal’s to find the parameter \beta that maximizes the training data likelihood

\mathcal{L}(\beta;x,y) = \sum_{n = 1}^N \log \mbox{Pr}(y=y_n | x=x_n, \beta).

The maximum likelihood estimate is thus

\beta^* = \arg\max_{\beta} \mathcal{L}(\beta;x,y).

Stochastic Gradient Descent

SGD finds the maximum likelihood estimate \beta^* given training data x,y. For simplicity, we’ll only consider a fixed learning rate (\lambda > 0), though the algorithm obviously generalizes to annealing schedules or other learning rate enhancement schemes. We’ll let E be the number of epochs for which to run training, though you could also measure convergence in the algorithm.

β = 0
for (e in 1:E) 
    for (n in 1:N)
        yHat[n] = inverseLogit(dotProduct(β,x[n]))
        err = y[n] - yHat[n]
        β += λ * err * x[n]     

Easy peasy, as they say in the old country.


Priors don’t actually interact with the probabilistic training. So you just add them in as usual. I’d recommend either blocking them (as I did for CRFs in LingPipe) to update every few hundred training instances, or using lazy updates (as I did for logistic regression in LingPipe).

Multinomial Classifiers

It’s also easy to generalize to multinomials, where the data consists of a probability for each of K outcomes. You just add an inner loop over the dimensions (minus 1 or not depending on whether you use K or K-1 parameter vectors for a K-dimensional problem), and the error term is calculated versus the probability for that outcome.

Dekel and Shamir (2009) Vox Populi: Collecting High-Quality Labels from a Crowd

January 22, 2010

Aleks just sent me a pointer to this paper from last year’s COLT:

The paper presents, proves theorems about, and evaluates a very simple idea in the context of crowdsourcing binary classifier data coding. Unlike other approaches I’ve blogged about, Dekel and Shamir use only one coder per item, and often coders only label a handful of items.

They train an SVM on all the labels, then use the SVM to evaluate coders. They then prune out low-quality coders using the SVM as a reference.

This paper reminded me of (Brodley and Friedl 1999), which was also about using multiple trained classifiers to remove bad labels. Brodley and Friedl remove items on an item by item basis rather than on an annotator by annotator basis.

This paper is also somewhat reminiscent of (Raykar et al. 2009), who train a classifier and use it as another annotator.

Lots of Theory

There’s lots of theory saying why this’ll work. It is a COLT paper after all.

Evaluation: Web Page Query Relevance

They ran an eval on Amazon’s Mechanical Turk asking people to judge a pair consisting of a web query and web page as to whether the page is relevant to the query.

They used 15 coders per query/page pair in order to define a gold standard by voting. They then evaluated their approach using only one coder per item by subsampling their larger data set.

One thing I noticed is that as they lowered the maximum number of items labeled by an annotator, their average annotator accuracy went up. This is consistent with our findings and those in Snow et al. that the spammy Mechanical Turkers complete a disproportionate number of tasks.

With only one coder per item, Dekel and Shamir can’t evaluate the way everyone else is evaluating crowdsourcing, because they know their resulting data will remain noisy because it’s only singly annotated.

Estimates of coder sensitivity and specificity could be made based on their performance relative to the SVM’s best guesses. That’d provide some handle on final corpus quality in terms of false positives and negatives relative to ground truth.

Rather than evaluating trained classifier performance, Dekel and Shamir measure the “noise level” of the resulting data set after pruning. What we really care about in practice is how much pruning bad annotators helps train a more reliable classifier (or help evaluate if that’s the goal). They discuss this kind of end-to-end evaluation under “future work”! It’s really striking how different the evaluation versus theory requirements are for different conferences.

LREC 2010 Tutorial: Statistical Models of the Annotation Process

January 4, 2010

17 May 2010, Malta

Massimo Poesio and I are going to be giving a half-day tutorial at LREC 2010 in the afternoon of Monday 17 May 2010. The title is Statistical Models of the Annotation Process. We’ll be covering most of the material I’ve discussed in this blog’s data annotation thread and more.

What’s really cool is that it’s in Malta!

I’ll be there for most of the LREC 2010 programme. I’m particularly excited about the 2nd Workshop on Building and Evaluating Resources for Biomedical Text Mining and the New Challenges for NLP Frameworks Workshop (where by “framework”, they mean tools like GATE and LingPipe). I’ve always liked workshops more than conferences.

5–6 May 2010, Edinburgh

Before Malta, I’ll be in Edinburgh 6–7 May 2010 for a reunion of the department in which I did my Ph.D., the Centre for Cognitivie Science (originally School of Epistemics). I’ll be giving a talk on data annotation there, too.