Archive for the ‘Statistics’ Category

March 10, 2011

I was commenting on a Wall St. Journal blog post on NCAA bracket math and figured I could actually elaborate on the math here.

For those of you who don’t know what it is, “March Madness” is a tournament of college basketball teams in the United States. The same approach could be used for the World Cup, chess tournaments, baseball seasons, etc. Anything where bettors assign strengths to teams then some subset of those teams play each other. It can be round robin, single elimination, or anything in between.

The Problem: A Betting Pool

Now suppose we want to run an office betting pool [just for fun, of course, as we don’t want to get in trouble with the law]. We want everyone to express some preferences about which teams are better and then evaluate who made the best predictions.

The contest will consist of 63 games total in a single-elimination tournament. (First round is 32 games, next 16, next 8, then 4, 2, and 1, totalling 63.) The big problem is that who plays in the second round depends on who wins in the first round. With 64 teams, there are $2^{64}$ possible matchups, a few too many to enumerate.

You could do something like have everyone rank the teams and then somehow try to line those up. See the WSJ blog post for more ad hoc suggestions.

The Bradley-Terry model is a model for predicting the outcome of pairwise comparisons. Suppose there are $N$ items (teams, in this case) being compared. Each item gets a coefficient $\alpha_n$ indicating how strong the team is, with larger numbers being better. The model then assigns the following probability to a matchup:

$\mbox{Pr}[\mbox{team } i \mbox{ beats team } j] = \mbox{logit}^{-1}(\alpha_i - \alpha_j)$

The inverse logit is $\mbox{logit}^{-1}(x) = 1/(1 + \exp(-x))$. For instance, if the team $i$ and team $j$ have the same strength, that is $\alpha_i = \alpha_j$, the the probability is even. If $\alpha_i >> \alpha_j$, then the probability approaches 1 that team $i$ will defeat team $j$.

As an aside, there have been all kinds of adjustments to this model. For instance, you can add an intercept term $\beta$ for home team advantage, and perhaps have this vary by team. You can imagine adding all sorts of other random effects for games. We can also add a third outcome for ties if the sport allows it.

Scoring Predictions

Suppose we have each bettor assign a number $\alpha_i$ to each team $i$. This vector $\alpha$ of team strength coefficients determines the probability that team $i$ defeats team $j$ for any $i,j$.

Suppose the games are numbered 1 to 63 and that in game $k$, the result is that team $ii[k]$ defeated team $jj[k]$. Then the score assigned to ratings $\alpha$ of team strengths is:

$\mbox{score}(\alpha) = \sum_{k = 1}^{63} \mbox{logit}^{-1}(\alpha_{ii[k]} - \alpha_{jj[k]})$.

Higher scores are better. In words, what happens is that for each game, you get a score that’s the log of the probabilty you predicted for winning for the team that won. The total score is just the sum of these individual game scores, so it’s the total probability that your rankings assigned to what actually happened.

For instance, let’s suppose there are three teams playing each other round robin. Let’s suppose that team 1 beats team 2, team 1 beats team 3 and team 3 beats team 2. Now suppose we assigned strengths $\alpha_1 = 2.0$ to team 1, $\alpha_2 = 1.0$ to team 2 and $\alpha_3 = 0.25$ to team 3. The total score would be

$\mbox{score}(\alpha)$
$= \mbox{logit}^{-1}(2.0 - 1.0) + \mbox{logit}^{-1}(2.0 - 0.25) + \mbox{logit}^{-1}(0.25 - 1.0)$
$= -1.61$.

This corresponds to an $\exp(-1.61) = 0.20$ probability assigned by the coefficients $\alpha$ to the outcomes of the three games.

Breaking out the probabilities for the individual games, note that

$\mbox{logit}^{-1}(2.0 - 1.0) = 0.73$,
$\mbox{logit}^{-1}(2.0 - 0.25) = 0.85$, and
$\mbox{logit}^{-1}(2.0 - 1.0) = 0.32$.

Note that multiplying these probabilities together yields 0.20.

Fitting the Model Given Data

Given the outcomes, it’s easy to optimize the coefficients. It’s just the maximum likelihood estimate for the Bradley-Terry model! Of course, we can compute Bayesian posteriors and go the full Bayesian inference route (Gelman et al.’s Bayesian Data Analysis goes over the Chess case where there may be draws.)

An obvious way to do this would be to use the season’s games to fit the coefficients. You could add in all the other teams, too, which provide information on team strengths, then just use the coefficients for the teams in the tournament for prediction. A multilevel model would make sense in this setting, of course, to smooth the strengths. The pooling could be overall, by division, or whatever else made sense for the problem.

How to Explain it to the Punters?

No idea. You could have them assign numbers and then they could explore what the predictions are for any pair of teams.

The Matrix Cookbook

February 3, 2011

I was browsing through the doc for the Eigen matrix package (C++ templated, which I need for automatic differentiation) and they had a pointer to the following extended (70+ page) matrix “cheat sheet”:

It contains all kinds of useful definitions and derivations, most relevantly for me, gradients of all the matrix operations (inverses, determinants, traces, eigenvalues). They even have the gradient for the multivariate normal density function (see section 8), which has a pleasingly simple form. Now if I could just get them to do the Wishart and multivariate-t, which they define, but don’t differentiate.

And how great is that pair of author names, “Petersen” and “Pedersen”? They’re reminiscent of Tintin’s Thomson and Thompson.

The authors even have a blog about The Matrix Cookbook, though it hasn’t been updated in almost a year.

Which Automatic Differentiation Tool for C/C++?

January 19, 2011

I’ve been playing with all sorts of fun new toys at the new job at Columbia and learning lots of new algorithms. In particular, I’m coming to grips with Hamiltonian (or hybrid) Monte Carlo, which isn’t as complicated as the physics-based motivations may suggest (see the discussion in David MacKay’s book and then move to the more detailed explanation in Christopher Bishop’s book).

Why Hamiltonian Monte Carlo?

Hamiltonian MC methods use gradient (derivative) information to guide Metropolis-Hastings proposals. Matt is finding that it works well for the multilevel deep interaction models we’re working on with Andrew. As the theory suggests, it works much better than Gibbs sampling when the variables are highly correlated, as they tend to be with multilevel regression coefficients. The basic idea goes back to the paper:

• Duane, Simon, A. D. Kennedy, Brian J. Pendleton, and Duncan Roweth. 1987. Hybrid Monte Carlo. Physics Letters B 195(2):216–222. doi:10.1016/0370-2693(87)91197-X

Because we want to model general directed graphical models a la BUGS, we need to compute gradients.

If you need general gradients, it sure seems like automatic differentiation (AD) deserves at least some consideration. AD is a technique that operates on arbitrary functions defined by source code, generating new source code that computes the derivative (thus it’s a kind of automatic code generation). At a high level, it does just what you might expect: computes the derivatives of built-in functions and constants then uses the chain rule to put them together. Because it generates code, you can even have a go at tuning the resulting derivative code further.

Here are some basic refs:

So Which One?

So what I want to know is, if I’m going to be coding in C, which implementation of AD should I be using? (There are implementations in everyting from Fortran to Python to Haskell.)

Update 21 January 2010: So, two problems rear their ugly heads. One: almost all of the tools other than Tapenade require you to seriously modify the code with different data types and flags for the code generator. Two: most of these tools are for non-commercial use only and can’t be redistributed. Not exactly how we’d like to roll together our own academic software. This is especially annoying because they’re almost all government-research funded (Tapenade in France, many of the others in the US, with copyright held tightly by institutions like the University of Chicago and INRIA). Many of the packages require extensive secondary packages to be installed. I miss Java and Apache.

Update 21 January 2010: maybe people coming from the blog killed it, but the Tapenade Online app is no longer up and hasn’t been since an hour or two after this blog post.

You can try it on your own C, C++, or Fortran program using the

For instance, consider the simple (intententionally non-optimal) C code to compute $f(x) = x^2 + 3x$:

double f(double x) {
double y;
y = x;
y *= x;
y += 3*x;
return y;
}


Running Tapenade in forward mode, we get

double f_d(double x, double xd, double *f) {
double y;
double yd;
yd = xd;
y = x;
yd = yd*x + y*xd;
y *= x;
yd = yd + 3*xd;
y += 3*x;
*f = y;
return yd;
}


If you plug in 1.0 for xd (in general, this can be used for chaining functions), you can see that the function returns the derivative $f'(x)$ and also sets the value of argument f (given as a pointer) to $f(x)$.

In backward mode, Tapenade generates the following code:

void f_b(double x, double *xb, double fb) {
double y;
double yb;
double f;
y = x;
yb = fb;
*xb = *xb + (y+3)*yb;
yb = x*yb;
*xb = *xb + yb;
}


If you squint hard enough, you’ll realize this method also computes the derivative $f'(x)$ of $f(x)$. It’s set up to run the chain rule in reverse.

Citation Stats?

Yes, I know this is totally crass, but nevertheless, I went to autodiff.org and checked out the many systems for C/C++ they cite, and then did a Google Scholar query [SystemName "automatic differentiation"], restricted to results after (including?) 2005. Here’s what I found:

ADMB      73
AUTODIF   11
COSY INF  52
Rapsodia   9
Treeverse  7
YAO       88


This doesn’t seem to be expressing a particularly clear community preference. So I was hoping some of you may have suggestions. I should add that many of the papers are themselves surveys, so don’t actually correspond to someone using the package, which is what I’d really like to know.

Scaling Jaccard Distance for Document Deduplication: Shingling, MinHash and Locality-Sensitive Hashing

January 12, 2011

Following on from Breck’s straightforward LingPipe-based application of Jaccard distance over sets (defined as size of their intersection divided by size of their union) in his last post on deduplication, I’d like to point out a really nice textbook presentation of how to scale the process of finding similar document using Jaccard distance.

The Book

Check out Chapter 3, Finding Similar Items, from:

It was developed for a Stanford undergrad class, and we know Ullman writes a mean text, so it’s not surprising it’s at the perfect level for serious practitioners. Other presentations I’ve seen have been very theory heavy (though feel free to point out other refs in the comments).

The Drill

Here’s an overview of the scaling process, as currently understood, which may seem like a lot of work until you realize it reduces a quadratic all-versus-all doc comparison problem, each instance of which is hairy, to a linear problem, the constant factor for which is manageable.

Step 1. Build a tokenizer to create snippets of text, typically overlapping “shingles” consisting of sequences of tokens. LingPipe’s TokenNGramTokenizerFactory class is essentially a flexible shingler filter for a basic tokenizer. Of course, if all you need to do is the next steps, you don’t need to build string-based tokens — you only need their hash codes, and that’s done more efficiently using something like a rolling hash (the name “rolling hash” is new to me [at least in the intended sense], but the algorithm should be familiar from the Karp-Rabin string search algorithm, which is well described in Gusfield’s most awesome string algorithms book).

Step 2. From each document, extract multiple shingles. Typically these are just the overlapping n-grams or some stemmed or stoplisted forms thereof, which is where the name “shingle” comes from . Rajaram and Ullman suggest that a single stop word followed by the next two tokens works well, which isn’t exactly a shingle, though you could use it as a component and make sequences of these items.

Step 3. Calculate minhash values for each of the shingles in a doc. This provides a compressed representation of sets with the nifty property that the chance that minhashes are equal is the same as the Jaccard distance itself (explained in the book cited above). There’s no Wikipedia page (that I could find), but here’s a nice blog entry on MinHash, which comes with (C#?) code.

Step 4. Calculate locality-sensitive hashes of the minhash values. The point of locality-sensitivity hashing is to map similar items to similar buckets. There’s some really cool math here on expected recall and precision, but I wouldn’t trust the naive numbers for natural language text, because of the huge degree of correlation.

Step 5. Test for equality using the locality-sensitive hashes (LSH). This reduces the quadratic problem of comparing all docs to one with roughly the same performance that only takes constant time. You can get an idea of what the usual presentation looks like for LSH by considering the LSH Wikipedia page, the first line of which assumes you know what a metric space is!

Step 6. You can then check the actual docs if you want to prevent false positive matches.

The book draft has nice examples of all of these things. It also goes over the theoretical justifications of why these approaches work, but doesn’t get bogged down in lots of math — it just sticks to what you need to know to understand the problem and build an implementation yourself. In fact, I may do just that.

Tunable Tradeoffs in Accuracy versus Speed

One very cool feature of this approach is that it’s probabilistic in the sense that you can trade efficiency for accuracy. By using more and more shingles and more and more LSH, you can get pretty much arbitrarily close to 1.0 in accuracy. Given that the problem’s not so well defined already, we can usually tolerate a bit of error on both the false positive and false negative side.

What Does Nutch Do?

The Apache Nutch project, based on the Apache Lucene search engine, is intended to be a web-scale crawler and indexer. It has an implementation of a similar algorithm, though I can’t find any details other than a comment that they’re using MD5 hashes to approximate step 6 (that’s a pretty good approximation). Does anyone have a pointer to how it works?

Monitoring Convergence of EM for MAP Estimates with Priors

January 4, 2011

I found it remarkably hard to figure out how to monitor convergence for the expectation maximization (EM) estimtation algorithm. Elementary textbook presentations often just say “until convergence”, which left me scratching my head. More advanced presentations often leave you in a sea of generalized maximization routines and abstract functionals.

Typically, EM is phrased for maximum likelihood estimation (MLE) problems where there are no priors. Given data $y$ and parameters $\theta$, the goal is to find the parameters $\theta^*$ that maximize the likelihood function $p(y|\theta)$.

Likelihood and Missing Data

Usually EM is used for latent parameter problems, where there are latent variables $z$ which are treated like missing data, so that the full likelihood function is actually $p(y,z|\theta)$. For instance, $z$ might be mixture component indicators, as in soft (EM) clustering. Typically the full likelihood is factored as $p(y,z|\theta) = p(z|\theta) \times p(y|z,\theta)$.

Even though the expectation (E) step of EM computes “expectations” for $z$ given current estimates of $\theta$ and the data $y$, these “expectations” aren’t used in the likelihood calculation for convergence. Instead, the form of likelihood we care about for convergence marginalizes $z$ away. Specifically, the maximum likelihood estimate $\theta^*$ is the one that maximizes the likelihood with $z$ marginalized out,

$p(y|\theta) = \int p(y,z|\theta) \times p(z|\theta) \ dz$.

Monitoring Likelihood or Parameters

There’s more than one way to monitor convergence. You can monitor either the differences in log likelihoods (after marginalizing out the latent data) or the differences in parameters (e.g. by Euclidean distance, though you might want to rescale). Log likelihood is more task-oriented, and thus more common in the machine learning world. But if you care about your parameters, you may want to measure them for convergence, because …

Linearly Separable Data for Logistic Regression

In data that’s linearly separable on a single predictor, the maximum likelihood coefficient for that predictor is infinite. Thus the parameters will never converge. But as the parameter approaches infinity, the difference its (absolute) growth makes to log likelihood diminishes (we’re way out on the extremes of the logistic sigmoid at this point, where the slope’s nearly 0).

Convergence with MAP?

Textbooks often don’t mention, either for philosophical or pedagogical reasons, that it’s possible to use EM for general maximum a posterior (MAP) estimation when there are priors. Pure non-Bayesians talk about “regularization” or “shrinkage” (specifically the ridge or lasso for regression problems) rather than priors and MAP estimates, but the resulting estimate’s the same either way.

Adding priors for the coefficients, even relatively weak ones, can prevent estimates from diverging, even in the case of separable data. In practice, maximum a posteriori (MAP) estimates will balance the prior and the likelihood. Thus it is almost always a good idea to add priors (or “regularize” if that goes down better philosophically), if nothing else to add stability to the estimates in cases of separability.

Maximization Step with Priors

In EM with priors, the maximization step needs to set $\theta^{(n)}$, the parameter estimate in the $n$-th epoch, to the value that maximizes the total probability, $\log p(y|\theta) + \log p(\theta)$, given the current “expectation” for the latent parameters $z$ based on the the data and previous epoch’s estimate of $\theta$. That is, you can’t just set $\theta^{(n)}$ to maximize the likelihood, $\log p(y|\theta)$. There are analytic solutions for the maximizer in many conjugate settings like Dirichlet-Multinomial or Normal-Normal, so this isn’t as hard as it may sound. And often you can get away with increasing it rather than maximizing it (leading to the so-called generalized EM algorithm, GEM).

Convergence with Priors

Well, you could just monitor the parameters. But if you want to monitor the equivalent of likelihood, you need to monitor the log likelihood plus prior, $\log p(y|\theta) + \log p(\theta)$, not just the log likelihood $p(y|\theta)$. What EM guarantees is that every iteration increases this sum. If you just monitor the likelihood term $p(y|\theta)$, you’ll see it bouncing around rather than monotonically increasing. That’s because the prior’s having its effect, and you need to take that into account.

Danger Zone! of Data Science

November 24, 2010

I’m not a big fan of links-only posts, but I’m still chuckling at the “Danger Zone!” quadrant from Drew Conway’s post The Data Science Venn Diagram on the Dataists Blog:

Venn diagram by Drew Conway.

I just realize I posted to a cross-post, which is one of the reason I don’t like all this linking about. The original blog entry is on Drew Conway’s blog, Zero Intelligence Agents.

Need Another Label(s)!

November 23, 2010

It occurred to me while working through the math for my last post that there are situations when you not only need another label for an existing item, but need more than one to achieve an expected positive payoff.

Prior and Posterior Odds Ratios

First, let’s summarize the result from adding another annotator. For a given image $i$, suppose the current odds (aka prior) of clean ($c_i = 1$) versus porn ($c_i = 0$) are $V:1$, which corresponds to a probability of being clean of $\mbox{Pr}[c_i =1] = V/(1+V)$. For instance, if the odds are 4:1 the image is clean, the probability it is clean is 4/5.

Now suppose we get a label $y_{i,j}$ from an annotator $j$ for image $i$. We update the odds to include the annotator’s label to get new (posterior) odds $V'$, by the following formula if the label is

$V' = V \times \mbox{Pr}[y_{i,j}|c_i=1] \, / \, \mbox{Pr}[y_{i,j}|c_i=0]$.

We just multiply the prior odds $V$ by the likelihood ratio of the annotation $y_{i,j}$ given that $c_i = 1$ or $c_i = 0$. The new probability of a clean image is thus $V'/(1+V')$.

The Porn Payoff

Suppose the payoff for the porn task is as follows. A porn image classified as clean has a payoff of -100, a porn image classified as porn has a cost of 0, a clean image classified as clean has a payoff of 1, and a clean image classified as porn has a payoff of -1.

In this setup, when we have an image, we need odds of better than 100:1 odds (a bit more than 99% probability) that it is clean before we return the decision that it’s clean; otherwise we maximize our payoff saying it’s porn. Unless we are 100% sure of our decision, we always have an expected loss from returning a decision that an image is porn, because the payoff is zero for a true negative (classifying porn as porn), whereas the payoff is -1 for rejecting a non-porn image.

Now suppose that the prevalence of clean images is 20 in 21 (of course, we work with an estimate). So we start with odds of 20:1. The decision to classify the item as clean before annotation has an expected payoff of [20*1 + 1*(-100)]/21 or about -4 per decision. The decision to classify an image as porn before an annotation has a payoff of [20*(-1) + 1*0]/21 which is around -1 per decision.

Need Multiple Annotators

Clearly we need to annotate an item before we can have a positive expected payoff.

Suppose for the sake of argument (and easy arithmetic) that we have an annotator with sensitivity 0.9 (they correctly classify 90% of the clean images as clean and reject 10% of the clean images as porn) and specificity 0.8 (they correctly classify 80% of the porn images as porn and let 20% through as clean).

In this context, we actually need more than one annotator to annotate an item before we get a positive expected payoff. We first need to work through our expectations properly. First, we start with 20:1 odds (probability 20/21) an image is clean, so we can expect to see 20 clean images for each porn image. Then we have to look at the annotator’s expected response. If the image is clean, there’s a 80% chance the annotator says it’s clean and a 20% chance they say it’s porn. If the image is porn, there’s a 90% chance they say it’s porn and a 10% chance they say it’s clean. That let’s us work out the expectations for true category and response a priori. For instance, the chance the image is clean and the annotator says its porn is 20/21 * 0.2 = 4/21.

We then calculate the updated odds under both possible annotator responses and figure out the new odds and weight them by our expectations before seeing the label.

I’ll let you work out the arithmetic, but the upshot is that until you have more than one annotator, you can’t get 100:1 odds or more of an image not being porn. The key step is noting that we only get positive expected payoff if we return that the image is clean, but the posterior odds if the annotator provides a 1 label, which are 20/1 * 0.9/0.2 = 90:1. And don’t forget to factor in that we only land in the happy situation of a getting a non-porn label around 90% of the time, so the expected gain in payoff from the annotation is less than the improvement from 20:1 to 90:1 we get in the ideal case.

In reality, you’re likely to have slightly better porn/not-porn annotators than this because it’s a relatively easy decision problem. But you’re also likely to have spammers, as we mentioned last time, which is really a mixed pool of spammers and cooperators.

Unknown Annotators

I’ll say again that one of the pleasant properties of the hierarchical model extension of Dawid and Skene is that it allows us to predict the behavior for a new unknown annotator. This is particularly useful in a mechanical Turk setting where we can’t choose our annotators directly (though we can feed items to an annotator if we write our own web app and they volunteer to do more).

We just sum over predictions from each posterior sample of the hyperpriors, then sample an annotator from that, calculate what the outcome would be for them, and average the result. What’s extra cool is that this includes all the extra dispersion in posterior inference that’s warranted due to our uncertainty in the hyperpriors representing the population of annotators.

Evaluating with Probabilistic Truth: Log Loss vs. O/1 Loss

November 2, 2010

I’ve been advocating an approach to classifier evaluation in which the “truth” is itself a probability distribution over possible labels and system responses take the same form.

Is the Truth Out There?

The main motivation is that we don’t actually know what the truth is for some items being classified. Not knowing the truth can come about because our labelers are biased or inconsistent, because our coding standard is underspecified, or because the items being classified are truly marginal. That is, the truth may not actually be out there, at least in any practically useful sense.

For example, we can squint all we want at the use of a pronoun in text and still not be able to come to any consensus about its antecedent, no matter how many experts in anaphora we assemble. Similarly, we might look at a phrase like fragile X in a paper on autism and truly not know if it refers to the gene or the disease. We may not know a searcher’s intent when we try to do spelling correction on a query like Montona — maybe they want Montana the state, maybe they want the pop star, or maybe they really want the west central region of the island of São Vicente.

The truth’s not always out there. At least for language data collected from the wild.

Partial Credit for Uncertainty

The secondary motivation is that many classifiers, like most of LingPipe’s, are able to return probabilistic answers. The basic motivation of log loss is that we want to evaluate the probability assigned by the system to the correct answers. The less uncertain the system is about the true answer, the better.

If we want to be able to trade precision (or specificity) for recall (i.e., sensitivity), it’s nice to have probabilistically normalized outputs over which to threshold.

Probabilistic Reference Truth

Let’s just consider binary (two category) classification problems, where a positive outcome is coded as $1$ and a negative one as $0$.. Suppose there are $N$ items being classified, $x_1, \ldots, x_N$. A probabilistic corpus then consists of a sequence of probabilistic categorizations $y_1, \ldots, y_N$, where $y_n \in [0,1]$ for $1 \leq n \leq N$.

A non-probabilistic corpus is just an edge case, with $y_n \in \{ 0, 1 \}$ for $1 \leq n \leq N$.

Probabilistic System Responses

Next, suppose that the systems we’re evaluating are themselves allowed to return conditional probabilities of positive responses, which we will write as $\hat{y}_n \in [0,1]$ for $1 \leq n \leq N$.

A non-probabilistic response is again an edge case, with $\hat{y}_n \in \{ 0, 1 \}$.

Log Loss

The log loss for response $\hat{y}$ for reference $y$ is the negative expected log probability of the response given the reference, namely

${\cal L}(y,\hat{y}) = - \sum_{n=1}^N y_n \times \log \hat{y}_n + (1-y_n) \times \log (1 - \hat{y}_n)$.

If we think of $y$ and $\hat{y}$ as parameters to Bernoulli distributions, it’s just KL divergence.

If the reference $y$ is restricted to being $0$ or $1$, this reduces to the usual definition of log loss for a probabilistic classifier (that is, $\hat{y}$ may be any value in $[0,1]$.

Plotting a Single Response

Because we just sum over all the responses, we can see what the result is for a single item $n$. Suppose $y_n = 0.7$, which means the reference says there’s a 70% chance of item $n$ being positive, and a 30% chance of it being negative. The plot of the loss for response $\hat{y}_n$ given reference $y_n = 0.7$ is

It’s easy to see that the response $\hat{y} = 1$ is better than $\hat{y} = 0$ for the case where the reference is $y = 0.7$.

The red vertical dashed line is at the point $\hat{y} = 0.7$, which is where $y = \hat{y}$. The blue horizontal line is at ${\cal L}(y,\hat{y}) = 0.61$, which is the minimum loss for any $\hat{y}$ when $y = 0.7$.

Classifier Uncertainty Matches Reference Uncertainty

It’s easy to see from the graph (and prove via differentiation) that the log loss function is convex and the minimum loss occurs when $y = \hat{y} = 0.7$. That is, best performance occurs when the system response has the same uncertainty as the gold standard. In other words, we want to train the system to be uncertain in the same places as the reference (training) data is uncertain.

Log Loss Degenerates at the Edges

If the reference is $y = 0$ and the response $\hat{y} = 1$, the log loss is infinite. Similarly for $y = 1, \hat{y} = 0$. As long as the response is not 0 or 1, the results will be finite.

If the response is 0 or 1, the loss will only be finite if the reference category is the same value, $y = \hat{y}$. In fact, having the reference equal to the response, $y = \hat{y}$, with both being 0 or 1, is the only way to get zero loss. In all other situations, loss will be non-zero. Loss is never negative because we’re negating logs of values between 0 and 1.

This is why you don’t see log loss used for bakeoffs. If you’re using a system that doesn’t have probabilistic output and the truth is binary 0/1, then every system will have infinite error.

Relation to Regularization via Prior Data

This matching of uncertainty can be thought of as a kind of regularization. One common way to implement priors if you only have maximum likelihood estimators is to add some artificial data. If I add a new data point $x$ which has all features turned on, and give it one positive instance and one negative instance, and I’m doing logistic regression, it acts as a regularizer pushing all the coefficients closer to zero (the further they are away from zero, the worse they do on the combination of the one negative and one positive example). If I add more of this kind of data, the regularization effect is stronger. Here, we can think of adding 3 negative examples and 7 positive examples, or 0.3 and 0.7 if we allow fractional training data.

Comparison to 0/1 Loss

Consider what happens if instead of log loss, we consider the standard 0/1 loss as expressed in a confusion matrix (true positive, true negative, false positive, and false negative counts).

With a probabilistic reference category $y$ and probabilistic system response $\hat{y}$, the expected confusion matrix counts are

$\mathbb{E}[\mbox{TP}] = y \times \hat{y}$,

$\mathbb{E}[\mbox{TN}] = (1-y) \times (1-\hat{y})$,

$\mathbb{E}[\mbox{FP}] = (1-y) \times \hat{y}$, and

$\mathbb{E}[\mbox{FN}] = y \times (1 - \hat{y})$.

With this, we can compute expected accuracy, precision, recall (= sensitivity), specificity, and F-measure as

$\mathbb{E}[\mbox{acc}] = \mathbb{E}[(\mbox{TP}+\mbox{TN})/(\mbox{TP}+\mbox{TN}+\mbox{FP}+\mbox{FN})]$
${} = y \hat{y} + (1 - y)(1-\hat{y})$.

$\mathbb{E}[\mbox{recall}] = \mathbb{E}[\mbox{TP}/(\mbox{TP}+\mbox{FN})]$
${} = y \hat{y}/(y\hat{y} + y(1 - \hat{y})) = \hat{y}$,

$\mathbb{E}[\mbox{precision}] = \mathbb{E}[\mbox{TP}/(\mbox{TP}+\mbox{FP})]$
${} = y \hat{y}/(y\hat{y} + (1-y)(\hat{y})) = y$,

$\mathbb{E}[\mbox{specificity}] = \mathbb{E}[\mbox{TN}/(\mbox{TN}+\mbox{FP})]$
${} = (1-y)(1-\hat{y})/((1-y)(1-\hat{y}) + (1-y)\hat{y}) = 1-\hat{y}$, and

$\mathbb{E}[\mbox{F-measure}] = \mathbb{E}[2 \times \mbox{prec} \times \mbox{recall} / (\mbox{prec} + \mbox{recall})]$
${} = 2 y \hat{y} / (y + \hat{y})$.

All of these statistical measures are degenerate for probabilistic evaluation. For instance, if $y > 0.5$, then accuracy is maximized with $\hat{y} = 1$, if $y < 0.5$, accuracy is maximized with $\hat{y} = 0$, and if $y = 0$, accuracy is the same for all $\hat{y} \in [0,1]$! The rest of the measures are no better behaved.

R Source

Plotting functions is easy in R with the curve() function. Here’s how I generated the one above.

p <- 0.7;
f <- function(q) {
- (p * log(q) + (1-p) * log(1-q));
}
curve(f, xlim=c(0,1), ylim=c(0,4),
xlab="yHat", ylab="LogLoss(y, yHat)",
main="Log Loss  ( y=0.7 )",
lwd=2, cex.lab=1.5, cex.axis=1.5, cex.main=1.5,
axes=FALSE, frame=FALSE)

axis(1,at=c(0,0.5,0.7,1))
axis(2,at=c(0,f(0.7),2,4),labels=c("0", ".61", "2", "4"))
abline(h=f(0.7),col="blue",lty=3)
abline(v=0.7,col="red",lty=2)


Just be careful with that dynamic lexical scope if you want to redefine the reference result p ($y$ in the definitions above).

How to Define (Interpolated) Precision-Recall/ROC Curves and the Area under Them?

October 1, 2010

I’m revising LingPipe’s ScoredPrecisionRecallEvaluation class. My goal is to replace what’s currently there with more industry-standard definitions, as well as fixing some bugs (see the last section for a list).

Uninterpolated and Interpolated Curves?

I’ve seen multiple different definitions for uninterpolated and interpolated curves.

For the uninterpolated case, I’ve seen some authors include all the points on the curve and others include only the points corresponding to relevant results (i.e., true positives). LingPipe took the latter approach, but Mannig et al.’s IR book took the former.

For the interpolated case, I’ve seen some authors (e.g., Manning et al.) set precision for a given recall point r to be the maximum precision at an operating point with recall r or above. I’ve seen others just drop points that were dominated (LingPipe’s current behavior). I seem to recall reading that others used the convex hull (which is what LingPipe’s doc says it does, but the doc’s wrong).

Limit Points?

Should we always add the limit points corresponding to precision/recall values of (0,1) and (1,0)?

Break-Even Point?

If we use all the points on the precision-recall curve, the break-even point (BEP) will be at the rank equal to the number of relevant results (i.e., BEP equals R-precision). If we remove points that don’t correspond to true positives, there might not be an operating point that is break even (e.g., consider results: irrel, irrel, rel, rel in the case where there are only two relevant results — BEP is 0/0 in the full case, but undefined if we only consider true positive operating points).

Area Under the Curves?

For area under the curve (AUC) computations, I’ve also seen multiple definitions. In some cases, I’ve seen people connect all the operating points (however defined), and then calculate the actual area. Others define a step function more in keeping with interpolating to maximum precision at equal or higher recall, and then take the area under that.

Bugs in LingPipe’s Current Definitions

For the next version of LingPipe, I need to fix three bugs:

1. The ROC curves should have 1-specificity on the X axis and sensitivity on the Y axis (LingPipe currently returns sensitivity on the X axis and specificity on the Y axis, which doesn’t correspond to anyone’s definition of ROC curves.)

3. Area under curve calculations don’t match anything I can find in the literature, nor do they match the definitions in the javadoc.

R’s Scoping

September 9, 2010

[Update: 10 September 2010 I didn’t study Radford Neal’s example closely enough before making an even bigger mess of things. I’d like to blame it on HTML formatting, which garbled Radford’s formatting and destroyed everyone else’s examples, but I was actually just really confused about what was going on in R. So I’m scratching most of the blog entry and my comments, and replacing them with Radford’s example and a pointer to the manual.]

A Better Mousetrap

There’s been an ongoing discussion among computational statisticians about writing something better than R, in terms of both speed and comprehensibility:

> f = function () {
+     g = function () a+b
+     a = 10
+     g()
+ }

> h = function () {
+     a = 100
+     b = 200
+     f()
+ }

> b=3

> h()
[1] 13


This illustrates what’s going on, assuming you can parse R. I see it, I believe it. The thing to figure out is why a=10 was picked up in the call to g() in f, but b=200 was not picked up in the call to f() in h. Instead, the global assignment b=3 was picked up.

RTFM

Even after I RTFM-ed, I was still confused.

It has a section 10.7 titled “Scope”, but I found their example

cube <- function(n) {
sq <- function() n*n
n*sq()
}


and the following explanation confusing,

The variable n in the function sq is not an argument to that function. Therefore it is a free variable and the scoping rules must be used to ascertain the value that is to be associated with it. Under static scope (S-Plus) the value is that associated with a global variable named n. Under lexical scope (R) it is the parameter to the function cube since that is the active binding for the variable n at the time the function sq was defined. The difference between evaluation in R and evaluation in S-Plus is that S-Plus looks for a global variable called n while R first looks for a variable called n in the environment created when cube was invoked.

I was particularly confused by the “environment created when cube was invoked” part, because I couldn’t reconcile it with Radford’s example.

Let’s consider a slightly simpler example without nested function calls.

> j =10
> f = function(x) j*x
> f(3)
[1] 30
> j =12
> f(3)
[1] 36


This shows it can’t be the value of j at the time f is defined, because it changes when I change j later. I think it’s actually determining how it’s going to find j when it’s defined. If there’s a value of j that’s lexically in scope (not just defined in the current environment), it’ll use that value. If not, it’ll use the environment of the caller. And things that go on in subsequent function definitions and calls, as Radford’s example illustrates, don’t count.

Am I the only one who finds this confusing? At least with all your help, I think I finally understand what R’s doing.