[**Update: 19 June 2012**: Becky just wrote me to clarify which tools they were using for what (quoted with permission, of course -- thanks, Becky):

*... we aren't using BART to rank structures, we use an independently learned ranked list to bin the structures before we apply BART. We use BART to do a treatment analysis where the y values represent whether there was an event, then we compute the role that the treatment variable plays in the prediction. Here's a journal paper that describes our initial ranking method *

`http://www.springerlink.com/content/3034h0j334211484/`

and the pre-publication version

`http://www1.ccls.columbia.edu/%7Ebeck/pubs/ConedPaperRevision-v5.pdf`

*The algorithm for doing the ranking was modified a few years ago, and now Cynthia is taking a new approach that uses survival analysis.*]

### Rare Events

Let’s suppose you’re building a model to predict rare events, like manhole explosions in the Con-Ed system in New York (this is the real case at hand — see below for more info). For a different example, consider modeling the probability of a driver getting into a traffic accident in the next week. The problem with both of these situations is that even with all the predictors in hand (last maintenance, number of cables, voltages, etc. in the Con-Ed case; driving record, miles driven, etc. in the driving case), the estimated probability for any given manhole exploding (any person getting into an accident next week) is less than 50%.

### The Problem with 0/1 Loss

A typical approach in machine learning in general, and particularly in NLP, is to use 0/1 loss. This forces the system to make a simple yea/nay (aka 0/1) prediction for every manhole about whether it will explode in the next year or not. Then we compare those predictions to reality, assigning a loss of 1 if you predict the wrong outcome and 0 if you predict correctly, then summing these losses over all manholes.

The way to minimize expected loss is to predict 1 if the probability estimate of failure is greater than 0.5 and 0 otherwise. If all of the probability estimates are below 0.5, all predictions are 0 (no explosion) for every manhole. Consequently, the loss is always the number of explosions. Unfortunately, this is the best you can do if your loss is 0/1 and you have to make 0/1 predictions.

So we’ve minimized 0/1 loss and in so doing created a useless 0/1 classifier.

### A Hacked Threshold?

There’s something fishy about a classifier that returns all 0 predictions. Maybe we can adjust the threshold for predicting explosions below 0.5. Equivalently, for 0/1 classification purposes, we could rescale the probability estimates.

Sure, it gives us some predicted explosions, but the result is a non-optimal 0/1 classifier. The reason it’s non-optimal in 0/1-loss terms is that each prediction of an explosion is likely to be wrong, but in aggregate some of them will be right.

### It’s not a 0/1 Classification Problem

The problem in 0/1 classification arises from converting estimates of explosion of less than 50% per manhole to 0/1 predictions minimizing expected loss.

Suppose our probability estimates are close, at least in the sense that for any given manhole there’s only a very small chance it’ll explode no matter what its features are.

Some manholes do explode and the all-0 predictions are wrong for every exploding manhole.

What Con-Ed really cares about is finding the most at-risk properties in its network and supplying them maintenance (as well as understanding what the risk factors are). This is a very different problem.

### A Better Idea

Take the probabilities seriously. If your model predicts a 10% chance of explosion for each of 100 manholes, you expect to see 10 explosions. You just don’t know which of the 100 manholes they’ll be. You can measure these marginal predictions (number of predicted explosions) to gauge how accurate your model’s probability estimates are.

We’d really like a general evaluation that will measure how good our probability estimates are, not how good our 0/1 predictions are. Log loss does just that. Suppose you have outcomes with corresoponding predictors (aka features), , and your model has parameter . The log loss for parameter (point) estimate is

That is, it’s the negative log probability (the negative turns gain into loss) of the actual outcomes given your model; the summation is called the log likelihood when viewed as a function of , so log loss is really just the negative log likelihood. This is what you want to optimize if you don’t know anything else. And it’s exactly what most probabilistic estimators optimize for classifiers (e.g., logistic regression, BART [see below]).

### Decision Theory

The right thing to do for the Con-Ed case is to break out some decision theory. We can assign weights to various prediction/outcome pairs (true positive, false positive, true negative, false negative), and then try to optimize weights. If there’s a huge penalty for a false negative (saying there won’t be an explosion when there is), then you are best served by acting on low-probability information, such as servicing even low-probability manholes. For example, if there is a $100 cost for a manhole blowing up and it costs $1 to service a manhole so it doesn’t blow up, then even a 1% chance of blowing up is enough to send out the service team.

We haven’t changed the model’s probability estimates at all, just how we act on them.

In Bayesian decision theory, you choose actions to minimize expected loss conditioned on the data (i.e., optimize expected outcomes based on the posterior predictions of the model).

### Ranking-Based Evaluations

Suppose we sort the list of manholes in decreasing order of estimated probability of explosion. We can line this up with the actual outcomes. Good system performance is reflected in having the actual explosions ranked high on the list.

Information retrieval supplies a number of metrics for this kind of ranking. The thing I like to see for this kind of application is a precision-recall curve. I’m not a big fan of single-number evaluations like mean average precision, though precision-at-N makes sense in some cases, such as if Con-Ed had a fixed maintenance budget and wanted to know how many potentially exploding manholes it could service.

There’s a long description of these kind evaluations in

- LingPipe doc:
`classify.ScoredPrecisionRecallEvaluation`

.

Just remember there’s noise in this received curves and that picking an optimal point on them is unlikely to produce such good behavior on held-out data.

With good probability estimates for the events you will get good rankings (there’s a ton of theory around this I’ve never studied).

### About the Exploding Manholes Project

I’ve been hanging out at Columbia’s Center for Computational Learning Systems (CCLS) talking to Becky Passonneau, Haimanti Dutta, Ashish Tomar, and crew about their Con-Ed project of predicting certain kinds of events like exploding manholes. They built a non-parametric regression model using Bayesian additive regression trees with a fair amount of data and many features as predictors.

I just wrote a blog post on Andrew Gelman’s blog that’s related to issues they were having with diagnosing convergence:

But the real problem is that all the predictions are below 0.5 for manholes exploding and the like. So simple 0/1 loss just fails. I thought the histograms of residuals looked fishy until it dawned on me that it actually makes sense for all the predictions to be below 0.5 in this situation.

### Moral of the Story

0/1 loss is not your real friend. Decision theory is.

### The Lottery Paradox

This whole discussion reminds me of the lottery “paradox”. Each ticket holder is very unlikely to win a lottery, but one of them will win. The “paradox” arises from the inconsistency of the conjunction of beliefs that each person will lose and the belief that someone will win.

Oh, no! Henry Kyburg died in 2007. He was a great guy and decades ahead of his time. He was one of my department’s faculty review board members when I was at CMU. I have a paper in a book he edited from the 80s when we were both working on default logics.

June 15, 2012 at 12:43 am |

Bob, great post. Can you post references to some Bayesian decision theory sources. Articles or fundamental texts would be great. The BART paper linked from your Statistical Modeling post seems a bit specialized.

Thanks!

June 15, 2012 at 1:27 pm |

BART just does the Bayesian posterior inference. The classic reference for Bayesian decision theory is

Statistical Decision Theory and Bayesian Analysis. 2nd Edition. Springer.Some of the fundamental motivations for Bayesian statistics are decision theoretic (in the sense that if you don’t follow the proper Bayesian inferences you can be taken advantage of in a betting context, i.e., they can make Dutch book against you (ed. whenever “Dutch” appears as an adjective in English, you know it’s something distasteful; c.f., “Dutch uncle”, “Dutch date”, “double Dutch”).

For a quick intro to what makes Bayesian stats Bayesian, along with an overview of the inferential apparatus, see my earlier post, What is Bayesian Statistical Inference?.

You can, of course, use decision theory in a frequentist setting or even in non-probabilistic settings with the right formulation.

June 15, 2012 at 7:15 pm |

What about Dennis Lindley’s textbook Making decisions. The text is less mathematically rigorous as the text is aimed for a more general audience, but is a more readable intro for those with less of a strong stat background. The Berger text is the most complete, rigorous text available but it can be intimidating at a first read.

June 15, 2012 at 11:41 pm |

I’d agree that Berger could be intimidating if you don’t already know math stats reasonably well. I actually found it useful in that regard — it was one of the first Bayesian stats books I studied.

There’s a short discussion in Bishop’s

Pattern Recognition and Machine Learning, which is easier than Berger (and much shorter — it’s one introductory section). I also looked up the discussion of decision theory in MacKay’s info theory and machine learning book — there’s two pages of example after a discussion and a dismissal of the topic as “trivial” (though not unimportant, of course — it’s just that it follows pretty directly from everything else).I don’t know Lindley’s book, but Lindley did some fundamental research in Bayesian stats. On the more philosophical side, he introduced what is now known as Lindley’s Paradox (Wikipedia).

It looks like Lindley also has a book for the general public called

Understanding Uncertainty.There are really two courses of study. There’s the whole philosophy of Bayesian statistics side, which is often motivated decision-theoretically. This is both about the philosophy of science and reasoning in general and about human reasoning and epistemology (this is where Kyburg is relevant).

Then there’s the actual apparatus to carry out model fitting given some notion of loss that you want to apply. Berger’s book covers both, but it’s basically a math book with some background philosophy. I assume Lindley would cover both the math and the philosophy — he was active on both sides.