Archive for the ‘Statistics’ Category

All Bayesian Models are Generative (in Theory)

May 23, 2013

[This post is a followup to my previous post, Generative vs. discriminative; Bayesian vs. frequentist.]

I had a brief chat with Andrew Gelman about the topic of generative vs. discriminative models. It came up when I was asking him why he didn’t like the frequentist semicolon notation for variables that are not random. He said that in Bayesian analyses, all the variables are considered random. It just turns out we can sidestep having to model the predictors (i.e., covariates or features) in a regression. Andrew explains how this works in section 14.1, subsection “Formal Bayesian justification of conditional modeling” in Bayesian Data Analysis, 2nd Edition (the 3rd Edition is now complete and will be out in the foreseeable future). I’ll repeat the argument here.

Suppose we have predictor matrix X and outcome vector y. In regression modeling, we assume a coefficient vector \beta and explicitly model the outcome likelihood p(y|\beta,X) and coefficient prior p(\beta). But what about X? If we were modeling X, we’d have a full likelihood p(X,y|\beta,\psi), where \psi are the additional parameters involved in modeling X and the prior is now joint, p(\beta,\psi).

So how do we justify conditional modeling as Bayesians? We simply assume that \psi and \beta have independent priors, so that

p(\beta,\psi) = p(\beta) \times p(\psi).

The posterior then neatly factors as

p(\beta,\psi|y,X) = p(\psi|X) \times p(\beta|X,y).

Looking at just the inference for the regression coefficients \beta, we have the familiar expression

p(\beta|y,X) \propto p(\beta) \times p(y|\beta,X).

Therefore, we can think of everything as a joint model under the hood. Regression models involve an independence assumption so we can ignore the inference for \psi. To quote BDA,

The practical advantage of using such a regression model is that it is much easier to specify a realistic conditional distribution of one variable given k others than a joint distribution on all k+1 variables.

We knew that all along.

Generative vs. Discriminative; Bayesian vs. Frequentist

April 12, 2013

[There's now a followup post, All Bayesian models are generative (in theory).]

I was helping Boyi Xie get ready for his Ph.D. qualifying exams in computer science at Columbia and at one point I wrote the following diagram on the board to lay out the generative/discriminative and Bayesian/frequentist distinctions in what gets modeled.

To keep it in the NLP domain, let’s assume we have a simple categorization problem to predict a category z for an input consisting of a bag of words vector w and parameters β.

Frequentist Bayesian
Discriminative p(z ; w, β) p(z, β ; w) = p(z | β ; w) * p(β)
Generative p(z, w ; β) p(z, w, β) = p(z, w | β) * p(β)

If you’re not familiar with frequentist notation, items to the right of the semicolon (;) are not modeled probabilistically.

Frequentists model the probability of observations given the parameters. This involves a likelihood function.

Bayesians model the joint probability of data and parameters, which, given the chain rule, amounts to a likelihood function times a prior.

Generative models provide a probabilistic model of the predictors, here the words w, and the categories z, whereas discriminative models only provide a probabilistic model of the categories z given the words w.

Mean-Field Variational Inference Made Easy

March 25, 2013

I had the hardest time trying to understand variational inference. All of the presentations I’ve seen (MacKay, Bishop, Wikipedia, Gelman’s draft for the third edition of Bayesian Data Analysis) are deeply tied up with the details of a particular model being fit. I wanted to see the algorithm and get the big picture before being overwhelmed with multivariate exponential family gymnastics.

Bayesian Posterior Inference

In the Bayesian setting (see my earlier post, What is Bayesian Inference?), we have a joint probability model p(y,\theta) for data y and parameters \theta, usually factored as the product of a likelihood and prior term, p(y,\theta) = p(y|\theta) p(\theta). Given some observed data y, Bayesian predictive inference is based on the posterior density p(\theta|y) \propto p(\theta, y) of the the unknown parameter vector \theta given observed data vector y. Thus we need to be able to estimate the posterior density p(\theta|y) to carry out Bayesian inference. Note that the posterior is a whole density function—we’re not just after a point estimate as in maximum likelihood estimation.

Mean-Field Approximation

Variational inference approximates the Bayesian posterior density p(\theta|y) with a (simpler) density g(\theta|\phi) parameterized by some new parameters \phi. The mean-field form of variational inference factors the approximating density g by component of \theta = \theta_1,\ldots,\theta_J, as

g(\theta|\phi) = \prod_{j=1}^J g_j(\theta_j|\phi_j).

I’m going to put off actually defining the terms g_j until we see how they’re used in the variational inference algorithm.

What Variational Inference Does

The variational inference algorithm finds the value \phi^* for the parameters \phi of the approximation which minimizes the Kullback-Leibler divergence of g(\theta|\phi) from p(\theta|y),

\phi^* = \mbox{arg min}_{\phi} \ \mbox{KL}[ g(\theta|\phi) \ || \ p(\theta|y) ].

The key idea here is that variational inference reduces posterior estimation to an optimization problem. Optimization is typically much faster than approaches to posterior estimation such as Markov chain Monte Carlo (MCMC).

The main disadvantage of variational inference is that the posterior is only approximated (though as MacKay points out, just about any approximation is better than a delta function at a point estimate!). In particular, variational methods systematically underestimate posterior variance because of the direction of the KL divergence that is minimized. Expectation propagation (EP) also converts posterior fitting to optimization of KL divergence, but EP uses the opposite direction of KL divergence, which leads to overestimation of posterior variance.

Variational Inference Algorithm

Given the Bayesian model p(y,\theta), observed data y, and functional terms g_j making up the approximation of the posterior p(\theta|y), the variational inference algorithm is:

  • \phi \leftarrow \mbox{random legal initialization}
  • \mbox{repeat}
    • \phi_{\mbox{\footnotesize old}} \leftarrow \phi
    • \mbox{for } j \mbox{ in } 1:J
      • \mbox{{} \ \ set } \phi_j \mbox{ such that } g_j(\theta_j|\phi_j) = \mathbb{E}_{g_{-j}}[\log p(\theta|y)].
  • \mbox{until } ||\phi - \phi_{\mbox{\footnotesize old}}|| < \epsilon

The inner expectation is a function of \theta_j returning a single non-negative value, defined by

\mathbb{E}_{g_{-j}}[\log p(\theta|y)]

\begin{array}{l}  \mbox{ } = \int_{\theta_1} \ldots \int_{\theta_{j-1}} \int_{\theta_{j+1}} \ldots \int_{\theta_J}  \\[8pt] \mbox{ } \hspace*{0.4in}  g(\theta_1|\phi_1) \times \cdots \times g(\theta_{j-1}|\phi_{j-1}) \times  g(\theta_{j+1}|\phi_{j+1}) \times \cdots \times g(\theta_J|\phi_J)  \times  \log p(\theta|y)  \\[8pt] \mbox{ } \hspace*{0.2in}  d\theta_J \cdots d\theta_{j+1} \ d\theta_{j-1} \cdots d\theta_1  \end{array}

Despite the suggestive factorization of g and the coordinate-wise nature of the algorithm, variational inference does not simply approximate the posterior marginals p(\theta_j|y) independently.

Defining the Approximating Densities

The trick is to choose the approximating factors so that the we can compute parameter values \phi_j such that g_j(\theta_j|\phi_j) = \mathbb{E}_{g_{-j}}[\log p(\theta|y)]. Finding such approximating terms g_j(\theta_j|\phi_j) given a posterior p(\theta|y) is an art form unto itself. It’s much easier for models with conjugate priors. Bishop or MacKay’s books and the Wikipedia present calculations for a wide range of exponential-family models.

What if My Model is not Conjugate?

Unfortunately, I almost never work with conjugate priors (and even if I did, I’m not much of a hand at exponential-family algebra). Therefore, the following paper just got bumped to the top of my must understand queue:

It’s great having Dave down the hall on sabbatical this year — one couldn’t ask for a better stand in for Matt Hoffman. They are both insanely good at on-the-fly explanations at the blackboard (I love that we still have real chalk and high quality boards).

Bayesian Inference for LDA is Intractable

February 18, 2013

Bayesian inference for LDA is intractable. And I mean really really deeply intractable in a way that nobody has figured or is ever likely to figure out how to solve.

Before sending me a “but, but, but, …” reply, you might want to bone up on the technical definition of Bayesian inference, which is a bit more than applying Bayes’s rule or using a prior,

and on computational intractability, which is a bit more involved than a program being slow,

The reason Bayesian inference for LDA is intractable is that we can’t effectively integrate over its posterior. And I don’t just mean there’s no analytic solution; there rarely is for interesting Bayesian models. But for some models, you can use numerical techniques like MCMC to sample from the posterior and compute a posterior integral by averaging.

Posterior sampling doesn’t work for LDA. I know, I know, you’re going to tell me that lots of smart people have applied Gibbs sampling to LDA. You might even point out that LingPipe has a Gibbs sampler for LDA.

Yes, these systems return values. But they’re not sampling from the posterior according to the posterior distribution. The number of modes in the posterior makes comprehensive posterior sampling intractable.

And I’m not just talking about what people call “label switching”, which refers to the non-identifiability of the indexes. Label switching just means that you can permute the K indices corresponding to topics and get the same model predictions. It adds another layer of intractability, but it’s not the one that’s interesting — the real problem is multimodality.

So what does everybody do? Some use Gibbs sampling to take a single sample and then go with it. This is not a Bayesian approach; see the description of Bayesian inference above. It’s not even clear you can get a single effective sample this way (or at least it’s unclear if you can convince yourself you have an effective sample).

Others use variational inference to compute an approximate posterior mean corresponding to a single posterior mode. Variational inference approximates the posterior variance as well as the posterior mean. But alas, they don’t work that way for LDA, where there is no meaningful posterior mean due to lack of identifiability.

So what to do? The output of LDA sure does look cool. And clients do love reading the tea leaves, even if you obfuscate them in a word cloud.

I only implemented Gibbs for LDA in LingPipe because it was all that I understood when I did it. But now that I understand Bayesian inference much better, I agree with the experts: just do variational inference. You’re going to get a lousy point-based representation of the posterior no matter what you do, but variational inference is fast.

Specifically, I recommend Vowpal Wabbit package’s online LDA algorithm, which is based on Matt Hoffman’s stochastic variational LDA algorithm. It’ll also be much much more scalable than LingPipe’s sampling-based approach because it works online and doesn’t need to store all the data in memory, much like stochastic gradient descent.

The algorithm’s clean and it’d be easy enough to add such an online LDA package to LingPipe so that we could compute topic models over dozens of gigabytes of text really quickly (as Matt did with Wikipedia in the paper). But I don’t know that anyone has the time or inclination to do it.

Upgrading from Beta-Binomial to Logistic Regression

October 30, 2012

Bernoulli Model

Consider the following very simple model of drawing the components of a binary random N-vector y i.i.d. from a Bernoulli distribution with chance of success theta.

data {
  int N;  // number of items
  int y[N];  // binary outcome for item i
}
parameters {
  real theta;  // Prob(y[n]=1) = theta
}
model {
  theta ~ beta(2,2); // (very) weakly informative prior
  for (n in 1:N)
    y[n] ~ bernoulli(theta);
}

The beta distribution is used as a prior on theta. This is the Bayesian equivalent to an “add-one” prior. This is the same model Laplace used in the first full Bayesian analysis (or as some would have it, Laplacian inference) back in the Napoleonic era. He used it to model male-vs.-female birth ratio in France, with N being the number of samples and y[n] = 1 if the baby was male and 0 if female.

Beta-Binomial Model

You can get the same inferences for theta here by replacing the Bernoulli distribution with a binomial:

model {
  theta ~ beta(2,2);
  sum(y) ~ binomial(N,theta);
}

But it doesn’t generalize so well. What we want to do is let the prediction of theta vary by predictors (“features” in NLP speak, covariates to some statisticians) of the items n.

Logistic Regression Model

A roughly similar model can be had by moving to the logistic scale and replacing theta with a logistic regression with only an intercept coefficient alpha.

data {
  int N;
  int y[N];
}
parameters {
  real alpha;  // inv_logit(alpha) = Prob(y[n]=1)
}
model {
  alpha ~ normal(0,5);  // weakly informative
  for (n in 1:N)
    y[n] ~ bernoulli(inv_logit(alpha));
}

Recall that the logistic sigmoid (inverse of the logit, or log odds function) maps

\mbox{logit}^{-1}:(-\infty,\infty)\rightarrow(0,1)

by taking

\mbox{logit}^{-1}(u) = 1 / (1 + \mbox{exp}(-u)).

The priors aren’t quite the same in the Bernoulli and logistic models, but that’s no big deal. In more flexible models, we’ll move to hierarchical models on the priors.

Adding Features

Now that we have the inverse logit transform in place, we can replace theta with a regression on predictors for y[n]. You can think of the second model as an intercept-only regression. For instance, with a single predictor x[n], we could add a slope coefficient beta and write the following model.

data {
  int N;  // number of items
  int y[N];  // binary outcome for item n
  real x[N];  // predictive feature for item n
}
parameters {
  real alpha;  // intercept
  real beta;  // slope
}
model {
  alpha ~ normal(0,5);  // weakly informative
  for (n in 1:N)
    y[n] ~ bernoulli(inv_logit(alpha + beta * x[n]));
}

Stan

I used Stan’s modeling language — Stan is the full Bayesian inference system I and others have been developing (it runs from the command line or from R). For more info on Stan, including a link to the manual for the modeling language, see:

Stan Home Page:  http://mc-stan.org/

Stan’s not a competitor for LingPipe, by the way. Stan scales well for full Bayesian inference, but doesn’t scale like LingPipe’s SGD-based point estimator for logistic regression. And Stan doesn’t do structured models like HMMs or CRFs and has no language-specific features like tokenizers built in. As I’ve said before, it’s horses for courses.

High Kappa Values are not Necessary for High Quality Corpora

October 2, 2012

I’m not a big fan of kappa statistics, to say the least. I point out several problems with kappa statistics right after the initial studies in this talk on annotation modeling.

I just got back from another talk on annotation where I was ranting again about the uselessness of kappa. In particular, this blog post is an attempt to demonstrate why a high kappa is not necessary. The whole point of building annotation models a la Dawid and Skene (as applied by Snow et al. in their EMNLP paper on gather NLP data with Mechanical Turk) is that you can create a high-reliability corpus without even having high accuracy, much less acceptable kappa values — it’s the same kind of result as using boosting to combine multiple weak learners into a strong learner.

So I came up with some R code to demonstrate why a high kappa is not necessary without even bothering with generative annotation models. Specifically, I’ll show how you can wind up with a high-quality corpus even in the face of low kappa scores.

The key point is that annotator accuracy fully determines the accuracy of the resulting entries in the corpus. Chance adjustment has nothing at all to do with corpus accuracy. That’s what I mean when I say that kappa is not predictive. If I only know the annotator accuracies, I can tell you expected accuracy of entries in the corpus, but if I only know kappa, I can’t tell you anything about the accuracy of the corpus (other than that all else being equal, higher kappa is better; but that’s also true of agreement, so kappa’s not adding anything).

First, the pretty picture (the colors are in honor of my hometown baseball team, the Detroit Tigers, clinching a playoff position).

Kappa for varying prevalences and accuracies

What you’re looking at is a plot of the kappa value vs. annotator accuracy and category prevalence in a binary classification problem. (It’s only the upper-right corner of a larger diagram that would let accuracy run from 0 to 1 and kappa from 0 to 1. Here’s the whole plot for comparison.

Kappa for varying prevalences and accuracies

Note that the results are symmetric in both accuracy and prevalence, because very low accuracy leads to good agreement in the same way that very high accuracy does.)

How did I calculate the values? First, I assumed accuracy was the same for both positive and negative categories (usually not the case — most annotators are biased). Prevalence is defined as the fraction of items belonging to category 1 (usually the “positive” category).

Everything else follows from the definitions of kappa, to result in the following definition in R to compute expected kappa from binary classification data with a given prevalence of category 1 answers and a pair of annotators with the same accuracies.

kappa_fun = function(prev,acc) {
  agr = acc^2 + (1 - acc)^2;
  cat1 = acc * prev + (1 - acc) * (1 - prev);
  e_agr = cat1^2 + (1 - cat1)^2;
  return((agr - e_agr) / (1 - e_agr));
}

Just as an example, let’s look at prevalence = 0.2 and accuracy = 0.9 with say 1000 examples. The expected contingency table would be

  Cat1 Cat2
Cat1 170 90
Cat2 90 650

and the kappa coefficient would be 0.53, below anyone’s notion of “acceptable”.

The chance of actual agreement is the accuracy squared (both annotators are correct and hence agree) plus one minus the accuracy squared (both annotators are wrong and hence agree — two wrongs make a right for kappa, another of its problems).

The proportion of category 1 responses (say positive responses) is the accuracy times the prevalence (true category is positive, correct response) plus one minus accuracy times one minus prevalence (true category is negative, wrong response).

Next, I calculate expected agreement a la Cohen’s kappa (which is the same as Scott’s pi in this case because the annotators have identical behavior and hence everything’s symmetric), which is just the resulting agreement from voting according to the prevalences. So that’s just the probability of category 1 squared (both annotators respond category 1) and the probability of a category 2 response (1 minus the probability of a category 1 response) squared.

Finally, I return the kappa value itself, which is defined as usual.

Back to the plot. The white border is set at .66, the lower-end threshold established by Krippendorf for somewhat acceptable kappas; the higher-end threshold of acceptable kappas set by Krippendorf was 0.8, and is also indicated on the legend.

In my own experience, there are almost no 90% accurate annotators for natural language data. It’s just too messy. But you need well more than 90% accuracy to get into acceptable kappa range on a binary classification problem. Especially if prevalence is high, because as prevalence goes up, kappa goes down.

I hope this demonstrates why having a high kappa is not necessary.

I should add that Ron Artstein asked me after my talk what I thought would be a good thing to present if not kappa. I said basic agreement is more informative than kappa about how good the final corpus is going to be, but I want to go one step further and suggest you just inspect a contingency table. It’ll tell you not only what the agreement is, but also what each annotator’s bias is relative to the other (evidenced by asymmetric contingency tables).

In case anyone’s interested, here’s the R code I then used to generate the fancy plot:

pos = 1;
K = 200;
prevalence = rep(NA,(K + 1)^2);
accuracy = rep(NA,(K + 1)^2);
kappa = rep(NA,(K + 1)^2);
for (m in 1:(K + 1)) {
  for (n in 1:(K + 1)) {
    prevalence[pos] = (m - 1) / K;
    accuracy[pos] = (n - 1) / K;
    kappa[pos] = kappa_fun(prevalence[pos],accuracy[pos]);
    pos = pos + 1;
  }
}
library("ggplot2");
df = data.frame(prevalence=prevalence,
                accuracy=accuracy,
                kappa=kappa);
kappa_plot = 
  ggplot(df, aes(prevalence,accuracy,fill = kappa)) +
     labs(title = "Kappas for Binary Classification\n") +
     geom_tile() +
     scale_x_continuous(expand=c(0,0), 
                        breaks=c(0,0.25,0.5,0.75,1),
                        limits =c(0.5,1)) +
     scale_y_continuous(expand=c(0,0), 
                        breaks=seq(0,10,0.1), 
                        limits=c(0.85,1)) +
     scale_fill_gradient2("kappa", limits=c(0,1), midpoint=0.66,
                          low="orange", mid="white", high="blue",
                          breaks=c(1,0.8,0.66,0));

More on the Terminology of Averages, Means, and Expectations

June 21, 2012

I missed Cosma Shalizi’s comment on my first post on averages versus means. Rather than write a blog-length reply, I’m pulling it out into its own little lexicographic excursion. To borrow stylistically from Cosma’s blog, I’ll warn you, the reader, that this post is more linguistics than statistics.

Three Concepts and Terminology

Presumably everyone’s clear on the distinctions among the three concepts,

1. [arithmetic] sample mean,

2. the mean of a distribution, and

3. the expectation of a random variable.

The relations among these concepts is very rich, which is what I conjecture is causing their conflation.

Let’s set aside the discussion of “average”, as it’s less precise terminologically. But even the very precision of the term “average” is debatable! The Random House Dictionary lists the meaning of “average” in statistics as the arithmetic mean. Wikipedia, on the other hand, agrees with Cosma, and lists a number of statistics of centrality (sample mean, sample median, etc.) as being candidates for the meaning of “average”.

Distinguishing Means and Sample Means

Getting to the references Cosma asked for, all of my intro textbooks (Ash, Feller, Degroot and Schervish, Larsen and Marx) distinguished sense (1) from senses (2) and (3). Even the Wikipedia entry for "mean" leads off with

In statistics, mean has two related meanings:

the arithmetic mean (and is distinguished from the geometric mean or harmonic mean).

the expected value of a random variable, which is also called the population mean.

Unfortunately, the invocation of the population mean here is problematic. Random variables aren’t intrinsically related to populations in any way (at least under the Bayesian conception of what can be modeled as random). Populations can be characterized by a set of (conditionally) independent and identically distributed (i.i.d.) random variables, each corresponding to a measureable quantity of a member of the population. And of course, averages of random variables are themselves random variables.

This reminds me to the common typological mistake of talking about “sampling from a random variable” (follow the link for Google hits for the phrase).

Population Means and Empirical Distributions

The Wikipedia introduces a fourth concept, population mean, which is just the arithmetic mean of a given population. This is related to the point Cosma brought up in his comment that you can think of a sample mean as the mean of a distribution with the same distribution as the empirically observed distribution. For instance, if you observe three heads and a tail in three coin flips, you create a discrete random variable X with p_X(1) = 0.75 and p_X(0) = 0.25, then the average number of heads is equal to the expectation of X or the mean of p_X.

Conflating Means and Expectations

I was surprised that like the Wikipedia, almost all the sources I consulted explicitly conflated senses (2) and (3). Feller’s 1950 Introduction to Probability Theory and Applications, Vol 1 says the following on page 221.

The terms mean, average, and mathematical expectation are synonymous. We also speak of the mean of a distribution instead of referring to a corresponding random variable.

The second sentence is telling. Distributions have means independently of whether we’re talking about a random variable or not. If one forbids talk of distributions as first-class objects with their own existence free of random variables, one might argue that concepts (2) and (3) should always be conflated.

Metonomy and Lexical Semantic Coercion

I think the short story about what’s going on in conflating (2) and (3) is metonymy. For example, I can use “New York” to refer to the New York Yankees or the city government, but no one will understand you if you try to use “New York Yankees” to refer to the city or the government. I’m taking one aspect of the team, namely its location, and using that to refer to the whole.

This can happen implicitly with other kinds of associations. I can talk about the “mean” of a random variable X by implicitly invoking its probability function p_X(x). I can also talk about the expectation of a distribution by implicitly invoking the appropriate random variable. Sometimes authors try to sidestep random variable notation by writing \mathbb{E}_{\pi(x)}[x], which to my type-sensitive mind appears ill-formed; what they really mean to write is \mathbb{E}[X] where p_X(x) = \pi(x).

I found it painfully difficult to learn statistics because of this sloppiness with respect to the types of things, especially among less careful authors. Bayesians, myself now included, often write x for both a random variable and a bound variable; see Gelman et al.’s Bayesian Data Analysis, for a typical example of this style.

Settling down into my linguistic armchair, I’ll conclude by noting that it seems to me more felicitous to say

the expectation of a random variable is the mean of its distribution.

than to say

the mean of a random variable is the expectation of its distribution.

0/1 Loss Meaningless for Predicting Rare Events such as Exploding Manholes

June 14, 2012

[Update: 19 June 2012: Becky just wrote me to clarify which tools they were using for what (quoted with permission, of course -- thanks, Becky):

... we aren't using BART to rank structures, we use an independently learned ranked list to bin the structures before we apply BART. We use BART to do a treatment analysis where the y values represent whether there was an event, then we compute the role that the treatment variable plays in the prediction. Here's a journal paper that describes our initial ranking method

http://www.springerlink.com/content/3034h0j334211484/

and the pre-publication version

http://www1.ccls.columbia.edu/%7Ebeck/pubs/ConedPaperRevision-v5.pdf

The algorithm for doing the ranking was modified a few years ago, and now Cynthia is taking a new approach that uses survival analysis.]

Rare Events

Let’s suppose you’re building a model to predict rare events, like manhole explosions in the Con-Ed system in New York (this is the real case at hand — see below for more info). For a different example, consider modeling the probability of a driver getting into a traffic accident in the next week. The problem with both of these situations is that even with all the predictors in hand (last maintenance, number of cables, voltages, etc. in the Con-Ed case; driving record, miles driven, etc. in the driving case), the estimated probability for any given manhole exploding (any person getting into an accident next week) is less than 50%.

The Problem with 0/1 Loss

A typical approach in machine learning in general, and particularly in NLP, is to use 0/1 loss. This forces the system to make a simple yea/nay (aka 0/1) prediction for every manhole about whether it will explode in the next year or not. Then we compare those predictions to reality, assigning a loss of 1 if you predict the wrong outcome and 0 if you predict correctly, then summing these losses over all manholes.

The way to minimize expected loss is to predict 1 if the probability estimate of failure is greater than 0.5 and 0 otherwise. If all of the probability estimates are below 0.5, all predictions are 0 (no explosion) for every manhole. Consequently, the loss is always the number of explosions. Unfortunately, this is the best you can do if your loss is 0/1 and you have to make 0/1 predictions.

So we’ve minimized 0/1 loss and in so doing created a useless 0/1 classifier.

A Hacked Threshold?

There’s something fishy about a classifier that returns all 0 predictions. Maybe we can adjust the threshold for predicting explosions below 0.5. Equivalently, for 0/1 classification purposes, we could rescale the probability estimates.

Sure, it gives us some predicted explosions, but the result is a non-optimal 0/1 classifier. The reason it’s non-optimal in 0/1-loss terms is that each prediction of an explosion is likely to be wrong, but in aggregate some of them will be right.

It’s not a 0/1 Classification Problem

The problem in 0/1 classification arises from converting estimates of explosion of less than 50% per manhole to 0/1 predictions minimizing expected loss.

Suppose our probability estimates are close, at least in the sense that for any given manhole there’s only a very small chance it’ll explode no matter what its features are.

Some manholes do explode and the all-0 predictions are wrong for every exploding manhole.

What Con-Ed really cares about is finding the most at-risk properties in its network and supplying them maintenance (as well as understanding what the risk factors are). This is a very different problem.

A Better Idea

Take the probabilities seriously. If your model predicts a 10% chance of explosion for each of 100 manholes, you expect to see 10 explosions. You just don’t know which of the 100 manholes they’ll be. You can measure these marginal predictions (number of predicted explosions) to gauge how accurate your model’s probability estimates are.

We’d really like a general evaluation that will measure how good our probability estimates are, not how good our 0/1 predictions are. Log loss does just that. Suppose you have N outcomes y_1,\ldots,y_N with corresoponding predictors (aka features), x_1,\ldots,x_N, and your model has parameter \theta. The log loss for parameter (point) estimate \hat{\theta} is

      {\mathcal L}(\hat{\theta}) - \sum_{n=1}^N \, \log \, p(y_n|\hat{\theta};x_n)

That is, it’s the negative log probability (the negative turns gain into loss) of the actual outcomes given your model; the summation is called the log likelihood when viewed as a function of \theta, so log loss is really just the negative log likelihood. This is what you want to optimize if you don’t know anything else. And it’s exactly what most probabilistic estimators optimize for classifiers (e.g., logistic regression, BART [see below]).

Decision Theory

The right thing to do for the Con-Ed case is to break out some decision theory. We can assign weights to various prediction/outcome pairs (true positive, false positive, true negative, false negative), and then try to optimize weights. If there’s a huge penalty for a false negative (saying there won’t be an explosion when there is), then you are best served by acting on low-probability information, such as servicing even low-probability manholes. For example, if there is a $100 cost for a manhole blowing up and it costs $1 to service a manhole so it doesn’t blow up, then even a 1% chance of blowing up is enough to send out the service team.

We haven’t changed the model’s probability estimates at all, just how we act on them.

In Bayesian decision theory, you choose actions to minimize expected loss conditioned on the data (i.e., optimize expected outcomes based on the posterior predictions of the model).

Ranking-Based Evaluations

Suppose we sort the list of manholes in decreasing order of estimated probability of explosion. We can line this up with the actual outcomes. Good system performance is reflected in having the actual explosions ranked high on the list.

Information retrieval supplies a number of metrics for this kind of ranking. The thing I like to see for this kind of application is a precision-recall curve. I’m not a big fan of single-number evaluations like mean average precision, though precision-at-N makes sense in some cases, such as if Con-Ed had a fixed maintenance budget and wanted to know how many potentially exploding manholes it could service.

There’s a long description of these kind evaluations in

Just remember there’s noise in this received curves and that picking an optimal point on them is unlikely to produce such good behavior on held-out data.

With good probability estimates for the events you will get good rankings (there’s a ton of theory around this I’ve never studied).

About the Exploding Manholes Project

I’ve been hanging out at Columbia’s Center for Computational Learning Systems (CCLS) talking to Becky Passonneau, Haimanti Dutta, Ashish Tomar, and crew about their Con-Ed project of predicting certain kinds of events like exploding manholes. They built a non-parametric regression model using Bayesian additive regression trees with a fair amount of data and many features as predictors.

I just wrote a blog post on Andrew Gelman’s blog that’s related to issues they were having with diagnosing convergence:

But the real problem is that all the predictions are below 0.5 for manholes exploding and the like. So simple 0/1 loss just fails. I thought the histograms of residuals looked fishy until it dawned on me that it actually makes sense for all the predictions to be below 0.5 in this situation.

Moral of the Story

0/1 loss is not your real friend. Decision theory is.

The Lottery Paradox

This whole discussion reminds me of the lottery “paradox”. Each ticket holder is very unlikely to win a lottery, but one of them will win. The “paradox” arises from the inconsistency of the conjunction of beliefs that each person will lose and the belief that someone will win.

Oh, no! Henry Kyburg died in 2007. He was a great guy and decades ahead of his time. He was one of my department’s faculty review board members when I was at CMU. I have a paper in a book he edited from the 80s when we were both working on default logics.

Computing Autocorrelations and Autocovariances with Fast Fourier Transforms (using Kiss FFT and Eigen)

June 8, 2012

[Update 8 August 2012: We found that for KissFFT if the size of the input is not a power of 2, 3, and 5, then things really slow down. So now we're padding the input to the next power of 2.]

[Update 6 July 2012: It turns out there's a slight correction needed to what I originally wrote. The correction is described on this page:

I'm fixing the presentation below to reflect the correction. The change is also reflected in the updated Stan code using Eigen, but not the updated Kiss FFT code.]

Suppose you need to compute all the sample autocorrelations for a sequence of observations

x = x[0],...,x[N-1]

The most efficient way to do this is with the discrete fast fourier transform (FFT) and its inverse; it’s {\mathcal O}(N \log N) versus {\mathcal O}(N^2) for the naive approach. That much I knew. I had both experience with Fourier transforms from my speech reco days (think spectograms) and an understanding of the basic functional analysis principles. I didn’t know how to code it given an FFT library. The web turned out to be not much help — the explanations I found were all over my head complex-analysis-wise and I couldn’t find simple code examples.

Matt Hoffman graciously volunteered to give me a tutorial and wrote an initial prototype. It turns out to be really really simple once you know which way ’round the parts all go.

Autocorrelations via FFT

Conceptually, the input N-vector x is the time vector and the autocorrelations will be the frequency vector. Here’s the algorithm:

  1. create a centered version of x by setting x_cent = x / mean(x);
  2. pad x_cent at the end with entries of value 0 to get a new vector of length L = 2^ceil(log2(N));
  3. run x_pad through a forward discrete fast fourier transform to get an L-vector z of complex values;
  4. replace the entries in z with their norms (the norm of a complex number is the real number resulting of summing the squared real component and squared imaginary component).
  5. run z through the inverse discrete FFT to produce an L-vector acov of (unadjusted) autocovariances;
  6. trim acov to size N;
  7. create a L-vector named mask consisting of N entries with value 1 followed by L-N entries with value 0;
  8. compute the forward FFT of mask and put the result in the L-vector adj
  9. to get adjusted autocovariance estimates, divide each entry acov[n] by norm(adj[n]), where norm is the complex norm defined above; and
  10. to get autocorrelations, set acorr[n] = acov[n] / acov[0] (acov[0], the autocovariance at lag 0, is just the variance).

The autocorrelation and autocovariance N-vectors are returned as acorn and acov respectively.

It’s really fast to do all of them in practice, not just in theory.

Depending on the FFT function you use, you may need to normalize the output (see the code sample below for Stan). Run a test case and make sure that you get the right ratios of values out in the end, then you can figure out what the scaling needs to be.

Eigen and Kiss FFT

For Stan, we started out with a direct implementation based on Kiss FFT.

  • Stan’s original Kiss FFT-based source code (C/C++) [Warning: this function does not have the correction applied; see the current Stan code linked below for an example]

At Ben Goodrich’s suggestion, I reimplemented using the Eigen FFT C++ wrapper for Kiss FFT. Here’s what the Eigen-based version looks like:

As you can see from this contrast, nice C++ library design can make for very simple work on the front end.

Hat’s off to Matt for the tutorial, Thibauld Nion for the nice blog post on the mask-based correction, Kiss FFT for the C implementation, and Eigen for the C++ wrapper.

Averages vs. Means (vs. Expectations)

May 29, 2012

Averages

Averages are statistics calculated over a set of samples. If you have a set of samples x = x_1,\ldots,x_N, their average, often written \bar{x}, is defined by

\bar{x} = \frac{1}{N} \sum_{n=1}^N x_n.

Means

Means are properties of distributions. If p(x) is a discrete probability mass function over the natural numbers \mathbb{N}, then its mean is defined by

\sum_{x \in \mathbb{N}} \, x \times p(x).

If p(x) is a continuous probability density function over the real numbers \mathbb{R}, then its mean, if it exists, is defined by

\int_{\mathbb{R}} \, x \times p(x) \, dx.

This also shows how summations over discrete probability functions, \sum_{x \in \mathbb{N}} relate to integrals over continuous probability functions, \int_{\mathbb{R}} dx. (Distributions can also be mixed, like spike and slab priors, but the math gets more complicated due to the need to unify the notion of summation and integration.)

Expectations

To confuse matters further, there are expectations. Expectations are properties of (some) random varaibles. The expectation of a random variable is the mean of its distribution. If X is a discrete random variable with probability mass function p(x), then its expectation is defined to be

\mathbb{E}[X] = \sum_{x \in \mathbb{N}} \, x \times p(x).

If X is a continuous random variable with probability density function p(x), then

\mathbb{E}[X] = \int_{\mathbb{R}} \, x \times p(x) \, dx.

Look familiar?

Sample Means

Samples don’t have means per se. They have averages. But sometimes the average is called the “sample mean”. Just to confuse things.

Averages as Estimates of the Mean

Gauss showed that the average of a set of independent, identically distributed (i.i.d.) samples from a distribution p(x) is a good estimate of the mean.

What’s good about the average as an estimator of the mean? First, it’s unbiased, meaning the expectation of the average of a set of i.i.d. samples from a distribution is the mean of the distribution. Second, it has the lowest expected mean square error among all estimators of the mean. That’s why everyone likes square error (that, and its convexity, which I discussed in a previous blog post on Mean square error, or why committees won the Netflix Prize).

What about Medians?

The median is a good estimator too. Laplace proved that it has the lowest expected absolute error among estimators (I just learned it was Laplace from the Wikipedia entry on median unbiased estimators). It’s also more robust to outliers.

More on Estimators

The Wikipedia page on estimators is a good place to start.

Of course, in Bayesian statistics, we’re more concerned with a full characterization of posterior uncertainty, not just a point (or even interval) estimate.

Summary

  • Means are properties of distributions.
  • Expectations are properties of random variables.
  • Averages or sample means are statistics calculated from samples.

Follow

Get every new post delivered to your Inbox.

Join 312 other followers