## Archive for the ‘Statistics’ Category

### Computing Autocorrelations and Autocovariances with Fast Fourier Transforms (using Kiss FFT and Eigen)

June 8, 2012

[Update 8 August 2012: We found that for KissFFT if the size of the input is not a power of 2, 3, and 5, then things really slow down. So now we’re padding the input to the next power of 2.]

[Update 6 July 2012: It turns out there’s a slight correction needed to what I originally wrote. The correction is described on this page:

I’m fixing the presentation below to reflect the correction. The change is also reflected in the updated Stan code using Eigen, but not the updated Kiss FFT code.]

Suppose you need to compute all the sample autocorrelations for a sequence of observations

x = x[0],...,x[N-1]

The most efficient way to do this is with the discrete fast fourier transform (FFT) and its inverse; it’s ${\mathcal O}(N \log N)$ versus ${\mathcal O}(N^2)$ for the naive approach. That much I knew. I had both experience with Fourier transforms from my speech reco days (think spectograms) and an understanding of the basic functional analysis principles. I didn’t know how to code it given an FFT library. The web turned out to be not much help — the explanations I found were all over my head complex-analysis-wise and I couldn’t find simple code examples.

Matt Hoffman graciously volunteered to give me a tutorial and wrote an initial prototype. It turns out to be really really simple once you know which way ’round the parts all go.

### Autocorrelations via FFT

Conceptually, the input N-vector x is the time vector and the autocorrelations will be the frequency vector. Here’s the algorithm:

1. create a centered version of x by setting x_cent = x / mean(x);
2. pad x_cent at the end with entries of value 0 to get a new vector of length L = 2^ceil(log2(N));
3. run x_pad through a forward discrete fast fourier transform to get an L-vector z of complex values;
4. replace the entries in z with their norms (the norm of a complex number is the real number resulting of summing the squared real component and squared imaginary component).
5. run z through the inverse discrete FFT to produce an L-vector acov of (unadjusted) autocovariances;
6. trim acov to size N;
7. create a L-vector named mask consisting of N entries with value 1 followed by L-N entries with value 0;
8. compute the forward FFT of mask and put the result in the L-vector adj
9. to get adjusted autocovariance estimates, divide each entry acov[n] by norm(adj[n]), where norm is the complex norm defined above; and
10. to get autocorrelations, set acorr[n] = acov[n] / acov[0] (acov[0], the autocovariance at lag 0, is just the variance).

The autocorrelation and autocovariance N-vectors are returned as acorn and acov respectively.

It’s really fast to do all of them in practice, not just in theory.

Depending on the FFT function you use, you may need to normalize the output (see the code sample below for Stan). Run a test case and make sure that you get the right ratios of values out in the end, then you can figure out what the scaling needs to be.

### Eigen and Kiss FFT

For Stan, we started out with a direct implementation based on Kiss FFT.

• Stan’s original Kiss FFT-based source code (C/C++) [Warning: this function does not have the correction applied; see the current Stan code linked below for an example]

At Ben Goodrich’s suggestion, I reimplemented using the Eigen FFT C++ wrapper for Kiss FFT. Here’s what the Eigen-based version looks like:

As you can see from this contrast, nice C++ library design can make for very simple work on the front end.

Hat’s off to Matt for the tutorial, Thibauld Nion for the nice blog post on the mask-based correction, Kiss FFT for the C implementation, and Eigen for the C++ wrapper.

May 29, 2012

### Averages

Averages are statistics calculated over a set of samples. If you have a set of samples $x = x_1,\ldots,x_N$, their average, often written $\bar{x}$, is defined by

$\bar{x} = \frac{1}{N} \sum_{n=1}^N x_n$.

### Means

Means are properties of distributions. If $p(x)$ is a discrete probability mass function over the natural numbers $\mathbb{N}$, then its mean is defined by

$\sum_{x \in \mathbb{N}} \, x \times p(x)$.

If $p(x)$ is a continuous probability density function over the real numbers $\mathbb{R}$, then its mean, if it exists, is defined by

$\int_{\mathbb{R}} \, x \times p(x) \, dx$.

This also shows how summations over discrete probability functions, $\sum_{x \in \mathbb{N}}$ relate to integrals over continuous probability functions, $\int_{\mathbb{R}} dx$. (Distributions can also be mixed, like spike and slab priors, but the math gets more complicated due to the need to unify the notion of summation and integration.)

### Expectations

To confuse matters further, there are expectations. Expectations are properties of (some) random varaibles. The expectation of a random variable is the mean of its distribution. If $X$ is a discrete random variable with probability mass function $p(x)$, then its expectation is defined to be

$\mathbb{E}[X] = \sum_{x \in \mathbb{N}} \, x \times p(x)$.

If $X$ is a continuous random variable with probability density function $p(x)$, then

$\mathbb{E}[X] = \int_{\mathbb{R}} \, x \times p(x) \, dx$.

Look familiar?

### Sample Means

Samples don’t have means per se. They have averages. But sometimes the average is called the “sample mean”. Just to confuse things.

### Averages as Estimates of the Mean

Gauss showed that the average of a set of independent, identically distributed (i.i.d.) samples from a distribution $p(x)$ is a good estimate of the mean.

What’s good about the average as an estimator of the mean? First, it’s unbiased, meaning the expectation of the average of a set of i.i.d. samples from a distribution is the mean of the distribution. Second, it has the lowest expected mean square error among all estimators of the mean. That’s why everyone likes square error (that, and its convexity, which I discussed in a previous blog post on Mean square error, or why committees won the Netflix Prize).

The median is a good estimator too. Laplace proved that it has the lowest expected absolute error among estimators (I just learned it was Laplace from the Wikipedia entry on median unbiased estimators). It’s also more robust to outliers.

### More on Estimators

Of course, in Bayesian statistics, we’re more concerned with a full characterization of posterior uncertainty, not just a point (or even interval) estimate.

### Summary

• Means are properties of distributions.
• Expectations are properties of random variables.
• Averages or sample means are statistics calculated from samples.

May 18, 2012

### From the Emailbox

Krishna writes,

I have a question about using the chunking evaluation class for inter annotation agreement : how can you use it when the annotators might have missing chunks I.e., if one of the files contains more chunks than the other.

The answer’s not immediately obvious because the usual application of interannotator agreement statistics is to classification tasks (including things like part-of-speech tagging) that have a fixed number of items being annotated.

### Chunker Evaluation

The chunker evaluations built into LingPipe calculate the usual of precision and recall measures (see below). These evaluations compare a set of response chunkings to a set of reference chunkings. Usually the reference is drawn from a gold-standard corpus and the response from an automated system built to do chunking.

Precision (aka positive predictive accuracy) measures the proportion of chunks in the response that are also in the reference. Recall (aka sensitivity) measures the proportion of chunks in the reference that are in the response. If we swap the reference and response chunkings, we swap precision and recall.

True negatives aren’t really being counted here — theoretically there are a huge number of them — any possible span with any possible tag could have been labeled. LingPipe just sets the true negative count to zero, and as a result, specificity (TN/[TN+FP]) doesn’t make sense.

### Interannotator Agreement

Suppose you have chunkings from two human annotators. Just treat one as the reference and one as the response and run a chunking evaluation. The precision and recall values will tell you which annotator returned more chunkings. For instance, if precision is .95 and recall .75, you know that the annotator assigned as the reference chunking had a whole bunch of chunks the other annotator didn’t think were chunks, but most of the chunks found by the response annotator were also chunks of the reference annotator.

You can use F-measure as an overall single-number score.

The base metrics are all explained in

and their application to chunking in

Examples of running chunker evaluations can be found in

### LingPipe Annotation Tool

If you’re annotating entity data, you might be interested in our learn-a-little, tag-a-little tool.

Now that Mitzi’s brought it up to compatibility with LingPipe 4, we should move citationEntities out of the sandbox and into a tutorial.

March 12, 2012

### Time, Negation, and Clinical Events

Mitzi’s been annotating clinical notes for time expressions, negations, and a couple other classes of clinically relevant phrases like diagnoses and treatments (I just can’t remember exactly which!). This is part of the project she’s working on with Noemie Elhadad, a professor in the Department of Biomedical Informatics at Columbia.

### LingPipe Chunk Annotation GUI

Mitzi’s doing the phrase annotation with a LingPipe tool which can be found in

She even brought it up to date with the current release of LingPipe and generalized the layout for documents with subsections.

Our annotation tool follows the tag-a-little, train-a-little paradigm, in which an automatic system based on the already-annotated data is trained as you go to pre-annotate the data for a user to correct. This approach was pioneered in MITRE’s Alembic Workbench, which was used to create the original MUC-6 named-entity corpus.

The chunker underlying LingPipe’s annotation toolkit is based on LingPipe’s character language-model rescoring chunker, which can be trained online (that is, as the data streams in) and has quite reasonable out-of-the-box performance. It’s LingPipe’s best out-of-the-box chunker. In contrast, CRFs can be engineered to outperform the rescoring chunker with good feature engineering.

A very nice project would be to build a semi-supervised version of the rescoring chunker. The underlying difficulty is that our LM-based and HMM-based models take count-based sufficient statistics.

### It Works!

Mitzi’s getting reasonable system accuracy under cross validation, with over 80% precision and recall (and hence over 80% balanced F-measure).

### That’s not Cricket!

According to received wisdom in natural language processing, she’s left out a very important step of the standard operating procedure. She’s supposed to get another annotator to independently label the data and then measure inter-annotator agreement.

### So What?

If we can train a system to performa at 80%+ F-measure under cross-validation, who cares if we can’t get another human to match Mitzi’s annotation?

We have something better — we can train a system to match Mitzi’s annotation!

In fact, training such a system is really all that we often care about. It’s much better to be able to train a system than another human to do the annotation.

The other thing we might want a corpus for is to evaluate a range of systems. There, if the systems are highly comparable, the fringes of the corpus matter. But perhaps the small, but still p < 0.05, differences in such systems don't matter so much. What the MT people have found is that even a measure that's roughly correlated with performance can be used to guide system development.

### Error Analysis and Fixing Inconsistencies

Mitzi’s been doing the sensible thing of actually looking at the errors the system’s making under cross validation. In some of these cases, she’d clearly made a braino and annotated the data wrong. So she fixes it. And system performance goes up.

What Mitzi’s reporting is what I’ve always found in these tasks. For instance, she inconsistently annotated time plus date sequences, sometimes including the times and sometimes not. So she’s going back to correct to do it all consistently to include all of the time information in a phrase (makes sense to me).

After a couple of days of annotation, you get a much stronger feeling for how the annotations should have gone all along. The annotations drifted so much over time in this fashion in the clinical notes annotated for the i2b2 Obesity Challenge that the winning team exploited time of labeling as an informative feature to predict co-morbidities of obesity!

### That’s also not Cricket!

The danger with re-annotating is that the system’s response will bias the human annotations. System-label bias is also a danger with single annotation under the tag-a-little, learn-a-little setup. If you gradually change the annotation to match the system’s responses, you’ll eventually get to very good, if not perfect, performance under cross validation.

So some judgment is required in massaging the annotations into a coherent system, but one that you care about, not one driven by the learned system’s behavior.

On the other hand, you do want to choose features and chunkings the system can learn. So if you find you’re trying to make distinctions that are impossible for the system to learn, then change the coding standard to make it more learnable, that seems OK to me.

### Go Forth and Multiply

Mitzi has only spent a few days annotating the data and the system’s already working well end to end. This is just the kind of use case that Breck and I had in mind when we built LingPipe in the first place. It’s so much fun seeing other people use your tools

When Breck and Linnea and I were annotating named entities with the citationEntities tool, we could crank along at 5K tokens/hour without cracking a sweat. Two eight-hour days will net you 80K tokens of annotated data and a much deeper insight into the problem. In less than a person-week of effort, you’ll have a corpus the size of the MUC 6 entity corpus.

Of course, it’d be nice to roll in some active learning here. But that’s another story. As is measuring whether it’s better to have a bigger or a better corpus. This is the label-another-instance vs. label-a-fresh-instance decision problem that (Sheng et al. 2008) addressed directly.

### Settles (2011): Closing the Loop: Fast, Interactive Semi-Supervised Annotation with Queries on Features and Instances

February 23, 2012

Whew, that was a long title. Luckily, the paper’s worth it:

Settles, Burr. 2011. Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances. EMNLP.

It’s a paper that shows you how to use active learning to build reasonably high-performance classifier with only minutes of user effort. Very cool and right up our alley here at LingPipe.

### The Big Picture

The easiest way to see what’s going on is with a screenshot of DUALIST, the system on which the paper is based:

It’s basically a tag-a-little, learn-a-little annotation tool for classifiers. I wrote something along these lines for chunk tagging (named entities, etc.) — you can find it in the LingPipe sandbox project citationEntities (called that because I originally used it to zone bibliogrphies in docs, citations in bibliographies and fields in citations). Mitzi just brought it up to date with the current LingPipe and generalized it for some large multi-part document settings.

In DUALIST, users provide two kinds of input:

1. category classifications for documents
2. words associated with categories

The left-hand-side of the interface presents a scrolling list of documents, with buttons for categories. There are then columns for categories with words listed under them. Users can highlight words in the lists that they believe are associated with the category. They may also enter new words that don’t appear on the lists.

Settles points out a difficult choice in the design. If you update the underlying model after every user choice, the GUI items are going to rearrange themselves. Microsoft tried this with Word, etc., arranging menus by frequency of use, and I don’t think anyone liked it. Constancy of where something’s located is very important. So what he did was let the user mark up a bunch of choices of categories and words, then hit the big submit button at the top, which would update the model. I did roughly the same thing with our chunk annotation interface.

There’s always a question in this kind of design whether to pre-populate the answers based on the model’s guesses (as far as I can tell, DUALIST does not pre-populate answers). Pre-populating answers makes the user’s life easier in that if the system is halfway decent, there’s less clicking. But it raises the possibility of bias, with users just going with what the system suggests without thinking too hard.

### Naive-Bayes Classifier

The underlying classification model is naive Bayes with a Dirichlet prior. Approximate inference is carried out using a maximum a posteriori (MAP) estimate of parameters. It’s pretty straightforward to implement naive Bayes this in a way that’s fast enough to use in this setting. The Dirichlet is conjugate to the multinomial so the posteriors are analytically tractable and the sufficient statistics are just counts of documents in each category and the count of words in documents of each category.

### The Innovations

The innovation here is twofold.

The first innovation is that Settles uses EM to create a semi-supervised MAP estimate. As we’ve said before, it’s easy to use EM or some kind of posterior sampling like Gibbs sampling over a directed graphical model with any subset of its parameters or labels being unknown. So technically, this is straightforward. But it’s still a really good idea. Although semi-supervised classifiers are very popular, I’ve never seen it used in this kind of active-learning tagging interface. I should probably add this to our chunking tagger.

The second (and in my opinion more important) innovation is in letting users single out some words as being important words in categories. The way this gets pushed through to the model is by setting the component of the Dirichlet prior corresponding to the word/category pair to a larger value. Settles fits this value using held-out data rather than rolling it into the model itself with a prior. The results seem oddly insensitive to it, which surprised me (but see below).

Gadzooks! Settles seems to be missing the single biggest tuning parameter typically applied to naive Bayes — the document length normalizer. Perhaps he did this because when you document-length normalize, you no longer have a properly generative model that corresponds to the naive Bayes paradigm. But it makes a huge difference.

The LingPipe EM tutorial uses naive Bayes and the same 20 Newsgroups corpus as (Nigam, McCallum and Mitchell 2000) used for evaluation, and I showed the effect of document length normalization is huge (the Nigam et al. article has been cited nearly 2000 times!). You can do way better than Nigam et al.’s reported results by setting the document length norm to a small value like 5. (What document length norm’s doing is trying to correct for the lack of covariance and overdispersion modeling in naive multinomial document model — it’s the same kind of shenanigans you see in speech recognition in weighting the acoustic and language models and the same trick I just saw Kevin Knight pull out during a talk last week about decoding encrypted documents and doing machine translation.)

I think one of the reasons that the setting of the prior for important words has so little effect (see the performance figures) is that all of the priors are too high. If Settles really is starting with the Laplace prior (aka add 1), then that’s already too big for naive Bayes in this setting. Even the uniform prior (aka add 0) is too big. We’ve found that we need very small (less than 1) prior parameters for word-in-topic models unless there’s a whole lot of data (and the quantity Settles is using hasn’t gotten there by a long shot — you need to get enough so that the low counts dominate the prior before the effect of the prior washes out, so we’re talking gigabytes of text, at least).

Also, this whole approach is not Bayesian. It uses point estimates. For a discussion of what a properly Bayesian version of naive Bayes would look like, check out my previous blog post, Bayesian Naive Bayes, aka Dirichlet Multinomial Classifiers. For a description of what it means to be Bayesian, see my post What is Bayesian Statistical Inference?.

### Confusion with Dirichlet-Multinomial Parameterization and Inference

There’s a confusion in the presentation of the Dirichlet prior and consequent estimation. The problem is that the prior parameter for a Dirichlet is conventionally the prior count (amount you add to the usual frequency counts) plus one. That’s why a prior of 1 is uniform (you add nothing to the frequency counts) and why a prior parameter of 2 corresponds to Laplace’s approach (add one to all frequency counts). The parameter is constrained to be positive, so what does a prior of 0.5 mean? It’s sort of like subtracting 1/2 from all the counts (sound familiar from Kneser-Ney LM smoothing?).

Now the maximum a posteriori estimate is just the estimate you get from adding the prior counts (parameter minus one) to all the empirical counts. It doesn’t even exist if the counts are less than 1, which can happen with Dirichlet parameter components that are less than 1. But Settles says he’s looking at the posterior expectation (think conditional expectation of parameters given data — the mean of the posterior distribution). The posterior average always exist (it has to given the bounded support here), but it requires you to add another one to all the counts.

To summarize, the mean of the Dirichlet distribution $\mbox{Dir}(\theta|\alpha)$ is

$\bar{\theta} = \alpha/(\sum_k \alpha_k)$,

whereas the maximum (or mode) is

$\theta^{*} = (\alpha - 1) / (\sum_k (\alpha_k - 1))$.

where the $-1$ is read componentwise, so $\alpha - 1 = (\alpha_1-1,\ldots,\alpha_K-1)$. This only exists if all $\alpha_k \geq 0$.

That’s why a parameter of 1 corresponds to the uniform distribution and why a parameter of 2 (aka Laplace’s “add-one” prior) is not uniform.

Settles says he’s using Laplace and taking the posterior mean (which, by the way, is the “Bayesian point estimate” (oxymoron warning) minimizing expected square loss). But that’s not right. If he truly adds 1 to the empirical frequency counts, then he’s taking the posterior average with a Laplace prior (which is not uniform). This is equivalent to the posterior mode with a prior parameter of 3. But it’s not equivalent to either the posterior mode or mean with a uniform prior (i.e., prior parameter of 1).

### Active Learning

Both documents and words are sorted by entropy-based active learning measures which the paper presents very clearly.

Documents are sorted by the conditional entropy of the category given the words in the model.

Word features are sorted by information gain, which is the reduction in entropy from the prevalence category distribution to the expected category distribution entropy conditoned on knowing the feature’s value.

Rather than sorting docs by highest classification uncertainty, as conditional entropy does, we’ve found it useful to sort docs by the lowest classification uncertainty! That is, we ask humans to label the docs about which the classifier is least uncertain. The motivation for this is that we’re often building high-precision (and relatively low recall) classifiers for customers and thus have a relatively high probability threshold to return a guess. So the higher ranked items are closer to the boundary we’re trying to learn. Also, we find in real world corpora that if we go purely by uncertainty, we get a long stream of outliers.

Settles does bring up the issue of whether using what’s effectively a kind of active learning mechanism trained with one classifier will be useful for other classifiers. We need to get someone like John Langford or Tong Zhang in here to prove some useful bounds. Their other work on active learning with weights is very cool.

I love the big submit button. What with Fitts’s law, and all.

I see a big problem with this interface for situations with more than a handful of categories. What would the full 20 Newsgroups look like? There aren’t enough room for more columns or a big stack of buttons.

Also, buttons seems like the wrong choice for selecting categories. These should probably be radio buttons to express the exclusivity and the fact that they don’t take action themselves. Typically, buttons cause some action.

### Discriminative Classifiers

Given the concluding comments, Settles doesn’t seem to know that you can do pretty much exactly the same thing in a “discriminative” classifier setting. For instance, logistic regression can be cast as just another directed graphical model with parameters for each word/category pair. So we could do full Bayes with no problem.

There are also plenty of online estimation procedures for weighted examples; you’ll need the weighting to deal with EM (see, e.g., (Karampatziakis and Langford 2010) for an online weighted training method that adjusts neatly for curvature in the objective; it’s coincidentally the paper I’m covering for the next Columbia Machine Learning Reading Group meeting).

The priors can be carried over to this setting, too, only now they’re priors on regression coefficients. See (Genkin, Lewis and Madigan 2007) for guidance. One difference is that you get a mean and a variance to play with.

### Building it in LingPipe

LingPipe’s traditional naive Bayes implementation contains all that you need to build a system like DUALIST. Semi-supervised learning with EM is covered in our EM tutorial with naive Bayes as an example.

To solve some of the speed issues Settles brings up in the discussion section, you can always thread the retraining in the background. That’s what I did in the chunk tagger. With a discriminative “online” method, you can just keep cycling through epochs in the background, which gives you the effect of a hot warmup for subsequent examples. Also, you don’t need to run to convergence — the background models are just being used to select instances for labeling.

### How to Prevent Overflow and Underflow in Logistic Regression

February 16, 2012

Logistic regression is a perilous undertaking from the floating-point arithmetic perspective.

### Logistic Regression Model

The basic model of an binary outcome $y_n \in \{ 0, 1\}$ with predictor or feature (row) vector $x_n \in \mathbb{R}^K$ and coefficient (column) vector $\beta \in \mathbb{R}^K$ is

$y_n \sim \mbox{\sf Bernoulli}(\mbox{logit}^{-1}(x_n \beta))$

where the logistic sigmoid (i.e., the inverse logit function) is defined by

$\mbox{logit}^{-1}(\alpha) = 1 / (1 + \exp(-\alpha))$

and where the Bernoulli distribution is defined over support $y \in \{0, 1\}$ so that

$\mbox{\sf Bernoulli}(y_n|\theta) = \theta \mbox{ if } y_n = 1$, and

$\mbox{\sf Bernoulli}(y_n|\theta) = (1 - \theta) \mbox{ if } y_n = 0$.

### (Lack of) Floating-Point Precision

Double-precision floating-point numbers (i.e., 64-bit IEEE) only support a domain for $\exp(\alpha)$ of roughly $\alpha \in (-750,750)$ before underflowing to 0 or overflowing to positive infinity.

### Potential Underflow and Overflow

The linear predictor at the heart of the regression,

$x_n \beta = \sum_{k = 0}^K x_{n,k} \beta_k$

can be anywhere on the real number line. This isn’t usually a problem for LingPipe’s logistic regression, which always initializes the coefficient vector $\beta$ to zero. It could be a problem if we have even a moderately sized coefficient and then see a very large (or small) predictor. Our probability estimate will overflow to 1 (or underflow to 0), and if the outcome is the opposite, we assign zero probability to the data, which is not good predictively.

### Log Sum of Exponents to the Rescue

Luckily, there’s a solution. First, we’re almost always working with log probabilities to prevent underflow in the likelihood function for the whole data set $y,x$,

$\log p(y|\beta;x) = \log \prod_{n = 1}^N p(y_n|\beta;x_n) = \sum_{n=1}^N \log p(y_n|\beta;x_n)$

Working on the inner log probability term, we have

$\log p(y_n|\beta;x_n)$

${ } = \log \mbox{\sf Bernoulli}(y_n|\mbox{logit}^{-1}(x_n \beta))$

${ } = \log \ \mbox{logit}^{-1}(x_n \beta) \mbox{ if } y_n = 1$
${ } = \log (1 - \mbox{logit}^{-1}(x_n \beta)) \mbox{ if } y_n = 0$

Recalling that

$1 - \mbox{logit}^{-1}(\alpha) = \mbox{logit}^{-1}(-\alpha)$,

we further simplify to

${ } = \log \ \mbox{logit}^{-1}(x_n \beta) \mbox{ if } y_n = 1$
${ } = \log \ \mbox{logit}^{-1}(-x_n \beta) \mbox{ if } y_n = 0$

Now we’re in good shape if we can prevent the log of the inverse logit from overflowing or underflowing. This is manageable. If we let $\alpha$ stand in for the linear predictor (or its negation), we have

${ } = \log \ \mbox{logit}^{-1}(\alpha)$

${ } = \log (1 / (1 + \exp(-\alpha)))$

${ } = - \log (1 + \exp(-\alpha))$

${ } = - \mbox{logSumExp}(0,-\alpha)$

### Log Sum of Exponentials

Recall that the log sum of exponentials function is

$\mbox{logSumExp}(a,b) = \log (\exp(a) + \exp(b))$

If you’re not familiar with how it prevents underflow and overflow, check out my previous post:

In the logistic regression case, we have an even greater chance for optimization because the argument $a$ is a constant zero.

### Logit-transformed Bernoulli

Putting it all together, we have the logit-transformed Bernoulli distribution,

$\mbox{\sf Bernoulli}(y_n|\mbox{logit}^{-1}(x_n\beta))$

${ } = - \mbox{logSumExp}(0,-x_n\beta) \mbox{ if } y_n = 1$
${ } = - \mbox{logSumExp}(0,x_n\beta) \mbox{ if } y_n = 0$

We can just think of this as an alternatively parameterized Bernoulli distribution,

$\mbox{\sf BernoulliLogit}(y|\alpha) = \mbox{\sf Bernoulli}(y|\mbox{logit}^{-1}(\alpha))$

with which our model can be expressed as

$y_n \sim \mbox{\sf BernoulliLogit}(x_n\beta)$.

### Recoding Outcomes {0,1} as {-1,1}

The notation’s even more convenient if we recode the failure outcome as $-1$ and thus take the outcome $y \in \{ -1, 1 \}$, where we have

$\mbox{\sf BernoulliLogit}(y|\alpha) = - \mbox{logSumExp}(0,-y \alpha)$

### Bob’s ML Meetup Talk — Stan: A Bayesian Directed Graphical Model Compiler

January 6, 2012

I (Bob) am going to give a talk at the next NYC Machine Learning Meetup, on 19 January 2012 at 7 PM:

There’s an abstract on the meetup site. The short story is that Stan’s a directed graphical model compiler (like BUGS) that uses adaptive Hamiltonian Monte Carlo sampling to estimate posterior distributions for Bayesian models.

The official version 1 release is coming up soon, but until then, you can check out our work in progress at:

### “Smoothing” Measurement Errors

October 27, 2011

The issue came up in the last blog post, Discontinuities in ROC calculations, about smoothing out Area-under-the-curve (AUC) calculations for ROC and PR curves given observed data points that are very close together.

If you have a bunch of point measurements that you assume were taken with some error, what you can do is model the error itself. Suppose we don’t want to think too hard and just make the error normal, so that a measurement of $a$ is replaced with a distribution $\mbox{\sf Norm}(a,\sigma^2)$, where $\sigma$ is the standard deviation of the error.

Then, rather than computing a function $f(a)$ based on the empirical measurement $a$, instead compute the posterior of $f(X)$ given a random variable $X \sim \mbox{\sf Norm}(a,\sigma^2)$.

To get back to something on the same scale, it’s typical to use a Bayesian point estimate that minimizes expected square error, namely the expectation. Now, instead of a point $f(a)$, such as an AUC statistic, defined precisely by the observation $a$, we have a point defined by the expectation of $a$ characterized by our normal measurement error model.

$\mathbb{E}[f(X)] = \int_X f(x) \ \mbox{\sf Norm}(x|a,\sigma^2) \ dx$.

We can also compute even probabilities, such as comparing two systems with noisy measurements $a$ and $b$, with noise $\sigma$, by

$\mbox{Pr}[f(X) < f(Y)]$

${} = \int \int \mathbb{I}[f(X) < f(Y)] \ \mbox{\sf Norm}(x|a,\sigma^2) \ \mbox{\sf Norm}(y|b,\sigma^2) \ dx \ dy$.

Of course, this leaves the problem of how to estimate $\sigma$. Ideally, we'll have lots of measurements of the same thing and be able to estimate $\sigma$ directly, amounting in what's called an "empirical Bayes" estimate (note that "empirical Bayes" is not a full Bayesian method and is no more empirical than full Bayes, which is also estimated from data). Of course, we could go the full Bayes route integrate over our posterior $p(\sigma|...)$ for $\sigma$, giving us

$\mathbb{E}[f(X)] = \int_0^{\infty} \int_X f(x) \ \mbox{\sf Norm}(x|a,\sigma^2) \ p(\sigma|...) \ dx \ d\sigma$.

As Mike Ross pointed out, the result is going to be sensitive to $\sigma$. In particular, as Mike pointed out, as measurement error goes to zero, we get our original statistic back,

$\lim_{\sigma \rightarrow 0} \mathbb{E}[f(X)] = f(a)$.

Similarly, as $\sigma$ gets big, the limit of $f(X)$ and $f(Y)$ converge if $X \sim \mbox{\sf Norm}(a,\sigma)^2$ and $Y \sim \mbox{\sf Norm}(b,\sigma^2)$.

The bottom line is that you'd like a sensible model of measurement error.

Real AUC calculations depend on more than one point. If you have a vector points $\hat{x} \in [0,1]^N$ with discrete true values $x \in \{0, 1\}^N$, you just chain the integrals together, as in

$\mathbb{E}[\mbox{AUC}(\hat{x},x)]$

${} = \int \cdots \int \ \mbox{AUC}(y,x) \ \prod_{n=1}^N \ \mbox{\sf Norm}(y_n|\hat{x}_n,\sigma) \ dx_1 \cdots dx_N$.

With a normal noise model (or really any model you can sample from), it's easy to compute this integral using Monte Carlo methods. Just take a sequence of $M$ samples $y^{(m)}$, where $y^{(m)}_n$ is drawn independently from $\mbox{\sf Norm}(\hat{x}_n,\sigma^2)$, and approximate the integral required to compute the expectation by

$\frac{1}{M} \ \sum_{m=1}^M \mbox{AUC}(y^{(m)},x)$.

As the number of samples $M$ grows, this summation converges to the expectation,

$\lim_{M \rightarrow \infty} \frac{1}{M} \ \sum_{m=1}^M \mbox{AUC}(y^{(m)},x) = \mathbb{E}[\mbox{AUC}(\hat{x},x)]$.

### Discontinuities in ROC Calculations

October 26, 2011

Amaç Herdagdelen brought up some interesting examples on our mailing list that caused LingPipe’s ROC curve (and PR curve) implementation to provide some unexpected (and arguably wrong) results. (We then took the detailed discussion off line, and are now reporting back.)

#### Computing ROC Curves by Sorting

The basic problem is that the computation of an ROC curve involves first sorting the responses by the system’s estimated probability a response should be positive, then incrementally computing sensitivity versus 1 – specificty operating points by working down the ranked list. You can see a worked example in the class documentation for

#### What if there are Ties?

Mike Ross realized there was a problem when you had two examples with the same score, but different reference values. For instance, a system might be looking at two Tweets and estimate a 0.98 probability that both are positive sentiment. If one was classified as positive in the reference (i.e., the gold standard) and the other negative, then there’s a problem. With one order, you get one curve, and with the other order, you get a different curve. Mike figured out we might as well just leave off the intermediate point, because we get to the same operating point after handling both of them.

#### What if there are Near Ties?

Then Amaç wrote back after noticing he had some cases that were very close, but not quite, the same. This comes up with arithmetic precision all the time. In his case, it was different one-versus-all results for a multi-category classifier. He suggested treating two results as the same if they were within some small absolute value of each other. This is the typical thing to do when checking results are the same in unit tests (or code) — it’s built into junit for Java and googletest for C++.

#### Discretized ROC Curve Estimates are Discontinuous

That led me (Bob) to realize that the real problem is that these measures are discontinuous in the underlying scores. Suppose I have three items, A, B and C, with true values POS, NEG, and POS. Now if I have assigned scores to them of 0.97, 0.96, and 0.98, I get a curve based on the sequence TP, TP, FP, or a perfect score. I get the same reslt as the score for B moves from 0.96 to any value less than 0.97. But at the point it crosses 0.97, I wind up with a sequence TP, FP, TP, which is a whole different story.

The upshot is that area under the ROC curve (or PR curve) is not continuous in the underlying scores.

#### Help? Should we Probabilistically Smooth?

Does anyone have any idea what we should do here?

I’m thinking moving to some kind of probabilistic version might be able to smooth things out in a well-defined way. For instance, if I add normally-distributed noise around each point and then integrate it out, things are smooth again.

### Domain Adaptation with Hierarchical Logistic Regression

September 29, 2011

Last post, I explained how to build hierarchical naive Bayes models for domain adaptation. That post covered the basic problem setup and motivation for hierarchical models.

### Hierarchical Logistic Regression

Today, we’ll look at the so-called (in NLP) “discriminative” version of the domain adaptation problem. Specifically, using logistic regression. For simplicity, we’ll stick to the binary case, though this could all be generalized to K-way classifiers.

Logistic regression is more flexible than naive Bayes in allowing other features (aka predictors) to be brought in along with the words themselves. We’ll start with just the words, so the basic setup look more like naive Bayes.

#### The Data

We’ll use the same data representation as in the last post, with $D$ being the nubmer of domains, with $I_d$ docs in domain $d$. Document $i$ in domain $d$ will have $N[d,i]$ tokens. We’ll assume $V$ is the size of the vocabulary.

Raw document data provides a token $x[d,i,n] \in 1{:}V$ for word $n$ of document $i$ of domain $d$. Assuming a bag-of-words representation like we used in naive Bayes, it’ll be more convenient to convert each document into a word frequency vector $u[d,i]$ of dimensionality $V$. To match naive Bayes, define a word frequency vector $u[d,i]$ with value at word (or intercept) $w$ given by

$u[d,i,w] = \sum_{n = 1}^{N[d,i]} \mathbb{I}(x[d,i,n] = w)$,

where $\mathbb{I}(\phi)$ is the indicator function, taking on value 1 if $\phi$ is true and 0 otherwise. Now that we have continuous data, we could transform it using something like TF/IDF weighting. For convenience, we’ll prefix an intercept predictor $1$ to each vector of predictors $u[d,i]$. Because it’s logistic regression, other information like the source of the document may also be included in the predictor vectors.

The labeled training data consists of a classification $z[d,i] \in \{ 0, 1 \}$ for each document $i$ in domain $d$.

#### The Model, Lower Level

The main parameter to estimate is a coefficient vector $\beta[d]$ of size $V + 1$ for each domain $d$. Here, the first dimension will correspond to the intercept and the other dimensions each correspond to a word in the vocabulary.

The probabilty that a given document is of category 1 (rather than category 0) is given by

$\mbox{Pr}[z[d,i] = 1] = \mbox{logit}^{-1}(\beta[d]^T u[d,i])$.

The inner (dot) product is just the sum of the dimensionwise products,

$\beta[d]^T u[d,i] = \sum_{v = 0}^V \beta[d,v] \times u[d,i,v]$.

This value, called the linear prediction, can take values in $(-\infty,\infty)$. The logistic sigmoid function

$\mbox{logit}^{-1}(x) = 1/(1 + \exp(-x))$

maps the unbounded linear prediction to the range $(0,1)$, where it is taken to represent the probabilty that the category $z[d,i]$ of document $i$ in domain $d$ is 1.

In sampling notation, we define the category as being drawn from a Bernoulli distribution with parameter given by the transformed predictor; in symbols,

$z[d,i] \sim \mbox{\sf Bern}(\mbox{logit}^{-1}(\beta[d]^T u[d,i])$.

Unlike naive Bayes, logistic regression model does not model the word data $u$, instead treating it as constant.

#### The Model, Upper Levels

We are treating the coefficient vectors as estimated parameters, so we need a prior for them. We can choose just about any kind of prior we want here. We’ll only consider normal (Gaussian, L2) priors here, but the Laplace (L1) prior is also very popular. Recently, statisticians have argued for using a Cauchy distribution as a prior (very fat tails; see Gelman et al.’s paper) or combining L1 and L2 into a so-called elastic net prior. LingPipe implements all of these priors; see the documentation for regression priors.

Specifically, we are going to pool the prior across domains by sharing the priors for words $v$ across domains. Ignoring any covariance for the time being (a bad idea in general, but it’s hard to scale coveriance to NLP-sized coefficient vectors), we draw the component for word $v$ of the coefficients for domain $d$ from a normal distribution,

$\beta[d,v] \sim \mbox{\sf Normal}(\mu[v],\sigma[v]^2)$.

Here $\mu[v]$ represents the mean value of the coefficient for word (or intercept) $v$ across domains and $\sigma[v]^2$ is the variance of the coefficient values across domains. A strong prior will have low variance.

All that’s left is to provide a prior for these location and scale parameters. It’s typical to use a simple centered normal distribution for the locations, specifically

$\mu[v] \sim \mbox{\sf Normal}(0,\tau^2)$,

where $\tau^2$ is a constant variance parameter. Typically, the variance $\tau^2$ is set to a constant to make the entire prior on the means weakly informative. Alternatively, we could put a prior on $\tau$ itself.

Next, we need priors for the $\sigma[v]$ terms. Commonly, these are chosen to be inverse $\chi^2$ (a specific kind of inverse gamma distribution) distributions for convenience because they’re (conditionally) conjugate. Thus the typical model would take the variance $\sigma[v]^2$ to be the parameter, giving it an inverse gamma prior,

$\sigma[v]^2 \sim \mbox{\sf InvGamma}(\alpha,\beta)$.

Gelman argues against inverse gamma priors on variance because they have pathological behavior when the hierarchical variance terms are near zero.
Gelman prefers the half-Cauchy prior, because it tends not to degenerate in hierarchical models like the inverse gamma can. The half Cauchy is just the Cauchy distribution restricted to positive values (thus doubling the usual density, but restricting support to non-negative values). In symbols, we generate the deviation $\sigma[v]$ (not the variance) using a (non-negative) Half-Cauchy distribution:

$\sigma[v] \sim \mbox{HalfCauchy}()$.

And that’s about it. Of course, you’ll need some fancy-pants software to fit the whole thing, but it’s not that difficult to write.

### Properties of the Hierarchical Logisitic Regression Model

Like in the naive Bayes case, going hierarchical means we get data pooling (or “adaptation” or “transfer”) across domains.

One very nice feature of the regression model follows from the fact that we are treating the word vectors as unmodeled predictors and thus don’t have to model their probabilities across domains. Instead, the coefficient vector $\beta[d]$ for domain $d$ only needs to model words that discriminate positive (category 1) from negative (category 0) examples. Thus the correct value for $\beta[d,v]$ for a word $v$ is 0 if the word does not discriminate between positive and negative documents. Thus our hierarchical hyperprior has location 0 in order to pull the prior location $\mu[v]$ for each word $v$ to 0.

The intercept is just acting as a bias term toward one category or the other (depending on the value of its coefficient).

### Comparison to SAGE

If one were to approximate the full Bayes solution with maximum a posteriori point estimates and use Laplace (L1) priors on the coefficients,

$\beta[d,v] \sim \mbox{\sf Laplace}(\mu[d],\sigma[d]^2)$

then we recreate the regression counterpart of Eisenstein et al.’s hybrid regression and naive-Bayes style sparse additive generative model of text (aka SAGE).

Eisenstein et al.’s motivation was to represent each domain as a difference from the average of all domains. If you recast the coefficient prior formula slightly and set

$\beta[d,v] = \alpha[d,v] + \mu[d]$

and sample

$\alpha[d,v] \sim \mbox{Laplace}(0,\sigma[d]^2)$

you’ll see that if the variance term $\sigma[d]$ is low enough, many of the $\alpha[d,v]$ values will be zero. Of course, in a full Bayesian approach, you’ll integrate over the uncertainty in $\alpha$, not set it to a point estimate of zero. So sparseness only helps when you’re willing to approximate and treat estimates as more certain than we have evidence for. Of course, resarchers do that all the time. It’s what LingPipe’s logistic regression and CRF estimators do.