## Archive for the ‘Statistics’ Category

### Domain Adaptation with Hierarchical Logistic Regression

September 29, 2011

Last post, I explained how to build hierarchical naive Bayes models for domain adaptation. That post covered the basic problem setup and motivation for hierarchical models.

### Hierarchical Logistic Regression

Today, we’ll look at the so-called (in NLP) “discriminative” version of the domain adaptation problem. Specifically, using logistic regression. For simplicity, we’ll stick to the binary case, though this could all be generalized to K-way classifiers.

Logistic regression is more flexible than naive Bayes in allowing other features (aka predictors) to be brought in along with the words themselves. We’ll start with just the words, so the basic setup look more like naive Bayes.

#### The Data

We’ll use the same data representation as in the last post, with $D$ being the nubmer of domains, with $I_d$ docs in domain $d$. Document $i$ in domain $d$ will have $N[d,i]$ tokens. We’ll assume $V$ is the size of the vocabulary.

Raw document data provides a token $x[d,i,n] \in 1{:}V$ for word $n$ of document $i$ of domain $d$. Assuming a bag-of-words representation like we used in naive Bayes, it’ll be more convenient to convert each document into a word frequency vector $u[d,i]$ of dimensionality $V$. To match naive Bayes, define a word frequency vector $u[d,i]$ with value at word (or intercept) $w$ given by

$u[d,i,w] = \sum_{n = 1}^{N[d,i]} \mathbb{I}(x[d,i,n] = w)$,

where $\mathbb{I}(\phi)$ is the indicator function, taking on value 1 if $\phi$ is true and 0 otherwise. Now that we have continuous data, we could transform it using something like TF/IDF weighting. For convenience, we’ll prefix an intercept predictor $1$ to each vector of predictors $u[d,i]$. Because it’s logistic regression, other information like the source of the document may also be included in the predictor vectors.

The labeled training data consists of a classification $z[d,i] \in \{ 0, 1 \}$ for each document $i$ in domain $d$.

#### The Model, Lower Level

The main parameter to estimate is a coefficient vector $\beta[d]$ of size $V + 1$ for each domain $d$. Here, the first dimension will correspond to the intercept and the other dimensions each correspond to a word in the vocabulary.

The probabilty that a given document is of category 1 (rather than category 0) is given by

$\mbox{Pr}[z[d,i] = 1] = \mbox{logit}^{-1}(\beta[d]^T u[d,i])$.

The inner (dot) product is just the sum of the dimensionwise products,

$\beta[d]^T u[d,i] = \sum_{v = 0}^V \beta[d,v] \times u[d,i,v]$.

This value, called the linear prediction, can take values in $(-\infty,\infty)$. The logistic sigmoid function

$\mbox{logit}^{-1}(x) = 1/(1 + \exp(-x))$

maps the unbounded linear prediction to the range $(0,1)$, where it is taken to represent the probabilty that the category $z[d,i]$ of document $i$ in domain $d$ is 1.

In sampling notation, we define the category as being drawn from a Bernoulli distribution with parameter given by the transformed predictor; in symbols,

$z[d,i] \sim \mbox{\sf Bern}(\mbox{logit}^{-1}(\beta[d]^T u[d,i])$.

Unlike naive Bayes, logistic regression model does not model the word data $u$, instead treating it as constant.

#### The Model, Upper Levels

We are treating the coefficient vectors as estimated parameters, so we need a prior for them. We can choose just about any kind of prior we want here. We’ll only consider normal (Gaussian, L2) priors here, but the Laplace (L1) prior is also very popular. Recently, statisticians have argued for using a Cauchy distribution as a prior (very fat tails; see Gelman et al.’s paper) or combining L1 and L2 into a so-called elastic net prior. LingPipe implements all of these priors; see the documentation for regression priors.

Specifically, we are going to pool the prior across domains by sharing the priors for words $v$ across domains. Ignoring any covariance for the time being (a bad idea in general, but it’s hard to scale coveriance to NLP-sized coefficient vectors), we draw the component for word $v$ of the coefficients for domain $d$ from a normal distribution,

$\beta[d,v] \sim \mbox{\sf Normal}(\mu[v],\sigma[v]^2)$.

Here $\mu[v]$ represents the mean value of the coefficient for word (or intercept) $v$ across domains and $\sigma[v]^2$ is the variance of the coefficient values across domains. A strong prior will have low variance.

All that’s left is to provide a prior for these location and scale parameters. It’s typical to use a simple centered normal distribution for the locations, specifically

$\mu[v] \sim \mbox{\sf Normal}(0,\tau^2)$,

where $\tau^2$ is a constant variance parameter. Typically, the variance $\tau^2$ is set to a constant to make the entire prior on the means weakly informative. Alternatively, we could put a prior on $\tau$ itself.

Next, we need priors for the $\sigma[v]$ terms. Commonly, these are chosen to be inverse $\chi^2$ (a specific kind of inverse gamma distribution) distributions for convenience because they’re (conditionally) conjugate. Thus the typical model would take the variance $\sigma[v]^2$ to be the parameter, giving it an inverse gamma prior,

$\sigma[v]^2 \sim \mbox{\sf InvGamma}(\alpha,\beta)$.

Gelman argues against inverse gamma priors on variance because they have pathological behavior when the hierarchical variance terms are near zero.
Gelman prefers the half-Cauchy prior, because it tends not to degenerate in hierarchical models like the inverse gamma can. The half Cauchy is just the Cauchy distribution restricted to positive values (thus doubling the usual density, but restricting support to non-negative values). In symbols, we generate the deviation $\sigma[v]$ (not the variance) using a (non-negative) Half-Cauchy distribution:

$\sigma[v] \sim \mbox{HalfCauchy}()$.

And that’s about it. Of course, you’ll need some fancy-pants software to fit the whole thing, but it’s not that difficult to write.

### Properties of the Hierarchical Logisitic Regression Model

Like in the naive Bayes case, going hierarchical means we get data pooling (or “adaptation” or “transfer”) across domains.

One very nice feature of the regression model follows from the fact that we are treating the word vectors as unmodeled predictors and thus don’t have to model their probabilities across domains. Instead, the coefficient vector $\beta[d]$ for domain $d$ only needs to model words that discriminate positive (category 1) from negative (category 0) examples. Thus the correct value for $\beta[d,v]$ for a word $v$ is 0 if the word does not discriminate between positive and negative documents. Thus our hierarchical hyperprior has location 0 in order to pull the prior location $\mu[v]$ for each word $v$ to 0.

The intercept is just acting as a bias term toward one category or the other (depending on the value of its coefficient).

### Comparison to SAGE

If one were to approximate the full Bayes solution with maximum a posteriori point estimates and use Laplace (L1) priors on the coefficients,

$\beta[d,v] \sim \mbox{\sf Laplace}(\mu[d],\sigma[d]^2)$

then we recreate the regression counterpart of Eisenstein et al.’s hybrid regression and naive-Bayes style sparse additive generative model of text (aka SAGE).

Eisenstein et al.’s motivation was to represent each domain as a difference from the average of all domains. If you recast the coefficient prior formula slightly and set

$\beta[d,v] = \alpha[d,v] + \mu[d]$

and sample

$\alpha[d,v] \sim \mbox{Laplace}(0,\sigma[d]^2)$

you’ll see that if the variance term $\sigma[d]$ is low enough, many of the $\alpha[d,v]$ values will be zero. Of course, in a full Bayesian approach, you’ll integrate over the uncertainty in $\alpha$, not set it to a point estimate of zero. So sparseness only helps when you’re willing to approximate and treat estimates as more certain than we have evidence for. Of course, resarchers do that all the time. It’s what LingPipe’s logistic regression and CRF estimators do.

Presumably, the values of coefficients have will have non-negligible covariance. That is, I expect words like “thrilling” and “scary” to covary. For thriller movies, they’re positive sentiment terms and for appliances, rather negative. The problem with modeling covariance among lexical items is that we usually have tens of thousands of them. Not much software can handle a 10K by 10K matrix.

Another way to add covariance instead of using a covariance model for “fixed effects” is instead convert to random effects. For instance, a model like latent Dirichlet allocation (LDA) models covariance of words by grouping them into topics. For instance, two words with high probabilities in the same topic will have positive covariance.

I’m a big fan of Gelman and Hill’s multilevel regression book. You definitely want to master that material before moving on to Gelman et al.’s Bayesian Data Analysis. Together, they’ll give you a much deeper understanding of the issues in hierarchical (or more generally, multilevel) modeling.

### Domain Adaptation with Hierarchical Naive Bayes Classifiers

September 22, 2011

This will be the first of two posts exploring hierarchical and multilevel classifiers. In this post, I’ll describe a hierarchical generalization of naive Bayes (what the NLP world calls a “generative” model). The next post will explore hierarchical logistic regression (called a “discriminative” or “log linear” or “max ent” model in NLP land).

Within natural language processing, the term “domain adaptation” has come to mean something like “using data in one domain to improve results in another domain”. Examples that have received attention include positive/negative sentiment for reviews of different kinds of products, with, say, DVDs being one domain and kitchen appliances another (Blitzer et al. 2007). Other examples classifying sequences of tokens as names across different genres of text, such as newswire and transcribed speech (Daumé III 2007, Finkel and Manning 2009).

The work I’ve seen on adaptation has largely been in the supervised case, but what we’re going to do here can be unsupervised, in which case you get something like soft K-means clustering. It can also be semi-supervised, with some documents having category labels and some not having labels. I’ve talked about semi-supervised models before, too, as recently as the very last blog post, covering Tang and Lease’s semi-supervised annotation model and in our replication of Nigam et al.’s semi-supervised naive Bayes EM estimator in the LingPipe tutorial on semi-supervised naive Bayes classification.

### Hiearchical Models

The standard Bayesian approach to domain adaptation is to build a hierarchical model. The hierarchy in this case is the hieararchy of domains. We’ll just consider the flat case, where there’s a collection $D$ of what we’ll call “domains”. In the next post, we’ll generalize this assumption to richer organizations of domains with cross-cutting features.

### Naive Bayes

I’ve blogged before about properly Bayesian approaches to naive Bayes classifiers, specifically about Bayesian naive Bayes inference and collapsed Gibbs sampling for computing it.

### Pooling, Exchangeability and Hieararchy

What we’re going to do is let information in one domain, such as which words are positive or which phrases are people names, inform estimates for other domains. This is sometimes referred to as “pooling” information across sub-populations in statistics. A simple example from political science that illustrates the general flavor of models we’re working on at Columbia is allowing estimates of the effect of income on voting in one state to inform estimates in other states, while still allowing the states to differ (an example is given in Gelman et al.’s What’s wrong with Connecticut?).

Theoretically, what we’re going to do amounts to assuming that the domains themselves are exchangeable. We use exchangeable models when we don’t have any information to distinguish the domains. We’ll discuss the case where we do have such information in the next post.

In Bayesian models, we treat exchangeable variables as being drawn from a population distribution (aka, a prior). In hierarchical models, we model the population directly. The reason it’s called hierarchical is that in a proper Bayesian model, there must be a prior distribution on the domain model. So we get a prior on the parameters of the domain model, which then acts as a prior on the estimates for each domain. The pooling is achieved by information flow through the dependencies in the joint distribution implied by the graphical model. In practice, this is simpler than it sounds.

### Hieararchical Naive Bayes

Once we’ve got our heads around the Bayesian formulation of naive Bayes, extending it to hieararchical models is straightforward. We’re going to use the language of documents and tokens in describing naive Bayes, but it’s really a general multinomial model, so don’t assume this is only valid for text classifiers.

For simplicity, we’ll assume we have a binary classifier. I’ll say more about the multi-category case later and in the next post.

#### The Data

Let $D$ be the number of domains. Let $I_d$ be the number of documents in domain $D$ and $N_{d,i}$ the number of tokens in document $i \in 1{:}I_d$ of domain $d \in 1{:}D$. Let $V$ be the size of the vocabulary from which the tokens are drawn.

The raw document data we have provides a token $x[d,i,n] \in 1{:}V$ for each token $n \in N_{d,i}$ for each document $i \in 1{:}I_d$ for each domain $d \in 1{:}D$.

The labeled training data we have is a classification $z[d,i] \in \{ 0, 1 \}$ for each document $i \in 1{:}I_d$ in each domain $d \in 1{:}D$.

#### The Model, Lower Levels

The hierarchical naive Bayes model is a directed graphical model and as such, easy to describe with sampling notation.

We have a parameter $\pi[d]$ for the prevalence of category 1 documents in domain $d \in 1{:}D$. It is sampled from a beta distribution with prior count parameters $\alpha_{\pi}, \beta_{\pi}$ (more about which later),

$\pi[d] \sim \mbox{\sf Beta}(\alpha_{\pi}, \beta_{\pi})$.

The categories assigned to each document are sampled from a Bernoulli distribution parameterized by prevalence in the document’s domain,

$z[d,i] \sim \mbox{\sf Bern}(\pi[d])$.

Each domain $d$ and outcome category $c \in \{ 0, 1 \}$ will have its own naive Bayes model with multinomial $V$-simplex parameter $\phi[d,c] \in [0,1]^V$, such that

$\sum_{v = 1}^V \phi[d,c,v] = 1$,

describing the distribution of vocabulary items $v \in V$ in category $c \in \{ 0, 1 \}$ in domain $d \in D$.

These category word distributions for each domain are drawn from a Dirichlet prior $\gamma[c]$ appropriate to the outcome category $c \in \{ 0, 1 \}$,

$\phi[d,c] \sim \mbox{\sf Dir}(\gamma[c])$.

We then model the tokens in the standard way for naive Bayes, as being drawn independently (that’s the “naive” part) from the discrete distribution associated with their domain $d$ and the category $z[d,i]$ of document $i$,

$x[d,i,n] \sim \mbox{\sf Disc}(\phi[d, z[d,i]])$.

#### The Model, Upper Levels

So far, that’s only the lower level of the model. As I mentioned earlier, the priors characterize the populations of prevalences and vocabulary distributions for categories across domains. And we’re using hierarchical models on both sides of naive Bayes, for the categories themselves and for the topic distributions. There are many choices for how to model these populations. We’ll go with a very simple model that only characterizes prior means and variance, not prior covariance; as an example of a prior modeling covariance for multinomials, see the prior for topics in Lafferty and Blei’s correlated topic model.

It’s easier to start with the prevalences. The prior model here will model the mean prevalence of outcome 1 (say, the “positive” outcome for sentiment) across domains, as well as its variance. The mean of the density $\mbox{\sf Beta}(\alpha_{\pi},\beta_{\pi}$ is $\alpha_{\pi}/(\alpha_{\pi} + \beta_{\pi})$ and the variance is inversely related to the total prior count $\alpha_{\pi} + \beta_{\pi}$.

For convenience, we’ll reparameterize $\alpha_{\pi}, \beta_{\pi}$ in terms of total prior count $\kappa_{\pi} \in (0,\infty)$ and prior mean $\psi_{\pi} \in [0,1]$, by setting

$\alpha_{\pi} = \psi_{\pi} \kappa_{\pi}, \ \ \beta_{\pi} = (1 - \psi_{\pi}) \kappa_{\pi}$.

We’ll put a simple conjugate uniform prior on the prior mean $\psi_{\pi}$, taking

$\psi_{\pi} \sim \mbox{\sf Beta}(1,1)$.

(Note that $\mbox{\sf Beta}(1,1)$ is uniform on its support, $[0,1]$.)

We’ll take a very weakly informative prior favoring high variance (low prior count) on the total prior counts,

$\kappa_{\pi} \sim \mbox{\sf Pareto}(0.5, 1.5)$,

where for $x > 0.5$,

$\mbox{\sf Pareto}(x|0.5,1.5) \propto x^{-1.5}$.

The effect of this decision is that the estimate for prevalence $\pi[d]$ in a single domain $d$ will be somewhere between the overall average $\psi_{\pi}$ and the training data average, depending on the relative strength $\kappa_{\pi}$ estimated for the prior and the number of training examples for domain $d$. (A standard shortcut here would be to use either cross-validation or “empirical Bayes” to set the priors $\alpha_{\pi}, \beta_{\pi}$.

We do the same thing for the Dirichlet prior, reparameterizating the prior count $V$-vector $\gamma[c]$ for vocabularly items for outcome $c \in \{0,1\}$ as a prior mean $V$-simplex $\theta[b]$ and prior total count $\rho[b]$, setting

$\gamma[c] = \rho[b] \theta[b]$.

We set a uniform prior on the prior vocabulary distribution $\theta[b]$,

$\theta[b] \sim \mbox{Dir}(1)$,

and another weakly informative Pareto prior on the prior counts,

$\rho \sim \mbox{Pareto}(0.5,1.5)$.

As the mathematicians are wont to say, Q.E.D..

### Tell Us What We’ve Won

What we do get is sharing of discriminative terms across domains. For instance, with polarity, positive term weights in each domain affect the prior, and the prior affects the average in each domain. The prior’s effect on estimates goes by the name “smoothing” in NLP. More specifically, the prior pulls the estimate for each domain and category toward the prior mean for that category by an amount inversely related to the prior variance (low prior variance exerting a stronger smoothing effect) and inversely related to the data count (with higher data counts, the prior’s effect is weaker).

We also get sharing of the non-discriminative terms among category 0 instances across domains and sharing among non-discriminative terms among category 1 instances.

… the fact that the model has to estimate the whole vocabulary of non-discriminative terms in both the negative and positive vocabulary priors $\theta[0]$ and $\theta[1]$?

For instance, we’d expect “awesome” and “crap” to have a very different distribution in positive and negative sentiment product reviews, whereas “the” and “of” should be less discriminative.

### Turtles All the Way Up

The next level here would be to add a more informative prior for the prior per-category vocabulary distribution $\theta[d]$. A very good place to start here would be with a prior mean distribution $\theta[d]$ estimated from a large corpus of text without category labels. That would put an empirical Bayes step into the mix and make the prior over priors much more informative (as is, we’ve set it to be symmetric). We could apply full Bayesian inference by just building the unsupervised data into the model with what the statisticians like to call “missing” labels.

We could do the same thing for the prior mean for prevalence of categories if we have samples of other domains we could use.

One nice feature of hierarchical models is that it’s easy to extend the hierarchy (for an extreme example, see Li and McCallum’s paper on their aptly named technique, pachinko allocation).

It’s trivial to extend this hiearchical naive Bayes model to the multi-category setting. We kept things simple here because it’ll greatly simplify the next post on multilevel modeling with logistic regression. In the naive Bayes setting, just replace the Bernoulli distribution with a discrete distribution and the beta distribution with a Dirichlet.

It’s also trivial to extend the semi-supervised case. Any number of variables may be “missing” in a directed graphical model without hampering inference (theoretically it’s simple; computationally it requires more bookkeeping and perhaps more samplers). As we mentioned earlier, this is especially helpful in the setting where we’re estimating hyperpriors (here the top-level vocabulary prior over per-category vocabulary priors over per-category/per-domain vocabulary distributions).

### Empirical Evaluations?

This will work. The hieararchical modeling strategy pretty much always works. The reason is that with the right hyper-hyper priors, it’ll reduce to the original model. The real question is will it help?

We should see big gains in small count data.

I haven’t done any. If anyone else has evaluated hierarchical naive Bayes, please say so in the comments or e-mail me and I’ll write a link into the post itself.

A good data set would be Dredze et al.’s multi-domain sentiment data set. I’d at least like to see the results of this relatively simple and standard naive Bayes model and the one I’ll describe in the next post before delving into “deep” models a la (Glorot et al. 2011).

Wow, that was all only four lines of notes in my notebook. I wasn’t expecting this to run so long! The next post has six lines of notes, so watch out.

### Tang and Lease (2011) Semi-Supervised Consensus Labeling for Crowdsourcing

September 12, 2011

I came across this paper, which, among other things, describes the data collection being used for the 2011 TREC Crowdsourcing Track:

But that’s not why we’re here today. I want to talk about their modeling decisions.

Tang and Lease apply a Dawid-and-Skene-style model to crowdsourced binary relevance judgments for highly-ranked system responses from a previous TREC information retrieval evaluation. The workers judge document/query pairs as highly relevant, relevant, or irrelevant (though highly relevant and relevant are collapsed in the paper).

The Dawid and Skene model was relatively unsupervised, imputing all of the categories for items being classified as well as the response distribution for each annotator for each category of input (thus characterizing both bias and accuracy of each annotator).

### Semi Supervision

Tang and Lease exploit the fact that in a directed graphical model, EM can be used to impute arbitrary patterns of missing data. They use this to simply add some known values for categories (here true relevance values). Usually, EM is being used to remove data, and that’s just how they pitch what they’re doing. They contrast the approach of Crowdflower (nee Dolores Labs) and Snow et al. as fully supervised. They thus provide a natural halfway point between Snow et al. and Dawid and Skene.

### Good Results vs. NIST Gold

The key results are in the plots in figures 4 through 7,which plot performance versus amount of supervision (as well as fully unsupervised and majority vote approaches). They show the supervision helping relative to the fully unsupervised approach and the approach of training on just the labeled data.

Another benefit of adding supervised data (or adding unsupervised data if viewed the other way) is that you’ll get better estimates of annotator responses (accuracies and biases) and of topic prevalences.

### Really Gold?

They get their gold-standard values from NIST, and the notion of relevance is itself rather vague and subjective, so the extra labels are only as golden as the NIST annotators. See below for more on this issue.

### Voting: Quantity vs. Quality

Tang and Lease say that voting can produce good results with high quality annotators. It’ll also produce good results with a high quantity of annotators of low quality. As long as their results are independent enough, at least. This is what everyone else has seen (me with the Snow et al. data and Vikas Raykar et al. very convincingly in their JMLR paper).

### Regularized Estimates (vs. MLE vs. Bayesian)

I think it’d help if they regularized rather than took maximum likelihood estimates. Adding a bit of bias from regularization often reduces variance and thus expected error even more. It helps with fitting EM, too.

For my TREC entry, I went whole hog and sampled from the posterior of a Bayesian hierarchical model which simultaneously estimates the regularization parameters (now cast as priors) along with the other parameters.

I also use Bayesian estimates, specifically posterior means, which minimize expected squared error. MLE for the unregularized case and maximum a posterior (MAP) estimates for the regularized case can both be viewed as taking posterior maximums (or modes) rather than means. These can be pretty different for the kinds of small count beta-binomial distributions used in Dawid and Skene-type models.

How in the world did they get a Mechanical turker to have an accuracy of 0 with nearly 100 responses? That’s very very adversarial. I get higher accuracy estimates using their data for TREC and don’t get very good agreement with the NIST gold standard, so I’m really wondering about this figure and the quality of the NIST judgments.

### Active Learning

Choosing labels for items on the margin of a classifier is not necessarily the best thing to do for active learning. You need to balance uncertainty with representativeness, or you’ll do nothing but label a sequence of outliers. There’s been ongoing work by John Langford and crew on choosing the right balance here.

Vikas Raykar et al. in their really nice JMLR paper add a regression-based classifier to the annotators. I think this is the kind of thing Tang and Lease are suggesting in their future work section. They cite the Raykar et al. paper, but oddly not in this context, which for me, was its major innovation.

### Not Quite Naive Bayes

Tang and Lease refer to the Dawid and Skene approach as “naive Bayes”, which is not accurate. I believe they’re thinking of generating the labels as analogous to generating tokens. But the normalization for that is wrong, being over annotators rather than over annotator/label pairs. If they had a term estimating the probability of an annotator doing an annotation, then it would reduce to naive Bayes if they allow multiple annotations by the same annotator independently (which they actually consider, but then rule out).

So it’s odd to see the Nigam et al. paper on semi-supervised naive Bayes text classification used as an example, as it’s not particularly relevant, so to speak. (I really like Nigam et al.’s paper, by the way — our semi-supervised naive Bayes tutorial replicates their results with some more analysis and some improved results.)

### Two-Way vs. K-Way Independence

Another nitpick is that it’s not enough to assume every pair of workers is independent. The whole set needs to be independent, and these conditions aren’t the same. (I was going to link to the Wikipedia article on independent random variables, but it only considers the pairwise case. So you’ll have to go to a decent probability theory textbook like Degroot and Schervish or Larsen and Marx, where you’ll get examples of three variables that are not independent though each pair is pairwise independent.

### One Last Nitpick

A further nitpick is equation (6), the second line of which has an unbound i in the p[i] term. Instead, i needs to be bound to the true category for instance m.

### Synthetic Data Generation?

I also didn’t understand their synthetic data generation in 3.1. If they generate accuracies, do they take the sensitivities and specificities to be the same (in their notation, pi[k,0,0] = pi[k,1,1]). In my (and others’) experience, there’s usually a significant bias so that sensitivity is not equal to specificity for most annotators.

### Modeling Item Difficulty for Annotations of Multinomial Classifications

September 8, 2011

We all know from annotating data that some items are harder to annotate than others. We know from the epidemiology literature that the same holds true for medical tests applied to subjects, e.g., some cancers are easier to find than others.

But how do we model item difficulty? I’ll review how I’ve done this before using an IRT-like regression, then move on to Paul Mineiro’s suggestion for flattening multinomials, then consider a generalization of both these approaches.

### Binary Data

My previous suggestions have been for binary data and were based on item-response theory (IRT) style models as applied to epidemiology by Uebersax and Grove. These are reflected, for instance, in my tutorial for LREC with Massimo Poesio. The basic idea is that you characterize an item $i$ by position $\alpha_i$ and an annotator by position $\beta_j$ and discriminativeness $\delta_j$, then generate a label for item $i$ by annotator $j$ whose probability of being correct is:

$y_{i,j} \sim \mbox{\sf Bern}(\mbox{logit}^{-1}(\delta_j(\alpha_i - \beta_j)))$

The higher the value of $\delta_j$, the sharper the distinctions made by annotator $j$. You can break this out so there’s one model for positive items and another for negative items (essentially, one set of parameters for sensitivity and one for specificity).

If there’s no difficulty, equivalently all $\alpha_i = 0$, so we can reduce the logistic regression to a simple binomial parameter

$\theta_j = \mbox{logit}^{-1}(\delta_j \beta_j)$

which is the form of model I’ve been proposing for binary data with single response parameter $\theta_j$, with $\theta_{0,j}$ for specificity and $\theta_{1,j}$ for specificity.

### Difficulty as Flattening Responses

Paul Mineiro’s blog post Low-Rank Confusion Modeling of Crowdsourced Workers introduces a general approach to handling item difficulty. If the true category is $k$, each annotator $j$ has a response distribution paraemterized by a multinomial parameter $\theta_{j,k}$.

This is just Dawid and Skene‘s original model.

Mineiro applies a neat single-parameter technique to model item difficulty where the larger the parameter, the closer the response distribution is to uniform. That is, difficulty amounts to flattening an annotator’s response.

He does this in the standard temperature-based analogy used in annealing. If the difficulty of item $i$ of true category $k$ is $\alpha_i$, the response of annotator $j$ is flattened from $\theta_{j,k}$ to being proportional to $\theta_{j,k}^{1/\alpha_i}$. The higher the value of $\alpha$, the greater the amount of flattening. A value of $\alpha_i$ greater than 1 indicates an annotator will do worse than their basic response and a value less than 1 indicates they’ll do better (assuming they assign the true value the highest probability).

### General Regression Approach

We can do almost the same thing in a more general regression setting. To start, convert an annotator’s response probability vector $\theta_{j,k}$ to a regression representation $\log \theta_{j,k}$. To get a probability distribution back, apply softmax (i.e., multi-logit), where the probability of label $k'$ for an item $i$ with true label $k$ for annotator $j$ is proportional to $\exp(\theta_{j,k,k'})$

We can encode Mineiro’s approach in this setting by adding a multiplicative term for the item, making the response proportional to $\exp((1/\alpha_i) \theta_{j,k,k'}) = \exp(\theta_{j,k,k'})^{1/\alpha_i}$. It would probably be more common to make the difficulty additive, so you’d get $\exp(\alpha_i + \theta_{j,k,k'})$.

For instance, suppose we have an ice-cream-flavor classification task with four response categories, lemon, orange, vanilla and chocolate. Built into users’ responses (and in my hierarchical models into the hierarchical priors) are the basic confusions. For instance, a lemon ice cream would be more likely to be confused with orange than vanilla or chocolate. A more difficult item will flatten this response to one that’s more uniform. But until $\alpha_i$ approaches infinity, we’ll still confuse orange with lemon more than with the others.

### Problem with the Basic Approach

The fundamental problem with Mineiro’s approach is that there’s no way to have the difficulty be one of limited confusability. In the ice cream case, an item that’s very citrusy will never be confused with chocolate or vanilla, but might in the limit of a very hard problem, have a uniform response between orange and lemon.

You also see this in ordinal problems, like Rzhetsky et al.’s modality and strength of assertion ordinal scale. A very borderline case might be confused between a 2 or 3 (positive and very positive), but won’t be confused between a -3 (very negative) modality of assertion.

### Genealized Multi-Parameter Approach

What I can do is convert everything to a regression again. And then the generalization’s obvious. Instead of a single difficulty parameter $\alpha_i$ for an item, have a difficulty parameter for each response $k$, namely $\alpha_{i,k}$. Now the probablity of annotator $j$ responding with category $k'$ when the true category is $k$ is taken to be proportional to $\exp(\beta_{j,k,k'} + \alpha_{i,k'})$.

If you want it to be more like Mineiro’s approach, you could leave it on a multiplicative scale, and take the response to be proportional to $\exp(\alpha_{i,k'} \beta_{j,k,k'})$.

### Of course, …

We’re never going to be able to fit such a model without a pile of annotations per item. It’s just basic stats — your estimates are only as good as (the square root of) the count of your data.

Being Bayesian, at least this shouldn’t hurt us (as I showed in my tech report). Even if the point estimate is unreliable, if we apply full Bayesian posterior inference, we get back overdispersion relative to our point estimate and everything should work out in principle. I just didn’t find it helped much in the problems I’ve looked at, which had 10 noisy annotations per item.

### But then, …

If we have contentful predictors for the items, we might be able to model the difficulties more directly as regressions on basic predictors. Examples would be knowing the journal in which a paper came from when doing classification of article subjects. Some journals might be more interdisciplinary and have more confusable papers in general.

### Sampling, Modeling and Measurement Error in Inference from Clinical Text

August 18, 2011

Here’s a link to the slides of the talk I presented recently at ICML:

It’s basically a list of the kinds of things that can go wrong and introduce error (bias and noise) into inferences. Although the examples are mostly clinical (with one on baseball and one on cancer clusters), the point is generally applicable.

### Small, Focused Workshops

I really like small, focused workshops, and this one was very good, with lots of presentations on people’s practical experiences launching systems in hospitals and working on fascinating text mining problems from clinical notes.

### Thanks to the Organizers

Thanks again to the organizers, especially Faisal Farooq, who handled all the paperwork. It’s a pretty thankless job in my experience, but having done it myself, I can really appreciate how much work it is to run something that comes off smoothly.

I don’t know how long the page will last, but here’s a link to the workshop itself:

### Unintended (Beneficial) Consequences

When Noémie Elhadad invited me to give a talk, I met with her to see if there was a topic I could talk about. During that meeting, she mentioned how hard it had been to hire an NLP programmer in biomedical informatics (it’s just as hard if not harder at a small company). The upshot is that Mitzi got a new job at Columbia into the bargain. In a way, it’s too bad, because I miss talking to Mitzi about her work in genomics, about which I know relatively little compared to NLP.

### Steyvers, Lee, Miller and Hemmer (2009) The Wisdom of Crowds in the Recollection of Order Information

June 23, 2011

I just found Mark Steyvers et al.’s work on models of annotation for rankings:

They also describe the model with a more psych/experimental slant with some more experimental data relating observed (and estimated) expertise to self-reported expertise in:

### The Problem and the Data

The basic idea, which they describe as Thurstonian, has annotators rank-order a common set of items, such as the sizes of 10 cities. The goal is to then induce the true ranking (the so-called “wisdom of crowds”) and also to estimate the annotator’s accuracies (but not biases in this case).

### Some Background

The model they propose should be familiar to anyone who’s seen item-response models or Bradley-Terry models from the psychometrics literature on educational testing and preference ranking respectively. Somewhat surprisingly given Steyvers’s connection to cognitive science, they don’t seem to know (or don’t care to cite) sixty years worth of previous psychometrics literature on these kinds of problems. As Andrew Gelman is fond of saying, just about any model you invent was studied decades ago by a psychometrician.

Instead, they dig back even deeper to Thurston in the 1920s and also cite some work by Mallows in the 1950s, the latter of which is closer to what I’d have expected in the way of citations.

Their model reminds me most of Uebersax and Grove’s approach to ordinal ranking problems described in their 1993 Biometrics paper A latent trait finite mixture model for the analysis of rating agreement. Uebersax and Grove also use latent positions and normally distributed noise. The difference is that Uebersax and Grove looked at the case of multiple annotators evaluating multiple items on an ordinal scale. An example would be five doctors ranking 100 slides of potential tumors on a 0-4 scale of severity.

### Steyvers et al.’s Model

#### The Items

The basic idea is to introuce a latent scalar $\mu_i$ for each item $i \in 1{:}I$ being ranked. The ordering of the latent scalars $\mu_i$ induces a complete ordering of items.

#### The Annotators

Each annotator $j \in 1{:}J$ is characterized by a single noise parameter $\sigma_j > 0$. These are given what seem like a rather arbitrary prior:

$\sigma_i \sim \mbox{\sf Gamma}(\lambda,1/\lambda)$

where $\lambda$ is a constant hyperprior (set to 3). I’m more used to seeing inverse gamma distributions used as priors for variance (or gammas used as priors for precision).

They mention that one could fit another level of hierarchy here for $\lambda$, which would account for population effects in the model; this is standard operating procedure for Bayesian modeling these days and usually results in a better model of posterior uncertainty than optimizing or arbitrarily setting hyperparameters.

#### The Annotations

The annotations that are observed are of the form of complete rankings. That is, if we had three cities to rank by population, Chicago, Houston and Phoneix, an annotator’s response might be

Houston > Chicago > Phoenix.

The model assumes these annotations are derived from a latent annotator-specific scalar $x_{i,j}$ for each item $i$ and annotator $j$ (it’d also be easy to allow an incomplete panel design in which not every annotator ranks every item). The model for this latent scalar is the obvious one:

$x_{i,j} \sim \mbox{\sf Norm}(\mu_i,\sigma_j^2)$.

That is, the latent position $x_{i,j}$ assigned to item $i$ by annotator $j$ is drawn from a normal centered around the true latent location $\mu_i$ for item $i$ with noise determined by the annotator-specific deviation parameter $\sigma_j$.

#### The Sampling Trick

There’s only one problem in implemeting this model: the latent $x_{i,j}$ must be consistent with the observed ranking $y_{i,j}$. As you can imagine, they follow Albert and Chib’s approach, which involves a truncated normal sampler. That is, conditional on all but a single position $x_{i,j}$, use a normal distribution truncated to the interval bounded by the next lowest-ranked and next highest-rank item than $i$ (with no lower bound for the lowest-ranked item and no upper bound for the highest-ranked item).

The whole model’s only a few lines of JAGS code, though they wrote their own implementation using a mix of Metropolis and Gibbs updating (this is a case where Gibbs is going to mix relatively slowly because of the interdependence of the $x_{i,j}$, yet this is where they use Gibbs). An advantage of using JAGS is that it’s trivial to explore the hierarchical extensions of the model.

#### The Posterior

The posterior distribution is modeled using samples. Here, the random variables being sampled are $(\mu, x, \sigma)$. Given samples for $\mu$, we can estimate the rank of each item by looking at the rank in each sample $\mu^{(k)}$. We also get a direct characterization of the noise of an annotator through $\sigma_i$. We probably don’t care at all about the annotator-specific latent positions $x_{i,j}$.

In the NIPS paper, we get posterior intervals. In general, I prefer more direct views of the posterior samples, like scatterplots of histograms. For instance, check out the alternative plots for similar data in the collection BUGS Examples, Volume I, which contains a closely related example model involving ranking hospitals based on pediatric surgery fatalities (p. 13, diagram p. 17). It’s basically a histogram of the item’s rank in the posterior samples.

### The Wisdom of Crowds

The result is a successful “wisdom of the crowds” aggregation of rankings. Each ranker is weighted by their estimated noise, so more reliable rankers have their rankings weighted more heavily. This is just like all the annotation models we’ve talked about, beginning with Dawid and Skene’s seminal 1979 paper, Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm (sorry for the paywalled link — I can’t find a pirated copy).

In the terminology of statistics, these kinds of models are called “measurement error” models (there doesn’t seem to be a good Wikipedia page or good general overview — you find introductions in any book on survey sampling). It’s not uncommon to have the basic data be measured with noise. Especially in an epidemiology setting or any kind of subjective coding by human subjects, like survey responses.

### Future Directions

The authors point out in their second paper that it’d be natural to build hierarchical models for this task. But their suggestion for how to do it is odd. They suggest adding a single parameter for an individual across all tasks. Usually, you’d have this, then a domain-specific parameter that varied around the hierarchical parameter.

That is, if our tasks were indexed by $t$, we’d have an individual level error $\sigma_i$ for each individual across tasks, and an error $\sigma_{t,i}$ for each task for each user that’s sampled in the neighborhood of $\sigma_i$. It’d be common at this point to estimate random effects for things like task difficulty or annotator expertise. You see this all over the epidemiology and psychometrics literature when they extend these annotation models (for instance, in epidemiology, blood sample tests vs. doctor physical exam is an example of an annotator-level effect; in psychometrics, midterm versus final exam is an example of an item-level effect).

I’d at least start with giving the $\sigma_i$ a hierarchical prior.

I’m guessing since they also suggest multiple-choice questions as an extension, they really haven’t seen the item-response theory literature.

### Stochastic Gradient with Early Stopping and Low-Count Feature Up-Weighting

June 6, 2011

I’ve been meaning to blog about this topic ever since seeing David Chiang‘s talk on machine translation a few weeks ago at Columbia. (I don’t have a citation for this idea, but if you know one, I’m happy to insert it.)

What David proposed was largely a typical approach to fitting a larges-scale, sparse linear model: run some kind of stochastic gradient updates and stop before you reach convergence. Even without a shrinkage/prior/regularization component to the model (and hence the gradient), early stopping results in a kind of poor-man’s regularization. The reason is that there’s only so far a coefficient can move in 10 or 20 iterations (all they could afford given the scale of their MT problem). Even a separable feature (one that co-occurs with only a single output category) will remain relatively bounded.

The novel (to me) twist was that David and crew up-weighted the learning rate for low-count features. This was counterintuitive enough to me and Jenny Finkel that we discussed it during the talk. Our intuitions, I believe, were formed by thinking of a model involving priors run to convergence. Actually, at that point, it probably doesn’t make a difference in the limit if you follow the Robbins-Monro update weights.

Of course, we don’t live in the limit.

It occurred to me well after the talk that the reason David’s strategy of up-weighting low-count features could help is exactly because of the early stopping. Features that don’t occur very often have even fewer chances to be moved by the gradient, especially if they’re (a) correlated with features that do occur frequently, or (b) occur in training examples that are well modeled by other features. We have ample experience in NLP that low-count features help with prediction — we usually only prune for the sake of memory.

By boosting the learning rate of low-count features, it helps them overcome the limited number of training iterations to some extent. The trick is to make sure it helps more than it hurts. It’d also be nice to have a better theoretical understanding of what’s going on.

For some reason, this basic approach reminds me of (a) normalization of feature (co)-variance, and (b) other ad-hoc weighting schemes, like Lewis and Madigan’s trick for logistic regression of weighting features in a document for classification by TF/IDF.

### Revised: Summing out Topic Assignments in LDA

April 20, 2011

I updated the LDA summing out topic assignments post so that the derivation’s actually useful. Thanks again to Matt Hoffman.

### Summing Out Topic Assignments in LDA

April 14, 2011

[Update: 20 April 2011. I replaced my original (correct, but useless) derivation with the derivation Matt Hoffman provided in the comments.]

In a previous post, I showed how to integrate out LDA’s multinomial parameters, with the result being a discrete distribution. In fact, we could go all the way to deriving the probability of each word’s topic assignment given all other parameters, thus defining a natural collapsed Gibbs sampling strategy.

In this post, I’ll show how to turn this around, summing out the discrete parameters in order to produce a fully continuous density. The nice feature about such a density is that it’s amenable to Hamiltonian Monte Carlo methods. These may actually mix faster than the collapsed Gibbs sampler. Now that we have a working HMC implementation, it’ll be easy to test.

It’s always pretty straightforward to sum out discrete parameters. For instance, it’s commonly encountered in the textbook mixture model in the E step of EM (e.g., for soft K-means clustering or semi-supervised learning). It’s all based on the sum law, which says that if $p(a,b)$ is discrete in $b$ and $b$ can take on values $1..N$, then

$p(a) = \sum_{b=1}^{N} p(a,b)$.

All we need to do is repeatedly apply this rule to each of LDA’s topic assignments to tokens.

### Latent Dirichlet Allocation

The LDA model involves a combination of Dirichlet and multinomial distributions. We observe the words $w_{d,i}$ for each document $d \in 1{:}D$, where document $d$ is of length $I_d$. We let $V$ be the set of all words observed. We assume there are $K$ topics. Next, there are multinomial parameters $\theta_d$ of dimension $K$ for the distribution of topics in document $d$. We also assume multinomial parameters $\phi_k$ of dimension $V$ for the distribution of words in topic $k$. For simplicity, we assume fixed Dirichlet hyperpriors $\alpha$ and $\beta$ of dimensionality $K$ and $V$ over the multinomial parameters $\theta_d$ and $\phi_k$.

The full probability model $p(\theta,\phi,z,w|\alpha,\beta)$ is defined by

$\theta_d \sim \mbox{\sf Dirichlet}(\alpha)$

$\phi_k \sim \mbox{\sf Dirichlet}(\beta)$

$z_{d,i} \sim \mbox{\sf Discrete}(\theta_d)$

$w_{d,i} \sim \mbox{\sf Discrete}(\phi_{z[d,i]})$

### Summing out the Topic Assignments

In our previous post, we showed how to simplify the expression for the distribution of topic assignments given words and hyperpriors, $p(z|w,\alpha,\beta)$. Here, we are going to derive $p(\theta,\phi|w,\alpha,\beta)$, the density of the document distributions and topic distributions given the words and hyperpriors.

$p(\theta,\phi|w,\alpha,\beta)$

${} \propto p(\theta,\phi,w|\alpha,\beta)$

${} = p(\theta|\alpha) \times p(\phi|\beta) \times p(w|\alpha,\beta)$

${} = p(\theta|\alpha) \times p(\phi|\beta) \times \prod_{d=1}^D \prod_{i=1}^{I_d} p(w_{d,i}|\theta_d, \phi)$

${} = p(\theta|\alpha) \times p(\phi|\beta) \times \prod_{d=1}^D \prod_{i=1}^{I_d} \sum_{z_{d,i} = 1}^K p(w_{d,i},z_{d,i}|\theta_d, \phi)$

Next, by repeated application of the sum rule for total probability, we get

${} = p(\theta|\alpha) \times p(\phi|\beta) \times \prod_{d=1}^D \prod_{i=1}^{I_d} \sum_{z[d,i] = 1}^K p(z_{d,i}|\theta_d) \times p(w_{d,i}|\phi_{z[d,i]})$

Expanding out the definitions, we are left with a nice tight expression that’s easy to calculate:

$p(\theta,\phi|w,\alpha,\beta)$

${} \propto \prod_{n=1}^N \mbox{\sf Dir}(\theta_n|\alpha)$

${} \times \prod_{k=1}^K \mbox{\sf Dir}(\phi_k|\beta)$

${} \times \prod_{d=1}^D \prod_{i=1}^{I_d} \sum_{z[d,i]=1}^K \mbox{\sf Disc}(z_{d,i}|\theta_d) \times \mbox{\sf Disc}(w_{d,i}|\phi_{z[d,i]})$.

Of course, we’ll probably be doing this on the log scale, where

$\log p(\theta,\phi|w,\alpha,\beta)$

${} = Z_{w,\alpha,\beta}$

${} + \sum_{d=1}^D \log \mbox{\sf Dir}(\theta_d|\alpha)$

${} + \sum_{k=1}^K \log \mbox{\sf Dir}(\phi_k|\beta)$

${} + \sum_{d=1}^D \sum_{i=1}^{I_d} \log \left( \sum_{z[d,i]=1}^K \mbox{\sf Disc}(z_{d,i}|\theta_d) \times \mbox{\sf Disc}(w_{d,i}|\phi_{z[d,i]}) \right)$

where $Z_{w,\alpha,\beta}$ is a normalizing term that we can ignore for most optimization, expectation or sampling-based methods. Note that the $\Gamma$ functions inside the Dirichlet terms go into the $Z_{w,\alpha,\beta}$ term if we leave the priors $\alpha$ and $\beta$ constant.

### What is a (Mathematical) Model?

March 11, 2011

I was shocked and dismayed when I heard from a reader that I’d used the term “model” over 200 times in the LingPipe book without ever saying what it meant. This kind of thing is why it’s so hard to write introductory material.

Perhaps I shouldn’t have been surprised at this comment, because other people had expressed some uncertainty about the term “model” to me in the past.

### What is a (Mathematical) Model?

In short, when I say “model”, I mean it in the bog-standard scientific sense, as explained on:

Quite simply, it’s just a bunch of math used to describe a phenomenon. Nothing interesting here either philosophically or conceptually, just the usual scientific method.

For instance, Newton’s equation $f = m \times a$ (force equals mass times acceleration) is a mathematical model that may be used to describe the motions of the planets, among other things. Newton derived his model from Kepler’s observation that the planets picked out equal area in equal time in their orbits. Newton realized that by introducing the notion of “gravity”, he could model the orbits of the planets. Of course, he had to invent calculus to do so, but that’s another story.

### Prediction vs. Postdiction

Typically models are used for predicting future events, but sometimes they’re used to retroactively try to understand past events (“backcasting” [“aftcasting” if you’re nautical] is the time opposite of “forecasting”, and “postdiction” the opposite of “prediction”). For instance, climate scientists attempt to postdict/backcast earth temperatures from data such as tree rings; we’re working on fitting models of such data with Matt Schofield as part of our Bayesian inference project at Columbia.

### All Models are Wrong, but …

As the statistician George E. P. Box said, “Essentially, all models are wrong, but some are useful.” For instance, Newton’s model is wrong in that it doesn’t correct for relativistic effects at very high velocities. But it proved useful at predicting everything from eclipses to the tides.

The models we’ve used in LingPipe, such as the HMM model of part-of-speech tagging, are also clearly wrong. Language just isn’t Markovian (meaning that the n-th word only depends on a fixed window of previous few words). But we can still do pretty well at predicting part of speech tags with the simplified model.