Domain Adaptation with Hierarchical Naive Bayes Classifiers

September 22, 2011 by

This will be the first of two posts exploring hierarchical and multilevel classifiers. In this post, I’ll describe a hierarchical generalization of naive Bayes (what the NLP world calls a “generative” model). The next post will explore hierarchical logistic regression (called a “discriminative” or “log linear” or “max ent” model in NLP land).

Domain Adaptation

Within natural language processing, the term “domain adaptation” has come to mean something like “using data in one domain to improve results in another domain”. Examples that have received attention include positive/negative sentiment for reviews of different kinds of products, with, say, DVDs being one domain and kitchen appliances another (Blitzer et al. 2007). Other examples classifying sequences of tokens as names across different genres of text, such as newswire and transcribed speech (Daumé III 2007, Finkel and Manning 2009).

The work I’ve seen on adaptation has largely been in the supervised case, but what we’re going to do here can be unsupervised, in which case you get something like soft K-means clustering. It can also be semi-supervised, with some documents having category labels and some not having labels. I’ve talked about semi-supervised models before, too, as recently as the very last blog post, covering Tang and Lease’s semi-supervised annotation model and in our replication of Nigam et al.’s semi-supervised naive Bayes EM estimator in the LingPipe tutorial on semi-supervised naive Bayes classification.

Hiearchical Models

The standard Bayesian approach to domain adaptation is to build a hierarchical model. The hierarchy in this case is the hieararchy of domains. We’ll just consider the flat case, where there’s a collection D of what we’ll call “domains”. In the next post, we’ll generalize this assumption to richer organizations of domains with cross-cutting features.

Naive Bayes

I’ve blogged before about properly Bayesian approaches to naive Bayes classifiers, specifically about Bayesian naive Bayes inference and collapsed Gibbs sampling for computing it.

Pooling, Exchangeability and Hieararchy

What we’re going to do is let information in one domain, such as which words are positive or which phrases are people names, inform estimates for other domains. This is sometimes referred to as “pooling” information across sub-populations in statistics. A simple example from political science that illustrates the general flavor of models we’re working on at Columbia is allowing estimates of the effect of income on voting in one state to inform estimates in other states, while still allowing the states to differ (an example is given in Gelman et al.’s What’s wrong with Connecticut?).

Theoretically, what we’re going to do amounts to assuming that the domains themselves are exchangeable. We use exchangeable models when we don’t have any information to distinguish the domains. We’ll discuss the case where we do have such information in the next post.

In Bayesian models, we treat exchangeable variables as being drawn from a population distribution (aka, a prior). In hierarchical models, we model the population directly. The reason it’s called hierarchical is that in a proper Bayesian model, there must be a prior distribution on the domain model. So we get a prior on the parameters of the domain model, which then acts as a prior on the estimates for each domain. The pooling is achieved by information flow through the dependencies in the joint distribution implied by the graphical model. In practice, this is simpler than it sounds.

Hieararchical Naive Bayes

Once we’ve got our heads around the Bayesian formulation of naive Bayes, extending it to hieararchical models is straightforward. We’re going to use the language of documents and tokens in describing naive Bayes, but it’s really a general multinomial model, so don’t assume this is only valid for text classifiers.

For simplicity, we’ll assume we have a binary classifier. I’ll say more about the multi-category case later and in the next post.

The Data

Let D be the number of domains. Let I_d be the number of documents in domain D and N_{d,i} the number of tokens in document i \in 1{:}I_d of domain d \in 1{:}D. Let V be the size of the vocabulary from which the tokens are drawn.

The raw document data we have provides a token x[d,i,n] \in 1{:}V for each token n \in N_{d,i} for each document i \in 1{:}I_d for each domain d \in 1{:}D.

The labeled training data we have is a classification z[d,i] \in \{ 0, 1 \} for each document i \in 1{:}I_d in each domain d \in 1{:}D.

The Model, Lower Levels

The hierarchical naive Bayes model is a directed graphical model and as such, easy to describe with sampling notation.

We have a parameter \pi[d] for the prevalence of category 1 documents in domain d \in 1{:}D. It is sampled from a beta distribution with prior count parameters \alpha_{\pi}, \beta_{\pi} (more about which later),

\pi[d] \sim \mbox{\sf Beta}(\alpha_{\pi}, \beta_{\pi}).

The categories assigned to each document are sampled from a Bernoulli distribution parameterized by prevalence in the document’s domain,

z[d,i] \sim \mbox{\sf Bern}(\pi[d]).

Each domain d and outcome category c \in \{ 0, 1 \} will have its own naive Bayes model with multinomial V-simplex parameter \phi[d,c] \in [0,1]^V, such that

\sum_{v = 1}^V \phi[d,c,v] = 1,

describing the distribution of vocabulary items v \in V in category c \in \{ 0, 1 \} in domain d \in D.

These category word distributions for each domain are drawn from a Dirichlet prior \gamma[c] appropriate to the outcome category c \in \{ 0, 1 \},

\phi[d,c] \sim \mbox{\sf Dir}(\gamma[c]).

We then model the tokens in the standard way for naive Bayes, as being drawn independently (that’s the “naive” part) from the discrete distribution associated with their domain d and the category z[d,i] of document i,

x[d,i,n] \sim \mbox{\sf Disc}(\phi[d, z[d,i]]).

The Model, Upper Levels

So far, that’s only the lower level of the model. As I mentioned earlier, the priors characterize the populations of prevalences and vocabulary distributions for categories across domains. And we’re using hierarchical models on both sides of naive Bayes, for the categories themselves and for the topic distributions. There are many choices for how to model these populations. We’ll go with a very simple model that only characterizes prior means and variance, not prior covariance; as an example of a prior modeling covariance for multinomials, see the prior for topics in Lafferty and Blei’s correlated topic model.

It’s easier to start with the prevalences. The prior model here will model the mean prevalence of outcome 1 (say, the “positive” outcome for sentiment) across domains, as well as its variance. The mean of the density \mbox{\sf Beta}(\alpha_{\pi},\beta_{\pi} is \alpha_{\pi}/(\alpha_{\pi} + \beta_{\pi}) and the variance is inversely related to the total prior count \alpha_{\pi} + \beta_{\pi}.

For convenience, we’ll reparameterize \alpha_{\pi}, \beta_{\pi} in terms of total prior count \kappa_{\pi} \in (0,\infty) and prior mean \psi_{\pi} \in [0,1], by setting

\alpha_{\pi} = \psi_{\pi} \kappa_{\pi}, \ \ \beta_{\pi} = (1 - \psi_{\pi}) \kappa_{\pi}.

We’ll put a simple conjugate uniform prior on the prior mean \psi_{\pi}, taking

\psi_{\pi} \sim \mbox{\sf Beta}(1,1).

(Note that \mbox{\sf Beta}(1,1) is uniform on its support, [0,1].)

We’ll take a very weakly informative prior favoring high variance (low prior count) on the total prior counts,

\kappa_{\pi} \sim \mbox{\sf Pareto}(0.5, 1.5),

where for x > 0.5,

\mbox{\sf Pareto}(x|0.5,1.5)  \propto x^{-1.5}.

The effect of this decision is that the estimate for prevalence \pi[d] in a single domain d will be somewhere between the overall average \psi_{\pi} and the training data average, depending on the relative strength \kappa_{\pi} estimated for the prior and the number of training examples for domain d. (A standard shortcut here would be to use either cross-validation or “empirical Bayes” to set the priors \alpha_{\pi}, \beta_{\pi}.

We do the same thing for the Dirichlet prior, reparameterizating the prior count V-vector \gamma[c] for vocabularly items for outcome c \in \{0,1\} as a prior mean V-simplex \theta[b] and prior total count \rho[b], setting

\gamma[c] = \rho[b] \theta[b].

We set a uniform prior on the prior vocabulary distribution \theta[b],

\theta[b] \sim \mbox{Dir}(1),

and another weakly informative Pareto prior on the prior counts,

\rho \sim \mbox{Pareto}(0.5,1.5).

As the mathematicians are wont to say, Q.E.D..

Tell Us What We’ve Won

What we do get is sharing of discriminative terms across domains. For instance, with polarity, positive term weights in each domain affect the prior, and the prior affects the average in each domain. The prior’s effect on estimates goes by the name “smoothing” in NLP. More specifically, the prior pulls the estimate for each domain and category toward the prior mean for that category by an amount inversely related to the prior variance (low prior variance exerting a stronger smoothing effect) and inversely related to the data count (with higher data counts, the prior’s effect is weaker).

We also get sharing of the non-discriminative terms among category 0 instances across domains and sharing among non-discriminative terms among category 1 instances.

But what about…

… the fact that the model has to estimate the whole vocabulary of non-discriminative terms in both the negative and positive vocabulary priors \theta[0] and \theta[1]?

For instance, we’d expect “awesome” and “crap” to have a very different distribution in positive and negative sentiment product reviews, whereas “the” and “of” should be less discriminative.

Turtles All the Way Up

The next level here would be to add a more informative prior for the prior per-category vocabulary distribution \theta[d]. A very good place to start here would be with a prior mean distribution \theta[d] estimated from a large corpus of text without category labels. That would put an empirical Bayes step into the mix and make the prior over priors much more informative (as is, we’ve set it to be symmetric). We could apply full Bayesian inference by just building the unsupervised data into the model with what the statisticians like to call “missing” labels.

We could do the same thing for the prior mean for prevalence of categories if we have samples of other domains we could use.

One nice feature of hierarchical models is that it’s easy to extend the hierarchy (for an extreme example, see Li and McCallum’s paper on their aptly named technique, pachinko allocation).

It’s trivial to extend this hiearchical naive Bayes model to the multi-category setting. We kept things simple here because it’ll greatly simplify the next post on multilevel modeling with logistic regression. In the naive Bayes setting, just replace the Bernoulli distribution with a discrete distribution and the beta distribution with a Dirichlet.

It’s also trivial to extend the semi-supervised case. Any number of variables may be “missing” in a directed graphical model without hampering inference (theoretically it’s simple; computationally it requires more bookkeeping and perhaps more samplers). As we mentioned earlier, this is especially helpful in the setting where we’re estimating hyperpriors (here the top-level vocabulary prior over per-category vocabulary priors over per-category/per-domain vocabulary distributions).

Empirical Evaluations?

This will work. The hieararchical modeling strategy pretty much always works. The reason is that with the right hyper-hyper priors, it’ll reduce to the original model. The real question is will it help?

We should see big gains in small count data.

I haven’t done any. If anyone else has evaluated hierarchical naive Bayes, please say so in the comments or e-mail me and I’ll write a link into the post itself.

A good data set would be Dredze et al.’s multi-domain sentiment data set. I’d at least like to see the results of this relatively simple and standard naive Bayes model and the one I’ll describe in the next post before delving into “deep” models a la (Glorot et al. 2011).

Wow, that was all only four lines of notes in my notebook. I wasn’t expecting this to run so long! The next post has six lines of notes, so watch out.

Tang and Lease (2011) Semi-Supervised Consensus Labeling for Crowdsourcing

September 12, 2011 by

I came across this paper, which, among other things, describes the data collection being used for the 2011 TREC Crowdsourcing Track:

But that’s not why we’re here today. I want to talk about their modeling decisions.

Tang and Lease apply a Dawid-and-Skene-style model to crowdsourced binary relevance judgments for highly-ranked system responses from a previous TREC information retrieval evaluation. The workers judge document/query pairs as highly relevant, relevant, or irrelevant (though highly relevant and relevant are collapsed in the paper).

The Dawid and Skene model was relatively unsupervised, imputing all of the categories for items being classified as well as the response distribution for each annotator for each category of input (thus characterizing both bias and accuracy of each annotator).

Semi Supervision

Tang and Lease exploit the fact that in a directed graphical model, EM can be used to impute arbitrary patterns of missing data. They use this to simply add some known values for categories (here true relevance values). Usually, EM is being used to remove data, and that’s just how they pitch what they’re doing. They contrast the approach of Crowdflower (nee Dolores Labs) and Snow et al. as fully supervised. They thus provide a natural halfway point between Snow et al. and Dawid and Skene.

Good Results vs. NIST Gold

The key results are in the plots in figures 4 through 7,which plot performance versus amount of supervision (as well as fully unsupervised and majority vote approaches). They show the supervision helping relative to the fully unsupervised approach and the approach of training on just the labeled data.

Another benefit of adding supervised data (or adding unsupervised data if viewed the other way) is that you’ll get better estimates of annotator responses (accuracies and biases) and of topic prevalences.

Really Gold?

They get their gold-standard values from NIST, and the notion of relevance is itself rather vague and subjective, so the extra labels are only as golden as the NIST annotators. See below for more on this issue.

Voting: Quantity vs. Quality

Tang and Lease say that voting can produce good results with high quality annotators. It’ll also produce good results with a high quantity of annotators of low quality. As long as their results are independent enough, at least. This is what everyone else has seen (me with the Snow et al. data and Vikas Raykar et al. very convincingly in their JMLR paper).

Regularized Estimates (vs. MLE vs. Bayesian)

I think it’d help if they regularized rather than took maximum likelihood estimates. Adding a bit of bias from regularization often reduces variance and thus expected error even more. It helps with fitting EM, too.

For my TREC entry, I went whole hog and sampled from the posterior of a Bayesian hierarchical model which simultaneously estimates the regularization parameters (now cast as priors) along with the other parameters.

I also use Bayesian estimates, specifically posterior means, which minimize expected squared error. MLE for the unregularized case and maximum a posterior (MAP) estimates for the regularized case can both be viewed as taking posterior maximums (or modes) rather than means. These can be pretty different for the kinds of small count beta-binomial distributions used in Dawid and Skene-type models.

Really Adversarial Turkers?

How in the world did they get a Mechanical turker to have an accuracy of 0 with nearly 100 responses? That’s very very adversarial. I get higher accuracy estimates using their data for TREC and don’t get very good agreement with the NIST gold standard, so I’m really wondering about this figure and the quality of the NIST judgments.

Active Learning

Choosing labels for items on the margin of a classifier is not necessarily the best thing to do for active learning. You need to balance uncertainty with representativeness, or you’ll do nothing but label a sequence of outliers. There’s been ongoing work by John Langford and crew on choosing the right balance here.

Adding a Model

Vikas Raykar et al. in their really nice JMLR paper add a regression-based classifier to the annotators. I think this is the kind of thing Tang and Lease are suggesting in their future work section. They cite the Raykar et al. paper, but oddly not in this context, which for me, was its major innovation.

Not Quite Naive Bayes

Tang and Lease refer to the Dawid and Skene approach as “naive Bayes”, which is not accurate. I believe they’re thinking of generating the labels as analogous to generating tokens. But the normalization for that is wrong, being over annotators rather than over annotator/label pairs. If they had a term estimating the probability of an annotator doing an annotation, then it would reduce to naive Bayes if they allow multiple annotations by the same annotator independently (which they actually consider, but then rule out).

So it’s odd to see the Nigam et al. paper on semi-supervised naive Bayes text classification used as an example, as it’s not particularly relevant, so to speak. (I really like Nigam et al.’s paper, by the way — our semi-supervised naive Bayes tutorial replicates their results with some more analysis and some improved results.)

Two-Way vs. K-Way Independence

Another nitpick is that it’s not enough to assume every pair of workers is independent. The whole set needs to be independent, and these conditions aren’t the same. (I was going to link to the Wikipedia article on independent random variables, but it only considers the pairwise case. So you’ll have to go to a decent probability theory textbook like Degroot and Schervish or Larsen and Marx, where you’ll get examples of three variables that are not independent though each pair is pairwise independent.

One Last Nitpick

A further nitpick is equation (6), the second line of which has an unbound i in the p[i] term. Instead, i needs to be bound to the true category for instance m.

Synthetic Data Generation?

I also didn’t understand their synthetic data generation in 3.1. If they generate accuracies, do they take the sensitivities and specificities to be the same (in their notation, pi[k,0,0] = pi[k,1,1]). In my (and others’) experience, there’s usually a significant bias so that sensitivity is not equal to specificity for most annotators.

Modeling Item Difficulty for Annotations of Multinomial Classifications

September 8, 2011 by

We all know from annotating data that some items are harder to annotate than others. We know from the epidemiology literature that the same holds true for medical tests applied to subjects, e.g., some cancers are easier to find than others.

But how do we model item difficulty? I’ll review how I’ve done this before using an IRT-like regression, then move on to Paul Mineiro’s suggestion for flattening multinomials, then consider a generalization of both these approaches.

Binary Data

My previous suggestions have been for binary data and were based on item-response theory (IRT) style models as applied to epidemiology by Uebersax and Grove. These are reflected, for instance, in my tutorial for LREC with Massimo Poesio. The basic idea is that you characterize an item i by position \alpha_i and an annotator by position \beta_j and discriminativeness \delta_j, then generate a label for item i by annotator j whose probability of being correct is:

y_{i,j} \sim \mbox{\sf Bern}(\mbox{logit}^{-1}(\delta_j(\alpha_i - \beta_j)))

The higher the value of \delta_j, the sharper the distinctions made by annotator j. You can break this out so there’s one model for positive items and another for negative items (essentially, one set of parameters for sensitivity and one for specificity).

If there’s no difficulty, equivalently all \alpha_i = 0, so we can reduce the logistic regression to a simple binomial parameter

\theta_j = \mbox{logit}^{-1}(\delta_j \beta_j)

which is the form of model I’ve been proposing for binary data with single response parameter \theta_j, with \theta_{0,j} for specificity and \theta_{1,j} for specificity.

Difficulty as Flattening Responses

Paul Mineiro’s blog post Low-Rank Confusion Modeling of Crowdsourced Workers introduces a general approach to handling item difficulty. If the true category is k, each annotator j has a response distribution paraemterized by a multinomial parameter \theta_{j,k}.

This is just Dawid and Skene‘s original model.

Mineiro applies a neat single-parameter technique to model item difficulty where the larger the parameter, the closer the response distribution is to uniform. That is, difficulty amounts to flattening an annotator’s response.

He does this in the standard temperature-based analogy used in annealing. If the difficulty of item i of true category k is \alpha_i, the response of annotator j is flattened from \theta_{j,k} to being proportional to \theta_{j,k}^{1/\alpha_i}. The higher the value of \alpha, the greater the amount of flattening. A value of \alpha_i greater than 1 indicates an annotator will do worse than their basic response and a value less than 1 indicates they’ll do better (assuming they assign the true value the highest probability).

General Regression Approach

We can do almost the same thing in a more general regression setting. To start, convert an annotator’s response probability vector \theta_{j,k} to a regression representation \log \theta_{j,k}. To get a probability distribution back, apply softmax (i.e., multi-logit), where the probability of label k' for an item i with true label k for annotator j is proportional to \exp(\theta_{j,k,k'})

We can encode Mineiro’s approach in this setting by adding a multiplicative term for the item, making the response proportional to \exp((1/\alpha_i) \theta_{j,k,k'}) = \exp(\theta_{j,k,k'})^{1/\alpha_i}. It would probably be more common to make the difficulty additive, so you’d get \exp(\alpha_i + \theta_{j,k,k'}).

For instance, suppose we have an ice-cream-flavor classification task with four response categories, lemon, orange, vanilla and chocolate. Built into users’ responses (and in my hierarchical models into the hierarchical priors) are the basic confusions. For instance, a lemon ice cream would be more likely to be confused with orange than vanilla or chocolate. A more difficult item will flatten this response to one that’s more uniform. But until \alpha_i approaches infinity, we’ll still confuse orange with lemon more than with the others.

Problem with the Basic Approach

The fundamental problem with Mineiro’s approach is that there’s no way to have the difficulty be one of limited confusability. In the ice cream case, an item that’s very citrusy will never be confused with chocolate or vanilla, but might in the limit of a very hard problem, have a uniform response between orange and lemon.

You also see this in ordinal problems, like Rzhetsky et al.’s modality and strength of assertion ordinal scale. A very borderline case might be confused between a 2 or 3 (positive and very positive), but won’t be confused between a -3 (very negative) modality of assertion.

Genealized Multi-Parameter Approach

What I can do is convert everything to a regression again. And then the generalization’s obvious. Instead of a single difficulty parameter \alpha_i for an item, have a difficulty parameter for each response k, namely \alpha_{i,k}. Now the probablity of annotator j responding with category k' when the true category is k is taken to be proportional to \exp(\beta_{j,k,k'} + \alpha_{i,k'}).

If you want it to be more like Mineiro’s approach, you could leave it on a multiplicative scale, and take the response to be proportional to \exp(\alpha_{i,k'} \beta_{j,k,k'}).

Of course, …

We’re never going to be able to fit such a model without a pile of annotations per item. It’s just basic stats — your estimates are only as good as (the square root of) the count of your data.

Being Bayesian, at least this shouldn’t hurt us (as I showed in my tech report). Even if the point estimate is unreliable, if we apply full Bayesian posterior inference, we get back overdispersion relative to our point estimate and everything should work out in principle. I just didn’t find it helped much in the problems I’ve looked at, which had 10 noisy annotations per item.

But then, …

If we have contentful predictors for the items, we might be able to model the difficulties more directly as regressions on basic predictors. Examples would be knowing the journal in which a paper came from when doing classification of article subjects. Some journals might be more interdisciplinary and have more confusable papers in general.

Sampling, Modeling and Measurement Error in Inference from Clinical Text

August 18, 2011 by

Here’s a link to the slides of the talk I presented recently at ICML:

It’s basically a list of the kinds of things that can go wrong and introduce error (bias and noise) into inferences. Although the examples are mostly clinical (with one on baseball and one on cancer clusters), the point is generally applicable.

Small, Focused Workshops

I really like small, focused workshops, and this one was very good, with lots of presentations on people’s practical experiences launching systems in hospitals and working on fascinating text mining problems from clinical notes.

Thanks to the Organizers

Thanks again to the organizers, especially Faisal Farooq, who handled all the paperwork. It’s a pretty thankless job in my experience, but having done it myself, I can really appreciate how much work it is to run something that comes off smoothly.

I don’t know how long the page will last, but here’s a link to the workshop itself:

Unintended (Beneficial) Consequences

When Noémie Elhadad invited me to give a talk, I met with her to see if there was a topic I could talk about. During that meeting, she mentioned how hard it had been to hire an NLP programmer in biomedical informatics (it’s just as hard if not harder at a small company). The upshot is that Mitzi got a new job at Columbia into the bargain. In a way, it’s too bad, because I miss talking to Mitzi about her work in genomics, about which I know relatively little compared to NLP.

Buffering Exceptions

August 16, 2011 by

OK, now that it’s just us API fanatics, imagine you need to call a stream read in an interface method implementation, but the interface doesn’t declare an I/O exception. For concreteness, suppose we have an interface:

interface Foo {
    public void bar();
}

and we want to implement a Foo that calls streaming I/O, as in:

public class IOFoo implements Foo {
    public void bar() {
        InputStream in = ...;
        ...
        int numBytes = in.read(buf); // throws IOException
    }
}

A common example is the Java interface Runnable, which has a run() method which does not throw exceptions. Given that this is the basis for threads, the issue comes up all the time.

This won’t compile, because the call to the read method throws an I/O exception, which is a checked exception that is not declared on the run method. In fact, it can’t be declared on the run method, because the interface doesn’t declare an I/O exception. What I too often see is this broken workaround:

        int numBytes = -1;
        try {
            numBytes = in.read();
        } catch (IOException e) {
            throw new RuntimeException(e);
        }

What’s wrong here? There’s no way for anyone calling the bar() method to recover the exception and handle it. You could document that it throws a RuntimeException, but all kinds of problems can lead to a runtime exception being thrown, and there’s no way to tell which threw it. (Yes, you could encode it in the message, but that’s an even worse idea, as it’s very hard to maintain and code against.)

An alternative approach is to follow the pattern used by Java’s PrintWriter class and buffer the exception then make it available to the caller. Here’s a simplified example that buffers a single exception:

public class IOFoo implements Foo {
    private IOException mException;
    public void bar() {
        InputStream in = ...;
        ...
        try {
            in.read(); // throws IOException
        } catch (IOException e) {
            mException = e;
            return;
        }
    }
    public IOException getIOException() {
        return mException;
    }
}

That way, a caller can do this:

IOFoo f = new IOFoo(...);
f.run();
if (f.getIoException() != null)
    throw (f.getIOException());

Now the context around the call to IOFoo’s run can catch the IOException. For instance, if this is in a method call, you can just declare the method to throw the checked IOException. Note that we need to declare f as an IOFoo, not just a Foo, because we call its getIOException method.

Java’s PrintWriter keeps a list of exceptions that have been raised in calls to its methods. That’s because it keeps going after exceptions, assuming the client would rather they be silently ignored. I don’t think that’s such a great idea, for the obvious reason that there’s no telling what’s going to happen when I/O exceptions are just ignored.

It would be straightforward to set a flag that stops any further action once an exception has been raised.

The case I was actually using this in was for was writing a static method to serialize a tokenized language model. Visiting the n-grams requires an IntArrayHandler whose handle method is not declared to throw any checked exceptions. I wanted any embedded exception to be buffered and thrown by the top-level static method.

“Sentiment Takes a LONG Time”, Said the Logistic Regression Classifier

August 12, 2011 by

Hello again, fellow LingPipers!  It is I, yet again, Sam The Intern.  In my last (and first) post I promised the masses that I would return with a study similar to the last one that I conducted, but this time doing the non-trivial case of sentiment.  It turns out that this aspiration ended up being nothing more than a (Ling)Pipe-dream.  As my titular character, the logistic regression classifier so concisely said before, sentiment takes a long time!  And, with this being my last week at LingPipe, I will not have enough time to give our logistic regression classifiers the attention and whispered sweet nothings that they require to function properly.

As you may remember from my first post, I absolutely love to annotate.  However, annotating for sentiment takes much more time than language identification, because the annotator needs to internalize and judge the text, not just glance at it.  Add to this the fact that the sentiment classifier requires much more training data than the language identification classifier, and you have yourself an intern that has been product-conditioned to Coca-Cola, but no time to do anything about it (reading 1000 tweets about how much people love coke will do that to you).

I was able to start the study, with small amounts of success.  The following confusion matrix is the result of  a 10-fold cross-validation run across all of the data that I had time to annotate (about 1000 tweets each for “Clinton”, “Coke”, and “Gaga”).  The top of the chart is the reference data which I annotated, and the side is the response data that the classifier provided.  The categories are: w = neutral, e = positive, q = negative.

   w    e   q
w 1244,239,50
e 313 ,312,23
q 199 ,45 ,23

I wish that I had more time to spend on the sentiment classification problem.  Perhaps I will be back next summer, to continue the effort to teach computers good from bad.  Until then, I will be at Rochester, college-ing!  And drinking massive quantities of Coke…

Peter Jackson has Passed Away

August 4, 2011 by

I am very sorry to convey that Peter Jackson has passed away. Below is an email sent to Thomson Reuters from their CTO.

Peter was a great friend, early customer and advisor to LingPipe. I will miss him dearly.

Breck

—————-

From James Powell, CTO Thomson Reuters

Dear Colleagues,

It is with great sadness that I must inform you that Peter Jackson, our Chief Scientist and head of R&D, died suddenly this morning.

Peter JacksonPeter was a colleague of immense talents, both professionally and personally. He started his career in academia, and after a post as a Lecturer in Artificial Intelligence at Edinburgh University in Scotland he moved to the U.S. where he went on to become Associate Professor in the Department of Electrical and Computer Engineering at Clarkson University, NY.

A true global citizen who was born in Barbados and educated in England, Peter also served as Visiting Professor at the Artificial Intelligence Laboratory for the School of Electrical & Electronic Engineering at Singapore Polytechnic.

Peter switched from the corridors of academia to the meeting rooms of Rochester when he joined Lawyers Cooperative Publishing (LCP) back in 1995, as Lead Software Engineer for the Natural Language Group. LCP was acquired by West Group the following year.

His penetrating intelligence, allied to his outgoing and immediately likeable personality, ensured that Peter moved quickly through the ranks at Thomson; next in the West Group and then at Thomson Legal & Regulatory, where he rose to become Chief Research Scientist & Vice-President, Technology in 2005.

A lasting legacy

Peter was the obvious choice for the position of Chief Scientist when we became Thomson Reuters in 2008, and it was a role he fulfilled with exceptional results, forming a 40-strong R&D group as a corporate-wide resource, and aligning our Research & Development capabilities with the major platform and strategy initiatives of recent years.

Peter’s legacy at Thomson Reuters is significant and lasting. He oversaw the advanced technologies , such as CaRE, Concord and Results Plus, that enabled us to launch radical and successful new product platforms such as WestlawNext and PeopleMap. He was also integral to the development of our innovative Reuters Insider channel.

On top of that he represented Thomson Reuters with great professionalism — and considerable personality and panache — in the wider world of conferences, academia and professional associations. He was a great ambassador for Thomson Reuters in his role overseeing university liaison with regards to joint research projects with institutions of the caliber of MIT, NYU and CMU.

These achievements give some sense of Peter’s abilities and drive. Peter, though, had many more strings to his bow. He published three books and about 40 papers on his fields of expertise (artificial intelligence and natural language processing), and invented four US patents.

Above all, when it came to his work, Peter was proudest of the group that he built and worked with side-by-side every day.

If this wasn’t enough for one man, he was also a talented musician and a serious music fan, serving for a number of years on the Board of Directors for the Saint Paul Chamber Orchestra. In recent times, he traveled to Chicago, New York and Ojai to support the SPCO on tour. Many people in Eagan will remember Peter’s “Jazzkickers” band playing at employee events around campus.

He even demonstrated his guitar playing prowess in this internal technology video filmed at Joe’s Pub in NYC earlier this summer.

The many of us who worked with Peter will miss greatly his enthusiasm, the clarity of his thinking, his wisdom and his wit.

He leaves behind his wife Sandy, also a former employee here. Peter was 62.

LingPipe and JDK 7

July 28, 2011 by

[Update: 29 July 2011: Thanks to Dean Jones for pointing out that our friends at Lucid Imagination have found a deal-breaking bug in the initial JDK 7's optimization of loops. Here's the scoop from the source:

In any case, we won't be using any of the JDK 7 features in LingPipe until JDK 6 is retired.]
 

Oracle (seems odd to say that) just released version 7 of the Java developers kit, aka JDK 7.

What’s in JDK 7?

I haven’t been following what was going into JDK 7. I found this comprehensive list of changes:

This includes Project Coin, which involved changes to the language itslef. These are summarized nicely with examples at:

There were also changes to add module structure, some concurrency changes, some new I/O (nio) changes, support for Unicode 6 in internationalization, support for the latest XML, a new version of JDBC, a bunch of changes to Swing, and some changes to things like cryptography that I didn’t even know were in Java in the first place.

LingPipe Compatibility

The good news is that LingPipe seems to work just fine with Java 7. At least all of our unit tests pass.

There are a couple new compiler warnings. Variable list arguments (i.e., varargs) and generics don’t play nicely together. This should be obvious from the fact that varargs use an array and you can’t have generic arrays. So now I get very menacing error messages like

warning: [unchecked] Possible heap pollution from parameterized vararg type E

I’m a bit surprised they didn’t add this warning earlier, as it seems like a well understood and commonly made mistake. There’s a nice discussion of the underlying problem due to type erasure and arrays here:

And also a very nice discussion with great examples here:

From Kalanger’s discussion, I realized everywhere I got these errors was actually safe as the arrays in question didn’t persist beyond the scope of the function body where the user couldn’t get at them.

“Language ID”, Said the NGram Process Dynamic Language Model Classifier…

July 25, 2011 by

Hi, there! My name is Sam Brown, and I’m a summer intern here at LingPipe. I’m from Eastchester, NY, and I’m a rising sophomore in the University of Rochester Biomedical Engineering Department.

I’ve been interning at LingPipe’s luxuriously airconditioned headquarters for about a month and a half now. After barely learning Java and some of the LingPipe API in the span of about a week, my slavemaster…rather…boss Breck thought it would be a great idea if I could publish a three-part Java-based computational linguistics experiment exploring the relative performances of generic and customized classifiers for language identification.

And then he told me this is the trivial case.

Since I hardly understood the objective myself, even after Breck kindly spelled it out for me about 50 times, let me try and explain the experiment. Language model classifiers can be used to sort test data (like a bunch of tweets retrieved from Twitter using a query like “Gaga”) into two categories (in my case English, and Non-English). These classifiers, of course, need to be trained on annotated data (tweets collected from Twitter that we sort for the computer) before they can be useful. This is where my experiment’s problem comes in; Is it better to make a custom classifier that is trained on data similar to the test data (a set of tweets different from the test data, but also retrieved using the query “Gaga”), or to train a generic classifier on data that comes from a variety of sources (a collection of tweets from the queries “Coke, “Clinton”, etc.)? What we found is that, based on the input and test data, either classifier may do better at language identification depending on the circumstances.

Data:

Before we could start the experiment, we first needed to collect our data from Twitter. For each of our five queries, 1500 tweets were retrieved from Twitter, and duplicates were removed from this data by requiring a Jaccard distance of .5 or less between any two tweets. The effect of this is that any tweet with more than 50% token (word) overlap with any other tweet in the accepted data collection was rejected from the corpus (our collection of tweets to be used in the experiment).

After that came my absolute favorite part; Annotation. I annotated 500 tweets for being written in English (‘e’) or non-English (‘x’), for 5 queries (for a total of 2,500 tweets). These annotated tweets comprised our first epoch of data. As if that wasn’t enough fun, we reran the 5 queries again two weeks later, and annotated these as well to form the second epoch of data (for a cumulative total of 5000 tweets). We checked if there were duplicates between the two Twitter retrievals by using the Jaccard distance between any two given tweets to decide if they were identical.

To form a baseline to compare our final results to, I needed to find an interannotator agreement rate. My lovely girlfriend very nicely (and possibly naively) agreed to annotate some of the data.  She annotated 400 tweets of each query in the second epoch of data, all of which overlapped with data that I annotated as well. A program was then used to find the precision and recall between my annotations and hers. The agreement rate was 95% precision and 97% recall with confidence .01% with one annotator serving as truth and the other the response (thereby proving that she knows English 5% ± .01% better than me). For evaluation purposes the annotations were adjudicated resulting in complete agreement.

Process:

Part 1: Evaluation of Customized Language Classifier

We created a custom language model classifier by training it on data retrieved from Twitter using one of our queries, like “Gaga”. To do this, we first loaded all of the annotated training tweets for that query into the training section of a corpus. We then loaded the annotated testing tweets into the test section of the corpus. This created a corpus of 1000 tweets, 500 each for training and testing. The testing and training tweets were both retrieved from Twitter using the same query at different times.

As the title character in today’s adventure, our friendly neighborhood nGram boundary language model classifier was used to train and test on the corpus. Because nGram boundary classifiers are sensitive to nGram size (the size of the chunk of characters in a tweet that the classifier being trained actually sees), it trained and tested on the data with nGram sizes 1 through 15.

After each test, the classifier was given to a joint classifier evaluator. This evaluator was used to record the one-versus-all data for each category (precision, recall, and f-measure), as well as calculate the micro-averaged f-measure. The micro-averaged f-measure was used for quantitative comparison between classifiers.

This procedure was repeated for each of the five Twitter queries.

Part 2: Evaluation of Generic Language Classifier

We created a generic language model classifier by training it on data retrieved from Twitter using the same queries as the customized experiment. The difference between the custom and generic classifiers is that, in the generic classifier, the testing and training tweets were retrieved using different queries. Because each set of data included 500 tweets, the total amount of data that we had was 5,000 tweets. For any given query, the training data consisted of all the other data minus the 500 tweets of data to be tested on and 500 tweets of data that was retrieved at an earlier epoch with the same query. All of this data, which contained a total of 4000 tweets for any given test data, was put into a list. Five-hundred tweets were then randomly selected from this list and entered into the training section of the corpus. We then loaded the annotated testing tweets into the testing section of the corpus, and trained and tested the corpus using the same nGram looping structure as we did in the customized training process.

After each test, we evaluated the classifier in the same way that we did for the custom classifier. This procedure was repeated for each of the five Twitter queries.

Part 3: Evaluation of the Effect of Corpus Size On a Generic Classifier

A generic classifier was trained on data from a variety of different sources, as described above. However, the size of the corpus was increased incrementally after each training and testing experiment. This was done by nesting the nGram looping structure inside of the corpus size looping structure. The evaluator returned the best micro-averaged f-measure after each corpus size increase, and this was graphed against corpus size.

Results:

Part 1

For each of the five queries that we tested, the custom classifier had an f-measure that ranged from 80% to 95%. For four out of the five queries, the English classifications performed better than the non-english. However, on the query “Mitsubishi”, the non-english classifications performed better. Out of the four queries in which English performed better, two queries (“Clinton”(Fig. 1) and “Coke” (Fig. 2)) had significantly higher f-measures for the English than the non-English.

Figure 1 - Clinton Generic

Figure 2 - Coke Generic

Figure 2 New

Figure 3 - Gaga Generic

Figure 4 NEW

Figure 4 - Mitsubishi Generic

Figure 5 - Wiener Generic

Part 2

For each of the five queries, the generic classifier had an f-measure that ranged from 70% to 95%

Part 2.5

We put the micro-averaged f-measures for the generic and custom classifiers for each query on the same graph of F-Measure versus NGram size. For two of the queries (“Clinton” (Fig. 6) and “Coke” (Fig. 7)), the custom and generic classifiers performed about equally for an nGram size of 4. For two of the remaining queries (“Gaga” (Fig. 8) and “Wiener” (Fig. 10)), the custom classifier outperformed the generic. For the final query (“Mitsubishi” (Fig. 9)), the generic classifier outperformed the custom.

Figure 6 - Clinton Generic vs. Custom

Figure 7 - Coke Generic vs. Custom

Figure 8 - Gaga Generic vs. Custom

Figure 9 - Mitsubishi Generic vs. Custom

Figure 10 - Wiener Generic vs. Custom

The experiment was run again (at nGram size 4),  50 times.  The sample mean of these experiments’ f-measures were graphed, with a 95% confidence error bar (Figures 11 and 12).

Figure 11 - Non-English Classifier Comparrison

Figure 12 - English Classifier Comparrison

Part 3

When we graphed the generic classifiers’ performance versus the size of the of the training data, we found that the English classifications’ F-Measure increased slightly over increased training data. Non-English classifications’ F-Measure increased dramatically over increased training data. (Fig. 13).

Figure 13 - Clinton Generic Performance Graph (on a logarithmic scale)

We graphed the size of the English portion of the corpus versus the English F-Measure (and did the same with the Non-English), and found that English classifications performed better at low category-corpus sizes than Non-English did (Figs. 14 & 15).

Figure 14 - Coke Category Performance

Figure 15 - Mitsubishi Category Performance

Discussion:

Parts 1+2

We came to the conclusion that neither generic nor customized training regimens are necessarily better than the other because each one performed better in different circumstances. We believe that the Non-English classifications for Mitsubishi scored higher than the English classifications because the Non-English category was larger than the English category, and also it was very coherent (with mostly asian writing, so that each category has either Roman characters or Asian characters). We believe that “Clinton” and “Coke” had much higher English F-Measures than the others because these two queries produce the most English tweets.

Parts 3

We believe that the English classifications performed better than the non-English classifications, especially at low corpus sizes, because the makeup of the English training data is more coherent. Because there are many different, seemingly unrelated attributes that define the broad Non-English category, it’s difficult for the classifier to identify what makes a tweet Non-English at low corpus-sizes. This also explains why the classifier is better at identifying English tweets, even at equal category sizes.

Our data balance for each test are as follows (for the generic classifiers, the balance of the 4000 tweet list is shown):

Custom Data Distribution

Generic Data Distribution

And that’s my first experiment in computational linguistics! Anyone who would like to throw money at me for this monumental achievement may do so liberally. Otherwise, look forward to my next experiment, the not-so-trivial case, Sentiment!! (perhaps THAT will convince you to throw money at me!) Until then, I will be annotating. A lot. My favorite!

Crowdsourcing, Pair Programming and Code Review for Unit Testing

July 18, 2011 by

That last post on sorting primitive arrays in Java succeeded beyond my wildest dreams. But not in a way I had anticipated.

CrowdSourcing

The best part was the replies with suggestions for improved unit testing, both at the specific and general level. That is, crowdsourced unit tests.

I got fantastic feedback from David R. MacIver, which I’m incorporating into our code.

I also got some good general advice. Particularly, “Did you look in the libs for something that does what you want?” (yes) and “Did you profile before setting off on this perhaps ill-advised specialization?” (not yet, but with qualifications).

Code Review

This reminds me of what many of my friends and former colleagues at Google have said is the very best part of Google’s engineering practice: having code review by a peer before any commit. That seems Draconian to me, but hey, so did source control when I first heard about it. The Google engineers say it works because of reciprocity, coders have to review if they want their code reviewed. I can’t even imagine how great it’d be to have Martin Jansche or Mike Riley reviewing my C++ code.

Pair Programming

I spent four or five afternoons pair programming with Ben Goodrich lately. He knows R back to front, and I’ve had some experience designing and coding GUIs. So we sat down together and built a GUI (using gWidgets, a really cool pacakge) for mi, a multiple imputation package that Ben completely rewrote (but hasn’t yet released).

The bottom line is that pair programming is fantastic in a number of situations. One is like we had, where there is some complentarity in skill sets. The second, which is what I usually had at SpeechWorks, was when a newbie (me) was assigned to work with a veteran (typically Sasha Caskey for me). At LingPipe, Breck and I often sit down for an hour or two to bang something out. A third situation in which pair programming shines is when the work is painstakingly detail oriented and it’s hard for one person to keep track of all the moving pieces and keep all the parts of what’s going on in their head.

In talking to Matt Hoffman about pair programming as we walked back to the A train one evening, he said that he felt pair programming was perhaps the best use of time, but perhaps not the best use of energy. Why’d he say that? Most people’s reaction is that pair programming would be relatively relaxing, but unproductive as measured in person hours.

Little do the uninitiated know that pair programming is right up there with qualifying exams and academic job interviews in the exhausting department. It’s exactly the opposite experience from in-the-zone debugging when you find it’s 6 AM and don’t know where the time went.

The thing about pair programming is that it only rarely gets interrupted, so it’s long hours of intense, continuous concentration. And you often have to explain what you’re doing as you go along. In essence, you have to give what the psych folks call “protocol“. Herb Simon, my idol, was a big proponent; Violetta Cavalli-Sforza, while working with Herb at CMU, had me give protocol while solving GRE analytical problems. As effortless as I find solving simple logical puzzles, doing it effectively while giving protocol was nearly impossible. Sort of like patting your head and rubbing your stomach.

In pair programming, you don’t get stuck and start web surfing or answering e-mail (or texting or whatever). You just steam through all the hard problems, which can require significant energy and lots of decision making. With two people looking at every line of code, you catch most of the syntax mistakes and little logical mistakes like off-by-one errors before you even compile. It’s very hard to cut corners with another skilled programmer sitting next to you.

Oddly, what doesn’t happen, is that you don’t get stuck like Dr. Seuss’s Northgoing and Southgoing Zax. Compromising on how to do things is the easy part.

If you haven’t ever pair programmed, I’d highly recommend it. It’s properties will almost certainly confound your expectations if you’re like everyone else I’ve ever talked to about it.


Follow

Get every new post delivered to your Inbox.

Join 149 other followers