Bob’s ML Meetup Talk — Stan: A Bayesian Directed Graphical Model Compiler

January 6, 2012 by

I (Bob) am going to give a talk at the next NYC Machine Learning Meetup, on 19 January 2012 at 7 PM:

There’s an abstract on the meetup site. The short story is that Stan’s a directed graphical model compiler (like BUGS) that uses adaptive Hamiltonian Monte Carlo sampling to estimate posterior distributions for Bayesian models.

The official version 1 release is coming up soon, but until then, you can check out our work in progress at:

  • Google Code: Stan.

How to Close a LinkedIn Account with a “Large Network of Connections”

November 28, 2011 by

Last week I shut my LinkedIn account down.

My first attempt resulted in a warning web page saying I couldn’t shut it down because I had over 250 contacts. With that hint, I just deleted contacts until I had fewer than 250. Then I could close my account through their web form.

Please Don’t Go!

The first close attempt resulted in an e-mail from LinkedIn to their “customer support” group, cc-ed to me, asking them to close my account because I was unable to, citing the reason:

The member has a large network of connections to close. Please close during non-peak hours.

LinkedIn seems to be suggesting their site is so fragile it can’t be trusted to delete during peak hours.

A week later (as in today), I got the predicted response from “customer support”, namely a plea to stay. Today’s e-mail started with:

I’m sorry it’s taken so long to get back to you.

The apology seems rather disengenuous given the rest of the e-mail, which continued with:

I noticed that you have put a lot of effort into growing your LinkedIn network. Because of this, I wanted to confirm that you want to close an account with such a large number of connections.

Only after this did the e-mail start outlining further steps I’d have to take to close the account I’d already closed last week.

Why did I close my LinkedIn account? On the “con” side, it was a hassle to go through invitations to connect from people I didn’t know or had met once. I felt bad if I said “no” or “yes”. On the “pro” side, I couldn’t come up with anything. It’s not like I’m going to use LinkedIn to find a job.

Roberto takes over ICSI

November 15, 2011 by

Roberto Pieraccini, who was my boss at SpeechWorks (among other notable accomplishments), is going to be the new director of U. C. Berkeley’s International Computer Science Institute (ICSI).

Now this is a succession event I can (and did) wholeheartedly endorse. They asked me to write a recommendation letter as a former employee (his new colleagues all report to him; he’ll report to the trustees). I can paraphrase what I said: I’d be more than happy to work for Roberto again.

Twitter POS Tagging with LingPipe and ARK Tweet Data

November 4, 2011 by

The Data

We will train and test on anything that’s easy to parse. Up today is a basic English part-of-speech tagging for Twitter developed by Kevin Gimpel et al. (and when I say “et al.”, there are ten co-authors!) in Noah Smith’s group at Carnegie Mellon.

The relevant resources are:

Their paper describes their tagging scheme as well as their CRF-based tagger. It uses Stanford’s CRF tagger with baseline features as a performance comparison. The code for their tagger’s also in the distribution. I’m not sure what the license is — it’s listed as “other open source” (I didn’t even know Google Code let you do that — I thought it was “free beer” or nothing with them).

Training and Evaluating a LingPipe POS Tagger

Their corpus was very easy to parse (thanks, I really appreciate it). It only took me about an hour or so to download the data, parse it, and evaluate LingPipe’s baseline POS tagger on it. (It helps to be the author of code. The patterns feel awfully comfortable.)

Our performance was 85.4% accuracy on their train/test split using the default parameters for tagging in LingPipe. In contrast, the Stanford CRF tagger with default features was 85.9% accurate, whereas Gimpel et al.’s tagger achieved 89.4% accuracy. As usual, LingPipe’s HMM tagger is competitive with out-of-the-box CRFs and a few percentage points behind tuned, feature-rich CRFs.

Their paper (on page 5) says the annotator agreement is 92.2%. They also break accuracy out per tag, which LingPipe’s output also does; you can see this yourself if you run it.

LingPipe’s Baseline POS Tagger

The baseline POS tagger in LingPipe is a bigram HMM with emissions defined by a bounded character language model. Estimation is with simple additive smoothing (i.e., MAP estimates given symmetric Dirichlet priors) for the initial state and transition probabilities and Witten-Bell smoothing for the character LMs. Our main motivation for doing things this way is that (a) it’s online, letting us train an example at a time, and (b) it’s reasonably fast when it runs. We should be able to decode this tag set at well over 500K tokens/second by turning on caching of character LM results and pruning.

We could also implement their approach using LingPipe’s CRFs. It’s just that it’d take a bit longer than an hour all in.

Run it Yourself

You can get their code from their project home page, linked above.

All of my code’s checked into the LingPipe Sandbox in a project named “twitter-pos”. You can check it out anonymously using Subversion:

svn co https://aliasi.devguard.com/svn/sandbox/twitter-pos

The code’s in a single file, stored under the src subdirectory of the package:

package com.lingpipe.twpos;

import com.aliasi.classify.*;
import com.aliasi.corpus.*;
import com.aliasi.io.*;
import com.aliasi.hmm.*;
import com.aliasi.tag.*;
import java.io.*;
import java.util.*;

public class Eval {

    public static void main(String[] args) throws IOException {
        System.out.println("Reading Corpus");
        TwitterPosCorpus corpus
            = new TwitterPosCorpus(new File(args[0]));

        System.out.println("Training Tagger");
        HmmCharLmEstimator hmm = new HmmCharLmEstimator();
        corpus.visitTrain(hmm);
        HmmDecoder tagger = new HmmDecoder(hmm);

        System.out.println("Evaluating");
        boolean storeTokens = true;
        TaggerEvaluator evaluator
            = new TaggerEvaluator(tagger,storeTokens);
        corpus.visitTest(evaluator);
        System.out.println(evaluator.tokenEval());
    }

    static List<Tagging> parse(File f) throws IOException {
        List<Tagging> taggings
            = new ArrayList<Tagging>();
        FileLineReader reader = new FileLineReader(f,"UTF-8");
        List tokens = new ArrayList();
        List tags = new ArrayList();
        for (String line : reader) {
            String[] tokTag = line.split("\\s+");
            if (tokTag.length != 2) {
                taggings.add(new Tagging(tokens,tags));
                // System.out.println("tokens=" + tokens);
                // System.out.println("tags=" + tags);
                tokens = new ArrayList();
                tags = new ArrayList();
            } else {
                tokens.add(tokTag[0]);
                tags.add(tokTag[1]);
            }
        }
        return taggings;
    }

    static class TwitterPosCorpus extends ListCorpus<Tagging> {
        public TwitterPosCorpus(File path) throws IOException {
            for (Tagging t : parse(new File(path,"train")))
                addTrain(t);
            for (Tagging t : parse(new File(path,"dev")))
                addTrain(t);
            for (Tagging t : parse(new File(path,"test")))
                addTest(t);
        }
    }
}

LingPipe’s pretty fast for this sort of thing, with the entire program above, including I/O, corpus parsing, training, and testing taking a total of 5 seconds on my now ancient workstation.

Although it wouldn’t be a fair comparison, there’s usually a percent or so to be eked out of a little tuning in this setting (it would’ve been fair had I done tuning on the dev set and evaluated exactly once). This was just a straight out of the box, default settings eval. In general, one shouldn’t trust results that report post-hoc best settings values as they’re almost always going to overestimate real performance for all the usual reasons.

Finally, here’s the confusion matrix for tags in the first-best output:

,D,E,#,!,G,&,@,A,$,L,N,O,,,U,T,V,P,S,R,~,^,X,Z
D,446,0,0,1,0,0,0,4,0,0,0,7,0,0,0,0,11,0,7,0,8,1,0
E,0,53,0,1,2,0,0,0,1,0,0,0,5,0,1,0,0,0,0,0,0,0,0
#,0,0,44,0,1,0,0,0,0,0,10,0,0,0,0,3,0,0,0,0,20,0,0
!,0,0,1,140,1,0,0,5,0,1,15,5,0,0,0,3,1,0,7,0,7,0,0
G,1,1,5,2,14,0,0,1,3,0,10,0,10,0,0,4,1,0,1,2,15,0,0
&,0,0,0,0,0,122,0,1,0,0,1,0,0,0,0,0,1,0,1,0,1,0,0
@,0,0,0,0,0,0,328,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0
A,0,0,0,1,0,1,0,248,3,0,44,0,0,0,2,30,2,0,24,0,12,0,0
$,0,0,0,0,0,0,1,0,79,0,2,0,0,0,0,0,3,0,0,0,0,0,0
L,2,0,0,0,1,0,0,0,0,120,3,1,0,0,0,2,0,0,0,0,0,0,0
N,1,0,1,5,1,0,0,49,1,1,783,2,0,0,2,52,6,0,14,0,63,0,0
O,4,0,0,0,1,0,0,2,0,0,2,456,0,0,1,0,0,0,2,0,4,0,0
,,0,4,0,0,2,0,0,0,0,0,0,0,861,0,0,2,0,0,0,11,0,0,0
U,0,0,0,1,0,0,0,0,0,0,0,0,1,114,0,0,0,0,0,0,1,0,0
T,0,0,0,0,0,0,0,0,0,0,0,1,0,0,24,0,9,0,1,0,1,0,0
V,0,1,0,0,0,0,0,21,0,1,69,1,0,0,0,921,9,0,7,2,21,0,0
P,2,0,0,1,0,0,0,4,1,0,1,0,0,0,11,6,571,0,12,0,4,0,0
S,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,2,0,0,1,0,0
R,4,0,0,1,0,0,0,13,0,0,20,1,0,0,1,6,15,0,269,0,8,1,0
~,0,0,0,1,1,0,0,0,0,0,0,0,32,0,0,1,0,0,0,177,0,0,0
^,1,0,4,1,2,0,0,29,2,0,101,0,2,0,0,16,4,0,1,0,331,0,1
X,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,0,0,3,0
Z,1,0,0,0,0,0,0,0,0,1,4,0,0,0,0,0,0,1,0,0,13,0,2

I should really figure out how to format that a bit more neatly.

“Academic” Licenses, GPL, and “Free” Software

November 3, 2011 by

[This post repeats a long comment I posted about licensing in response to Brendan O'Connor's blog entry, End-to-End NLP Packages. Brendan's post goes over some packages for NLP and singles out LingPipe as being only “quasi free.”]

Restrictive “Academic-Only” Licenses

Some of those other packages, like C&C Tools and Senna, are in the same “quasi free” category as LingPipe in the sense that they’re released under what their authors call “non-commercial” licenses. For instance, none of the Senna, C&C, or LingPipe licenses are compatible with GPL-ed code. Senna goes so far as to prohibit derived works altogether.

The LingPipe License

The intent for the

was a little different from the “academic use only” licenses in that we didn’t single out academia as a special class of users. We do allow free use for research purposes for industrialists and academics alike. We also provide a “developers” license that explicitly gives you this right, which makes some users’ organizations feel better.

Truly Free NLP Software

The other tools, like NLTK, Mallet, OpenNLP, and GATE are released under more flexible licenses (LGPL, Apache or BSD), which I really do think of as being truly “free”. Mahout’s also in this category, though not mentioned by Brendan, whereas packages like TreeTagger are more like Senna or C&C in their restrictive “academic only” licensing.

Stanford and the GPL

Stanford NLP’s license sounds like it was written by someone who didn’t quite understand the GPL. Their page says (the link is also theirs):

The Stanford CoreNLP code is licensed under the full GPL, which allows its use for research purposes, free software projects, software services, etc., but not in distributed proprietary software.

Technically, what they say is true. It would’ve been clearer if they’d replaced “research” with “research and non-research” and “free” with “free and for-profit”. Instead, their choice of examples suggests “free” or “research” have some special status under the GPL, which they don’t. With my linguist hat on, I’d say their text leads the reader to a false implicature. The terms “research” and “academia” don’t even show up in the GPL, and although “free” does, GNU and others clarify this usage elswewhere as “free as in free speech”, not “free as in free beer”.

Understanding the GPL

The key to understanding the GPL lies behind Stanford’s embedded link to

Here, proprietary doesn’t have to do with ownership, but rather with closed source. Basically, if you redistribute source code or an application based on GPL-ed code, you have to also release your code under the GPL, which is why it’s called a “copyleft” or “viral” license. In some cases, you can get away with using a less restrictive license like LGPL or BSD for your mods or interacting libraries, though you can’t change the underlying GPL-ed source’s license.

GPL Applies to Academics, Too

There’s no free ride for academics here — you can’t take GPL-ed code, use it to build a research project for your thesis, then give an executable away for free without also distributing your code with a compatible license. And you can’t restrict the license to something research only. Similarly, you couldn’t roll a GPL-ed library into Senna or C&C or LingPipe and redistribute them under their own licenses. Academics are often violating these terms because they somehow think “research use only” is special.

Services Based on GPL-ed Software and the AGPL

You can also set up a software service, for example on Amazon’s Elastic Compute Cloud (EC2) or on your own servers, that’s entirely driven by GPL-ed software, like say Stanford NLP or Weka, and then charge users for accessing it. Because you’re not redistributing the software itself, you can modify it any way you like and write code around it without releasing your own software. GNU introduced the Affero GPL (AGPL), a license even more restrictive than the GPL that tries to close this server loophole for the basic GPL.

Charging for GPL-ed Code

You can charge for GPL-ed code if you can find someone to pay you. That’s what RedHat’s doing with Linux, what Revolution R’s doing with R, and what Enthought’s doing with Python.

LingPipe’s Business Model is Like MySQL’s

Note that this is not what MySQL did with MySQL (before they sold it to Oracle) nor is it what we do with LingPipe. In both those cases, the company owns all the intellectual property and copyrights and thus is able to release the code under multiple licenses. This strategy’s explained on the

We license LingPipe under custom licenses as well as our royalty-free license. These licenses include all sorts of additional restrictions (like only using some of the modules on so many servers) and additional guarantees (like indemnification and maintenance); don’t ask me about the details — that’s Breck’s bailiwick. Suffice it to say most companies don’t like to get involved with copyleft, be it from GPL or LingPipe’s royalty-free license. So we let them pay us extra and get an unencumbered license so they can do what they want with LingPipe and not have to share their code. We’ve had more than one customer buy commercial license for LingPipe who wouldn’t even tell us what they were going to do with our software.

Free “Academic” Software

Also, keep in mind that as an academic, your university (or lab) probably has a claim to your intellectual property developed using their resources. Here’s some advice from GNU on that front:

 

Oracle buys Endeca, HP buys Autonomy, Microsoft buys FAST

October 28, 2011 by

The news that Oracle’s buying Endeca sounds awfully familiar. But this time it cuts a little closer to home, because we’re an Endeca technology partner. Endeca has been a great customer to work with — we’ve been really impressed with their engineers at every turn.

Clean Sweep

I believe this makes it almost a clean sweep of the medium-to-medium-large-sized independent search companies. Maybe Vivisimo will be next. Of course, there are still small companies delivering search via Apache Lucene and SOLR, such as Sematext and Lucid Imagination. I imagine they will be delighted that yet another small competitor was snapped up by a tech giant.

2008: Microsoft buys FAST

By combining the innovation and agility of FAST with the discipline and resources of Microsoft, our customers get the best of both worlds: market-leading products from a trusted technology partner. … Enterprise Search from Microsoft offers best-in-class technologies…

from: Microsoft fact sheet

2011: HP buys Autonomy

Autonomy brings to HP higher value business solutions that will help customers manage the explosion of information. Together with Autonomy, we plan to reinvent how both unstructured and structured data is processed, analyzed, optimized, automated and protected. … this bold action will squarely position HP in software and information to create the next-generation Information Platform, and thereby, create significant value for our shareholders.

from: HP press release

2011: Oracle buys Endeca

Combination [of Oracle and Endeca] provides best-in-class technology and applications for unstructured data management, business intelligence, and web commerce. … The convergence of structured and unstructured information is driving the need for a common data management and analytics platform.

from: Oracle Press Release

Maybe the tough economic climate has made it hard for small-to-medium-sized tech companies to survive without the deep pockets of a successful large tech company. Maybe Oracle made them an offer that was too good to refuse. Far better to sell when you can get a good price than to suffer the fate of Yahoo! (or SpeechWorks, for that matter).

Make us an Offer?

On that note, feel free to make us an offer for LingPipe.

 

“Smoothing” Measurement Errors

October 27, 2011 by

The issue came up in the last blog post, Discontinuities in ROC calculations, about smoothing out Area-under-the-curve (AUC) calculations for ROC and PR curves given observed data points that are very close together.

If you have a bunch of point measurements that you assume were taken with some error, what you can do is model the error itself. Suppose we don’t want to think too hard and just make the error normal, so that a measurement of a is replaced with a distribution \mbox{\sf Norm}(a,\sigma^2), where \sigma is the standard deviation of the error.

Then, rather than computing a function f(a) based on the empirical measurement a, instead compute the posterior of f(X) given a random variable X \sim \mbox{\sf Norm}(a,\sigma^2).

To get back to something on the same scale, it’s typical to use a Bayesian point estimate that minimizes expected square error, namely the expectation. Now, instead of a point f(a), such as an AUC statistic, defined precisely by the observation a, we have a point defined by the expectation of a characterized by our normal measurement error model.

\mathbb{E}[f(X)] = \int_X f(x) \ \mbox{\sf Norm}(x|a,\sigma^2) \ dx.

We can also compute even probabilities, such as comparing two systems with noisy measurements a and b, with noise \sigma, by

\mbox{Pr}[f(X) < f(Y)]

{} = \int \int \mathbb{I}[f(X) < f(Y)] \ \mbox{\sf Norm}(x|a,\sigma^2) \ \mbox{\sf Norm}(y|b,\sigma^2) \ dx \ dy.

Of course, this leaves the problem of how to estimate \sigma. Ideally, we'll have lots of measurements of the same thing and be able to estimate \sigma directly, amounting in what's called an "empirical Bayes" estimate (note that "empirical Bayes" is not a full Bayesian method and is no more empirical than full Bayes, which is also estimated from data). Of course, we could go the full Bayes route integrate over our posterior p(\sigma|...) for \sigma, giving us

\mathbb{E}[f(X)] = \int_0^{\infty} \int_X f(x) \ \mbox{\sf Norm}(x|a,\sigma^2) \ p(\sigma|...) \ dx \ d\sigma.

As Mike Ross pointed out, the result is going to be sensitive to \sigma. In particular, as Mike pointed out, as measurement error goes to zero, we get our original statistic back,

\lim_{\sigma \rightarrow 0} \mathbb{E}[f(X)] = f(a).

Similarly, as \sigma gets big, the limit of f(X) and f(Y) converge if X \sim \mbox{\sf Norm}(a,\sigma)^2 and Y \sim \mbox{\sf Norm}(b,\sigma^2).

The bottom line is that you'd like a sensible model of measurement error.

Real AUC calculations depend on more than one point. If you have a vector points \hat{x} \in [0,1]^N with discrete true values x \in \{0, 1\}^N, you just chain the integrals together, as in

\mathbb{E}[\mbox{AUC}(\hat{x},x)]

{} = \int \cdots \int \ \mbox{AUC}(y,x) \ \prod_{n=1}^N \ \mbox{\sf Norm}(y_n|\hat{x}_n,\sigma) \  dx_1 \cdots dx_N.

With a normal noise model (or really any model you can sample from), it's easy to compute this integral using Monte Carlo methods. Just take a sequence of M samples y^{(m)}, where y^{(m)}_n is drawn independently from \mbox{\sf Norm}(\hat{x}_n,\sigma^2), and approximate the integral required to compute the expectation by

\frac{1}{M} \ \sum_{m=1}^M \mbox{AUC}(y^{(m)},x).

As the number of samples M grows, this summation converges to the expectation,

\lim_{M \rightarrow \infty} \frac{1}{M} \ \sum_{m=1}^M \mbox{AUC}(y^{(m)},x) = \mathbb{E}[\mbox{AUC}(\hat{x},x)].

Discontinuities in ROC Calculations

October 26, 2011 by

Amaç Herdagdelen brought up some interesting examples on our mailing list that caused LingPipe’s ROC curve (and PR curve) implementation to provide some unexpected (and arguably wrong) results. (We then took the detailed discussion off line, and are now reporting back.)

Computing ROC Curves by Sorting

The basic problem is that the computation of an ROC curve involves first sorting the responses by the system’s estimated probability a response should be positive, then incrementally computing sensitivity versus 1 – specificty operating points by working down the ranked list. You can see a worked example in the class documentation for

What if there are Ties?

Mike Ross realized there was a problem when you had two examples with the same score, but different reference values. For instance, a system might be looking at two Tweets and estimate a 0.98 probability that both are positive sentiment. If one was classified as positive in the reference (i.e., the gold standard) and the other negative, then there’s a problem. With one order, you get one curve, and with the other order, you get a different curve. Mike figured out we might as well just leave off the intermediate point, because we get to the same operating point after handling both of them.

What if there are Near Ties?

Then Amaç wrote back after noticing he had some cases that were very close, but not quite, the same. This comes up with arithmetic precision all the time. In his case, it was different one-versus-all results for a multi-category classifier. He suggested treating two results as the same if they were within some small absolute value of each other. This is the typical thing to do when checking results are the same in unit tests (or code) — it’s built into junit for Java and googletest for C++.

Discretized ROC Curve Estimates are Discontinuous

That led me (Bob) to realize that the real problem is that these measures are discontinuous in the underlying scores. Suppose I have three items, A, B and C, with true values POS, NEG, and POS. Now if I have assigned scores to them of 0.97, 0.96, and 0.98, I get a curve based on the sequence TP, TP, FP, or a perfect score. I get the same reslt as the score for B moves from 0.96 to any value less than 0.97. But at the point it crosses 0.97, I wind up with a sequence TP, FP, TP, which is a whole different story.

The upshot is that area under the ROC curve (or PR curve) is not continuous in the underlying scores.

Help? Should we Probabilistically Smooth?

Does anyone have any idea what we should do here?

I’m thinking moving to some kind of probabilistic version might be able to smooth things out in a well-defined way. For instance, if I add normally-distributed noise around each point and then integrate it out, things are smooth again.

Domain Adaptation with Hierarchical Logistic Regression

September 29, 2011 by

Last post, I explained how to build hierarchical naive Bayes models for domain adaptation. That post covered the basic problem setup and motivation for hierarchical models.

Hierarchical Logistic Regression

Today, we’ll look at the so-called (in NLP) “discriminative” version of the domain adaptation problem. Specifically, using logistic regression. For simplicity, we’ll stick to the binary case, though this could all be generalized to K-way classifiers.

Logistic regression is more flexible than naive Bayes in allowing other features (aka predictors) to be brought in along with the words themselves. We’ll start with just the words, so the basic setup look more like naive Bayes.

The Data

We’ll use the same data representation as in the last post, with D being the nubmer of domains, with I_d docs in domain d. Document i in domain d will have N[d,i] tokens. We’ll assume V is the size of the vocabulary.

Raw document data provides a token x[d,i,n] \in 1{:}V for word n of document i of domain d. Assuming a bag-of-words representation like we used in naive Bayes, it’ll be more convenient to convert each document into a word frequency vector u[d,i] of dimensionality V. To match naive Bayes, define a word frequency vector u[d,i] with value at word (or intercept) w given by

u[d,i,w] = \sum_{n = 1}^{N[d,i]} \mathbb{I}(x[d,i,n] = w),

where \mathbb{I}(\phi) is the indicator function, taking on value 1 if \phi is true and 0 otherwise. Now that we have continuous data, we could transform it using something like TF/IDF weighting. For convenience, we’ll prefix an intercept predictor 1 to each vector of predictors u[d,i]. Because it’s logistic regression, other information like the source of the document may also be included in the predictor vectors.

The labeled training data consists of a classification z[d,i] \in \{ 0, 1 \} for each document i in domain d.

The Model, Lower Level

The main parameter to estimate is a coefficient vector \beta[d] of size V + 1 for each domain d. Here, the first dimension will correspond to the intercept and the other dimensions each correspond to a word in the vocabulary.

The probabilty that a given document is of category 1 (rather than category 0) is given by

\mbox{Pr}[z[d,i] = 1] = \mbox{logit}^{-1}(\beta[d]^T u[d,i]).

The inner (dot) product is just the sum of the dimensionwise products,

\beta[d]^T u[d,i] = \sum_{v = 0}^V \beta[d,v] \times u[d,i,v].

This value, called the linear prediction, can take values in (-\infty,\infty). The logistic sigmoid function

\mbox{logit}^{-1}(x) = 1/(1 + \exp(-x))

maps the unbounded linear prediction to the range (0,1), where it is taken to represent the probabilty that the category z[d,i] of document i in domain d is 1.

In sampling notation, we define the category as being drawn from a Bernoulli distribution with parameter given by the transformed predictor; in symbols,

z[d,i] \sim \mbox{\sf Bern}(\mbox{logit}^{-1}(\beta[d]^T u[d,i]).

Unlike naive Bayes, logistic regression model does not model the word data u, instead treating it as constant.

The Model, Upper Levels

We are treating the coefficient vectors as estimated parameters, so we need a prior for them. We can choose just about any kind of prior we want here. We’ll only consider normal (Gaussian, L2) priors here, but the Laplace (L1) prior is also very popular. Recently, statisticians have argued for using a Cauchy distribution as a prior (very fat tails; see Gelman et al.’s paper) or combining L1 and L2 into a so-called elastic net prior. LingPipe implements all of these priors; see the documentation for regression priors.

Specifically, we are going to pool the prior across domains by sharing the priors for words v across domains. Ignoring any covariance for the time being (a bad idea in general, but it’s hard to scale coveriance to NLP-sized coefficient vectors), we draw the component for word v of the coefficients for domain d from a normal distribution,

\beta[d,v] \sim \mbox{\sf Normal}(\mu[v],\sigma[v]^2).

Here \mu[v] represents the mean value of the coefficient for word (or intercept) v across domains and \sigma[v]^2 is the variance of the coefficient values across domains. A strong prior will have low variance.

All that’s left is to provide a prior for these location and scale parameters. It’s typical to use a simple centered normal distribution for the locations, specifically

\mu[v] \sim \mbox{\sf Normal}(0,\tau^2),

where \tau^2 is a constant variance parameter. Typically, the variance \tau^2 is set to a constant to make the entire prior on the means weakly informative. Alternatively, we could put a prior on \tau itself.

Next, we need priors for the \sigma[v] terms. Commonly, these are chosen to be inverse \chi^2 (a specific kind of inverse gamma distribution) distributions for convenience because they’re (conditionally) conjugate. Thus the typical model would take the variance \sigma[v]^2 to be the parameter, giving it an inverse gamma prior,

\sigma[v]^2 \sim \mbox{\sf InvGamma}(\alpha,\beta).

Gelman argues against inverse gamma priors on variance because they have pathological behavior when the hierarchical variance terms are near zero.
Gelman prefers the half-Cauchy prior, because it tends not to degenerate in hierarchical models like the inverse gamma can. The half Cauchy is just the Cauchy distribution restricted to positive values (thus doubling the usual density, but restricting support to non-negative values). In symbols, we generate the deviation \sigma[v] (not the variance) using a (non-negative) Half-Cauchy distribution:

\sigma[v] \sim \mbox{HalfCauchy}().

And that’s about it. Of course, you’ll need some fancy-pants software to fit the whole thing, but it’s not that difficult to write.

Properties of the Hierarchical Logisitic Regression Model

Like in the naive Bayes case, going hierarchical means we get data pooling (or “adaptation” or “transfer”) across domains.

One very nice feature of the regression model follows from the fact that we are treating the word vectors as unmodeled predictors and thus don’t have to model their probabilities across domains. Instead, the coefficient vector \beta[d] for domain d only needs to model words that discriminate positive (category 1) from negative (category 0) examples. Thus the correct value for \beta[d,v] for a word v is 0 if the word does not discriminate between positive and negative documents. Thus our hierarchical hyperprior has location 0 in order to pull the prior location \mu[v] for each word v to 0.

The intercept is just acting as a bias term toward one category or the other (depending on the value of its coefficient).

Comparison to SAGE

If one were to approximate the full Bayes solution with maximum a posteriori point estimates and use Laplace (L1) priors on the coefficients,

\beta[d,v] \sim \mbox{\sf Laplace}(\mu[d],\sigma[d]^2)

then we recreate the regression counterpart of Eisenstein et al.’s hybrid regression and naive-Bayes style sparse additive generative model of text (aka SAGE).

Eisenstein et al.’s motivation was to represent each domain as a difference from the average of all domains. If you recast the coefficient prior formula slightly and set

\beta[d,v] = \alpha[d,v] + \mu[d]

and sample

\alpha[d,v] \sim \mbox{Laplace}(0,\sigma[d]^2)

you’ll see that if the variance term \sigma[d] is low enough, many of the \alpha[d,v] values will be zero. Of course, in a full Bayesian approach, you’ll integrate over the uncertainty in \alpha, not set it to a point estimate of zero. So sparseness only helps when you’re willing to approximate and treat estimates as more certain than we have evidence for. Of course, resarchers do that all the time. It’s what LingPipe’s logistic regression and CRF estimators do.

Adding Covariance

Presumably, the values of coefficients have will have non-negligible covariance. That is, I expect words like “thrilling” and “scary” to covary. For thriller movies, they’re positive sentiment terms and for appliances, rather negative. The problem with modeling covariance among lexical items is that we usually have tens of thousands of them. Not much software can handle a 10K by 10K matrix.

Another way to add covariance instead of using a covariance model for “fixed effects” is instead convert to random effects. For instance, a model like latent Dirichlet allocation (LDA) models covariance of words by grouping them into topics. For instance, two words with high probabilities in the same topic will have positive covariance.

Further Reading

I’m a big fan of Gelman and Hill’s multilevel regression book. You definitely want to master that material before moving on to Gelman et al.’s Bayesian Data Analysis. Together, they’ll give you a much deeper understanding of the issues in hierarchical (or more generally, multilevel) modeling.

Domain Adaptation with Hierarchical Naive Bayes Classifiers

September 22, 2011 by

This will be the first of two posts exploring hierarchical and multilevel classifiers. In this post, I’ll describe a hierarchical generalization of naive Bayes (what the NLP world calls a “generative” model). The next post will explore hierarchical logistic regression (called a “discriminative” or “log linear” or “max ent” model in NLP land).

Domain Adaptation

Within natural language processing, the term “domain adaptation” has come to mean something like “using data in one domain to improve results in another domain”. Examples that have received attention include positive/negative sentiment for reviews of different kinds of products, with, say, DVDs being one domain and kitchen appliances another (Blitzer et al. 2007). Other examples classifying sequences of tokens as names across different genres of text, such as newswire and transcribed speech (Daumé III 2007, Finkel and Manning 2009).

The work I’ve seen on adaptation has largely been in the supervised case, but what we’re going to do here can be unsupervised, in which case you get something like soft K-means clustering. It can also be semi-supervised, with some documents having category labels and some not having labels. I’ve talked about semi-supervised models before, too, as recently as the very last blog post, covering Tang and Lease’s semi-supervised annotation model and in our replication of Nigam et al.’s semi-supervised naive Bayes EM estimator in the LingPipe tutorial on semi-supervised naive Bayes classification.

Hiearchical Models

The standard Bayesian approach to domain adaptation is to build a hierarchical model. The hierarchy in this case is the hieararchy of domains. We’ll just consider the flat case, where there’s a collection D of what we’ll call “domains”. In the next post, we’ll generalize this assumption to richer organizations of domains with cross-cutting features.

Naive Bayes

I’ve blogged before about properly Bayesian approaches to naive Bayes classifiers, specifically about Bayesian naive Bayes inference and collapsed Gibbs sampling for computing it.

Pooling, Exchangeability and Hieararchy

What we’re going to do is let information in one domain, such as which words are positive or which phrases are people names, inform estimates for other domains. This is sometimes referred to as “pooling” information across sub-populations in statistics. A simple example from political science that illustrates the general flavor of models we’re working on at Columbia is allowing estimates of the effect of income on voting in one state to inform estimates in other states, while still allowing the states to differ (an example is given in Gelman et al.’s What’s wrong with Connecticut?).

Theoretically, what we’re going to do amounts to assuming that the domains themselves are exchangeable. We use exchangeable models when we don’t have any information to distinguish the domains. We’ll discuss the case where we do have such information in the next post.

In Bayesian models, we treat exchangeable variables as being drawn from a population distribution (aka, a prior). In hierarchical models, we model the population directly. The reason it’s called hierarchical is that in a proper Bayesian model, there must be a prior distribution on the domain model. So we get a prior on the parameters of the domain model, which then acts as a prior on the estimates for each domain. The pooling is achieved by information flow through the dependencies in the joint distribution implied by the graphical model. In practice, this is simpler than it sounds.

Hieararchical Naive Bayes

Once we’ve got our heads around the Bayesian formulation of naive Bayes, extending it to hieararchical models is straightforward. We’re going to use the language of documents and tokens in describing naive Bayes, but it’s really a general multinomial model, so don’t assume this is only valid for text classifiers.

For simplicity, we’ll assume we have a binary classifier. I’ll say more about the multi-category case later and in the next post.

The Data

Let D be the number of domains. Let I_d be the number of documents in domain D and N_{d,i} the number of tokens in document i \in 1{:}I_d of domain d \in 1{:}D. Let V be the size of the vocabulary from which the tokens are drawn.

The raw document data we have provides a token x[d,i,n] \in 1{:}V for each token n \in N_{d,i} for each document i \in 1{:}I_d for each domain d \in 1{:}D.

The labeled training data we have is a classification z[d,i] \in \{ 0, 1 \} for each document i \in 1{:}I_d in each domain d \in 1{:}D.

The Model, Lower Levels

The hierarchical naive Bayes model is a directed graphical model and as such, easy to describe with sampling notation.

We have a parameter \pi[d] for the prevalence of category 1 documents in domain d \in 1{:}D. It is sampled from a beta distribution with prior count parameters \alpha_{\pi}, \beta_{\pi} (more about which later),

\pi[d] \sim \mbox{\sf Beta}(\alpha_{\pi}, \beta_{\pi}).

The categories assigned to each document are sampled from a Bernoulli distribution parameterized by prevalence in the document’s domain,

z[d,i] \sim \mbox{\sf Bern}(\pi[d]).

Each domain d and outcome category c \in \{ 0, 1 \} will have its own naive Bayes model with multinomial V-simplex parameter \phi[d,c] \in [0,1]^V, such that

\sum_{v = 1}^V \phi[d,c,v] = 1,

describing the distribution of vocabulary items v \in V in category c \in \{ 0, 1 \} in domain d \in D.

These category word distributions for each domain are drawn from a Dirichlet prior \gamma[c] appropriate to the outcome category c \in \{ 0, 1 \},

\phi[d,c] \sim \mbox{\sf Dir}(\gamma[c]).

We then model the tokens in the standard way for naive Bayes, as being drawn independently (that’s the “naive” part) from the discrete distribution associated with their domain d and the category z[d,i] of document i,

x[d,i,n] \sim \mbox{\sf Disc}(\phi[d, z[d,i]]).

The Model, Upper Levels

So far, that’s only the lower level of the model. As I mentioned earlier, the priors characterize the populations of prevalences and vocabulary distributions for categories across domains. And we’re using hierarchical models on both sides of naive Bayes, for the categories themselves and for the topic distributions. There are many choices for how to model these populations. We’ll go with a very simple model that only characterizes prior means and variance, not prior covariance; as an example of a prior modeling covariance for multinomials, see the prior for topics in Lafferty and Blei’s correlated topic model.

It’s easier to start with the prevalences. The prior model here will model the mean prevalence of outcome 1 (say, the “positive” outcome for sentiment) across domains, as well as its variance. The mean of the density \mbox{\sf Beta}(\alpha_{\pi},\beta_{\pi} is \alpha_{\pi}/(\alpha_{\pi} + \beta_{\pi}) and the variance is inversely related to the total prior count \alpha_{\pi} + \beta_{\pi}.

For convenience, we’ll reparameterize \alpha_{\pi}, \beta_{\pi} in terms of total prior count \kappa_{\pi} \in (0,\infty) and prior mean \psi_{\pi} \in [0,1], by setting

\alpha_{\pi} = \psi_{\pi} \kappa_{\pi}, \ \ \beta_{\pi} = (1 - \psi_{\pi}) \kappa_{\pi}.

We’ll put a simple conjugate uniform prior on the prior mean \psi_{\pi}, taking

\psi_{\pi} \sim \mbox{\sf Beta}(1,1).

(Note that \mbox{\sf Beta}(1,1) is uniform on its support, [0,1].)

We’ll take a very weakly informative prior favoring high variance (low prior count) on the total prior counts,

\kappa_{\pi} \sim \mbox{\sf Pareto}(0.5, 1.5),

where for x > 0.5,

\mbox{\sf Pareto}(x|0.5,1.5)  \propto x^{-1.5}.

The effect of this decision is that the estimate for prevalence \pi[d] in a single domain d will be somewhere between the overall average \psi_{\pi} and the training data average, depending on the relative strength \kappa_{\pi} estimated for the prior and the number of training examples for domain d. (A standard shortcut here would be to use either cross-validation or “empirical Bayes” to set the priors \alpha_{\pi}, \beta_{\pi}.

We do the same thing for the Dirichlet prior, reparameterizating the prior count V-vector \gamma[c] for vocabularly items for outcome c \in \{0,1\} as a prior mean V-simplex \theta[b] and prior total count \rho[b], setting

\gamma[c] = \rho[b] \theta[b].

We set a uniform prior on the prior vocabulary distribution \theta[b],

\theta[b] \sim \mbox{Dir}(1),

and another weakly informative Pareto prior on the prior counts,

\rho \sim \mbox{Pareto}(0.5,1.5).

As the mathematicians are wont to say, Q.E.D..

Tell Us What We’ve Won

What we do get is sharing of discriminative terms across domains. For instance, with polarity, positive term weights in each domain affect the prior, and the prior affects the average in each domain. The prior’s effect on estimates goes by the name “smoothing” in NLP. More specifically, the prior pulls the estimate for each domain and category toward the prior mean for that category by an amount inversely related to the prior variance (low prior variance exerting a stronger smoothing effect) and inversely related to the data count (with higher data counts, the prior’s effect is weaker).

We also get sharing of the non-discriminative terms among category 0 instances across domains and sharing among non-discriminative terms among category 1 instances.

But what about…

… the fact that the model has to estimate the whole vocabulary of non-discriminative terms in both the negative and positive vocabulary priors \theta[0] and \theta[1]?

For instance, we’d expect “awesome” and “crap” to have a very different distribution in positive and negative sentiment product reviews, whereas “the” and “of” should be less discriminative.

Turtles All the Way Up

The next level here would be to add a more informative prior for the prior per-category vocabulary distribution \theta[d]. A very good place to start here would be with a prior mean distribution \theta[d] estimated from a large corpus of text without category labels. That would put an empirical Bayes step into the mix and make the prior over priors much more informative (as is, we’ve set it to be symmetric). We could apply full Bayesian inference by just building the unsupervised data into the model with what the statisticians like to call “missing” labels.

We could do the same thing for the prior mean for prevalence of categories if we have samples of other domains we could use.

One nice feature of hierarchical models is that it’s easy to extend the hierarchy (for an extreme example, see Li and McCallum’s paper on their aptly named technique, pachinko allocation).

It’s trivial to extend this hiearchical naive Bayes model to the multi-category setting. We kept things simple here because it’ll greatly simplify the next post on multilevel modeling with logistic regression. In the naive Bayes setting, just replace the Bernoulli distribution with a discrete distribution and the beta distribution with a Dirichlet.

It’s also trivial to extend the semi-supervised case. Any number of variables may be “missing” in a directed graphical model without hampering inference (theoretically it’s simple; computationally it requires more bookkeeping and perhaps more samplers). As we mentioned earlier, this is especially helpful in the setting where we’re estimating hyperpriors (here the top-level vocabulary prior over per-category vocabulary priors over per-category/per-domain vocabulary distributions).

Empirical Evaluations?

This will work. The hieararchical modeling strategy pretty much always works. The reason is that with the right hyper-hyper priors, it’ll reduce to the original model. The real question is will it help?

We should see big gains in small count data.

I haven’t done any. If anyone else has evaluated hierarchical naive Bayes, please say so in the comments or e-mail me and I’ll write a link into the post itself.

A good data set would be Dredze et al.’s multi-domain sentiment data set. I’d at least like to see the results of this relatively simple and standard naive Bayes model and the one I’ll describe in the next post before delving into “deep” models a la (Glorot et al. 2011).

Wow, that was all only four lines of notes in my notebook. I wasn’t expecting this to run so long! The next post has six lines of notes, so watch out.


Follow

Get every new post delivered to your Inbox.

Join 40 other followers