This will be the first of two posts exploring hierarchical and multilevel classifiers. In this post, I’ll describe a hierarchical generalization of naive Bayes (what the NLP world calls a “generative” model). The next post will explore hierarchical logistic regression (called a “discriminative” or “log linear” or “max ent” model in NLP land).
Within natural language processing, the term “domain adaptation” has come to mean something like “using data in one domain to improve results in another domain”. Examples that have received attention include positive/negative sentiment for reviews of different kinds of products, with, say, DVDs being one domain and kitchen appliances another (Blitzer et al. 2007). Other examples classifying sequences of tokens as names across different genres of text, such as newswire and transcribed speech (Daumé III 2007, Finkel and Manning 2009).
The work I’ve seen on adaptation has largely been in the supervised case, but what we’re going to do here can be unsupervised, in which case you get something like soft K-means clustering. It can also be semi-supervised, with some documents having category labels and some not having labels. I’ve talked about semi-supervised models before, too, as recently as the very last blog post, covering Tang and Lease’s semi-supervised annotation model and in our replication of Nigam et al.’s semi-supervised naive Bayes EM estimator in the LingPipe tutorial on semi-supervised naive Bayes classification.
The standard Bayesian approach to domain adaptation is to build a hierarchical model. The hierarchy in this case is the hieararchy of domains. We’ll just consider the flat case, where there’s a collection of what we’ll call “domains”. In the next post, we’ll generalize this assumption to richer organizations of domains with cross-cutting features.
I’ve blogged before about properly Bayesian approaches to naive Bayes classifiers, specifically about Bayesian naive Bayes inference and collapsed Gibbs sampling for computing it.
Pooling, Exchangeability and Hieararchy
What we’re going to do is let information in one domain, such as which words are positive or which phrases are people names, inform estimates for other domains. This is sometimes referred to as “pooling” information across sub-populations in statistics. A simple example from political science that illustrates the general flavor of models we’re working on at Columbia is allowing estimates of the effect of income on voting in one state to inform estimates in other states, while still allowing the states to differ (an example is given in Gelman et al.’s What’s wrong with Connecticut?).
Theoretically, what we’re going to do amounts to assuming that the domains themselves are exchangeable. We use exchangeable models when we don’t have any information to distinguish the domains. We’ll discuss the case where we do have such information in the next post.
In Bayesian models, we treat exchangeable variables as being drawn from a population distribution (aka, a prior). In hierarchical models, we model the population directly. The reason it’s called hierarchical is that in a proper Bayesian model, there must be a prior distribution on the domain model. So we get a prior on the parameters of the domain model, which then acts as a prior on the estimates for each domain. The pooling is achieved by information flow through the dependencies in the joint distribution implied by the graphical model. In practice, this is simpler than it sounds.
Hieararchical Naive Bayes
Once we’ve got our heads around the Bayesian formulation of naive Bayes, extending it to hieararchical models is straightforward. We’re going to use the language of documents and tokens in describing naive Bayes, but it’s really a general multinomial model, so don’t assume this is only valid for text classifiers.
For simplicity, we’ll assume we have a binary classifier. I’ll say more about the multi-category case later and in the next post.
Let be the number of domains. Let be the number of documents in domain and the number of tokens in document of domain . Let be the size of the vocabulary from which the tokens are drawn.
The raw document data we have provides a token for each token for each document for each domain .
The labeled training data we have is a classification for each document in each domain .
The Model, Lower Levels
The hierarchical naive Bayes model is a directed graphical model and as such, easy to describe with sampling notation.
We have a parameter for the prevalence of category 1 documents in domain . It is sampled from a beta distribution with prior count parameters (more about which later),
The categories assigned to each document are sampled from a Bernoulli distribution parameterized by prevalence in the document’s domain,
Each domain and outcome category will have its own naive Bayes model with multinomial -simplex parameter , such that
describing the distribution of vocabulary items in category in domain .
These category word distributions for each domain are drawn from a Dirichlet prior appropriate to the outcome category ,
We then model the tokens in the standard way for naive Bayes, as being drawn independently (that’s the “naive” part) from the discrete distribution associated with their domain and the category of document ,
The Model, Upper Levels
So far, that’s only the lower level of the model. As I mentioned earlier, the priors characterize the populations of prevalences and vocabulary distributions for categories across domains. And we’re using hierarchical models on both sides of naive Bayes, for the categories themselves and for the topic distributions. There are many choices for how to model these populations. We’ll go with a very simple model that only characterizes prior means and variance, not prior covariance; as an example of a prior modeling covariance for multinomials, see the prior for topics in Lafferty and Blei’s correlated topic model.
It’s easier to start with the prevalences. The prior model here will model the mean prevalence of outcome 1 (say, the “positive” outcome for sentiment) across domains, as well as its variance. The mean of the density is and the variance is inversely related to the total prior count .
For convenience, we’ll reparameterize in terms of total prior count and prior mean , by setting
We’ll put a simple conjugate uniform prior on the prior mean , taking
(Note that is uniform on its support, .)
We’ll take a very weakly informative prior favoring high variance (low prior count) on the total prior counts,
where for ,
The effect of this decision is that the estimate for prevalence in a single domain will be somewhere between the overall average and the training data average, depending on the relative strength estimated for the prior and the number of training examples for domain . (A standard shortcut here would be to use either cross-validation or “empirical Bayes” to set the priors .
We do the same thing for the Dirichlet prior, reparameterizating the prior count -vector for vocabularly items for outcome as a prior mean -simplex and prior total count , setting
We set a uniform prior on the prior vocabulary distribution ,
and another weakly informative Pareto prior on the prior counts,
As the mathematicians are wont to say, Q.E.D..
Tell Us What We’ve Won
What we do get is sharing of discriminative terms across domains. For instance, with polarity, positive term weights in each domain affect the prior, and the prior affects the average in each domain. The prior’s effect on estimates goes by the name “smoothing” in NLP. More specifically, the prior pulls the estimate for each domain and category toward the prior mean for that category by an amount inversely related to the prior variance (low prior variance exerting a stronger smoothing effect) and inversely related to the data count (with higher data counts, the prior’s effect is weaker).
We also get sharing of the non-discriminative terms among category 0 instances across domains and sharing among non-discriminative terms among category 1 instances.
But what about…
… the fact that the model has to estimate the whole vocabulary of non-discriminative terms in both the negative and positive vocabulary priors and ?
For instance, we’d expect “awesome” and “crap” to have a very different distribution in positive and negative sentiment product reviews, whereas “the” and “of” should be less discriminative.
The next level here would be to add a more informative prior for the prior per-category vocabulary distribution . A very good place to start here would be with a prior mean distribution estimated from a large corpus of text without category labels. That would put an empirical Bayes step into the mix and make the prior over priors much more informative (as is, we’ve set it to be symmetric). We could apply full Bayesian inference by just building the unsupervised data into the model with what the statisticians like to call “missing” labels.
We could do the same thing for the prior mean for prevalence of categories if we have samples of other domains we could use.
One nice feature of hierarchical models is that it’s easy to extend the hierarchy (for an extreme example, see Li and McCallum’s paper on their aptly named technique, pachinko allocation).
It’s trivial to extend this hiearchical naive Bayes model to the multi-category setting. We kept things simple here because it’ll greatly simplify the next post on multilevel modeling with logistic regression. In the naive Bayes setting, just replace the Bernoulli distribution with a discrete distribution and the beta distribution with a Dirichlet.
It’s also trivial to extend the semi-supervised case. Any number of variables may be “missing” in a directed graphical model without hampering inference (theoretically it’s simple; computationally it requires more bookkeeping and perhaps more samplers). As we mentioned earlier, this is especially helpful in the setting where we’re estimating hyperpriors (here the top-level vocabulary prior over per-category vocabulary priors over per-category/per-domain vocabulary distributions).
This will work. The hieararchical modeling strategy pretty much always works. The reason is that with the right hyper-hyper priors, it’ll reduce to the original model. The real question is will it help?
We should see big gains in small count data.
I haven’t done any. If anyone else has evaluated hierarchical naive Bayes, please say so in the comments or e-mail me and I’ll write a link into the post itself.
A good data set would be Dredze et al.’s multi-domain sentiment data set. I’d at least like to see the results of this relatively simple and standard naive Bayes model and the one I’ll describe in the next post before delving into “deep” models a la (Glorot et al. 2011).
Wow, that was all only four lines of notes in my notebook. I wasn’t expecting this to run so long! The next post has six lines of notes, so watch out.