Last post, I explained how to build hierarchical naive Bayes models for domain adaptation. That post covered the basic problem setup and motivation for hierarchical models.
Hierarchical Logistic Regression
Today, we’ll look at the so-called (in NLP) “discriminative” version of the domain adaptation problem. Specifically, using logistic regression. For simplicity, we’ll stick to the binary case, though this could all be generalized to K-way classifiers.
Logistic regression is more flexible than naive Bayes in allowing other features (aka predictors) to be brought in along with the words themselves. We’ll start with just the words, so the basic setup look more like naive Bayes.
We’ll use the same data representation as in the last post, with being the nubmer of domains, with docs in domain . Document in domain will have tokens. We’ll assume is the size of the vocabulary.
Raw document data provides a token for word of document of domain . Assuming a bag-of-words representation like we used in naive Bayes, it’ll be more convenient to convert each document into a word frequency vector of dimensionality . To match naive Bayes, define a word frequency vector with value at word (or intercept) given by
where is the indicator function, taking on value 1 if is true and 0 otherwise. Now that we have continuous data, we could transform it using something like TF/IDF weighting. For convenience, we’ll prefix an intercept predictor to each vector of predictors . Because it’s logistic regression, other information like the source of the document may also be included in the predictor vectors.
The labeled training data consists of a classification for each document in domain .
The Model, Lower Level
The main parameter to estimate is a coefficient vector of size for each domain . Here, the first dimension will correspond to the intercept and the other dimensions each correspond to a word in the vocabulary.
The probabilty that a given document is of category 1 (rather than category 0) is given by
The inner (dot) product is just the sum of the dimensionwise products,
This value, called the linear prediction, can take values in . The logistic sigmoid function
maps the unbounded linear prediction to the range , where it is taken to represent the probabilty that the category of document in domain is 1.
In sampling notation, we define the category as being drawn from a Bernoulli distribution with parameter given by the transformed predictor; in symbols,
Unlike naive Bayes, logistic regression model does not model the word data , instead treating it as constant.
The Model, Upper Levels
We are treating the coefficient vectors as estimated parameters, so we need a prior for them. We can choose just about any kind of prior we want here. We’ll only consider normal (Gaussian, L2) priors here, but the Laplace (L1) prior is also very popular. Recently, statisticians have argued for using a Cauchy distribution as a prior (very fat tails; see Gelman et al.’s paper) or combining L1 and L2 into a so-called elastic net prior. LingPipe implements all of these priors; see the documentation for regression priors.
Specifically, we are going to pool the prior across domains by sharing the priors for words across domains. Ignoring any covariance for the time being (a bad idea in general, but it’s hard to scale coveriance to NLP-sized coefficient vectors), we draw the component for word of the coefficients for domain from a normal distribution,
Here represents the mean value of the coefficient for word (or intercept) across domains and is the variance of the coefficient values across domains. A strong prior will have low variance.
All that’s left is to provide a prior for these location and scale parameters. It’s typical to use a simple centered normal distribution for the locations, specifically
where is a constant variance parameter. Typically, the variance is set to a constant to make the entire prior on the means weakly informative. Alternatively, we could put a prior on itself.
Next, we need priors for the terms. Commonly, these are chosen to be inverse (a specific kind of inverse gamma distribution) distributions for convenience because they’re (conditionally) conjugate. Thus the typical model would take the variance to be the parameter, giving it an inverse gamma prior,
Gelman argues against inverse gamma priors on variance because they have pathological behavior when the hierarchical variance terms are near zero.
Gelman prefers the half-Cauchy prior, because it tends not to degenerate in hierarchical models like the inverse gamma can. The half Cauchy is just the Cauchy distribution restricted to positive values (thus doubling the usual density, but restricting support to non-negative values). In symbols, we generate the deviation (not the variance) using a (non-negative) Half-Cauchy distribution:
And that’s about it. Of course, you’ll need some fancy-pants software to fit the whole thing, but it’s not that difficult to write.
Properties of the Hierarchical Logisitic Regression Model
Like in the naive Bayes case, going hierarchical means we get data pooling (or “adaptation” or “transfer”) across domains.
One very nice feature of the regression model follows from the fact that we are treating the word vectors as unmodeled predictors and thus don’t have to model their probabilities across domains. Instead, the coefficient vector for domain only needs to model words that discriminate positive (category 1) from negative (category 0) examples. Thus the correct value for for a word is 0 if the word does not discriminate between positive and negative documents. Thus our hierarchical hyperprior has location 0 in order to pull the prior location for each word to 0.
The intercept is just acting as a bias term toward one category or the other (depending on the value of its coefficient).
Comparison to SAGE
If one were to approximate the full Bayes solution with maximum a posteriori point estimates and use Laplace (L1) priors on the coefficients,
then we recreate the regression counterpart of Eisenstein et al.’s hybrid regression and naive-Bayes style sparse additive generative model of text (aka SAGE).
Eisenstein et al.’s motivation was to represent each domain as a difference from the average of all domains. If you recast the coefficient prior formula slightly and set
you’ll see that if the variance term is low enough, many of the values will be zero. Of course, in a full Bayesian approach, you’ll integrate over the uncertainty in , not set it to a point estimate of zero. So sparseness only helps when you’re willing to approximate and treat estimates as more certain than we have evidence for. Of course, resarchers do that all the time. It’s what LingPipe’s logistic regression and CRF estimators do.
Presumably, the values of coefficients have will have non-negligible covariance. That is, I expect words like “thrilling” and “scary” to covary. For thriller movies, they’re positive sentiment terms and for appliances, rather negative. The problem with modeling covariance among lexical items is that we usually have tens of thousands of them. Not much software can handle a 10K by 10K matrix.
Another way to add covariance instead of using a covariance model for “fixed effects” is instead convert to random effects. For instance, a model like latent Dirichlet allocation (LDA) models covariance of words by grouping them into topics. For instance, two words with high probabilities in the same topic will have positive covariance.
I’m a big fan of Gelman and Hill’s multilevel regression book. You definitely want to master that material before moving on to Gelman et al.’s Bayesian Data Analysis. Together, they’ll give you a much deeper understanding of the issues in hierarchical (or more generally, multilevel) modeling.