## Domain Adaptation with Hierarchical Logistic Regression

Last post, I explained how to build hierarchical naive Bayes models for domain adaptation. That post covered the basic problem setup and motivation for hierarchical models.

### Hierarchical Logistic Regression

Today, we’ll look at the so-called (in NLP) “discriminative” version of the domain adaptation problem. Specifically, using logistic regression. For simplicity, we’ll stick to the binary case, though this could all be generalized to K-way classifiers.

Logistic regression is more flexible than naive Bayes in allowing other features (aka predictors) to be brought in along with the words themselves. We’ll start with just the words, so the basic setup look more like naive Bayes.

#### The Data

We’ll use the same data representation as in the last post, with $D$ being the nubmer of domains, with $I_d$ docs in domain $d$. Document $i$ in domain $d$ will have $N[d,i]$ tokens. We’ll assume $V$ is the size of the vocabulary.

Raw document data provides a token $x[d,i,n] \in 1{:}V$ for word $n$ of document $i$ of domain $d$. Assuming a bag-of-words representation like we used in naive Bayes, it’ll be more convenient to convert each document into a word frequency vector $u[d,i]$ of dimensionality $V$. To match naive Bayes, define a word frequency vector $u[d,i]$ with value at word (or intercept) $w$ given by

$u[d,i,w] = \sum_{n = 1}^{N[d,i]} \mathbb{I}(x[d,i,n] = w)$,

where $\mathbb{I}(\phi)$ is the indicator function, taking on value 1 if $\phi$ is true and 0 otherwise. Now that we have continuous data, we could transform it using something like TF/IDF weighting. For convenience, we’ll prefix an intercept predictor $1$ to each vector of predictors $u[d,i]$. Because it’s logistic regression, other information like the source of the document may also be included in the predictor vectors.

The labeled training data consists of a classification $z[d,i] \in \{ 0, 1 \}$ for each document $i$ in domain $d$.

#### The Model, Lower Level

The main parameter to estimate is a coefficient vector $\beta[d]$ of size $V + 1$ for each domain $d$. Here, the first dimension will correspond to the intercept and the other dimensions each correspond to a word in the vocabulary.

The probabilty that a given document is of category 1 (rather than category 0) is given by

$\mbox{Pr}[z[d,i] = 1] = \mbox{logit}^{-1}(\beta[d]^T u[d,i])$.

The inner (dot) product is just the sum of the dimensionwise products,

$\beta[d]^T u[d,i] = \sum_{v = 0}^V \beta[d,v] \times u[d,i,v]$.

This value, called the linear prediction, can take values in $(-\infty,\infty)$. The logistic sigmoid function

$\mbox{logit}^{-1}(x) = 1/(1 + \exp(-x))$

maps the unbounded linear prediction to the range $(0,1)$, where it is taken to represent the probabilty that the category $z[d,i]$ of document $i$ in domain $d$ is 1.

In sampling notation, we define the category as being drawn from a Bernoulli distribution with parameter given by the transformed predictor; in symbols,

$z[d,i] \sim \mbox{\sf Bern}(\mbox{logit}^{-1}(\beta[d]^T u[d,i])$.

Unlike naive Bayes, logistic regression model does not model the word data $u$, instead treating it as constant.

#### The Model, Upper Levels

We are treating the coefficient vectors as estimated parameters, so we need a prior for them. We can choose just about any kind of prior we want here. We’ll only consider normal (Gaussian, L2) priors here, but the Laplace (L1) prior is also very popular. Recently, statisticians have argued for using a Cauchy distribution as a prior (very fat tails; see Gelman et al.’s paper) or combining L1 and L2 into a so-called elastic net prior. LingPipe implements all of these priors; see the documentation for regression priors.

Specifically, we are going to pool the prior across domains by sharing the priors for words $v$ across domains. Ignoring any covariance for the time being (a bad idea in general, but it’s hard to scale coveriance to NLP-sized coefficient vectors), we draw the component for word $v$ of the coefficients for domain $d$ from a normal distribution,

$\beta[d,v] \sim \mbox{\sf Normal}(\mu[v],\sigma[v]^2)$.

Here $\mu[v]$ represents the mean value of the coefficient for word (or intercept) $v$ across domains and $\sigma[v]^2$ is the variance of the coefficient values across domains. A strong prior will have low variance.

All that’s left is to provide a prior for these location and scale parameters. It’s typical to use a simple centered normal distribution for the locations, specifically

$\mu[v] \sim \mbox{\sf Normal}(0,\tau^2)$,

where $\tau^2$ is a constant variance parameter. Typically, the variance $\tau^2$ is set to a constant to make the entire prior on the means weakly informative. Alternatively, we could put a prior on $\tau$ itself.

Next, we need priors for the $\sigma[v]$ terms. Commonly, these are chosen to be inverse $\chi^2$ (a specific kind of inverse gamma distribution) distributions for convenience because they’re (conditionally) conjugate. Thus the typical model would take the variance $\sigma[v]^2$ to be the parameter, giving it an inverse gamma prior,

$\sigma[v]^2 \sim \mbox{\sf InvGamma}(\alpha,\beta)$.

Gelman argues against inverse gamma priors on variance because they have pathological behavior when the hierarchical variance terms are near zero.
Gelman prefers the half-Cauchy prior, because it tends not to degenerate in hierarchical models like the inverse gamma can. The half Cauchy is just the Cauchy distribution restricted to positive values (thus doubling the usual density, but restricting support to non-negative values). In symbols, we generate the deviation $\sigma[v]$ (not the variance) using a (non-negative) Half-Cauchy distribution:

$\sigma[v] \sim \mbox{HalfCauchy}()$.

And that’s about it. Of course, you’ll need some fancy-pants software to fit the whole thing, but it’s not that difficult to write.

### Properties of the Hierarchical Logisitic Regression Model

Like in the naive Bayes case, going hierarchical means we get data pooling (or “adaptation” or “transfer”) across domains.

One very nice feature of the regression model follows from the fact that we are treating the word vectors as unmodeled predictors and thus don’t have to model their probabilities across domains. Instead, the coefficient vector $\beta[d]$ for domain $d$ only needs to model words that discriminate positive (category 1) from negative (category 0) examples. Thus the correct value for $\beta[d,v]$ for a word $v$ is 0 if the word does not discriminate between positive and negative documents. Thus our hierarchical hyperprior has location 0 in order to pull the prior location $\mu[v]$ for each word $v$ to 0.

The intercept is just acting as a bias term toward one category or the other (depending on the value of its coefficient).

### Comparison to SAGE

If one were to approximate the full Bayes solution with maximum a posteriori point estimates and use Laplace (L1) priors on the coefficients,

$\beta[d,v] \sim \mbox{\sf Laplace}(\mu[d],\sigma[d]^2)$

then we recreate the regression counterpart of Eisenstein et al.’s hybrid regression and naive-Bayes style sparse additive generative model of text (aka SAGE).

Eisenstein et al.’s motivation was to represent each domain as a difference from the average of all domains. If you recast the coefficient prior formula slightly and set

$\beta[d,v] = \alpha[d,v] + \mu[d]$

and sample

$\alpha[d,v] \sim \mbox{Laplace}(0,\sigma[d]^2)$

you’ll see that if the variance term $\sigma[d]$ is low enough, many of the $\alpha[d,v]$ values will be zero. Of course, in a full Bayesian approach, you’ll integrate over the uncertainty in $\alpha$, not set it to a point estimate of zero. So sparseness only helps when you’re willing to approximate and treat estimates as more certain than we have evidence for. Of course, resarchers do that all the time. It’s what LingPipe’s logistic regression and CRF estimators do.

Presumably, the values of coefficients have will have non-negligible covariance. That is, I expect words like “thrilling” and “scary” to covary. For thriller movies, they’re positive sentiment terms and for appliances, rather negative. The problem with modeling covariance among lexical items is that we usually have tens of thousands of them. Not much software can handle a 10K by 10K matrix.

Another way to add covariance instead of using a covariance model for “fixed effects” is instead convert to random effects. For instance, a model like latent Dirichlet allocation (LDA) models covariance of words by grouping them into topics. For instance, two words with high probabilities in the same topic will have positive covariance.

I’m a big fan of Gelman and Hill’s multilevel regression book. You definitely want to master that material before moving on to Gelman et al.’s Bayesian Data Analysis. Together, they’ll give you a much deeper understanding of the issues in hierarchical (or more generally, multilevel) modeling.

### 3 Responses to “Domain Adaptation with Hierarchical Logistic Regression”

1. Dave Says:

“I’m a big fan of Gelman and Hill’s multilevel regression book. You definitely want to master that material before moving on to Gelman et al.’s Bayesian Data Analysis.”

Interesting. I have bought both books for self-study, but I had assumed that BDA should come first, since hierarchical models and regression come in chapters 5 and 14 of BDA, and assumed the other book would expand on these topics in depth.

• Bob Carpenter Says:

Gelman and Hill’s regression book is written with much less math and computation than Gelman et al.’s BDA.

Gelman and a different et al. (including Daniel Lee, who’s sitting next to me right now) are writing a very introductory textbook on stats that they’ve been using for a very basic intro to stats.

2. Domain Adaptation with Hierarchical Logistic Regression « Another Word For It Says:

[…] Domain Adaptation with Hierarchical Logistic Regression […]