If, like me, you learned Bayesian regression from Gelman and Hill’s book, Data Analysis Using Regression and Multilevel/Hierarchical Models, you’re going to love:
- Finkel, Jenny Rose and Christopher D. Manning. 2009. Hierarchical Bayesian domain adaptation. In EMNLP.
In a Nutshell
Finkel and Manning apply the standard hierarchical model for the parameters of a generalized linear model to structured natural language problems. Although they call it “domain adaptation”, it’s really just a standard hierarchical model. In particular, they focus on CRFs (a kind of structured logistic regression), applying them to named entity recognition and dependency parsing. Their empirical results showed a modest gain from partial pooling over complete pooling or no pooling.
The Standard Model
In the standard non-hierarchical Bayesian regression model (including generalized linear models like logistic regression), each regression coefficient has a prior, usually in the form of a Gaussian (normal, L2, ridge) or Laplace (L1, lasso) distribution. The maximum likelihood solution corresponds to an improper prior with infinite variance.
In most natural language applications, the prior is assumed to have zero mean and a diagonal covariance matrix with shared variance, so that each coefficient has an independent normal prior with zero mean and the same variance. LingPipe allows arbitrary means and diagonal covariance matrices, as does Genkin, Madigan and Lewis’s package Bayesian Logistic Regression.
With data of the same type (e.g. person, location and organization named-entity data in English) from multiple sources (e.g. CoNLL, MUC6, MUC7), there are two standard approaches. First, train individual models, one for each source, then apply the models in-source. That’s the unpooled result, where the data in each domain isn’t used to help train other domains.
The second approach is to completely pool the data into one big training set and then apply the resulting model to each domain.
The Hierarchical Model
The hierarchical model generalizes these two approaches, and allows a middle ground where there is partial pooling across domains. A hierarchical model fits coefficients from each domain using a shared prior for each coefficient that is shared across domains. That is, we might have a feature f and coefficients β[1,f], β[2,f] and β[3,f] for three different domains. We assume the β[i,f] are drawn from a shared prior (normal in Finkel and Manning’s case), say Normal(μ[f],σ[f]2).
The prior mean μ[f] and variance σ[f]2 for coefficients for feature f and prior can now be fit in the model along with the coefficients for each model.
In BUGS-like sampling notation, for a simple classification problem:
DATA F : number of features D : number of domains I[d] : number of items in domain d, d in 1:D x[d,i] : f-dimensional vector i being classified in domain d, d in 1:D, i in 1:I[d] c[d,i] : discrete category of x[d,i] σ[f]2 : variance of feature f, f in 1:F ν : prior mean for feature means τ2 : prior variance for feature means PARAMETERS μ[f] : hierarchical mean for feature f, f in 1:F β[d,f] : coefficient for feature f in domain d, f in 1:F, d in 1:D MODEL β[d,f] ~ Norm(μ[f],σ[f]2), d in 1:D, f in 1:F μ[f] ~ Norm(ν,τ2) c[d,i] ~ Bern(logit-1(β[d]tx[d,i]), d in 1:D, i in 1:I[d]
Finkel and Manning fix the prior variances σ[f] to a single shared value σ as a hyperparameter, which makes sense because they don’t really have enough data points to fit them in a meaningful way. This is too bad, because the variance is often the most useful part of setting hierarchical priors (see, e.g, Genkin et al.’s papers). I suspect the prior variance is going to be a very sensitive hyperparameter in this model.
Of course, μ[f] itself requires a prior, which Manning and Finkel fix as a hyperparameter to have mean ν=0, and variance τ2. I suspect τ will also be a pretty sensitive parameter in this model.
The completely pooled result arises as σ2 approaches zero, so that the β[d,f] are all equal to μ[f].
The completely unpooled result in this case arises when ν=0 and τ2 approaches zero, so that each μ[f]=0 for all f, so that β[d,f] ~ Normal(0,σ[f]2).
Relation to Daumé (2007)
Turns out Hal did the same thing in his 2007 ACL paper, Frustratingly easy domain adaptation, only with a different presentation and slightly different parameter tying. I read Hal’s paper and didn’t see the connection; Finkel and Manning thank David Vickrey for pointing out the relation.
Of course, as Finkel and Manning mention, it’s easy to add more hierarchical structure. For instance, with several genres, such as newspapers, television, e-mail, blogs, etc., there could be an additional level where the genre priors were drawn from fixed high level priors (thus pooling across genres), and then within-genre coefficients can be drawn from those.
More Predictors (Random Effects)
There’s no reason these models need to remain hierarchical. They can be extended to general multilevel models by adding more predictors at each level than the instances below it. One example is that we could pool by year and by genre. Check out Gelman and Hill’s book for lots of examples.
The really cool part is how easy the estimation is. Apparently the loss function’s still convex, so all our favorite optimizers work just fine. The hierarchical priors are easy to marginalize, and thus easy to compute the gradient for. This is because normals are in the exponential family, and the domains are assumed to be exchangeable, so the big product of exponentials turns into a simple summation in the log loss space where the gradient happens (see the paper for the formula).