Instead of drawing your coefficient β for a feature from a hierarchical prior with mean α, draw it from a prior with mean α1+α2, where α1 is for level 1 (e.g. twitter) and α2 for level 2 (e.g. baseball). You can even estimate covariance of the priors with enough data (e.g. twitter posts about baseball).

You can also add arbitrary additional hierarchical structure.

Check out section 13.5 of Gelman and Hill’s regression book for more info.

]]>You say that “there’s no reason these models need to remain hierarchical”. I’m wondering how this can be done. In their model, the prior is centered around top-level parameters. How would the prior look like, if there would be many “parent” parameters?

]]>Mccallum, Rosenfeld, Mitchell and Ng. 1998. Improving text classification by shrinkage in a hierarchy of classes. In *ICML*.

It actually uses interpolation rather than a more standard hierarchical shrinkage model like Finkel and Manning used. Always interesting to see how the same ideas keep coming back in different forms.

Given their naive Bayes basis, they could’ve easily put this in a hierarchical Bayesian setting by using Dirichlet priors. Then they’d have prior concentrations to estimate instead of interpolation parameters.

]]>Ooops, should have said great post! Waiting on the next post! ]]>