Check out the Stanford NLP implementation page, which also points to a tutorial:

http://nlp.stanford.edu/software/classifier.shtml

Like most applications in natural language processing, they allow different numbers of features per category.

The priors work exactly the same way, with one per dimension.

Although they don’t allow features varying by category, you can check out the BMR package and related papers for some uses of priors for natural language that vary by dimension:

http://www.bayesianregression.org/

For much earlier refs with applications in economics, check out my blog post:

]]>Cheers.

]]>I can do better than that and give you a reference with theoretical and practical evaluations! But first, the answer is simple: it’s differentiable everywhere but at 0. If the feature val’s 0, there’s no need to regularize toward 0, so that non-differentiability isn’t a problem in practice. The trick is that if the regularization gradient moves a feature past zero (that is, changes its sign), truncate the move at zero.

It turns out that I rediscovered the “truncated stochastic gradient” method of Langford, Li and Zhang, for which there’s a NIPS and Arxiv paper:

Click to access NIPS2008_0280.pdf

http://arxiv.org/abs/0806.4686

I thought I was just implementing the industry standard stochastic gradient for Laplace. In fact, I thought I was borrowing the idea from Genkin, Lewis and Madigan, but they said they were truncating Laplaces for other reasons, but that the truncated gradient idea surfaced even earlier in Zhang and Oles’ IR Journal paper:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.5553

I called it “clipped regularization” in the white paper (section 10.6).

]]>Could you give a few words on how the Laplace prior works on LingPipe? You mention you use gradient descent, but Laplace isn’t differentiable at every point? How did you work it out?

Thanks!

]]>