### Bernoulli Model

Consider the following very simple model of drawing the components of a binary random N-vector y i.i.d. from a Bernoulli distribution with chance of success theta.

data { int N; // number of items int y[N]; // binary outcome for item i } parameters { real theta; // Prob(y[n]=1) = theta } model { theta ~ beta(2,2); // (very) weakly informative prior for (n in 1:N) y[n] ~ bernoulli(theta); }

The beta distribution is used as a prior on theta. This is the Bayesian equivalent to an “add-one” prior. This is the same model Laplace used in the first full Bayesian analysis (or as some would have it, Laplacian inference) back in the Napoleonic era. He used it to model male-vs.-female birth ratio in France, with N being the number of samples and y[n] = 1 if the baby was male and 0 if female.

### Beta-Binomial Model

You can get the same inferences for theta here by replacing the Bernoulli distribution with a binomial:

model { theta ~ beta(2,2); sum(y) ~ binomial(N,theta); }

But it doesn’t generalize so well. What we want to do is let the prediction of theta vary by predictors (“features” in NLP speak, covariates to some statisticians) of the items n.

### Logistic Regression Model

A roughly similar model can be had by moving to the logistic scale and replacing theta with a logistic regression with only an intercept coefficient alpha.

data { int N; int y[N]; } parameters { real alpha; // inv_logit(alpha) = Prob(y[n]=1) } model { alpha ~ normal(0,5); // weakly informative for (n in 1:N) y[n] ~ bernoulli(inv_logit(alpha)); }

Recall that the logistic sigmoid (inverse of the logit, or log odds function) maps

by taking

.

The priors aren’t quite the same in the Bernoulli and logistic models, but that’s no big deal. In more flexible models, we’ll move to hierarchical models on the priors.

### Adding Features

Now that we have the inverse logit transform in place, we can replace theta with a regression on predictors for y[n]. You can think of the second model as an intercept-only regression. For instance, with a single predictor x[n], we could add a slope coefficient beta and write the following model.

data { int N; // number of items int y[N]; // binary outcome for item n real x[N]; // predictive feature for item n } parameters { real alpha; // intercept real beta; // slope } model { alpha ~ normal(0,5); // weakly informative for (n in 1:N) y[n] ~ bernoulli(inv_logit(alpha + beta * x[n])); }

### Stan

I used Stan’s modeling language — Stan is the full Bayesian inference system I and others have been developing (it runs from the command line or from R). For more info on Stan, including a link to the manual for the modeling language, see:

**Stan Home Page: ** http://mc-stan.org/

Stan’s not a competitor for LingPipe, by the way. Stan scales well for full Bayesian inference, but doesn’t scale like LingPipe’s SGD-based point estimator for logistic regression. And Stan doesn’t do structured models like HMMs or CRFs and has no language-specific features like tokenizers built in. As I’ve said before, it’s horses for courses.

October 30, 2012 at 1:47 pm |

If all predictors are categorical and there aren’t too many category combinations, it can be useful to go in the other direction too. Collapse the samples into counts and use a logistic regression to parameterize the means of beta-binomial distributions for each category combination. This can be advantageous computationally and also allows for overdispersion.