## Upgrading from Beta-Binomial to Logistic Regression

### Bernoulli Model

Consider the following very simple model of drawing the components of a binary random N-vector y i.i.d. from a Bernoulli distribution with chance of success theta.

data {
int N;  // number of items
int y[N];  // binary outcome for item i
}
parameters {
real theta;  // Prob(y[n]=1) = theta
}
model {
theta ~ beta(2,2); // (very) weakly informative prior
for (n in 1:N)
y[n] ~ bernoulli(theta);
}


The beta distribution is used as a prior on theta. This is the Bayesian equivalent to an “add-one” prior. This is the same model Laplace used in the first full Bayesian analysis (or as some would have it, Laplacian inference) back in the Napoleonic era. He used it to model male-vs.-female birth ratio in France, with N being the number of samples and y[n] = 1 if the baby was male and 0 if female.

### Beta-Binomial Model

You can get the same inferences for theta here by replacing the Bernoulli distribution with a binomial:

model {
theta ~ beta(2,2);
sum(y) ~ binomial(N,theta);
}


But it doesn’t generalize so well. What we want to do is let the prediction of theta vary by predictors (“features” in NLP speak, covariates to some statisticians) of the items n.

### Logistic Regression Model

A roughly similar model can be had by moving to the logistic scale and replacing theta with a logistic regression with only an intercept coefficient alpha.

data {
int N;
int y[N];
}
parameters {
real alpha;  // inv_logit(alpha) = Prob(y[n]=1)
}
model {
alpha ~ normal(0,5);  // weakly informative
for (n in 1:N)
y[n] ~ bernoulli(inv_logit(alpha));
}


Recall that the logistic sigmoid (inverse of the logit, or log odds function) maps

$\mbox{logit}^{-1}:(-\infty,\infty)\rightarrow(0,1)$

by taking

$\mbox{logit}^{-1}(u) = 1 / (1 + \mbox{exp}(-u))$.

The priors aren’t quite the same in the Bernoulli and logistic models, but that’s no big deal. In more flexible models, we’ll move to hierarchical models on the priors.

Now that we have the inverse logit transform in place, we can replace theta with a regression on predictors for y[n]. You can think of the second model as an intercept-only regression. For instance, with a single predictor x[n], we could add a slope coefficient beta and write the following model.

data {
int N;  // number of items
int y[N];  // binary outcome for item n
real x[N];  // predictive feature for item n
}
parameters {
real alpha;  // intercept
real beta;  // slope
}
model {
alpha ~ normal(0,5);  // weakly informative
for (n in 1:N)
y[n] ~ bernoulli(inv_logit(alpha + beta * x[n]));
}


### Stan

I used Stan’s modeling language — Stan is the full Bayesian inference system I and others have been developing (it runs from the command line or from R). For more info on Stan, including a link to the manual for the modeling language, see: