## How to Prevent Overflow and Underflow in Logistic Regression

Logistic regression is a perilous undertaking from the floating-point arithmetic perspective.

### Logistic Regression Model

The basic model of an binary outcome $y_n \in \{ 0, 1\}$ with predictor or feature (row) vector $x_n \in \mathbb{R}^K$ and coefficient (column) vector $\beta \in \mathbb{R}^K$ is

$y_n \sim \mbox{\sf Bernoulli}(\mbox{logit}^{-1}(x_n \beta))$

where the logistic sigmoid (i.e., the inverse logit function) is defined by

$\mbox{logit}^{-1}(\alpha) = 1 / (1 + \exp(-\alpha))$

and where the Bernoulli distribution is defined over support $y \in \{0, 1\}$ so that

$\mbox{\sf Bernoulli}(y_n|\theta) = \theta \mbox{ if } y_n = 1$, and

$\mbox{\sf Bernoulli}(y_n|\theta) = (1 - \theta) \mbox{ if } y_n = 0$.

### (Lack of) Floating-Point Precision

Double-precision floating-point numbers (i.e., 64-bit IEEE) only support a domain for $\exp(\alpha)$ of roughly $\alpha \in (-750,750)$ before underflowing to 0 or overflowing to positive infinity.

### Potential Underflow and Overflow

The linear predictor at the heart of the regression,

$x_n \beta = \sum_{k = 0}^K x_{n,k} \beta_k$

can be anywhere on the real number line. This isn’t usually a problem for LingPipe’s logistic regression, which always initializes the coefficient vector $\beta$ to zero. It could be a problem if we have even a moderately sized coefficient and then see a very large (or small) predictor. Our probability estimate will overflow to 1 (or underflow to 0), and if the outcome is the opposite, we assign zero probability to the data, which is not good predictively.

### Log Sum of Exponents to the Rescue

Luckily, there’s a solution. First, we’re almost always working with log probabilities to prevent underflow in the likelihood function for the whole data set $y,x$,

$\log p(y|\beta;x) = \log \prod_{n = 1}^N p(y_n|\beta;x_n) = \sum_{n=1}^N \log p(y_n|\beta;x_n)$

Working on the inner log probability term, we have

$\log p(y_n|\beta;x_n)$

${ } = \log \mbox{\sf Bernoulli}(y_n|\mbox{logit}^{-1}(x_n \beta))$

${ } = \log \ \mbox{logit}^{-1}(x_n \beta) \mbox{ if } y_n = 1$
${ } = \log (1 - \mbox{logit}^{-1}(x_n \beta)) \mbox{ if } y_n = 0$

Recalling that

$1 - \mbox{logit}^{-1}(\alpha) = \mbox{logit}^{-1}(-\alpha)$,

we further simplify to

${ } = \log \ \mbox{logit}^{-1}(x_n \beta) \mbox{ if } y_n = 1$
${ } = \log \ \mbox{logit}^{-1}(-x_n \beta) \mbox{ if } y_n = 0$

Now we’re in good shape if we can prevent the log of the inverse logit from overflowing or underflowing. This is manageable. If we let $\alpha$ stand in for the linear predictor (or its negation), we have

${ } = \log \ \mbox{logit}^{-1}(\alpha)$

${ } = \log (1 / (1 + \exp(-\alpha)))$

${ } = - \log (1 + \exp(-\alpha))$

${ } = - \mbox{logSumExp}(0,-\alpha)$

### Log Sum of Exponentials

Recall that the log sum of exponentials function is

$\mbox{logSumExp}(a,b) = \log (\exp(a) + \exp(b))$

If you’re not familiar with how it prevents underflow and overflow, check out my previous post:

In the logistic regression case, we have an even greater chance for optimization because the argument $a$ is a constant zero.

### Logit-transformed Bernoulli

Putting it all together, we have the logit-transformed Bernoulli distribution,

$\mbox{\sf Bernoulli}(y_n|\mbox{logit}^{-1}(x_n\beta))$

${ } = - \mbox{logSumExp}(0,-x_n\beta) \mbox{ if } y_n = 1$
${ } = - \mbox{logSumExp}(0,x_n\beta) \mbox{ if } y_n = 0$

We can just think of this as an alternatively parameterized Bernoulli distribution,

$\mbox{\sf BernoulliLogit}(y|\alpha) = \mbox{\sf Bernoulli}(y|\mbox{logit}^{-1}(\alpha))$

with which our model can be expressed as

$y_n \sim \mbox{\sf BernoulliLogit}(x_n\beta)$.

### Recoding Outcomes {0,1} as {-1,1}

The notation’s even more convenient if we recode the failure outcome as $-1$ and thus take the outcome $y \in \{ -1, 1 \}$, where we have

$\mbox{\sf BernoulliLogit}(y|\alpha) = - \mbox{logSumExp}(0,-y \alpha)$

### 3 Responses to “How to Prevent Overflow and Underflow in Logistic Regression”

1. numerical stability in logistic regression « Georgia Tech CS 4650/7650: Natural Language Processing Says:
2. Python:Float precision breakdown in python/numpy when adding numbers – IT Sprite Says:

[…] Why not do these sums in the log domain to avoid the precision problem? lingpipe-blog.com/2012/02/16/… […]

3. Where Predictive Modeling Goes Astray Says:

[…] of the cubic term? Or because your computer code for evaluating the inverse logit function is numerically unstable? Or because you’re using a small N sample in which a simpler model does better than the true […]