How to Prevent Overflow and Underflow in Logistic Regression

Logistic regression is a perilous undertaking from the floating-point arithmetic perspective.

Logistic Regression Model

The basic model of an binary outcome $y_n \in \{ 0, 1\}$ with predictor or feature (row) vector $x_n \in \mathbb{R}^K$ and coefficient (column) vector $\beta \in \mathbb{R}^K$ is

$y_n \sim \mbox{\sf Bernoulli}(\mbox{logit}^{-1}(x_n \beta))$

where the logistic sigmoid (i.e., the inverse logit function) is defined by

$\mbox{logit}^{-1}(\alpha) = 1 / (1 + \exp(-\alpha))$

and where the Bernoulli distribution is defined over support $y \in \{0, 1\}$ so that

$\mbox{\sf Bernoulli}(y_n|\theta) = \theta \mbox{ if } y_n = 1$, and

$\mbox{\sf Bernoulli}(y_n|\theta) = (1 - \theta) \mbox{ if } y_n = 0$.

(Lack of) Floating-Point Precision

Double-precision floating-point numbers (i.e., 64-bit IEEE) only support a domain for $\exp(\alpha)$ of roughly $\alpha \in (-750,750)$ before underflowing to 0 or overflowing to positive infinity.

Potential Underflow and Overflow

The linear predictor at the heart of the regression,

$x_n \beta = \sum_{k = 0}^K x_{n,k} \beta_k$

can be anywhere on the real number line. This isn’t usually a problem for LingPipe’s logistic regression, which always initializes the coefficient vector $\beta$ to zero. It could be a problem if we have even a moderately sized coefficient and then see a very large (or small) predictor. Our probability estimate will overflow to 1 (or underflow to 0), and if the outcome is the opposite, we assign zero probability to the data, which is not good predictively.

Log Sum of Exponents to the Rescue

Luckily, there’s a solution. First, we’re almost always working with log probabilities to prevent underflow in the likelihood function for the whole data set $y,x$,

$\log p(y|\beta;x) = \log \prod_{n = 1}^N p(y_n|\beta;x_n) = \sum_{n=1}^N \log p(y_n|\beta;x_n)$

Working on the inner log probability term, we have

$\log p(y_n|\beta;x_n)$

${ } = \log \mbox{\sf Bernoulli}(y_n|\mbox{logit}^{-1}(x_n \beta))$

${ } = \log \ \mbox{logit}^{-1}(x_n \beta) \mbox{ if } y_n = 1$
${ } = \log (1 - \mbox{logit}^{-1}(x_n \beta)) \mbox{ if } y_n = 0$

Recalling that

$1 - \mbox{logit}^{-1}(\alpha) = \mbox{logit}^{-1}(-\alpha)$,

we further simplify to

${ } = \log \ \mbox{logit}^{-1}(x_n \beta) \mbox{ if } y_n = 1$
${ } = \log \ \mbox{logit}^{-1}(-x_n \beta) \mbox{ if } y_n = 0$

Now we’re in good shape if we can prevent the log of the inverse logit from overflowing or underflowing. This is manageable. If we let $\alpha$ stand in for the linear predictor (or its negation), we have

${ } = \log \ \mbox{logit}^{-1}(\alpha)$

${ } = \log (1 / (1 + \exp(-\alpha)))$

${ } = - \log (1 + \exp(-\alpha))$

${ } = - \mbox{logSumExp}(0,-\alpha)$

Log Sum of Exponentials

Recall that the log sum of exponentials function is

$\mbox{logSumExp}(a,b) = \log (\exp(a) + \exp(b))$

If you’re not familiar with how it prevents underflow and overflow, check out my previous post:

In the logistic regression case, we have an even greater chance for optimization because the argument $a$ is a constant zero.

Logit-transformed Bernoulli

Putting it all together, we have the logit-transformed Bernoulli distribution,

$\mbox{\sf Bernoulli}(y_n|\mbox{logit}^{-1}(x_n\beta))$

${ } = - \mbox{logSumExp}(0,-x_n\beta) \mbox{ if } y_n = 1$
${ } = - \mbox{logSumExp}(0,x_n\beta) \mbox{ if } y_n = 0$

We can just think of this as an alternatively parameterized Bernoulli distribution,

$\mbox{\sf BernoulliLogit}(y|\alpha) = \mbox{\sf Bernoulli}(y|\mbox{logit}^{-1}(\alpha))$

with which our model can be expressed as

$y_n \sim \mbox{\sf BernoulliLogit}(x_n\beta)$.

Recoding Outcomes {0,1} as {-1,1}

The notation’s even more convenient if we recode the failure outcome as $-1$ and thus take the outcome $y \in \{ -1, 1 \}$, where we have

$\mbox{\sf BernoulliLogit}(y|\alpha) = - \mbox{logSumExp}(0,-y \alpha)$

3 Responses to “How to Prevent Overflow and Underflow in Logistic Regression”

1. numerical stability in logistic regression « Georgia Tech CS 4650/7650: Natural Language Processing Says:
2. Python:Float precision breakdown in python/numpy when adding numbers – IT Sprite Says:

[…] Why not do these sums in the log domain to avoid the precision problem? lingpipe-blog.com/2012/02/16/… […]

3. Where Predictive Modeling Goes Astray Says:

[…] of the cubic term? Or because your computer code for evaluating the inverse logit function is numerically unstable? Or because you’re using a small N sample in which a simpler model does better than the true […]