How to Prevent Overflow and Underflow in Logistic Regression


Logistic regression is a perilous undertaking from the floating-point arithmetic perspective.

Logistic Regression Model

The basic model of an binary outcome y_n \in \{ 0, 1\} with predictor or feature (row) vector x_n \in \mathbb{R}^K and coefficient (column) vector \beta \in \mathbb{R}^K is

y_n \sim \mbox{\sf Bernoulli}(\mbox{logit}^{-1}(x_n \beta))

where the logistic sigmoid (i.e., the inverse logit function) is defined by

\mbox{logit}^{-1}(\alpha) = 1 / (1 + \exp(-\alpha))

and where the Bernoulli distribution is defined over support y \in \{0, 1\} so that

\mbox{\sf Bernoulli}(y_n|\theta) = \theta \mbox{ if } y_n = 1, and

\mbox{\sf Bernoulli}(y_n|\theta) = (1 - \theta) \mbox{ if } y_n = 0.

(Lack of) Floating-Point Precision

Double-precision floating-point numbers (i.e., 64-bit IEEE) only support a domain for \exp(\alpha) of roughly \alpha \in (-750,750) before underflowing to 0 or overflowing to positive infinity.

Potential Underflow and Overflow

The linear predictor at the heart of the regression,

x_n \beta = \sum_{k = 0}^K x_{n,k} \beta_k

can be anywhere on the real number line. This isn’t usually a problem for LingPipe’s logistic regression, which always initializes the coefficient vector \beta to zero. It could be a problem if we have even a moderately sized coefficient and then see a very large (or small) predictor. Our probability estimate will overflow to 1 (or underflow to 0), and if the outcome is the opposite, we assign zero probability to the data, which is not good predictively.

Log Sum of Exponents to the Rescue

Luckily, there’s a solution. First, we’re almost always working with log probabilities to prevent underflow in the likelihood function for the whole data set y,x,

\log p(y|\beta;x) = \log \prod_{n = 1}^N p(y_n|\beta;x_n) = \sum_{n=1}^N \log p(y_n|\beta;x_n)

Working on the inner log probability term, we have

\log p(y_n|\beta;x_n)

{ } = \log \mbox{\sf Bernoulli}(y_n|\mbox{logit}^{-1}(x_n \beta))

{ } = \log \ \mbox{logit}^{-1}(x_n \beta) \mbox{ if } y_n = 1
{ } = \log (1 - \mbox{logit}^{-1}(x_n \beta)) \mbox{ if } y_n = 0

Recalling that

1 - \mbox{logit}^{-1}(\alpha) = \mbox{logit}^{-1}(-\alpha),

we further simplify to

{ } = \log \ \mbox{logit}^{-1}(x_n \beta) \mbox{ if } y_n = 1
{ } = \log \ \mbox{logit}^{-1}(-x_n \beta) \mbox{ if } y_n = 0

Now we’re in good shape if we can prevent the log of the inverse logit from overflowing or underflowing. This is manageable. If we let \alpha stand in for the linear predictor (or its negation), we have

{ } = \log \ \mbox{logit}^{-1}(\alpha)

{ } = \log (1 / (1 + \exp(-\alpha)))

{ } = - \log (1 + \exp(-\alpha))

{ } = - \mbox{logSumExp}(0,-\alpha)

Log Sum of Exponentials

Recall that the log sum of exponentials function is

\mbox{logSumExp}(a,b) = \log (\exp(a) + \exp(b))

If you’re not familiar with how it prevents underflow and overflow, check out my previous post:

In the logistic regression case, we have an even greater chance for optimization because the argument a is a constant zero.

Logit-transformed Bernoulli

Putting it all together, we have the logit-transformed Bernoulli distribution,

\mbox{\sf Bernoulli}(y_n|\mbox{logit}^{-1}(x_n\beta))

{ } = - \mbox{logSumExp}(0,-x_n\beta) \mbox{ if } y_n = 1
{ } = - \mbox{logSumExp}(0,x_n\beta) \mbox{ if } y_n = 0

We can just think of this as an alternatively parameterized Bernoulli distribution,

\mbox{\sf BernoulliLogit}(y|\alpha) = \mbox{\sf Bernoulli}(y|\mbox{logit}^{-1}(\alpha))

with which our model can be expressed as

y_n \sim \mbox{\sf BernoulliLogit}(x_n\beta).

Recoding Outcomes {0,1} as {-1,1}

The notation’s even more convenient if we recode the failure outcome as -1 and thus take the outcome y \in \{ -1, 1 \}, where we have

\mbox{\sf BernoulliLogit}(y|\alpha) = - \mbox{logSumExp}(0,-y \alpha)

3 Responses to “How to Prevent Overflow and Underflow in Logistic Regression”

  1. numerical stability in logistic regression « Georgia Tech CS 4650/7650: Natural Language Processing Says:

    […] […]

  2. Python:Float precision breakdown in python/numpy when adding numbers – IT Sprite Says:

    […] Why not do these sums in the log domain to avoid the precision problem?… […]

  3. Where Predictive Modeling Goes Astray Says:

    […] of the cubic term? Or because your computer code for evaluating the inverse logit function is numerically unstable? Or because you’re using a small N sample in which a simpler model does better than the true […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s