Logistic regression is a perilous undertaking from the floating-point arithmetic perspective.

### Logistic Regression Model

The basic model of an binary outcome with predictor or feature (row) vector and coefficient (column) vector is

where the logistic sigmoid (i.e., the inverse logit function) is defined by

and where the Bernoulli distribution is defined over support so that

, and

.

### (Lack of) Floating-Point Precision

Double-precision floating-point numbers (i.e., 64-bit IEEE) only support a domain for of roughly before underflowing to 0 or overflowing to positive infinity.

### Potential Underflow and Overflow

The linear predictor at the heart of the regression,

can be anywhere on the real number line. This isn’t usually a problem for LingPipe’s logistic regression, which always initializes the coefficient vector to zero. It could be a problem if we have even a moderately sized coefficient and then see a very large (or small) predictor. Our probability estimate will overflow to 1 (or underflow to 0), and if the outcome is the opposite, we assign zero probability to the data, which is not good predictively.

### Log Sum of Exponents to the Rescue

Luckily, there’s a solution. First, we’re almost always working with log probabilities to prevent underflow in the likelihood function for the whole data set ,

Working on the inner log probability term, we have

Recalling that

,

we further simplify to

Now we’re in good shape if we can prevent the log of the inverse logit from overflowing or underflowing. This is manageable. If we let stand in for the linear predictor (or its negation), we have

### Log Sum of Exponentials

Recall that the log sum of exponentials function is

If you’re not familiar with how it prevents underflow and overflow, check out my previous post:

- LingPipe Blog: Log Sum of Exponentials.

In the logistic regression case, we have an even greater chance for optimization because the argument is a constant zero.

### Logit-transformed Bernoulli

Putting it all together, we have the logit-transformed Bernoulli distribution,

We can just think of this as an alternatively parameterized Bernoulli distribution,

with which our model can be expressed as

.

### Recoding Outcomes {0,1} as {-1,1}

The notation’s even more convenient if we recode the failure outcome as and thus take the outcome , where we have

February 18, 2012 at 12:12 am |

[…] https://lingpipe-blog.com/2012/02/16/howprevent-overflow-underflow-logistic-regression/ […]

October 26, 2015 at 1:06 pm |

[…] Why not do these sums in the log domain to avoid the precision problem? lingpipe-blog.com/2012/02/16/… […]

January 27, 2017 at 1:18 pm |

[…] of the cubic term? Or because your computer code for evaluating the inverse logit function is numerically unstable? Or because you’re using a small N sample in which a simpler model does better than the true […]