Logistic regression is a perilous undertaking from the floating-point arithmetic perspective.
Logistic Regression Model
The basic model of an binary outcome with predictor or feature (row) vector
and coefficient (column) vector
is
where the logistic sigmoid (i.e., the inverse logit function) is defined by
and where the Bernoulli distribution is defined over support so that
, and
.
(Lack of) Floating-Point Precision
Double-precision floating-point numbers (i.e., 64-bit IEEE) only support a domain for of roughly
before underflowing to 0 or overflowing to positive infinity.
Potential Underflow and Overflow
The linear predictor at the heart of the regression,
can be anywhere on the real number line. This isn’t usually a problem for LingPipe’s logistic regression, which always initializes the coefficient vector to zero. It could be a problem if we have even a moderately sized coefficient and then see a very large (or small) predictor. Our probability estimate will overflow to 1 (or underflow to 0), and if the outcome is the opposite, we assign zero probability to the data, which is not good predictively.
Log Sum of Exponents to the Rescue
Luckily, there’s a solution. First, we’re almost always working with log probabilities to prevent underflow in the likelihood function for the whole data set ,
Working on the inner log probability term, we have
Recalling that
,
we further simplify to
Now we’re in good shape if we can prevent the log of the inverse logit from overflowing or underflowing. This is manageable. If we let stand in for the linear predictor (or its negation), we have
Log Sum of Exponentials
Recall that the log sum of exponentials function is
If you’re not familiar with how it prevents underflow and overflow, check out my previous post:
- LingPipe Blog: Log Sum of Exponentials.
In the logistic regression case, we have an even greater chance for optimization because the argument is a constant zero.
Logit-transformed Bernoulli
Putting it all together, we have the logit-transformed Bernoulli distribution,
We can just think of this as an alternatively parameterized Bernoulli distribution,
with which our model can be expressed as
.
Recoding Outcomes {0,1} as {-1,1}
The notation’s even more convenient if we recode the failure outcome as and thus take the outcome
, where we have
February 18, 2012 at 12:12 am |
[…] https://lingpipe-blog.com/2012/02/16/howprevent-overflow-underflow-logistic-regression/ […]
October 26, 2015 at 1:06 pm |
[…] Why not do these sums in the log domain to avoid the precision problem? lingpipe-blog.com/2012/02/16/… […]
January 27, 2017 at 1:18 pm |
[…] of the cubic term? Or because your computer code for evaluating the inverse logit function is numerically unstable? Or because you’re using a small N sample in which a simpler model does better than the true […]