Softmax without Overflow


Softmax is another name for the generalization of the logistic sigmoid function to n outcomes. It’s popular as an activation function for neural networks and as an (inverse) link function for generalized linear models like logistic regression.

Softmax converts an arbitrary real-valued vector into a multinomial probability vector. Except when it doesn’t.

LingPipe uses softmax it in two places. First, to construct a classify.JointClassification using joint log probabilities, which requires properly scaled linear probabilities in order to construct its superclass classify.ConditionalClassification. The second is inside multinomial logistic regression, where its used to convert the linear predictors into probabilities.

Suppose we have a vector of finite numbers:

x = \langle x_1, \ldots, x_n \rangle

The softmax operation converts it into a vector whose entries are finite, non-negative, and sum to 1:

\mbox{\rm softmax}(x) = \langle \exp(x_1)/Z,\ldots,\exp(x_n)/Z \rangle

where the normalizing term is the usual sum:

Z = \exp(x_1) + \cdots \exp(x_n)

so that by construction:

 \exp(x_1)/Z + \cdots + \exp(x_n)/Z  = 1

The goal is to actually compute the terms in the softmax vector. The problem? Overflow. The terms in x might be large. So large that Math.exp(x[i]) overflows. That doesn’t even need to be that big, just Math.log(Double.MAX_VALUE), or about 710. It’s especially easy to do in logistic regression under many conditions (e.g. large absolute feature values, large variance within a dimension of features, large predictive differences, weak priors with separable features).

It’s algebra month here at Alias-i, so you shouldn’t be surprised when I note that the vector x can be shifted dimensionwise without changing the softmax result:

\mbox{\rm softmax}(\langle x_1,\ldots,x_n\rangle) = \mbox{\rm softmax}(\langle x_1 + a, \ldots, x_n+a\rangle)

{ } = \langle \exp (x_1 + a)/Z',\ldots \exp(x_n+1)/Z' \rangle

= \langle \exp(x_1)\exp(a)/Z',\ldots,\exp(x_n)\exp(a)/Z'\rangle

where the additive term factors through the normalizing constant:

Z' = \exp(x_1+a) + \cdots \exp(x_n+a)

= \exp(a)(\exp(x_1) + \cdots + \exp(x_n))

and thus drop out to establish the equivalence.

What we do is rescale so there’s no overflow by subtracting the maximum of the x values from each x[i]. That converts overflow into underflow. Underflow is no problem, because that rounds off to zero, which is a well-behaved floating point number. The code is roughly as follows:

double[] softmax(double[] xs) {

    double a = Double.POSITIVE_INFINITY;
    for (int i = 0; i < xs.length; ++i)  
        if (xs[i] > a) 
            a = xs[i]; 

    double Z = 0.0;
    for (int i = 0; i < xs.length; ++i) 
        Z += Math.exp(xs[i] - a);

    double[] ps = new double[xs.length];
    for (int i = 0; i < xs.length; ++i) 
        ps[i] = Math.exp(xs[i] - a)/Z;

    return ps;

The real code’s less clear because it caches the exponent computation Math.exp(xs[i] - a) in ps while computing the sum for Z and then just rescales ps by Z.

In practice, suppose x = (305, 300, 150). So we subtract the max (305) from each entry to get x' = (0,-5,-155). We then calculate the normalizer Z = exp(0) + exp(-5) + exp(-155) ~ 1 + 1/32 + 0 = 33/32. Note how the exponentiation of the big negative value (-155) underflows to zero. We now compute the linear probabilities as p' = (exp(0)/(33/32)), exp(-5)/(33/32), exp(-155)/(33/32)) = (32/33, 1/33, 0).

The old classifier implementation used the max rather than the min to rescale, which isn’t agressive enough in that it still allows overflow when the inputs may be large an dpositive. That wasn’t a problem for classifiers when the numbers were log probabilities, because they’re already all negative; so there the scaling prevents catastrophic underflow, and we already had that right.

This method is guaranteed not to have any overflow. And it’s guaranteed not to have all zero values, too. I’m afraid it’s simply not possible to represent two numbers more than exp(711) apart or so using IEEE double-precision floating point arithmetic. And frankly, once events are relatively this unlikely, we’re not too worried about rounding to zero.

Look for the new implementation in LingPipe 4.0’s logistic regression classification function. Coming soon to a web site near you.

5 Responses to “Softmax without Overflow”

  1. sth4nth Says:

    People who often deal with Bayesian estimation and inference often have a logsumexp(x) function (x=[x1,…,xn]) in their library. It’s like (Again in matlab language ):
    function s=logsumexp(x)

    Then the sofmax(x) is smply:

    Which seems easier.
    Have fun:)

  2. Ken Williams Says:

    Your math expressions seem to be broken here, did the LaTeX server go away?

    • Bob Carpenter Says:

      Thanks for the heads up. I’ve switched to doing everything in WordPress itself, where it’s easy to include \LaTeX.

      I’ll see what I can do about fixing the old links. They luckily have the raw LaTeX still in the anchor tag.

  3. steven Varga Says:

    double a = Double.POSITIVE_INFINITY;
    for (int i = 0; i a)
    a = xs[i];

    did you mean to type:
    double a = Double.NEGATIVE_INFINITY;
    or am I missing a point? it seems the above codeblock sets all ‘a’-s to max_double then keeps it that way.

  4. Akin C. Says:

    Hello, thank you for your article! I am new in Machine Learning and it helped me a lot!

    I think there is a calculation error:
    exp(-5) = 0.0067379469990855 and 1/32 = 0,03125

    What if it happens that x’ = (0,0,0)? Is that even possible? Does that mean I have to tweak my weights through learning?

    Thanks again

    Akin C.
    (Sorry if I posted the comment twice)

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: