Log Sum of Exponentials

by

I’ve decided, at least in the first pass, to calculate forward/backward for CRFs on a log scale. The alternative way to maintain floating point stability, by keeping log multipliers for columns of outputs, was just making my head hurt too much in the nested loops.

The log computation involves taking the logarithm of a sum over the exponentiated values found in an array. In symbols:

z= \log \sum_i \exp(x_i)

The problem, of course, is that if any of the x values get sufficiently large (greater than log(Double.MAX_VALUE), which is about 710), they overflow to positive infinity when you calculate the exponent; if all of the values are tiny (less than -750 or so), they all underflow to 0.0 and the resulting log sum is zero.

Luckily, we can get more arithmetic stability with a little algebra. First, find the maximum value in x:

m = \max_i x_i

Then use it to normalize the largest value to 0:

\begin{array}{lll} \log \sum_i \exp(x_i) & = & \log \sum \frac{\exp(m)}{\exp(m)}\exp(x_i) \\[12pt] & = & \log \exp(m) \sum \frac{1}{\exp(m)} \exp(x_i) \\[12pt]& = & \log \exp(m) + \log \sum \exp(-m) \exp(x_i)\\[12pt]& = & m + \log \sum \exp(x_i - m) \end{array}

So the largest value passed to the exponential function is 0. If there are really tiny values after subtraction, they’ll become zero and drop out, as they should with limited precision arithmetic.

I’ve implemented a new static utility method to compute the log of the sum of the exponentiated values of an array:

com.aliasi.util.Math.logSumOfExponentials(double[]).

It’ll be out in the next release (hopefully along with CRFs).

In case you’re curious where this comes up, here’s the inductive step of the forward pass of the forward-backward algorithm for CRFs (φ(n,k’) is the feature vector for token position n and previous outcome k'; β[k] is the coefficient coefficient for outcome k):

\alpha_{n,k} = \sum_{k'} \alpha_{n-1,k'} \exp \big( \phi(n,k')^{\top} \beta_k \big)

and here it is on the log scale:

\log \alpha_{n,k} = \log \sum_{k'} \exp\big(\, \log \alpha_{n-1,k'} + \phi(n,k')^{\top} \beta_k\, \big)

Here’s our Java source from com.aliasi.util.Math:

    public static double logSumOfExponentials(double[] xs) {
        if (xs.length == 1) return xs[0];
        double max = maximum(xs);
        double sum = 0.0;
        for (int i = 0; i < xs.length; ++i)
            if (xs[i] != Double.NEGATIVE_INFINITY)
                sum += java.lang.Math.exp(xs[i] - max);
        return max + java.lang.Math.log(sum);
    }

5 Responses to “Log Sum of Exponentials”

  1. Jonathan Graehl Says:

    Many of us have defined the same operation pairwise so as to have (with operator overloading) numbers which have the usual operations. In adding pairwise you can use a special function to compute log(1+x) which is more accurate for small x – “log1p”. You could use that in your vector operation as well, although it would only be relevant if all the numbers were MUCH smaller than the maximum.

  2. lingpipe Says:

    Thanks for reminding me about the special log calculation.

    For everyone else who’s not knee deep in these derivations, the binary case is:

    log(exp(a) + exp(b)) = a + log(1 + exp(b-a))

    and this is where the special calculation for log(1+x) comes into play.

    Check out John Cook’s discussion of log(1+x), which uses a Taylor approximation for x < 1e-4 (that is, x < 0.0001), and simply makes a lib call otherwise.

  3. sth4nth Says:

    That’s what I said in the replay of the logistic regression post.

  4. lingpipe Says:

    sth4nth is talking about his or her comment on my softmax without overflow blog entry.

    Sorry I’m so slow with all this arithmetic precision stuff; I should’ve definitely put exp(2) and exp(2) together in this case.

  5. Martin Says:

    In my limited experience, the more useful function to approximate is f(x) = log(1+exp(x)). You only need to do that for x>=0 or x<=0. For sufficiently large x, log(1+exp(x)) ~= x; and for sufficiently small x (e.g. x < 750 in IEEE double precision), log(1+exp(x)) ~= 0. Moreover, the difference between the function and its asymptotes is symmetric about 0, i.e., f(x) – x == f(-x). So you only need an approximation for f near 0 and only on one side of 0. A rational approximation of f can usually be evaluated much faster than log (or log1p) and exp, since each one of those is actually computed by a polynomial or rational approximation in turn.

    (Aside: f should be familiar, though what usually shows up is df/dx.)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

Join 811 other followers