I’ve been playing with all sorts of fun new toys at the new job at Columbia and learning lots of new algorithms. In particular, I’m coming to grips with Hamiltonian (or hybrid) Monte Carlo, which isn’t as complicated as the physics-based motivations may suggest (see the discussion in David MacKay’s book and then move to the more detailed explanation in Christopher Bishop’s book).

### Why Hamiltonian Monte Carlo?

Hamiltonian MC methods use gradient (derivative) information to guide Metropolis-Hastings proposals. Matt is finding that it works well for the multilevel deep interaction models we’re working on with Andrew. As the theory suggests, it works much better than Gibbs sampling when the variables are highly correlated, as they tend to be with multilevel regression coefficients. The basic idea goes back to the paper:

- Duane, Simon, A. D. Kennedy, Brian J. Pendleton, and Duncan Roweth. 1987. Hybrid Monte Carlo.
*Physics Letters B***195**(2):216–222. doi:10.1016/0370-2693(87)91197-X

### Why AD?

Because we want to model general directed graphical models *a la* BUGS, we need to compute gradients.

If you need general gradients, it sure seems like automatic differentiation (AD) deserves at least some consideration. AD is a technique that operates on arbitrary functions defined by source code, generating new source code that computes the derivative (thus it’s a kind of automatic code generation). At a high level, it does just what you might expect: computes the derivatives of built-in functions and constants then uses the chain rule to put them together. Because it generates code, you can even have a go at tuning the resulting derivative code further.

Here are some basic refs:

- Wikipedia: automatic differentiation
- Community Site with Software Links: Autodiff.org
- Survey Paper: Bischof, C. H. and P. D. Hovland. 2008. On the implementation of automatic differentiation tools.
*Higher-Order and Symbolic Computation***21**(3):311–331.

### So Which One?

So what I want to know is, if I’m going to be coding in C, which implementation of AD should I be using? (There are implementations in everyting from Fortran to Python to Haskell.)

**Update** 21 January 2010: So, two problems rear their ugly heads. One: almost all of the tools other than Tapenade require you to seriously modify the code with different data types and flags for the code generator. Two: most of these tools are for non-commercial use only and can’t be redistributed. Not exactly how we’d like to roll together our own academic software. This is especially annoying because they’re almost all government-research funded (Tapenade in France, many of the others in the US, with copyright held tightly by institutions like the University of Chicago and INRIA). Many of the packages require extensive secondary packages to be installed. I miss Java and Apache.

### Tapenade’s Online AD for C, C++ and Fortran

**Update** 21 January 2010: maybe people coming from the blog killed it, but the Tapenade Online app is no longer up and hasn’t been since an hour or two after this blog post.

You can try it on your own C, C++, or Fortran program using the

For instance, consider the simple (intententionally non-optimal) C code to compute :

double f(double x) { double y; y = x; y *= x; y += 3*x; return y; }

Running Tapenade in forward mode, we get

double f_d(double x, double xd, double *f) { double y; double yd; yd = xd; y = x; yd = yd*x + y*xd; y *= x; yd = yd + 3*xd; y += 3*x; *f = y; return yd; }

If you plug in 1.0 for xd (in general, this can be used for chaining functions), you can see that the function returns the derivative and also sets the value of argument f (given as a pointer) to .

In backward mode, Tapenade generates the following code:

void f_b(double x, double *xb, double fb) { double y; double yb; double f; y = x; yb = fb; *xb = *xb + (y+3)*yb; yb = x*yb; *xb = *xb + yb; }

If you squint hard enough, you’ll realize this method also computes the derivative of . It’s set up to run the chain rule in reverse.

### Citation Stats?

Yes, I know this is totally crass, but nevertheless, I went to autodiff.org and checked out the many systems for C/C++ they cite, and then did a Google Scholar query `[`

, restricted to results after (including?) 2005. Here’s what I found:*SystemName* "automatic differentiation"]

ADMB 73 ADC 16 ADIC 274 ADOL-C 265 AUTODIF 11 COSY INF 52 CppAD 24 FAD 46 fadbad 65 ffadlib 3 OpenAD 91 Rapsodia 9 Sacado 8 Tapenade 244 Treeverse 7 YAO 88

This doesn’t seem to be expressing a particularly clear community preference. So I was hoping some of you may have suggestions. I should add that many of the papers are themselves surveys, so don’t actually correspond to someone using the package, which is what I’d really like to know.

January 19, 2011 at 9:56 pm |

Is Hamiltonian MC sort of like a multilevel multiple imputations?

January 20, 2011 at 1:49 pm |

Hamiltonian MC is an instance of the Metropolis-Hastings algorithm, which is an MCMC algorithm for sampling from distributions. Unlike Gibbs sampling, it samples all the parameters at once. Either could be used to sample from the posterior of a Bayesian model, but Hamiltonian MC tends to work better if there is a lot of correlation among the parameters. You can also mix them, as you tend to do when the parameters or missing data are a mix of categorical and continuous variables.

The term “imputation” is usually used for missing data, such as a non-response in a survey. You can certainly set up a Bayesian model to do multilevel multiple imputation and then use Hamiltonian MC to sample from the posterior. In fact, it’s one of the applications we’re exploring. There’s a nice discussion in Gelman and Hill’s multilevel regression book, with BUGS used for sampling (it’s a general Gibbs sampler).

January 21, 2011 at 2:29 pm |

that’s awesome, thanks.

January 24, 2011 at 8:27 pm |

Ive started playing around with CppAD, and using Rcpp have been able to rudimentarily (sic) link it with R. Dont yet have any feeling on how well it performs, but intend to try out the ideas of Girolami and Calderhead (JRSSB read paper late last year) using it.

January 24, 2011 at 11:38 pm |

CppAD’s notion of “taping” seems really clunky. Isn’t it going to be very inefficient? Here’s what it looks like in CppAD:

CppAD reverse automatic differentation example

I really like the clean integration of operator overloading in Sacado:

Sacado reverse automatic differentiation example

But then how inefficient is the operator overloading in Sacado and other such packages? I just don’t know C++ that well.

The function is cleanly separated from any of the Sacado bits. Maybe that’s achievable with CppAD with a little encapsulation — I’m still not sure exactly how all these things work.

It makes me wonder if I can just install something like the templated Boost C++ libraries and be done with it.

Or if I can just pass in a vector to forward mode, like you see in the Sacado forward example Fad_demo.

Installing this stuff’s such a pain compared to Java — it’s like the 1980s all over again with path-guessing and scripts to configure scripts to build platform-specific executables.

Linking with R isn’t the problem. We (and by that I mean the people I’m working with) are all over that. We just don’t want to spend our lives hand-coding derivatives.

January 25, 2011 at 4:25 am |

I’m just about to do some mcmc using cppad, so, then i’ll know the penalties. It was easy enough to install and one of my colleagues had prior experience with it, so it seemed a good place to start, and i could nt face trying to differentiate the likelihood i was constructing by hand. But, always interested in doing better, so would like to keep in touch with how you are progressing

April 3, 2011 at 3:09 pm |

I’m writing to follow up on this post: http://www.stat.columbia.edu/~cook/movabletype/archives/2011/02/handy_matrix_ch.html

eigen is awesome, but linking eigen to AD packages has proved to be very tricky (even their unsupported/AutoDiff package). Do you have any updates about which set of tools you settled on? I’m eager to find something that can play nicely with eigen.

October 10, 2011 at 3:01 am |

Just thought i would revisit this issue, and see how things have progressed with you in the preceding 9 months. Cppad has done what I wanted, but really only as a test case, i think ineed a more efficient approach than one based on a ‘tape’.

October 10, 2011 at 8:24 am |

I’ve so far had good luck with ADOL-C. eigen is flexible enough that it doesn’t mind working with objects (adoubles) instead of integrals (doubles), and therefore integration between the two (ADOL-C and eigen) has not been an issue. ADOL-C’s major drawback is its large memory footprint (I force it to save the tape in memory rather than on disk), but I recently found a patch at (http://www.sc.rwth-aachen.de/willkomm/) that replaces ADOL-C’s memory manager with something a bit more modern. The site I linked to also has a presentation (PDF) benchmarking a few AD packages which might be of interest.

As for the performance hit that comes with using a tape, I’ve noticed that ADOL-C, and perhaps the operator-overloading / tape method in general (I’m not a computer scientist), has a high fixed cost in terms of speed, but relatively low cost in terms of calculating additional higher-order tensors. For example, I’m using Girolami/Calderhead’s MALA, which relies on the third-derivative of log probability function. G&C’s paper includes an approximation (constant curvature) that only requires the Fisher Information matrix, so I implemented that first. However, there appears to be very little speed difference between stopping at the second derivative and going all the way to full MALA.

Any other success stories out there?

October 10, 2011 at 2:12 pm

The tape’s a huge memory sink. I don’t see how else you can do reverse-mode, though.

We’re about ready to release the first version of our HMC sampler with auto-dif. We wound up writing our own auto-dif that’s both faster (not by much) and more flexible (by a fair bit) than either RAD/Sacado or CppAD. We based in on an extensible OO design that overrides operator new to manage the auto-dif variables on a stack.

It’s very special purpose, though. We also didn’t template values out, so there’s not a good way to do second-order difs. But reverse-reverse isn’t ideal, at least the way any of the standard packages go about it. For second-order (i.e., Hessians), it’d be better to do N forward-reverse mode passes.

We went with Eigen, whose developers have been really helpful. For Eigen 3.0.1, they removed explicit namespaces on some of their internals, which allows all of the functions to be used (we use Cholesky, eigenvalues/vectors, determinants and inverses, as well as all the basic arithmetic). It all works very well

if all the variables are auto-dif variables.There’s no way in Eigen to deal with multiplying a vector of doubles by a matrix of auto-dif variables. I’ve put in a request and they’ve sketched a solution for mixed-type operations, but no one’s implemented it and it’s not something that’s a high priority for the developers. (They specially hard-coded the complex-double case.)

For now, we’re just wrapping our double matrices in auto-dif variables operation by operation so we could template everything and easily code generate (we’re generating code from model specs like BUGS). Wrapping is a huge waste of space (and hence time), but it does the right thing in terms of behavior. The real bummer is that the way I wrapped it, it defeats their clever template expressions. But the real speed gain will not be from laziness through templates, but by short-circuiting the auto-dif to make it more efficient (who knew so many matrix ops had analytical derivatives?).

Luckily, we don’t have too many matrix ops. They only show up in multivariate distros. It was a huge pain to get an unconstrained representation of correlation and covariance matrices. We used the Lewandowski, Kurowicka and Joe vines construction, which gives you (k choose 2) parameters for a correlation matrix and another k for a covariance matrix. I’m becoming quite adept at complex Jacobians (we also needed to finesse simplexes to k-1 unconstrained values and ordered scalars for ordinal logit cutpoints).

We also had to implement our own log probability functions. We needed something like normal(y,mu,sigma) to have any or all of y, mu and sigma be auto-dif variables. The standard libs just have a single template variable, which is again wasteful. We also added lots of other useful functions like log_sum_exp, softmax, log gamma, and so on.