Cross-Validation isn’t Enough for “Empirical” Bayes Prior Estimation

by

Andrew set me straight the other day on why cross validation isn’t enough to estimate priors in hierarchical models.

I naively thought you could just run cross-validation and choose the prior that works the best overall. By best overall, I meant the one that leads to the lowest cross-validation error. Suppose we have K folds of data d[1],…,d[K]. We are interested in finding the prior parameter vector β that maximizes

    Σk log p(d[k] | d[1..k-1, k+1..K], β),

where the probability density function p is the usual predictive inference, which integrates over model parameters θ given the data d[1..k, k+1,K] and prior β to predict d[k],

    p(d[k] | d[1..k-1, k+1..K], β)
        = Θ p(d[k] | θ) p(θ | d[1..k-1, k+1,..,K], β) dθ

Andrew pointed out that the degenerate prior with mean set to the corpus maximum likelihood estimate (let’s call it θ*) and zero variance works the best. That’s because θ* is the parameter that maximizes the corpus likelihood:

    Σk log p(d[k] | θ*)

and with a zero-variance prior β with mean θ*, the integral over parameters θ works out to

    p(d[k] | d[1..k-1, k+1..K], β) = p(d[k] | θ*)

The problem is clear here. But I’m not sure the prior with mean θ* and zero variance is always optimal. Isn’t it possible that we could use a different prior β’ and by carefully carving up the folds find that we could estimate a better θk for a fold k and hence a better overall performance? I think it may be possible because we’re only trying to predict one fold at a time. But then collectively we may not be able to win. I could try to work out some examples, but instead I’ll ask, are there any statisticians in the audience?

One Response to “Cross-Validation isn’t Enough for “Empirical” Bayes Prior Estimation”

  1. sth4nth Says:

    I’m not a statistician, but I do interested in the model selection problem. From my understanding, the conclusion that the prior with mean θ* is opitmal can be deduced by using Laplace approximation technique. Therefore as long as the the posterior is more or less resemble to Gaussian, the conclusion is valid.

    Actually cross validation has similar effect as AIC. One distadvange is that they are not consistent. Therefore when you have large data, cross validation may not select the right model.

    I recommend this excellent tutorial for model selection problem. Hope it will help.
    http://videolectures.net/uai08_grunwald_cup/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s