Andrew set me straight the other day on why cross validation isn’t enough to estimate priors in hierarchical models.

I naively thought you could just run cross-validation and choose the prior that works the best overall. By best overall, I meant the one that leads to the lowest cross-validation error. Suppose we have K folds of data d[1],…,d[K]. We are interested in finding the prior parameter vector β that maximizes

Σ_{k} log p(d[k] | d[1..k-1, k+1..K], β),

where the probability density function p is the usual predictive inference, which integrates over model parameters θ given the data d[1..k, k+1,K] and prior β to predict d[k],

p(d[k] | d[1..k-1, k+1..K], β)

= ∫_{Θ} p(d[k] | θ) p(θ | d[1..k-1, k+1,..,K], β) dθ

Andrew pointed out that the degenerate prior with mean set to the corpus maximum likelihood estimate (let’s call it θ^{*}) and zero variance works the best. That’s because θ^{*} is the parameter that maximizes the corpus likelihood:

Σ_{k} log p(d[k] | θ^{*})

and with a zero-variance prior β with mean θ^{*}, the integral over parameters θ works out to

p(d[k] | d[1..k-1, k+1..K], β) = p(d[k] | θ^{*})

The problem is clear here. But I’m not sure the prior with mean θ^{*} and zero variance is always optimal. Isn’t it possible that we could use a different prior β’ and by carefully carving up the folds find that we could estimate a better θ_{k} for a fold k and hence a better overall performance? I think it may be possible because we’re only trying to predict one fold at a time. But then collectively we may not be able to win. I could try to work out some examples, but instead I’ll ask, are there any statisticians in the audience?

May 5, 2009 at 8:25 pm |

I’m not a statistician, but I do interested in the model selection problem. From my understanding, the conclusion that the prior with mean θ* is opitmal can be deduced by using Laplace approximation technique. Therefore as long as the the posterior is more or less resemble to Gaussian, the conclusion is valid.

Actually cross validation has similar effect as AIC. One distadvange is that they are not consistent. Therefore when you have large data, cross validation may not select the right model.

I recommend this excellent tutorial for model selection problem. Hope it will help.

http://videolectures.net/uai08_grunwald_cup/