Andrew set me straight the other day on why cross validation isn’t enough to estimate priors in hierarchical models.
I naively thought you could just run cross-validation and choose the prior that works the best overall. By best overall, I meant the one that leads to the lowest cross-validation error. Suppose we have K folds of data d,…,d[K]. We are interested in finding the prior parameter vector β that maximizes
Σk log p(d[k] | d[1..k-1, k+1..K], β),
where the probability density function p is the usual predictive inference, which integrates over model parameters θ given the data d[1..k, k+1,K] and prior β to predict d[k],
p(d[k] | d[1..k-1, k+1..K], β)
= ∫Θ p(d[k] | θ) p(θ | d[1..k-1, k+1,..,K], β) dθ
Andrew pointed out that the degenerate prior with mean set to the corpus maximum likelihood estimate (let’s call it θ*) and zero variance works the best. That’s because θ* is the parameter that maximizes the corpus likelihood:
Σk log p(d[k] | θ*)
and with a zero-variance prior β with mean θ*, the integral over parameters θ works out to
p(d[k] | d[1..k-1, k+1..K], β) = p(d[k] | θ*)
The problem is clear here. But I’m not sure the prior with mean θ* and zero variance is always optimal. Isn’t it possible that we could use a different prior β’ and by carefully carving up the folds find that we could estimate a better θk for a fold k and hence a better overall performance? I think it may be possible because we’re only trying to predict one fold at a time. But then collectively we may not be able to win. I could try to work out some examples, but instead I’ll ask, are there any statisticians in the audience?