## Monitoring Convergence of EM for MAP Estimates with Priors

I found it remarkably hard to figure out how to monitor convergence for the expectation maximization (EM) estimtation algorithm. Elementary textbook presentations often just say “until convergence”, which left me scratching my head. More advanced presentations often leave you in a sea of generalized maximization routines and abstract functionals.

Typically, EM is phrased for maximum likelihood estimation (MLE) problems where there are no priors. Given data $y$ and parameters $\theta$, the goal is to find the parameters $\theta^*$ that maximize the likelihood function $p(y|\theta)$.

### Likelihood and Missing Data

Usually EM is used for latent parameter problems, where there are latent variables $z$ which are treated like missing data, so that the full likelihood function is actually $p(y,z|\theta)$. For instance, $z$ might be mixture component indicators, as in soft (EM) clustering. Typically the full likelihood is factored as $p(y,z|\theta) = p(z|\theta) \times p(y|z,\theta)$.

Even though the expectation (E) step of EM computes “expectations” for $z$ given current estimates of $\theta$ and the data $y$, these “expectations” aren’t used in the likelihood calculation for convergence. Instead, the form of likelihood we care about for convergence marginalizes $z$ away. Specifically, the maximum likelihood estimate $\theta^*$ is the one that maximizes the likelihood with $z$ marginalized out, $p(y|\theta) = \int p(y,z|\theta) \times p(z|\theta) \ dz$.

### Monitoring Likelihood or Parameters

There’s more than one way to monitor convergence. You can monitor either the differences in log likelihoods (after marginalizing out the latent data) or the differences in parameters (e.g. by Euclidean distance, though you might want to rescale). Log likelihood is more task-oriented, and thus more common in the machine learning world. But if you care about your parameters, you may want to measure them for convergence, because …

### Linearly Separable Data for Logistic Regression

In data that’s linearly separable on a single predictor, the maximum likelihood coefficient for that predictor is infinite. Thus the parameters will never converge. But as the parameter approaches infinity, the difference its (absolute) growth makes to log likelihood diminishes (we’re way out on the extremes of the logistic sigmoid at this point, where the slope’s nearly 0).

### Convergence with MAP?

Textbooks often don’t mention, either for philosophical or pedagogical reasons, that it’s possible to use EM for general maximum a posterior (MAP) estimation when there are priors. Pure non-Bayesians talk about “regularization” or “shrinkage” (specifically the ridge or lasso for regression problems) rather than priors and MAP estimates, but the resulting estimate’s the same either way.

Adding priors for the coefficients, even relatively weak ones, can prevent estimates from diverging, even in the case of separable data. In practice, maximum a posteriori (MAP) estimates will balance the prior and the likelihood. Thus it is almost always a good idea to add priors (or “regularize” if that goes down better philosophically), if nothing else to add stability to the estimates in cases of separability.

### Maximization Step with Priors

In EM with priors, the maximization step needs to set $\theta^{(n)}$, the parameter estimate in the $n$-th epoch, to the value that maximizes the total probability, $\log p(y|\theta) + \log p(\theta)$, given the current “expectation” for the latent parameters $z$ based on the the data and previous epoch’s estimate of $\theta$. That is, you can’t just set $\theta^{(n)}$ to maximize the likelihood, $\log p(y|\theta)$. There are analytic solutions for the maximizer in many conjugate settings like Dirichlet-Multinomial or Normal-Normal, so this isn’t as hard as it may sound. And often you can get away with increasing it rather than maximizing it (leading to the so-called generalized EM algorithm, GEM).

### Convergence with Priors

Well, you could just monitor the parameters. But if you want to monitor the equivalent of likelihood, you need to monitor the log likelihood plus prior, $\log p(y|\theta) + \log p(\theta)$, not just the log likelihood $p(y|\theta)$. What EM guarantees is that every iteration increases this sum. If you just monitor the likelihood term $p(y|\theta)$, you’ll see it bouncing around rather than monotonically increasing. That’s because the prior’s having its effect, and you need to take that into account.

### 4 Responses to “Monitoring Convergence of EM for MAP Estimates with Priors”

1. Mark Johnson Says:

Hi Bob,

Your right, of course. With hidden variables the likelihood surface can have lots of local maxima, and different “tricks” work best on different problems.

I suspect most people think of the likelihood surface as several mountain peaks, but it can be much more complicated! (Or perhaps the right comment is that 10,000-dimensional mountain ranges are more complicated!). Many of our problems are non-identifiable (or close to non-identifiable) which means the likelihood surface contains ridges, plateaus and mesas — the EM theorems guarantee convergence in likelihood, not parameters, and for these problems you shouldn’t expect convergence in parameter space!

Another observation is that most people don’t run EM for enough iterations to get anything like convergence. In some experiments on EM for HMM POS tagging we showed that EM may need hundreds of iterations to approach convergence.

Best,

Mark

• Bob Carpenter Says:

One is the Very Important Problem, which is that even if you converge, there’s no guarantee your local maximum is a global maximum (MLE or MAP). I was just trying to point out what it is you’re measuring for converging to a LOCAL optimum. [This can also be a problem for Bayesian sampling with Gibbs or other MCMC approaches — you need to get in the neighborhood of all the modes to cover the posterior well.]

I really like the discussion in David MacKay’s book, which points out very clearly that in the standard Gaussian mixture example, there is no maximum likelihood point, because likelihood is unbounded when there is a single element in a cluster and variance goes to zero.

The second point, that you only get convergence in likelihood, was news to me. I see where it’s a problem with non-identified problems, like perfectly correlated predictors for a regression problem. I’m now wondering if the coefficients will converge along with the likelihood at a local maximum in practice, but they just won’t do so stably. EM’s deterministic once you have inits and have started to climb.

Lots of people stop EM short to do a kind of poor-man’s regularization. For some reason, this always annoys my inner neat freak. I’d rather they fix the model then fit it rather than mixing half model, half heuristic.

I’m in the midst of learning Python and am about to roll out a Python-based annotation inference package. I implemented EM first because I’m working with Andrey Rzhetsky at U. Chicago on this and he wanted to see EM. It’s turning out to be unstable for ordinal annotation problems in that if two items are close on the ordinal scale and hence easily confusible, EM tends to drive prevalence in a winner-take-all direction. I’m hoping the Gibbs sampler I write next will be a bit more stable in spreading out the wealth.

2. Robert Chow Says:

An interesting read, and definately a lot more easier to understand than reading articles. In regards to the last section, the convergence with priors, is there a particular source article(s) that you got this information from?

• Bob Carpenter Says:

Pretty much anything that covers the theory gets you there, such as Gelman et al.’s Bayesian Data Analysis. But even the Wikipedia entry for EM covers MAP estimation. And it has links to other papers with the theory. Or you can read Rubin et al.’s original paper, which isn’t that hard to follow.

I have since learned that log likelihood is guaranteed to converge, even if you have something like a separable logistic regression problem where the coefficients don’t converge. As the coefficients head off to positive (or negative) infinity [because the problem’s separable], the log likelihood converges.

It’s also important to keep in mind that if you’re doing something like a latent mixture model (the usual poster child app for EM), the latent parameters are not considered data — they get marginalized out. Again, the Wikipedia’s presentation is pretty good in the theory section here.