I found it remarkably hard to figure out how to monitor convergence for the expectation maximization (EM) estimtation algorithm. Elementary textbook presentations often just say “until convergence”, which left me scratching my head. More advanced presentations often leave you in a sea of generalized maximization routines and abstract functionals.
Typically, EM is phrased for maximum likelihood estimation (MLE) problems where there are no priors. Given data and parameters
, the goal is to find the parameters
that maximize the likelihood function
.
Likelihood and Missing Data
Usually EM is used for latent parameter problems, where there are latent variables which are treated like missing data, so that the full likelihood function is actually
. For instance,
might be mixture component indicators, as in soft (EM) clustering. Typically the full likelihood is factored as
.
Even though the expectation (E) step of EM computes “expectations” for given current estimates of
and the data
, these “expectations” aren’t used in the likelihood calculation for convergence. Instead, the form of likelihood we care about for convergence marginalizes
away. Specifically, the maximum likelihood estimate
is the one that maximizes the likelihood with
marginalized out,
.
Monitoring Likelihood or Parameters
There’s more than one way to monitor convergence. You can monitor either the differences in log likelihoods (after marginalizing out the latent data) or the differences in parameters (e.g. by Euclidean distance, though you might want to rescale). Log likelihood is more task-oriented, and thus more common in the machine learning world. But if you care about your parameters, you may want to measure them for convergence, because …
Linearly Separable Data for Logistic Regression
In data that’s linearly separable on a single predictor, the maximum likelihood coefficient for that predictor is infinite. Thus the parameters will never converge. But as the parameter approaches infinity, the difference its (absolute) growth makes to log likelihood diminishes (we’re way out on the extremes of the logistic sigmoid at this point, where the slope’s nearly 0).
Convergence with MAP?
Textbooks often don’t mention, either for philosophical or pedagogical reasons, that it’s possible to use EM for general maximum a posterior (MAP) estimation when there are priors. Pure non-Bayesians talk about “regularization” or “shrinkage” (specifically the ridge or lasso for regression problems) rather than priors and MAP estimates, but the resulting estimate’s the same either way.
Adding priors for the coefficients, even relatively weak ones, can prevent estimates from diverging, even in the case of separable data. In practice, maximum a posteriori (MAP) estimates will balance the prior and the likelihood. Thus it is almost always a good idea to add priors (or “regularize” if that goes down better philosophically), if nothing else to add stability to the estimates in cases of separability.
Maximization Step with Priors
In EM with priors, the maximization step needs to set , the parameter estimate in the
-th epoch, to the value that maximizes the total probability,
, given the current “expectation” for the latent parameters
based on the the data and previous epoch’s estimate of
. That is, you can’t just set
to maximize the likelihood,
. There are analytic solutions for the maximizer in many conjugate settings like Dirichlet-Multinomial or Normal-Normal, so this isn’t as hard as it may sound. And often you can get away with increasing it rather than maximizing it (leading to the so-called generalized EM algorithm, GEM).
Convergence with Priors
Well, you could just monitor the parameters. But if you want to monitor the equivalent of likelihood, you need to monitor the log likelihood plus prior, , not just the log likelihood
. What EM guarantees is that every iteration increases this sum. If you just monitor the likelihood term
, you’ll see it bouncing around rather than monotonically increasing. That’s because the prior’s having its effect, and you need to take that into account.