## Author Archive

### Monitoring Convergence of EM for MAP Estimates with Priors

January 4, 2011

I found it remarkably hard to figure out how to monitor convergence for the expectation maximization (EM) estimtation algorithm. Elementary textbook presentations often just say “until convergence”, which left me scratching my head. More advanced presentations often leave you in a sea of generalized maximization routines and abstract functionals.

Typically, EM is phrased for maximum likelihood estimation (MLE) problems where there are no priors. Given data $y$ and parameters $\theta$, the goal is to find the parameters $\theta^*$ that maximize the likelihood function $p(y|\theta)$.

### Likelihood and Missing Data

Usually EM is used for latent parameter problems, where there are latent variables $z$ which are treated like missing data, so that the full likelihood function is actually $p(y,z|\theta)$. For instance, $z$ might be mixture component indicators, as in soft (EM) clustering. Typically the full likelihood is factored as $p(y,z|\theta) = p(z|\theta) \times p(y|z,\theta)$.

Even though the expectation (E) step of EM computes “expectations” for $z$ given current estimates of $\theta$ and the data $y$, these “expectations” aren’t used in the likelihood calculation for convergence. Instead, the form of likelihood we care about for convergence marginalizes $z$ away. Specifically, the maximum likelihood estimate $\theta^*$ is the one that maximizes the likelihood with $z$ marginalized out,

$p(y|\theta) = \int p(y,z|\theta) \times p(z|\theta) \ dz$.

### Monitoring Likelihood or Parameters

There’s more than one way to monitor convergence. You can monitor either the differences in log likelihoods (after marginalizing out the latent data) or the differences in parameters (e.g. by Euclidean distance, though you might want to rescale). Log likelihood is more task-oriented, and thus more common in the machine learning world. But if you care about your parameters, you may want to measure them for convergence, because …

### Linearly Separable Data for Logistic Regression

In data that’s linearly separable on a single predictor, the maximum likelihood coefficient for that predictor is infinite. Thus the parameters will never converge. But as the parameter approaches infinity, the difference its (absolute) growth makes to log likelihood diminishes (we’re way out on the extremes of the logistic sigmoid at this point, where the slope’s nearly 0).

### Convergence with MAP?

Textbooks often don’t mention, either for philosophical or pedagogical reasons, that it’s possible to use EM for general maximum a posterior (MAP) estimation when there are priors. Pure non-Bayesians talk about “regularization” or “shrinkage” (specifically the ridge or lasso for regression problems) rather than priors and MAP estimates, but the resulting estimate’s the same either way.

Adding priors for the coefficients, even relatively weak ones, can prevent estimates from diverging, even in the case of separable data. In practice, maximum a posteriori (MAP) estimates will balance the prior and the likelihood. Thus it is almost always a good idea to add priors (or “regularize” if that goes down better philosophically), if nothing else to add stability to the estimates in cases of separability.

### Maximization Step with Priors

In EM with priors, the maximization step needs to set $\theta^{(n)}$, the parameter estimate in the $n$-th epoch, to the value that maximizes the total probability, $\log p(y|\theta) + \log p(\theta)$, given the current “expectation” for the latent parameters $z$ based on the the data and previous epoch’s estimate of $\theta$. That is, you can’t just set $\theta^{(n)}$ to maximize the likelihood, $\log p(y|\theta)$. There are analytic solutions for the maximizer in many conjugate settings like Dirichlet-Multinomial or Normal-Normal, so this isn’t as hard as it may sound. And often you can get away with increasing it rather than maximizing it (leading to the so-called generalized EM algorithm, GEM).

### Convergence with Priors

Well, you could just monitor the parameters. But if you want to monitor the equivalent of likelihood, you need to monitor the log likelihood plus prior, $\log p(y|\theta) + \log p(\theta)$, not just the log likelihood $p(y|\theta)$. What EM guarantees is that every iteration increases this sum. If you just monitor the likelihood term $p(y|\theta)$, you’ll see it bouncing around rather than monotonically increasing. That’s because the prior’s having its effect, and you need to take that into account.

### Language Model Generated Injection Attacks: Cool/Disturbing LingPipe Application

March 9, 2010

Joshua Mason emailed us with a link to his (with a bunch of co-authors) recent ACM paper “English Shellcode” (http://www.cs.jhu.edu/~sam/ccs243-mason.pdf). Shell code attacks can attempt to seize control of a computer by masquerading as data. The standard defense is to look for tell-tale patterns in the data that reflect the syntax of assembly language instructions. It is sort of like spam filtering.The filter would have to reject strings that looked like:

“7IIQZjJX0B1PABkBAZB2BA2AA0AAX8BBPux”

which would not be too hard if you knew to expect language data.

Mason et al changed the code generation process so that lots of variants of the injection are tried but filtered against a language model of English based on the text of Wikipedia and Project Gutenberg.The result is an injection attack that looks like:

“There is a major center of economic activity, such as Star Trek, including The Ed Sullivan Show. The former Soviet Union.”

This is way better than I would have thought possible and it is going to be very difficult to filter. It would be interesting to see how automatic essay grading software would score the above. It is gibberish, but sophisticated sounding gibberish.

And it used LingPipe for the language processing.

I am a firm believer in the white hats publicizing exploits before black hats deploy them surreptitiously. This one could be a real problem however.

Breck