## Convergence for (Stochastic) Gradient Descent (or EM)

One of the most confusing aspects of the stochastic gradient descent (SGD) and expectation maximization (EM) algorithms as usually written is the line that says “iterate until convergence”. For some reason, most presentations don’t indicate how to measure convergence.

There are two common approaches, and perhaps more that I don’t know about:

1. Convergence by Likelihood + Prior

The first common approach to calculate the log likelihood

log p(y|θ)

for the data y given current parameter settings θ, the log prior for the parameters

log p(θ),

then monitor the sum

log p(y|θ) log p(θ)

for convergence.

This approach makes sense intuitively, because it’s the value of the likelihood and prior that’s being maximized by the search algorithms (in loss terms, the negation of the log likelihood and prior is minimized).

2. Convergence by Vector Distance

The second approach is to monitor the coefficient vectors themselves. Typically, this would involve measuring Euclidean (or taxicab) distance between the parameter vectors across epochs.

Note that the vector-based approach is the only one that makes sense for a truly online algorithm.

3. Convergence by Held-Out Evaluation

I forgot to mention this in the first post. You can also use held-out labeled data to evaluate the classifier or tagger being produced, and stop training at the point the error is minimized. The main problem with this approach is that the search problem’s not convex. In practice, people often use this as a form of “early stopping”, whereby training is halted before convergence. As Bishop’s machine learning book shows, this can be viewed as a kind of regularization if the learning rate’s low enough. By stopping early, coefficients haven’t had time to reach their optimal value.

Relative Convergence and Rolling Averages

Algorithms like stochastic gradient (unlike EM) can get into situations where log likelihood actually increases across an epoch. SGD and EM both sometimes makes small steps, then start making bigger steps. So what I’ve done is taken a rolling average, as well as allowing a minimum number of epochs to be set.

Likelihood Converges, Vectors Diverge

If the vectors converge, the likelihood converges, but the converse doesn’t hold for two reasons. First, the vectors can plain old diverge. If the data’s separable and there’s no prior, there is no limit to how big the coefficients can get; the larger the cooefficients get, the higher the likelihood. (For logistic regression, this is because 1/(1+exp(-x)) approaches 0 as x approaches negative infinity and approaches 1 as x approaches infinity.)

The second problem is that the coefficients may not be identified. For instance, if I take maximum likelihood and use a different vector for each category (instead of one fewer than the number of categories), then the maximum a posteriori (MAP) solution isn’t identified in the statistical sense. That is, two different sets of vectors may produce the same result (subtract any vector from all the coefficient vectors and the results don’t change for logistic regression).