Stochastic Gradient Descent for Probabilistic Training Data

by

I’ve been talking for a while about using the Bayesian hierarchical models of data annotation to generate probabilistic corpora. It’s really easy to extend stochastic gradient descent (SGD) optimization algorithms to handle probabilistic corpora.

I’m going to consider maximum likelihood estimation for binary logistic regression, but the same thing can be done for conditional random field (CRF) taggers, support vector machine (SVM) classifiers, or perceptron classifiers. I’ll discuss priors and multinomials after the main algorithm.

The Training Data

Suppose we have N labeled training instances, each consisting of a D-dimensional input vector x_n and probabilistic outcome y \in [0,1]:

double[N][D] x;  // inputs
double[N] y;  // outputs

A non-probabilistic corpus constrains y_n to be either 0 or 1.

The Model

Binary logistic regression for D-dimensional inputs is parameterized by a D-dimensional vector \beta. The probability of a 1 outcome for a D-dimensional input x is

\mbox{Pr}(y_n = 1 | x_n, \beta)  = \mbox{logit}^{-1}(\beta^{\top}x_n),

where

\mbox{logit}^{-1}(z) = \exp(z)/(1 + \exp(z)) = 1/(1 + \exp(-z)).

Maximizing Log Likelihood

Our goal’s to find the parameter \beta that maximizes the training data likelihood

\mathcal{L}(\beta;x,y) = \sum_{n = 1}^N \log \mbox{Pr}(y=y_n | x=x_n, \beta).

The maximum likelihood estimate is thus

\beta^* = \arg\max_{\beta} \mathcal{L}(\beta;x,y).

Stochastic Gradient Descent

SGD finds the maximum likelihood estimate \beta^* given training data x,y. For simplicity, we’ll only consider a fixed learning rate (\lambda > 0), though the algorithm obviously generalizes to annealing schedules or other learning rate enhancement schemes. We’ll let E be the number of epochs for which to run training, though you could also measure convergence in the algorithm.

β = 0
for (e in 1:E) 
    for (n in 1:N)
        yHat[n] = inverseLogit(dotProduct(β,x[n]))
        err = y[n] - yHat[n]
        β += λ * err * x[n]     

Easy peasy, as they say in the old country.

Priors

Priors don’t actually interact with the probabilistic training. So you just add them in as usual. I’d recommend either blocking them (as I did for CRFs in LingPipe) to update every few hundred training instances, or using lazy updates (as I did for logistic regression in LingPipe).

Multinomial Classifiers

It’s also easy to generalize to multinomials, where the data consists of a probability for each of K outcomes. You just add an inner loop over the dimensions (minus 1 or not depending on whether you use K or K-1 parameter vectors for a K-dimensional problem), and the error term is calculated versus the probability for that outcome.

4 Responses to “Stochastic Gradient Descent for Probabilistic Training Data”

  1. Joseph Turian Says:

    Bob,

    If you are using an l2 prior (Gaussian), why not apply the l2 penalty at every stochastic update? I do that all the time.

    • lingpipe Says:

      The reason not to apply L2 at each instance at which the likelihood gradient is updated is that it will dominate computation if the instance vectors are at all sparse.

      In prior version of LingPipe, I used lazy updates, where you keep track of how many instances it has been since the last regularization for a feature you’re about to use, then apply the prior before you use it (multiplied by number of instances since last update divided by total number of instances).

      It’s an exact solution equivalent to the stochastic update for L1. You need to do something more clever to do L2 or other non-constant gradient priors because of the non-additivity of consecutive updates (you need to account for the equivalent of compound interest). So we just approximate by applying the number of instances since last update divided by the total number of instances times the gradient (instead of 1 over the number of instances times gradient).

      Recently, we’ve been scaling to problems with large numbers of categories (200 plus) and medium sized corpora (150K items), and found that the sparse updates completely dominate compute time with lots of categories. So I just swapped out the lazy update with a blocked one, where every N steps, you just apply all the priors (again with a linear approximation of block size divided by number of training instances times learning rate times gradient).

  2. Joseph Turian Says:

    Ah, got it.

    Yes, you are right, when the instance is sparse, the penalty leads to a dense gradient.

    I don’t think you are correct about equivalence to the stochastic update for L1. Because the result can change depending upon whether you cross the 0, and perhaps clip the value to 0.

    I am not sure if you are talking about this technique, but you can also try applying the penalty when the feature value is non-zero. This leads to more frequent applications of the penalty for more frequent features (scaled by the number of time steps since the last penalty update, so that more frequent features are not overpenalized).

    I don’t understand the difference between your “lazy” approach and the “blocked” one. I also don’t understand what the number of outputs (categories) has to do with the input sparsity.

    • lingpipe Says:

      By “blocked”, I mean the same thing as in the Langford et al. paper on truncated gradient. That is, every N (say 500) items or so, I apply the gradient at 500 times its stochastic strength to all features. But I only block the prior. They also block the likelihood gradient by evaluating all N items in parallel and summing their gradients.

      By “lazy”, I mean only applying regularization/prior gradient to features that are in the item about to be handled. I keep track of how many items it’s been since the gradient was applied and scale accordingly. Or at least we used to.

      If the L1 penalty clips in a stochastic update, it’ll clip in the lazy update. I update all the features’ priors before using an item with that feature.

      The reason lots of categories (say K=200) matters is that when you do the stochastic update, it applies to 200 instances per feature. When features are very common (as they are with character n-grams), you wind up doing lots more work even with the lazy approach than with the blocked approach.

      Of course, your mileage (kilometerage?) may vary depending on the sparsity structure of the problem.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s