I’ve been talking for a while about using the Bayesian hierarchical models of data annotation to generate probabilistic corpora. It’s really easy to extend stochastic gradient descent (SGD) optimization algorithms to handle probabilistic corpora.
I’m going to consider maximum likelihood estimation for binary logistic regression, but the same thing can be done for conditional random field (CRF) taggers, support vector machine (SVM) classifiers, or perceptron classifiers. I’ll discuss priors and multinomials after the main algorithm.
The Training Data
Suppose we have labeled training instances, each consisting of a -dimensional input vector and probabilistic outcome :
double[N][D] x; // inputs double[N] y; // outputs
A non-probabilistic corpus constrains to be either or .
Binary logistic regression for -dimensional inputs is parameterized by a -dimensional vector . The probability of a 1 outcome for a -dimensional input is
Maximizing Log Likelihood
Our goal’s to find the parameter that maximizes the training data likelihood
The maximum likelihood estimate is thus
Stochastic Gradient Descent
SGD finds the maximum likelihood estimate given training data . For simplicity, we’ll only consider a fixed learning rate (), though the algorithm obviously generalizes to annealing schedules or other learning rate enhancement schemes. We’ll let be the number of epochs for which to run training, though you could also measure convergence in the algorithm.
β = 0 for (e in 1:E) for (n in 1:N) yHat[n] = inverseLogit(dotProduct(β,x[n])) err = y[n] - yHat[n] β += λ * err * x[n]
Easy peasy, as they say in the old country.
Priors don’t actually interact with the probabilistic training. So you just add them in as usual. I’d recommend either blocking them (as I did for CRFs in LingPipe) to update every few hundred training instances, or using lazy updates (as I did for logistic regression in LingPipe).
It’s also easy to generalize to multinomials, where the data consists of a probability for each of K outcomes. You just add an inner loop over the dimensions (minus 1 or not depending on whether you use K or K-1 parameter vectors for a K-dimensional problem), and the error term is calculated versus the probability for that outcome.