NYC NLP/ML Group

December 28, 2009 by lingpipe

Joseph Turian’s back in town and has started a mailing list/newsgroup and Twitter feed for natural language processing (NLP) and machine learning (ML) in New York City (NYC):

Joseph’s trying to organize something like the Foo Camps for NLP and ML programming in NY. If you’re interested, drop him a line, or wait for the announcement.

P.S. Thanks to Joseph for not requiring registration for reading!

P.P.S. Check out some of Joseph’s latest work surveying features for named entity recognition.

Multimodal Priors for Binomial Data

December 23, 2009 by lingpipe

Andrey Rzhetsky gave me some great feedback on my talk at U. Chicago. I want to address one of the points he rasied in this blog post:

What if the prior is multimodal?

This is a reasonable question. When you look at the Mechanical Turk annotation data we have, sensitivities and specificities are suggestively multimodal, with the two modes representing the spammers, and what I’ve decided to call “hammers” (because the opposite of “spam” is “ham” in common parlance).

Here’s a replay from an earlier post, Filtering Out Bad Annotators:

The center of each circle represents the sensitivity and specificity of an annotator relative to the gold standard from Snow et al.’s paper on Mech Turk for NLP.

Beta Priors are Unimodal or Amodal

But a beta distribution \mbox{\sf Beta}(\alpha,\beta) has a single mode at (\alpha-1)/(\alpha + \beta - 2) in the situation where \alpha, \beta > 1, but has no modes if \alpha \leq 1 or \beta \leq 1.

Mixture Priors

One possibility would be to use a mixture of two beta distributions, where for \alpha_1,\beta_1,\alpha_2,\beta_2 > 0 and 0 \leq \lambda \leq 1, we define:

p(\theta|\alpha_1,\beta_1,\alpha_2,\beta_2,\lambda) = \lambda \ \mbox{\sf Beta}(\theta|\alpha_1,\beta_1) + (1 - \lambda) \ \mbox{\sf Beta}(\theta|\alpha_2,\beta_2).

Hierarchical Fit

We can fit the mixture component \lambda along with the parameters of the beta mixture components in the usual way. It’s just one more variable to sample in the Gibbs sampler.

Hierarchical Priors

On the other hand, if you consider baseball batting data, there would be at least two modes in the prior for batting average corresponding to fielding position (i.e. shortstops and second-basemen typically don’t bat as well as outfielders and I imagine there’s more variance in batting ability for infielders). If we didn’t know the fielding position, it’d make sense to use a mixture prior. But if we knew the fielding position, we’d want to create a hierarchical model with a position prior nested inside of a league prior.

Coding Inference Talk at University of Chicago, Wed 16 Dec 2009

December 13, 2009 by lingpipe

Update: It’s still Wednesday, but the right day is 16 December.

This Wednesday, I (Bob) will be giving a talk at the brand-spanking new Knapp Center for Biomedical Discovery at the University of Chicago.

When:
2 PM, Wed 16 Dec 2009

Where:
Knapp Center for Biomedical Discovery (KCBD)
10th Floor, South Conference Room
900 East 57th Street
University of Chicago

Multilevel Models of Coding and Diagnosis
with Multiple Tests and No Gold Standards

Bob Carpenter, Alias-i

I’ll introduce multilevel generalizations of some well known models drawn from the epidemiology literature and evaluate their fit to diagnostic testing and linguistic annotation data. The analogy is that data annotation for machine learning (or, e.g., ICD-9 coding) is the same kind of process as diagnostic testing.

The observed data consists of multiple testers supplying results for multiple tested units. True labels for some units may be known (perhaps with selection bias), and not all units need to undergo every test.

In all models, there are parameters for outcome prevalence, the result for each unit, and some form of test accuracy. I’ll also consider models with parameters for individual test accuracy and bias (equivalently sensitivity and specificity in the binary case) and item difficulty/severity.

I’ll focus on the priors for annotator accuracy/bias and item difficulty, showing how diffuse hyperpriors allow them to be effectively inferred along with the other parameters using Gibbs sampling. The posterior samples may be used for inferences for diagnostic precision, multiple comparisons of test accuracies, population prevalence, unit-level labels, etc.

I’ll show that the resulting multilevel models can be fit using data simulated according to the model. I will then fit the model to a range of clinical and natural language data. I’ll discuss their advantages for inference with epidemiology data ranging from dentists diagnosing caries based on x-rays, oncologists diagnosing tumors based on slides, infection diagnosis based on exams and serum tests, and with natural language data including name spotting, word stemming, and classification.

I’ll conclude by discussing extensions to further pooling through random effects for different testing facilities, different kinds of annotators (e.g. doctors vs. ICD-9 specialists), different kinds of subjects/units (e.g. genetic predisposition to diseases, or articles drawn from different journals), etc.

All the software (Java, R, BUGS) and data discussed in this talk are freely available from the LingPipe sandbox in the hierAnno project.

You may already be familiar with all this from the data annotation thread on this blog.

Probability Measures and Random Variables

December 11, 2009 by lingpipe

I introduced measure theory in my last post, Measure Theory in Two Definitions. With that out of the way, we can move on to probability measures and random variables.

Probability Measure

A probability measure is a measure taking values in the closed interval [0,1] and mapping the unit to 1, \mu(\mbox{I}) = 1. It follows that \mu(\emptyset) = 0.

In a probability measure, the sets A \in \mathcal S are called events and we write \mbox{Pr}(A) for \mu(A).

The sample space for a probability measure is \Omega = \bigcup_{A \in \mathcal S} A.

Those Pesky Random Variables

Given a probability measure \mu, a random variable X is a real-valued function over the measure’s sample space such that \{ \omega \in \Omega | X(\omega) < y \} \in \mathcal S (so that it has a measure) for all y.

The cumulative density function F_X for X is defined by

F_X(y) = \mu(\{ \omega \in \Omega | X(\omega) < y \}).

The probability that random variable X is less than a fixed value y is

\mbox{Pr}(X < y) = F_X(y).

The probability that the variable X falls in the interval (a,b) is defined by

\mbox{Pr}(X \in (a,b)) = \mbox{Pr}(X < b) - \mbox{Pr}(X < a) = F_X(b) - F_X(a).

When F_X is differentiable, the density function f_X is

f_X = F'_X.

When the density is defined,

\mbox{Pr}(X < a) \ = \ \int_{-\infty}^a f_X(x) dx \ = \ F_X(a)

and

\mbox{Pr}(X \in (a,b)) \ = \ \int_a^b f_X(x) dx \ = \ F_X(b) - F_X(a).

End of Story

This still seems like a lot of work just to get going with random variables. We haven’t even defined multivariate densities or conditional and joint probabilities.

In applied work, we define integrable multivariate density functions explicitly and reason through integration. For instance, in the hierarchical batting ability model I discussed as an example of Bayesian inference, the model is fully described by the density

f(\alpha,\beta,n,\theta)

{} \ \ \ = \mbox{\sf Beta}(\alpha/(\alpha+\beta)|1,1) \ \mbox{\sf Pareto}(\alpha+\beta|1.5) \ \prod_{j = 1}^J \mbox{\sf Beta}(\theta_j|\alpha,\beta) \ \mbox{\sf Binom}(n_j|\theta_j,N_j)

Technically, we could construct the measure from the density function as follows. First, construct the sample space from which the parameters are drawn. Continuing the baseball example, we draw our parameter assignment (\alpha,\beta,n,\theta) from the sample space

\Omega = {\mathbb R}^+ \times {\mathbb R}^+ \times {\mathbb N}^J \times [0,1]^J.

We then take the Lebesgue measurable events A \subseteq \Omega with measure \mbox{Pr}(A) defined by (multivariate) Lebesgue-Stieltjes integration,

\mbox{Pr}(A) = \int_A f(x) \ dx

where, of course, x \in \Omega. If A is a simple hypercube in \Omega, this works out to the usual sum/integral:

\mbox{Pr}(A)

= \int_{x_1}^{y_1} \int_{x_2}^{y_2} \sum_{n_1=a_1}^{b_1} \cdots \sum_{n_J=a_J}^{b_J} \int_{u_1}^{v_1} \cdots \int_{u_J}^{v_J} f(\alpha,\beta,n,\theta) \ d\alpha \ d\beta \ d\theta_1 \cdots d\theta_n.

Measure Theory in Two Definitions

December 10, 2009 by lingpipe

Measure theory generalizes the notion of length, area and volume to a very general topological setting. It also forms the basis of probability theory.

Definition 1

A \sigma-algebra is a set \mathcal S of sets which

  • is closed under difference, A - B \in {\mathcal S} if A, B \in \mathcal S,
  • is closed under countable unions, \bigcup_{n = 0}^{\infty} A_n \in \mathcal S if A_n \in \mathcal S, and
  • has a unit \mbox{I} \in \mathcal S such that A \cap \mbox{I} = A for all A \in \mathcal S.

Definition 2

A \sigma-additive measure is a function \mu over a \sigma-algebra \mathcal S taking non-negative values such that

\mu(\bigcup_{n=0}^{\infty} A_n) = \sum_{n=0}^{\infty} \mu(A_n) for pairwise disjoint sets A_n \in \mathcal S.

Immediate Consequences

The closure of a \sigma-algebra under countable intersections follows from de Morgan’s law,

\bigcap_{n=0}^{\infty} A_n = \mbox{I} - \bigcup_{n = 0}^{\infty} (\mbox{I} - A_n).

The empty set \emptyset = \mbox{I} - \mbox{I} \in \mathcal S is the unique zero element such that \emptyset \cap A = \emptyset for all A \in \mathcal S. Further, \emptyset \cup A = A and \mbox{I} \cup A = \mbox{I} for all A \in \mathcal S.

A \sigma-algebra forms a ring, with \cap as addition, \cup as multiplication, \emptyset as the multiplicative identity, \mbox{I} as the additive identity, and -A = \mbox{I} - A as additive difference.

Constructing Lebesgue Measures

It’s not entirely trivial to show that interesting measures exist, even working over the real numbers. The easiest useful measure to construct is the Lebesgue measure over the unit interval (0,1).

Start by defining the measure of a subinterval to be its length. Extend the definition to countable unions of subintervals by summation, following the definition of measure.

It remains to tidy up the some boundary cases producing zero-measure results, which isn’t entirely trivial (analysis is built on counterexamples and edge cases), but the basic idea pretty much works as is.

Breck in Time Out NY’s Public Eye

December 9, 2009 by lingpipe

I (Bob) love Time Out New York’s feature Public Eye. Each week, it’s a different New Yorker, usually in some kind of wild outfit.

This week they’re featuring our Breck, Alias-i’s self-described “president, founder, whatever — chief janitor”.

Check out the full feature:

The photo’s a block or two from our office, which is next to McCarren Park, where you’ll often find Breck flying his planes. It’s too bad they didn’t catch him in his orange boiler suit, or one of his more stylin’ vintage or designer jackets. In Breck’s defense, he’s been wearing hats long before the trend among Williamsburg hipsters. Me, I’m a jeans and t-shirt and baseball cap kind of guy.

The feature includes a plug for the Brooklyn Aerodrome; the site contains movie clips taken from cameras mounted on Breck’s RC planes, such as the transparent jobby featured in the photo above.

Computing Marginal Channel Probabilities with Generalized Needleman-Wunsch

December 8, 2009 by lingpipe

In my last post on Probabilistic Edit Channel Models, I tried to motivate a properly calculated marginal probability of an observation (e.g., a read) given a source (e.g., a reference sequence).

In this post, I’ll show you how to build a properly normalized probabilistic generalization of the well-known Needleman-Wunsch algorithm for global alignment that computes full marginal probabilities rather than best path probabilities. Global alignment matches all of its two input sequences.

This approach is easily generalized to the Smith-Waterman algorithm for local alignments or to fully matching a read sequence against a subsequence of a reference sequence. See the last section below.

Edit Operations

We’ll assume a simple alphabet of bases, {A,C,G,T}. We’ll also assume four possible actions given a reference sequence and a position in that reference sequence:

  • delete the next symbol from the reference (advance the ref position)
  • insert a symbol into the read (don’t advance the ref position)
  • generate a matched symbol for the read (advance ref)
  • generate a mismatched symbol for the read (advance ref)

There’s only one way to delete, which we’ll write D. There are four possiblities for insertion depending on the base inserted, which we’ll write IA, IC, IG, IT. There are a total of four possibilities for substitution and matching, depending on the base generated in the read, SA, SC, SG, ST; the one with the symbol corresponding to the next reference symbol is the match case.

Probabilistic Edits

Given a reference sequence, we generate a read sequence by iterating over the reference sequence base by base, and for each base, performing one of the above operations. If the operation is not a delete, the position in the reference sequence advances.

For instance, consider the alignment we introduced in the previous post,

A C G T - A C G
| D | | I | S |
A - G T T A G G

which is generated by the sequence of operations:

Reference Position Read Generated Operation Prob
^ACGTACG
A^CGTACG A SA p(SA | A)
AC^GTACG A D p(D | C)
ACG^TACG AG SG p(SG | G)
ACGT^ACG AGT SG p(ST | T)
ACGT^ACG AGTT IT p(IT | T)
ACGTA^CG AGTTA SA p(SA | A)
ACGTAC^G AGTTAG SA p(SG | C)
ACGTACG^ AGTTAGG SG p(SG | G)

What we need to estimate is the conditional probability of one of the nine edit operations given the next base in the input. We’ll assume the edit probabilities are known. Typically, these edits are inferred using little or no training data through the EM algorithm.

The sequence of edit operations is: σ = [SA, D, SG, SG, IT, SA, SA, SG]. Given the reference sequence ACGTACG and the edit operations σ, the read AGTTAGG is uniquely determined. With this clean generative model, given an input reference sequence, the probability of all read sequences sums to 1.0. Of course, we will know the read in most of what we do and only need to compute its probability.

The probability of that sequence of edit operations given the reference sequence ACGTACG is given by the product of all the probabilities in the rightmost column above:

p(σ|ACGTACG)
  = p(SA|A) p(D|C) p(SG|G) p(ST|T) p(IT|T) p(SA|A) p(SG|C) p(SG|G)

What we’re interested in calculating here is p(AGTTAGG | ACGTACG), namely the probability of the read given the reference sequence. This is done by summing over all possible alignments:

p(\mbox{\tt AGTTAGG}|\mbox{\tt ACGTACG}) = \sum_\sigma p(\mbox{\tt AGTTAGG},\sigma | \mbox{\tt ACGTACG})

Note that this probability will be zero if the edit sequence \sigma and reference \mbox{\tt ACGTACG} don’t generate the read \mbox{\tt AGTTAGG}.

The usual alignment programs compute the best alignment, namely:

\arg\max_{\sigma} p(\mbox{\tt AGTTAGG},\sigma | \mbox{\tt ACGTACG})

We could use a regular \max_{\sigma} in place of \arg\max_{\sigma} to compute the probability of the best alignment instead of the alignment itself.

Note that we take the maximum probability; usually the algorithm is formulated with edit costs and it takes the minimum cost. We could convert to the usual setup with minimization by reformulating in terms of negative log probabilities.

Needleman-Wunsch Algorithm

The Needlman-Wunsch algorithm is a classic instance of dynamic programming where the distances computed for shorter substrings are extended in the inner loop to distances for longer strings. The first two loops deal with the boundary conditions of all-insertions and all-deletions

double nw(byte[] reads, byte[] refs) {

    int[][] d = new int[refs.length+1][reads.length+1];

    d[0][0] = 0.0;

    for (int i = 1; i < refs.length; ++i)
        d[0][i] = d[0][i-1] + log p(D|refs[i]);

    for (int j = 1; j < refs.length; ++j)
        d[j][0] = d[j-1][0] + log p(I_reads[j] | refs[0])

    for (int i = 1; i < refs.length; ++i)
        for (int j = 1; j < reads.length; ++j)
            d[j][i] = max( d[i][j-1] + log p(D | refs[i]),
                  d[j-1][i] + log p(I_reads[j] | refs[i]),
                  d[j-1][i-1] + log p(S-reads[j] | refs[i]) );

    return d[refs.length][reads.length];
}

Given its simple loop, this algorithm clearly takes time proportional to read time reference length.

Marginal Needleman-Wunsch

I’ll call this the marginal Needleman-Wunsch, but if someone knows a better name, let me know. It bears the same relation to the standard N-W algorithm as the forward algorithm for HMMs compares to the Viterbi algorithm.

double nw(byte[] reads, byte[] refs) {

    int[][] d = new int[refs.length+1][reads.length+1];

    d[0][0] = 0.0;

    for (int i = 1; i < refs.length; ++i)
        d[0][i] = d[0][i-1] + log p(D|refs[i]);

    for (int j = 1; j < refs.length; ++j)
        d[j][0] = d[j-1][0] + log p(I_reads[j] | refs[0])

    for (int i = 1; i < refs.length; ++i)
        for (int j = 1; j < reads.length; ++j)
            d[j][i] = logSumExps( d[i][j-1] + log p(D | refs[i]),
                  d[j-1][i] + log p(I_reads[j] | refs[i]),
                  d[j-1][i-1] + log p(S-reads[j] | refs[i]) );

    return d[refs.length][reads.length];
}

The only change is that we’re calling the familiar log sum of exponentials function here rather than max. Recall the definition:

\mbox{logSumExp}(x,y,z) = \log (e^x + e^y + e^z)

We use this because in our case, x, y, and z are log probabilities, so that exp(x), exp(y), and exp(z) are linear probabilities, so that their sum is a linear probability, and hence the final result is the total probability of the paths converted back to the log scale. We don’t just maintain linear probabilities because we’re likely to hit underflow problems. (This may not actually be an issue for the limited number of errors allowed in these models; if not, then we just need a sum of the linear probabilities, which will be much faster to execute computationally.)

That’s it. We’ve computed the (log of) the sum of all path probabilities.

When the Reference is Exhausted

One small detail: we need to be able to generate symbols in the read after the reference is exhausted, so we have to include that option in the algorithm. That is, we need five discrete distributions over edit operations, one for each next base {A,C,G,T} in the reference, and one to use when the reference is exhausted.

Although not conceptually difficult or expensive to compute, it complicates the boundary conditions for the algorithm and the conceptualization of the algorithm, so the algorithm isn’t technically correct as stated. To correc it, the probability terms need to be fiddled to account for the boundary.

Generalizing to Local Alignments

The Smith-Waterman-like generalization to local alignments requires allowing zero-cost deletes on the first and last row of the edit matrix, and zero-cost inserts on the first and last column of the matrix.

This isn’t quite properly normalized, but can be if we assume a uniform model of read start and end positions.

We can also restrict local alignments in the usual way to the case where the read completely matches a substring of the reference sequence by only allowing zero-cost deletes on the first and last row of the distance matrix.

Multiple Edits and Correlated Errors

The model above is Markovian; only the current reference symbol determines the output symbol.

This assumption may be relaxed to allow compound edit operations instead of the simple Markovian insertion, substitution, and deletion operations of the Needleman-Wunsch algorithm.

For example, University of Toronto’s SHRiMP alignment package considers pairwise operations generated due to the AB SOLiD platform’s colorspace reads (there’s a shortage of capital “I”s for names, hence “Alias-i”). With this approach, SHRiMP can model the the correlated errors from the SOLiD colorspace reads relative to a reference sequence due to to singular nucleotide polymorphisms (SNPs) in the samples relative to the reads (which induce two sequential, correlated, color space read errors).

SHRiMP’s generalization of Needleman-Wunsch is similar both algorithmically and conceptually to the phonetic extensions to he edit distance component of the noisy channel model for natural language spelling correction proposed by Kristina Toutanova and Bob Moore in their 2002 ACL paper, Pronunciation Modeling for Improved Spelling Correction.

Probabilistic Edit Channel Models

December 4, 2009 by lingpipe

This post is about probabilistic edit distances used as noisy channel models with applications to biological short-read sequencing and natural language spelling correction. (Spoiler: Normalization is over channel outputs and the forward (sum) algorithm is required instead of the usual Viterbi (max) algorithm.)

Spelling Correction and Chinese Tokenization

LingPipe models both spelling correction and Chinese tokenization (word splitting) using noisy channel models. We have two tutorials:

The two components of a noisy channel model are the source p(\sigma) from which signals \sigma consisting of strings are generated probabilistically, and the channel p(\tau|\sigma) which models the probability of observing the signal \tau when the message is \sigma. For spell checking, the source model of what people are likely to say may be estimated using document data and/or query log, and the channel model of possible brainos and typos may be estimated using matched query/correction data (which itself may be inferred via EM from query logs and/or document data). For Chinese, the channel deterministically deletes spaces, which requires probabilistic reconstruction for the observer who doesn’t get word boundary information.

LingPipe’s spelling package implements (confusion-matrix weighted) edit distance. And other models, such as n-gram comparison. This is all described in the spelling tutorial as it relates to spell checking, and also in:

LingPipe doesn’t enforce a probabilistic interpretation of edit distance. Furthermore, decoding of the intended signal given the observed message is implemented using the Viterbi approximation, which estimates the channel probability based on the single best alignment between the message \sigma and the observed signal \tau.

Mature mRNA Sequencing

mRNA diagram

Mitzi’s lab’s working on sequencing the untranslated end regions of mature mRNA (aka the 3′ UTR) in C. elegans (aka the nematode worm), a widely studied model organism. The 3′ UTR is the sequence of bases after the stop codon and before the poly-A tail found on mature mRNA.

Short Read Sequencers

I’ve been talking with Mitzi lately about the next generation short-read DNA sequencers, like Illumina and SOLiD. Perhaps it’s not surprising, given that it’s locally coherent string data, that the same techniques are used in bioinformatics as in computational linguistics.

Isoform Expression Models

I’ve been formulating Bayesian models for transcriptome analysis. In particular, isoform (aka alternative splicing) expression. As usual, there’s an underlying generative model of the process. One of the components of this model is how the read \tau that’s observed from the device (the observed data) is generated given the biological sequence \sigma from which it was generated (the estimand of interest).

The noisy channel model is appropriate for this setting; I just need to make it probabilistic.

Levenshtein Distance

The simplest models of spelling correction or genetic sequencing involve simple Levenshtein distance calculations. The Levenshtein distance is the minimum number of insert, delete or substitution operations required to turn one sequence into the other. It’s symmetric. But it’s not probabilistic.

Probabilistic Channels

It’s actually quite simple to model p(\tau|\sigma). Just make sure all the moves from a given state sum to 1.0. A state corresponds to a prefix of the reference sequence \tau. Legal moves are to skip (delete) the next symbol from the message \sigma, to add (insert) a spurious symbol into the observation \tau, match the next symbol from \sigma and \tau, or to substitute the next symbols.

Viterbi versus Forward Algorithm

The usual approach to calculation is to make the so-called Viterbi assumption, and look at the probability of the best sequence of moves, to consume the message sequence and observed sequence. Given a sequence of edit operations, we can generate an alignment between sequences. For instance, consider matching the message ACGTACG to the observed signal AGTTAGG. The following alignment consists of a match of A, a delete of C, two matches GT, an insertion of a T, a match on A, a substitution of a G for a C, and a final match on G:

ACGT-ACG
|D||I|S|
A-GTTAGG

To make the probabilities add up to 1.0 for the moves from a particular state, we need to subdivide the inserts and substitutions. With the genetic alphabet of symbols {A, C, G, T}, the 9 possible moves are subdivided by output, I(A), I(C), I(G), I(T), D, S(A), S(C), S(G), S(T). We have probabilities for these outputs for each symbol. In the genetic case, that’s four multinomials of order 9.

Note that we’ve included the match among the substitution operations for simplicity. If we know the next output symbol from a state, only one of the four substitutions will be legal — the one that outputs the next output symbol.

But there are many more alignments for the same pair of strings. Some even have the same Levenshtein distance. For instance, we can align the two sequences using three substitutions:

ACGTACG
|SS||S|
AGTTAGG

Edit distance assigns a probability to a path of operations, which corresponds to an alignment a, say p(a,\tau|\sigma). Given the underlying message (or sequence) \sigma, the observed sequence \tau is determined by the edit operations a.

What we need to do is integrate (or sum) out the alignments in the usual way in order to calculate the probability of the observed signal (query/read) \tau given the message (meaning/reference sequence) \sigma:

p(\tau|\sigma) = \sum_a p(\tau,a|\sigma)

The Viterbi approximation is:

p(\tau|\sigma) \approx \max_a p(\tau,a|\sigma)

The approximation is only reasonable from a probabilistic point of view if the best alignment accounts for most of the probability mass. Or, if we are doing calculations proportionally and the best alignment probability is a constant fraction of the total generation probability.

Simply using the forward algorithm instead of the Viterbi algorithm computes this in {\mathcal O}(m \times n) time (and space) where n = |\tau| is the length of \tau and m = |\sigma| the length of \sigma.

We can actually tighten up the space requirements to \mbox{max}(m,n); this is how I did things for LingPipe’s implementation and what I contributed for Lucene’s fuzzy matcher. We can tighten even further if we have an upper bound on the maximum number of edits to consider, say d. Then the time complexity reduces to {\mathcal O}(d \times \max(m,n)). (As for all things string algorithm related, see Gusfield’s string algorithms book and Durbin, Eddy, Krogh and Mithcison’s probabilistic string matching book).

Of course, we’ll compute it all on a log scale so the products become sums. This will require our old friend the log sum of exponentials operation.

Filtering

For the biological sequence matching problem, we’re facing on the order of 100M reads of 50 bases each, matching against a set of 30K isoforms of 1.5K bases each. This is prohibitive for the forward algorithm. There are various ways to share the prefix operations (again, see Gusfield’s book).

Typically, though, practitioners use n-gram sequence indexes to spot potential matches, only considering sequences with good n-gram overlap, thus eliminating most of the work up front. This can be done either heuristically, as in BLAST (described in both books), or exactly, as in SHRiMP (the latter link is to a paper with a good description of the SOLiD color space reads and uses automaton composition to compose noisy channels, one for color read errors, one for sequencing errors, and one for mutations).

This is like the notion of blocking in database record linkage (including database deduplication), which we also tend to carry out using character n-grams for natural language sequences.

The transcendental math’s still going to dominate the calculations, so typically a Viterbi pass using Levenshtein distance is done after blocking with n-grams to further filter out unlikely alignments.

Biological sequence analysis sure looks a lot like natural language sequence analysis!

Chang et al. (2009) Reading Tea Leaves: How Humans Interpret Topic Models

November 18, 2009 by lingpipe

This forthcoming NIPS paper outlines a neat little idea for evaluating clustering output:

The question they pose is how to evaluate clustering output, specifically topic-word and document-topic coherence, for human interpretability.

Bag of Words

Everything’s in the bag of words setting, so a topic is modeled as a discrete distribution over words.

Multiple Topics per Document

They only consider models that allow multiple topics per document. Specifically, the clusterers all model a document as discrete distribution over topics. The clusterers considered share strong family resemblances: probabilistic latent semantic indexing (pLSI) and two forms of latent Dirichlet allocation (LDA), the usual one and one in which topics for documents are drawn from a logistic prior modeling topic correlations rather than a uniform Dirichlet.

Intrusion Tasks

To judge the coherence of a topic, they take the top six words from a topic, delete one of the words and insert a top word from a different topic. They then measure whether subjects can detect the “intruder”.

To judge the coherence of the topics assigned to a document, they do the same thing for document distributions: they take the top topics for a document, delete one and insert a topic not assigned with high probability to the document.

Analysis

They considered two small corpora of roughly 10K articles, 10K word types and 1M tokens, one from Wikipedia pages and one from NY Times articles. These can be relatively long documents compared to tweets, customer support requests, MEDLINE abstracts, etc, but are shorter than full-text research articles or corporate 10K or 10Q statements.

They only consider 50, 100, and 150 topic models, and restrict parameterizations to add-1 smoothing (aka the Laplace form of the Dirichlet prior) for per-document topic distributions. I didn’t see any mention of what the prior was for the per-topic word distributions. I’ve found both of these parameters to have a huge effect on LDA output, with larger prior counts in both cases leading to more diffuse topic assignments to documents.

They only consider point estimates of the posteriors, which they compute using EM or variational inference. This is is not surprising given the non-identifiability of topics in the Gibbs sampler.

Mechanical Turker Voting

They used 8 mechanical Turkers per task (aka HIT) of 10 judgments (wildly overpaying at US$0.07 to US$0.15 per HIT).

(Pseudo Expected) Predictive Log Likelihood

They do the usual sample cross-entropy rate evaluations (aka [pseudo expected] predictive log likelihoods). Reporting these to four decimal places is a mistake, because the different estimation methods for the various models have more variance than the differences shown here. Also, there’s a huge effect from the priors. For both points, check out Asuncion et al.’s analysis of LDA estimation, which the first author, Jonathan Chang, blogged about.

Model Precision

Their evaluation for precision is the percentage of subjects who pick out the intruder. It’d be interesting to see the effect of adjusting for annotator bias and accuracy. This’d be easy to evaluate with any of our annotation models. For instance, it’d be interesting to see if it reduced the variance in their figure 3.

There’s variation among the three models at different topics over the different corpora. I’m just not sure how far to trust their model precision estimates.

Their Take Home Message

The authors drive home the point that traditional measures such as expected predictive log likelihood are negatively correlated with their notion of human evaluated precision. As I said, I’m skeptical about the robustness of this inference given the variation in estimation techniques and the strong effect of priors.

The authors go so far as to suggest using humans in the model selection loop. Or developing an alternative estimation technique. If they’d been statisticians rather than computer scientists, my guess is that they’d be calling for better models, not a new model selection or estimation technique!

The Real Take Home Message

I think the real contribution here is the evaluation methodology. If you’re using clustering for exploratory data analysis, this might be a way to vet clusters for further consideration.

What They Could’ve Talked About

Although they mention Griffiths and Steyvers’ work on using LDA for traditional psychometrics, I think a more interesting result is Griffith and Steyvers’ use of KL-divergence to measure the stability of topics across Gibbs samples (which I describe and recreate in the LingPipe clustering tutorial). Using KL divergence to compare different clusters may give you a Bayesian method to automatically assess Chang et al.’s notion of precision.

Phrase Detectives Linguistic Annotation Game

November 10, 2009 by lingpipe

Massimo Poesio just sent me a pointer to the following awesome web application:

Annotation Game

Annotation games (aka “games with a purpose”) were popularized by van Ahn’s ESP game, though I first heard about them through David Stork’s Open Mind project.

Unlike the mechanical Turk, the games approach tries to make the task somewhat fun and competitive. It seems like making the users “detectives” is a thin veneer of “fun”, but they maintain the metaphor beautifully throughout the whole site, so it works.

As with many games, Phrase Detectives pays out in leader board bragging rights and cash prizes rather than directly for work completed as on Mechanical Turk.

Phrase Detective Tasks

The really brilliant part is how they break the coref annotation task into four easy-to-answer questions about a single highlighted phrase.

  1. Not Mentioned Before: yes/no question as to whether the referent of highlighted phrase was previously mentioned in the text
  2. Mentioned Before: highlight previous mention of a given phrase
  3. Non-Referring: pick out non-referential nouns (like the “there” in “there is trouble brewing”)
  4. Property of Another Phrase: pick out other phrases that describe someone already mentioned (e.g. attributives or apositives)

The site also has nice clean, easy-to-follow graphics, and appears to still have users after two years.

Adjudication Phase

OK, they call it “Detectives Conference”, but the idea is you get to vote yes/no as to whether someone else’s answer is right. This is a good idea widely used on Mechanical Turk because it’s easier to check someone’s work than to create it from scratch.

Read All About It

It was developed by academics, so there are at least as many papers as contributors:

Coreference Annotation

There are “expert” annotated within-doc coref corpora for the MUC 7 and ACE 2005 evaluations (available from LDC, who charge an arm and a leg for this stuff, especially for commercial rights).

LingPipe does within-document coreference and we’ve worked on cross-document coreference.

More Like This

As soon as you find one of these things, you find more. Check out:

I’d love to hear about more of these if anyone knows any.