Finding Text with One Language Model


At our Java SIG presentation this past Tuesday, we had several
interesting questions about classifiers. One of the things people
wanted to be able to do was pull the natural language text out of a
document (as compared to, say, ASCII art, tables, lists of menu titles
derived from HTML to text conversion, etc.). That’s a two-class
classification problem, where a sequence of characters is classified
as either “GOOD” or “BAD”

Not a problem, was our answer, but as usual, the devil’s in the
details. One person framed the question beautifully:

> Anyway, I think I see where you're going with hierarchical
> classification, but I think what I'm describing is slightly
> different. In the cases you describe, you're trying to distinguish
> between two known cases, e.g., in natural text vs. non-natural
> text, I imagine that works well if you have a fairly representative
> collection of non-natural text. But what happens if there is total
> garbage in the paper--noise that is not representative of either
> the known natural text vs. the known non-natural text?

> What I'm after is that you might be able to train for detecting that
> you are not in a trained section by tracking the confidence of the
> classifier. Perhaps I'm stretching the metaphor, but the idea is
> that I recognize when I'm in an unfamiliar place not because it
> looks like other unfamiliar places, but rather because I recognize
> the feeling of unfamiliarity.

Spot on. The really cool part is how that’s actually implemented with
our classification interface.

A simple two-class classifier will operate by classifying to the
category with the highest probability given the text being classified,
P(cat|text). That’s what we’re calling “confidence” — the
probability of the category given the text.

As usual, it’s much easier to build a joint model P(cat,text) of the
probablity of the category AND the text. Then we just note that:

P(cat|text) = P(cat,text) / P(text)
            = P(cat,text) / SUM_cat P(cat,text)

where we’ve applied the usual Bayesian inversion to compute the
marginal as a sum of joint estimates:

P(text) = SUM_cat P(cat,text)

If we’re just choosing the first-best category, then we don’t even
need to compute the conditional, because:.

ARGMAX_cat P(cat|text)
= ARGMAX_cat P(cat,text) / P(text)
= ARGMAX_cat P(cat,text)
= ARGMAX_cat P(text|cat) P(cat)

Now here’s the question: what if we only have training examples of one
category, say that of natural language text in English? As a good
statististician, all I want is a good estimate of the probability of
the text given the category.

The best approach is to get some data. The second best approach is to
take a uniform estimate of P(text|cat) [which will depend on the
length of the text if it’s to be properly normalized].

The first-best approach is our normal language model-based classifier.

The second-best approach is implemented as the classify.BinaryLMClassifier.

The constructor takes a “cross-entropy threshold” and what happens in
first-best classification is that the input is classified as being of
the category if its entropy-rate is lower than the threshold.

But how do we do this by inheritance? By making the competing model
P(cat2|text) with a constant character language model with a
per-character estimate of the cross-entropy rate. Voila.

Let’s see an example. I build a language model P(text|GOOD) using
data about what’s “good” text to process. This can probably just be
run of the mill training data you get from anywhere, but the closer it
matches your application good data, the better. Then I look at it’s
cross-entropy rate (log prob / length) on unseen text. I can actually
go so far as to compute a variance, as I did in the paper at the ACL
software workshop last year. This’ll let me estimate where I need to
set the threshold to get a given recall value (which will then
determine precision). In essence, anything that looks too “far out”,
that is, too high entropy against the training model, will be
classified as non-text.

That’s how you “recognize the feeling of unfamiliarity”.
Unfamiliarity is just high cross-entropy in a statistical sense.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s