At our Java SIG presentation this past Tuesday, we had several

interesting questions about classifiers. One of the things people

wanted to be able to do was pull the natural language text out of a

document (as compared to, say, ASCII art, tables, lists of menu titles

derived from HTML to text conversion, etc.). That’s a two-class

classification problem, where a sequence of characters is classified

as either “GOOD” or “BAD”

Not a problem, was our answer, but as usual, the devil’s in the

details. One person framed the question beautifully:

> Anyway, I think I see where you're going with hierarchical > classification, but I think what I'm describing is slightly > different. In the cases you describe, you're trying to distinguish > between two known cases, e.g., in natural text vs. non-natural > text, I imagine that works well if you have a fairly representative > collection of non-natural text. But what happens if there is total > garbage in the paper--noise that is not representative of either > the known natural text vs. the known non-natural text? > What I'm after is that you might be able to train for detecting that > you are not in a trained section by tracking the confidence of the > classifier. Perhaps I'm stretching the metaphor, but the idea is > that I recognize when I'm in an unfamiliar place not because it > looks like other unfamiliar places, but rather because I recognize > the feeling of unfamiliarity.

Spot on. The really cool part is how that’s actually implemented with

our classification interface.

A simple two-class classifier will operate by classifying to the

category with the highest probability given the text being classified,

P(cat|text). That’s what we’re calling “confidence” — the

probability of the category given the text.

As usual, it’s much easier to build a joint model P(cat,text) of the

probablity of the category AND the text. Then we just note that:

P(cat|text) = P(cat,text) / P(text) = P(cat,text) / SUM_cat P(cat,text)

where we’ve applied the usual Bayesian inversion to compute the

marginal as a sum of joint estimates:

P(text) = SUM_cat P(cat,text)

If we’re just choosing the first-best category, then we don’t even

need to compute the conditional, because:.

ARGMAX_cat P(cat|text) = ARGMAX_cat P(cat,text) / P(text) = ARGMAX_cat P(cat,text) = ARGMAX_cat P(text|cat) P(cat)

Now here’s the question: what if we only have training examples of one

category, say that of natural language text in English? As a good

statististician, all I want is a good estimate of the probability of

the text given the category.

The best approach is to get some data. The second best approach is to

take a uniform estimate of P(text|cat) [which will depend on the

length of the text if it’s to be properly normalized].

The first-best approach is our normal language model-based classifier.

The second-best approach is implemented as the classify.BinaryLMClassifier.

The constructor takes a “cross-entropy threshold” and what happens in

first-best classification is that the input is classified as being of

the category if its entropy-rate is lower than the threshold.

But how do we do this by inheritance? By making the competing model

P(cat2|text) with a constant character language model with a

per-character estimate of the cross-entropy rate. Voila.

Let’s see an example. I build a language model P(text|GOOD) using

data about what’s “good” text to process. This can probably just be

run of the mill training data you get from anywhere, but the closer it

matches your application good data, the better. Then I look at it’s

cross-entropy rate (log prob / length) on unseen text. I can actually

go so far as to compute a variance, as I did in the paper at the ACL

software workshop last year. This’ll let me estimate where I need to

set the threshold to get a given recall value (which will then

determine precision). In essence, anything that looks too “far out”,

that is, too high entropy against the training model, will be

classified as non-text.

That’s how you “recognize the feeling of unfamiliarity”.

Unfamiliarity is just high cross-entropy in a statistical sense.

## Leave a Reply