Document Classification with Lucene

by

As promised in my last post, this post shows you how to use Lucene’s ranked search results and document store to build a simple classifier. Most of this post is excerpted from Text Processing in Java, Chapter 7, Text Search with Lucene. The data and source code for this example are contained in the source bundle distributed with this book, which can be downloaded from https://github.com/colloquial/javabook.

A classifier takes some input, which could be just about anything, and returns a classification of the input over a finite number of discrete categories. The goal of text classification is to assign some text to some category. Given a set of texts that contain discrete category labels, we can use that can assign these labels to new documents based on the similarity of the new documents to the labeled documents.

Example Dataset: The 20 Newsgroups Corpus

For this example, we use the data from the 20 Newsgroups corpus, a set of roughly 20,000 messages posted to 20 different newsgroups.  The classification task is to build a system that can assign a newsgroup name (the category label) to a newsgroup message (the text). Since all the messages in the corpus are already labeled, we can use one set of newsgroup posts to train our classifier and use the other set to test the classifier. To test the system, we compare the newsgroup name generated by the classifier against the name of the newsgroup that the message was posted to.

To use Lucene to classify a newsgroup post from the test set, we tokenize the message subject and body and create a Lucene query over these tokens. We do a search on the index over the training data and use the newsgroup name of the top-scoring document as the classifier response.  The names of the newsgroups in this corpus are:

comp.graphics rec.autos sci.crypt
comp.os.ms-windows.misc rec.motorcycles sci.electronics
comp.sys.ibm.pc.hardware rec.sport.baseball sci.med
comp.sys.mac.hardware rec.sport.hockey sci.space
comp.windows.x
talk.politics.misc talk.religion.misc
misc.forsale talk.politics.guns alt.atheism
talk.politics.mideast soc.religion.christian

This dataset is maintained by Jason Rennie. There are no licensing terms listed and the data may be downloaded directly from: http://people.csail.mit.edu/jrennie/20Newsgroups/.  The bundle 20news-bydate.tar.gz is divided into two sets of posts, a training set comprising roughly 60% of the posts to each group and a test set containing the remaining 40% of the posts. This version of the corpus doesn’t include cross-posts (messages posted to more than one newsgroup) and the newsgroup-identifying header lines have been removed.  20news-bydate.tar.gz unpacks into two top-level directories: 20news-bydate-train and 20news-bydate-test. Both of these contain 20 subdirectories of newsgroup data, where the directory name is the same as the newsgroup name, and each of these contains a set of newsgroup messages, roughly 400—600 posts for the training sets and roughly 250—400 posts for the test sets.

The Program ClassifyNewsgroups.java

The program ClassifyNewsgroups.java constructs the Lucene index from the training data and then uses this index to do a first-best classification of the test data. The classification results are presented as a confusion matrix that shows the full response profile of the classifiers. This is a 20-by-20 matrix where both the rows and columns are newsgroup names. Each row represents the performance of the classifier on posts from a given newsgroup, and each column represents the classifier response, so that reading across a row, we can see the number of posts given a particular category label. The classifier also reports the percentage of posts in the test set that were correctly labeled.

The source bundle contains the source code and the Lucene jar files as well as the file 20news-bydate.tar.gz.  The source code is in directory src/applucene which contains an Ant build.xml file. The ClasifyNewsgroups.java program is in subdirectory src/applucene/src/com/colloquial/applucene. The Ant target classify runs the classification program over the 20 newsgroups data.

Building the Index

We’ve already discussed the method buildIndex in previous post on Lucene 4. This method builds a Lucene index over all the documents in the training directory. It takes two java.io.File arguments: the directory for the Lucene index and a the directory containing the training data newsgroup posts.

void buildIndex(File indexDir, File trainDir) 
    throws IOException, FileNotFoundException {
    Directory fsDir = FSDirectory.open(indexDir);
    IndexWriterConfig iwConf 
        = new IndexWriterConfig(VERSION,mAnalyzer);

    iwConf.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
    IndexWriter indexWriter
        = new IndexWriter(fsDir,iwConf);

    File[] groupsDir = trainDir.listFiles();
    for (File group : groupsDir) {
        String groupName = group.getName();
        File[] posts = group.listFiles();
        for (File postFile : posts) {
            String number = postFile.getName();
            NewsPost post = parse(postFile, groupName, number);
            Document d = new Document();
            d.add(new StringField("category",
                                  post.group(),Store.YES));
            d.add(new TextField("text",
                                post.subject(),Store.NO));
            d.add(new TextField("text",
                                post.body(),Store.NO));
            indexWriter.addDocument(d);
        }
    }
    indexWriter.commit();
    indexWriter.close();
}

The first four statements in buildIndex create a oal.store.FSDirectory that stores the index as a file-system directory and an IndexWriter which does the work of analyzing documents and adding them to the index. The behavior of the IndexWriter is specified by an IndexWriterConfig object. Documents are indexed via the IndexWriter‘s default Analyzer. At the outset of the program we define a static final Lucene Version enum constant named VERSION. In order to ensure that the tokenization applied to the query matches the tokenization used to build the index, we use the member variable mAnalyzer to hold the reference to a LuceneAnalyzer. We supply these as arguments to the IndexWriterConfig constructor. We call the setOpenMode method with the enum constant IndexWriterConfig.OpenMode.CREATE, which causes the index writer to create a new index or overwrite an existing one.

For each newsgroup sub-directory, all posts in that directory are parsed into a NewsPost object, which is a simple domain model of a newsgroup post that holds the newsgroup name, subject, body, and filename (a string of numbers). The NewsPost object is processed into a Lucene Document that has two fields: a StringField named category for the newsgroup name and a TextField named text that holds the message subject and message body.

Using the Index for Classification

To use Lucene to classify a newsgroup post from the test set, we tokenize the message subject and body and create a Lucene query over these tokens. We do a search on the index over the training data and use the newsgroup name of the top-scoring document as the classifier response.

The method buildQuery creates a Lucene query from the subject and body text of the test message by tokenizing the message subject and body. The Lucene query ORs together all terms (up to a maximum of 1,024) into a BooleanQuery using the BooleanClause enum value SHOULD.

BooleanQuery buildQuery(String text) throws IOException {
    BooleanQuery termsQuery = new BooleanQuery();
    Reader textReader = new StringReader(text);
    TokenStream tokStream 
        = mAnalyzer.tokenStream("text",textReader);
    try {
        tokStream.reset();
        CharTermAttribute terms = 
            tokStream.addAttribute(CharTermAttribute.class);
        int ct = 0;
        while (tokStream.incrementToken() 
               && ct++ < 1024) {
            termsQuery.
                add(new TermQuery(new Term("text",
                                           terms.toString())), 
                    Occur.SHOULD);
        }
        tokStream.end();
    } finally {
        tokStream.close();
        textReader.close();
    }
    return termsQuery;
}

As we process each newsgroup in the test set, we fill in the confusion matrix, one row at a time. To index the rows and columns of the matrix by newsgroup name, we define a String array of newsgroup names called NEWSGROUPS:

public static final String[] NEWSGROUPS
    = { "alt.atheism",
        "comp.graphics",
        "comp.os.ms-windows.misc",
...

The method testIndex iterates over the sets of newsgroup posts in the test set and tabulates the results in a 20-by-20 confusion matrix.

void testIndex(File indexDir, File testDir)
    throws IOException, FileNotFoundException {
    Directory fsDir = FSDirectory.open(indexDir);
    DirectoryReader reader = DirectoryReader.open(fsDir);
    IndexSearcher searcher = new IndexSearcher(reader);

    int[][] confusionMatrix
        = new int[NEWSGROUPS.length][NEWSGROUPS.length];
    File[] groupsDir = testDir.listFiles();
    for (File group : groupsDir) {
        String groupName = group.getName();
        int rowIdx = Arrays.binarySearch(NEWSGROUPS,groupName);

        File[] posts = group.listFiles();
        for (File postFile : posts) {
            String number = postFile.getName();
            NewsPost post = parse(postFile, groupName, number);
            BooleanQuery termsQuery 
                = buildQuery(post.subject() 
                             + " " + post.body());
            // only get first-best result
            TopDocs hits = searcher.search(termsQuery,1);
            ScoreDoc[] scoreDocs = hits.scoreDocs;
            for (int n = 0; n < scoreDocs.length; n++) {
                ScoreDoc sd = scoreDocs[n];
                int docId = sd.doc;
                Document d = searcher.doc(docId);
                String category = d.get("category");
                // record result in confusion matrix
                int colIdx 
                    = Arrays.binarySearch(NEWSGROUPS,category);
                confusionMatrix[rowIdx][colIdx]++;
            }
        }
        System.out.print(groupName);
        for (int i=0; i<NEWSGROUPS.length; i++) 
             System.out.printf("| %4d ", confusionMatrix[rowIdx][i]);
        System.out.println("|");
    }
}

Evaluating the Classifier and Comparing Tokenization Strategies

The ClassifyNewsgroups program allows the user to specify the analyzer used. This allows us to see how different tokenization strategies affect classification. The three analyzers available are: the Lucene StandardAnalyzer, which chains together a StandardTokenizer that breaks the text into a stream of words, discarding whitespace and punctuation, followed by a StandardFilter, a LowerCaseFilter, and then a StopFilter, which uses a list of English stop words; an analyzer that chains together just a StandardTokenizer and a LowerCaseFilter; and an Analyzer that uses only a Lucene NGramTokenizer (from package oal.analysis.ngram) to index the data using 4-grams.

Here is the top-left quadrant of the confusion matrix, with the results for the first 10 newsgroups of running the classifier using the Lucene StandardAnalyzer:

ath grphx oswin ibm mac win-x sale auto cycle bb
ath 247 1 0 2 0 1 0 1 2 0
grphx 1 301 14 4 4 28 2 1 2 1
oswin 3 54 182 28 43 42 3 0 1 1
ibm 1 25 27 249 36 9 9 1 3 0
mac 0 31 10 25 257 6 7 3 3 0
win-x 0 43 7 1 7 313 1 2 4 3
sale 0 17 5 24 40 9 230 12 13 1
auto 1 7 5 0 1 0 6 324 9 0
cycle 1 0 1 0 0 3 0 13 341 3
bb 0 3 1 0 0 1 0 2 2 336

The cells on the diagonal show that in general, this classifier performs well on the data. Posts to alt.atheism are correctly classified in 247 out of 319 cases, at a rate of roughly 75 percent, whereas a classifier which always guessed at random, choosing uniformly among the newsgroups, would be correct only 5 percent of the time. In comparing the classification rates for the posts to the different computer newsgroups, we see a certain amount of confusion between the different kinds of computers and operating systems, as well as confusion with the listings in the misc.forsale newsgroup however, there is little confusion between the posts to the computer newsgroup and alt.atheism or with rec.sports.baseball.

To evaluate how well this program is able to identify the newsgroup to which a post belongs, we did several runs over the 20 Newsgroups dataset using different tokenization strategies and tabulated the results. The table below compares the correct classification rates for the three different analyzers for all newsgroups.

Newsgroup name Standard Analyzer Lowercase only N-Grams length 4
alt.atheism 0.77 0.80 0.74
comp.graphics 0.77 0.77 0.62
comp.os.ms-windows.misc 0.46 0.52 0.60
comp.sys.ibm.pc.hardware 0.63 0.66 0.59
comp.sys.mac.hardware 0.67 0.66 0.57
comp.windows.x 0.79 0.79 0.57
misc.forsale 0.59 0.60 0.58
rec.autos 0.82 0.82 0.70
rec.motorcycles 0.86 0.86 0.83
rec.sport.baseball 0.85 0.86 0.78
rec.sport.hockey 0.91 0.92 0.86
sci.crypt 0.91 0.89 0.85
sci.electronics 0.61 0.64 0.61
sci.med 0.67 0.70 0.66
sci.space 0.79 0.81 0.74
soc.religion.christian 0.64 0.65 0.78
talk.politics.guns 0.74 0.75 0.76
talk.politics.mideast 0.79 0.81 0.74
talk.politics.misc 0.56 0.55 0.62
talk.religion.misc 0.59 0.59 0.66

All classifiers work pretty well. No one classifier works best for all categories, reflecting the difference in the kinds of vocabulary and writing styles used across the different newsgroups.

N-Grams can be very effective for both regular search and classification, however the storage requirements for n-grams are considerably greater than for word-based tokenization. The size of index over the 20 Newsgroups training data is 7 MB when the Lucene StandardAnalyzer is used, 8 MB when the lowercase only analyzer is used, and 24 MB for the length-4 n-grams.

One Response to “Document Classification with Lucene”

  1. Bob Carpenter Says:

    The classifier Mitzi describes in the post is a flavor of K-nearest-neighbor (KNN) classifier. The idea behind KNN is that you have a way to measure “distance” from an item to classify to training items. You take the K closest training set items to the item you’re classifying and have them vote based on their category. The approach in this blog post takes K = 1 and uses the default scoring function from Lucene, which is itself an enriched TF/IDF approach with lots of knobs and dials to twiddle.

    LingPipe would allow you to build a similar classifier using the KnnClassifier class:

    http://alias-i.com/lingpipe/docs/api/com/aliasi/classify/KnnClassifier.html

    It’s configured with a feature extractor and distance measure on feature vectors.

    An alternative that throws all of the training docs of each category into a single meta-document is available through the TfIdfClassifierTrainer class:

    http://alias-i.com/lingpipe/docs/api/com/aliasi/classify/TfIdfClassifierTrainer.html

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

Join 811 other followers