Archive for the ‘Java’ Category

Document Classification with Lucene

April 10, 2014

As promised in my last post, this post shows you how to use Lucene’s ranked search results and document store to build a simple classifier. Most of this post is excerpted from Text Processing in Java, Chapter 7, Text Search with Lucene. The data and source code for this example are contained in the source bundle distributed with this book, which can be downloaded from https://github.com/colloquial/javabook.

A classifier takes some input, which could be just about anything, and returns a classification of the input over a finite number of discrete categories. The goal of text classification is to assign some text to some category. Given a set of texts that contain discrete category labels, we can use that can assign these labels to new documents based on the similarity of the new documents to the labeled documents.

Example Dataset: The 20 Newsgroups Corpus

For this example, we use the data from the 20 Newsgroups corpus, a set of roughly 20,000 messages posted to 20 different newsgroups.  The classification task is to build a system that can assign a newsgroup name (the category label) to a newsgroup message (the text). Since all the messages in the corpus are already labeled, we can use one set of newsgroup posts to train our classifier and use the other set to test the classifier. To test the system, we compare the newsgroup name generated by the classifier against the name of the newsgroup that the message was posted to.

To use Lucene to classify a newsgroup post from the test set, we tokenize the message subject and body and create a Lucene query over these tokens. We do a search on the index over the training data and use the newsgroup name of the top-scoring document as the classifier response.  The names of the newsgroups in this corpus are:

comp.graphics rec.autos sci.crypt
comp.os.ms-windows.misc rec.motorcycles sci.electronics
comp.sys.ibm.pc.hardware rec.sport.baseball sci.med
comp.sys.mac.hardware rec.sport.hockey sci.space
comp.windows.x
talk.politics.misc talk.religion.misc
misc.forsale talk.politics.guns alt.atheism
talk.politics.mideast soc.religion.christian

This dataset is maintained by Jason Rennie. There are no licensing terms listed and the data may be downloaded directly from: http://people.csail.mit.edu/jrennie/20Newsgroups/.  The bundle 20news-bydate.tar.gz is divided into two sets of posts, a training set comprising roughly 60% of the posts to each group and a test set containing the remaining 40% of the posts. This version of the corpus doesn’t include cross-posts (messages posted to more than one newsgroup) and the newsgroup-identifying header lines have been removed.  20news-bydate.tar.gz unpacks into two top-level directories: 20news-bydate-train and 20news-bydate-test. Both of these contain 20 subdirectories of newsgroup data, where the directory name is the same as the newsgroup name, and each of these contains a set of newsgroup messages, roughly 400—600 posts for the training sets and roughly 250—400 posts for the test sets.

The Program ClassifyNewsgroups.java

The program ClassifyNewsgroups.java constructs the Lucene index from the training data and then uses this index to do a first-best classification of the test data. The classification results are presented as a confusion matrix that shows the full response profile of the classifiers. This is a 20-by-20 matrix where both the rows and columns are newsgroup names. Each row represents the performance of the classifier on posts from a given newsgroup, and each column represents the classifier response, so that reading across a row, we can see the number of posts given a particular category label. The classifier also reports the percentage of posts in the test set that were correctly labeled.

The source bundle contains the source code and the Lucene jar files as well as the file 20news-bydate.tar.gz.  The source code is in directory src/applucene which contains an Ant build.xml file. The ClasifyNewsgroups.java program is in subdirectory src/applucene/src/com/colloquial/applucene. The Ant target classify runs the classification program over the 20 newsgroups data.

Building the Index

We’ve already discussed the method buildIndex in previous post on Lucene 4. This method builds a Lucene index over all the documents in the training directory. It takes two java.io.File arguments: the directory for the Lucene index and a the directory containing the training data newsgroup posts.

void buildIndex(File indexDir, File trainDir) 
    throws IOException, FileNotFoundException {
    Directory fsDir = FSDirectory.open(indexDir);
    IndexWriterConfig iwConf 
        = new IndexWriterConfig(VERSION,mAnalyzer);

    iwConf.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
    IndexWriter indexWriter
        = new IndexWriter(fsDir,iwConf);

    File[] groupsDir = trainDir.listFiles();
    for (File group : groupsDir) {
        String groupName = group.getName();
        File[] posts = group.listFiles();
        for (File postFile : posts) {
            String number = postFile.getName();
            NewsPost post = parse(postFile, groupName, number);
            Document d = new Document();
            d.add(new StringField("category",
                                  post.group(),Store.YES));
            d.add(new TextField("text",
                                post.subject(),Store.NO));
            d.add(new TextField("text",
                                post.body(),Store.NO));
            indexWriter.addDocument(d);
        }
    }
    indexWriter.commit();
    indexWriter.close();
}

The first four statements in buildIndex create a oal.store.FSDirectory that stores the index as a file-system directory and an IndexWriter which does the work of analyzing documents and adding them to the index. The behavior of the IndexWriter is specified by an IndexWriterConfig object. Documents are indexed via the IndexWriter‘s default Analyzer. At the outset of the program we define a static final Lucene Version enum constant named VERSION. In order to ensure that the tokenization applied to the query matches the tokenization used to build the index, we use the member variable mAnalyzer to hold the reference to a LuceneAnalyzer. We supply these as arguments to the IndexWriterConfig constructor. We call the setOpenMode method with the enum constant IndexWriterConfig.OpenMode.CREATE, which causes the index writer to create a new index or overwrite an existing one.

For each newsgroup sub-directory, all posts in that directory are parsed into a NewsPost object, which is a simple domain model of a newsgroup post that holds the newsgroup name, subject, body, and filename (a string of numbers). The NewsPost object is processed into a Lucene Document that has two fields: a StringField named category for the newsgroup name and a TextField named text that holds the message subject and message body.

Using the Index for Classification

To use Lucene to classify a newsgroup post from the test set, we tokenize the message subject and body and create a Lucene query over these tokens. We do a search on the index over the training data and use the newsgroup name of the top-scoring document as the classifier response.

The method buildQuery creates a Lucene query from the subject and body text of the test message by tokenizing the message subject and body. The Lucene query ORs together all terms (up to a maximum of 1,024) into a BooleanQuery using the BooleanClause enum value SHOULD.

BooleanQuery buildQuery(String text) throws IOException {
    BooleanQuery termsQuery = new BooleanQuery();
    Reader textReader = new StringReader(text);
    TokenStream tokStream 
        = mAnalyzer.tokenStream("text",textReader);
    try {
        tokStream.reset();
        CharTermAttribute terms = 
            tokStream.addAttribute(CharTermAttribute.class);
        int ct = 0;
        while (tokStream.incrementToken() 
               && ct++ < 1024) {
            termsQuery.
                add(new TermQuery(new Term("text",
                                           terms.toString())), 
                    Occur.SHOULD);
        }
        tokStream.end();
    } finally {
        tokStream.close();
        textReader.close();
    }
    return termsQuery;
}

As we process each newsgroup in the test set, we fill in the confusion matrix, one row at a time. To index the rows and columns of the matrix by newsgroup name, we define a String array of newsgroup names called NEWSGROUPS:

public static final String[] NEWSGROUPS
    = { "alt.atheism",
        "comp.graphics",
        "comp.os.ms-windows.misc",
...

The method testIndex iterates over the sets of newsgroup posts in the test set and tabulates the results in a 20-by-20 confusion matrix.

void testIndex(File indexDir, File testDir)
    throws IOException, FileNotFoundException {
    Directory fsDir = FSDirectory.open(indexDir);
    DirectoryReader reader = DirectoryReader.open(fsDir);
    IndexSearcher searcher = new IndexSearcher(reader);

    int[][] confusionMatrix
        = new int[NEWSGROUPS.length][NEWSGROUPS.length];
    File[] groupsDir = testDir.listFiles();
    for (File group : groupsDir) {
        String groupName = group.getName();
        int rowIdx = Arrays.binarySearch(NEWSGROUPS,groupName);

        File[] posts = group.listFiles();
        for (File postFile : posts) {
            String number = postFile.getName();
            NewsPost post = parse(postFile, groupName, number);
            BooleanQuery termsQuery 
                = buildQuery(post.subject() 
                             + " " + post.body());
            // only get first-best result
            TopDocs hits = searcher.search(termsQuery,1);
            ScoreDoc[] scoreDocs = hits.scoreDocs;
            for (int n = 0; n < scoreDocs.length; n++) {
                ScoreDoc sd = scoreDocs[n];
                int docId = sd.doc;
                Document d = searcher.doc(docId);
                String category = d.get("category");
                // record result in confusion matrix
                int colIdx 
                    = Arrays.binarySearch(NEWSGROUPS,category);
                confusionMatrix[rowIdx][colIdx]++;
            }
        }
        System.out.print(groupName);
        for (int i=0; i<NEWSGROUPS.length; i++) 
             System.out.printf("| %4d ", confusionMatrix[rowIdx][i]);
        System.out.println("|");
    }
}

Evaluating the Classifier and Comparing Tokenization Strategies

The ClassifyNewsgroups program allows the user to specify the analyzer used. This allows us to see how different tokenization strategies affect classification. The three analyzers available are: the Lucene StandardAnalyzer, which chains together a StandardTokenizer that breaks the text into a stream of words, discarding whitespace and punctuation, followed by a StandardFilter, a LowerCaseFilter, and then a StopFilter, which uses a list of English stop words; an analyzer that chains together just a StandardTokenizer and a LowerCaseFilter; and an Analyzer that uses only a Lucene NGramTokenizer (from package oal.analysis.ngram) to index the data using 4-grams.

Here is the top-left quadrant of the confusion matrix, with the results for the first 10 newsgroups of running the classifier using the Lucene StandardAnalyzer:

ath grphx oswin ibm mac win-x sale auto cycle bb
ath 247 1 0 2 0 1 0 1 2 0
grphx 1 301 14 4 4 28 2 1 2 1
oswin 3 54 182 28 43 42 3 0 1 1
ibm 1 25 27 249 36 9 9 1 3 0
mac 0 31 10 25 257 6 7 3 3 0
win-x 0 43 7 1 7 313 1 2 4 3
sale 0 17 5 24 40 9 230 12 13 1
auto 1 7 5 0 1 0 6 324 9 0
cycle 1 0 1 0 0 3 0 13 341 3
bb 0 3 1 0 0 1 0 2 2 336

The cells on the diagonal show that in general, this classifier performs well on the data. Posts to alt.atheism are correctly classified in 247 out of 319 cases, at a rate of roughly 75 percent, whereas a classifier which always guessed at random, choosing uniformly among the newsgroups, would be correct only 5 percent of the time. In comparing the classification rates for the posts to the different computer newsgroups, we see a certain amount of confusion between the different kinds of computers and operating systems, as well as confusion with the listings in the misc.forsale newsgroup however, there is little confusion between the posts to the computer newsgroup and alt.atheism or with rec.sports.baseball.

To evaluate how well this program is able to identify the newsgroup to which a post belongs, we did several runs over the 20 Newsgroups dataset using different tokenization strategies and tabulated the results. The table below compares the correct classification rates for the three different analyzers for all newsgroups.

Newsgroup name Standard Analyzer Lowercase only N-Grams length 4
alt.atheism 0.77 0.80 0.74
comp.graphics 0.77 0.77 0.62
comp.os.ms-windows.misc 0.46 0.52 0.60
comp.sys.ibm.pc.hardware 0.63 0.66 0.59
comp.sys.mac.hardware 0.67 0.66 0.57
comp.windows.x 0.79 0.79 0.57
misc.forsale 0.59 0.60 0.58
rec.autos 0.82 0.82 0.70
rec.motorcycles 0.86 0.86 0.83
rec.sport.baseball 0.85 0.86 0.78
rec.sport.hockey 0.91 0.92 0.86
sci.crypt 0.91 0.89 0.85
sci.electronics 0.61 0.64 0.61
sci.med 0.67 0.70 0.66
sci.space 0.79 0.81 0.74
soc.religion.christian 0.64 0.65 0.78
talk.politics.guns 0.74 0.75 0.76
talk.politics.mideast 0.79 0.81 0.74
talk.politics.misc 0.56 0.55 0.62
talk.religion.misc 0.59 0.59 0.66

All classifiers work pretty well. No one classifier works best for all categories, reflecting the difference in the kinds of vocabulary and writing styles used across the different newsgroups.

N-Grams can be very effective for both regular search and classification, however the storage requirements for n-grams are considerably greater than for word-based tokenization. The size of index over the 20 Newsgroups training data is 7 MB when the Lucene StandardAnalyzer is used, 8 MB when the lowercase only analyzer is used, and 24 MB for the length-4 n-grams.

Lucene 4 Essentials for Text Search and Indexing

March 8, 2014

Here’s a short-ish introduction to the Lucene search engine which shows you how to use the current API to develop search over a collection of texts. Most of this post is excerpted from Text Processing in Java, Chapter 7, Text Search with Lucene.

Lucene Overview

Apache Lucene is a search library written in Java. It’s popular in both academic and commercial settings due to its performance, configurability, and generous licensing terms. The Lucene home page is http://lucene.apache.org.

Lucene provides search over documents. A document is essentially a collection of fields. A field consists of a field name that is a string and one or more field values. Lucene does not in any way constrain document structures. Fields are constrained to store only one kind of data, either binary, numeric, or text data. There are two ways to store text data: string fields store the entire item as one string; text fields store the data as a series of tokens. Lucene provides many ways to break a piece of text into tokens as well as hooks that allow you to write custom tokenizers. Lucene has a highly expressive search API that takes a search query and returns a set of documents ranked by relevancy with documents most similar to the query having the highest score.

The Lucene API consists of a core library and many contributed libraries. The top-level package is org.apache.lucene, which is abbreviated as oal in this article. As of Lucene 4, the Lucene distribution contains approximately two dozen package-specific jars, e.g.: lucene-core-4.7.0.jar, lucene-analyzers-common-4.7.0.jar, lucene-misc-4.7.0.jar. This cuts down on the size of an application at a small cost to the complexity of the build file.

A Lucene Index is an Inverted Index

Lucene manages an index over a dynamic collection of documents and provides very rapid updates to the index as documents are added to and deleted from the collection. An index may store a heterogeneous set of documents, with any number of different fields that may vary by document in arbitrary ways. Lucene indexes terms, which means that Lucene search is search over terms. A term combines a field name with a token. The terms created from the non-text fields in the document are pairs consisting of the field name and field value. The terms created from text fields are pairs of field name and token.

The Lucene index provides a mapping from terms to documents.  This is called an inverted index because it reverses the usual mapping of a document to the terms it contains.  The inverted index provides the mechanism for scoring search results: if a number of search terms all map to the same document, then that document is likely to be relevant.

Here are three entries from an index over part of the The Federalist Papers, a collection of 85 political essays which contains roughly 190,000 word instances over a vocabulary of about 8,000 words.  A field called text holds the contents of each essay, which have been tokenized into words, all lowercase, no punctuation.  The inverted index entries for the terms consisting of field name text and tokens abilities, able, and abolish are:

term document
text:abilities 2, 17, 21, 22, 35, 64, 76
text:able 1, 3, 4, 7, 8, 9, 10, 11, 12, 15, 16, 21, 23, 24,
25, 27, 29, 30, 33, 34, 35, 36, 37, 38, 41, 43, 46, 49,
51, 52, 58, 62, 63, 64, 67, 68, 70, 71, 75, 78, 85
text:abolish 10, 14, 26, 31, 39, 49, 45, 70, 71, 78, 81

Note that the document numbers here are Lucene’s internal references to the document.  These ids are not stable; Lucene manages the document id as it manages the index and the internal numbering may change as documents are added to and deleted from the index.

Lucene Versions and Version Numbers

The current Apache Lucene Java release is version 4.7, where 4 is the major version number and 7 is the minor version number.  The Apache odometer rolled over to 4.6.0 in November, 2013 and just hit 4.7.0 on February 26, 2014.  At the time that I wrote the Lucene chapter of Text Processing in Java, the current version was 4.5.  Minor version updates maintain backwards compatibility for the given major version therefore, all the example programs in the book compile and run under version 4.7 as well.

The behavior of many Lucene components has changed over time.  In particular, the index file format is subject to change from release to release as different methods of indexing and compressing the data are implemented.   To address this, the Enum class oal.util.Version was introduced in Lucene 3.  A Version instance identifies the major and minor versions of Lucene. For example, LUCENE_45 identifies version 4.5.  The Lucene version is supplied to the constructor of the components in an application.  As of Lucene 4.7, older versions have been deprecated, so the compiler issues a warning when older versions are specified.  This means that the examples from the book, which specify version 4.5, generate a compiler warning when compiled under version 4.7.

There’s no requirement that all components in an application be of the same version however, for components used for both search and indexing, it is critical that the Lucene version is the same in the code that is called at indexing time and the code that is called at search time.  A Lucene index knows what version of Lucene was used to create it (by using the Lucene Version enum constant).  Lucene is backward compatible with respect to searching and maintaining old index versions because it includes classes that can read and write all versions of the index up through the current release.

Lucene Indexes Fields

Conceptually, Lucene provides indexing and search over documents, but implementation-wise, all indexing and search is carried out over fields.  A document is a collection of fields.  Each field has three parts: name, type, and value.  At search time, the supplied field name restricts the search to particular fields.

For example, a MEDLINE citation can be represented as a series of fields: one field for the name of the article, another field for name of the journal in which it was published, another field for the authors of the article, a pub-date field for the date of publication, a field for the text of the article’s abstract, and another field for the list of topic keywords drawn from Medical Subject Headings (MeSH).  Each of these fields is given a different name, and at search time, the client could specify that it was searching for authors or titles or both, potentially restricting to a date range and set of journals by constructing search terms for the appropriate fields and values.

The Lucene API for fields has changed across the major versions of Lucene as the functionality and organization of the underlying Lucene index have evolved.  Lucene 4 introduces a new interface oal.index.IndexableField, which is implemented by class oal.document.Field.  Lucene 4 also introduces datatype-specific subclasses of Field that encapsulate indexing and storage details for common use cases.  For example, to index integer values, use class oal.document.IntField, and to index simple unanalyzed strings (keywords), use oal.document.StringField.  These so-called sugar subclasses are all final subclasses.

The field type is an object that implements oal.index.IndexableFieldType.  Values may be text, binary, or numeric.  The value of a field can be indexed for search or stored for retrieval or both.  The value of an indexed field is processed into terms that are stored in the inverted index.  The raw value of a stored field is stored in the index in a non-inverted manner.  Storing the raw values allows you to retrieve them at search time but may consume substantial space.

Indexing and storage options are specified via setter methods on oal.document.FieldType.  These include the method setIndexed(boolean), which specifies whether or not to index a field, and the method setTokenized(boolean), which specifies whether or not the value should be tokenized.  The method setOmitNorms(boolean) controls how Lucene computes term frequency.  Lucene’s default behavior is to represent term frequency as a proportion by computing the ratio of the number of times a term occurs to the total number of terms in the document, instead of storing a simple term frequency count.  To do this calculation it stores a normalizing factor for each field that is indexed.  Calling method setOmitNorms with value true turns this off and the raw term frequency is used instead.

Some indexing choices are interdependent.  Lucene checks the values on a FieldType object at Field construction time and throws an IllegalArgumentException if the FieldType has inconsistent values.  For example, a field must be either indexed or stored or both, so indexed and/or stored must be true.  If indexed is false, then stored must be true and all other indexing options should be set to false.  The following code fragment defines a custom FieldType and then creates a Field of this type:

FieldType myFieldType = new FieldType();
myFieldType.setIndexed(true);
myFieldType.setOmitNorms(true);
myFieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS);
myFieldType.setStored(false);
myFieldType.setTokenized(true);
myFieldType.freeze();
Field myField = new Field("field name",
                          "field value",
                          myFieldType);

The source code of the subclasses of Field provides good examples of how to define a FieldType.

Analyzing Text into Tokens

Search and indexing over text fields require processing text data into tokens.  The package oal.analysis contains the base classes for tokenizing and indexing text.  Processing may consist of a sequence  of transformations , e.g., whitespace tokenization, case normalization, stop-listing, and stemming.

The abstract class oal.analysis.TokenStream breaks the incoming text into a sequence of tokens that are retrieved using an iterator-like pattern.  TokenStream has two subclasses: oal.analysis.Tokenizer and oal.analysis.TokenFilter.   A Tokenizer takes a java.io.Reader as input whereas a TokenFilter takes another oal.analysis.TokenStream as input.  This allows us to chain together tokenizers such that the initial tokenizer gets its input from a reader and the others operate on tokens from the preceding TokenStream in the chain.

An oal.analysis.Analyzer supplies the indexing and searching processes with TokenStreams on a per-field basis.  It maps field names to tokenizers and may also supply a default analyzer for unknown field names.  Lucene includes many analysis modules that provide concrete implementations of different kinds of analyzers.  As of Lucene 4, these modules are bundled into separate jarfiles.  There are several dozen language-specific analysis packages, from oal.analysis.ar for Arabic to oal.analysis.tr for Turkish.  The package oal.analysis.core provides several general-purpose analyzers, tokenizers, and tokenizer factory classes.

The abstract class oal.analysis.Analyzer contains methods used to extract terms from input text.  Concrete subclasses of Analyzer must override the method createComponents, which returns an object of the nested class TokenStreamComponents that defines the tokenization process and provides access to initial and file components of the processing pipeline.  The initial component is a Tokenizer that handles the input source.  The final component is an instance of TokenFilter and it is the TokenStream returned by the method Analyzer.tokenStream(String,Reader).  Here is an example of a custom Analyzer that tokenizes its inputs into individual words with all letters lowercase.

Analyzer analyzer = new Analyzer() { 
    @Override protected TokenStreamComponents 
    createComponents(String fieldName, Reader reader) {
        Tokenizer source = 
            new StandardTokenizer(VERSION,reader);
        TokenStream filter = 
            new LowerCaseFilter(VERSION,source);
        return new TokenStreamComponents(source, filter);
    }
};

Note that the constructors for the oal.analysis.standard.StandardTokenizer and oal.analysis.core.LowerCaseFilter objects require a Version argument.  Further note that package oal.analysis.standard is distributed in the jarfile lucene-analyzers-common-4.x.y.jar, where x and y are the minor version and release number.

Indexing Documents

Document indexing consists of first constructing a document that contains the fields to be indexed or stored, then adding that document to the index.  The key classes involved in indexing are oal.index.IndexWriter, which is responsible for adding documents to an index, and oal.store.Directory, which is the storage abstraction used for the index itself.  Directories provide an interface that’s similar to an operating system’s file system.  A Directory contains any number of sub-indexes called segments.  Maintaining the index as a set of segments allows Lucene to rapidly update and delete documents from the index.

The following example shows how to create a Lucene index given a directory containing a set of data files.  The data in this example is taking from the 20 Newsgroups corpus a set of roughly 20,000 messages posted to 20 different newsgroups.  This code is excerpted from the example program in section 7.11 of Text Processing in Java that shows how to do document classification using Lucene.  We’ll get to document classification in a later post.  Right now we just want to go over how to build the index.

void buildIndex(File indexDir, File dataDir) 
    throws IOException, FileNotFoundException {

    Directory fsDir = FSDirectory.open(indexDir);
    IndexWriterConfig iwConf 
        = new IndexWriterConfig(VERSION,mAnalyzer);
    iwConf.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
    IndexWriter indexWriter
        = new IndexWriter(fsDir,iwConf);

    File[] groupsDir = dataDir.listFiles();
    for (File group : groupsDir) {
        String groupName = group.getName();
        File[] posts = group.listFiles();
        for (File postFile : posts) {
            String number = postFile.getName();
            NewsPost post = parse(postFile, groupName, number);
            Document d = new Document();
            d.add(new StringField("category",
                                  post.group(),Store.YES));
            d.add(new TextField("text",
                                post.subject(),Store.NO));
            d.add(new TextField("text",
                                post.body(),Store.NO));
            indexWriter.addDocument(d);
        }

    }
    indexWriter.commit();
    indexWriter.close();
}

The method buildIndex walks over the training data directory and parses the newsgroup messages into a NewsPost object, which is a simple domain model of a newsgroup post, consisting of the newsgroup name, subject, body, and filename (a string of numbers).  We treat each post as a Lucene document consisting of two fields: a StringField named category for the newsgroup name and a TextField named text that holds the message subject and message body.  Method buildIndex takes two java.io.File arguments: the directory for the Lucene index and a directory for one set of newsgroup posts where the directory name is the same as the Usenet newsgroup name, e.g.: rec.sport.baseball.

The first four statements in buildIndex create a oal.store.FSDirectory that stores the index as a file-system directory and an IndexWriter which does the work of analyzing documents and adding them to the index.  The behavior of the IndexWriter is specified by an IndexWriterConfig object.  Documents are indexed via the IndexWriter‘s default Analyzer.  At the outset of the program we define a static final Lucene Version enum constant named VERSION. and a Lucene Analyzer named mAnalyzer.  We supply these as arguments to the IndexWriterConfig constructor.  We call the setOpenMode method with the enum constant IndexWriterConfig.OpenMode.CREATE, which causes the index writer to create a new index or overwrite an existing one.

A for loop iterates over all files in the data directory.  Each file is first parsed into a Java NewsPost and a corresponding Lucene document is created and fields are added to the document.  This document has two fields: category and text.  Newsgroup names are stored as simple unanalyzed strings.  The field named text stores both the message subject and message body.  A document may have multiple values for a given field.  Search over that field will be over all values for that field, however phrase search over that field will not match across different field values.  When theIndexWriter.addDocument(Document) method is called, the document is added to the index and the constituent fields are indexed and stored accordingly.

The final two statements invoke the IndexWriter‘s commit and close methods, respectively.  The commit method writes all pending changes to the directory and syncs all referenced index files so that the changes are visible to index readers.  Lucene now implements a two-phase commit so that if the commit succeeds, the changes to the index will survive a crash or power loss.  A commit may be expensive, and part of performance tuning is determining when to commit as well as how often and how aggressively to merge the segments of the index.

Document Search and Search Ranking

The Lucene search API takes a search query and returns a set of documents ranked by relevancy with documents most similar to the query having the highest score.  Lucene provides a highly configurable hybrid form of search that combines exact boolean searches with softer, more relevance-ranking-oriented vector-space search methods.  All searches are field specific, because Lucene indexes terms and a term is composed of a field name and a token.

Search Queries

Lucene specifies a language in which queries may be expressed.  For instance, computer NOT java produces a query that specifies the term computer must appear in the default field and the term java must not appear.  Queries may specify fields, as in text:java, which requires the term java to appear in the text field of a document.

The query syntax includes basic term and field specifications, modifiers for wildcard, fuzzy, proximity, or range searches, and boolean operators for requiring a term to be present, absent, or for combining queries with logical operators.  Finally, sub-queries may be boosted by providing numeric values to raise or lower their prominence relative to other parts of the query. The full syntax specification is in the package level javadoc for package oal.queryparser.classic.  A bonus feature of Text Processing in Java is a quick one-page reference guide to Lucene’s search query syntax (see Figure 7.2).

Queries may be constructed programmatically using the dozen or so built-in implementations of the oal.search.Query abstract base class.  The most basic kind of query is a search for a single token on a single field, i.e., a single term.  This query is implemented in Lucene’s oal.search.TermQuery class.  A term query is constructed from a Term object, which is constructed from a field name and text for the term, both specified as strings.

Search Scoring

Lucene’s default search scoring algorithm weights results using TF—IDF, term frequency—inverse document frequency.  Term frequency means that high-frequency terms within a document have higher weight than do low-frequency terms.  Inverse document frequency means that terms that occur frequently across many documents in a collection of documents are less likely to be meaningful descriptors of any given document in a corpus and are therefore down-weighted.  This filters out common words.

As of Lucene 4, the API provides alternative scoring algorithms and a pluggable architecture that allows developers to build their own custom scoring models.

Search Example

The following program illustrates the basic sequence of search operations.  This program is a simplified version of the Lucene search example program included with the book.  When run from the command line, this program takes three arguments: the path to a Lucene index; a query string; the maximum number of results to return.  It runs the specified query against the specified index and prints out the rank, score, and internal document id for all documents that match the query.

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

import java.io.File;
import java.io.IOException;

public class LuceneSearch {
    public static void main(String[] args) 
        throws ParseException, CorruptIndexException,
               IOException {

        File indexDir = new File(args[0]);
        String query = args[1];
        int maxHits = Integer.parseInt(args[2]);

        Directory fsDir = FSDirectory.open(indexDir);
        DirectoryReader reader = DirectoryReader.open(fsDir);
        IndexSearcher searcher = new IndexSearcher(reader);

        Analyzer stdAn 
            = new StandardAnalyzer(Version.LUCENE_45);
        QueryParser parser 
            = new QueryParser(Version.LUCENE_45,"text",stdAn);
        Query q= parser.parse(query);

        TopDocs hits = searcher.search(q,maxHits);
        ScoreDoc[] scoreDocs = hits.scoreDocs;
        System.out.println("hits=" + scoreDocs.length);
        System.out.println("Hits (rank,score,docId)");
        for (int n = 0; n < scoreDocs.length; ++n) {
            ScoreDoc sd = scoreDocs[n];
            float score = sd.score;
            int docId = sd.doc;
            System.out.printf("%3d %4.2f %d\n",
                              n, score, docId);
        }
        reader.close();
    }
}

The main method is declared to throw a Lucene corrupt index exception if the index isn’t well formed, a Lucene parse exception if the query isn’t well formed, and a general Java I/O exception if there is a problem reading from or writing to the index directory.  It starts off by reading in command-line arguments, then it creates a Lucene directory, index reader, index searcher, and query parser, and then it uses the query parser to parse the query.

Lucene uses instances of the aptly named IndexReader to read data from an index, in this example, we use an instance of class oal.index.DirectoryReader.  The oal.search.IndexSearcher class performs the actual search.  Every index searcher wraps an index reader to get a handle on the indexed data.  Once we have an index searcher, we can supply queries to it and enumerate results in order of their score.

The class oal.queryparser.classic.QueryParserBase provides methods to parse a query string into a oal.search.Query object or throw a oal.queryparser.classic.ParseException if it is not.  All tokens are analyzed using the specified Analyzer.  This example is designed to search an index for Lucene version 4.5 that was created using an analyzer of class oal.analysis.standard.StandardAnalyzer which contains text a field named text.

It is important to use the same analyzer in the query parser as is used in the creation of the index.  If they don’t match, queries that should succeed will fail should the different analyzers produce differing tokenizations given the same input.  For instance, if we apply stemming to the contents of text field text during indexing and reduce the word codes to code, but we don’t apply stemming to the query, then a search for the word codes in field text will fail. The search term is text:codes but the index contains only the term text:code.  Note that package oal.queryparser.classic is distributed in the jarfile lucene-queryparser-4.x.y.jar.

The actual search is done via a call to the IndexSearcher‘s search method, which takes two arguments: the query and an upper bound on the number of hits to return.  It returns an instance of the Lucene class oal.search.TopDocs.  The TopDocs result provides access to an array of search results.  Each result is an instance of the Lucene class oal.search.ScoreDoc, which encapsulates a document reference with a floating point score.  The array of search results is sorted in decreasing order of score, with higher scores representing better matches.  For each ScoreDoc object, we get the score from the public member variable score.  We get the document reference number (Lucene’s internal identifier for the doc) from the public member variable doc.  The document identifier is used to retrieve the document from the searcher.  Here, we’re just using the id as a diagnostic to illustrate the basics of search and so we just print the number, even though it is only an internal reference.  Finally, we close the IndexReader.

Discussion and Further Reading

This post tries to cover the essentials of Lucene 4 in a very short amount of space.  In order to do this, this post contains minimal amounts of examples, and links to the Lucene javadocs have been substituted for detailed explanation of how classes and methods behave.  Putting these code fragments together into a full application is left as an exercise to the reader.

Back when Lucene was at version 2.4 I wrote my first blog post titled “Lucene 2.4 in 60 Seconds.”  A more accurate title would have been “Lucene 2.4 in 60 Seconds or More.”  Likewise, a more accurate title for this post would be “The Essential Essentials of Text Search and Indexing with Lucene 4″ but that’s just not very snappy.  For more on the essentials of search and indexing with Lucene, please check out Chapter Seven of Text Processing in Java, (and all the other chapters as well).

To cover all of Lucene would take an entire book, and there are many good ones out there. My favorite is Lucene in Action, Second Edition, which, alas, only covers version 3.

New Book: Text Processing in Java

January 31, 2014

I’m pleased to announce the publication of Text Processing in Java !
Text Processing in Java

This book teaches you how to master the subtle art of multilingual text processing and prevent text data corruption.  Data corruption is the all-too-common problem of words that are garbled into strings of question marks, black diamonds, or random glyphs.  In Japanese this is called mojibake (“character change”), written 文字化け, but on your browser it might look like this: �����  or this: 文字化ã‘. When this happens, pinpointing the source of error can be surprisingly difficult and time-consuming. The information and example programs in this book make it easy.

This book also provides an introduction to natural language processing using Lucene and Solr. It covers the tools and techniques necessary for managing large collections of text data, whether they come from news feeds, databases, or legacy documents. Each chapter contains executable programs that can also be used for text data forensics.

Topics covered:

  • Unicode code points
  • Character encodings from ASCII and Big5 to UTF-8 and UTF-32LE
  • Character normalization using International Components for Unicode (ICU)
  • Java I/O, including working directly with zip, gzip, and tar files
  • Regular expressions in Java
  • Transporting text data via HTTP
  • Parsing and generating XML, HTML, and JSON
  • Using Lucene 4 for natural language search and text classification
  • Search, spelling correction, and clustering with Solr 4

Other books on text processing presuppose much of the material covered in this book. They gloss over the details of transforming text from one format to another and assume perfect input data. The messy reality of raw text will have you reaching for this book again and again.

Buy Text Processing in Java on Amazon

Canceled: Help Build a Watson Clone–Participants Sought for LingPipe Code Camp

January 10, 2014

Code Camp is canceled. We are late on delivering a LingPipe recipes book to our publisher and that will have to be our project for March. But we could have testers/reviewers come and hang out. Less fun I think.

Apologies.
Breck

—————————

 

Dates: To be absolutely clear: Dates are 3/2/2014 to 3/31/14 in Driggs Idaho. You can work remotely, we will be doing stuff before as well for setup.

New: We have setup a github repository. URL is https://github.com/watsonclone/jeopardy-.git

————————

Every year we go out west for a month of coding and skiing. Last year it was Salt Lake City Utah, this year it is Driggs Idaho for access to Grand Targhee and Jackson Hole. The house is rented for the month of March and the project selected. This year we will create a Watson clone.

I have blogged about Watson before and I totally respect what they have done. But the coverage is getting a bit breathless with the latest billion dollar effort. So how about assembling a scrappy bunch of developers and see how close we can come to recreating the Jeopardy beast?

How this works:

  • We have a month of focused development. We tend to code mornings, ski afternoons. House has 3 bedrooms if you want to come stay. Prior arrangements must be made. House is paid for. Nothing else is.
  • The code base will be open source. We are moving LingPipe to the AGPL so maybe that license or we could just apache license it. We want folks to be comfortable contributing.
  • You don’t have to be present to contribute.
  • We have a very fun time last year. We worked very hard on an email app that didn’t quite get launched but the lesson learned was to start with the project defined.

If you are interested in participating let us know at watsonclone@lingpipe.com. Let us know your background, what you want to do, do you expect to stay with us etc…. No visa situations please and we don’t have any funds to support folks. Obviously we have limited space physically and mentally so we may say no but we will do our best to be inclusive. Step 1–transcribe some Jeopardy shows.

Ask questions in the comments so all can benefit. Check comments before asking questions pls. I’ll answer the first one that is on everyone’s mind:

Q: Are you frigging crazy???

A: Why, yes, yes we are. But we are also really good computational linguists….

Breck

The Latin1 Transcoding Trick for Ant

April 21, 2013

A while back Bob blogged about The Latin1 Transcoding Trick for Java Servlets, etc.

Suppose you have an API that insists on converting an as-yet-unseen stream of bytes to characters for you (e.g. servlets), but lets you set the character encoding if you want.

Because Latin1 (officially, ISO-8859-1) maps bytes one-to-one to Unicode code points, setting Latin1 as the character encoding means that you get a single Java char for each byte.

Another situation where this trick comes in real handy is dealing with the way that Ant compiles its logfiles.

If, like me, you’re fond of debug-by-printf and you use Ant to compile and run your programs, then you might have run into the problem that has given rise to many StackOverflow queries, that is, when you use an Ant task to run the program and instrument your code with print statements to standard out, Ant replaces non-ASCII characters with a question mark. When the problem you’re trying to debug is making sure that non-ASCII characters are being processed correctly, this is both misleading and maddening. The standard advice on StackOverflow is to set the shell environment variable ANT_OPTS using the following incantation (for bash shell):

export ANT_OPTS="-Dfile.encoding=UTF-8"

This works as long as you’re working with UTF-8 encoded character data and your terminal’s encoding is set to UTF-8 as well. Here a solution that works no matter what character encoding is in play:

export ANT_OPTS="-Dfile.encoding=Latin1"

It’s the ol’ Latin1 transcoding trick!

Of course you already know about character encodings . Do you know about Ant’s IO System? Here’s what Ant contributor Conor MacNeill says:

The Ant IO system is designed to associate any output sent to System.out and System.err with the task that generated the output and to log the output accordingly.

Ant’s Main class installs its own output streams into System.out and System.err. These streams are instances of DemuxOutputStreams

Using the source code for Ant 1.9.0, in class org.apache.tools.ant.Main we see that System.In, System.Out, and System.Err are all reassigned to Ant’s DemuxInputStream and DemuxOutputStream, which extend InputStream and OutputStream, respectively:

System.setIn(new DemuxInputStream(project));
System.setOut(new PrintStream(new DemuxOutputStream(project, false)));
System.setErr(new PrintStream(new DemuxOutputStream(project, true)));

The call to the PrintStream constructor is the one-arg constructor PrintStream(OutputStream out). Because no file encoding is specified, the encoding used is the default charset for the JVM that’s running Ant. This is specified by the system property file.encoding. This property varies depending on your platform and locale. To check this, try this on your machine:

public class GetDefaultEncoding {
    public static void main(String[] args) {
        System.out.println(System.getProperty("file.encoding"));
    }
}

On my Mac running OS-X the default is US-ASCII (because the default locale on this machine is en_US). On my Windows XP machine the default is Cp1252 (Windows Latin1 which differs from ISO_8859-1 just enough to be noticeable).

At the point where Ant’s DemuxInputStream reads in the bytes sent to System.out by a Java task, any character outside of the default character set is replaced by a question mark. When Latin1 is the default encoding, all bytes are valid Latin1 characters and their Unicode code point value is the same as the byte value so the bytes from the Java task pass through the Ant IO system unchanged.

As long as the next process in the chain (e.g. the terminal app) is configured to handle whatever encoding your text data is in, you’re good to go.

Anyone Want to Write an O’Reilly Book on NLP with Java?

February 21, 2013

Mitzi and I pitched O’Reilly books a revision of the Text Processing in Java book that she’s been finishing off.

The response from their editor was that they’d love to have an NLP book based on Java, but what we provided looked like everything-but-the-NLP you’d need for such a book. Insightful, these editors. That’s exactly how the book came about, when the non-proprietary content was stripped out of the LingPipe Book.

I happen to still think that part of the book is incredibly useful. It covers all of unicode, UCI for normalization and detection, all of the streaming I/O interfaces, codings in HTML, XML and JSON, as well as in-depth coverage of reg-exes, Lucene, and Solr. All of the stuff that is continually misunderstood and misconfigured so that I have to spend way too much of my time sorting it out. (Mitzi finished the HTML, XML and JSON chapter, and is working on Solr; she tuned Solr extensively on her last consulting gig, by the way, if anyone’s looking for a Lucene/Solr developer).

O’Reilly isn’t interested in LingPipe because of LingPipe’s non-OSF approved license. I don’t have time to finish the LingPipe book now that I’m at Columbia full time; I figured it’d be 1500 pages when I was done if I kept up at the same level of detail, and even that didn’t seem like enough!

A good model for such a book is Manning Press’s Taming Text, co-authored by Breck’s grad-school classmate Tom Morton. It’s based on Lucene/Mahout and Tom’s baby, OpenNLP. (Manning also passed on Text Processing in Java, which is where Breck sent it first.)

Another model, aimed more at linguists than programmers, is O’Reilly’s own NLTK Book, co-authored by my grad school classmate Steve Bird, and my Ph.D. supervisor, Ewan Klein. Very small world, NLP.

Manning also passed on TPiJ, so Colloquial Media will branch out from genre fiction into tech books. More news when we’re closer to finishing.

If you know why LDA will think this document is about American football and why frame-based parsing will only make topic classification worse, then you’re probably a good candidate to write this book. [Domain knowledge: Manning is the surname of two generations of very well known American football players all of whom play the only position that fill the agent roll in a very popular play known as a "pass".] If you know why Yarowsky‘s one discourse, one sense rule is violated by the background knowledge and also understand the principled Bayesian strategies for what Yarowsky called “bootstrapping,” even better.

Standard Output Ruins Everything!

May 9, 2012

The title is a a paraphrase from Dirk Eddelbuettel on the Rcpp mailing list (an interface tool for R and C++), but the lesson also applies to Java.

Don’t Write to Standard Output!

One of the first lessons of writing an API (as opposed to something that only runs from the command line) is that you never ever ever write to standard output in an API.

The reasons are that (1) you never know how someone might configure standard output around you (it’s resettable in Java), and (2) you never know what context your API will run in — it may be running in a servlet or in a Swing GUI where standard output where it’s invisible to your user (but does clog up the logs and the shell from which the Swing GUI was invoked).

So What do You do Instead?

1. Throw an exception if there’s some kind of error. See my previous post, “When to catch, pass, or throw exceptions?

Of course you have to be careful here about the context things are running in, too, especially if you try throwing a runtime exception instead of a checked exception. This is why the Google style guide for C++ forbids exceptions!

2. If there’s no error, the common advice is to write to a logger. We didn’t do that in LingPipe because we didn’t want any dependencies to other code built into LingPipe. We also didn’t want every user of LingPipe to have to configure a logger like log4j or Java’s built-in logger. The other issue with loggers is that they have one top-level config, so it gets confusing with multiple packages running if you use high-level config in the properties files (I know you can configure per-package, but people often don’t and get surprised).

Alternatively, you can write messages into something like a string builder. Then they can be sent to whatever output source you want.

The com.aliasi.io.Reporter class may look like a standard logger, but it’s only configurable programatically and is set by default to just accumulate results. Note how it’s passed into logistic regression fitting, not just there by default in the background.

A second alternative is to pass an OutputStream into the function that might want to write and write to that. In a command-line setting it can be set to the standard output. In an embedded context, it might be set to a byte array output stream wrapped in a PrintStream, which will just accumulate the results until they can be dealt with. For instance, they might be written into a servlet output stream for use in a web app.

Buffering Exceptions

August 16, 2011

OK, now that it’s just us API fanatics, imagine you need to call a stream read in an interface method implementation, but the interface doesn’t declare an I/O exception. For concreteness, suppose we have an interface:

interface Foo {
    public void bar();
}

and we want to implement a Foo that calls streaming I/O, as in:

public class IOFoo implements Foo {
    public void bar() {
        InputStream in = ...;
        ...
        int numBytes = in.read(buf); // throws IOException
    }
}

A common example is the Java interface Runnable, which has a run() method which does not throw exceptions. Given that this is the basis for threads, the issue comes up all the time.

This won’t compile, because the call to the read method throws an I/O exception, which is a checked exception that is not declared on the run method. In fact, it can’t be declared on the run method, because the interface doesn’t declare an I/O exception. What I too often see is this broken workaround:

        int numBytes = -1;
        try {
            numBytes = in.read();
        } catch (IOException e) {
            throw new RuntimeException(e);
        }

What’s wrong here? There’s no way for anyone calling the bar() method to recover the exception and handle it. You could document that it throws a RuntimeException, but all kinds of problems can lead to a runtime exception being thrown, and there’s no way to tell which threw it. (Yes, you could encode it in the message, but that’s an even worse idea, as it’s very hard to maintain and code against.)

An alternative approach is to follow the pattern used by Java’s PrintWriter class and buffer the exception then make it available to the caller. Here’s a simplified example that buffers a single exception:

public class IOFoo implements Foo {
    private IOException mException;
    public void bar() {
        InputStream in = ...;
        ...
        try {
            in.read(); // throws IOException
        } catch (IOException e) {
            mException = e;
            return;
        }
    }
    public IOException getIOException() {
        return mException;
    }
}

That way, a caller can do this:

IOFoo f = new IOFoo(...);
f.run();
if (f.getIoException() != null)
    throw (f.getIOException());

Now the context around the call to IOFoo’s run can catch the IOException. For instance, if this is in a method call, you can just declare the method to throw the checked IOException. Note that we need to declare f as an IOFoo, not just a Foo, because we call its getIOException method.

Java’s PrintWriter keeps a list of exceptions that have been raised in calls to its methods. That’s because it keeps going after exceptions, assuming the client would rather they be silently ignored. I don’t think that’s such a great idea, for the obvious reason that there’s no telling what’s going to happen when I/O exceptions are just ignored.

It would be straightforward to set a flag that stops any further action once an exception has been raised.

The case I was actually using this in was for was writing a static method to serialize a tokenized language model. Visiting the n-grams requires an IntArrayHandler whose handle method is not declared to throw any checked exceptions. I wanted any embedded exception to be buffered and thrown by the top-level static method.

“Sentiment Takes a LONG Time”, Said the Logistic Regression Classifier

August 12, 2011

Hello again, fellow LingPipers!  It is I, yet again, Sam The Intern.  In my last (and first) post I promised the masses that I would return with a study similar to the last one that I conducted, but this time doing the non-trivial case of sentiment.  It turns out that this aspiration ended up being nothing more than a (Ling)Pipe-dream.  As my titular character, the logistic regression classifier so concisely said before, sentiment takes a long time!  And, with this being my last week at LingPipe, I will not have enough time to give our logistic regression classifiers the attention and whispered sweet nothings that they require to function properly.

As you may remember from my first post, I absolutely love to annotate.  However, annotating for sentiment takes much more time than language identification, because the annotator needs to internalize and judge the text, not just glance at it.  Add to this the fact that the sentiment classifier requires much more training data than the language identification classifier, and you have yourself an intern that has been product-conditioned to Coca-Cola, but no time to do anything about it (reading 1000 tweets about how much people love coke will do that to you).

I was able to start the study, with small amounts of success.  The following confusion matrix is the result of  a 10-fold cross-validation run across all of the data that I had time to annotate (about 1000 tweets each for “Clinton”, “Coke”, and “Gaga”).  The top of the chart is the reference data which I annotated, and the side is the response data that the classifier provided.  The categories are: w = neutral, e = positive, q = negative.

   w    e   q
w 1244,239,50
e 313 ,312,23
q 199 ,45 ,23

I wish that I had more time to spend on the sentiment classification problem.  Perhaps I will be back next summer, to continue the effort to teach computers good from bad.  Until then, I will be at Rochester, college-ing!  And drinking massive quantities of Coke…

“Language ID”, Said the NGram Process Dynamic Language Model Classifier…

July 25, 2011

Hi, there! My name is Sam Brown, and I’m a summer intern here at LingPipe. I’m from Eastchester, NY, and I’m a rising sophomore in the University of Rochester Biomedical Engineering Department.

I’ve been interning at LingPipe’s luxuriously airconditioned headquarters for about a month and a half now. After barely learning Java and some of the LingPipe API in the span of about a week, my slavemaster…rather…boss Breck thought it would be a great idea if I could publish a three-part Java-based computational linguistics experiment exploring the relative performances of generic and customized classifiers for language identification.

And then he told me this is the trivial case.

Since I hardly understood the objective myself, even after Breck kindly spelled it out for me about 50 times, let me try and explain the experiment. Language model classifiers can be used to sort test data (like a bunch of tweets retrieved from Twitter using a query like “Gaga”) into two categories (in my case English, and Non-English). These classifiers, of course, need to be trained on annotated data (tweets collected from Twitter that we sort for the computer) before they can be useful. This is where my experiment’s problem comes in; Is it better to make a custom classifier that is trained on data similar to the test data (a set of tweets different from the test data, but also retrieved using the query “Gaga”), or to train a generic classifier on data that comes from a variety of sources (a collection of tweets from the queries “Coke, “Clinton”, etc.)? What we found is that, based on the input and test data, either classifier may do better at language identification depending on the circumstances.

Data:

Before we could start the experiment, we first needed to collect our data from Twitter. For each of our five queries, 1500 tweets were retrieved from Twitter, and duplicates were removed from this data by requiring a Jaccard distance of .5 or less between any two tweets. The effect of this is that any tweet with more than 50% token (word) overlap with any other tweet in the accepted data collection was rejected from the corpus (our collection of tweets to be used in the experiment).

After that came my absolute favorite part; Annotation. I annotated 500 tweets for being written in English (‘e’) or non-English (‘x’), for 5 queries (for a total of 2,500 tweets). These annotated tweets comprised our first epoch of data. As if that wasn’t enough fun, we reran the 5 queries again two weeks later, and annotated these as well to form the second epoch of data (for a cumulative total of 5000 tweets). We checked if there were duplicates between the two Twitter retrievals by using the Jaccard distance between any two given tweets to decide if they were identical.

To form a baseline to compare our final results to, I needed to find an interannotator agreement rate. My lovely girlfriend very nicely (and possibly naively) agreed to annotate some of the data.  She annotated 400 tweets of each query in the second epoch of data, all of which overlapped with data that I annotated as well. A program was then used to find the precision and recall between my annotations and hers. The agreement rate was 95% precision and 97% recall with confidence .01% with one annotator serving as truth and the other the response (thereby proving that she knows English 5% ± .01% better than me). For evaluation purposes the annotations were adjudicated resulting in complete agreement.

Process:

Part 1: Evaluation of Customized Language Classifier

We created a custom language model classifier by training it on data retrieved from Twitter using one of our queries, like “Gaga”. To do this, we first loaded all of the annotated training tweets for that query into the training section of a corpus. We then loaded the annotated testing tweets into the test section of the corpus. This created a corpus of 1000 tweets, 500 each for training and testing. The testing and training tweets were both retrieved from Twitter using the same query at different times.

As the title character in today’s adventure, our friendly neighborhood nGram boundary language model classifier was used to train and test on the corpus. Because nGram boundary classifiers are sensitive to nGram size (the size of the chunk of characters in a tweet that the classifier being trained actually sees), it trained and tested on the data with nGram sizes 1 through 15.

After each test, the classifier was given to a joint classifier evaluator. This evaluator was used to record the one-versus-all data for each category (precision, recall, and f-measure), as well as calculate the micro-averaged f-measure. The micro-averaged f-measure was used for quantitative comparison between classifiers.

This procedure was repeated for each of the five Twitter queries.

Part 2: Evaluation of Generic Language Classifier

We created a generic language model classifier by training it on data retrieved from Twitter using the same queries as the customized experiment. The difference between the custom and generic classifiers is that, in the generic classifier, the testing and training tweets were retrieved using different queries. Because each set of data included 500 tweets, the total amount of data that we had was 5,000 tweets. For any given query, the training data consisted of all the other data minus the 500 tweets of data to be tested on and 500 tweets of data that was retrieved at an earlier epoch with the same query. All of this data, which contained a total of 4000 tweets for any given test data, was put into a list. Five-hundred tweets were then randomly selected from this list and entered into the training section of the corpus. We then loaded the annotated testing tweets into the testing section of the corpus, and trained and tested the corpus using the same nGram looping structure as we did in the customized training process.

After each test, we evaluated the classifier in the same way that we did for the custom classifier. This procedure was repeated for each of the five Twitter queries.

Part 3: Evaluation of the Effect of Corpus Size On a Generic Classifier

A generic classifier was trained on data from a variety of different sources, as described above. However, the size of the corpus was increased incrementally after each training and testing experiment. This was done by nesting the nGram looping structure inside of the corpus size looping structure. The evaluator returned the best micro-averaged f-measure after each corpus size increase, and this was graphed against corpus size.

Results:

Part 1

For each of the five queries that we tested, the custom classifier had an f-measure that ranged from 80% to 95%. For four out of the five queries, the English classifications performed better than the non-english. However, on the query “Mitsubishi”, the non-english classifications performed better. Out of the four queries in which English performed better, two queries (“Clinton”(Fig. 1) and “Coke” (Fig. 2)) had significantly higher f-measures for the English than the non-English.

Figure 1 - Clinton Generic

Figure 2 - Coke Generic

Figure 2 New

Figure 3 - Gaga Generic

Figure 4 NEW

Figure 4 - Mitsubishi Generic

Figure 5 - Wiener Generic

Part 2

For each of the five queries, the generic classifier had an f-measure that ranged from 70% to 95%

Part 2.5

We put the micro-averaged f-measures for the generic and custom classifiers for each query on the same graph of F-Measure versus NGram size. For two of the queries (“Clinton” (Fig. 6) and “Coke” (Fig. 7)), the custom and generic classifiers performed about equally for an nGram size of 4. For two of the remaining queries (“Gaga” (Fig. 8) and “Wiener” (Fig. 10)), the custom classifier outperformed the generic. For the final query (“Mitsubishi” (Fig. 9)), the generic classifier outperformed the custom.

Figure 6 - Clinton Generic vs. Custom

Figure 7 - Coke Generic vs. Custom

Figure 8 - Gaga Generic vs. Custom

Figure 9 - Mitsubishi Generic vs. Custom

Figure 10 - Wiener Generic vs. Custom

The experiment was run again (at nGram size 4),  50 times.  The sample mean of these experiments’ f-measures were graphed, with a 95% confidence error bar (Figures 11 and 12).

Figure 11 - Non-English Classifier Comparrison

Figure 12 - English Classifier Comparrison

Part 3

When we graphed the generic classifiers’ performance versus the size of the of the training data, we found that the English classifications’ F-Measure increased slightly over increased training data. Non-English classifications’ F-Measure increased dramatically over increased training data. (Fig. 13).

Figure 13 - Clinton Generic Performance Graph (on a logarithmic scale)

We graphed the size of the English portion of the corpus versus the English F-Measure (and did the same with the Non-English), and found that English classifications performed better at low category-corpus sizes than Non-English did (Figs. 14 & 15).

Figure 14 - Coke Category Performance

Figure 15 - Mitsubishi Category Performance

Discussion:

Parts 1+2

We came to the conclusion that neither generic nor customized training regimens are necessarily better than the other because each one performed better in different circumstances. We believe that the Non-English classifications for Mitsubishi scored higher than the English classifications because the Non-English category was larger than the English category, and also it was very coherent (with mostly asian writing, so that each category has either Roman characters or Asian characters). We believe that “Clinton” and “Coke” had much higher English F-Measures than the others because these two queries produce the most English tweets.

Parts 3

We believe that the English classifications performed better than the non-English classifications, especially at low corpus sizes, because the makeup of the English training data is more coherent. Because there are many different, seemingly unrelated attributes that define the broad Non-English category, it’s difficult for the classifier to identify what makes a tweet Non-English at low corpus-sizes. This also explains why the classifier is better at identifying English tweets, even at equal category sizes.

Our data balance for each test are as follows (for the generic classifiers, the balance of the 4000 tweet list is shown):

Custom Data Distribution

Generic Data Distribution

And that’s my first experiment in computational linguistics! Anyone who would like to throw money at me for this monumental achievement may do so liberally. Otherwise, look forward to my next experiment, the not-so-trivial case, Sentiment!! (perhaps THAT will convince you to throw money at me!) Until then, I will be annotating. A lot. My favorite!


Follow

Get every new post delivered to your Inbox.

Join 817 other followers