Hi, there! My name is Sam Brown, and I’m a summer intern here at LingPipe. I’m from Eastchester, NY, and I’m a rising sophomore in the University of Rochester Biomedical Engineering Department.
I’ve been interning at LingPipe’s luxuriously airconditioned headquarters for about a month and a half now. After barely learning Java and some of the LingPipe API in the span of about a week, my slavemaster…rather…boss Breck thought it would be a great idea if I could publish a three-part Java-based computational linguistics experiment exploring the relative performances of generic and customized classifiers for language identification.
And then he told me this is the trivial case.
Since I hardly understood the objective myself, even after Breck kindly spelled it out for me about 50 times, let me try and explain the experiment. Language model classifiers can be used to sort test data (like a bunch of tweets retrieved from Twitter using a query like “Gaga”) into two categories (in my case English, and Non-English). These classifiers, of course, need to be trained on annotated data (tweets collected from Twitter that we sort for the computer) before they can be useful. This is where my experiment’s problem comes in; Is it better to make a custom classifier that is trained on data similar to the test data (a set of tweets different from the test data, but also retrieved using the query “Gaga”), or to train a generic classifier on data that comes from a variety of sources (a collection of tweets from the queries “Coke, “Clinton”, etc.)? What we found is that, based on the input and test data, either classifier may do better at language identification depending on the circumstances.
Before we could start the experiment, we first needed to collect our data from Twitter. For each of our five queries, 1500 tweets were retrieved from Twitter, and duplicates were removed from this data by requiring a Jaccard distance of .5 or less between any two tweets. The effect of this is that any tweet with more than 50% token (word) overlap with any other tweet in the accepted data collection was rejected from the corpus (our collection of tweets to be used in the experiment).
After that came my absolute favorite part; Annotation. I annotated 500 tweets for being written in English (‘e’) or non-English (‘x’), for 5 queries (for a total of 2,500 tweets). These annotated tweets comprised our first epoch of data. As if that wasn’t enough fun, we reran the 5 queries again two weeks later, and annotated these as well to form the second epoch of data (for a cumulative total of 5000 tweets). We checked if there were duplicates between the two Twitter retrievals by using the Jaccard distance between any two given tweets to decide if they were identical.
To form a baseline to compare our final results to, I needed to find an interannotator agreement rate. My lovely girlfriend very nicely (and possibly naively) agreed to annotate some of the data. She annotated 400 tweets of each query in the second epoch of data, all of which overlapped with data that I annotated as well. A program was then used to find the precision and recall between my annotations and hers. The agreement rate was 95% precision and 97% recall with confidence .01% with one annotator serving as truth and the other the response (thereby proving that she knows English 5% ± .01% better than me). For evaluation purposes the annotations were adjudicated resulting in complete agreement.
Part 1: Evaluation of Customized Language Classifier
We created a custom language model classifier by training it on data retrieved from Twitter using one of our queries, like “Gaga”. To do this, we first loaded all of the annotated training tweets for that query into the training section of a corpus. We then loaded the annotated testing tweets into the test section of the corpus. This created a corpus of 1000 tweets, 500 each for training and testing. The testing and training tweets were both retrieved from Twitter using the same query at different times.
As the title character in today’s adventure, our friendly neighborhood nGram boundary language model classifier was used to train and test on the corpus. Because nGram boundary classifiers are sensitive to nGram size (the size of the chunk of characters in a tweet that the classifier being trained actually sees), it trained and tested on the data with nGram sizes 1 through 15.
After each test, the classifier was given to a joint classifier evaluator. This evaluator was used to record the one-versus-all data for each category (precision, recall, and f-measure), as well as calculate the micro-averaged f-measure. The micro-averaged f-measure was used for quantitative comparison between classifiers.
This procedure was repeated for each of the five Twitter queries.
Part 2: Evaluation of Generic Language Classifier
We created a generic language model classifier by training it on data retrieved from Twitter using the same queries as the customized experiment. The difference between the custom and generic classifiers is that, in the generic classifier, the testing and training tweets were retrieved using different queries. Because each set of data included 500 tweets, the total amount of data that we had was 5,000 tweets. For any given query, the training data consisted of all the other data minus the 500 tweets of data to be tested on and 500 tweets of data that was retrieved at an earlier epoch with the same query. All of this data, which contained a total of 4000 tweets for any given test data, was put into a list. Five-hundred tweets were then randomly selected from this list and entered into the training section of the corpus. We then loaded the annotated testing tweets into the testing section of the corpus, and trained and tested the corpus using the same nGram looping structure as we did in the customized training process.
After each test, we evaluated the classifier in the same way that we did for the custom classifier. This procedure was repeated for each of the five Twitter queries.
Part 3: Evaluation of the Effect of Corpus Size On a Generic Classifier
A generic classifier was trained on data from a variety of different sources, as described above. However, the size of the corpus was increased incrementally after each training and testing experiment. This was done by nesting the nGram looping structure inside of the corpus size looping structure. The evaluator returned the best micro-averaged f-measure after each corpus size increase, and this was graphed against corpus size.
For each of the five queries that we tested, the custom classifier had an f-measure that ranged from 80% to 95%. For four out of the five queries, the English classifications performed better than the non-english. However, on the query “Mitsubishi”, the non-english classifications performed better. Out of the four queries in which English performed better, two queries (“Clinton”(Fig. 1) and “Coke” (Fig. 2)) had significantly higher f-measures for the English than the non-English.
For each of the five queries, the generic classifier had an f-measure that ranged from 70% to 95%
We put the micro-averaged f-measures for the generic and custom classifiers for each query on the same graph of F-Measure versus NGram size. For two of the queries (“Clinton” (Fig. 6) and “Coke” (Fig. 7)), the custom and generic classifiers performed about equally for an nGram size of 4. For two of the remaining queries (“Gaga” (Fig. 8) and “Wiener” (Fig. 10)), the custom classifier outperformed the generic. For the final query (“Mitsubishi” (Fig. 9)), the generic classifier outperformed the custom.
The experiment was run again (at nGram size 4), 50 times. The sample mean of these experiments’ f-measures were graphed, with a 95% confidence error bar (Figures 11 and 12).
When we graphed the generic classifiers’ performance versus the size of the of the training data, we found that the English classifications’ F-Measure increased slightly over increased training data. Non-English classifications’ F-Measure increased dramatically over increased training data. (Fig. 13).
We graphed the size of the English portion of the corpus versus the English F-Measure (and did the same with the Non-English), and found that English classifications performed better at low category-corpus sizes than Non-English did (Figs. 14 & 15).
We came to the conclusion that neither generic nor customized training regimens are necessarily better than the other because each one performed better in different circumstances. We believe that the Non-English classifications for Mitsubishi scored higher than the English classifications because the Non-English category was larger than the English category, and also it was very coherent (with mostly asian writing, so that each category has either Roman characters or Asian characters). We believe that “Clinton” and “Coke” had much higher English F-Measures than the others because these two queries produce the most English tweets.
We believe that the English classifications performed better than the non-English classifications, especially at low corpus sizes, because the makeup of the English training data is more coherent. Because there are many different, seemingly unrelated attributes that define the broad Non-English category, it’s difficult for the classifier to identify what makes a tweet Non-English at low corpus-sizes. This also explains why the classifier is better at identifying English tweets, even at equal category sizes.
Our data balance for each test are as follows (for the generic classifiers, the balance of the 4000 tweet list is shown):
And that’s my first experiment in computational linguistics! Anyone who would like to throw money at me for this monumental achievement may do so liberally. Otherwise, look forward to my next experiment, the not-so-trivial case, Sentiment!! (perhaps THAT will convince you to throw money at me!) Until then, I will be annotating. A lot. My favorite!