“Language ID”, Said the NGram Process Dynamic Language Model Classifier…

by

Hi, there! My name is Sam Brown, and I’m a summer intern here at LingPipe. I’m from Eastchester, NY, and I’m a rising sophomore in the University of Rochester Biomedical Engineering Department.

I’ve been interning at LingPipe’s luxuriously airconditioned headquarters for about a month and a half now. After barely learning Java and some of the LingPipe API in the span of about a week, my slavemaster…rather…boss Breck thought it would be a great idea if I could publish a three-part Java-based computational linguistics experiment exploring the relative performances of generic and customized classifiers for language identification.

And then he told me this is the trivial case.

Since I hardly understood the objective myself, even after Breck kindly spelled it out for me about 50 times, let me try and explain the experiment. Language model classifiers can be used to sort test data (like a bunch of tweets retrieved from Twitter using a query like “Gaga”) into two categories (in my case English, and Non-English). These classifiers, of course, need to be trained on annotated data (tweets collected from Twitter that we sort for the computer) before they can be useful. This is where my experiment’s problem comes in; Is it better to make a custom classifier that is trained on data similar to the test data (a set of tweets different from the test data, but also retrieved using the query “Gaga”), or to train a generic classifier on data that comes from a variety of sources (a collection of tweets from the queries “Coke, “Clinton”, etc.)? What we found is that, based on the input and test data, either classifier may do better at language identification depending on the circumstances.

Data:

Before we could start the experiment, we first needed to collect our data from Twitter. For each of our five queries, 1500 tweets were retrieved from Twitter, and duplicates were removed from this data by requiring a Jaccard distance of .5 or less between any two tweets. The effect of this is that any tweet with more than 50% token (word) overlap with any other tweet in the accepted data collection was rejected from the corpus (our collection of tweets to be used in the experiment).

After that came my absolute favorite part; Annotation. I annotated 500 tweets for being written in English (‘e’) or non-English (‘x’), for 5 queries (for a total of 2,500 tweets). These annotated tweets comprised our first epoch of data. As if that wasn’t enough fun, we reran the 5 queries again two weeks later, and annotated these as well to form the second epoch of data (for a cumulative total of 5000 tweets). We checked if there were duplicates between the two Twitter retrievals by using the Jaccard distance between any two given tweets to decide if they were identical.

To form a baseline to compare our final results to, I needed to find an interannotator agreement rate. My lovely girlfriend very nicely (and possibly naively) agreed to annotate some of the data.  She annotated 400 tweets of each query in the second epoch of data, all of which overlapped with data that I annotated as well. A program was then used to find the precision and recall between my annotations and hers. The agreement rate was 95% precision and 97% recall with confidence .01% with one annotator serving as truth and the other the response (thereby proving that she knows English 5% ± .01% better than me). For evaluation purposes the annotations were adjudicated resulting in complete agreement.

Process:

Part 1: Evaluation of Customized Language Classifier

We created a custom language model classifier by training it on data retrieved from Twitter using one of our queries, like “Gaga”. To do this, we first loaded all of the annotated training tweets for that query into the training section of a corpus. We then loaded the annotated testing tweets into the test section of the corpus. This created a corpus of 1000 tweets, 500 each for training and testing. The testing and training tweets were both retrieved from Twitter using the same query at different times.

As the title character in today’s adventure, our friendly neighborhood nGram boundary language model classifier was used to train and test on the corpus. Because nGram boundary classifiers are sensitive to nGram size (the size of the chunk of characters in a tweet that the classifier being trained actually sees), it trained and tested on the data with nGram sizes 1 through 15.

After each test, the classifier was given to a joint classifier evaluator. This evaluator was used to record the one-versus-all data for each category (precision, recall, and f-measure), as well as calculate the micro-averaged f-measure. The micro-averaged f-measure was used for quantitative comparison between classifiers.

This procedure was repeated for each of the five Twitter queries.

Part 2: Evaluation of Generic Language Classifier

We created a generic language model classifier by training it on data retrieved from Twitter using the same queries as the customized experiment. The difference between the custom and generic classifiers is that, in the generic classifier, the testing and training tweets were retrieved using different queries. Because each set of data included 500 tweets, the total amount of data that we had was 5,000 tweets. For any given query, the training data consisted of all the other data minus the 500 tweets of data to be tested on and 500 tweets of data that was retrieved at an earlier epoch with the same query. All of this data, which contained a total of 4000 tweets for any given test data, was put into a list. Five-hundred tweets were then randomly selected from this list and entered into the training section of the corpus. We then loaded the annotated testing tweets into the testing section of the corpus, and trained and tested the corpus using the same nGram looping structure as we did in the customized training process.

After each test, we evaluated the classifier in the same way that we did for the custom classifier. This procedure was repeated for each of the five Twitter queries.

Part 3: Evaluation of the Effect of Corpus Size On a Generic Classifier

A generic classifier was trained on data from a variety of different sources, as described above. However, the size of the corpus was increased incrementally after each training and testing experiment. This was done by nesting the nGram looping structure inside of the corpus size looping structure. The evaluator returned the best micro-averaged f-measure after each corpus size increase, and this was graphed against corpus size.

Results:

Part 1

For each of the five queries that we tested, the custom classifier had an f-measure that ranged from 80% to 95%. For four out of the five queries, the English classifications performed better than the non-english. However, on the query “Mitsubishi”, the non-english classifications performed better. Out of the four queries in which English performed better, two queries (“Clinton”(Fig. 1) and “Coke” (Fig. 2)) had significantly higher f-measures for the English than the non-English.

Figure 1 - Clinton Generic

Figure 2 - Coke Generic

Figure 2 New

Figure 3 - Gaga Generic

Figure 4 NEW

Figure 4 - Mitsubishi Generic

Figure 5 - Wiener Generic

Part 2

For each of the five queries, the generic classifier had an f-measure that ranged from 70% to 95%

Part 2.5

We put the micro-averaged f-measures for the generic and custom classifiers for each query on the same graph of F-Measure versus NGram size. For two of the queries (“Clinton” (Fig. 6) and “Coke” (Fig. 7)), the custom and generic classifiers performed about equally for an nGram size of 4. For two of the remaining queries (“Gaga” (Fig. 8) and “Wiener” (Fig. 10)), the custom classifier outperformed the generic. For the final query (“Mitsubishi” (Fig. 9)), the generic classifier outperformed the custom.

Figure 6 - Clinton Generic vs. Custom

Figure 7 - Coke Generic vs. Custom

Figure 8 - Gaga Generic vs. Custom

Figure 9 - Mitsubishi Generic vs. Custom

Figure 10 - Wiener Generic vs. Custom

The experiment was run again (at nGram size 4),  50 times.  The sample mean of these experiments’ f-measures were graphed, with a 95% confidence error bar (Figures 11 and 12).

Figure 11 - Non-English Classifier Comparrison

Figure 12 - English Classifier Comparrison

Part 3

When we graphed the generic classifiers’ performance versus the size of the of the training data, we found that the English classifications’ F-Measure increased slightly over increased training data. Non-English classifications’ F-Measure increased dramatically over increased training data. (Fig. 13).

Figure 13 - Clinton Generic Performance Graph (on a logarithmic scale)

We graphed the size of the English portion of the corpus versus the English F-Measure (and did the same with the Non-English), and found that English classifications performed better at low category-corpus sizes than Non-English did (Figs. 14 & 15).

Figure 14 - Coke Category Performance

Figure 15 - Mitsubishi Category Performance

Discussion:

Parts 1+2

We came to the conclusion that neither generic nor customized training regimens are necessarily better than the other because each one performed better in different circumstances. We believe that the Non-English classifications for Mitsubishi scored higher than the English classifications because the Non-English category was larger than the English category, and also it was very coherent (with mostly asian writing, so that each category has either Roman characters or Asian characters). We believe that “Clinton” and “Coke” had much higher English F-Measures than the others because these two queries produce the most English tweets.

Parts 3

We believe that the English classifications performed better than the non-English classifications, especially at low corpus sizes, because the makeup of the English training data is more coherent. Because there are many different, seemingly unrelated attributes that define the broad Non-English category, it’s difficult for the classifier to identify what makes a tweet Non-English at low corpus-sizes. This also explains why the classifier is better at identifying English tweets, even at equal category sizes.

Our data balance for each test are as follows (for the generic classifiers, the balance of the 4000 tweet list is shown):

Custom Data Distribution

Generic Data Distribution

And that’s my first experiment in computational linguistics! Anyone who would like to throw money at me for this monumental achievement may do so liberally. Otherwise, look forward to my next experiment, the not-so-trivial case, Sentiment!! (perhaps THAT will convince you to throw money at me!) Until then, I will be annotating. A lot. My favorite!

4 Responses to ““Language ID”, Said the NGram Process Dynamic Language Model Classifier…”

  1. Bob Carpenter Says:

    Nice! Thanks for blogging this, Sam. And since no one ran this by me first, my comments are going to be public:

    1. It’s always a better idea to train on specialized data if it represents the test conditions. It’s always better to train on more data. What you’re seeing is a tradeoff between the two, because you have more of what you call “generic” data (I’d choose a different term because “generic” is already heavily overloaded in both language and computer science).

    2. You’d be better off using a process character LM and padding the input with spaces on either end. I’m guessing normalizing for space (all sequences of whitespace go to single space char) would help, too.

    3. Jaccard distance measures word types, not count. So if you have “a b b b” versus “a a a b”, that’s a perfect Jaccard match.

    4. Throwing away exact matches is not kosher statistically, as you’re monkeying with the data distribution by hand. For this context, this probably won’t matter, but it’s not a good idea in general. Don’t be afraid of duplicate training data if the data you’re going to run on is duplicated.

    5. If you swap reference and response, you just swap precision and recall. Most people just report agreement, not precision and recall. With 400 items at 95%, you can’t get 0.01% confidence intervals. Back of the envelope, that should be about 1% = sqrt(.95 * 0.05 / 400).

    That’s pretty good agreement. Did the adjudication make it look like you just made errors in the first pass, or were there hard decisions to make?

    6. Micro-averaged F measure is fine for a DARPA style “we just want one number” eval, but does it correspond at all to anything we care about the classifier’s performance in an application?

    Overall, I find just looking at the raw confusion matrices the best indicator of performance. Or looking at precision/recall or ROC curves if we want to choose an operating point.

    7. As to results, it’d be nice to see the uncertainty there. The results will depend on data selected for the fold and I’m guessing there’ll be lots of variance based on this. Even more so at low counts.

    8. The “right” approach from a stat point of view is to build a hiearchical model. There’s one “generic” top-level model for language ID, then submodels for different domains. Then you learn the difference between the subdomains and general language instead of just learning each subdomain from scratch. Unfortunately, there’s nothing in LingPipe to let you do that very easily.

    8. Usually you want log scale on sizes. 20, 40, 80, 160, 320, 640, etc. And why stop at 500?

    9. Those post-hoc “best F measure” values you report are very dangerous. They are biased to overreport the actual value by their very selection process. This is a very common “mistake” in the literature.

    10. Longer n-grams will almost always work better if there’s enough smoothing. LingPipe’s smoothing isn’t very nuanced, but you should still be able to find values such that the graphs don’t start tanking in performance for longer n-grams.

    11. What are the e- vs. x- labels on the first few graphs? Were you just doing “English” versus “not English”? If so, there’s no reason to report two sets of results as one’s just the complement of the other.

    12. What you’re seeing is just variance. You can tell which performs better on average by doing more experiments. But you won’t necessarily get the same kind of behavior at run time with different categories unless you collect enough categories that are enough like you care about that you can start doing stats across the topics/queries.

    13. Be careful using terms like “character set”. You probably used a single character set, Unicode, with a single encoding, UTF-8. Japanese and Roman characters are in different parts of the character set, but our programs don’t know anything at all about that. (They could be told that if you used something like a logistic regression classifier — then the range of characters from which it came can be a feature — we’ve done that before with language ID.)

    14. What was the “balance” of the training data? Does it represent the natural occurrence of data? There’s a big problem when one category is tiny relative to another. You get very little training data and the performance on the small category winds up having very little effect on micro-averaged measures (and arguably too much effect on macro-averaged measures).

    Final comments on form.

    A. Never use underlining on the web — it’s confusing with underlined links.

    B. That’s “LingPipe” with a capital “P”. Like in the blog title. We’re a product of our era.

    C. I love the term “rising sophomore”. It’s so, dare I say it, sophomoric.

    D. Learn R. The graphs from Excel are awful.

    E. Always label the axes of plots. Some of your plots have them and some not. I can’t tell what’s on the horizontal axis (size, but in what units?) or what’s on the vertical axis (performance on a 0-1 scale, but what?).

  2. Bob Carpenter Says:

    I forgot to add that I agree with your final conclusion (Part 3 discussion) and error analysis. The reason it works better for English is that the training data and test data match better.

    Language ID’s pretty easy. Especially with lots of training data and very different languages.

    Where it gets hard is discriminating Dutch/German/Swedish
    or Spanish/Catalan/Italian. They use the same character sets and even have very similar lexical patterns.

  3. samueltheintern Says:

    Thanks for your input Bob! I’ll try to address each of your comments individually:

    1. Under real-world, large amounts of data circumstances, that idea for training absolutely works. However, what I’m reporting are simply the results of my experiment. Thank you for adding that, however.

    2. My new experiments (See 7) use a process character LM. The input is also now padded on either sides with white spaces, as well as normalized for white space. Thank you for the suggestion.

    3. For Twitter, Jaccard is a good filter, because there are many retweets. Other options could be a “longest substring” method, etc. Duplicates within the same data were eliminated because the original testing was cross-validation. Duplicates in data during cross-validation would lead to training on testing data, messing up the classifier. Another reason is to not annoy the annotators, in general, so that they are not classifying the same Tweet again and again.

    4. Two weeks is probably not a good enough buffer between training and testing data, so we wanted to eliminate the possibility of retweets, which would lead to training on test data, which classifiers are very sensitive to.

    5. The adjudication consisted of both hard decisions and errors in the first pass, although mostly errors. The decisions between English and Non-English were not extremely hard to make, however there were some problems that arose when dealing with certain cases, like tweets that contained both English and Non-English.
    The percent agreement that the two annotators had was 96.7% ± 1%, before adjudication. As far as the confidence interval in the post goes, the precent sign was a typo. I accidentally put the decimal instead of the percentage.

    6. Agreed. We were trying to keep the graphs simple, with only one number to compare classifiers with. However, in the future I will include the confusion matrices.

    7. After running the experiment again, the uncertainty did turn out to be significant, although not enough to disprove our findings. New graphs are included in the post(see Figures 11-13). Figure 13 details, on a logarithmic scale, the results of a slightly altered experiment three. Instead of choosing just one set of 500 random tweets from the 4000 tweet corpus, 50 sets of 500 tweets were selected randomly (with replacement) and tested on. The sample mean of their one-versus-all F-Measures were found, as well as the standard deviation. On the graph, their values are given with a 95% confidence error bar.

    8 (1). I’ll try to do that next summer.

    8 (2). We stopped at 500 because it was an even, reasonable count that fell under the size of all of our data.

    9. In the new experiment (mentioned in 7.), only an nGram of four was used. This eliminates one level of post-hoc analysis (choosing only the best ngram size for each run, no matter what it is), but introduced a new one (deciding to only use an ngram of 4 because it is, in general, the best performing). However, because we are simply doing an exploratory study, as opposed to saying that our classifier will absolutely perform at the level shown, I don’t believe that the choice of ngram size will skew our results. Because ngram size should affect generic and custom classifiers about equally, which exact ngram size we use shouldn’t play a huge role in what our results are.

    10. Perhaps next summer I can use smoothing.

    11. Yes, the “e” and “x” are “English” and “Non-English” respectively. I’m not entirely sure what you mean by your statement that they are complimentary of each other. In my experiment, the English and Non-English classifications have varied highly relatively to each other, enough that you cannot guess the performance of one based on the other (See Generic vs. Custom graphs for Mitsubishi versus Wiener).

    12. (See 7) We also posted a new set of graphs, which details the the generic versus custom classifiers for nGram size 4. It supports our original conclusions, that either classifier may do better in certain circumstances.

    13. Thank you for your input on word choice. I’ll try not to confuse anyone else.

    14: Our data balance represents the natural occurrence of data, minus the duplicates that we removed. Figures 14 and 15 show the relationship between the english / non-english corpus sizes and their performances. Also included at the end of the article is a table showing the data balances of each query, for the generic and custom cases.

  4. Bob Carpenter Says:

    Thanks for posting the reply.

    1. What we’d probably do if push came to shove is just throw both types of data together (or maybe interpolate the predictions). Crude, but easy to implement. And probably better than either model separately unless there’s something odd going on with the data collection over time (which is a real possibility — language data’s very squirrelly that way and there’s just not much we can do about it other than acknowledge the fact).

    3. The retweet issue’s interesting. Twitter is unusual as a language source, though not that unusual (other examples include newswire repeating AP stories and e-mails or message boards with quotes). As long as we keep in mind that we’re getting performance numbers on the deduplicated data, we should be OK. The real data will be higher variance because the successess and failures get correlated (if there are five copies of a tweet, we get five right answers or five wrong answers all at once if you count by tweet).

    4. That’s an empirical issue you probably have a better sense of than me. Keep in mind that in every training set used in natural language there’s a degree of this. The word “president” shows up as a noun phrase in both training and testing.

    5. This information about the domain in question is actually very interesting. Again, more from a customer point of view and a linguist’s point of view than a machine learning point of view. Or from the point of view of someone trying to understand the problem (as opposed to trying to understand two different training conditions, two different learning schemes, etc. — there are lots of reasonable goals).

    6. The question is how to present the data, not how to keep the graphs simple. The overall field’s with you as far as conference and journal submissions go, but the customers not so much. They’re often rightly interested in the error patterns and which categories are getting messed up and how.

    7. The new figure 13 is exactly what I was asking for. I’d not say that it didn’t disprove your findings (I just had to one-up your negative chain), but rather that you can’t really draw any conclusions until you’ve characterized the uncertainty of your tests. That’s what all the p-value stuff’s about in frequentist hypothesis testing.

    8 (1). Glad I haven’t scared you off yet.

    9. You definitely want to try all the options and choose which one’s best. It’s just that you need to keep in mind that by so doing, you’re likely to get slightly less performance in a new round of testing simply in virtue of choosing the best performing option. I don’t think removing that bias is worth fixing a single n-gram length. This is really just a matter of presenting and interpreting results.

    10. Character language models (almost) always use smoothing. If not, you run into zero probability estimates on test data, which tend to sink the whole method. It’s controlled through the interpolation ratio parameter in the LM constructor. By default it’s set to the n-gram because I found that worked well in the past. It’s not that sensitive. All the formulas are in the javadoc and in the LM section of the LingPipe book.

    11. They’re complementary in the sense that it’s a binary classification problem. So a true-positive for English is a true-negative for not-English and a false-positive for English is a false negative for not-English and vice-versa.

    This means that using F-measure, which is a kind of average of precision and recall, gets all intertwined.

    For this task, it’d probably be better to report accuracy for truly English and accuracy for truly not-English posts. These two statistics are completely independent.

    14. Perfect. I wish more studies would be this careful in reporting results. Seriously, it’s hard to know what’s going on with lots of data when the collection, labeling and adjudication are usually somewhat of a black box.

    C. C’mon, no old man jokes? My grey beard and I can take it.

    D. I now realize these are OpenOffice, and they’re better than I expected. I think I was more reacting to the content. That, and I completely failed in an hour of trying to recreate the graph you created in two minutes. R’s graphing is really painful, so I’ll have to retract that piece of bad advice, at least until you need more complex stats to go along with your graphing.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

Join 820 other followers