Hello again, fellow LingPipers! It is I, yet again, Sam The Intern. In my last (and first) post I promised the masses that I would return with a study similar to the last one that I conducted, but this time doing the non-trivial case of sentiment. It turns out that this aspiration ended up being nothing more than a (Ling)Pipe-dream. As my titular character, the logistic regression classifier so concisely said before, sentiment takes a long time! And, with this being my last week at LingPipe, I will not have enough time to give our logistic regression classifiers the attention and whispered sweet nothings that they require to function properly.
As you may remember from my first post, I absolutely love to annotate. However, annotating for sentiment takes much more time than language identification, because the annotator needs to internalize and judge the text, not just glance at it. Add to this the fact that the sentiment classifier requires much more training data than the language identification classifier, and you have yourself an intern that has been product-conditioned to Coca-Cola, but no time to do anything about it (reading 1000 tweets about how much people love coke will do that to you).
I was able to start the study, with small amounts of success. The following confusion matrix is the result of a 10-fold cross-validation run across all of the data that I had time to annotate (about 1000 tweets each for “Clinton”, “Coke”, and “Gaga”). The top of the chart is the reference data which I annotated, and the side is the response data that the classifier provided. The categories are: w = neutral, e = positive, q = negative.
w e q w 1244,239,50 e 313 ,312,23 q 199 ,45 ,23
I wish that I had more time to spend on the sentiment classification problem. Perhaps I will be back next summer, to continue the effort to teach computers good from bad. Until then, I will be at Rochester, college-ing! And drinking massive quantities of Coke…