Processing Tweets with LingPipe #1: Search and CSV Data Structures

by

I have been focusing on teaching developers how to approach the challenges of computational linguistics. An interesting source to work with is Twitter since it is culturally relevant these days, fairly easy to work with from a data feed perspective and it is just so darned cute. Like many cute things, it is not long before it is eating your shoes, resisting toilet training and overall testing one’s patience. We will be using LingPipe to help tame the beast.

I will start with how to work with the search client and will move towards language ID, deduplication and a bit of sentiment classification in later blog entries. The code for this post is available in this tarball or from our subversion repository which can be accessed via the command:

svn co https://aliasi.devguard.com/svn/teaching/lingpipe4twitter

Go ahead and download the software and make sure you can compile. I am developing with Ant so I just type ‘ant’ in the teaching/twitter directory:

$ ant
Buildfile: build.xml

compile:
    [mkdir] Created dir: /Users/breck/devguard/teaching/lingpipe4twitter/build/classes
    [javac] Compiling 8 source files to /Users/breck/devguard/teaching/lingpipe4twitter/build/classes

jar:
      [jar] Building jar: /Users/breck/devguard/teaching/lingpipe4twitter/demo.jar

BUILD SUCCESSFUL
Total time: 3 seconds

Searching Twitter

We are using the Twitter4j library for accessing the Twitter API. The folks at Twitter apparently make contributions, while not an endorsement, it certainly increases credibility.

Getting some data on disk is the first item. Run a search on Obama as follows:

ant search -Dquery="Obama"
Buildfile: build.xml

compile:

jar:

search:
     [java] [Mon Nov 15 10:20:45 EST 2010]Will use class twitter4j.internal.logging.StdOutLoggerFactory as logging factory.
     [java] [Mon Nov 15 10:20:45 EST 2010]Will use twitter4j.internal.http.HttpClientImpl as HttpClient implementation.
     [java] writing to disk in directory: searches

BUILD SUCCESSFUL
Total time: 6 seconds

The file name will be the query, searchers/Obama.csv. Note that subsequent searches will not overwrite previous search results. Rerunning the above query would create a file searches/Obama_1.csv.

Looking at the csv output in spread sheet like Open Office or Excel (import as comma seperated UTF-8 text)

Search results in CSV format viewed in a spread sheet

The tweets are in column D. To get them more readable you should select the entire column D, resize width to around 5″ and modify formatting to include wrapping the text.

Pretty cute eh? Lots of entrepreneurs have gotten to this stage of perusing Twitter and see piles of customers clamoring for access to this feed that is effectively the id of the internet. Clean it up with some super ego and profit is sure to follow. Well, the cute gets ugly fast. Run a few queries in areas of interest and the messiness becomes apparent quickly. Below we go into some details and a bit of data analysis.

Details

We will be using the CSV data structure to work with the Twitter search results. Spreadsheets provide an easy way to view, sort and manipulate the tweets and there is a simple parser that allows us to interface with LingPipe functionality.

The relevant bits of the search code in src/SearchTwitter4j.java are:

    static final int TWEETS_PER_PAGE = 100;
    static final int NUM_PAGES = 15;

    public static void main (String[] args) 
        throws IOException, TwitterException {
        String queryString = args[0];
        File outDir = new File(args[1]);
        Twitter twitter = new TwitterFactory().getInstance();
        Query query = new Query(queryString);
        query.setRpp(TWEETS_PER_PAGE);
        List tweets = new ArrayList();
        for (int i = 1; i <= NUM_PAGES; ++i) {
            query.setPage(i);
            QueryResult result = twitter.search(query);
            List resultTweets = result.getTweets();
            if (resultTweets.size() == 0) break;
            tweets.addAll(resultTweets);
        }
        System.out.println("writing to disk in directory= " + outDir);
        TweetWriter.writeCsv(tweets,outDir,queryString,"csv"); //our class for writing tweets

The only tricky bit to this search interface is the two levels of indexing for search results, pages and
tweets per page. Otherwise we accumulate the tweets and send them off to our TweetWriter class for
writing to disk.

The code for writing is in src/TweetWriter.java:

public class TweetWriter {    
    public static void writeCsv(List tweets, File dir, 
                                String filename, String suffix)
        throws IOException {
        File output = nextFile(dir,filename,suffix);
        OutputStream fileOut = null;
        Writer streamWriter = null;
        CsvListWriter writer = null;
        try {
            fileOut =  new FileOutputStream(output);
            streamWriter = new OutputStreamWriter(fileOut,Strings.UTF8);
            writer = new CsvListWriter(streamWriter,CsvPreference.EXCEL_PREFERENCE); 

            List headerRow 
                = Arrays.asList("Estimate", "Guessed Class", 
                                "True Class", "Text");
            writer.write(headerRow);
            List row = new ArrayList();
            for (Tweet tweet : tweets) {
                row.clear();
                row.add("");
                row.add("");
                row.add("");
                String whitespaceNormedTweet
                    = tweet.getText().replaceAll("[\\s]+"," ").trim();
                row.add(whitespaceNormedTweet);
                writer.write(row);
            }
        } finally {
            try {
                if (writer != null)
                    writer.close();
            } finally {
                try { 
                    Streams.closeQuietly(streamWriter); //lingpipe convenience utility to close without exceptions 
                } finally {
                    Streams.closeQuietly(fileOut);
                }
            }
        }
    }

This code uses the Super Csv package to handle reading and writing csv (comma seperated value) formatted files. For this tutorial I am only interested in the text of the Tweet object but much more information is available–see the Twitter4j documentation on the returned Tweet object.

The basic pattern is to populate a row with string values and write the row out. I have adopted the convention that the first line of a csv file contains headers which is handled before iterating over the tweets. Writing out the tweets involves 3 empty fields followed by the text of the tweet. Later those empty fields will be populated by human and machine annotations so this is our foundational data structure. The remaining odd bit is the whitespaceNormedTweet = tweet.getText().replaceAll("\\s+"," ").trim(); which replaces new lines and carriage returns with a single white space. This is in clear violation of my “do not adulterate the data” stance but for the purposes of this tutorial the improved readability of the csv format makes it worth it.

Also note that the IO has been wrapped for maximum paranoia regarding open file handles that might take out a JVM on repeated failures of writer, streamWriter or fileOut. I just know that someone is going to copy and paste the above code into a production system so might as well make it well behaved. BTW a real implementation would be streaming tweets to disk to keep down memory use. Thanks to Bob for noting all this and re-writing the closes.

Exploring the Data

Blog entries are supposed to be short I will cut the presentation so that I may blog another day in good conscience. But a little data analysis to whet the appetite seems appropriate.

Duplicates and Near Duplicates:

We now have an on disk data structure for storing searched for tweets that we can browse with a spread sheet and modify at our will. Lets look at what our harvest has reaped. Select the entire sheet and sort by column D. Scrolling down a bit I find for my Obama search some near duplicates that share prefixes:

#2: Crimes Against Liberty: An Indictment of President Barack Obama http://amzn.to/aMahMd
#2: Crimes Against Liberty: An Indictment of President Barack Obama http://amzn.to/aMahMd
#5: Crimes Against Liberty: An Indictment of President Barack Obama http://amzn.to/a4Z4VT
#5: Crimes Against Liberty: An Indictment of President Barack Obama http://amzn.to/a4Z4VT
#5: Crimes Against Liberty: An Indictment of President Barack Obama http://amzn.to/a4Z4VT
#5: Crimes Against Liberty: An Indictment of President Barack Obama: Crimes Against Liberty: An Indictment of Pre... http://amzn.to/bj2dIb

Looking more closely we see that the repeats are minor variants with both the referred to URL, the #2 vs #5 and some more variation in the last example. For the purposes of this tutorial I assume that duplicates and near duplicates need to be eliminated–there are other use cases where retweets, which are near duplicates passed on by different users, are desirable. We will be handling duplicates and near duplicates in the next blog post.

Languages:

I browsed the tweets and started to markup up what the language was. I didn’t have to look past around 100 tweets to get a pretty rich smattering of languages.

Some of the languages found in Twitter searches

Diversity of Languages Displayed.

In another blog post I will approach the issue of language identification–fortunately it is pretty easy for languages that are not near neighbors like British English and American English. If you want to get a heads up on language id look at our language ID tutorial.

That is all for now. My next posts will attend to deduplication, then language ID before moving on to classification for topic and sentiment.

Breck

4 Responses to “Processing Tweets with LingPipe #1: Search and CSV Data Structures”

  1. Brendan O'Connor Says:

    Language identification in tweets is interesting. People often switch between different languages inside one message. And even among English, there are all the SMS-like dialects that have incredibly alternate spellings, and I bet very different character n-gram distributions — for example, vowel dropping.

  2. breckbaldwin Says:

    Switching languages mid-tweet, or code-switching as they call it in linguistics, is indeed an issue. One approach is to do binary classifiers for each language (English vs not-English), (Spanish vs not-Spanish) and so on and tag the tweet for all languages that pass a threshold. For n+1 way language identifiers where n is the number of languages with an additional category of not-in-n-languages I suppose some sort of zoning would be in order. Got any good ideas for this?

    As for the dialects tweets are mutating rapidly into alternate orthographies with properties of other writing systems, vowel drop is like Arabic, dropped word spacing like Chinese and even glyphs ;). Very cool, but a real pain from the text processing perspective. Makes me appreciate the good ol’days of MUC-6 with well edited Wall Street Journal articles.

    I have some faith that tokenizing with character n-grams will provide needed flexibility for simple versions of language identification and classification. But things like number/letter substitution and word boundary drift are going to be difficult.

    • Bob Carpenter Says:

      One always needs training data that matches test conditions. So training on the Brown corpus and testing on tweets is unlikely to work well from a language modeling perspective.

      Semi-supervised learning would probably go a long way toward adaptation. Messages will be mixed.

      As to segmenting multilingual text, I built just such a system when I worked at SpeechWorks to preprocess multilingual emails into languages as a precursor to speech synthesis. I don’t know if it ever went into production, as the whole TTS side of the operation was in flux when I left in 2002.

      I penalized switching languages at any given point and handled the stretches within languages using 5th or 6th-order character LMs with Witten-Bell smoothing. I don’t remember the exact details, but I think I put some kind of hard constraint on not changing languages in the middle of a token.

      You could implement almost exactly the same thing in LingPipe using a rescoring LM chunker. It’s also very easy to implement directly via search once you have the character LMs implemented.

      Even character unigrams work pretty well for language ID with more than a couple dozen characters. Other confusing aspects, as always, are punctuation and URLs, which need to be balanced across languages. I still remember the lesson from Chris Manning’s stat NLP class for language ID — only the Finnish data (if I recall) had sequences of hyphens, so it was convinced any message with lots of hyphens was Finnish. Same deal with numbers.

      By tokenizing with character n-grams, I think Breck’s referring to our noisy channel approach to Chinese word segmentation (supervised). The source is a character language model and the channel is weighted edit distance.

      Speaking of spell checking (which is the API implementing the noisy channel on which our segmenters are based), you could use that to back-translate L33T and vowel-drop back into “standard” English. As Breck says, it’s like inserting vowels into Arabic.

  3. Weekly Search & Social News: 11/23/2010 | Search Engine Journal Says:

    [...] Processing Tweets with LingPipe #1: Search and CSV Data Structure [...]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

Join 797 other followers