## Another Linguistic Corpus Collection Game

November 12, 2012

Johan Bos and his crew at University of Groningen have a new suite of games aimed at linguistic data data collection. You can find them at:

Wordrobe is currently hosting four games. Twins is aimed at part-of-speech tagging, Senses is for word sense annotation, Pointers for coref data, and Names for proper name classification.

One of the neat things about Wordrobe is that they try to elicit some notion of confidence by allowing users to “bet” on their answers.

They also discuss prizes, but I didn’t see any mention of what the prizes were.

The project is aimed at imrpoving the Groningen Meaning Bank. I hope they release the raw user data as well as their best guess at a gold standard. I had some background discussion with Johan about annotation models, but they’re going to go with something relatively simple, which means there’s an opportunity to compare a richer statistical models like the other ones I’ve cited on the Data Annotation category of this blog.

### Other Linguistic Games

The first linguistic game of which I was aware was Ahn’s reCAPTCHA. Although aimed at capturing OCR annotations as a side effect, it is more of a security wall aimed at filtering out bots than a game. Arguably, I’ve been played by it more than the other way around.

A more linguistically relevant game is Poesio et al.’s Phrase Detectives, which is aimed at elucidating coreference annotations. I played through several rounds of every aspect of it. The game interface itself is very nice for a web app. Phrase Detectives occassionally has cash prizes, but it looks like they ran out of prize money because the last reference to prizes was July 2011.

### Are they Really Games?

Phrase Detectives is more like an Amazon Mechanical Turk task with a backstory and leaderboard. I didn’t create a login for Wordrobe to try it, but I’m going out on a limb to guess it’s going to be similar given the descriptions of the games.

## Upgrading from Beta-Binomial to Logistic Regression

October 30, 2012

### Bernoulli Model

Consider the following very simple model of drawing the components of a binary random N-vector y i.i.d. from a Bernoulli distribution with chance of success theta.

data {
int N;  // number of items
int y[N];  // binary outcome for item i
}
parameters {
real theta;  // Prob(y[n]=1) = theta
}
model {
theta ~ beta(2,2); // (very) weakly informative prior
for (n in 1:N)
y[n] ~ bernoulli(theta);
}


The beta distribution is used as a prior on theta. This is the Bayesian equivalent to an “add-one” prior. This is the same model Laplace used in the first full Bayesian analysis (or as some would have it, Laplacian inference) back in the Napoleonic era. He used it to model male-vs.-female birth ratio in France, with N being the number of samples and y[n] = 1 if the baby was male and 0 if female.

### Beta-Binomial Model

You can get the same inferences for theta here by replacing the Bernoulli distribution with a binomial:

model {
theta ~ beta(2,2);
sum(y) ~ binomial(N,theta);
}


But it doesn’t generalize so well. What we want to do is let the prediction of theta vary by predictors (“features” in NLP speak, covariates to some statisticians) of the items n.

### Logistic Regression Model

A roughly similar model can be had by moving to the logistic scale and replacing theta with a logistic regression with only an intercept coefficient alpha.

data {
int N;
int y[N];
}
parameters {
real alpha;  // inv_logit(alpha) = Prob(y[n]=1)
}
model {
alpha ~ normal(0,5);  // weakly informative
for (n in 1:N)
y[n] ~ bernoulli(inv_logit(alpha));
}


Recall that the logistic sigmoid (inverse of the logit, or log odds function) maps

$\mbox{logit}^{-1}:(-\infty,\infty)\rightarrow(0,1)$

by taking

$\mbox{logit}^{-1}(u) = 1 / (1 + \mbox{exp}(-u))$.

The priors aren’t quite the same in the Bernoulli and logistic models, but that’s no big deal. In more flexible models, we’ll move to hierarchical models on the priors.

Now that we have the inverse logit transform in place, we can replace theta with a regression on predictors for y[n]. You can think of the second model as an intercept-only regression. For instance, with a single predictor x[n], we could add a slope coefficient beta and write the following model.

data {
int N;  // number of items
int y[N];  // binary outcome for item n
real x[N];  // predictive feature for item n
}
parameters {
real alpha;  // intercept
real beta;  // slope
}
model {
alpha ~ normal(0,5);  // weakly informative
for (n in 1:N)
y[n] ~ bernoulli(inv_logit(alpha + beta * x[n]));
}


### Stan

I used Stan’s modeling language — Stan is the full Bayesian inference system I and others have been developing (it runs from the command line or from R). For more info on Stan, including a link to the manual for the modeling language, see:

Stan’s not a competitor for LingPipe, by the way. Stan scales well for full Bayesian inference, but doesn’t scale like LingPipe’s SGD-based point estimator for logistic regression. And Stan doesn’t do structured models like HMMs or CRFs and has no language-specific features like tokenizers built in. As I’ve said before, it’s horses for courses.

## Mystery Novel with Natural Language Processing

October 24, 2012

For those of you who like mystery novels, Mitzi’s just written one. The added bonus for readers of this blog is that there’s natural language processing involved in the detective work (I don’t want to give too much away, so I can’t tell you how).

Poetic Justice is in the cozy mystery sub-genre, where the focus is on the amateur sleuths and their milieu, not on grisly multiple homicides.

### The Back-Cover Blurb

Once you’ve made it in Manhattan, why would you be caught dead in Staten Island? That’s what Jay Alfred, editor-in-chief of Ars Longa Press, can’t understand. Jay and his partner Ken live on the best block in Chelsea. They’re an attractive pair of opposites. Jay would never stoop to snoop. Ken exercises his right to know every chance he gets.

When Sheba Miller, literary agent and downtown doyenne, is found dead in a bar on Staten Island, Ken can’t wait to investigate. He hustles Jay onto the Staten Island Ferry and into adventure. Then Sheba’s tell-all memoir surfaces. It’s a catalog of white nights with hot artists and liquid lunches with idiot publishers. Jay’s the idiot-in-chief, but he’s not alone. It’s a good thing that Sheba’s dead, because half of literary New York is ready to kill her.

That’s not the only book in town and Ken’s not the only amateur detective. Sheba’s old friends and lovers and the junior members of Ars Longa are all ready and willing to explore New York City and beyond in search of authors, books, killers, and a killer martini.

### Look Inside!

If you follow the Amazon link below, the first six chapters are available free online through Amazon’s “Look Inside!” feature.

### Early Reviews

For what it’s worth, at least twenty people have read it and said they enjoyed it. Some (including me) are already clamoring for the second book in the series. Other mystery writers told Mitzi this would happen; luckily for us fans, she already has two follow-ons in the pipeline.

### Details

• Publisher:  Colloquial Media
• Language:  English
• Pages:  324
• ISBN-10: 0-9882087-0-9 (Paperback)
• ISBN-13: 978-0-9882087-0-4 (Kindle)

### Ordering

On Amazon, it’s eligible for the 4-for-3 deal (order 4 books, get the cheapest one free).

It’s also available from Amazon UK.

If you want a review copy, add a comment to this post or send me e-mail at carp@alias-i.com.

## High Kappa Values are not Necessary for High Quality Corpora

October 2, 2012

I’m not a big fan of kappa statistics, to say the least. I point out several problems with kappa statistics right after the initial studies in this talk on annotation modeling.

I just got back from another talk on annotation where I was ranting again about the uselessness of kappa. In particular, this blog post is an attempt to demonstrate why a high kappa is not necessary. The whole point of building annotation models a la Dawid and Skene (as applied by Snow et al. in their EMNLP paper on gather NLP data with Mechanical Turk) is that you can create a high-reliability corpus without even having high accuracy, much less acceptable kappa values — it’s the same kind of result as using boosting to combine multiple weak learners into a strong learner.

So I came up with some R code to demonstrate why a high kappa is not necessary without even bothering with generative annotation models. Specifically, I’ll show how you can wind up with a high-quality corpus even in the face of low kappa scores.

The key point is that annotator accuracy fully determines the accuracy of the resulting entries in the corpus. Chance adjustment has nothing at all to do with corpus accuracy. That’s what I mean when I say that kappa is not predictive. If I only know the annotator accuracies, I can tell you expected accuracy of entries in the corpus, but if I only know kappa, I can’t tell you anything about the accuracy of the corpus (other than that all else being equal, higher kappa is better; but that’s also true of agreement, so kappa’s not adding anything).

First, the pretty picture (the colors are in honor of my hometown baseball team, the Detroit Tigers, clinching a playoff position).

What you’re looking at is a plot of the kappa value vs. annotator accuracy and category prevalence in a binary classification problem. (It’s only the upper-right corner of a larger diagram that would let accuracy run from 0 to 1 and kappa from 0 to 1. Here’s the whole plot for comparison.

Note that the results are symmetric in both accuracy and prevalence, because very low accuracy leads to good agreement in the same way that very high accuracy does.)

How did I calculate the values? First, I assumed accuracy was the same for both positive and negative categories (usually not the case — most annotators are biased). Prevalence is defined as the fraction of items belonging to category 1 (usually the “positive” category).

Everything else follows from the definitions of kappa, to result in the following definition in R to compute expected kappa from binary classification data with a given prevalence of category 1 answers and a pair of annotators with the same accuracies.

kappa_fun = function(prev,acc) {
agr = acc^2 + (1 - acc)^2;
cat1 = acc * prev + (1 - acc) * (1 - prev);
e_agr = cat1^2 + (1 - cat1)^2;
return((agr - e_agr) / (1 - e_agr));
}


Just as an example, let’s look at prevalence = 0.2 and accuracy = 0.9 with say 1000 examples. The expected contingency table would be

 Cat1 Cat2 Cat1 170 90 Cat2 90 650

and the kappa coefficient would be 0.53, below anyone’s notion of “acceptable”.

The chance of actual agreement is the accuracy squared (both annotators are correct and hence agree) plus one minus the accuracy squared (both annotators are wrong and hence agree — two wrongs make a right for kappa, another of its problems).

The proportion of category 1 responses (say positive responses) is the accuracy times the prevalence (true category is positive, correct response) plus one minus accuracy times one minus prevalence (true category is negative, wrong response).

Next, I calculate expected agreement a la Cohen’s kappa (which is the same as Scott’s pi in this case because the annotators have identical behavior and hence everything’s symmetric), which is just the resulting agreement from voting according to the prevalences. So that’s just the probability of category 1 squared (both annotators respond category 1) and the probability of a category 2 response (1 minus the probability of a category 1 response) squared.

Finally, I return the kappa value itself, which is defined as usual.

Back to the plot. The white border is set at .66, the lower-end threshold established by Krippendorf for somewhat acceptable kappas; the higher-end threshold of acceptable kappas set by Krippendorf was 0.8, and is also indicated on the legend.

In my own experience, there are almost no 90% accurate annotators for natural language data. It’s just too messy. But you need well more than 90% accuracy to get into acceptable kappa range on a binary classification problem. Especially if prevalence is high, because as prevalence goes up, kappa goes down.

I hope this demonstrates why having a high kappa is not necessary.

I should add that Ron Artstein asked me after my talk what I thought would be a good thing to present if not kappa. I said basic agreement is more informative than kappa about how good the final corpus is going to be, but I want to go one step further and suggest you just inspect a contingency table. It’ll tell you not only what the agreement is, but also what each annotator’s bias is relative to the other (evidenced by asymmetric contingency tables).

In case anyone’s interested, here’s the R code I then used to generate the fancy plot:

pos = 1;
K = 200;
prevalence = rep(NA,(K + 1)^2);
accuracy = rep(NA,(K + 1)^2);
kappa = rep(NA,(K + 1)^2);
for (m in 1:(K + 1)) {
for (n in 1:(K + 1)) {
prevalence[pos] = (m - 1) / K;
accuracy[pos] = (n - 1) / K;
kappa[pos] = kappa_fun(prevalence[pos],accuracy[pos]);
pos = pos + 1;
}
}
library("ggplot2");
df = data.frame(prevalence=prevalence,
accuracy=accuracy,
kappa=kappa);
kappa_plot =
ggplot(df, aes(prevalence,accuracy,fill = kappa)) +
labs(title = "Kappas for Binary Classification\n") +
geom_tile() +
scale_x_continuous(expand=c(0,0),
breaks=c(0,0.25,0.5,0.75,1),
limits =c(0.5,1)) +
scale_y_continuous(expand=c(0,0),
breaks=seq(0,10,0.1),
limits=c(0.85,1)) +
low="orange", mid="white", high="blue",
breaks=c(1,0.8,0.66,0));


## Refactoring in the Zone

September 17, 2012

I remember very clearly when I first started to work as a professional programmer. I was tasked with first designing and then integrating a new semantic interpreter for the grammars in SpeechWorks’s speech recognizer.

I didn’t know my ass from my elbow and pretty much couldn’t get off the ground on my own.

### Get Adopted by a Great Mentor

Luckily, Sasha Caskey pretty much saved my professional programming life by pair programming with me until I “got it” and on a continuing basis after that so I didn’t forget.

One of Sasha’s memorable early lessons involved refactoring or adding new features. After everything was designed (something I’m still more comfortable with than coding), there was the integration. This involved a C implementation of JavaScript interpreting user programs with high-level dialog control and low-level speech integration. I just couldn’t see how the ends could be made to meet.

### Use the Force

Sasha said something along the lines of “use the force”. What he really said is probably more along the lines of “when you’re dealing with good code like this you just have a sense of where everything should go and if you stick to the plan, it usually works”. It sounded awfully reckless to me, but then I was having trouble seeing how version control could work with 20 programmers sharing a code base.

He applied this philosophy on percolating arguments through call chains, propagating return codes, dealing with exceptions (all hacked up with gotos to end-of-function cleanup blocks because we were using straight-up C), and even figuring out what the bounds of a loop should be.

It works. But only once you get to a certain level of expertise where you know what to expect and how things look if they’re “right”. And only if the other people you work with write idiomatic code.

I was reminded of Sasha’s early lessons on two occasions recently.

### Sometimes it Works

First, I just added print statements to Stan’s modeling language. I needed to pass a standard output stream as well as a standard error stream to be the target of code writes. The error stream was already getting propagated. Even though I didn’t write a lot of the code I’m dealing with, the other people I work with are well-trained C++ coders, so everything just works as expected. It was like the current code was sprinkled with bread crumbs I could follow to do what I needed.

The force worked!

### Sometimes it Doesn’t

Second, I was refactoring some student-written Java code and it’s so non-idiomatic I can’t make heads or tails of it. (Think Daily WTF levels of insanity here.) I had no idea what the original programmer intended or how the code was supposed to implement those intentions.

The force completely failed me.

### Chess Memory in Experts and Novices

This all reminds me of some of my favorite psychology experiments ever, by de Groot in the 1950s with followups by Simon and Chase in the 1970s. (I heard about them in Herb Simon’s wonderful cog psych class/seminar at CMU.) It also relates to the seminal short-term memory experiments of George Miller in the 1950s.

The main takeaway is that chess experts have great memories for board positions if and only if the boards make sense tactically. They’re no better than amateurs at remembering random board positions. Even the errors they make in reconstructing board positions also tend to preserve the tactical arrangement even if all the pieces. Herb’s takeaway message was that “memory” is very tied up with expectations and the ability to “chunk” information into bundles. Miller’s experiments and followups showed we could remember as many bundles of information as single random pieces of information (the famous “7 +/- 2″ figure for the number of items humans can hold in short term memory).

Here’s a nice survey of the psychology of chess that covers the above experiments and more.

## Dilbert Meets Big Unstructured Data … and Builds a Framework

September 5, 2012

Best Dilbert ever. Or at least the most relevant to this blog:

Dilbert, 5 September 2012

I’ll give you the setup. Dilbert walks into a bar and strikes up a conversation with a woman who asks him what he does for a living. Dilbert replies, “I’m working on a framework to allow construction of large-scale analytical queries on unstructured data.”

I’ll leave the punchline to the strip.

## Using Luke the Lucene Index Browser to develop Search Queries

July 24, 2012

Luke is a GUI tool written in Java that allows you to browse the contents of a Lucene index, examine individual documents, and run queries over the index. Whether you’re developing with PyLucene, Lucene.NET, or Lucene Core, Luke is your friend.

Downloads are available from http://code.google.com/p/luke/ You can download the source or just the standalone binary .jar file. Right now we’re using the just-released Luke 4.0.0-ALPHA standalone binary. To run Luke, just fire up the jar file from the command line:

> java -jar lukeall-4.0.0-ALPHA.jar

Once launched, Luke opens with a dialog menu to open an index. We’re using an index called fed-papers.idx, which is an index over the Federalist Papers, a set of 85 op-ed pieces written by James Madison, Alexander Hamilton, and/or John Jay. This is taken from the example in our recently updated Lucene 3 tutorial. To build the index yourself, download the data and accompanying source code and follow the instructions in the tutorial. Each of the 85 papers is treated as a document. The text of the document has been indexed using Lucene’s StandardAnalyzer and is stored in a field named text.

The following is a screenshot of the Luke GUI upon first opening the index. We’ve circled the main controls on the top left corner of the GUI. Luke opens with the Overview tab pane.

All 85 documents have a similar structure. They start with an identifying number, e.g. FEDERALIST No. 1. The opening salutation is: To the People of the State of New York:. Looking at the top terms in the index, we see that there are 85 documents which contain the words {federalist, people, state, york} as well as common function words. The StandardAnalyzer has stop-listed other function words, including {to, of, the, no}. Otherwise, these too would be among the top terms with a document frequency of 85.

### Exploring Document Indexing

The Document tab pane lets you examine individual documents in the index. In the screenshot below we have selected the 47th document in the index, FEDERALIST No. 48. Circled in red are the series of controls used to get to this point. First we chose the Document tab pane, and then used the “Browse by document number” control to choose document 47. The bottom half of the window shows the document fields.

Clicking on the “Reconstruct & Edit” control opens a new pop-up window which allows us to inspect the document contents on a field-by-field basis. Below we show side-by-side screenshots of the text field. On the left we see the raw text as stored; on the right we see the result of tokenization and indexing via the Lucene StandardAnalyzer. Lucene’s StandardAnalyzer includes a StandardTokenizer, StandardFilter, Tokenization happens first. This removes punctuation and assigns each token a position. Then the tokens are lower-cased and stop-listed. LowerCaseFilter and StopFilter.

We have circled the text of the opening salutation: To the People of the State of New York:. It consists of 9 words followed by 1 punctuation symbol. The tokenizer strips out the final colon. The stop-list filter removes the tokens at positions {1, 2, 4, 5, 7}. The text of tokens which have been indexed are displayed as is. Tokens which were stop-listed are missing from the index, so Luke displays the string null followed by the number of consecutive stop-listed tokens.

### Exploring Search

The Search tab pane packs a lot of controls. The annotated screenshot below shows the results of running a search over the index.

In the top right quadrant is an embedded tab pane (tabs circled in red). On the Analysis tab, we have selected the StandardAnalyzer from a pull-down list of Analyzers and selected the default search field from a pull-down list of all fields in the index. Given this, Luke constructs the QueryParser used to parse queries entered into the text box in the upper left quadrant. We entered the words: “Powers of the Judiciary” into this text box, circled in red. Directly below is it the parsed query, also circled in red. Clicking on the “Search” control runs this search over the index. The bottom pane displays the ranked results.

Luke passes the input string from the search text box to the QueryParser which parses it using the specified analyzer and creates a set of search terms, one term per token, over the specified default field. In this example, the StandardAnalyzer tokenizes, lowercases, and stop-lists these words, resulting in two search terms text:powers and text:judiciary. The result of this search pulls back all documents which contain the words powers and/or judiciary.

To drill down on Lucene search, we use the Explain Structure control, circled in red on the following screenshot.

When clicked, this pops up another pane which shows all details of the query constructed from the search expression. The structure of the query used in this example is:

lucene.BooleanQuery
clauses=2, maxClauses=1024
Clause 0: SHOULD
lucene.TermQuery
Term: field='text' text='powers'
Clause 1: SHOULD
lucene.TermQuery
Term: field='text' text='judiciary'

The search expression can be any legal search string according to the Lucene Query Parser Syntax. To search for the phrase powers of the judiciary we need to enclose it in double quotes. But this new search produces no results, which is clearly wrong, since this phrase occurs in the title of FEDERALIST No. 80 (see search results above).

In the query details we see that the input has been parsed into text:"powers judiciary". The Explain Structure feature returns the following:

lucene.PhraseQuery, slop=0
pos: [0,1]
Term 0: field='text' text='powers'
Term 1: field='text' text='judiciary'

The problem here is that the token positions specified don’t allow for the intervening stop-listed words. To correct this, we need to adjust Luke’s default settings on the QueryParser. The screenshot below shows both the controls changed and the query results.

First we go to the QueryParser tab in the set of search controls in the top right quadrant, then we click the checkbox labeled Enable positionIncrements. Now the parsed query looks like this: text:"powers ? ? judiciary", which translates to the following programmatic query:

lucene.PhraseQuery, slop=0
pos: [0,3]
Term 0: field='text' text='powers'
Term 1: field='text' text='judiciary'

Finally, we select the single search result, and click on the Explain control (circled in red). This pops up another window which explains the query score (outlined in red). Here is the text of the explanation:

0.0847  weight(text:"powers ? ? judiciary" in 79) [DefaultSimilarity], result of:
0.0847  fieldWeight in 79, product of:
1.0000  tf(freq=1.0), with freq of:
1.0000  phraseFreq=1.0
3.6129  idf(), sum of:
1.3483  idf(docFreq=59, maxDocs=85)
2.2646  idf(docFreq=23, maxDocs=85)
0.0234  fieldNorm(doc=79)

### Using the Lucene XML Query Parser

The Lucene API contains many kinds of queries beyond those generated by the QueryParser. You can use Luke to develop these queries as well, via the Lucene XML Query Parser. It is almost impossible to find any on-line documentation on this most excellent contribution to Lucene by Mark Harwood. Luckily, it is distributed with Lucene source tgz files. For Lucene 3.6 the path to this documentation is: lucene-3.6.0/contrib/xml-query-parser/docs/index.html.

Recast in as a Lucene XML Query, our original search term:powers term:judiciary becomes:

 <BooleanQuery fieldName="text">
<Clause occurs="should">
<TermQuery>powers</TermQuery>
</Clause>
<Clause occurs="should">
<TermQuery>judiciary</TermQuery>
</Clause>
</BooleanQuery>

The screenshot below shows the results of pasting the above XML into the search expression textbox with the QueryParser control Use XML Query Parser turned on. (Note: currently this works in Luke 3.5 but not in Luke 4.0-ALPHA. We hate the bleeding edge).

The XML rewrites to exactly the same query as our first query.

Big deal, you might be saying right about now. That’s a whole lotta chopping for not much kindling. Au contraire! One of the reasons for developing the XML Query Syntax is this:

To bridge the growing gap between Lucene query/filtering functionality and the set of functionality accessible through the standard Lucene QueryParser syntax.

For example, a while back, Mark Miller blogged about Lucene SpanQueries. This is similar to but not the same as PhraseQuery. We can use Luke to compare and contrast the difference between the two. Here’s our earlier phrase query Powers of the Judiciary recast as a span query.

  <SpanNear slop="3" inOrder="true" fieldName="text">
<SpanTerm>powers</SpanTerm>
<SpanTerm>judiciary</SpanTerm>
</SpanNear>

The screenshot below shows the results of running this query.

FEDERALIST No. 48 scores slightly better on this query than does FEDERALIST No. 80. Understanding why is left as an exercise to the reader.

## MacBook Pro 15″ Retina Display Awesomeness

July 14, 2012

I just received my new MacBook Pro 15″ with the Retina display.

First, I have to mention how blown away I was that Apple has a feature (the “Migration Assistant“) that lets you clone your last computer. An hour or two after setting up, the new MacBook Pro had all the software, data, and settings (well, almost all) from my previous computer, a MacBook Air. All done over my home wireless network (though our sysadmin here at Columbia strongly recommended a wired connection, my Air doesn’t have a port and I didn’t have a dongle).

Yes, text is just as beautiful as on the iPad3. So are photos and images. Everything else I use is looking awfully pixelated in comparison (such as this blog post I’m typing into Safari on my 27″ iMac).

The biggest downside is that it’s big (15″ diagonal screen vs. 13″ on my MacBook Air) and heavy (4.5 lbs vs. 2.9 lbs for the Air). Though big isn’t so bad — the 15″ screen seems luxurious after the Air’s rather cramped confines. Some software’s not up to the display, so the text looks really bad on the new MacBook Pro. Firefox and Thunderbird, for instance, look terrible. Overall, it’s just not as nice to handle as the Air. (Not to mention Columbia slapping the ugliest anti-theft stickers ever on it. Now I look like both a hipster clone and a corporate drone at the same time.) The magsafe cable has a very strong magnet compared to the Air’s and sticks out a bit more. And to add insult to injury, they’re not interchangeable, so we had to throw more money toward Cupertino.

I’d say the price is a downside (mine came out to about \$2700 before Columbia discounts, including AppleCare). Even if I were buying this myself it’d be worth it, because I’ll average at least 20 hours/week use for two or more years.

Additional upsides are 16GB of memory and four cores. With that, it runs the Stan C++ unit tests in under 3 minutes (it takes around 12 minutes on the Air and the Air starts buzzing like an angry fly). The HDMI port saves a dongle, but then the change to Thunderbolt meant buying another one. I don’t know that I’ll get much use of out of USB 3.0 (the iPad 3 is only USB 2.0). I also get 256GB of SSD, though I never filled the 128GB I had on the Air. The ethernet port and HDMI port are handy — two less dongles compared to the MacBook Air if you need either of these ports.

I haven’t heard the fan. I’ve heard about it — it’s asymmetrical, which according to my signal processing geek friends, reduces the noise tremendously. It’s either super quiet or the machine’s so powerful the Stan unit tests don’t stress it out.

## grammar why ! matters

July 11, 2012

For all those of you who might have read things like this, this, or this, I want to explain why the answer is “yes”, spelling and grammar matter.

### Language is a Tool

Language is a tool used for many purposes.

If your goal is to entertain, there are different conventions. Singers like Bob Dylan can be highly entertaining while remaining nearly incomprehensible. If your goal is to connect to friends or loved ones, yet other conventions come into play.

Sometimes language is used for multiple things at once.

### Language is a Convention

Language is a matter of convention. We simply cannot write or say whatever we want to however we want to and be understood.

If your purpose is communication, it behooves you to make your message clear. There are exceptions to this, too. I might be trying to communicate how worldly I am by using French or Italian food terms or pronunciations instead of English, even knowing the audience won’t understand them.

Communicating means using shared conventions.

### Word Order

For instance, consider word order. Consider the following “understood be and to want we however to want we whatever say or write cannot simply we”. You’ve seen that sentence above, only in reverse. In reverse, it’s pretty much impossible to understand.

Even in the CBS piece by Steve Tobak, the author mocks bad grammar with “me want food”. Well, that has a subject, verb and object, in perfect English order, which is why it’s so easy to understand. It even has the tense of “want” and the number and lack of determiner for “food” right. The only mistake is the object/subject distinction in “me” vs. “I”!

### Spelling?

Tobak goes on to quote a comment, “I jus read your article; ___. Very interesting!” What’s wrong with bad spelling? It’s unpleasant because it slows us down as readers. If it gets bad enough, it can block understanding. I had no problem detangling the last example, but how about “I js rd y ar — int!!!!!!!”?

Spelling used to be even more chaotic in English. It’s better in some other languages.

### Disclaimers

I’m all for telegraphic speech. It works best in shared contexts. It’s a little harder with a bare Tweet. Language is incredibly tied up with context. Enough world knowledge can get you by, too. I might be able to refer to a TV show by “ST:TNG”, but my mom would have no idea what I was talking about.

For some purposes, precision and clarity matter much less. Consider drafting legislation vs. planning to meet at a restaurant vs. saying hello. Telegraphic speech can be very precise. Doctors’ notes to each other are a prime example. You don’t need a verb if everyone knows there’s only one thing to do with a device or a noun if there’s only one device to use.

Saying language is conventional and conventions should be followed is a subtly different stance from traditional linguistic prescriptivism. Languages change. If they didn’t, English wouldn’t even exist. I’m not railing against split infinitives, dangled prepositions, a complete failure to understand “who”/”whom” or even “I”/”me”, abandoning adverbial morphology, using “ain’t”, pronouncing “ask” like “axe”, etc. etc. I think these all have a good chance of achieving “proper” English status one day.

## Lucene Tutorial updated for Lucene 3.6

July 5, 2012

The current Apache Lucene Java version is 3.6, released in April of 2012. We’ve updated the Lucene 3 tutorial and the accompanying source code to bring it in line with the current API so that it doesn’t use any deprecated methods and my, there are a lot of them. Bob blogged about this tutorial back in February 2011, shortly after Lucene Java rolled over to version 3.0.

Like other 3.x minor releases, Lucene 3.6 introduces performance enhancements, bug fixes, new analyzers, and changes that bring the Lucene API in line with Solr. In addition, Lucene 3.6 anticipates Lucene 4, billed as “the next major backwards-incompatible release.”

### Significant changes since version 3.0

• IndexReader delete methods are deprecated and will be removed entirely in Lucene 4. All deletes and updates are done via an IndexWriter.
• There is a single IndexWriter constructor that takes two arguments: the index directory and an IndexWriterConfig object. The latter was introduced in Lucene 3.1. It holds configuration information that was previously specified directly as additional arguments to the constructor.
• IndexWriter optimize methods are deprecated. The merge method(s) supply this functionality.

### Building the Source

The ant build file is in the file src/applucene/build.xml and should be run from that directory. The book’s distribution is organized this way so that each chapter’s demo code is roughly standalone, but they are able to share libs. There are some minor dependencies on LingPipe in the example (jar included), but those are just for I/O and could be easily removed or replicated. As an added bonus, the source code now includes the data used in the examples throughout the tutorial, the venerable Federalist Papers from Project Gutenberg.