Author Archive

Upgrading from Beta-Binomial to Logistic Regression

October 30, 2012

Bernoulli Model

Consider the following very simple model of drawing the components of a binary random N-vector y i.i.d. from a Bernoulli distribution with chance of success theta.

data {
  int N;  // number of items
  int y[N];  // binary outcome for item i
}
parameters {
  real theta;  // Prob(y[n]=1) = theta
}
model {
  theta ~ beta(2,2); // (very) weakly informative prior
  for (n in 1:N)
    y[n] ~ bernoulli(theta);
}

The beta distribution is used as a prior on theta. This is the Bayesian equivalent to an “add-one” prior. This is the same model Laplace used in the first full Bayesian analysis (or as some would have it, Laplacian inference) back in the Napoleonic era. He used it to model male-vs.-female birth ratio in France, with N being the number of samples and y[n] = 1 if the baby was male and 0 if female.

Beta-Binomial Model

You can get the same inferences for theta here by replacing the Bernoulli distribution with a binomial:

model {
  theta ~ beta(2,2);
  sum(y) ~ binomial(N,theta);
}

But it doesn’t generalize so well. What we want to do is let the prediction of theta vary by predictors (“features” in NLP speak, covariates to some statisticians) of the items n.

Logistic Regression Model

A roughly similar model can be had by moving to the logistic scale and replacing theta with a logistic regression with only an intercept coefficient alpha.

data {
  int N;
  int y[N];
}
parameters {
  real alpha;  // inv_logit(alpha) = Prob(y[n]=1)
}
model {
  alpha ~ normal(0,5);  // weakly informative
  for (n in 1:N)
    y[n] ~ bernoulli(inv_logit(alpha));
}

Recall that the logistic sigmoid (inverse of the logit, or log odds function) maps

\mbox{logit}^{-1}:(-\infty,\infty)\rightarrow(0,1)

by taking

\mbox{logit}^{-1}(u) = 1 / (1 + \mbox{exp}(-u)).

The priors aren’t quite the same in the Bernoulli and logistic models, but that’s no big deal. In more flexible models, we’ll move to hierarchical models on the priors.

Adding Features

Now that we have the inverse logit transform in place, we can replace theta with a regression on predictors for y[n]. You can think of the second model as an intercept-only regression. For instance, with a single predictor x[n], we could add a slope coefficient beta and write the following model.

data {
  int N;  // number of items
  int y[N];  // binary outcome for item n
  real x[N];  // predictive feature for item n
}
parameters {
  real alpha;  // intercept
  real beta;  // slope
}
model {
  alpha ~ normal(0,5);  // weakly informative
  for (n in 1:N)
    y[n] ~ bernoulli(inv_logit(alpha + beta * x[n]));
}

Stan

I used Stan’s modeling language — Stan is the full Bayesian inference system I and others have been developing (it runs from the command line or from R). For more info on Stan, including a link to the manual for the modeling language, see:

Stan Home Page:  http://mc-stan.org/

Stan’s not a competitor for LingPipe, by the way. Stan scales well for full Bayesian inference, but doesn’t scale like LingPipe’s SGD-based point estimator for logistic regression. And Stan doesn’t do structured models like HMMs or CRFs and has no language-specific features like tokenizers built in. As I’ve said before, it’s horses for courses.

High Kappa Values are not Necessary for High Quality Corpora

October 2, 2012

I’m not a big fan of kappa statistics, to say the least. I point out several problems with kappa statistics right after the initial studies in this talk on annotation modeling.

I just got back from another talk on annotation where I was ranting again about the uselessness of kappa. In particular, this blog post is an attempt to demonstrate why a high kappa is not necessary. The whole point of building annotation models a la Dawid and Skene (as applied by Snow et al. in their EMNLP paper on gather NLP data with Mechanical Turk) is that you can create a high-reliability corpus without even having high accuracy, much less acceptable kappa values — it’s the same kind of result as using boosting to combine multiple weak learners into a strong learner.

So I came up with some R code to demonstrate why a high kappa is not necessary without even bothering with generative annotation models. Specifically, I’ll show how you can wind up with a high-quality corpus even in the face of low kappa scores.

The key point is that annotator accuracy fully determines the accuracy of the resulting entries in the corpus. Chance adjustment has nothing at all to do with corpus accuracy. That’s what I mean when I say that kappa is not predictive. If I only know the annotator accuracies, I can tell you expected accuracy of entries in the corpus, but if I only know kappa, I can’t tell you anything about the accuracy of the corpus (other than that all else being equal, higher kappa is better; but that’s also true of agreement, so kappa’s not adding anything).

First, the pretty picture (the colors are in honor of my hometown baseball team, the Detroit Tigers, clinching a playoff position).

Kappa for varying prevalences and accuracies

What you’re looking at is a plot of the kappa value vs. annotator accuracy and category prevalence in a binary classification problem. (It’s only the upper-right corner of a larger diagram that would let accuracy run from 0 to 1 and kappa from 0 to 1. Here’s the whole plot for comparison.

Kappa for varying prevalences and accuracies

Note that the results are symmetric in both accuracy and prevalence, because very low accuracy leads to good agreement in the same way that very high accuracy does.)

How did I calculate the values? First, I assumed accuracy was the same for both positive and negative categories (usually not the case — most annotators are biased). Prevalence is defined as the fraction of items belonging to category 1 (usually the “positive” category).

Everything else follows from the definitions of kappa, to result in the following definition in R to compute expected kappa from binary classification data with a given prevalence of category 1 answers and a pair of annotators with the same accuracies.

kappa_fun = function(prev,acc) {
  agr = acc^2 + (1 - acc)^2;
  cat1 = acc * prev + (1 - acc) * (1 - prev);
  e_agr = cat1^2 + (1 - cat1)^2;
  return((agr - e_agr) / (1 - e_agr));
}

Just as an example, let’s look at prevalence = 0.2 and accuracy = 0.9 with say 1000 examples. The expected contingency table would be

  Cat1 Cat2
Cat1 170 90
Cat2 90 650

and the kappa coefficient would be 0.53, below anyone’s notion of “acceptable”.

The chance of actual agreement is the accuracy squared (both annotators are correct and hence agree) plus one minus the accuracy squared (both annotators are wrong and hence agree — two wrongs make a right for kappa, another of its problems).

The proportion of category 1 responses (say positive responses) is the accuracy times the prevalence (true category is positive, correct response) plus one minus accuracy times one minus prevalence (true category is negative, wrong response).

Next, I calculate expected agreement a la Cohen’s kappa (which is the same as Scott’s pi in this case because the annotators have identical behavior and hence everything’s symmetric), which is just the resulting agreement from voting according to the prevalences. So that’s just the probability of category 1 squared (both annotators respond category 1) and the probability of a category 2 response (1 minus the probability of a category 1 response) squared.

Finally, I return the kappa value itself, which is defined as usual.

Back to the plot. The white border is set at .66, the lower-end threshold established by Krippendorf for somewhat acceptable kappas; the higher-end threshold of acceptable kappas set by Krippendorf was 0.8, and is also indicated on the legend.

In my own experience, there are almost no 90% accurate annotators for natural language data. It’s just too messy. But you need well more than 90% accuracy to get into acceptable kappa range on a binary classification problem. Especially if prevalence is high, because as prevalence goes up, kappa goes down.

I hope this demonstrates why having a high kappa is not necessary.

I should add that Ron Artstein asked me after my talk what I thought would be a good thing to present if not kappa. I said basic agreement is more informative than kappa about how good the final corpus is going to be, but I want to go one step further and suggest you just inspect a contingency table. It’ll tell you not only what the agreement is, but also what each annotator’s bias is relative to the other (evidenced by asymmetric contingency tables).

In case anyone’s interested, here’s the R code I then used to generate the fancy plot:

pos = 1;
K = 200;
prevalence = rep(NA,(K + 1)^2);
accuracy = rep(NA,(K + 1)^2);
kappa = rep(NA,(K + 1)^2);
for (m in 1:(K + 1)) {
  for (n in 1:(K + 1)) {
    prevalence[pos] = (m - 1) / K;
    accuracy[pos] = (n - 1) / K;
    kappa[pos] = kappa_fun(prevalence[pos],accuracy[pos]);
    pos = pos + 1;
  }
}
library("ggplot2");
df = data.frame(prevalence=prevalence,
                accuracy=accuracy,
                kappa=kappa);
kappa_plot = 
  ggplot(df, aes(prevalence,accuracy,fill = kappa)) +
     labs(title = "Kappas for Binary Classification\n") +
     geom_tile() +
     scale_x_continuous(expand=c(0,0), 
                        breaks=c(0,0.25,0.5,0.75,1),
                        limits =c(0.5,1)) +
     scale_y_continuous(expand=c(0,0), 
                        breaks=seq(0,10,0.1), 
                        limits=c(0.85,1)) +
     scale_fill_gradient2("kappa", limits=c(0,1), midpoint=0.66,
                          low="orange", mid="white", high="blue",
                          breaks=c(1,0.8,0.66,0));

Refactoring in the Zone

September 17, 2012

I remember very clearly when I first started to work as a professional programmer. I was tasked with first designing and then integrating a new semantic interpreter for the grammars in SpeechWorks’s speech recognizer.

I didn’t know my ass from my elbow and pretty much couldn’t get off the ground on my own.

Get Adopted by a Great Mentor

Luckily, Sasha Caskey pretty much saved my professional programming life by pair programming with me until I “got it” and on a continuing basis after that so I didn’t forget.

One of Sasha’s memorable early lessons involved refactoring or adding new features. After everything was designed (something I’m still more comfortable with than coding), there was the integration. This involved a C implementation of JavaScript interpreting user programs with high-level dialog control and low-level speech integration. I just couldn’t see how the ends could be made to meet.

Use the Force

Sasha said something along the lines of “use the force”. What he really said is probably more along the lines of “when you’re dealing with good code like this you just have a sense of where everything should go and if you stick to the plan, it usually works”. It sounded awfully reckless to me, but then I was having trouble seeing how version control could work with 20 programmers sharing a code base.

He applied this philosophy on percolating arguments through call chains, propagating return codes, dealing with exceptions (all hacked up with gotos to end-of-function cleanup blocks because we were using straight-up C), and even figuring out what the bounds of a loop should be.

It works. But only once you get to a certain level of expertise where you know what to expect and how things look if they’re “right”. And only if the other people you work with write idiomatic code.

I was reminded of Sasha’s early lessons on two occasions recently.

Sometimes it Works

First, I just added print statements to Stan’s modeling language. I needed to pass a standard output stream as well as a standard error stream to be the target of code writes. The error stream was already getting propagated. Even though I didn’t write a lot of the code I’m dealing with, the other people I work with are well-trained C++ coders, so everything just works as expected. It was like the current code was sprinkled with bread crumbs I could follow to do what I needed.

The force worked!

Sometimes it Doesn’t

Second, I was refactoring some student-written Java code and it’s so non-idiomatic I can’t make heads or tails of it. (Think Daily WTF levels of insanity here.) I had no idea what the original programmer intended or how the code was supposed to implement those intentions.

The force completely failed me.

Chess Memory in Experts and Novices

This all reminds me of some of my favorite psychology experiments ever, by de Groot in the 1950s with followups by Simon and Chase in the 1970s. (I heard about them in Herb Simon’s wonderful cog psych class/seminar at CMU.) It also relates to the seminal short-term memory experiments of George Miller in the 1950s.

The main takeaway is that chess experts have great memories for board positions if and only if the boards make sense tactically. They’re no better than amateurs at remembering random board positions. Even the errors they make in reconstructing board positions also tend to preserve the tactical arrangement even if all the pieces. Herb’s takeaway message was that “memory” is very tied up with expectations and the ability to “chunk” information into bundles. Miller’s experiments and followups showed we could remember as many bundles of information as single random pieces of information (the famous “7 +/- 2” figure for the number of items humans can hold in short term memory).

Here’s a nice survey of the psychology of chess that covers the above experiments and more.

Dilbert Meets Big Unstructured Data … and Builds a Framework

September 5, 2012

Best Dilbert ever. Or at least the most relevant to this blog:

Dilbert, 5 September 2012

I’ll give you the setup. Dilbert walks into a bar and strikes up a conversation with a woman who asks him what he does for a living. Dilbert replies, “I’m working on a framework to allow construction of large-scale analytical queries on unstructured data.”

I’ll leave the punchline to the strip.

MacBook Pro 15″ Retina Display Awesomeness

July 14, 2012

I just received my new MacBook Pro 15″ with the Retina display.

First, I have to mention how blown away I was that Apple has a feature (the “Migration Assistant“) that lets you clone your last computer. An hour or two after setting up, the new MacBook Pro had all the software, data, and settings (well, almost all) from my previous computer, a MacBook Air. All done over my home wireless network (though our sysadmin here at Columbia strongly recommended a wired connection, my Air doesn’t have a port and I didn’t have a dongle).

Yes, text is just as beautiful as on the iPad3. So are photos and images. Everything else I use is looking awfully pixelated in comparison (such as this blog post I’m typing into Safari on my 27″ iMac).

The biggest downside is that it’s big (15″ diagonal screen vs. 13″ on my MacBook Air) and heavy (4.5 lbs vs. 2.9 lbs for the Air). Though big isn’t so bad — the 15” screen seems luxurious after the Air’s rather cramped confines. Some software’s not up to the display, so the text looks really bad on the new MacBook Pro. Firefox and Thunderbird, for instance, look terrible. Overall, it’s just not as nice to handle as the Air. (Not to mention Columbia slapping the ugliest anti-theft stickers ever on it. Now I look like both a hipster clone and a corporate drone at the same time.) The magsafe cable has a very strong magnet compared to the Air’s and sticks out a bit more. And to add insult to injury, they’re not interchangeable, so we had to throw more money toward Cupertino.

I’d say the price is a downside (mine came out to about $2700 before Columbia discounts, including AppleCare). Even if I were buying this myself it’d be worth it, because I’ll average at least 20 hours/week use for two or more years.

Additional upsides are 16GB of memory and four cores. With that, it runs the Stan C++ unit tests in under 3 minutes (it takes around 12 minutes on the Air and the Air starts buzzing like an angry fly). The HDMI port saves a dongle, but then the change to Thunderbolt meant buying another one. I don’t know that I’ll get much use of out of USB 3.0 (the iPad 3 is only USB 2.0). I also get 256GB of SSD, though I never filled the 128GB I had on the Air. The ethernet port and HDMI port are handy — two less dongles compared to the MacBook Air if you need either of these ports.

I haven’t heard the fan. I’ve heard about it — it’s asymmetrical, which according to my signal processing geek friends, reduces the noise tremendously. It’s either super quiet or the machine’s so powerful the Stan unit tests don’t stress it out.

grammar why ! matters

July 11, 2012

For all those of you who might have read things like this, this, or this, I want to explain why the answer is “yes”, spelling and grammar matter.

Language is a Tool

Language is a tool used for many purposes.

If your goal is to entertain, there are different conventions. Singers like Bob Dylan can be highly entertaining while remaining nearly incomprehensible. If your goal is to connect to friends or loved ones, yet other conventions come into play.

Sometimes language is used for multiple things at once.

Language is a Convention

Language is a matter of convention. We simply cannot write or say whatever we want to however we want to and be understood.

If your purpose is communication, it behooves you to make your message clear. There are exceptions to this, too. I might be trying to communicate how worldly I am by using French or Italian food terms or pronunciations instead of English, even knowing the audience won’t understand them.

Communicating means using shared conventions.

Word Order

For instance, consider word order. Consider the following “understood be and to want we however to want we whatever say or write cannot simply we”. You’ve seen that sentence above, only in reverse. In reverse, it’s pretty much impossible to understand.

Even in the CBS piece by Steve Tobak, the author mocks bad grammar with “me want food”. Well, that has a subject, verb and object, in perfect English order, which is why it’s so easy to understand. It even has the tense of “want” and the number and lack of determiner for “food” right. The only mistake is the object/subject distinction in “me” vs. “I”!

Spelling?

Tobak goes on to quote a comment, “I jus read your article; ___. Very interesting!” What’s wrong with bad spelling? It’s unpleasant because it slows us down as readers. If it gets bad enough, it can block understanding. I had no problem detangling the last example, but how about “I js rd y ar — int!!!!!!!”?

Spelling used to be even more chaotic in English. It’s better in some other languages.

Disclaimers

I’m all for telegraphic speech. It works best in shared contexts. It’s a little harder with a bare Tweet. Language is incredibly tied up with context. Enough world knowledge can get you by, too. I might be able to refer to a TV show by “ST:TNG”, but my mom would have no idea what I was talking about.

For some purposes, precision and clarity matter much less. Consider drafting legislation vs. planning to meet at a restaurant vs. saying hello. Telegraphic speech can be very precise. Doctors’ notes to each other are a prime example. You don’t need a verb if everyone knows there’s only one thing to do with a device or a noun if there’s only one device to use.

Saying language is conventional and conventions should be followed is a subtly different stance from traditional linguistic prescriptivism. Languages change. If they didn’t, English wouldn’t even exist. I’m not railing against split infinitives, dangled prepositions, a complete failure to understand “who”/”whom” or even “I”/”me”, abandoning adverbial morphology, using “ain’t”, pronouncing “ask” like “axe”, etc. etc. I think these all have a good chance of achieving “proper” English status one day.

More on the Terminology of Averages, Means, and Expectations

June 21, 2012

I missed Cosma Shalizi’s comment on my first post on averages versus means. Rather than write a blog-length reply, I’m pulling it out into its own little lexicographic excursion. To borrow stylistically from Cosma’s blog, I’ll warn you, the reader, that this post is more linguistics than statistics.

Three Concepts and Terminology

Presumably everyone’s clear on the distinctions among the three concepts,

1. [arithmetic] sample mean,

2. the mean of a distribution, and

3. the expectation of a random variable.

The relations among these concepts is very rich, which is what I conjecture is causing their conflation.

Let’s set aside the discussion of “average”, as it’s less precise terminologically. But even the very precision of the term “average” is debatable! The Random House Dictionary lists the meaning of “average” in statistics as the arithmetic mean. Wikipedia, on the other hand, agrees with Cosma, and lists a number of statistics of centrality (sample mean, sample median, etc.) as being candidates for the meaning of “average”.

Distinguishing Means and Sample Means

Getting to the references Cosma asked for, all of my intro textbooks (Ash, Feller, Degroot and Schervish, Larsen and Marx) distinguished sense (1) from senses (2) and (3). Even the Wikipedia entry for "mean" leads off with

In statistics, mean has two related meanings:

the arithmetic mean (and is distinguished from the geometric mean or harmonic mean).

the expected value of a random variable, which is also called the population mean.

Unfortunately, the invocation of the population mean here is problematic. Random variables aren’t intrinsically related to populations in any way (at least under the Bayesian conception of what can be modeled as random). Populations can be characterized by a set of (conditionally) independent and identically distributed (i.i.d.) random variables, each corresponding to a measureable quantity of a member of the population. And of course, averages of random variables are themselves random variables.

This reminds me to the common typological mistake of talking about “sampling from a random variable” (follow the link for Google hits for the phrase).

Population Means and Empirical Distributions

The Wikipedia introduces a fourth concept, population mean, which is just the arithmetic mean of a given population. This is related to the point Cosma brought up in his comment that you can think of a sample mean as the mean of a distribution with the same distribution as the empirically observed distribution. For instance, if you observe three heads and a tail in three coin flips, you create a discrete random variable X with p_X(1) = 0.75 and p_X(0) = 0.25, then the average number of heads is equal to the expectation of X or the mean of p_X.

Conflating Means and Expectations

I was surprised that like the Wikipedia, almost all the sources I consulted explicitly conflated senses (2) and (3). Feller’s 1950 Introduction to Probability Theory and Applications, Vol 1 says the following on page 221.

The terms mean, average, and mathematical expectation are synonymous. We also speak of the mean of a distribution instead of referring to a corresponding random variable.

The second sentence is telling. Distributions have means independently of whether we’re talking about a random variable or not. If one forbids talk of distributions as first-class objects with their own existence free of random variables, one might argue that concepts (2) and (3) should always be conflated.

Metonomy and Lexical Semantic Coercion

I think the short story about what’s going on in conflating (2) and (3) is metonymy. For example, I can use “New York” to refer to the New York Yankees or the city government, but no one will understand you if you try to use “New York Yankees” to refer to the city or the government. I’m taking one aspect of the team, namely its location, and using that to refer to the whole.

This can happen implicitly with other kinds of associations. I can talk about the “mean” of a random variable X by implicitly invoking its probability function p_X(x). I can also talk about the expectation of a distribution by implicitly invoking the appropriate random variable. Sometimes authors try to sidestep random variable notation by writing \mathbb{E}_{\pi(x)}[x], which to my type-sensitive mind appears ill-formed; what they really mean to write is \mathbb{E}[X] where p_X(x) = \pi(x).

I found it painfully difficult to learn statistics because of this sloppiness with respect to the types of things, especially among less careful authors. Bayesians, myself now included, often write x for both a random variable and a bound variable; see Gelman et al.’s Bayesian Data Analysis, for a typical example of this style.

Settling down into my linguistic armchair, I’ll conclude by noting that it seems to me more felicitous to say

the expectation of a random variable is the mean of its distribution.

than to say

the mean of a random variable is the expectation of its distribution.

Computing Autocorrelations and Autocovariances with Fast Fourier Transforms (using Kiss FFT and Eigen)

June 8, 2012

[Update 8 August 2012: We found that for KissFFT if the size of the input is not a power of 2, 3, and 5, then things really slow down. So now we’re padding the input to the next power of 2.]

[Update 6 July 2012: It turns out there’s a slight correction needed to what I originally wrote. The correction is described on this page:

I’m fixing the presentation below to reflect the correction. The change is also reflected in the updated Stan code using Eigen, but not the updated Kiss FFT code.]

Suppose you need to compute all the sample autocorrelations for a sequence of observations

x = x[0],...,x[N-1]

The most efficient way to do this is with the discrete fast fourier transform (FFT) and its inverse; it’s {\mathcal O}(N \log N) versus {\mathcal O}(N^2) for the naive approach. That much I knew. I had both experience with Fourier transforms from my speech reco days (think spectograms) and an understanding of the basic functional analysis principles. I didn’t know how to code it given an FFT library. The web turned out to be not much help — the explanations I found were all over my head complex-analysis-wise and I couldn’t find simple code examples.

Matt Hoffman graciously volunteered to give me a tutorial and wrote an initial prototype. It turns out to be really really simple once you know which way ’round the parts all go.

Autocorrelations via FFT

Conceptually, the input N-vector x is the time vector and the autocorrelations will be the frequency vector. Here’s the algorithm:

  1. create a centered version of x by setting x_cent = x / mean(x);
  2. pad x_cent at the end with entries of value 0 to get a new vector of length L = 2^ceil(log2(N));
  3. run x_pad through a forward discrete fast fourier transform to get an L-vector z of complex values;
  4. replace the entries in z with their norms (the norm of a complex number is the real number resulting of summing the squared real component and squared imaginary component).
  5. run z through the inverse discrete FFT to produce an L-vector acov of (unadjusted) autocovariances;
  6. trim acov to size N;
  7. create a L-vector named mask consisting of N entries with value 1 followed by L-N entries with value 0;
  8. compute the forward FFT of mask and put the result in the L-vector adj
  9. to get adjusted autocovariance estimates, divide each entry acov[n] by norm(adj[n]), where norm is the complex norm defined above; and
  10. to get autocorrelations, set acorr[n] = acov[n] / acov[0] (acov[0], the autocovariance at lag 0, is just the variance).

The autocorrelation and autocovariance N-vectors are returned as acorn and acov respectively.

It’s really fast to do all of them in practice, not just in theory.

Depending on the FFT function you use, you may need to normalize the output (see the code sample below for Stan). Run a test case and make sure that you get the right ratios of values out in the end, then you can figure out what the scaling needs to be.

Eigen and Kiss FFT

For Stan, we started out with a direct implementation based on Kiss FFT.

  • Stan’s original Kiss FFT-based source code (C/C++) [Warning: this function does not have the correction applied; see the current Stan code linked below for an example]

At Ben Goodrich’s suggestion, I reimplemented using the Eigen FFT C++ wrapper for Kiss FFT. Here’s what the Eigen-based version looks like:

As you can see from this contrast, nice C++ library design can make for very simple work on the front end.

Hat’s off to Matt for the tutorial, Thibauld Nion for the nice blog post on the mask-based correction, Kiss FFT for the C implementation, and Eigen for the C++ wrapper.

Ranks in Academia vs. Nelson’s Navy

June 5, 2012

I’m a huge fan of nautical fiction. And by that, I mean age of sail stuff, not WWII submarines (though I loved Das Boot ). The literature is much deeper than Hornblower and Aubrey/Maturin (though it doesn’t get better than O’Brian). I’ve read hundreds of these books. If you want to join me, you might find the following helpful.

I think I’ve pretty much read every nautical fiction book published in the last 50 years. I had to go back to sci-fi and even fantasy (thank you, Patrick Rothfuss, for making my life better a book at a time).

Officer Grades

Given that nautical fiction almost always focuses on the officers, I’ve come to realize that the books are really about organizational structure and management. I see a strong relation to the academic pecking order, which I summarize in the following table.

Academia Navy
undergrad nipper
grad student midshipman
post-doc lieutenant
junior faculty commander
tenured faculty post captain
department head, dean admiral

Non-Commmissioned and Warrant Officers

What about the rest of us?

Academia Navy
research scientist sailing master
research programmer boatswain (aka ‘bosun’)
grants officier Admiralty bureaucrat

Sailing master because us research scientists know the technical bits of being an officer, namely navigation and how the ship works. Programmers are bosuns because they’re the most technically adept at the low-level functionality of academia. I guess if you weren’t in computer science, the research programmer would be a lab tech.

Averages vs. Means (vs. Expectations)

May 29, 2012

Averages

Averages are statistics calculated over a set of samples. If you have a set of samples x = x_1,\ldots,x_N, their average, often written \bar{x}, is defined by

\bar{x} = \frac{1}{N} \sum_{n=1}^N x_n.

Means

Means are properties of distributions. If p(x) is a discrete probability mass function over the natural numbers \mathbb{N}, then its mean is defined by

\sum_{x \in \mathbb{N}} \, x \times p(x).

If p(x) is a continuous probability density function over the real numbers \mathbb{R}, then its mean, if it exists, is defined by

\int_{\mathbb{R}} \, x \times p(x) \, dx.

This also shows how summations over discrete probability functions, \sum_{x \in \mathbb{N}} relate to integrals over continuous probability functions, \int_{\mathbb{R}} dx. (Distributions can also be mixed, like spike and slab priors, but the math gets more complicated due to the need to unify the notion of summation and integration.)

Expectations

To confuse matters further, there are expectations. Expectations are properties of (some) random varaibles. The expectation of a random variable is the mean of its distribution. If X is a discrete random variable with probability mass function p(x), then its expectation is defined to be

\mathbb{E}[X] = \sum_{x \in \mathbb{N}} \, x \times p(x).

If X is a continuous random variable with probability density function p(x), then

\mathbb{E}[X] = \int_{\mathbb{R}} \, x \times p(x) \, dx.

Look familiar?

Sample Means

Samples don’t have means per se. They have averages. But sometimes the average is called the “sample mean”. Just to confuse things.

Averages as Estimates of the Mean

Gauss showed that the average of a set of independent, identically distributed (i.i.d.) samples from a distribution p(x) is a good estimate of the mean.

What’s good about the average as an estimator of the mean? First, it’s unbiased, meaning the expectation of the average of a set of i.i.d. samples from a distribution is the mean of the distribution. Second, it has the lowest expected mean square error among all estimators of the mean. That’s why everyone likes square error (that, and its convexity, which I discussed in a previous blog post on Mean square error, or why committees won the Netflix Prize).

What about Medians?

The median is a good estimator too. Laplace proved that it has the lowest expected absolute error among estimators (I just learned it was Laplace from the Wikipedia entry on median unbiased estimators). It’s also more robust to outliers.

More on Estimators

The Wikipedia page on estimators is a good place to start.

Of course, in Bayesian statistics, we’re more concerned with a full characterization of posterior uncertainty, not just a point (or even interval) estimate.

Summary

  • Means are properties of distributions.
  • Expectations are properties of random variables.
  • Averages or sample means are statistics calculated from samples.