Absolutely I want to analyze the variance, again for exactly the reasons you say — it’s what you need to predict variability of the algorithm’s behavior in the future. And that matters in an application!

The other major reason to analyze variance in finance is that if you’re a market maker, you need to set a spread.

The reason funds report beta (correlation to S&P 500 or some other market representative) is that you need it for first-order adjustments to the effects of diversification (i.e., very rudimentary portfolio theory). Of course, what you really need is the full covariance matrix of your assets, because if you buy two instruments with very little correlation to the S&P 500 but very high correlation to each other that’s not as good for diversification. Of course, covariance is difficult to estimate for the usual reason that variance is hard to estimate and now you have a quadratic number of (co)variances to estimate.

]]>As far as the purely statistical issue on resampling variability, I’ve been wondering at what level resampling assumption should be used. For example, when evaluating token-level tagging accuracy, it’s straightforward to use a token-level resampling model (like binomial or mcnemar’s test where each token is an event … or could do bootstrap or whatever too). I have done this many times. However, accuracy might cluster/overdisperse at the sentence or document level; for example, some documents are much harder than others. (Imagine an additive logit model for the Bernoulli event of “is the token correct or not” with variables at the sentence and document levels.) Do we need to resample at the sentence or document level when evaluating token-level accuracy? This apparently would greatly diminish statistical power, but maybe it would be a more accurate reflection of the variability of accuracy rates when you get new data (and thus it should be taken into account when doing accuracy mean estimation, which is implicitly what everyone thinks is important to calculate for NLP evaluation).

Beyond the question of how to correctly assess the standard error for the accuracy mean, I wonder if directly analyzing variance might be a useful direction for NLP evaluation. In finance, they evaluate algorithms by explicitly look at *both* mean and variance of returns, since a small improvement to the mean but with a big variance isn’t that useful (you might go broke before you get rich). Cross-document or cross-domain variance reduction seems useful, to me at least, for NLP. An algorithm that gives on-average 87% accuracy, but on some documents is only 20% accurate, is less useful to me than one averaging 85% but is always at least 80% on all documents. The former could destroy the accuracy of a question-answering system, for example.

]]>In a simpler case, if I toss a coin and see 40 heads and 10 tails, I ask the question of whether I can reject the null hypothesis that this sample is an i.i.d. sample from a population with say an even number of heads and tails.

]]>But the t-test does try to answer this question. The problem I have with its use in most NLP applications is that the i.i.d. normality assumptions don’t hold. So lots of people have moved to non-parametric tests, but they have the same i.i.d. assumption which causes them to underestimate uncertainty.

I don’t think the bootstrap’s very widely used. I find I lose people at the point where I try to explain that it’s sampling with replacement to deal with variance properly.

]]>http://technocalifornia.blogspot.com/2012/07/more-data-or-better-models.html

]]>