[Update 21 October 2009: Check out Michael Collins's comment and my reply. Michael points out that introductions to probability theory carefully subscript probability functions with their random variables and distinguish random variables from regular variables by capitalizing the former. I replied that it's really the practice that's problematic, and I'm talking about the notation in Gelman et al.'s Bayesian Data Analysis or Blei, Ng and Jordan's LDA paper.]
What’s wrong with the probability notation used in Bayesian stats papers? The triple whammy of
- overloading for every probability function,
- using bound variables named after random variables, and
- using the bound variable names to distinguish probability functions.
Probabilty Notation is Bad
The first and third issues arise explicitly and the second implicitly in the usual expression of the first step of Bayes’s rule,
where each of the four uses of corresponds to a different probability function! In computer science, we’re used to using names to distinguish functions. So and are the same function applied to different arguments. In probability notation, and are different probability functions, picked out by their arguments.
Random Variables Don’t Help
As Michael (and others) pointed out in the comments, if these are densities determined by random variables, we use the capitalized random variables to distinguish the distributions and bound variables in the usual mathematical sense, disambiguating our random variables with
When we have dozens of parameters in our multivariate densities, this gets messy really quickly. So practitioners fall back to the unsubscripted notation.
The third issue appears in expectation notation (and in information theory notation for entropy, mutual information, etc.). Here, statisticans write and for random variables with the probability function and sample space implicit. The way you then see expectation notation written in applied Bayesian modeling is often:
What’s really sneaky here is the use of as a global random variable on the left side of the equation and as a bound variable on the right hand side. Distinguishing random variables with capital letters, this would look like
Continuous vs. Discrete
The definitions are even more overloaded than they first appear, because of the different definitions for continuous and discrete probabilities.
In Bayes’s rule, if is continuous in , we’re meant to understand integration
in place of summation
Similarly, for expectations of continuous densities, we write
or if we’re being very careful with random variables,
Intros to probability theory often use for continuous probability density functions and reserve for discrete probability mass functions. They’ll start with notation (or ) for the event probability function.
Samples Spaces to the Rescue?
In applied work, we rarely, if ever, need to talk about the sample space , measures from subsets of to , or actually consider a random variable as a function from to . I don’t recall ever seeing a model defined in this way.
Instead, we typically construct multivariate densities modularly by combining simpler distributions. For instance, a hierarchical beta-binomial model, such as the one I used for the post about Bayesian batting averages, would typically be expressed as:
Since we only really ever talk about probability density functions, why not get rid of random variable notation altogether? We could start with a joint density function over a vector , and consider projections of in lieu of random variables. This we can fully specify with high-school calculus.
If we use consistent naming for the dimensions, we can get away without ever formally definining a random variable. In practice, it’s not really that hard to keep our sampling distributions separate from our posteriors , even if we write them both as .
Technically, we could take as the sample space and then define the random variables as projections. But this formal definition of random variables doesn’t buy us much other than a connection to the usual way of writing things in theory textbooks. If it makes you feel better, you can treat the normal ambiguous definitions this way; some of the lowercase letters are random variables, some are bound variables, and we just drop all the subscripts to pick out densities.
Lambda Calculus to the Rescue?
Maybe we can do better. We could express our models as joint densities, and borrow a ready-made language for talking about functions, the lambda calculus.
For instance, we could define a discrete bivariate for reference. We could then distinguish the marginals , using
instead of , and
instead of .
We could similarly distinguish the conditionals, writing
instead of , and
instead of .
Bayes’s rule now becomes:
OK, so maybe the statisticians are onto something here with their sloppy notation in practice.
All attempts to distinguish function names in stats only seem to make matters worse. This is especially problematic for Bayesian stats, where a fixed prior in one model becomes an estimated parameter in the next.