[Update 21 October 2009: Check out Michael Collins's comment and my reply. Michael points out that introductions to probability theory carefully subscript probability functions with their random variables and distinguish random variables from regular variables by capitalizing the former. I replied that it's really the practice that's problematic, and I'm talking about the notation in Gelman et al.'s Bayesian Data Analysis or Blei, Ng and Jordan's LDA paper.]
What’s Wrong?
What’s wrong with the probability notation used in Bayesian stats papers? The triple whammy of
- overloading
for every probability function,
- using bound variables named after random variables, and
- using the bound variable names to distinguish probability functions.
Probabilty Notation is Bad
The first and third issues arise explicitly and the second implicitly in the usual expression of the first step of Bayes’s rule,
,
where each of the four uses of corresponds to a different probability function! In computer science, we’re used to using names to distinguish functions. So
and
are the same function
applied to different arguments. In probability notation,
and
are different probability functions, picked out by their arguments.
Random Variables Don’t Help
As Michael (and others) pointed out in the comments, if these are densities determined by random variables, we use the capitalized random variables to distinguish the distributions and bound variables
in the usual mathematical sense, disambiguating our random variables with
.
When we have dozens of parameters in our multivariate densities, this gets messy really quickly. So practitioners fall back to the unsubscripted notation.
Great Expectations
The third issue appears in expectation notation (and in information theory notation for entropy, mutual information, etc.). Here, statisticans write and
for random variables with the probability function and sample space implicit. The way you then see expectation notation written in applied Bayesian modeling is often:
.
What’s really sneaky here is the use of as a global random variable on the left side of the equation and as a bound variable on the right hand side. Distinguishing random variables with capital letters, this would look like
.
Continuous vs. Discrete
The definitions are even more overloaded than they first appear, because of the different definitions for continuous and discrete probabilities.
In Bayes’s rule, if is continuous in
, we’re meant to understand integration
,
in place of summation
.
Similarly, for expectations of continuous densities, we write
or if we’re being very careful with random variables,
.
Intros to probability theory often use for continuous probability density functions and reserve
for discrete probability mass functions. They’ll start with notation
(or
) for the event probability function.
Samples Spaces to the Rescue?
In applied work, we rarely, if ever, need to talk about the sample space , measures
from subsets of
to
, or actually consider a random variable
as a function from
to
. I don’t recall ever seeing a model defined in this way.
Instead, we typically construct multivariate densities modularly by combining simpler distributions. For instance, a hierarchical beta-binomial model, such as the one I used for the post about Bayesian batting averages, would typically be expressed as:
or in sampling notation by stating and
. In fact, that’s just how it gets coded in the model interpreter BUGS (Bayesian inference using Gibbs sampling) and compiler HBC (hierarchical Bayes compiler).
Since we only really ever talk about probability density functions, why not get rid of random variable notation altogether? We could start with a joint density function over a vector
, and consider projections of
in lieu of random variables. This we can fully specify with high-school calculus.
If we use consistent naming for the dimensions, we can get away without ever formally definining a random variable. In practice, it’s not really that hard to keep our sampling distributions separate from our posteriors
, even if we write them both as
.
Technically, we could take as the sample space and then define the random variables as projections. But this formal definition of random variables doesn’t buy us much other than a connection to the usual way of writing things in theory textbooks. If it makes you feel better, you can treat the normal ambiguous definitions this way; some of the lowercase letters are random variables, some are bound variables, and we just drop all the subscripts to pick out densities.
Lambda Calculus to the Rescue?
Maybe we can do better. We could express our models as joint densities, and borrow a ready-made language for talking about functions, the lambda calculus.
For instance, we could define a discrete bivariate for reference. We could then distinguish the marginals
, using
instead of
, and
instead of
.
We could similarly distinguish the conditionals, writing
instead of
, and
instead of
.
Bayes’s rule now becomes:
.
Clear, no?
Perhaps Not
OK, so maybe the statisticians are onto something here with their sloppy notation in practice.
All attempts to distinguish function names in stats only seem to make matters worse. This is especially problematic for Bayesian stats, where a fixed prior in one model becomes an estimated parameter in the next.
October 13, 2009 at 5:39 pm |
And as if “p” weren’t overloaded enough, statisticians turn around and use “p” to indicate the number of parameters in a model.
October 13, 2009 at 9:02 pm |
Distinguishing between random variables and their values may help (remember that random variables are in fact functions from the event space).
In Bayesian approaches the domain of all the random variables is the same event space, so you don’t even need to write the “P” at all. This leads to the “distributed according to” notation, e.g.,
X | mu, sigma ~ Normal(mu,sigma^2)
When I first encountered it this notation seemed strange, but now I find it easier to read than one with lots of “P”s.
October 21, 2009 at 1:45 pm |
+1 on your comment regarding expectation.
Conventionally, X is the r.v., and x is a value in the range space of x. So it makes literally no sense to write E[x]. And you have to really understand that X is a measurable function. If you do this, the notation is actually airtight.
The convention of dropping subscripts on individual densities, in most cases, aids understanding. Putting the subscripts in just adds to the alphabet soup.
A final point about why this came to pass — in measure-theoretic probability, the odds of every random variable is determinable (in principle) from a single measure, which may be a product measure, and it is conventionally called P. With this understanding, it’s always OK to have a single “P” for every event. I think this is another behind-the-scenes reason why people just use one “P”.
October 21, 2009 at 1:46 pm
typo, of course, that’s “…value in the range space of X.”
October 14, 2009 at 10:45 am |
@John I’m particularly amused by one paper where I saw
(you know who you are!).
@Mark I wish I’d gotten your comment a year ago. I actually prefer writing models out in sampling notation, ideally using something like BUGS to make sure they’re precisely specified enough to compile.
I’m trying to figure out how to write this all for an intro to Bayesian classifiers I’m writing. I started out with the traditional notion of random variables (samples space with prob distro plus mapping to values). But the combo of a shared sample space and distribution with functions to values detemrining the random variables’ joint distributions is kind of confusing to try to write down.
Andrew Gelman and Jennifer Hill described random variables succintly in their applied regression book. There’s a giant urn, you pull a ball out, and it has the value of every random variable written on it. Of course, you have to be happy with uncountably many balls in the urn. But to make the continuous urn notion precise, I’d need to pull out Lebesgue integrals. (I’d really like to pull out the even more general Stieltjes integrals to do away with the awkward alternations between summations and integrals.)
Instead, I’m leaning toward just defining joint densities on a per-model basis and working from there. You don’t ever really need the sample space and I’m not going to be able to define general integration in an intro text.
October 14, 2009 at 1:12 pm |
[Michael sent me this note via e-mail, but said I could post it. - Bob]
Last spring I was teaching a large undergrad class on probability
using this book:
http://athenasc.com/probbook.html
I like the notation in the book — in general I think it’s a great book — and I think it avoids some of the pitfalls you’ve described in your blog.
(Ignore the later chapters on statistics though, particularly frequentist
statistics, where the notation becomes a little more controversial — this led to long discussions with the authors and my co-lecturer about notation once the move from probability to statistics is made).
I think writing
,
etc. is sloppy, although we do it in research papers all the time. The book always uses
etc. for PMFs (discrete random variables), and
for PDFs (continuous random variables). So the subscripting clarifies things here.
You would never write
, always write
(assuming a convention that capitalization is used to denote random variables — otherwise be careful about what is a r.v. as opposed to the value of a r.v.). Actulally writing
leads to real notational problems I think. Particularly when you get to conditional expectation. You don’t need subscripting on expectations, assuming that whenever you introduce a random variable you’ve been careful to specify its pmf/pdf.
[Michael then went on in a second post to add the following. - Bob]
The book is really great. The first chapter just goes over sample spaces, probability measures, events etc., with no mention of random variables. The second chapter introduces random variables — and that notation
is used throughout the book — in fact once r.v.s are introduced you rarely need to refer back to the underlying sample space (although it’s useful to have that in mind).
One thing I like about the book is that it’s basically a book about probability (although it covers statistics in later chapters) — as a consequence it gets the basics of probability and probability notation down very solidly, before even mentioning statistics, which I think is really important. The mathematical probabilists are very precise.
… let all the work be done by the r.v.
– if you’ve taken care to define the set of random variables in the first place, then there will be no problems.
is better because you really do want to be able to write things like
— i.e., you do want to treat
/
thing is a convention but is unnecessary at that point. The Greek letter thing may be an inconvenience, but simply using capital Greek letters may work I think.
this as a function. The
October 14, 2009 at 1:15 pm |
That’s a very good point about writing
for particular probabilities (density values); so the subscript does help. It’s really uncler without the subscript,
.
I’d really like to define a Bernoulli random variable
with the usual sampling notation:
if
It’s basically eliding the random variable subscript from teh more “proper” random variable notation:
if
Ack — the plugin for LaTeX won’t let me use
\Bigg{. I’m guessing it’s a bracket-matching bug, because they allow\Bigg|.October 21, 2009 at 3:12 am |
Lambda notation clashes with standar mathematics, but you can use the |-> (\mapsto) operator instead.
October 21, 2009 at 5:39 am |
I agree; introducing a friend to Bayesian statistics was harder than it should have been (he has a programming background).
Another thing I don’t like is the pipe. p(A|B) reads as “probability of A given B”, or more precisely how much B-ness implies A-ness. So it would make more sense to have something like p(A A)…
October 22, 2009 at 4:38 am |
[...] Still, most people struggle with them. Could it be that the notation is just hard to swallow? What’s Wrong with Probability Notation? is a magnificent post that gives some basic reasons: The first two issues arise in the usual [...]
October 22, 2009 at 2:19 pm |
I agree that probability notation is tricky and maybe moreso than necessary. But I totally disagree that THIS is the obstacle to understanding probability that most people face.
December 4, 2009 at 2:12 pm |
I sometimes wonder why we don’t write:
This seems less heinous than
. At least, I think it’s more natural to imply bindings from underlying density measures than vice versa…
(sorry if this doesn’t come out nicely latex-ified—I’m no wordpress expert) [ed. I added the latex escape for you; all you needed was to put latex after the $ and before the first bit of LaTeX.]
December 4, 2009 at 3:32 pm |
Indeed, that’s how the notation is used to distinguish random variables
from regular old bound variables
. But you typically only see this in careful discussions of probability theory or intro stats texts.
In practical modeling papers, where there are parameters, matrices, etc., the upper-case/lower-case thing gets difficult to maintain. And it’s so rare to see anything other than
that it seems awfully pedantic to include all those subscripts.
The real kicker is that you almost never see random variables defined as maps
. In fact, you never see the sample space
even mentioned. Instead, you’ll see statements like “assume
is a random variable distributed as
.”
December 4, 2009 at 3:42 pm
Wouldn’t they write
when being careful? I intentionally left-off the function arguments.
Thanks for the wordpress/latex tip :)