## What’s Wrong with Probability Notation?

[Update 21 October 2009: Check out Michael Collins's comment and my reply. Michael points out that introductions to probability theory carefully subscript probability functions with their random variables and distinguish random variables from regular variables by capitalizing the former. I replied that it's really the practice that's problematic, and I'm talking about the notation in Gelman et al.'s Bayesian Data Analysis or Blei, Ng and Jordan's LDA paper.]

### What’s Wrong?

What’s wrong with the probability notation used in Bayesian stats papers? The triple whammy of

1. overloading $p()$ for every probability function,
2. using bound variables named after random variables, and
3. using the bound variable names to distinguish probability functions.

The first and third issues arise explicitly and the second implicitly in the usual expression of the first step of Bayes’s rule,

$p(x|y) = p(y|x)p(x) / p(y)$,

where each of the four uses of $p()$ corresponds to a different probability function! In computer science, we’re used to using names to distinguish functions. So $f(x)$ and $f(y)$ are the same function $f$ applied to different arguments. In probability notation, $p(x)$ and $p(y)$ are different probability functions, picked out by their arguments.

### Random Variables Don’t Help

As Michael (and others) pointed out in the comments, if these are densities determined by random variables, we use the capitalized random variables $Y, X$ to distinguish the distributions and bound variables $x, y$ in the usual mathematical sense, disambiguating our random variables with

$p_{X|Y}(x|y) = p_{Y|X}(y|x) p_X(x) / p_{Y}(y)$.

When we have dozens of parameters in our multivariate densities, this gets messy really quickly. So practitioners fall back to the unsubscripted notation.

### Great Expectations

The third issue appears in expectation notation (and in information theory notation for entropy, mutual information, etc.). Here, statisticans write $x$ and $y$ for random variables with the probability function and sample space implicit. The way you then see expectation notation written in applied Bayesian modeling is often:

${\mathbb E}[x] = \sum_x x p(x)$.

What’s really sneaky here is the use of $x$ as a global random variable on the left side of the equation and as a bound variable on the right hand side. Distinguishing random variables with capital letters, this would look like

${\mathbb E}[X] = \sum_x x p_X(x)$.

### Continuous vs. Discrete

The definitions are even more overloaded than they first appear, because of the different definitions for continuous and discrete probabilities.

In Bayes’s rule, if $p(x,y)$ is continuous in $x$, we’re meant to understand integration

$p(x|y) = p(y|x) p(x) / \int_{-\infty}^{\infty} p(y|x) p(x) dx$,

in place of summation

$p(x|y) = p(y|x) p(x) / \sum_x p(y|x) p(x)$.

Similarly, for expectations of continuous densities, we write

${\mathbb E}[x] = \int_{-\infty}^{\infty} x p(x) dx$

or if we’re being very careful with random variables,

${\mathbb E}[X] = \int_{-\infty}^{\infty} x p_X(x) dx$.

Intros to probability theory often use $f(x)$ for continuous probability density functions and reserve $p(n)$ for discrete probability mass functions. They’ll start with notation $P(A)$ (or $\mbox{Pr}(A)$) for the event probability function.

### Samples Spaces to the Rescue?

In applied work, we rarely, if ever, need to talk about the sample space $\Omega$, measures $P$ from subsets of $\Omega$ to $[0,1]$, or actually consider a random variable $X$ as a function from $\Omega$ to ${\mathbb R}$. I don’t recall ever seeing a model defined in this way.

Instead, we typically construct multivariate densities modularly by combining simpler distributions. For instance, a hierarchical beta-binomial model, such as the one I used for the post about Bayesian batting averages, would typically be expressed as:

$p(y,\theta|n,\alpha,\beta) = \mbox{Beta}(\theta|\alpha,\beta) \prod_i \mbox{Binom}(y_i|n_i,\theta_i)$

or in sampling notation by stating $y_i \sim \mbox{Binom}(n_i,\theta)$ and $\theta \sim \mbox{Beta}(\alpha,\beta)$. In fact, that’s just how it gets coded in the model interpreter BUGS (Bayesian inference using Gibbs sampling) and compiler HBC (hierarchical Bayes compiler).

Since we only really ever talk about probability density functions, why not get rid of random variable notation altogether? We could start with a joint density function $p(y)$ over a vector $y \in {\mathbb R}^n$, and consider projections of ${\mathbb R}^n$ in lieu of random variables. This we can fully specify with high-school calculus.

If we use consistent naming for the dimensions, we can get away without ever formally definining a random variable. In practice, it’s not really that hard to keep our sampling distributions $p(y|\theta)$ separate from our posteriors $p(\theta|y)$, even if we write them both as $p(\cdot | \cdot)$.

Technically, we could take $\Omega = {\mathbb R}^n$ as the sample space and then define the random variables as projections. But this formal definition of random variables doesn’t buy us much other than a connection to the usual way of writing things in theory textbooks. If it makes you feel better, you can treat the normal ambiguous definitions this way; some of the lowercase letters are random variables, some are bound variables, and we just drop all the subscripts to pick out densities.

### Lambda Calculus to the Rescue?

Maybe we can do better. We could express our models as joint densities, and borrow a ready-made language for talking about functions, the lambda calculus.

For instance, we could define a discrete bivariate $p:({\mathbb N} \times {\mathbb N}) \rightarrow [0,1]$ for reference. We could then distinguish the marginals $p_1,p_2:{\mathbb N} \rightarrow [0,1]$, using

$p_1 = \lambda n. \sum_{m \in {\mathbb N}} p(n,m)$ instead of $p(n)$, and

$p_2 = \lambda m. \sum_{n \in {\mathbb N}} p(n,m)$ instead of $p(m)$.

We could similarly distinguish the conditionals, writing

$p_{1|2} = \lambda n . \lambda m. p(n,m) / p(m)$ instead of $p(x|y)$, and

$p_{2|1} = \lambda m . \lambda n. p(n,m) / p(n)$ instead of $p(y|x)$.

Bayes’s rule now becomes:

$p_{1|2} = \lambda n . \lambda m . p_{2|1}(n,m) p_1(m) / p_2(n)$.

Clear, no?

### Perhaps Not

OK, so maybe the statisticians are onto something here with their sloppy notation in practice.

All attempts to distinguish function names in stats only seem to make matters worse. This is especially problematic for Bayesian stats, where a fixed prior in one model becomes an estimated parameter in the next.

### 20 Responses to “What’s Wrong with Probability Notation?”

1. John Says:

And as if “p” weren’t overloaded enough, statisticians turn around and use “p” to indicate the number of parameters in a model.

2. Mark Says:

Distinguishing between random variables and their values may help (remember that random variables are in fact functions from the event space).

In Bayesian approaches the domain of all the random variables is the same event space, so you don’t even need to write the “P” at all. This leads to the “distributed according to” notation, e.g.,

X | mu, sigma ~ Normal(mu,sigma^2)

When I first encountered it this notation seemed strange, but now I find it easier to read than one with lots of “P”s.

• Michael Says:

+1 on your comment regarding expectation.

Conventionally, X is the r.v., and x is a value in the range space of x. So it makes literally no sense to write E[x]. And you have to really understand that X is a measurable function. If you do this, the notation is actually airtight.

The convention of dropping subscripts on individual densities, in most cases, aids understanding. Putting the subscripts in just adds to the alphabet soup.

A final point about why this came to pass — in measure-theoretic probability, the odds of every random variable is determinable (in principle) from a single measure, which may be a product measure, and it is conventionally called P. With this understanding, it’s always OK to have a single “P” for every event. I think this is another behind-the-scenes reason why people just use one “P”.

3. lingpipe Says:

@John I’m particularly amused by one paper where I saw $p_p(p)$ (you know who you are!).

@Mark I wish I’d gotten your comment a year ago. I actually prefer writing models out in sampling notation, ideally using something like BUGS to make sure they’re precisely specified enough to compile.

I’m trying to figure out how to write this all for an intro to Bayesian classifiers I’m writing. I started out with the traditional notion of random variables (samples space with prob distro plus mapping to values). But the combo of a shared sample space and distribution with functions to values detemrining the random variables’ joint distributions is kind of confusing to try to write down.

Andrew Gelman and Jennifer Hill described random variables succintly in their applied regression book. There’s a giant urn, you pull a ball out, and it has the value of every random variable written on it. Of course, you have to be happy with uncountably many balls in the urn. But to make the continuous urn notion precise, I’d need to pull out Lebesgue integrals. (I’d really like to pull out the even more general Stieltjes integrals to do away with the awkward alternations between summations and integrals.)

Instead, I’m leaning toward just defining joint densities on a per-model basis and working from there. You don’t ever really need the sample space and I’m not going to be able to define general integration in an intro text.

4. Michael Collins Says:

[Michael sent me this note via e-mail, but said I could post it. - Bob]

Last spring I was teaching a large undergrad class on probability
using this book:

http://athenasc.com/probbook.html

I like the notation in the book — in general I think it’s a great book — and I think it avoids some of the pitfalls you’ve described in your blog.

(Ignore the later chapters on statistics though, particularly frequentist
statistics, where the notation becomes a little more controversial — this led to long discussions with the authors and my co-lecturer about notation once the move from probability to statistics is made).

I think writing $p(x)$, $p(y|x)$ etc. is sloppy, although we do it in research papers all the time. The book always uses

$p_{Y}(y)$

$p_{X}(x)$

$p_{Y|X}(y | x)$

etc. for PMFs (discrete random variables), and

$f_{Y}(y)$

$f_{X}(x)$

$f_{Y|X}(y | x)$

for PDFs (continuous random variables). So the subscripting clarifies things here.

You would never write $\mbox{E}[x]$, always write $\mbox{E}[X]$ (assuming a convention that capitalization is used to denote random variables — otherwise be careful about what is a r.v. as opposed to the value of a r.v.). Actulally writing $\mbox{E}[x]$ leads to real notational problems I think. Particularly when you get to conditional expectation. You don’t need subscripting on expectations, assuming that whenever you introduce a random variable you’ve been careful to specify its pmf/pdf.

[Michael then went on in a second post to add the following. - Bob]

The book is really great. The first chapter just goes over sample spaces, probability measures, events etc., with no mention of random variables. The second chapter introduces random variables — and that notation
is used throughout the book — in fact once r.v.s are introduced you rarely need to refer back to the underlying sample space (although it’s useful to have that in mind).

One thing I like about the book is that it’s basically a book about probability (although it covers statistics in later chapters) — as a consequence it gets the basics of probability and probability notation down very solidly, before even mentioning statistics, which I think is really important. The mathematical probabilists are very precise.

… let all the work be done by the r.v. $X$– if you’ve taken care to define the set of random variables in the first place, then there will be no problems. $p_X(x)$ is better because you really do want to be able to write things like $p_X(2)$ — i.e., you do want to treat
this as a function. The $X$/$x$ thing is a convention but is unnecessary at that point. The Greek letter thing may be an inconvenience, but simply using capital Greek letters may work I think.

• lingpipe Says:

That’s a very good point about writing $p_X(2)$ for particular probabilities (density values); so the subscript does help. It’s really uncler without the subscript, $p(2)$.

I’d really like to define a Bernoulli random variable $y$ with the usual sampling notation:

$y \sim \mbox{\sf Bern}(\theta)$

if

$p(y) = \Bigg| \begin{array}{ll}\theta & \mbox{ if } y = 1 \\ (1-\theta) & \mbox{ if } y = 0\end{array}$

It’s basically eliding the random variable subscript from teh more “proper” random variable notation:

$Y \sim \mbox{\sf Bern}(\theta)$

if

$p_Y(y) = \Bigg| \begin{array}{ll}\theta & \mbox{ if } y = 1 \\ (1-\theta) & \mbox{ if } y = 0\end{array}$

Ack — the plugin for LaTeX won’t let me use \Bigg{. I’m guessing it’s a bracket-matching bug, because they allow \Bigg|.

5. anon Says:

Lambda notation clashes with standar mathematics, but you can use the |-> (\mapsto) operator instead.

6. Jach Says:

I agree; introducing a friend to Bayesian statistics was harder than it should have been (he has a programming background).

Another thing I don’t like is the pipe. p(A|B) reads as “probability of A given B”, or more precisely how much B-ness implies A-ness. So it would make more sense to have something like p(A A)…

7. Academic Productivity » What’s Wrong with Probability Notation? Says:

[...] Still, most people struggle with them. Could it be that the notation is just hard to swallow? What’s Wrong with Probability Notation? is a magnificent post that gives some basic reasons: The first two issues arise in the usual [...]

8. yolio Says:

I agree that probability notation is tricky and maybe moreso than necessary. But I totally disagree that THIS is the obstacle to understanding probability that most people face.

9. Jason Rennie Says:

I sometimes wonder why we don’t write:

$p_{X|Y} = \frac{p_{Y|X} \ p_X}{p_Y}$

This seems less heinous than $p(x|y)$. At least, I think it’s more natural to imply bindings from underlying density measures than vice versa…

(sorry if this doesn’t come out nicely latex-ified—I’m no wordpress expert) [ed. I added the latex escape for you; all you needed was to put latex after the \$ and before the first bit of LaTeX.]

• lingpipe Says:

Indeed, that’s how the notation is used to distinguish random variables $X,Y$ from regular old bound variables $x,y$. But you typically only see this in careful discussions of probability theory or intro stats texts.

In practical modeling papers, where there are parameters, matrices, etc., the upper-case/lower-case thing gets difficult to maintain. And it’s so rare to see anything other than $p_{X|Y}(x|y)$ that it seems awfully pedantic to include all those subscripts.

The real kicker is that you almost never see random variables defined as maps $X:\Omega \rightarrow {\mathbb R}$. In fact, you never see the sample space $\Omega$ even mentioned. Instead, you’ll see statements like “assume $X$ is a random variable distributed as $\mbox{Binomial}(\theta,N)$.”

• Jason Rennie Says:

Wouldn’t they write $p_{X|Y}(x|y)$ when being careful? I intentionally left-off the function arguments.

Thanks for the wordpress/latex tip :)

• Julien Diard Says:

Leaving the sample space implicit really should be prohibited. It’s like crossing the beams, it’s bad. ;)

I have read way too many papers where, for instance, Normal distributions were happily applied over R+, or even over circular spaces. Consider, oh, “almost” all of the robotics navigation and mapping literature of the 90′s: positions x, y and orientations \theta of robots were put together in a single “pose” variable, a linear Gaussian models were applied directly on this 3D variable… :)

10. Problem with notation in applied Bayesian work Says:

[...] One hurdle newcomers have to applied Bayesian work is understand the notation at work. Understanding that p(x) is not the same function as p(y). Typically these refer to the marginal density (or mass) function for x and y, respectively. Similarly p(x|y) is not the same function as p(y|x), but instead the first is the function describing the conditional density (or mass) function of x given y and the second is the conditional density (or mass) function of y given x. Attempts to rectify this notation seem to make the notation overly complicated and therefore, the differences are made implicit. For more discussion of this, please see this post. [...]

11. Alan Mainwaring Says:

Can somebody help me here, I started off trying to understand the concept of nested designs in ANOVA, then moved into trying to clarify what was meant by the term variate, bivariate, etc. Then the term Random variable as being a real valued function defined on a sample space (only dealing with discreet) at the moment. I then tried to sort out the term independent variables from the term statistically independent random variables. All of the definitions I have seen define two or more independent random variable over the SAME sample space. Why could we not have different random variables defined over different sample spaces (discreet ones). The more I think about probability theory the more confused I get. Something has got to be done about this to bring about some sort of simplicity and consistency in this very important area.
Hope there is someone out there who can help.

• lingpipe Says:

Once you understand the notation, it’s consistent. That’s not to say that some people don’t use it inconsistently. The problem for beginners is just how much the notation’s overloaded.

I’m afraid it doesn’t make sense to talk about the independence of random variables over different sample spaces. Keep in mind that it’s OK to have different outcomes in the same sample space — the sample space itself is abstract. For instance, if you have two discrete distros with outcomes {A,B} and {X,Y,Z}, then you can have a sample space with six points, {AX, AY, AZ, BX, BY, BZ}. The value of the first variable is A for samples AX, AY and AZ, and B for samples BX, BY and BZ. Similar, the value of the second variable is X for AX and BX, and so on.

For instance, writing independence out in full, variables $X$ and $Y$ over a sample space are independent if and only if $p_{X,Y}(x,y) = p_X(x) \times p_Y(y)$ for all $x, y$. The reason you need this to be over a single sample space is that you can’t define the joint distribution $p_{X,Y}$ otherwise.

In practice, the sample space is rarely mentioned and random variables are defined by their distributions. It’s just assumed everything comes from the same sample space. Often you define $p_{X,Y}$ by first defining $p_X$ and $p_Y$ and then stating that they’re independent.

12. Julien Diard Says:

Concerning your general issue with classical probabilistic notation: I could not agree more. However, I believe an elegant solution already exists, and was proposed by Jaynes.

It’s basis is simple: p(x), or P(x), is an object which does not exist. The correct tool in probabilities is conditional probabilities: one should always specify the preliminary knowledge that conditions the state of uncertainty about a quantity x. Therefore, p(X | c) is the distribution about variable X under the assumption that c holds. If preliminary knowledge is different, say c’, then p(X | c’) can be another mathematical distribution. Function p(. | .), in this case, is not overloaded (contrary to what many reviewers of my papers asserted, but that’s not the point. :) ).

Complement this with a few conventions, for instance, use p(. | .) in the continuous case, P(. | .) in the discrete case. Use capitalized symbols when referring to variables (i.e., domains), and small-capped symbols for values. So that P([X=x] | c) (or P(X=x | c), when you’re lazy) is a probability value, whereas P(X | c) is a probability distribution. And you should be all set.

The use of right-hand side symbols to make probability distributions non-ambiguous does not even need to be tied to “subjectivist” stances about the meaning of probabilities as states of knowledge of agents. Indeed, it is also the basis of (purely Bayesian statistics or machine learning inspired) methods of model selection: both P(X | c) and P(X | c’) being defined, a new variable C can then be introduced, with domain C={c, c’}, and the model P(C | D) P(X | C D) can then be introduced to carry out model selection by computing P(C | X D).

• Bob Carpenter Says:

That introduces several new problems.

1. Do we then use $\mbox{Pr}(A)$ for event probs?

2. And what about mixtures, like spike and slab or Dirichlet process which are part discrete and part continuous?

3. The objection to the cap/lowercase convention is about matrices and to a lesser extent Greek letters. If we also want to capitalize matrices or bold them, we run into conflicting conventions. Not all the Greek letters have easily distinguishable caps.

3. While writing $p(X=x|C=c)$ may clear up some confusions, it also runs head-on into the notation used for events, where $X = x$ is shorthand for the event $\{\omega \in \Omega|\omega(X) = x\}$. And obviously $p(X=x|C=c) = 0$ if we’re talking events and $X$ is continuous.

4. I don’t see how using conditionals cleans up the distinction between $p(x|y)$ and $p(y|x)$, which would be written in probability theory as $p_{X|Y}(x|y)$ and $p_{Y|X}(y|x)$. This notation gets cumbersome when we have a dozen parameters.

I’ve never heard anyone say that the problem is purely Bayesian or frequentist — this is just about probability theory, about which everyone is in agreement. The frequentist/Bayesian debate is about what can be the object of a probability distribution, not how the laws of probability work.