We start with a probability space $(\Omega,\mathcal F, \mathbb P)$, where

– $\Omega \neq \emptyset$ the sample space (which is indeed just a set),

– $\mathcal F$ is a $\sigma$-algebra, normally just the power set $\mathcal P(\Omega) := \{ A \mid A \subset \Omega \}$ of $\Omega$,

– $\mathbb P$ is a measure ($\mathbb P(\emptyset) = 0$, $\sigma$-additive) on $\Omega$ with mass 1, i.e. $\mathbb P(\Omega) = 1$, called a probability measure.

$\omega \in \Omega$ is called outcome, a measurable set $A \in \mathcal F$ is called event.

A random variable $X$ is just a measurable map $X : (\Omega, \mathcal F) \to (\mathbb R \cup \{\pm\infty\}, \mathcal B(\mathbb R \cup \{\pm\infty\}))$.

Since $\mathbb P$ is a measure, for any $A \in \mathcal F$, $\mathbb P(A)$ actually means that

$\mathbb P(A) \equiv \int_A \mathrm d\mathbb P \equiv \int \mathbf 1_A(\omega) \mathbb P(\mathrm d\omega)$.

So, note that $\mathbb P : \mathcal F \to [0,1]$ actually is a set function that always takes events, never “just a random variable”!

For events, we introduce some further notation to describe real life phenomena/events in a natural way, e.g.

$\{X \in B\} := X^{-1}(B) = \{\omega \in \Omega : X(\omega) \in B\}$,

$\{X \leq x\} := \{X \in [-\infty, x)\} = \left\{\omega \in \Omega : X(\omega) \in [-\infty, x)\right\}$,

$\{X = x\} := \{X \in \{x\}\}$

etc. Now, you may set $p_X(x) := \mathbb P(X=x)$ which is not ambiguous at all.

But: A whole different object is the law or cumulative distribution function of a random variable $X$ which is defined as the image measure of a random variable $X$ by, for any $B \in \mathcal B(\mathbb R)$,

$B \mapsto \mathbb P_X(B) := \mathbb P(X \in B) = \mathbb P\circ X^{-1}(B)$.

In particular, we now see that

$\mathbb P(B) := \sum_{\omega \in B} \mathbb P(\{\omega\}) = \sum_{\omega \in B} p(\omega)$.

which is commonly used as a definition in engineer courses.

So mind: $p: \Omega \to [0,1]$ is a function, whereas $\mathbb P: \mathcal P(\Omega) \to [0,1]$. Hence $p(\omega)$, but $\mathbb P(\{\omega\})$!

The expectation of a random variable X is defined as

$\mathbb E u(X) := \int u(X) \mathrm d\mathbb P = \int_\Omega u(X(\omega)) \mathbb P(\mathrm d\omega) = \int_{\mathbb R} u(x) \, \mathbb P(X \in \mathrm d\omega)$,

where $u : \mathbb R \to \mathbb R$ is a measurable map and the last equality comes from a Lebesgue integration analogue of the change of variables theorem in Riemannian integration (transformation rule for image measures).

The measurable function $u$ can be quite simple, e.g. $u(x) = x$, which is just the expectation of $X$ or $u(x) = (x – \mathbb EX)^2$, which is the variance of $X$.

From the abstract definition we now recover the discrete and continuous random variables like this:

(1) A discrete random variable only sees isolated points, i.e. we want to end up with a distribution of the form, $X \sim \mathrm{Poi}(\lambda)$, $\mathbb P(X = n) = e^{-\lambda} \frac{\lambda^n}{n!}$, for $n = 0,1,2…$ and $\lambda > 0$.

Therefore, we use that for the so called Dirac measure

$\delta_x(A) := \mathbf 1_A(x) = \begin{cases}

1 & x \in A\\

0 & x \not\in A

\end{cases}$,

we get

$\int f(y) \delta_x(\mathrm dy) = f(x)$, i.e. we define in the example above

$\mathbb P(X \in \mathrm dx) = \sum_{n=0}^\infty e^{-\lambda} \frac{\lambda^n}{n!} \delta_n(\mathrm dx)$.

Hence,

$\mathbb Eu(X) = \int \sum_{n=0}^\infty e^{-\lambda} \frac{\lambda^n}{n!} \delta_n(\mathrm dx) = \sum_{n=0}^\infty e^{-\lambda} \frac{\lambda^n}{n!}$.

More general, if $X \sim \sum_{n\in\mathbb N} p_n \delta_{x_n}$, where $p_n$ describes the Bernoulli, Binomial, Poisson etc. distribution mass function, then

$\mathbb Eu(X) = \sum_{n\in\mathbb N} u(x_n) p_n = \sum_{n\in\mathbb N} u(x_n) \mathbb P(X = x_n)$.

(2) A continuous random variable is defined by its density $f(x)$, i.e. again by the transformation rule, for any $B \in \mathcal B(\mathbb R)$,

$\mathbb P(X \in B) = \int_B \mathrm d\mathbb P_X = \int_B f(x) \mathrm dx$.

Hence,

$\mathbb Eu(X) = \int u(x) \mathbb P(X \in \mathrm dx) = \int u(x) f(x) \mathrm dx$.

Finally, the conditional probability and conditional expectation are two more generalised (but again completely different!) objects, denoted in a similar fashion to emphasise its strong connection to the probability and expectation of function.

The conditional expectation is actually the $\mathsf L^1(\mathbb P)$ completion of an orthogonal $\mathsf L^2(\mathbb P)$ Hilbert space projection (what ever this means…) $\mathbb E(X \mid \mathcal A)$ (for some sub-$\sigma$-algebra $\mathcal A \subset \mathcal F$). The conditional probability (with respect to $\mathcal A$) can then be defined as, for $A \in \mathcal F$,

$\mathbb P(A \mid \mathcal A) := \mathbb E(\mathsf 1_A \mid \mathcal A)$,

which is again a random variable. So actually one should write $\mathbb P(A \mid \mathcal A)(\omega)$. In particular, the existence of both notions is not obvious at all. There is quite some work to be done. Secondly, $A \mapsto \mathbb P(A \mid \mathcal A)(\omega)$ is not a measure. But the good news is, two particular cases are exactly want we want:

(1) $A \in \mathcal F \Longrightarrow \mathbb P(A \mid \mathcal A) = \mathbb E(\mathbf 1_A \mid \mathcal F) = \mathbf 1_A$ and $A \mapsto \mathbf 1_A(\omega)$ is a measure.

(2) $A$ independent of $\mathcal A \Longrightarrow \mathbb P(A \mid \mathcal A) = \mathbb E(\mathbf 1_A \mid \mathcal A) = \mathbb E \mathbf 1_A = \mathbb P(A)$ and $A \mapsto \mathbb P(A)$ is a measure.

A special case is the classical conditional probability where $\mathcal A$ is just a set $B$: Let $A, B \in \mathcal F$ events. We define the conditional probability to be

$\mathbb P(A \mid B) = \begin{cases}

\frac{\mathbb P(A \cap B)}{\mathbb P(B)} & \mathbb P(B) > 0,\\

0 & \mathbb P(B) = 0.

\end{cases}$

and say the probability of $A$ under the condition $B$.

A little computation also shows that from above it follows that

$\mathbb E(X \mid B) = \begin{cases}

\frac{\mathbb E(X \mathbf 1_A)}{\mathbb P(B)} & \mathbb P(B) > 0,\\

0 & \mathbb P(B) = 0.

\end{cases}$

Now, we can also start to look at specific events. If $X$ and $Y$ are discrete, e.g. $A = \{X = x\}, B = \{Y = y\}$, we can define $\mathbb P(A \mid B) = \mathbb P(X=x \mid Y=y) =: p_{X\mid Y}(x,y)$.

If $X$ and $Y$ are continuous, they admit a joint (!) density $f_{X,Y}(x,y)$, i.e.

$\mathbb P(X \leq a, Y \leq y) = \int_{(-\infty,a]}\int_{(-\infty,b]} f_{X,Y}(x,y) \mathrm dy \, \mathrm dx.$

The marginal distributions can be recovered from the joint distribution by

$\mathbb P(X \leq a) = \mathbb P(X \leq a, Y < \infty) = \int_{(-\infty,a]} \underbrace{\int_{\mathbb R} f_{X,Y}(x,y) \mathrm dy}_{=: f_X(x)} \mathrm dx.$

The conditional density of $X$ given $Y$ is said to be

$f_{X|Y}(x,y) = \begin{cases}

\frac{f_{X,Y}(x,y)}{f_Y(y)} & f_Y(y) \neq 0,\\

0 & \text{else}.

\end{cases}$

Finally, conditional densities allow us to compute conditional expectations and probabilities, namely

$\mathbb E(h(X,Y) \mid Y = y) = \int_{\mathbb R} h(x,y) f_{X\mid Y}(x,y) \mathrm dx$,

and, for any $B \in \mathcal F$,

$\mathbb P(X \in B \mid \sigma(Y)) = \int_B f_{X\mid Y}(x\mid Y)\mathrm dx$.

To wrap up: it might be tempting to use some weird notation like $p(X)$ or $\mathbb E[x]$, but those make absolutely no sense.

]]>Even though this seems like a non-issue for statisticians, many non-statisticians can experience significant confusion due to the notion that a density function is just another static function as in ordinary calculus.

I am particularly comfortable with using p all the time and rely on the actual variables inside the parentheses to signify the corresponding density function.

However, we often find that some densities in our application are significant enough to justify giving them their own names like f, g or h. Another situation where we genuinely need specific names for density functions is when a single random variable has multiple possible densities depending on the context.

This practice, however, leads many to think that f, g or h are just ordinary calculus functions, which is indeed not true. Because density functions can be marginalized to reduce its dimensions, or conditioned upon another random variable, while ordinary functions cannot. For example, say, f(a | b) is the conditional density of a given b. Therefore, the shape or location of f is affected by b, which is also a random variable. Hence, f has no static shape or location. What we really want to communicate by f is a law of dependence between a – b, and this law can be static, which is why we are motivated to use f instead of the generic notation p in the first place.

For statisticians, it is quite natural to also write that integrating b out of g(a,b) will give g(a), which is to use g for both the joint and marginal density. This is justifiable because the joint density implies its marginal densities. Or sometimes people also write h(a,b) = f(a)g(b | a) which looks utterly confusing to most outsiders. However, this notation immediately makes sense when we realize that replacing f, g and h with p will give the familiar formula of the joint density of a and b.

In conclusion, I think the confusion is genuine and significant among many researchers. However, as I mentioned, this confusion seems to have legitimate reasons and a simple solution is not yet available. My current solution to this situation is to keep reminding myself that f, g or h are nothing else but specific variants of p, and hence they are not just calculus functions.

]]>As for the “continuous versus discrete” overloading, I think it is not a big problem, as there is no mathematical controversy here: sum is a special case of integration (sum is the Lebesgue integral with respect to the natural measure on the integers, the counting measure). When doing computations, one must be aware of the codomains of the variables (are the variables integer-valued, real-valued, vector-valued, unit-circle-valued etc.), but that also applies if we have only continuous variables (Julien says something similar in his comment https://lingpipe-blog.com/2009/10/13/whats-wrong-with-probability-notation/#comment-16022 ).

]]>1. Do we then use for event probs?

2. And what about mixtures, like spike and slab or Dirichlet process which are part discrete and part continuous?

3. The objection to the cap/lowercase convention is about matrices and to a lesser extent Greek letters. If we also want to capitalize matrices or bold them, we run into conflicting conventions. Not all the Greek letters have easily distinguishable caps.

3. While writing may clear up some confusions, it also runs head-on into the notation used for events, where is shorthand for the event . And obviously if we’re talking events and is continuous.

4. I don’t see how using conditionals cleans up the distinction between and , which would be written in probability theory as and . This notation gets cumbersome when we have a dozen parameters.

I’ve never heard anyone say that the problem is purely Bayesian or frequentist — this is just about probability theory, about which everyone is in agreement. The frequentist/Bayesian debate is about what can be the object of a probability distribution, not how the laws of probability work.

]]>It’s basis is simple: p(x), or P(x), is an object which does not exist. The correct tool in probabilities is conditional probabilities: one should always specify the preliminary knowledge that conditions the state of uncertainty about a quantity x. Therefore, p(X | c) is the distribution about variable X under the assumption that c holds. If preliminary knowledge is different, say c’, then p(X | c’) can be another mathematical distribution. Function p(. | .), in this case, is not overloaded (contrary to what many reviewers of my papers asserted, but that’s not the point. :) ).

Complement this with a few conventions, for instance, use p(. | .) in the continuous case, P(. | .) in the discrete case. Use capitalized symbols when referring to variables (i.e., domains), and small-capped symbols for values. So that P([X=x] | c) (or P(X=x | c), when you’re lazy) is a probability value, whereas P(X | c) is a probability distribution. And you should be all set.

The use of right-hand side symbols to make probability distributions non-ambiguous does not even need to be tied to “subjectivist” stances about the meaning of probabilities as states of knowledge of agents. Indeed, it is also the basis of (purely Bayesian statistics or machine learning inspired) methods of model selection: both P(X | c) and P(X | c’) being defined, a new variable C can then be introduced, with domain C={c, c’}, and the model P(C | D) P(X | C D) can then be introduced to carry out model selection by computing P(C | X D).

]]>