## Marginalizing Latent Variables in EM

After getting carried away in my last post on this topic, Maximum Likelihood Estimates of Expression in RNA-Seq are Winner-Take-All Biased, I want to explain what went wrong when I “turned the crank” (an idiom in mathematics for solving an equation by brute force without thinking too much about it).

### Expression Mixture Model

Recall that we were looking at a mixture model, which is the bread and butter of maximum likelihood estimators like EM. We defined a joint likelihood function for a sequence of observed reads $r$, gene or splice variant sources $g$, and mRNA expression $\theta$, setting

$p(r,g|\theta) = \prod_{n=1}^N p(r_n,g_n|\theta) = \prod_{n=1}^N p(r_n|g_n) \ p(g_n|\theta)$.

### Maximum Likelihood

Maximum likelihood is usually stated as finding the model parameters that maximize the likelihood function consisting of the probability of the observed data given the model parameters.

### What’s a Parameter?

So that’s what I did, calculating

$g^*, \theta^* = \arg \max_{g,\theta} p(r|g,\theta)$.

The problem is that there’s an implicit “except the latent parameters” inside the usual understanding the MLE.

What I should have done is marginalized out the latent gene sources $g$, taking

$\theta^* = \arg \max_{\theta} p(r|\theta)$.

That requires marginalizing out the latent mapping of reads $r_n$ to gene or splice variant sources $g_n$,

$p(r|\theta) = \prod_{n=1}^N p(r_n|\theta_n) = \prod_{n=1}^N \sum_{k = 1}^K p(r_n|k) \ p(k|\theta)$.

On a log scale, that’s

$\log p(r|\theta) = \sum_{n=1}^N \log \sum_{k = 1}^K p(r_n|k) \ p(k|\theta)$.

### The Example

I’ll repeat the example here for convenience:

### Isoform Expression Example

Suppose I have a gene with two isoforms, A and B, spliced from three exons, 1, 2, and 3. For simplicitly, we’ll suppose all the exons are the same length. Suppose isoform A is made up of exon 1 and exon 2, and isoform B of exon 1 and 3. Now suppose that you run your RNA-Seq experiment and find 70 reads mapping to exon 1, 20 reads mapping to exon 2, and 50 reads mapping to exon 3. In table form, we have the following mapping counts:

Exon 1 Exon 2 Exon 3 Total
Iso 1 N 20 0 20+N
Iso 2 70-N 0 50 120-N
Total 70 20 50 140

The 20 reads mapping to exon 2 must be part of isoform 1, and similarly the 50 reads mapping to exon 3 must belong to isoform 2. That leaves the 70 reads falling in exon 1 to spread in some proportion between isoform 1 and isoform 2.

### Turning the Crank In the Right Direction

By assuming each splice variant consists of two exons of the same length, the probability of a read in an exon is 1/2 for each exon in the splice variant and 0 for the exon not in the variant.

Now when we turn the crank, we see that

$\log p(r|\theta) = \sum_{n=1}^N \log \sum_{k=1}^K p(r_n|k) p(k|\theta)$

${} \ \ \ \ = 70 \log \sum_{k=1}^K p(1|k) p(k|\theta) + 20 \log \sum_{k=1}^K p(2|k) p(k|\theta) + 50 \log \sum_{k=1}^K p(3|k) p(k\theta)$

$= 20 \log \frac{\theta_1}{2} + 50 \log \frac{\theta_2}{2} + 70 \log (\frac{\theta_1}{2} + \frac{\theta_2}{2})$

$= 20 \log \theta_1 + 50 \log \theta_2 + (70 \log 1 -20\log 2 - 50 \log 2 - 70 \log 2)$.

The reduction on the third line is because the probability of a read in the second exon, $p(2|k)$, is only non-zero for isoform $k=1$, and similarly for a read in the third exon, where $p(3|k)$ is nly non-zero for the second gene or splice variant, $k=2$. The final reduction follows because $\theta_1 + \theta_2 = 1$.

### Not so Biased After All

After marginalizing out the read mappings $g$, the MLE for expression is right where we’d expect it, at the solution of

$\frac{d}{d\theta} \log p(r|\theta) = 0$,

which is

$\theta^*=20/70$.

It turns out EM isn’t so biased for these discrete mixtures, at least when you turn the crank the right way.