I want to expand a bit on Andrew’s post, in which he outlines a simple Bayesian analysis of 2×2 contingency tables to replace Fisher’s exact test (or a chi-square test) for contingency tables.
Contingency Table
I’ll use the data in the example in the Wikipedia article on contingency tables:
| Left-Handed | Right-Handed | TOTAL | |
|---|---|---|---|
| Male | 9 (y1) | 43 | 52 (n1) |
| Female | 4 (y2) | 44 | 48 (n2) |
| Total | 13 | 87 | 100 |
Hypothesis
The statistical hypothesis we wish to investigate is whether the proportion of left-handed men is greater than, equal to, or less than the proportion of left-handed women.
Beta-Binomial Model
The observed data has left-handed men of a total of
men and
left-handed women of a total of
women.
Letting be the proportion of left-handed men and
the proportion of left-handed women, Andrew suggested modeling the number of left-handers as a binomial,
, and
.
He then suggested simple uniform priors on the proportions, which we know are equivalent to priors:
, and
.
Given the conjugacy of the beta for the binomial, we can analytically compute the posteriors:
, and
We could try to compute the posterior density of analytically by solving the integral
or we could just follow Andrew’s sage advice and estimate the integral by simulation.
R Code
I love R. Here’s the R code to compute estimated posteriors by simulation and visualize them.
# DATA
n1 = 52 # men
y1 = 9 # left-handed men
n2 = 48 # women
y2 = 4 # left-handed women
# SIMULATION
I = 10000 # simulations
theta1 = rbeta(I, y1+1, (n1-y1)+1)
theta2 = rbeta(I, y2+1, (n2-y2)+1)
diff = theta1-theta2 # simulated diffs
# OUTPUT
quantiles = quantile(diff,c(0.005,0.025,0.5,0.975,0.995))
print(quantiles,digits=2)
print("Probability higher % men than women lefties:")
print(mean(theta1>theta2))
# VISUALIZATION
#png(file="bayesContingency.png")
plot(density(diff),
xlab="theta1 - theta2",
ylab="p(theta1 - theta2 | y, n)",
main="Posterior Simulation of Male - Female Lefties",
ylim=c(0,8),
frame.plot=FALSE,cex.lab=1.5,lwd=3,yaxt="no")
abline(v=quantiles[2], col="blue")
abline(v=quantiles[4], col="blue")
#dev.off()
Running this code prints out:
> source("twoByTwoPosterior.R")
0.5% 2.5% 50% 97.5% 99.5%
-0.091 -0.046 0.084 0.218 0.262
[1] "Probability higher % men than women lefties:"
[1] 0.901
[1] "Mean difference theta1-theta2"
[1] 0.08484979
The first numbers displayed are the posterior quantiles for the difference. The posterior median (50% quantile) difference is 0.084. The other numbers are quantiles which determine 95% and 99% posterior intervals, which span (0.025,0.975) and (0.005,0.995) respectively. The posterior central 95% quantile is thus (-0.046,0.218). Quantiles get the obvious interpretation in Bayesian stats — given the model and data, we assign a 95% probability to the true difference lying in the interval (-.046,0.218).
The second number printed is the posterior probability that men have a higher percentage of left-handness than women, which is 0.901, or about 90%. R makes this simple by letting us write mean(theta1 > theta2). This is also the value you’d get from integrating the posterior density from to
.
The final number is the mean difference between theta1 and theta2, which is the unbiased Bayesian estimator of the difference. This value is 0.0848, which is a bit higher than the median difference of 0.084.
The code also generates the following graph of the posterior:

The vertical lines are drawn at the boundaries of the central 95% region of the posterior difference .
Significant?
Don’t ask. You’re a Bayesian now.
Instead, say that according to your model, based on the observed data, there’s a 90% chance there’s a higher prevalence of lefthandedness among men than women, or that there’s a 95% chance the difference between male prevalence of lefthandness and female prevalence of lefthandedness falls in the range (-.05,0.22). [Remember not to use too many digits. The outputs from R were so you could see the small differences.]
October 14, 2009 at 10:12 am |
Very nice analysis!
October 25, 2009 at 3:45 pm |
This is a clear example of a beyesian analysis for the difference o f two proportions… However, the original problem is related with a test of independence. With your former analysis, how could you conclude that Y_1 and Y_2 are independent?
October 26, 2009 at 2:49 pm |
Exactly — it’s the Bayesian equivalent of testing the hypothesis of whether there is a higher prevalence of left-handedness among men than among women.
I don’t see the connection to independence of Y_1 and Y_2. But then I would have never written this kind of data out in a contingency table format in the first place, which suggests you’re going to do things like test independence of the rows and columns.
October 26, 2009 at 5:49 pm
OK, Its clear that independence of Y_1 and Y_2 is not a matter of interest. However, regarding your post, I dont see clearly the mathematical connection between the analysis of the difference of these proportions with the indepence between rows and columns. I mean… Do you know about some bayesian theoretical approach that concludes that if theta_1 > theta_2, then rows and columns are independet?
October 27, 2009 at 12:26 pm
You can compute Bayesian counterparts to everything from chi-square to correlation stats. Just look at the posterior over the stat. There’s a section in Gelman et al.’s Bayesian Data Analysis that compares the Bayesian posteriors to usual confidence intervals and hypothesis tests.
I don’t know about other Bayesian independence tests, but don’t take that as a definitive word — I’ve never looked.
(Hmm — only nests two deep on replies.)
October 27, 2009 at 8:42 pm |
Thanks a lot, i just wrote a post in my blog (spanish) about this interesting connection between independence on rows and columns and the difference of two proportions…
http://predictive.files.wordpress.com/2009/10/binder2.pdf
July 20, 2010 at 6:16 am |
Hey there,
I am currently trying to figure out how to apply Bayesian analyses to a problem: I want to compare two relative frequency distributions (i.e. each sample/distribution sums to 1) with 4 categories each. Following the two-category example discussed above, I thought I should treat the problem similarly and compare probabilities of belonging to, say, category i, between both samples. Unfortunately, I do not have counts but only proportions: One sample actually consists of counts but with fractions (i.e. an event may occurr 1.5 times), the other sample really only consists of relative indices, no counts at all. So I cannot use a multinomial model, analogous to the binomial in the example above. Instead, I thought about using a Dirichlet model with the parameter alpha, calculating expected proportions of each category i as alpha(i)/sum(alpha()). Is this approach correct? What kind of prior should I put on alpha? And how should I go about proportions that are sampled, i.e. have a mean and SD? Should I simulate new proportions based on their distributions for each iteration of the MCMC simulation? I’d really appreciate any advice,
cheers!
July 22, 2010 at 2:51 pm |
What question are you trying to answer in comparing the two distributions? For instance, in my post, the question is whether men are more likely to be left handed than women.
And why use multinomial models for continuous data?
You could use your fractional counts in the way you suggest as data for a Dirichlet. That is, you observe a bunch of
and the likelihood function is
. You then need a prior
on the Dirichlet parameters. You can use something diffuse like a Pareto prior or more informative like a logistic normal, and you get a posterior
. It’s easy to build these models in BUGS and run posterior inference by gathering samples
from the posterior
.
Which prior to use depends on how much you know. If you know something about the problem that lets you use a tighter prior, use it, or build the information into the model itself and maintain diffuse priors.
Some people like to fudge fractional data and treat it like multinomial data with a different normalizer. That’s what we do in our traditional naive Bayes class in LingPipe.
July 23, 2010 at 1:20 am |
First of all, thanks so much for the advice and the different ways to go about using a Dirichlet.
As for my problem – I am looking at dietary preferences of a predator. On the one hand I have counts of prey items taken (these are the actual but sometimes fractional counts), with a total n=33 and four prey categories. On the other hand I have a measure of prey availability, in the form of an abundance index, as I cannot actually count or estimate abundance for the prey species. I scaled the indices for the four prey categories so they sum to 1, and that’s why I only have proportions. The indices – and so the proportions – are estimates so they have an associated SE. I am trying to determine prey selection by comparing observed frequencies of prey i with expected frequencies based on the hypothesis of no selection, i.e. that use of prey species i is proportional to availability.
Is the a web resource you can recommend where I can find out more about these issues?
Again, thanks so much. I am rather new at this and any advice really helps me.
Cheers.
July 23, 2010 at 12:44 pm |
This must be a commonly modeled problem in biology. I’d first go look up what other people have done.
I don’t know about a web resource, but I’d highly recommend Gelman and Hill’s multilevel/hierarchical regression book for learning how to think about these problems.
For online resources, I’d recommend looking at the sample models that come with BUGS. Models of the size you’re looking at are really easy to code and run inference for, especially from R. The Gelman and Hill book has tons of examples.
Proportions seem to have a natural interpretation as “probability next meal is X”. The usual kind of model for this is something like multinomial logistic regression if the available prey is fixed (or something like discrete choice analysis if it isn’t). You’d use the availability as predictors, perhaps with interactions. You presumably need the identity of prey (e.g. some are harder to catch than others, or maybe that’s already rolled into the index you have).
You can push the uncertainty of the predictors through the predictions by integration, which you can estimate with sampling. It’s really easy to build this into bugs — you just make the availability predictor a random sample from its distribution as approximated by your normal with mean and standard error.
I’d still like to see something that produces the multinomial data that you see rather than trying to predict proportions. Naively, it seems like the total amount of prey items taken will vary by their size as well as their availability (e.g. you need to eat more small prey than large prety to get the same amount of nutrition). It also seems that this’d vary a lot by season for some animals and their overall activity level.
July 27, 2010 at 7:32 am |
Thanks again, I’ll try and get a hold of the book you mention. My problem with regression models are my small and extremely skewed counts – basically one category with an n of 26, and all the other somewhere between 1 and 2. There is not really any biological justification for further grouping of species. So while looking further I might think about settling for something more descriptive for this particular data set. Thanks so much for your time, it really helps to get some input and even if I am unable to apply any of your advice now, I will certainly remember it in further ananlyses.