I want to expand a bit on Andrew’s post, in which he outlines a simple Bayesian analysis of 2×2 contingency tables to replace Fisher’s exact test (or a chi-square test) for contingency tables.

### Contingency Table

I’ll use the data in the example in the Wikipedia article on contingency tables:

Left-Handed | Right-Handed | TOTAL | |
---|---|---|---|

Male | 9 (y1) | 43 | 52 (n1) |

Female | 4 (y2) | 44 | 48 (n2) |

Total | 13 | 87 | 100 |

### Hypothesis

The statistical hypothesis we wish to investigate is whether the proportion of left-handed men is greater than, equal to, or less than the proportion of left-handed women.

### Beta-Binomial Model

The observed data has left-handed men of a total of men and left-handed women of a total of women.

Letting be the proportion of left-handed men and the proportion of left-handed women, Andrew suggested modeling the number of left-handers as a binomial,

, and

.

He then suggested simple uniform priors on the proportions, which we know are equivalent to priors:

, and

.

Given the conjugacy of the beta for the binomial, we can analytically compute the posteriors:

, and

We could try to compute the posterior density of analytically by solving the integral

and then evaluating .

Instead, we’ll just follow Andrew’s sage advice and estimate the integral by simulation. That is, we’ll take samples from the joint posterior

,

which we write as

and then use the Monte Carlo approximation,

,

where is the indicator function, taking on value 1 if its property is true and 0 otherwise.

In words, we estimate the probability that is greater than by the fraction of samples where .

### R Code

I love R. Here’s the R code to compute estimated posteriors by simulation and visualize them.

# DATA n1 = 52 # men y1 = 9 # left-handed men n2 = 48 # women y2 = 4 # left-handed women # SIMULATION I = 10000 # simulations theta1 = rbeta(I, y1+1, (n1-y1)+1) theta2 = rbeta(I, y2+1, (n2-y2)+1) diff = theta1-theta2 # simulated diffs # OUTPUT quantiles = quantile(diff,c(0.005,0.025,0.5,0.975,0.995)) print(quantiles,digits=2) print("Probability higher % men than women lefties:") print(mean(theta1>theta2)) # VISUALIZATION #png(file="bayesContingency.png") plot(density(diff), xlab="theta1 - theta2", ylab="p(theta1 - theta2 | y, n)", main="Posterior Simulation of Male - Female Lefties", ylim=c(0,8), frame.plot=FALSE,cex.lab=1.5,lwd=3,yaxt="no") abline(v=quantiles[2], col="blue") abline(v=quantiles[4], col="blue") #dev.off()

Running this code prints out:

> source("twoByTwoPosterior.R") 0.5% 2.5% 50% 97.5% 99.5% -0.091 -0.046 0.084 0.218 0.262 [1] "Probability higher % men than women lefties:" [1] 0.901 [1] "Mean difference theta1-theta2" [1] 0.08484979

The first numbers displayed are the posterior quantiles for the difference. The posterior median (50% quantile) difference is 0.084. The other numbers are quantiles which determine 95% and 99% posterior intervals, which span (0.025,0.975) and (0.005,0.995) respectively. The posterior central 95% quantile is thus (-0.046,0.218). Quantiles get the obvious interpretation in Bayesian stats — given the model and data, we assign a 95% probability to the true difference lying in the interval (-.046,0.218).

The second number printed is the posterior probability that men have a higher percentage of left-handness than women, which is 0.901, or about 90%. R makes this simple by letting us write `mean(theta1 > theta2)`

. This is also the value you’d get from integrating the posterior density from to .

The final number is the mean difference between theta1 and theta2, which is the unbiased Bayesian estimator of the difference. This value is 0.0848, which is a bit higher than the median difference of 0.084.

The code also generates the following graph of the posterior:

The vertical lines are drawn at the boundaries of the central 95% region of the posterior difference .

### Significant?

Don’t ask. You’re a Bayesian now.

Instead, say that according to your model, based on the observed data, there’s a 90% chance there’s a higher prevalence of lefthandedness among men than women, or that there’s a 95% chance the difference between male prevalence of lefthandness and female prevalence of lefthandedness falls in the range (-.05,0.22). [Remember not to use too many digits. The outputs from R were so you could see the small differences.]

October 14, 2009 at 10:12 am |

Very nice analysis!

October 25, 2009 at 3:45 pm |

This is a clear example of a beyesian analysis for the difference o f two proportions… However, the original problem is related with a test of independence. With your former analysis, how could you conclude that Y_1 and Y_2 are independent?

October 26, 2009 at 2:49 pm |

Exactly — it’s the Bayesian equivalent of testing the hypothesis of whether there is a higher prevalence of left-handedness among men than among women.

I don’t see the connection to independence of Y_1 and Y_2. But then I would have never written this kind of data out in a contingency table format in the first place, which suggests you’re going to do things like test independence of the rows and columns.

October 26, 2009 at 5:49 pm

OK, Its clear that independence of Y_1 and Y_2 is not a matter of interest. However, regarding your post, I dont see clearly the mathematical connection between the analysis of the difference of these proportions with the indepence between rows and columns. I mean… Do you know about some bayesian theoretical approach that concludes that if theta_1 > theta_2, then rows and columns are independet?

October 27, 2009 at 12:26 pm

You can compute Bayesian counterparts to everything from chi-square to correlation stats. Just look at the posterior over the stat. There’s a section in Gelman et al.’s

Bayesian Data Analysisthat compares the Bayesian posteriors to usual confidence intervals and hypothesis tests.I don’t know about other Bayesian independence tests, but don’t take that as a definitive word — I’ve never looked.

(Hmm — only nests two deep on replies.)

October 27, 2009 at 8:42 pm |

Thanks a lot, i just wrote a post in my blog (spanish) about this interesting connection between independence on rows and columns and the difference of two proportions…

http://predictive.files.wordpress.com/2009/10/binder2.pdf

July 20, 2010 at 6:16 am |

Hey there,

I am currently trying to figure out how to apply Bayesian analyses to a problem: I want to compare two relative frequency distributions (i.e. each sample/distribution sums to 1) with 4 categories each. Following the two-category example discussed above, I thought I should treat the problem similarly and compare probabilities of belonging to, say, category i, between both samples. Unfortunately, I do not have counts but only proportions: One sample actually consists of counts but with fractions (i.e. an event may occurr 1.5 times), the other sample really only consists of relative indices, no counts at all. So I cannot use a multinomial model, analogous to the binomial in the example above. Instead, I thought about using a Dirichlet model with the parameter alpha, calculating expected proportions of each category i as alpha(i)/sum(alpha()). Is this approach correct? What kind of prior should I put on alpha? And how should I go about proportions that are sampled, i.e. have a mean and SD? Should I simulate new proportions based on their distributions for each iteration of the MCMC simulation? I’d really appreciate any advice,

cheers!

July 22, 2010 at 2:51 pm |

What question are you trying to answer in comparing the two distributions? For instance, in my post, the question is whether men are more likely to be left handed than women.

And why use multinomial models for continuous data?

You could use your fractional counts in the way you suggest as data for a Dirichlet. That is, you observe a bunch of and the likelihood function is . You then need a prior on the Dirichlet parameters. You can use something diffuse like a Pareto prior or more informative like a logistic normal, and you get a posterior . It’s easy to build these models in BUGS and run posterior inference by gathering samples from the posterior .

Which prior to use depends on how much you know. If you know something about the problem that lets you use a tighter prior, use it, or build the information into the model itself and maintain diffuse priors.

Some people like to fudge fractional data and treat it like multinomial data with a different normalizer. That’s what we do in our traditional naive Bayes class in LingPipe.

July 23, 2010 at 1:20 am |

First of all, thanks so much for the advice and the different ways to go about using a Dirichlet.

As for my problem – I am looking at dietary preferences of a predator. On the one hand I have counts of prey items taken (these are the actual but sometimes fractional counts), with a total n=33 and four prey categories. On the other hand I have a measure of prey availability, in the form of an abundance index, as I cannot actually count or estimate abundance for the prey species. I scaled the indices for the four prey categories so they sum to 1, and that’s why I only have proportions. The indices – and so the proportions – are estimates so they have an associated SE. I am trying to determine prey selection by comparing observed frequencies of prey i with expected frequencies based on the hypothesis of no selection, i.e. that use of prey species i is proportional to availability.

Is the a web resource you can recommend where I can find out more about these issues?

Again, thanks so much. I am rather new at this and any advice really helps me.

Cheers.

July 23, 2010 at 12:44 pm |

This must be a commonly modeled problem in biology. I’d first go look up what other people have done.

I don’t know about a web resource, but I’d highly recommend Gelman and Hill’s multilevel/hierarchical regression book for learning how to think about these problems.

For online resources, I’d recommend looking at the sample models that come with BUGS. Models of the size you’re looking at are really easy to code and run inference for, especially from R. The Gelman and Hill book has tons of examples.

Proportions seem to have a natural interpretation as “probability next meal is X”. The usual kind of model for this is something like multinomial logistic regression if the available prey is fixed (or something like discrete choice analysis if it isn’t). You’d use the availability as predictors, perhaps with interactions. You presumably need the identity of prey (e.g. some are harder to catch than others, or maybe that’s already rolled into the index you have).

You can push the uncertainty of the predictors through the predictions by integration, which you can estimate with sampling. It’s really easy to build this into bugs — you just make the availability predictor a random sample from its distribution as approximated by your normal with mean and standard error.

I’d still like to see something that produces the multinomial data that you see rather than trying to predict proportions. Naively, it seems like the total amount of prey items taken will vary by their size as well as their availability (e.g. you need to eat more small prey than large prety to get the same amount of nutrition). It also seems that this’d vary a lot by season for some animals and their overall activity level.

July 27, 2010 at 7:32 am |

Thanks again, I’ll try and get a hold of the book you mention. My problem with regression models are my small and extremely skewed counts – basically one category with an n of 26, and all the other somewhere between 1 and 2. There is not really any biological justification for further grouping of species. So while looking further I might think about settling for something more descriptive for this particular data set. Thanks so much for your time, it really helps to get some input and even if I am unable to apply any of your advice now, I will certainly remember it in further ananlyses.

October 26, 2010 at 9:22 pm |

Great post, I coded this up in Java if anyone wants Java source. I also discovered that pretty much the same method is described at http://en.wikipedia.org/wiki/Pearson's_chi-square_test#Bayesian_method and http://en.wikipedia.org/wiki/Categorical_distribution#Bayesian_statistics , though there are no refs given there. Of course this method is obvious in hindsight, I don’t think I’ll use Fisher’s Exact Test, Chi Squared or odds ratios again…

I wish there were a closed form for the integral though, the sampling step slows down my analysis code by a factor of about 100 over calculating Fisher’s Exact Test (with only 100*2 beta distribution samples per contingency table). I would love to use the mean and stddev to of the two beta functions to get a closed-form estimate of the z-score of the difference without having to numerically sample from the beta distributions (the mean of the difference is the difference between the means, and the variance of the difference is the sum of variances minus twice the covariance), but I don’t know how to calculate the covariance between the two beta-distributed RVs. Can anybody help with this please?

October 27, 2010 at 11:31 pm |

I’m curious as to what the use case where 100 times slower than pretty much instant is going to be much of a bottleneck.

Despite it being a fairly standard kind of convolution, I’m terrible at integrals, so I have no idea how to help.

I’m afraid the PDF referenced in your next comment is behind a paywall and I’m on the road right now, so I can’t see it. Given that the title suggests you have known correlations, how do you generate them? And what variables are you talking about being correlated?

For approximations like this, it’d be nice to know what range they’re accurate for. The problems usually arise near boundaries of bounded distributions when the normal approximation begins to break down.

The int to float conversions are automatic in Java. If you have (A op B), the type with smaller range will undergo what Java calls “numeric promotion”. Using lots of float-based arithmetic is dangerous, though, because it doesn’t give you many significant digits.

October 28, 2010 at 10:28 am

The use case is performing 5 billion of these tests (literally). Speed makes a difference. The high number of tests comes from taking the cross product of variables in three domains of interest (genes expressed vs. cell attributes vs. another more esoteric domain), and I’m looking for pairs of variables that have a high z-score. Of course I’ll have problems with false positives due to multiple hypothesis testing, and I would be interested to hear of the best way to do multiple hypothesis testing correction on top of this Bayesian contingency test.

As far as accuracy, I generated a thousand z-scores from the approximation and your sampling approach for random hyperparams in [5,1000] for n1 and n2 and [0,ni] for the yi values, and plotted a scatterplot of the results. I didn’t calculate the correlation coefficient but it was a nearly perfect straight line, eyeballing it the correlation was probably R=0.999 or something. So it’s a really close approximation if all you need is z-score. And z-score is useful enough if the posterior is “nearly” normal, and (again just from eyeballing) it is “usually” pretty close. If your z-score is over 2-3 or so then it’s a pretty significant value whether or not the posterior is really close to normal, so for cases where you’re looking for the most significant contingencies among many contingency tables, i.e. measuring significance of correlation between many pairs of variables, the z-score is a reasonable measure because you’re effectively generating these scores for all variables and then sorting by decreasing absolute z-score, so the z-scores close to zero (where whether the distribution is normal or not is more important) are not the ones you’re looking closely at.

Yes, I’m aware of automatic conversion rules and the limits of using float-based arithmetic. Actually some of the expressions like “float sum1 = (float) (aa1 + bb1)” have superfluous casts because the variable definitions were automatically extracted by Java using the refactoring tools (and the original subexpression had to be cast to float to avoid loss of accuracy dividing an int by an int or overflow in multiplying three large ints), I didn’t notice the unneeded cast. However at least one of the two casts “(float) aa1 * (float) bb1″ is needed to avoid integer overflow during multiplication. As far as the accuracy issues with floats, the number of significant digits in a float is in the right ballpark for the number of significant digits present in y1, y2, n1 and n2, so the only loss of precision is in the “(float) Math.sqrt()” statement, and since it’s just a number of standard deviations from the mean, I don’t mind losing precision there, the extra precision is unnecessary.

October 26, 2010 at 11:48 pm |

If all you need is a z-test statistic (i.e. number of std deviations of zero from the mean) and you don’t need quantiles or the “Probability higher % men than women lefties” statistic, you don’t need to do the simulation, you just need the mean and std dev of the distribution of delta. (This assumes delta is roughly normal, of course…)

Mean is easy, it is mean1 – mean2. Stddev is sqrt(var1 + var2 + 2*cov(theta1, theta2)).

var1 and var2 are the standard values for the beta distribution. The covariance term is harder. I *think* that the 1st order Taylor series approximation of the covariance is given by \frac{\alpha_1 \times \alpha_2 + (1 + \beta_1) \times (1 + \beta_2)}{(\alpha_1 + \beta_1) \times (\alpha_2 + \beta_2) \times (1 + \alpha_1 + \beta_1) \times (1 + \alpha_2 + \beta_2)} as shown in Eq (4) of http://linkinghub.elsevier.com/retrieve/pii/S0167947303001695 : “An algorithm for generating positively correlated Beta-distributed random variables with known marginal distributions and a specified correlation”. (I think you can just drop the delta1 and delta2 terms?)

If that is the case, then it turns out (after comparing to the simulation method) that the covariance term is usually very small compared to the variance terms, and because it is only a first-order approximation, it has large error with extreme counts. It’s worth dropping it and using stddev = sqrt(var1 + var2) as an estimate of the stddev of the posterior. The estimate is usually quite close to the simulated value.

So the code for computing the z-score without doing the numeric simulation is:

int aa1 = y1 + 1, bb1 = (n1 – y1) + 1;

int aa2 = y2 + 1, bb2 = (n2 – y2) + 1;

float sum1 = (float) (aa1 + bb1);

float sum2 = (float) (aa2 + bb2);

float mean1 = aa1 / sum1;

float mean2 = aa2 / sum2;

float var1 = (float) aa1 * (float) bb1 / (sum1 * sum1 * (sum1 + 1));

float var2 = (float) aa2 * (float) bb2 / (sum2 * sum2 * (sum2 + 1));

float meanAnalyt = mean1 – mean2;

float stddevAnalyt = (float) Math.sqrt(var1 + var2);

float zScoreAnalyt = meanAnalyt / stddevAnalyt;

Expanded out and without the int-to-float conversions, this z-score statistic is:

zScoreAnalyt = ((y1 + 1) / (n1 + 2) – (y2 + 1) / (n2 + 2)) / sqrt((y1 + 1) * ((n1 – y1) + 1) / ((n1 + 2)^2 * (n1 + 3)) + (y2 + 1) * ((n2 – y2) + 1) / ((n2 + 2)^2 * (n2 + 3)))

This agrees quite closely with the z-score from the full Bayesian simulation that draws from beta distributions, but can be computed much faster.

October 30, 2010 at 12:05 am |

For a quick look at Bayesian multiple comparisons, you can check out my baseball batting ability analysis:

http://lingpipe-blog.com/2009/11/04/hierarchicalbayesian-batting-ability-with-multiple-comparisons/

For more depth, check out Andrew, Jennifer and Masanao’s paper and presentation, linked from this blog post:

http://www.stat.columbia.edu/~cook/movabletype/archives/2008/03/why_i_dont_usua_1.html

I’ve been spending lots of time with RNA-Seq data lately, specifically differential gene or splice-variant of gene expression, as described here:

http://lingpipe-blog.com/2010/02/05/inferring-splice-variant-mrna-expression-rna-seq/

Of course, I had to reimplment the sampler — BUGS won’t scale to RNA-Seq size data sets!

November 19, 2010 at 3:47 am |

Very useful analysis and R code.

Is there an extension of it to handle larger r x c tables – e.g. 2 x 3?

November 19, 2010 at 3:25 pm |

I didn’t write one, but it’d be pretty simple. Just translate all the binomials to multinomials and betas to Dirichlets.

November 30, 2010 at 9:44 am |

This is super-useful… enough so that I’d like to use this Bayesian method to replace the Fisher exact test method I have been using in the paper I am currently writing (in this particular astronomical application, I’m comparing binarity fractions between two different samples of brown dwarfs)

How should I cite you?

November 30, 2010 at 2:26 pm |

No! Don’t cite me. I didn’t make this up — it’s just the standard Bayesian posterior inference.

I was just working through the details for Andrew Gelman’s blog post.

Andrew didn’t make this up, either, though his (co-authored) book

Bayesian Data Analysisis my favorite general reference for the basics of Bayesian inference.In general, you want to build a hierarchical model of the population, which sensibly adjusts for data size relative to a population and can handle multiple comparisons. See Andrew, Masanao and Jennifer’s paper and my blog post.

November 30, 2010 at 4:43 pm |

Hi

I like this example, and have incorporated it into a book

I am writing called “Machine learning: a probabilistic approach”.

I thought readers of this blog might be interested to see

the relevant pages, which can be accessed here

http://people.cs.ubc.ca/~murphyk/MLbook/pmlBFcoin.pdf

(A version with all the cross references fixed should

be ready in a year or so :)

In particular, I compare Gelman’s p(delta|D) method

to a more standard Bayes factor analysis.

The latter has a simple closed form solution which might be of

interest to Luke Hutchinson and others.

Please let me know your comments.

Kevin

PS. would “psirusteam” please translate his Spanish posting

(at http://predictive.files.wordpress.com/2009/10/binder2.pdf )

to English?

November 30, 2010 at 5:51 pm |

I’m toying with writing such a book myself. But I’d just steer clear of the whole frequentist notion of hypothesis testing. I got distracted with the LingPipe book, which is much less mathematical.

I’ve never really studied Bayes factors or model selection problems.

I like the analysis of the actual Fisher exact test with the oddball improper priors Beta(0,1) and Beta(1,0). I’ll have to look that up and work through it.

There must be better things to cite than Andrew’s and my blog posts here! There’s pretty extensive discussion in Andrew’s book and in the multiple comparisons paper I linked in the previous comment.

May 31, 2011 at 3:02 am |

Your method sounds great! I wonder If I can use your method in my problem or not.

I am working on a data set with a binary response variable and discrete predictor variables (each has three state: 0,1,2).

I am willing to test independence between response variable and all combinations of predictor variable in a Bayesian framework.

I think it would be logical to assign a similarity between two adjacent categories (e.g. x1=0 , x2=0, x3=0, x4=0, x5=0 & x1=0 , x2=0, x3=0, x4=0, x5=1).

Thus, I want to use the property of similarity of probabilities in adjacent categories, when the categories are ordered.

Kindly give me a clue.

June 1, 2011 at 4:27 pm |

I don’t see how it relates to this post per se, but what you’re looking for is perhaps ordinal logistic regression. If you compute a Bayesian posterior via sampling, you can then do whatever you want with statistics over the posterior samples.

June 4, 2011 at 6:05 am |

It is definitly related to my problem.

What I’m looking for is a 2*L version of the method you presented here. But I have an other prior information which considers ordered probabilities for celles in contingency table (i.e. P1<=P2<=P3…<=Pl).

June 3, 2012 at 6:52 am |

after finding the posteriors of theta1 and theta2,,, the distribution of the delta (theta1-theta2) should be normal in this case so one can easily find the posterior probabiltx for other cases with out using simulations

June 4, 2012 at 12:49 pm |

Isn’t that only in the limit of increasing sample sizes? I believe distributions approach normality as , but I can’t find a reference at the moment. That would mean the difference of the two betas would approach normality in the limit.

August 10, 2012 at 3:25 pm |

Thanks for providing this useful example.

One difference between this and a chi-squared test is that the “mean difference” statistic will depend on which categorical variable corresponds to the rows and which corresponds to the columns. So in cases where there’s a not a preference for a variable to condition on, what’s the standard practice? Run the model both ways?

Along the same lines of not having a well-defined directionality to conditioning, why use this approach over, say a saturated log-linear model of the contingency table (besides that the credible intervals in a bayesian poisson model would probably be too narrow because of the fixed variance)?

August 13, 2012 at 4:46 pm |

Maybe I’m misunderstanding, but for the first question, isn’t it just a matter of the question you’re trying to answer, such as whether left-handedness is more prevalent among women than men? Whether femaleness is more prevalent among left-handers than right-handers is a very different question.

They do both relate to how well the cell counts are predicted given the row and column identifiers, which brings us to question two.

For the second question, I’m not sure what you mean by a “saturated log-linear model of the contingency table”. Let’s say we fix the count of men and count of women, then fit a logistic regression based on an intercept and a predictor for sex. Without a prior, the model will fit the data perfectly. So do we also get rid of the preictor for sex? In which case, we’re back to where we started, I think.

August 16, 2012 at 12:03 pm

Re 1st question – that’s true in some domains, but in some problems the question is just whether the two variables are independent. I think this tends to come up in cases where you have less of a specific hypothesis regarding a mechanism or direction of causality and are only trying to determine whether there’s a correlation (for example, between measurements related to genes in an organism or words in text).

I should have been more specific regarding log-linear models. I had in mind the log-linear approach to modeling tabular correlations where you start with a saturated model and hierarchically compare it to simpler independence models, so that the association question is a problem of model selection with a likelihood ratio tests (I’m thinking along the lines of agresti’s categorical data analysis books). A saturated model will fit perfectly, but you can ask whether a likelihood ratio test will favor it over models with more independence relationships.

In Kruschke’s “Doing Bayesian Data Analysis” chapter 23 he describes the Bayesian analogue to a log-linear model as basically a saturated model with priors on the batches of coefficients (β0, βr, βc, βrc in log(λ) = β0+βr+βc+βrc, granted this probably better suited for contingency tables with more than 2 categories along each dimension). One way to look at association in this context is to ask whether the posterior distribution on the interaction parameter in the saturated model is nonzero (i.e. βrc in log(λ) = β0+βr+βc+βrc).

August 16, 2012 at 12:06 pm |

By “hierarchically compare” I was referring to simpler models of independence which are nested in the saturated model not andrew’s multilevel models. Overloaded term…