David Goldstein Proves Himself Wrong

A recent paper in PLoS Biology by David Goldstein’s group is being seen as another ‘death of GWAS’ moment (again?). I have a lot of issues with this paper, but I will be brief and stick to my main objection; the authors attempt to demonstrate that common associations can be caused by sets of rare variants, and in doing so inadvertantly show they most of them are not.

The Paper and the Press

This is another example of a scientific paper being careful to make no solid, controversial claims, but being surrounded by a media story that is not justified by the paper itself. The only real solid claim in the paper is that, if you do not include rare SNPs in your genome-wide association study, and rare SNPs of large effect are contributing to disease, then you will sometimes pick up more common SNPs as associated, because they are in Linkage Disequilibrium with the rare SNPs. Pretty uncontroversial, in so far as it goes. The paper makes no attempt to say whether this IS happening, just says that it CAN happen, and that we should be AWARE of it.

However, in the various articles around the internet, this paper is being received as if it makes some fundamental claim about complex disease genetics; that this somehow undermines Genome-Wide Association Studies, or shows their results to be spurious. David Goldstein is quoted on Nature News:

…many of the associations made so far don’t seem to have an explanation. Synthetic associations could be one factor at play. Goldstein speculates that, “a lot, and possibly the majority [of these unexplained associations], are due to, or at least contributed to, by this effect”.

Another author is quoted here as saying

We believe our analysis will encourage genetics researchers to reinterpret findings from genome-wide association studies

Much of the coverage conflates this paper with the claim that rare variants may explain ‘missing heritability’, which is an entirely different question; Nature News opens with the headline “Hiding place for missing heritability uncovered”. Other coverage can be found on Science Daily, Gene Expression and GenomeWeb.

Does this actually happen?

Is all this fuss justified? How common is this ‘synthetic hit’ effect; are a lot of GWAS hits caused by it, or hardly any? There are many ways that you could test this; for instance, you could make some predictions about what distribution of risk you’d expect to see in the many fine mapping experiments that have been done as follow ups to Genome-Wide Association Studies (this would be trivially easy to do using the paper’s simulations).

However, there is an even easier way to test the prevalence of the effect. If most GWAS hits are tagging relatively common variants, then you would expect to see most disease associated SNPs with a frequency in the 10% to 90% range (the range for which GWAS are best powered). However, a SNP with a frequency of 50% is less likely than one with a frequency of 10% to tag a SNP with frequency 0.5%, so if most GWAS hits are tagging rare variants, then you would expect to see most associated SNPs with a frequency skewed towards the very rare or the very common.

In fact, the paper makes an explicit calculation of the expected frequency distribution of GWAS hits, under their synthetic model. In-double-fact, the paper plots this distribution against the distribution of know GWAS hits. And here is that plot, taken directly from the paper (Figure 5):

The green line is the expected frequency distribution of ‘synthetic’ associations; the red line is the actual distribution. We can see that the GWAS hits we do see fail to follow the distribution for synthetic associations; in fact, they follow pretty much exactly the distribution we’d expect if most common associations are tagging common causal SNPs.

The paper manages to pretty conclusively show both that demonstrate that synthetic SNPs can occur, but they rarely do.


Dickson, S., Wang, K., Krantz, I., Hakonarson, H., & Goldstein, D. (2010). Rare Variants Create Synthetic Genome-Wide Associations PLoS Biology, 8 (1) DOI: 10.1371/journal.pbio.1000294

Share and Enjoy:
  • Digg
  • Reddit
  • StumbleUpon
  • del.icio.us
  • Facebook
  • Twitter
  • Google Bookmarks
  • FriendFeed

7 Responses to David Goldstein Proves Himself Wrong

  1. Pingback: Tweets that mention David Goldstein Proves Himself Wrong « Genetic Inference -- Topsy.com

  2. Interesting point, but it looks like they used the distribution of all simulated variants rather than the distribution that you would see on the types of chips that are used in GWAS, which have an increasing bias against lower frequency SNPs. I bet if they ran the analysis to take into account that bias that it would pull that green line towards the center and make it look more like GWAS.

  3. Yeah, the authors mention this. We can work through the numbers (all values extracted from the graph using GIMP), to see how it changes things.

    The ‘rare’ peak at 0.085 on the green line (the common synthetic associations) is at 2.8, and at a frequency of 0.5 it is 0.3 - the rare peak is 9.2 times higher than the mid point. For the known associations, the value at 0.085 is 0.8, and at 0.5 is 1.2, giving a ratio of 0.68.

    The question is, will correcting for sampling bias bring the predicted rare:common ratio (9.2) into line with the actual rare:common ratio (0.68)?

    We can calculate how much the predicted ratio will change once we take into account bias, by noting that a) we can take the ratio of 8.5% to 50% SNPs for the Illumina 1M chip from the graph (it is 1.37), and b) we can predict the actual ratio of 8.5% to 50% SNPs using Kimura’s neutral allele frequency ratio (see here):

    (f2/f1)*((1-f1)/(1-f2))^(M-1)

    We use M=0.1, as you generally do, but changing this value doesn’t significantly change our results. The ratio is 3.4, giving an overall sampling bias of 3.4/1.37 = 2.5.

    This sampling bias changes the predicted rare:common ratio from 9.2 to 3.7; this is still significantly higher than the observed ratio of 0.68, and thus we can still conclude that there are very few, if any, synthetic associations amoungst the known GWAS hits.

    Quite why the authors didn’t do this simple analysis, given that they already had all the data available, is an open question.

  4. D. Goldstein, has in the present paper constructed a logical sophism. In brief he says, if a causative rare variants is creating a signal at a given locus then the common variant at this locus with which the rare hitchikes will also be associated. This first point is true but have very rarely happened in the case of well confirmed replicated variants. The second step he does in its logic and that the media coverage is conveying is when you have common variant associated it is because of rare variants. This second step is unfounded and at least is unlikely to be a general scenario.
    Yes sequencing will bring data, but in advance one cannot assume he or she knows what will be the results.

  5. I agree with Patrick.Of course we know that some common SNPs found in GWAS will be in LD with high penetrant variants,some common and some rare.see AMD;but in advance of sequencing data we have no idea how much of the SNP associations, in which disorders can be attributed to these phenomena. D Goldstein’s graph/sums suggest a minority.

  6. Your post was selected as part of my Picks of the Week, of molbio-related blog posts aggregated to RB.

    http://amontenegro.blogspot.com/2010/02/gwas-under-attack-historical.html

  7. Pingback: Rare variants versus common variants in complex disease is a political, not a scientific, debate -Gene Expression

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>