Archive for January, 2010

David Goldstein Proves Himself Wrong

Wednesday, January 27th, 2010

A recent paper in PLoS Biology by David Goldstein’s group is being seen as another ‘death of GWAS’ moment (again?). I have a lot of issues with this paper, but I will be brief and stick to my main objection; the authors attempt to demonstrate that common associations can be caused by sets of rare variants, and in doing so inadvertantly show they most of them are not.

The Paper and the Press

This is another example of a scientific paper being careful to make no solid, controversial claims, but being surrounded by a media story that is not justified by the paper itself. The only real solid claim in the paper is that, if you do not include rare SNPs in your genome-wide association study, and rare SNPs of large effect are contributing to disease, then you will sometimes pick up more common SNPs as associated, because they are in Linkage Disequilibrium with the rare SNPs. Pretty uncontroversial, in so far as it goes. The paper makes no attempt to say whether this IS happening, just says that it CAN happen, and that we should be AWARE of it.

However, in the various articles around the internet, this paper is being received as if it makes some fundamental claim about complex disease genetics; that this somehow undermines Genome-Wide Association Studies, or shows their results to be spurious. David Goldstein is quoted on Nature News:

…many of the associations made so far don’t seem to have an explanation. Synthetic associations could be one factor at play. Goldstein speculates that, “a lot, and possibly the majority [of these unexplained associations], are due to, or at least contributed to, by this effect”.

Another author is quoted here as saying

We believe our analysis will encourage genetics researchers to reinterpret findings from genome-wide association studies

Much of the coverage conflates this paper with the claim that rare variants may explain ‘missing heritability’, which is an entirely different question; Nature News opens with the headline “Hiding place for missing heritability uncovered”. Other coverage can be found on Science Daily, Gene Expression and GenomeWeb.

Does this actually happen?

Is all this fuss justified? How common is this ‘synthetic hit’ effect; are a lot of GWAS hits caused by it, or hardly any? There are many ways that you could test this; for instance, you could make some predictions about what distribution of risk you’d expect to see in the many fine mapping experiments that have been done as follow ups to Genome-Wide Association Studies (this would be trivially easy to do using the paper’s simulations).

However, there is an even easier way to test the prevalence of the effect. If most GWAS hits are tagging relatively common variants, then you would expect to see most disease associated SNPs with a frequency in the 10% to 90% range (the range for which GWAS are best powered). However, a SNP with a frequency of 50% is less likely than one with a frequency of 10% to tag a SNP with frequency 0.5%, so if most GWAS hits are tagging rare variants, then you would expect to see most associated SNPs with a frequency skewed towards the very rare or the very common.

In fact, the paper makes an explicit calculation of the expected frequency distribution of GWAS hits, under their synthetic model. In-double-fact, the paper plots this distribution against the distribution of know GWAS hits. And here is that plot, taken directly from the paper (Figure 5):

The green line is the expected frequency distribution of ‘synthetic’ associations; the red line is the actual distribution. We can see that the GWAS hits we do see fail to follow the distribution for synthetic associations; in fact, they follow pretty much exactly the distribution we’d expect if most common associations are tagging common causal SNPs.

The paper manages to pretty conclusively show both that demonstrate that synthetic SNPs can occur, but they rarely do.


Dickson, S., Wang, K., Krantz, I., Hakonarson, H., & Goldstein, D. (2010). Rare Variants Create Synthetic Genome-Wide Associations PLoS Biology, 8 (1) DOI: 10.1371/journal.pbio.1000294

Cargo Cult Science and NT Factor®

Friday, January 22nd, 2010

A recent blog post on Chronic Fatigue Syndrome linked in passing to a ‘treatment’ called Mitochondria Ignite™ with NT Factor®. This product caught my attention as an example of what Richard Feynman called ‘Cargo Cult Science’; a company dressing up like scientists, using chemical names and precise sounding figures, without actually having any science underlying it.

However, the product is not arguably not exactly pure Cargo Cult Science; there is a small amount of science content present. The product page contains a number of references, some of which point to peer review journals, and some of which are actually studies of the effect of some of the contents of the drug on humans. Of cours,e taking apart the studies shows that the product is still unproven, despite the thin glaze of real science; I can’t help but feel that this sort of thing has slightly grim implications for the future of accurate consumer information.

(more…)

The Future of Second Generation Sequencing

Wednesday, January 13th, 2010

Illumina, the major player in high-throughput sequencing these days, have announced the newest version of their second generation sequencing platform, the HiSeq 2000. The machine can produce a lot more sequence, and at lower cost, than the previous Genome Analyzer II.

I’m not going into much detail about the machine: for that, see posts at Genomics Law Report, Genome Web, Genetic Future, Pathogenomics and PolITiGenomics. What I really care about is what this machine implies for the future of sequencing, and specifically what we can predict about the coming 2nd verses 3rd generation sequencing battles that will be kicking off later this year.

PacBio’s 3rd generation machine, which will be arriving later this year, will have an initial throughput of around 3Gb a day, at a price of around 1.4$ per Mb in consumable costs. I don’t know the specs for Oxford NanoPore’s machine; my guess is that it will be similar, but we’ll know soon.

Compare PacBio’s capacity to the HiSeq 2000, which will produce 25 Gb per day, at claimed consumables price of $0.11 per Mb ($10 000 for a 30X genome). In short, the Illumina 2nd gen machine is going to be able to pump out much more sequence at a much higher rate than PacBio. Both will rapidly increase the power of their machines after release, but we don’t know who will push faster (Dave Dooling thinks Illumina could push the HiSeq to 450 Gb per run with existing technology).

Of course, the competition isn’t just based on pure throughput. Read length and error rates are also important; the 3rd gen machines will also have much longer read lengths than Illumina and SOLiD, and we expect that the quality of sequence will be higher as well, giving the possibility of some real Gold Standard genomes being produced from these machines, rather than the somewhat messy genomes we get from Illumina.

This all ties in to the conversation I had with the Illumina people at ASHG; Illumina think that it’ll be a good few years before 3rd Gen sequencing can catch up with their current machines. I expect that, between now and 2014 (when PacBio release v2 of their machine), the major sequencing centres will keep a combination of 2nd and 3rd gen machines. The 2nd gen machines will be used when a very large amount of low-quality sequence is required, such as for Genome-Wide Association Studies or RNA-seq. The 3rd gen machines will be used for assembling genomes, looking for copy-number variations and studying the genetics and epigenetics of non-coding and repetitive regions.

I guess what I’m trying to say is that, as exciting and cool as the single-molecule technologies of PacBio and Oxford NanoPore are, it is far too soon to announce the death of Second Gen sequencing. If Illumina continues to push its throughput as hard as it is doing now, 2nd generation machines will be widely used for a long while yet.

The future will become a bit clearer at the AGBT conference, where we should see some big announcements from PacBio, Oxford Nanopore, Complete Genomics, ABI and Illumina. Me and a host of other bloggers will be there to cover them.