ASHG: Epistasis and the Missing Heritability

As if no time has passed at all since the sunny shores and lost laptops of the American Society of Genetics 2009 meeting, ASHG2010 has rolled around, this time in Washington DC. As always, I’m going to be trying to write a few thoughts on the conference every day, though this year it may be split between here and Genomes Unzipped.

I’ll also be semi-live-tweeting (wifi coverage is patchy), so you can get up-to-the-minute details of all the talks on my twitter feed (@lukejostins), or from other tweeps via the hashtag #ASHG2010.

Epistasis and Missing Heritability

As Daniel observed on Twitter, I very nearly had a heart attack when Eric Lander, in his Distinguished Speaker’s talk about the Human Genome Project, said that the “missing heritability” is probably all down to Epistasis (i.e. interactions between variants). His argument was that GWAS had low power to detect gene-gene interactions, and therefore there could be lots hanging around that count account for the unexplained variance.

This is a fallacy, and a big one. Continue reading

Introducing: Genomes Unzipped

I’d like to introduce a new group blog, called Genomes Unzipped.

Genomes Unzipped is written by a number of major genetics bloggers, including Daniel MacArthur of Genetic Future, Dan Vorhaus of Genomics Law Report, Caroline Wright of the PHG Foundation, and Jan Aerts of Saaien Tist, as well as a number of other geneticists who are new to blogging. And me, of course.

The really exciting thing about this blog is the diversity of people who are working on it. We have people from a range of fields, including science, law and public health, who hold a range of opinions. Some of us are very gung-ho about direct-to-consumer genetics, and enthusiastic about the benefits, while others are sceptical, and more in favour of regulation. While we’ve been putting together the blog, we’ve had many meetings between contributors, and with outside advisers (such as ethicists), and I have already learned so much before anything has been published. So, needless to say, I am pretty excited about the whole project.

The current blog is in the beta stage at the moment; this is because what you see now is not yet the finished version. Soon, the project will expand in scope; stay tuned for more announcements.

Crunching The Data on Human Brain Evolution

ResearchBlogging.orgLee, S., & Wolpoff, M. (2003). The pattern of evolution in Pleistocene human brain size Paleobiology, 29 (2), 186-196 DOI: 10.1666/0094-8373(2003)0292.0.CO;2


There has been a bit of debate around the biology blogoverse recently about the evolution of human brain size. It started off as an “idle speculation” type argument, but then took a satisfying swerve into the evidencey regions of science, which is always satisfying. The upshot of this is that I’ve spent today riffing statistically on some fossil data from a relatively old paper on the evolution of human brain size, and seeing if I can tease some interesting tidbits out of it.

Some background

It all kicked off when neuroscientist Colin Blakemore made some comments on the evolution of the human brain. He argued that the large increase in brain size we see around 200k years ago may have been a useless “macromutation” that was tolerated due to the abundance of food. The evolutionary implausibility of this was evaluated unfavourably by and Jerry Coyne.

I’ve heard lots of plausible reasons that human brain size may have seen an increase 200k years ago or so; my personal favourite is that it was due to the discovery of fire, which made food easier to eat, giving more calories and allowing children to be born without the fully connected skull required to attach the jaw muscles for chewing tough food. This would loosen up constraints on selection for brain size, allowing brain size to rush ahead.

Either way, none of the hypotheses proposed have had much strong evidence put forward for them yet (maybe when we bring some Evo-Devo to the party?), and this is particularly true of Blakemore’s theory.

More interesting, and more answerable, is the question of whether there was a sudden increase in brain size at all, and if so when exactly it happened. John Hawkes put up a graph of data from a review of human brain evolution by Lee and Wolpoff, and used it to argue that brain size has been increasing gradually for millions of years, with no recent “jump”. In response, Ciarán Brewster did some basic number crunching to establish that, even if there wasn’t a sudden macromutation 200 kya, the human brain seems to have been increasing in size faster over the last 200 kya, compared to before that.

If and when did brain growth speed up?

I decided to delve a little bit deeper into the data from the paper. The problem with Ciarán’s analysis is that it assumes that if there was a speed-up, it started 200k years ago. This is a slightly problematic assumption, mostly because of the winner’s curse (oh go look it up).

To do a more rigorous test for the existence of a speed-up, and to estimate when the speed-up happened if there was one, I fitted a least-squared break-point model (a model where the slope of a trend line changes on either side of some point). I compared this to to the basic linear trend-line, to see if it explains the data significantly better.

Here is what I get (the points are the fossils, the dashed line is a linear fit, and the coloured lines are the break-point model):

The model shows a definite speed-up of brain size increase recently, and fits the data significantly better than a simple trend line (F(1,90) = 15.8, p < 10^-5). I estimate that the speed-up occured 252kya, and can say with 95% confidence that it lies between 203 and 377 kya. This result is pretty robust to exactly what model we use; I also tried using a model where brain size grew exponentially with time, and this gave a similar break-point: 250kya, with a 95% interval of 167-402 kya (see this graph).

If you prefer non-parametric statistics, here is a loess smoothing of the data, showing a clear kink around 280kya:

Simon Blakemore’s theory of a single, sudden macromutation is, of course, inconsistent with the data; given that each sample is an individual human, if large brain size was a macromutation we’d see each sample either having a massive brain, or a small one. But, there does appear to be a change in the processes driving brain evolution somewhere between 200 and 400 kya.

Appendix 1: Adding sex into the mix

Examining the literature (and by “examining the literature” I mean “googling the sample names”), of the 94 samples in the dataset, 30 are probably male, 24 are probably female, and 40 have not been sexed, or I couldn’t find out the sex (my analysis was pretty haphazard, so don’t take this as gospel). Both sexes have the same age distribution (see here), so it is unlikely that sex would confound the break-point. However, just to be on the safe side, here is the analysis using just males:

And just females:

Both show a similar break-point. The break-point model explains the data significantly better than the linear fit for males (F(1,26) = 20.4, p < 10^-3) but not for females (F(1,20) = 2.9, p = 0.105). Maybe some hint that the change in brain growth was greater for men than for women, perhaps? [Insert fruitless and subconciously sexist speculation here]. We don't have much data, and you'd want to get an expert to classify the sexes before you drew any conclusions.

Appendix 2: Lee and Wolpoff’s Slightly Weird Model

In the original paper, Lee and Wolpoff argue that there is in fact no discontinuity anywhere, essentially because if you plot the log of brain size against the log of time, you get a straight line all the way back (Figure 3 in the paper). A log-log relationship corresponds to fitting an exponent model of Y = Axb (where y is brain size and x is number of years before present), which looks like this:

And indeed, under this model, you cannot find any break points that will significantly improve the fit. The model is pretty weird though; in our case, the model basically corresponds to brain size being inversely proportional to the fifth root of time (b = -0.2). Linear growth makes sense, and exponential growth corresponds to a growth in brain size is proportional to the current brain size, but what does inverse-quintic root growth correspond to? Maybe there is something fundamental going on here, but I expect the goodness of the fit in the graph above is just down to the flexibility of the exponent model, and I’d consider any conclusions drawn from a log-log transformation of the data to be somewhat dodgy (or at least, very underpowered, as the flexibility of the model will tend to obscure true discontinuities).

The authors strange choice of model appears to stem from a slight misunderstanding:

A logarithmic transformation may help avoid the problem of interdependence within the data set because it can be derived from the assumption that the rate of change of cranial capacity at any particular time is proportional to the cranial capacity of the sample at that time

But of course this is an exponential model, which corresponds to a log-linear transformation (the analysis I did above, which still showed a breakpoint); they performed a log-log, which corresponds to a exponent model.


The data I used, including sex classification, is all here. I’d normally put the code I’d used here as well, but in this case it is in a pretty nasty state. If anyone wants it, I can clean it up.

International Women’s Day

Few tragedies can be more extensive than the stunting of life, few injustices deeper than the denial of an opportunity to strive or even to hope, by a limit imposed from without, but falsely identified as lying within.

- Steven Jay Gould, The Mismeasure of Man

Today is International Women’s Day; today, we remember the all-to-quickly forgotten contributions of women to society, to industry, and to intellectual thought. It is also, importantly, a day to remember the hardships that women have had to fight through, and how far we still have to go to conquer the biases and prejudices that still exist at every level of society. In particular, I think that this should be a time for us scientists to consider how our field and our institutions have treated women, and how science can be used to hurt or help the cause of women.

Biologists have a far from untainted history when it comes to the treatment of women. The late Steven Jay Gould, in his essay Women’s Brains, documented how the fathers of anthropometry, lead by Paul Broca, used bad statistics on brain measurements to infer a general lack of intelligence in women, going on to argue against their access to higher education. Milder versions of this sort of thinking still go on today; in recent years, we’ve seen Lawrence Summers, the president of Harvard, comment that there may be a general lack of highly intelligent women compared to men, because of a (speculated) innate difference in the spread of intelligence, and that this may drive the lack of women in high-up academic positions. For better or for worse, the public still have a lot of trust for scientists, and unfounded and speculative statements such as these can do a lot to reinforce existing prejudices.

Despite dodgy patches in our past, science is far from an inherently sexist process. As Gould made clear in his take-down of Broca, the opinions the anthropometricists held were bad science; they were based on a shallow understanding of the statistics, and a failure to look at the factors that contributed to brain differences. It is bad science (fueled by unacknowledged biases) that leads people to make these statements, and ultimately it will be good data that counteracts it.

In that spirit, I have read a nice crop of blog posts over the last few months. In a series of posts collected together for Women’s Day, Ed Yong explains how the lack of female grandmasters is entirely explained by sex differences in the number of people taking up chess, that differences in maths ability is driven by gender inequality, that the ‘larger variation’ hypothesis falls down when you look across difference cultures, and that uncouncious but identifiable gender stereotypes in a society correlates strongly with sex differences in ability.

Another, unexpected source for challenging gender stereotypes is OkTrends, the datablog of the dating site OkCupid. A recent post was on “Older Women” (in this case, defined as women in their 30s and 40s); many men refuse to date women in this age range, citing beliefs that they are neurotic, non-sexual, unattractive, and too serious about relationships. However, none of these things are true; the data shows that women in their 30s and 40s are more self-confident, less sexually conservative, and less concerned about finding a relationship that leads to marriage; and while some women, with very youth-based looks, may lose their attractiveness with age, the majority (90% or more) of women look just as attractive at 40 as they did at 18. Men’s opinions on these women are not based on any actual experience of them; they are based on stereotypes about ‘older women’, which are inventions, propagating through a youth-obsessed culture.

These studies, and dozens of others, are building up an accurate picture of how society, through subtle effects of institutions, expectations and stereotypes, partition men and women into different roles; and when we know how it happens, we can attempt to counteract it.

AGBT: Speculating on Third Gen Tech

So, AGBT is over. I’ve reported on the existing tech in my previous post; one thing that I haven’t covered so far is 3rd Generation sequencing. Time to rectify this.

We have three major players, two of which had a strong presence at AGBT. Pacific Biosciences had a major launch (covered extensively elsewhere), and Life Technologies gave a surprisingly awesome presentation on their new Quantum Dot sequencing technology, QDot. Left out was Oxford Nanopore, the other major player in the 3rd Gen sphere; they did not present anything at AGBT, and I hope they all know that I am very angry about this.

We now have some information about these technologies; we know, in broad terms, how they work, and we can make some guesses about how they’ll compare. Based on the extremely limited amount of data we have at the moment, and a few speculative computer simulations (the R code for which can be read here), I’m going to draw some overall conclusions about how each tech will perform in terms of read length, yield, and accuracy.

To get a lot of this data, I’ve made “educated guesses” at the parameters (mostly based on what we know from the PacBio machine). This is an ‘all things being equal’ analysis; I assume that PacBio, Nanopore and all have the same read density, and the same enzyme efficiency; i.e. that the probability of the QDot polymerase dying is the same as for the PacBio polymerase, which are both the same as the DNA strand falling off the nanopore. If any of the companies feel like providing me with their molecule densities and decay parameters for their enzymes (ha!), I’ll happily fix the plots.

I really must repeat; all of these graphs or figures are guesses. I have no actual data beyond what has publicly been announced by the companies. I fully expect much of this to be proved wrong over the next year: this is just my guess at what the machines may look like. I should especially emphasise that I have not used any information sourced via my employer about any of these technologies.

Continue reading

AGBT: Sequencing Tech Lowdown

Alright, it’s time to address the meat of the matter of AGBT; the state of play of sequencing technology. I’ll go through each of the major companies in turn, and talk about what they’ve brought to the table, and what the future holds for them.

As usual, for more in depth information you can follow me on twitter (@lukejostins). Other coverage can be found on Genetic Future, MassGenomics, Fejes, GenomeWeb and Bio IT World.

Illumina

I covered Illumina on day zero. Basically, the GAIIx can now generate 7Gb/day, with 2x150bp, and error rates universally under 2%. The HiSeq generates 31Gb/day, 2x100bp, with error rates under 1%; this will soon be pushed to 43Gb/day with a slight decrease in accuracy. For sheer volume of sequence, no-one can match Illumina.

454

As I said yesterday, 454‘s median read lengths are climbing into the 700-800 range, but with error rates being pretty high beyond 600 or so. Not bad, but after all the fuss over 1000bp reads, also a little disappointing.

454 have been pushing their work on assembly; they’ve worked pretty hard to make an easy-to-follow recipe, involving both single-end and paired-end sequencing, and the program Newbler. Many interesting critters have had this treatment, including bonobo, panda and Desmond Tutu (in order of majesty).

SOLiD

I found the SOLiD content of this conference very cool. Focusing more on the medical genomics side of things, SOLiD is involved in various clinical trials to see whether genomic information can increase cancer survivial times, and emphasizing the importance of accuracy in a clinical setting.

Lots of cool new tech too: For instance, mixing 2-base and 1-base encoding, apparently making error rates of 1 in 10^6 possible. Apparently library prep errors now dominate, so SOLiD has been working on finding more gentle enzymes for amplification. Particularly cool was a throw-away slide on running the ligase on single molecules and actually getting signal (though actual single-molecular sequencing probably isn’t economic).

Pacific Biosciences

Pacific Biosciences have produced an extremely interesting product; it is a game-changer, though exactly what it means for sequencing is not immediately obvious. I am going to hold back on writing about PacBio right now, because I have a more in-depth post on the exact specs and implications of the PacBio, in comparison to their nearest equivalent Oxford Nanopore, in the works.

Complete Genomics

Complete Genomics have gone from “interesting idea” to “thriving technology” in a very short period of time. They’re scaling up their sequencing centre as we speak; they’ll have 16 machines in the next few months, generating 500 40X genomes a month. Over the year, providing they get more orders, they’ll scale up to 96 machines, with a predicted 5X increases in capacity per machine as well. If this all goes well, in theory they are on target for their 5000 genomes by the end of the year.

Complete also have some very interesting new technologies on the horizon, which they will be discussing tomorrow; check the twitter feed for coverage. A lot of people underestimate Complete Genomics, but it is starting to become evident that they are as much game-changers as more flash technology.

Ion Torrent

Ion Torrent wins both my major awards this year: the “most surprising release” award and the “sounds most like a soviet weapons project” award. Ion torrent burst onto the scene with its tiny machine (GS Junior sized); the first major non-florescence-based method in a long time, using the emission of hydrogen ions from the the DNA polymerase reaction to measure incorporation in a 454 stylee.

The machine can produce a rapid 150Mb or so in a single hour run, for about $500 in disposables. The machine itself costs a tiny $50k. From what I’ve heard, a lot of people are interested in a machine like this for fast library validation, though it also has applications in diagnostics and microbiology. Unfortunately, it looks like the error rates are currently high, though they claim these will drop by release time.

Summary

Overall, we are starting to see a divergence in sequencing technologies, as each tech concentrates on having clearly defined advantages and potential applications that differ from all others. This means that the scientists themselves can more closely tailor their choice of tech to fit their situation. Are you a small lab that needs 10 high-quality genomes on a budget? Go to Complete. Want a cheap, fast machine for library validation? Use Ion Torrent. Setting up a pipeline for sequencing thousands of genomes? Go Illumina.

I suppose this was all driven by the fact that Illumina’s machine has such high yield that chasing them is a fool’s game, so everyone else is concentrating on what they can do that Illumina doesn’t. This is pretty good for science as a whole; we are moving away from the One-Size-Fits-All approach to high-throughput sequencing, and moving into a time of more mature, application-based methods.

AGBT: Taking the Statistics out of Statistical Genetics

The second day (or the first day, depending on if you count yesterday’s pre-sessions) of AGBT is nearly done. There has been a lot of things going on today, but I’m only going to cover one; once again, you can get more detail on all the talks I’ve seen on my Twitter feed (@lukejostins).

Other things that I’ve done: I had a very interesting talk with Geoff Nilsen at Complete Genomics, in which I got to ask various questions, including: “Why don’t they use color-space?”, “It confuses customers, and the error model is good enough already”. “In what sense is Complete ’3rd Gen’?”, “Because it’s cheaper”. I also saw a set of presentations from 454 on de novo assembly, and the new Titanium 1k kit, which actually contains virtually no 1kb reads: mean read length is about 800bp, but beyond 600 the error rates get very high.

There has been some other blog coverage of AGBT from our army of bloggers: MassGenomics has some first impressions, and Anthony Fejes is uploading his detailed notes about all the talks. You can also follow a virtual rain of tweets on the #AGBT hashtag.

Fun with Exome Sequencing

Debbie Nickerson (again!) gave a talk about sequencing genomes to hunt down the genes underlying Mendelian disorders. The process is very simple; you sequence a 4-10 exomes of suffers, look for non-synonymous mutations shared in common between them, and then apply filters (such as presence in HapMap exomes) to find SNPs that are likely to be causal. Debbie is in the process of sequencing 200 exomes for 20 diseases, and has a real success story under her belt in tracking down the genes for 2 disorders. She raised the interesting question of how to validate the discovered genes, given that Mendelian disorders tend to have a large number of independent mutations.

Stacey Gabriel gave a related talk on exome sequencing, focusing on using the method Debbie described to track rare variants for complex traits. To do that, you ‘Mendelianise’ the trait, by only picking extreme individuals; She did this for high and low LDL-choloresoral, giving some candidate genes, but no smoking gun.

Let’s look slightly closer at this; you sequence a number of individuals with extreme traits, look for genes with shared non-synonymous mutations, and look for functional effects. This is a linkage study! A very small and underpowered linkage study, with a variant-to-gene collapse method (like a poor-mans lasso), and some sort of manual pathway/functional analysis (a poor-mans GRAIL), but linkage all the same. This is really re-inventing the wheel, without really learning any of the lessons that the first round of linkage analysis taught us (or even stopping to ask whether, if such variants existed, they would have been picked up by linkage in the first place).

It is not that Stacey Gabriel is doing anything wrong; it is just that she is failing to consider that she is attempting to solve non-statistically a problem that statisticians have worked on for decades. In short, she is risking taking the statistics out of statistical genetics.

AGBT: Running Sequencing Facilities and Illumina’s Ever-Growing Capacity

The first day of the Advances in Genome Biology and Technology conference is not until tomorrow, but today there were a couple of pre-sessions. These were on pretty much opposite ends of the spectrum; one was a series of general, high level talks by the users of high-throughput sequencing, and another was a series of technical talks by a manufacturer of sequencing machines.

As usual, this blog post is just a summary of a few aspects that I found interesting. More in depth coverage can be found on my Twitter feed, @lukejostins.

Running a Sequencing Facility

The subject of how to build, run and scale up a sequencing facility may seem, at first glance, a little dry; but I found two talks by two heads of sequencing labs fascinating.

Debbie Nickerson gave a talk about her experience with scaling from a few machines to a major operation. Debbie seems to keep it running smoothly mostly through collecting crazily amount of data on samples, libraries and runs, and producing a range of tools to quickly examine this data to locate problems, and to keep information flowing between the different parts of the lab. A nice example was a tool to flag up common library failures from sequence data, and automatically e-mail the library prep team to reprep the sample.

Susan Lucas gave an overview of the kind of questions you need to think about when planning a genome center. There are obvious things you need to consider; for example, making sure that you can transfer, store and process data. However, she also talked about some more interesting questions: Can your pipeline incorporate new technologies, or new platforms? Can it handle plant DNA, or E. coli? Have you considered the ergonomics of the space; will repetitive tasks cause repetitive strain injury? Do you have an emergency strategy? What will you do if the sprinklers go off?

Sequencing center logistics is right at that interesting intersection between data management, tool development, statistical inference and decision theory; I love the contrast between the the statistically high-flying (“detect signs of quality degradation in sequencing imaging”) and the mundane (“tell Bob to clean the lenses”).

Illumina’s Presentations

The late afternoon was taken up by a series of talks by Illumina on new developments in their sequencing tech. It was interesting to see, given that Illumina have already broken their big story with the HiSeq 2000; as opposed to any single big announcement, the talks were all about how extra sequencing capacity is squeezed out of the existing technologies (though ‘squeezing’ is probably not the right word for Illumina’s 12X increase in GAIIx sequencing yield over the last year).

Particularly interesting was Sheila Fisher‘s the talk on the performance of the Broad Institute’s Illumina pipeline; 60% of their machines now run the 2X100bp protocol, producing over 5Gb per day per machine, with other machines running 2X150bp, with a higher cluster density, giving 7Gb/day. For the latter, a single machine could produce a 30X genome in a single 2-week run, or all the sequence produced in Pilot 1 of the 1000 Genomes Project in 9 months. Once the HiSeq is brought up to the same cluster density, it will produce 43Gb per day; enough to generate all the Pilot 1 data in 6 weeks.

The scale of this sequencing production is staggering; the HiSeq could get to 43Gb a day without any new innovations, and I expect that there is another 2-4X increase in capacity that Illumina could bring in from incremental changes to cluster density and imagine processing over the next year. As I’ve said before, second generation sequencing still has a lot of room to grow.

Off to AGBT

Tomorrow morning I head off to the Advances in Genome Biology and Technology conference (AGBT for short) on Marco Island, Florida; as someone who loves the cold and hates warm places, this is not as exciting for me as you may think. One thing that is exciting, however, is the nice genomics blog presence at the conference; me, Daniel MacArthur, Anthony Fejes, Dan Koboldt and David Dooling will all be playing our parts as Ambasadors to the Blogosphere. Interestingly, assuming independence, there is a 68% chance that at least one of us will get eaten by an alligator; watch this space!

The high blog coverage is justified; we are expecting to get a feel for how the field of DNA sequencing tech will advance over the next year. I will be particularly interested in seeing what Complete Genomics have to report, as well as 3rd gen sequencing presentations from Pacific Biosciences and Life Technologies (ABI). One group that are notable by their lack of a presentation is Oxford Nanopore, which is a shame; I’m sure Nanopore will be talked about plenty anyway.

I have recently got a brand new laptop to replace the brand new laptop that I lost at ASHG last year, and I’m going to keep up the same schedule I did then; a daily blog post summing up the day’s highlights, and more detailed, up-to-the-minute coverage of every talk I see on my twitter feed (@lukejostins). For more AGBT twittering, I think people are going to be using the hashtag #AGBT.

As is traditional when I go away, I will also be sending a daily e-mail with amusing things that have occured at the conference, but that is promised to my girlfriend Hannah, so you will, alas, not get to read it.

David Goldstein Proves Himself Wrong

A recent paper in PLoS Biology by David Goldstein’s group is being seen as another ‘death of GWAS’ moment (again?). I have a lot of issues with this paper, but I will be brief and stick to my main objection; the authors attempt to demonstrate that common associations can be caused by sets of rare variants, and in doing so inadvertantly show they most of them are not.

The Paper and the Press

This is another example of a scientific paper being careful to make no solid, controversial claims, but being surrounded by a media story that is not justified by the paper itself. The only real solid claim in the paper is that, if you do not include rare SNPs in your genome-wide association study, and rare SNPs of large effect are contributing to disease, then you will sometimes pick up more common SNPs as associated, because they are in Linkage Disequilibrium with the rare SNPs. Pretty uncontroversial, in so far as it goes. The paper makes no attempt to say whether this IS happening, just says that it CAN happen, and that we should be AWARE of it.

However, in the various articles around the internet, this paper is being received as if it makes some fundamental claim about complex disease genetics; that this somehow undermines Genome-Wide Association Studies, or shows their results to be spurious. David Goldstein is quoted on Nature News:

…many of the associations made so far don’t seem to have an explanation. Synthetic associations could be one factor at play. Goldstein speculates that, “a lot, and possibly the majority [of these unexplained associations], are due to, or at least contributed to, by this effect”.

Another author is quoted here as saying

We believe our analysis will encourage genetics researchers to reinterpret findings from genome-wide association studies

Much of the coverage conflates this paper with the claim that rare variants may explain ‘missing heritability’, which is an entirely different question; Nature News opens with the headline “Hiding place for missing heritability uncovered”. Other coverage can be found on Science Daily, Gene Expression and GenomeWeb.

Does this actually happen?

Is all this fuss justified? How common is this ‘synthetic hit’ effect; are a lot of GWAS hits caused by it, or hardly any? There are many ways that you could test this; for instance, you could make some predictions about what distribution of risk you’d expect to see in the many fine mapping experiments that have been done as follow ups to Genome-Wide Association Studies (this would be trivially easy to do using the paper’s simulations).

However, there is an even easier way to test the prevalence of the effect. If most GWAS hits are tagging relatively common variants, then you would expect to see most disease associated SNPs with a frequency in the 10% to 90% range (the range for which GWAS are best powered). However, a SNP with a frequency of 50% is less likely than one with a frequency of 10% to tag a SNP with frequency 0.5%, so if most GWAS hits are tagging rare variants, then you would expect to see most associated SNPs with a frequency skewed towards the very rare or the very common.

In fact, the paper makes an explicit calculation of the expected frequency distribution of GWAS hits, under their synthetic model. In-double-fact, the paper plots this distribution against the distribution of know GWAS hits. And here is that plot, taken directly from the paper (Figure 5):

The green line is the expected frequency distribution of ‘synthetic’ associations; the red line is the actual distribution. We can see that the GWAS hits we do see fail to follow the distribution for synthetic associations; in fact, they follow pretty much exactly the distribution we’d expect if most common associations are tagging common causal SNPs.

The paper manages to pretty conclusively show both that demonstrate that synthetic SNPs can occur, but they rarely do.


Dickson, S., Wang, K., Krantz, I., Hakonarson, H., & Goldstein, D. (2010). Rare Variants Create Synthetic Genome-Wide Associations PLoS Biology, 8 (1) DOI: 10.1371/journal.pbio.1000294