Author Archives: Luke

AGBT: Sequencing Tech Lowdown

Alright, it’s time to address the meat of the matter of AGBT; the state of play of sequencing technology. I’ll go through each of the major companies in turn, and talk about what they’ve brought to the table, and what the future holds for them.

As usual, for more in depth information you can follow me on twitter (@lukejostins). Other coverage can be found on Genetic Future, MassGenomics, Fejes, GenomeWeb and Bio IT World.


I covered Illumina on day zero. Basically, the GAIIx can now generate 7Gb/day, with 2x150bp, and error rates universally under 2%. The HiSeq generates 31Gb/day, 2x100bp, with error rates under 1%; this will soon be pushed to 43Gb/day with a slight decrease in accuracy. For sheer volume of sequence, no-one can match Illumina.


As I said yesterday, 454‘s median read lengths are climbing into the 700-800 range, but with error rates being pretty high beyond 600 or so. Not bad, but after all the fuss over 1000bp reads, also a little disappointing.

454 have been pushing their work on assembly; they’ve worked pretty hard to make an easy-to-follow recipe, involving both single-end and paired-end sequencing, and the program Newbler. Many interesting critters have had this treatment, including bonobo, panda and Desmond Tutu (in order of majesty).


I found the SOLiD content of this conference very cool. Focusing more on the medical genomics side of things, SOLiD is involved in various clinical trials to see whether genomic information can increase cancer survivial times, and emphasizing the importance of accuracy in a clinical setting.

Lots of cool new tech too: For instance, mixing 2-base and 1-base encoding, apparently making error rates of 1 in 10^6 possible. Apparently library prep errors now dominate, so SOLiD has been working on finding more gentle enzymes for amplification. Particularly cool was a throw-away slide on running the ligase on single molecules and actually getting signal (though actual single-molecular sequencing probably isn’t economic).

Pacific Biosciences

Pacific Biosciences have produced an extremely interesting product; it is a game-changer, though exactly what it means for sequencing is not immediately obvious. I am going to hold back on writing about PacBio right now, because I have a more in-depth post on the exact specs and implications of the PacBio, in comparison to their nearest equivalent Oxford Nanopore, in the works.

Complete Genomics

Complete Genomics have gone from “interesting idea” to “thriving technology” in a very short period of time. They’re scaling up their sequencing centre as we speak; they’ll have 16 machines in the next few months, generating 500 40X genomes a month. Over the year, providing they get more orders, they’ll scale up to 96 machines, with a predicted 5X increases in capacity per machine as well. If this all goes well, in theory they are on target for their 5000 genomes by the end of the year.

Complete also have some very interesting new technologies on the horizon, which they will be discussing tomorrow; check the twitter feed for coverage. A lot of people underestimate Complete Genomics, but it is starting to become evident that they are as much game-changers as more flash technology.

Ion Torrent

Ion Torrent wins both my major awards this year: the “most surprising release” award and the “sounds most like a soviet weapons project” award. Ion torrent burst onto the scene with its tiny machine (GS Junior sized); the first major non-florescence-based method in a long time, using the emission of hydrogen ions from the the DNA polymerase reaction to measure incorporation in a 454 stylee.

The machine can produce a rapid 150Mb or so in a single hour run, for about $500 in disposables. The machine itself costs a tiny $50k. From what I’ve heard, a lot of people are interested in a machine like this for fast library validation, though it also has applications in diagnostics and microbiology. Unfortunately, it looks like the error rates are currently high, though they claim these will drop by release time.


Overall, we are starting to see a divergence in sequencing technologies, as each tech concentrates on having clearly defined advantages and potential applications that differ from all others. This means that the scientists themselves can more closely tailor their choice of tech to fit their situation. Are you a small lab that needs 10 high-quality genomes on a budget? Go to Complete. Want a cheap, fast machine for library validation? Use Ion Torrent. Setting up a pipeline for sequencing thousands of genomes? Go Illumina.

I suppose this was all driven by the fact that Illumina’s machine has such high yield that chasing them is a fool’s game, so everyone else is concentrating on what they can do that Illumina doesn’t. This is pretty good for science as a whole; we are moving away from the One-Size-Fits-All approach to high-throughput sequencing, and moving into a time of more mature, application-based methods.

AGBT: Taking the Statistics out of Statistical Genetics

The second day (or the first day, depending on if you count yesterday’s pre-sessions) of AGBT is nearly done. There has been a lot of things going on today, but I’m only going to cover one; once again, you can get more detail on all the talks I’ve seen on my Twitter feed (@lukejostins).

Other things that I’ve done: I had a very interesting talk with Geoff Nilsen at Complete Genomics, in which I got to ask various questions, including: “Why don’t they use color-space?”, “It confuses customers, and the error model is good enough already”. “In what sense is Complete ’3rd Gen’?”, “Because it’s cheaper”. I also saw a set of presentations from 454 on de novo assembly, and the new Titanium 1k kit, which actually contains virtually no 1kb reads: mean read length is about 800bp, but beyond 600 the error rates get very high.

There has been some other blog coverage of AGBT from our army of bloggers: MassGenomics has some first impressions, and Anthony Fejes is uploading his detailed notes about all the talks. You can also follow a virtual rain of tweets on the #AGBT hashtag.

Fun with Exome Sequencing

Debbie Nickerson (again!) gave a talk about sequencing genomes to hunt down the genes underlying Mendelian disorders. The process is very simple; you sequence a 4-10 exomes of suffers, look for non-synonymous mutations shared in common between them, and then apply filters (such as presence in HapMap exomes) to find SNPs that are likely to be causal. Debbie is in the process of sequencing 200 exomes for 20 diseases, and has a real success story under her belt in tracking down the genes for 2 disorders. She raised the interesting question of how to validate the discovered genes, given that Mendelian disorders tend to have a large number of independent mutations.

Stacey Gabriel gave a related talk on exome sequencing, focusing on using the method Debbie described to track rare variants for complex traits. To do that, you ‘Mendelianise’ the trait, by only picking extreme individuals; She did this for high and low LDL-choloresoral, giving some candidate genes, but no smoking gun.

Let’s look slightly closer at this; you sequence a number of individuals with extreme traits, look for genes with shared non-synonymous mutations, and look for functional effects. This is a linkage study! A very small and underpowered linkage study, with a variant-to-gene collapse method (like a poor-mans lasso), and some sort of manual pathway/functional analysis (a poor-mans GRAIL), but linkage all the same. This is really re-inventing the wheel, without really learning any of the lessons that the first round of linkage analysis taught us (or even stopping to ask whether, if such variants existed, they would have been picked up by linkage in the first place).

It is not that Stacey Gabriel is doing anything wrong; it is just that she is failing to consider that she is attempting to solve non-statistically a problem that statisticians have worked on for decades. In short, she is risking taking the statistics out of statistical genetics.

David Goldstein Proves Himself Wrong

A recent paper in PLoS Biology by David Goldstein’s group is being seen as another ‘death of GWAS’ moment (again?). I have a lot of issues with this paper, but I will be brief and stick to my main objection; the authors attempt to demonstrate that common associations can be caused by sets of rare variants, and in doing so inadvertantly show they most of them are not.

The Paper and the Press

This is another example of a scientific paper being careful to make no solid, controversial claims, but being surrounded by a media story that is not justified by the paper itself. The only real solid claim in the paper is that, if you do not include rare SNPs in your genome-wide association study, and rare SNPs of large effect are contributing to disease, then you will sometimes pick up more common SNPs as associated, because they are in Linkage Disequilibrium with the rare SNPs. Pretty uncontroversial, in so far as it goes. The paper makes no attempt to say whether this IS happening, just says that it CAN happen, and that we should be AWARE of it.

However, in the various articles around the internet, this paper is being received as if it makes some fundamental claim about complex disease genetics; that this somehow undermines Genome-Wide Association Studies, or shows their results to be spurious. David Goldstein is quoted on Nature News:

…many of the associations made so far don’t seem to have an explanation. Synthetic associations could be one factor at play. Goldstein speculates that, “a lot, and possibly the majority [of these unexplained associations], are due to, or at least contributed to, by this effect”.

Another author is quoted here as saying

We believe our analysis will encourage genetics researchers to reinterpret findings from genome-wide association studies

Much of the coverage conflates this paper with the claim that rare variants may explain ‘missing heritability’, which is an entirely different question; Nature News opens with the headline “Hiding place for missing heritability uncovered”. Other coverage can be found on Science Daily, Gene Expression and GenomeWeb.

Does this actually happen?

Is all this fuss justified? How common is this ‘synthetic hit’ effect; are a lot of GWAS hits caused by it, or hardly any? There are many ways that you could test this; for instance, you could make some predictions about what distribution of risk you’d expect to see in the many fine mapping experiments that have been done as follow ups to Genome-Wide Association Studies (this would be trivially easy to do using the paper’s simulations).

However, there is an even easier way to test the prevalence of the effect. If most GWAS hits are tagging relatively common variants, then you would expect to see most disease associated SNPs with a frequency in the 10% to 90% range (the range for which GWAS are best powered). However, a SNP with a frequency of 50% is less likely than one with a frequency of 10% to tag a SNP with frequency 0.5%, so if most GWAS hits are tagging rare variants, then you would expect to see most associated SNPs with a frequency skewed towards the very rare or the very common.

In fact, the paper makes an explicit calculation of the expected frequency distribution of GWAS hits, under their synthetic model. In-double-fact, the paper plots this distribution against the distribution of know GWAS hits. And here is that plot, taken directly from the paper (Figure 5):

The green line is the expected frequency distribution of ‘synthetic’ associations; the red line is the actual distribution. We can see that the GWAS hits we do see fail to follow the distribution for synthetic associations; in fact, they follow pretty much exactly the distribution we’d expect if most common associations are tagging common causal SNPs.

The paper manages to pretty conclusively show both that demonstrate that synthetic SNPs can occur, but they rarely do.

Dickson, S., Wang, K., Krantz, I., Hakonarson, H., & Goldstein, D. (2010). Rare Variants Create Synthetic Genome-Wide Associations PLoS Biology, 8 (1) DOI: 10.1371/journal.pbio.1000294

Cargo Cult Science and NT Factor®

A recent blog post on Chronic Fatigue Syndrome linked in passing to a ‘treatment’ called Mitochondria Ignite™ with NT Factor®. This product caught my attention as an example of what Richard Feynman called ‘Cargo Cult Science’; a company dressing up like scientists, using chemical names and precise sounding figures, without actually having any science underlying it.

However, the product is not arguably not exactly pure Cargo Cult Science; there is a small amount of science content present. The product page contains a number of references, some of which point to peer review journals, and some of which are actually studies of the effect of some of the contents of the drug on humans. Of cours,e taking apart the studies shows that the product is still unproven, despite the thin glaze of real science; I can’t help but feel that this sort of thing has slightly grim implications for the future of accurate consumer information.

Continue reading

The Future of Second Generation Sequencing

Illumina, the major player in high-throughput sequencing these days, have announced the newest version of their second generation sequencing platform, the HiSeq 2000. The machine can produce a lot more sequence, and at lower cost, than the previous Genome Analyzer II.

I’m not going into much detail about the machine: for that, see posts at Genomics Law Report, Genome Web, Genetic Future, Pathogenomics and PolITiGenomics. What I really care about is what this machine implies for the future of sequencing, and specifically what we can predict about the coming 2nd verses 3rd generation sequencing battles that will be kicking off later this year.

PacBio’s 3rd generation machine, which will be arriving later this year, will have an initial throughput of around 3Gb a day, at a price of around 1.4$ per Mb in consumable costs. I don’t know the specs for Oxford NanoPore’s machine; my guess is that it will be similar, but we’ll know soon.

Compare PacBio’s capacity to the HiSeq 2000, which will produce 25 Gb per day, at claimed consumables price of $0.11 per Mb ($10 000 for a 30X genome). In short, the Illumina 2nd gen machine is going to be able to pump out much more sequence at a much higher rate than PacBio. Both will rapidly increase the power of their machines after release, but we don’t know who will push faster (Dave Dooling thinks Illumina could push the HiSeq to 450 Gb per run with existing technology).

Of course, the competition isn’t just based on pure throughput. Read length and error rates are also important; the 3rd gen machines will also have much longer read lengths than Illumina and SOLiD, and we expect that the quality of sequence will be higher as well, giving the possibility of some real Gold Standard genomes being produced from these machines, rather than the somewhat messy genomes we get from Illumina.

This all ties in to the conversation I had with the Illumina people at ASHG; Illumina think that it’ll be a good few years before 3rd Gen sequencing can catch up with their current machines. I expect that, between now and 2014 (when PacBio release v2 of their machine), the major sequencing centres will keep a combination of 2nd and 3rd gen machines. The 2nd gen machines will be used when a very large amount of low-quality sequence is required, such as for Genome-Wide Association Studies or RNA-seq. The 3rd gen machines will be used for assembling genomes, looking for copy-number variations and studying the genetics and epigenetics of non-coding and repetitive regions.

I guess what I’m trying to say is that, as exciting and cool as the single-molecule technologies of PacBio and Oxford NanoPore are, it is far too soon to announce the death of Second Gen sequencing. If Illumina continues to push its throughput as hard as it is doing now, 2nd generation machines will be widely used for a long while yet.

The future will become a bit clearer at the AGBT conference, where we should see some big announcements from PacBio, Oxford Nanopore, Complete Genomics, ABI and Illumina. Me and a host of other bloggers will be there to cover them.

Christmas Thoughts

A late Merry Christmas, and a marginally early new year to all. A few Christmas-based observations, from various lines of thought that have been knocking around my head over the festive period.

Cooking and the Internet

The internet can sometimes birth amazingly useful things in unexpected fields. The one I am thinking of at the moment is in cooking; I have been cooking a lot of food this Christmas, and I have built up a reputation in my family for being someone who can make traditionally ‘complicated’ things (pie crust, Yorkshire pudding, etc). To clarify, I am not a good cook, or at least the food I invent myself is not liked by others (I like my mustard and chili pasta sauce, or my three-mushroom fried rice, but no-one else seems to). However, I have been shown by some more food-literature friends of how the internet can turn someone like me into a competent chef.

The hidden secret is the BBC Good Food website (which is distinct from the BBC Food website, presumably dedicated to bad food). The website consists of a crazily large number of recipes written by professionals; however, the real secret is that there is a very dedicated readership of amateur cooks who report their experiences, and rate the recipes on a scale of 1-5. It is this later part that really makes the website great; while a random recipe from a chef will often be relatively good, those that have been rated as 5 stars by the community are, virtually without fail, excellent.

The interesting thing is that many of the recipes look very weird at first, but turn out to work amazingly well. Cut the skin off gammon, and cut slits before roasting? Add flour to the filling of an apple pie? Pastry that should ‘look like scrambled egg’? These are the sort of thing that make you look to your family like an expert cook; you end up doing things that to them (and indeed you) look like madness, despite actually working.

Keynesian Christmas

I read a fun and insightful essay, reprinted at OpenDemocracy, about the Keynesian economic bases of a Christmas Carol. The idea is that the early 1840s were a time of deflation; the value of money was shrinking, and the value of goods were shrinking faster still. The economy was sinking into recession; deflation meant that investment wasn’t worth while, but because goods were worth less each year, people avoided buying things as well. Added to this was a Malthusian attitude that the world was too full to support population growth, and that saving and parsimony was the order of the day.

These fears combine to make a villain that is both indicative of, and the cause of, the recession. The miserly rich man, fearful of financial uncertainty, who hoards money without spending it either on themselves or others. And when Scrooge learns the spirit of Christmas, he also learns to be the sort of person that the economy needs for recovery; someone who gives and spends without thought for the cost, who buys things for the sheer joy of doing so, not because they are good value or even needed.

There is a similar feel to the carol Good King Wenceslas, which was also written in the 1840s; the Saint, upon seeing a poor man in the cold, on a kind-hearted whim calls out for flesh, wine and firewood to make a feast for the peasant. It is the spontaneity, the lack of economic calculation, that makes him a Saint; he spends on others for the sheer joy of doing so.

Oddly, these values are close to what we now call consumerism; buying things for the sake of it, not because they make your life better. This ties nicely into a post by Ed Yong; consuming goods, spending on yourself, does not give you happiness (most of us have more than we need anyway). However, spending on others, like Scrooge or Saint Wenceslas, can bring you happiness.

Saint Nicholas

As a final Christmas thought, before I put away childish things for the year, is this: Has anyone ever considering going to the Basilica di San Nicola at Christmas Day, in order to visit Santa Claus’ grave? One for the kids, perhaps.

The Economist Mangles Disease Genetics

The Economist has a rather distressingly bad article by the evolutionary psychologist Geoffrey Miller, about the supposed general failure in human disease genetics over the last 5 years. The thesis is that Genomes Wide Association Studies (GWAS) for common diseases have been a failure that geneticists are trying to keep hidden, and that the new techniques required to solve the problem of disease genetics will raise ‘politically awkward and morally perplexing facts’ about the different traits and evolutionary histories of races. The former claim is pretty much the same as Steve Jones Telegraph article earlier this year, and is just as specious. I will look at both claims separately.

A quick point of terminology: Miller uses ‘GWAS’ to refer to studies that look for disease association in common variants using a genotyping chip, and acts as if sequencing studies are not, in fact, GWAS. In fact, a sequencing association study is just another type of GWAS, just looking at a larger set of variants.
Continue reading

How Many Ancestors Share Our DNA?

This post was written four years ago, using a quick-and-dirty model of recombination to answer the question in the title. Since then a more detailed and rigorously tested model has been developed by Graham Coop and colleagues to answer this same question. You can read more about the results of this model on the Coop Lab blog here and here. Graham’s model is based on more accurate data, more careful tracking of multiple ancestors and a more realistic model of per-chromosome recombination, and thus his results should be considered to have superseded mine.

Over at the Genetic Genealogist, Blaine Bettinger has a Q&A post up about the difference between a genetic tree and a genealogical tree. The destinction is that your genealogical tree is the family tree of all your ancestors, but your genetic tree only contains those ancestors that actually left DNA to you. Just by chance, an individual may not leave any DNA to a distance descendant (like a great-great-great-grandchild), and as a result they would not appear on their descendant’s genetic tree, even though they are definitely their genealogical ancestor.

At the end of his post, Blaine asks a couple of questions that he would like to be able to answer in the future;

  • At 10 generations, I have approximately 1024 ancestors (although I know there is some overlap). How many of these ancestors are part of my Genetic Tree? Is it a very small number? A surprisingly large number?
  • What percentage, on average, of an individual’s genealogical tree at X generations is part of their genetic tree?

I think that I can answer those questions, or at least predict what the answers will be, using what we already know about sexual reproduction.
Continue reading

ASHG: Quantifying Relatedness and Active Subjects in Genome Research

Well, the American Society of Human Genetics Annual Meeting is coming to a close for another year. My talk is done and dusted, so I no longer have to lie awake at night worrying that I will forget everything other then the words to “Stand By Your Man” when confronted by the crowd. My white suit is now more of an off-white suit, with regions of very-off-white and pretty-much-entirely-out-of-sight-of-white. I’m looking forward to getting back home to catch up on my sleep.

For the last time, I’m going to give a little summary of talks today that I thought were interesting, or gave some indication of where genetics may be heading in the future. I will write up some more general thoughts about the meeting in the next few days, as soon as the traveling is out of the way and my mind has recharged.

If you would like some second opinions on the conference, GenomeWeb has a number of articles, including a couple of short summaries, as well as a nice mid-length article about the 1000 Genomes session; there are also a number of articles over at In The Field, the Nature network conference blog.
Continue reading

ASHG: Statistical Genomics and Beyond GWAS in Complex Disease

The second day of the American Society of Human Genetics Annual Meeting is drawing to a close; here’s a lowdown of what talks I’ve enjoyed today.

Remember, follow @lukejostins on Twitter if you want more up-to-the-minute details on the ASHG talks.
Continue reading