Archive for February, 2010

AGBT: Sequencing Tech Lowdown

Saturday, February 27th, 2010

Alright, it’s time to address the meat of the matter of AGBT; the state of play of sequencing technology. I’ll go through each of the major companies in turn, and talk about what they’ve brought to the table, and what the future holds for them.

As usual, for more in depth information you can follow me on twitter (@lukejostins). Other coverage can be found on Genetic Future, MassGenomics, Fejes, GenomeWeb and Bio IT World.

Illumina

I covered Illumina on day zero. Basically, the GAIIx can now generate 7Gb/day, with 2x150bp, and error rates universally under 2%. The HiSeq generates 31Gb/day, 2x100bp, with error rates under 1%; this will soon be pushed to 43Gb/day with a slight decrease in accuracy. For sheer volume of sequence, no-one can match Illumina.

454

As I said yesterday, 454‘s median read lengths are climbing into the 700-800 range, but with error rates being pretty high beyond 600 or so. Not bad, but after all the fuss over 1000bp reads, also a little disappointing.

454 have been pushing their work on assembly; they’ve worked pretty hard to make an easy-to-follow recipe, involving both single-end and paired-end sequencing, and the program Newbler. Many interesting critters have had this treatment, including bonobo, panda and Desmond Tutu (in order of majesty).

SOLiD

I found the SOLiD content of this conference very cool. Focusing more on the medical genomics side of things, SOLiD is involved in various clinical trials to see whether genomic information can increase cancer survivial times, and emphasizing the importance of accuracy in a clinical setting.

Lots of cool new tech too: For instance, mixing 2-base and 1-base encoding, apparently making error rates of 1 in 10^6 possible. Apparently library prep errors now dominate, so SOLiD has been working on finding more gentle enzymes for amplification. Particularly cool was a throw-away slide on running the ligase on single molecules and actually getting signal (though actual single-molecular sequencing probably isn’t economic).

Pacific Biosciences

Pacific Biosciences have produced an extremely interesting product; it is a game-changer, though exactly what it means for sequencing is not immediately obvious. I am going to hold back on writing about PacBio right now, because I have a more in-depth post on the exact specs and implications of the PacBio, in comparison to their nearest equivalent Oxford Nanopore, in the works.

Complete Genomics

Complete Genomics have gone from “interesting idea” to “thriving technology” in a very short period of time. They’re scaling up their sequencing centre as we speak; they’ll have 16 machines in the next few months, generating 500 40X genomes a month. Over the year, providing they get more orders, they’ll scale up to 96 machines, with a predicted 5X increases in capacity per machine as well. If this all goes well, in theory they are on target for their 5000 genomes by the end of the year.

Complete also have some very interesting new technologies on the horizon, which they will be discussing tomorrow; check the twitter feed for coverage. A lot of people underestimate Complete Genomics, but it is starting to become evident that they are as much game-changers as more flash technology.

Ion Torrent

Ion Torrent wins both my major awards this year: the “most surprising release” award and the “sounds most like a soviet weapons project” award. Ion torrent burst onto the scene with its tiny machine (GS Junior sized); the first major non-florescence-based method in a long time, using the emission of hydrogen ions from the the DNA polymerase reaction to measure incorporation in a 454 stylee.

The machine can produce a rapid 150Mb or so in a single hour run, for about $500 in disposables. The machine itself costs a tiny $50k. From what I’ve heard, a lot of people are interested in a machine like this for fast library validation, though it also has applications in diagnostics and microbiology. Unfortunately, it looks like the error rates are currently high, though they claim these will drop by release time.

Summary

Overall, we are starting to see a divergence in sequencing technologies, as each tech concentrates on having clearly defined advantages and potential applications that differ from all others. This means that the scientists themselves can more closely tailor their choice of tech to fit their situation. Are you a small lab that needs 10 high-quality genomes on a budget? Go to Complete. Want a cheap, fast machine for library validation? Use Ion Torrent. Setting up a pipeline for sequencing thousands of genomes? Go Illumina.

I suppose this was all driven by the fact that Illumina’s machine has such high yield that chasing them is a fool’s game, so everyone else is concentrating on what they can do that Illumina doesn’t. This is pretty good for science as a whole; we are moving away from the One-Size-Fits-All approach to high-throughput sequencing, and moving into a time of more mature, application-based methods.

AGBT: Taking the Statistics out of Statistical Genetics

Friday, February 26th, 2010

The second day (or the first day, depending on if you count yesterday’s pre-sessions) of AGBT is nearly done. There has been a lot of things going on today, but I’m only going to cover one; once again, you can get more detail on all the talks I’ve seen on my Twitter feed (@lukejostins).

Other things that I’ve done: I had a very interesting talk with Geoff Nilsen at Complete Genomics, in which I got to ask various questions, including: “Why don’t they use color-space?”, “It confuses customers, and the error model is good enough already”. “In what sense is Complete ’3rd Gen’?”, “Because it’s cheaper”. I also saw a set of presentations from 454 on de novo assembly, and the new Titanium 1k kit, which actually contains virtually no 1kb reads: mean read length is about 800bp, but beyond 600 the error rates get very high.

There has been some other blog coverage of AGBT from our army of bloggers: MassGenomics has some first impressions, and Anthony Fejes is uploading his detailed notes about all the talks. You can also follow a virtual rain of tweets on the #AGBT hashtag.

Fun with Exome Sequencing

Debbie Nickerson (again!) gave a talk about sequencing genomes to hunt down the genes underlying Mendelian disorders. The process is very simple; you sequence a 4-10 exomes of suffers, look for non-synonymous mutations shared in common between them, and then apply filters (such as presence in HapMap exomes) to find SNPs that are likely to be causal. Debbie is in the process of sequencing 200 exomes for 20 diseases, and has a real success story under her belt in tracking down the genes for 2 disorders. She raised the interesting question of how to validate the discovered genes, given that Mendelian disorders tend to have a large number of independent mutations.

Stacey Gabriel gave a related talk on exome sequencing, focusing on using the method Debbie described to track rare variants for complex traits. To do that, you ‘Mendelianise’ the trait, by only picking extreme individuals; She did this for high and low LDL-choloresoral, giving some candidate genes, but no smoking gun.

Let’s look slightly closer at this; you sequence a number of individuals with extreme traits, look for genes with shared non-synonymous mutations, and look for functional effects. This is a linkage study! A very small and underpowered linkage study, with a variant-to-gene collapse method (like a poor-mans lasso), and some sort of manual pathway/functional analysis (a poor-mans GRAIL), but linkage all the same. This is really re-inventing the wheel, without really learning any of the lessons that the first round of linkage analysis taught us (or even stopping to ask whether, if such variants existed, they would have been picked up by linkage in the first place).

It is not that Stacey Gabriel is doing anything wrong; it is just that she is failing to consider that she is attempting to solve non-statistically a problem that statisticians have worked on for decades. In short, she is risking taking the statistics out of statistical genetics.

AGBT: Running Sequencing Facilities and Illumina’s Ever-Growing Capacity

Thursday, February 25th, 2010

The first day of the Advances in Genome Biology and Technology conference is not until tomorrow, but today there were a couple of pre-sessions. These were on pretty much opposite ends of the spectrum; one was a series of general, high level talks by the users of high-throughput sequencing, and another was a series of technical talks by a manufacturer of sequencing machines.

As usual, this blog post is just a summary of a few aspects that I found interesting. More in depth coverage can be found on my Twitter feed, @lukejostins.

Running a Sequencing Facility

The subject of how to build, run and scale up a sequencing facility may seem, at first glance, a little dry; but I found two talks by two heads of sequencing labs fascinating.

Debbie Nickerson gave a talk about her experience with scaling from a few machines to a major operation. Debbie seems to keep it running smoothly mostly through collecting crazily amount of data on samples, libraries and runs, and producing a range of tools to quickly examine this data to locate problems, and to keep information flowing between the different parts of the lab. A nice example was a tool to flag up common library failures from sequence data, and automatically e-mail the library prep team to reprep the sample.

Susan Lucas gave an overview of the kind of questions you need to think about when planning a genome center. There are obvious things you need to consider; for example, making sure that you can transfer, store and process data. However, she also talked about some more interesting questions: Can your pipeline incorporate new technologies, or new platforms? Can it handle plant DNA, or E. coli? Have you considered the ergonomics of the space; will repetitive tasks cause repetitive strain injury? Do you have an emergency strategy? What will you do if the sprinklers go off?

Sequencing center logistics is right at that interesting intersection between data management, tool development, statistical inference and decision theory; I love the contrast between the the statistically high-flying (“detect signs of quality degradation in sequencing imaging”) and the mundane (“tell Bob to clean the lenses”).

Illumina’s Presentations

The late afternoon was taken up by a series of talks by Illumina on new developments in their sequencing tech. It was interesting to see, given that Illumina have already broken their big story with the HiSeq 2000; as opposed to any single big announcement, the talks were all about how extra sequencing capacity is squeezed out of the existing technologies (though ‘squeezing’ is probably not the right word for Illumina’s 12X increase in GAIIx sequencing yield over the last year).

Particularly interesting was Sheila Fisher‘s the talk on the performance of the Broad Institute’s Illumina pipeline; 60% of their machines now run the 2X100bp protocol, producing over 5Gb per day per machine, with other machines running 2X150bp, with a higher cluster density, giving 7Gb/day. For the latter, a single machine could produce a 30X genome in a single 2-week run, or all the sequence produced in Pilot 1 of the 1000 Genomes Project in 9 months. Once the HiSeq is brought up to the same cluster density, it will produce 43Gb per day; enough to generate all the Pilot 1 data in 6 weeks.

The scale of this sequencing production is staggering; the HiSeq could get to 43Gb a day without any new innovations, and I expect that there is another 2-4X increase in capacity that Illumina could bring in from incremental changes to cluster density and imagine processing over the next year. As I’ve said before, second generation sequencing still has a lot of room to grow.

Off to AGBT

Tuesday, February 23rd, 2010

Tomorrow morning I head off to the Advances in Genome Biology and Technology conference (AGBT for short) on Marco Island, Florida; as someone who loves the cold and hates warm places, this is not as exciting for me as you may think. One thing that is exciting, however, is the nice genomics blog presence at the conference; me, Daniel MacArthur, Anthony Fejes, Dan Koboldt and David Dooling will all be playing our parts as Ambasadors to the Blogosphere. Interestingly, assuming independence, there is a 68% chance that at least one of us will get eaten by an alligator; watch this space!

The high blog coverage is justified; we are expecting to get a feel for how the field of DNA sequencing tech will advance over the next year. I will be particularly interested in seeing what Complete Genomics have to report, as well as 3rd gen sequencing presentations from Pacific Biosciences and Life Technologies (ABI). One group that are notable by their lack of a presentation is Oxford Nanopore, which is a shame; I’m sure Nanopore will be talked about plenty anyway.

I have recently got a brand new laptop to replace the brand new laptop that I lost at ASHG last year, and I’m going to keep up the same schedule I did then; a daily blog post summing up the day’s highlights, and more detailed, up-to-the-minute coverage of every talk I see on my twitter feed (@lukejostins). For more AGBT twittering, I think people are going to be using the hashtag #AGBT.

As is traditional when I go away, I will also be sending a daily e-mail with amusing things that have occured at the conference, but that is promised to my girlfriend Hannah, so you will, alas, not get to read it.