This post follows on from my previous post on Sanger sequencing, and is part of an ongoing series that looks at how we take DNA, hidden away in our cell nuclei, into read the sequence of base pairs that make up our genetic code. In this post, we look at the Second Generation Sequencing machines, that are currently sequencing thousands of genomes-worth of DNA per year throughout the world.
Recap: What are we trying to do?
Previously, we saw how DNA is made up as little strings of nucleotides, and we used different shapes to represent different base pairs (A = triangle, C = diamond, G = circle, T = pentagon). For instance, is GCAT.
We looked at how the polymerase enzyme can be used to amplify up DNA, using the Polymerase Chain Reaction, and how we can determine the sequence of DNA using ddNTPs; nucleotides that, when incorporated into DNA, stop the polymerase working.
In first generation Sanger sequencing, we run a PCR reaction in the presence of a bunch of ddNTPs, with each different base pair dyed a different colour. We then measured the length and colour of the resulting fragments of DNA, and used that to work out the sequencing; a bit of DNA 35 base pairs long ending in a blue ddNTP told us that the original sequence had a “C” at the 35th position.
The problem with this method is that it requires a lot of space; you need a place to run the reaction, and then you need lengths of capillary tubes or a gels to determine the length of the DNA. As a result, you could only run perhaps a hundred of these reactions at any one time. There are 3 billion base pairs of DNA in the human genome, meaning about 6 million 500-base pair fragments of DNA; it would take a very long time to sequence all of these if you had to do them one hundred at a time.
Second generation sequencing techniques overcome this restriction, by finding ways to sequence the DNA without having to move it around. You stick the bit of DNA you want to sequence in a little dot, called a cluster, and you do the sequencing there; as a result, you can pack many millions of clusters into one machine. Sequencing a strand of DNA while keeping it held in place is tricky, and requires a lot of cleverness. I am only going to talk in detail about Illumina‘s reversible termination sequencing, partly because it is the most similar to Sanger sequencing, but mostly because I too am part of the shadowy cabal of people trying to destroy ABI.
Reversible Terminator Sequencing
Just like Sanger sequencing relies on the ddNTP to stop the PCR reaction, Illumina’s reversible terminator sequencing all depends on the reversible terminator bases (RT-bases). Just like ddNTP, these stop PCR reactions when they are incorporated; they have additional molecules, including a base-specific dye, attached to the standard base which stops the PCR enzyme adding more bases (A bases have red dye, C bases have blue dues, G yellow and T green):
However, they have additional, very useful property: there exists a cleavage enzyme that chops all the extra molecules off, and turns the RT-base into a normally functioning nucleotide. This is hugely useful, and gives us a method of sequencing that doesn’t require moving the DNA.
We multiply up the template stand, i.e. the bit of DNA that we are sequencing, and stick on a few bases of ‘adaptor sequence’; this sequence sticks on to complementary bits of DNA stuck to a surface, which holds the DNA in place while we sequence it:
We then flood the DNA with RT-bases:
We also add a polymerase enzyme, which incorporates the RT-base into the new strand that is complementary to the template strand:
We then wash away all the RT-bases, leaving just those that were incorporated into the new strand; we can read off what base this is by looking at the colour of the dye:
In this case, there dye is green, meaning that the base at the first position is a T.
Finally, we send in the cleavage enzyme, which cuts off the terminator region and the dye, leaving a normal base pair. We can then start again to sequence the next base pair.
In a single Illumina machine we have hundreds of millions of these clusters; cameras look at all of these dots and record how they change colour over time, allowing you to determine the sequence of bases of millions of bits of DNA at once. This animation illustrates how the process works over time; the main image shows the base pairs being incorporated into the DNA, and the little box shows what the camera sees; each dot is a reaction, with our reaction circled.
This system is exactly what we were looking for. Note that the sequencing method is pretty inefficient; for each base you read, you have to flood the DNA with RT-bases, was them off again, and use a cleavage enzyme. This is very slow, and in fact it takes about an hour to read each base. However, this doesn’t really matter; each individual bit of DNA may be slow to sequence, but you can sequence millions of DNA fragments at once. In fact, the way we do sequencing these days is to cut up an entire genome, and sequence all the fragments. The real state-of-the-art machines can produce a pretty high quality human genome in less than a week.
Illumina sequencing is not the only second generation technology, and it has many disadvantages. Firstly, because it takes so long to produce a single base pair, and because the different molecules in the cluster can get out of sync, it is impossible to sequence long bits of DNA. Mostly, the read length of the machine are under 100 bp, much less than the 500-1000 bp that you can get from Sanger sequencing. 454 sequencing is another second generation sequencing method that gets around this: instead of using dyes they use nucleotides that flash when the polymerase adds them to the DNA; they can get read lengths of up to 400 bp, getting close to Sanger sequencing. However, while 454 has a) longer read lengths b) a cooler name and c) a cooler sequencing method, it cannot rival Illumina for sheer amount of DNA sequenced per unit time.
Secondly, because it is very easy for the polymerase enzyme to add in the wrong RT-base, Illumina sequencing has a relatively high error rate (1-2% per base). ABI’s SOLiD sequencing adds bases in pairs, rather than singly, and thus sequences pairs of bases rather than single bases. E.g. while Illumina will read “GACT” as “G”, then “A”, then “C”, then “T”, SOLiD reads it as “GA”, “AC”, “CT” etc. Because you sequence each base twice, the error rates are much lower (0.1-0.2% per base)*. SOLiD has come along a bit late to the scene, and their data is somewhat difficult to handle, and as a result they haven’t really managed to get the market share they are hoping for (plus, the whole shadowy cabal thing).
Next Next Gen Sequencing
Just like the first generation, automated capillary sequencing machines changed the way we through about DNA from something that we could, with hard work, glimpse to something we could sequence on mass. Second Generation sequencing has changed it again, from something we sequence once, to something we sequence again and again. Second gen sequencing has allowed us to re-sequence many human individuals; the 1000 Genomes Project is using 454, SOLiD and Illumina machines to sequence hundreds of individuals. This sort of thing has allowed us to get an idea, not just of what a single genome looks like, but how the genome changes from person to person; we can look at how much variation there is in the genome, how different populations differ in their genome structure, and even what makes a cancer genome different from a healthy genome.
However, second gen sequencing is not without its flaws. While it has got cheap ($40-50 to sequence a human genome these days), it still requires a lot of reagents, a lot of work and a lot of cost (the RT-bases aren’t cheap, and neither are the enzymes used). The low read lengths are still a problem, as they make knowing precisely where you got the bit of DNA from hard to discover (especially if you want to know which of two paired chromosomes it came from), and it takes a very long time to run the machines to completion. In the third part of this series, I will take about the new technologies that are appearing on the horizon; the so-called Third Generation Sequencing machines, which promise to make sequencing an entire human genome cost a couple of hundred pounds and take a few hours.
* Edit 18/08/09: Thomas Keller asked in the comments whether I can give some references for these raw error rates. The 0.1-0.2% SOLiD dibase encoded error rate comes from Supplementary Table 1 of McKernan et al. For Illumina error rates: the SOLiD whole genome paper references Hillier et al (Figure 2 gives an error rate of 2.3% per base), however, the Supplementary information for the first Illumina whole genome paper, Bentley et al gives raw error rates in the 0.5-1% range (the truth is presumably somewhere in the middle, 1-2%).