BG2011: Things to do with a genome

Biology of Genomes 2011 is keeping up the momentum, and the third day had some great talks, along with some sort-of-great burgers. The live coverage continues apace on the #BG2011 hashtag.

As a slightly different approach to the last two days, all the talks I am going to report on today are based on one concept. Suppose you have sequenced a new genome, and you want to get the best out of it. Maybe you want to find a Mendelian mutation or a large-effect risk factor, or just look at personal genome for overall assessment. Today’s Computational Genomics session was full of great ideas for what you can do to get the most out of your shiny new genome sequence.

Idea 1: Run a scan for systematic errors

Illumina’s high-throughput sequencing can be prone to certain systematic errors, which no amount of extra sequencing can correct. 95% of the time, these errors are only on one strand, so you can filter them out by requiring every site you call to have the alternative allele on both strands. However, this will cut out a lot of true sites as well; even in a 30X genome, 1 in 8 truly heterozygous sites will be represented only on one strand.

Meromit Singer presented a more sophisticated approach to the problem, which she has developed in collaboration with Frazer Meacham and Lior Pachter. They have discovered that systematic errors are more likely to have a certain motif, and have a particular distribution of quality scores in the bases either side. This combination of factors is very effective at finding systematic errors, correctly classifying more than 90% of heterozygous sites as real variants or errors.

The method has been implemented as downloadable software, ready to be run on your data.

Idea 2: Do some assembly

There are far more variants in the genome than are dreamed of in your SNP lists, and many of them look pretty dramatic. The most reliable way of finding them is with a good assembly.

David Jaffe, an author of the ALLPATHS-LG assembler, showed that sequencing of a 1000 Genomes individual with diverse libraries found many tens of thousands of structural variants between 100 and 1000bp that were missed by the 1000 Genomes Project. In fact, with the right insert distribution you can now perform assemblies that rival the capillary assemblies of old for accuracy and coverage, for a thousandth of the cost.

If you are looking for a crazily good assessment of assembly methods, Keith Bradnam gave a report on the recent Assemblathon, which compared assemblies of a synthetic genome from 17 different teams. There is a lot of variation in assembly quality under different metrics, and no one assembler was better than the rest on all measures. However, the conclusion was that a few assemblers consistently performed well across most metrics, including SOAPdenovo, ALLPATHS-LG or SGA.

Ideal 3: Annotate and prioritise your variants

So you have called some error-free SNPs and assembled structural variants, but now you want to know what they do. For that, you need to start prioritizing variants by their likely functional effect.

Yesterday’s talks by John Stamatoyannopoulos and Ewan Birney pointed us to some of the extensive and powerful annotations produced by the ENCODE project, and showed how you can use them to find variants that disrupt regulatory elements. On a similar note, today Jacob Degner presented CENTIPEDE, a piece of software that can tie some of this together, to predict which elements are likely to be bound by which transcription factors in your cell-type of interest.

For a more extensive overview, Mark Yandell presented the software package VAAST, for annotating and selecting functional variants. It uses functional annotation and allele frequency, and collapses information within genes, to pull out genes and variants that modify function. He showed that this method is more effective than the annotation tools SIFT and ANNOVAR at correctly classifying known Mendelian mutations, and was even more effective than genome-wide association for picking out the Crohn’s gene NOD2.

Thank you to Meromit Singer, David Jaffe, Keith Bradnam, Jacob Degner and Mark Yandell for their permission to write about their talks, and commons on the post. The imagine above is by Adrian Cousins, and is taken from the Wellcome Images collection.

Share and Enjoy: