Another interesting day at ASHG so far (and not over yet). As with last year, genotype imputation (using reference sets to infer the genotype of untyped variants in your samples) has been a pretty major subject of the meeting. In particular, the idea of using large sequencing refernce sets like the 1000 Genomes Project to infer lower frequency variation in existing Genome-Wide Association Study datasets has been raising people’s hopes for accessing new types of variation “for free” (i.e. without having to regenotype samples).
Getting at Low-Frequency Variation
The “Genome-Wide Association Studies and Imputation” session started off with Vasyl Pihur’s somewhat provocatively titled talk “Neither common nor rare variation can explain much of phenotypic variation”. The point he was making (and confirmed with some model fitting to existing datasets) was that it is hard for very rare variation to explain much heritability, because so few people carry any particular variant, and very common variation has still left much heritability unexplained, so our best bet for filling in “missing heritability” is varients of intermediate frequency, the neither-common-nor-rare “low frequency” band that lives between 0.5% and 5%.
Bryan Howie gave a somewhat familiar sounding talk on using very large HapMap reference sets to impute these low-frequency variants, and Matthew Zawistowski illustrated some potential pitfalls of low-frequency variant imputation (including a tendency to slightly overestimate the frequency of rare variation). The conclusion of both was that, with a few caveats, imputation can work down to even very low frequencies, but you need very large, diverse reference sets to do it (hundreds of samples for <5%, and around a thousand for <1%).
Jeff Barrett presented some of our recent work, both at the 1000 Genomes tutorial and the Illumina workshop, showing that imputation into the WTCCC Crohn’s Disease samples using the 1000 Genomes pilot haplotypes can replicate a low-frequency signal that was missed in the original paper, but picked up in later meta-analyses. However, imputation into Kenyan samples using the Nigeria Yoruba from the pilot was far less successful (with low-frequency imputation being basically impossible with the current data), so easy-come-easy-go. In a similar vein, Yun Li presented an imputation of 1000 Genomes into 6000 Swiss samples to look for association with metabolic traits, and also picked up and replicated 4 new associations, as well as giving greater functional resolution to existing ones.
It looks like 1000 Genomes pilot imputation will squeeze a few new associations out of many association datasets, mostly by allowing better coverage of the low-frequency end of the spectrum. This will become more pronounced after the 1000 Genomes Phase 1 release at Biology of Genomes conference in May of next year; we’ll be putting out a large (~1100) sample dataset from around a dozen populations. These will be based on low-coverage whole-genome and high-coverage exome data on every sample, along with >2M genotypes from the Omni2.5 chip, to create a very high-quality set of data. Lots of work is going into putting together combined SNP, indel and CNV calls as nicely phased haplotypes. This dataset should be a massive boon to association studies, since all signs point towards this dataset being able to impute a massive chunk of variation >1% frequency with high certainty. Exciting times!