Category Archives: Uncategorized

ASHG: Doing it with exomes

ASHG 2010 is now over, and I am back on Albion. Either me, or Daniel, or both (or, indeed, any of the other GNZers) will have personal genomics roundup over at Genomes Unzipped sometime this week.

For the last of these posts, however, I thought I would just report on the entirety of the Exome Sequencing session on the final day of the conference. I loved this session for the diversity, the number of different projects that are using exome sequencing to address old questions. It shows how much biology is tech-limited: the moment a powerful new technology becomes available at a low price it is used in every field by a flood of researchers who have been waiting for exactly this sort of data.

Other than that, there wasn’t an overall theme to the session (or to this blog post), other than Exome Sequencing Is Cool.

Continue reading

ASHG: Getting At Low-Frequency Variants

Another interesting day at ASHG so far (and not over yet). As with last year, genotype imputation (using reference sets to infer the genotype of untyped variants in your samples) has been a pretty major subject of the meeting. In particular, the idea of using large sequencing refernce sets like the 1000 Genomes Project to infer lower frequency variation in existing Genome-Wide Association Study datasets has been raising people’s hopes for accessing new types of variation “for free” (i.e. without having to regenotype samples).

Getting at Low-Frequency Variation

The “Genome-Wide Association Studies and Imputation” session started off with Vasyl Pihur’s somewhat provocatively titled talk “Neither common nor rare variation can explain much of phenotypic variation”. The point he was making (and confirmed with some model fitting to existing datasets) was that it is hard for very rare variation to explain much heritability, because so few people carry any particular variant, and very common variation has still left much heritability unexplained, so our best bet for filling in “missing heritability” is varients of intermediate frequency, the neither-common-nor-rare “low frequency” band that lives between 0.5% and 5%.

Continue reading

ASHG: Genewise Assocation and Sequencing Families

All the ASHG talks that I have had to do analysis for have now been given, so today I’ve managed to dedicate my full attention to the sessions. Also a good day for tweeting; I managed to live-tweet quite a few talks on @lukejostins, and the #ASHG2010 hashtag has been totally rammed.

Larry Parnell over a Varigenome has been putting his ASHG notes up, if you are still hungry for details. Daniel Macarthur has promised a post on the “Identifiability in the Era of Genome-Scale Research” session for Genomes Unzipped, and I saw him getting pretty worked up about Jim Evan’s talk on his twitter feed, so hopefully we’ll see something from him as soon as he’s done being dead of plague.

Two sessions today, “Statistical Analysis of Human Sequence Variation” and “Finding High-Risk Susceptibility Gene Variants”, seem to encapsulate the cutting-edge of disease gene association, and illuminate where disease genetics is heading in the immediate future.

Continue reading

ASHG: Diploid Assembly and Low-Key Personal Genomics

Unfortunately, I’ve had a bit of a distracted day today; some analysis that is being presented tomorrow failed, and as a result I dropped off the radar somewhat trying to put it right.

I sat in the “Lessons from High-Throughput Sequencing” session, and picked up bits of it, but was generally distracted. Zam Iqbal presented some very impressive work on assembling genome sequence as diploid individuals, which will be extremely important in the future. The main reason for this is that it allows accurate HLA typing from whole-genome sequencing; HLA typing costs hundreds of dollars, so getting it for free as part of the genome is a major win for personal genomics.

Low-Key Personal Genomics

An interesting, though not new, story came in the form of back-to-back talks by Euan Ashley and Russ Altman on the disease and pharmacogenomic work on Steve Quake’s genome.

Continue reading

ASHG: Epistasis and the Missing Heritability

As if no time has passed at all since the sunny shores and lost laptops of the American Society of Genetics 2009 meeting, ASHG2010 has rolled around, this time in Washington DC. As always, I’m going to be trying to write a few thoughts on the conference every day, though this year it may be split between here and Genomes Unzipped.

I’ll also be semi-live-tweeting (wifi coverage is patchy), so you can get up-to-the-minute details of all the talks on my twitter feed (@lukejostins), or from other tweeps via the hashtag #ASHG2010.

Epistasis and Missing Heritability

As Daniel observed on Twitter, I very nearly had a heart attack when Eric Lander, in his Distinguished Speaker’s talk about the Human Genome Project, said that the “missing heritability” is probably all down to Epistasis (i.e. interactions between variants). His argument was that GWAS had low power to detect gene-gene interactions, and therefore there could be lots hanging around that count account for the unexplained variance.

This is a fallacy, and a big one. Continue reading

Introducing: Genomes Unzipped

I’d like to introduce a new group blog, called Genomes Unzipped.

Genomes Unzipped is written by a number of major genetics bloggers, including Daniel MacArthur of Genetic Future, Dan Vorhaus of Genomics Law Report, Caroline Wright of the PHG Foundation, and Jan Aerts of Saaien Tist, as well as a number of other geneticists who are new to blogging. And me, of course.

The really exciting thing about this blog is the diversity of people who are working on it. We have people from a range of fields, including science, law and public health, who hold a range of opinions. Some of us are very gung-ho about direct-to-consumer genetics, and enthusiastic about the benefits, while others are sceptical, and more in favour of regulation. While we’ve been putting together the blog, we’ve had many meetings between contributors, and with outside advisers (such as ethicists), and I have already learned so much before anything has been published. So, needless to say, I am pretty excited about the whole project.

The current blog is in the beta stage at the moment; this is because what you see now is not yet the finished version. Soon, the project will expand in scope; stay tuned for more announcements.

Crunching The Data on Human Brain Evolution

ResearchBlogging.orgLee, S., & Wolpoff, M. (2003). The pattern of evolution in Pleistocene human brain size Paleobiology, 29 (2), 186-196 DOI: 10.1666/0094-8373(2003)0292.0.CO;2


There has been a bit of debate around the biology blogoverse recently about the evolution of human brain size. It started off as an “idle speculation” type argument, but then took a satisfying swerve into the evidencey regions of science, which is always satisfying. The upshot of this is that I’ve spent today riffing statistically on some fossil data from a relatively old paper on the evolution of human brain size, and seeing if I can tease some interesting tidbits out of it.

Some background

It all kicked off when neuroscientist Colin Blakemore made some comments on the evolution of the human brain. He argued that the large increase in brain size we see around 200k years ago may have been a useless “macromutation” that was tolerated due to the abundance of food. The evolutionary implausibility of this was evaluated unfavourably by and Jerry Coyne.

I’ve heard lots of plausible reasons that human brain size may have seen an increase 200k years ago or so; my personal favourite is that it was due to the discovery of fire, which made food easier to eat, giving more calories and allowing children to be born without the fully connected skull required to attach the jaw muscles for chewing tough food. This would loosen up constraints on selection for brain size, allowing brain size to rush ahead.

Either way, none of the hypotheses proposed have had much strong evidence put forward for them yet (maybe when we bring some Evo-Devo to the party?), and this is particularly true of Blakemore’s theory.

More interesting, and more answerable, is the question of whether there was a sudden increase in brain size at all, and if so when exactly it happened. John Hawkes put up a graph of data from a review of human brain evolution by Lee and Wolpoff, and used it to argue that brain size has been increasing gradually for millions of years, with no recent “jump”. In response, Ciarán Brewster did some basic number crunching to establish that, even if there wasn’t a sudden macromutation 200 kya, the human brain seems to have been increasing in size faster over the last 200 kya, compared to before that.

If and when did brain growth speed up?

I decided to delve a little bit deeper into the data from the paper. The problem with Ciarán’s analysis is that it assumes that if there was a speed-up, it started 200k years ago. This is a slightly problematic assumption, mostly because of the winner’s curse (oh go look it up).

To do a more rigorous test for the existence of a speed-up, and to estimate when the speed-up happened if there was one, I fitted a least-squared break-point model (a model where the slope of a trend line changes on either side of some point). I compared this to to the basic linear trend-line, to see if it explains the data significantly better.

Here is what I get (the points are the fossils, the dashed line is a linear fit, and the coloured lines are the break-point model):

The model shows a definite speed-up of brain size increase recently, and fits the data significantly better than a simple trend line (F(1,90) = 15.8, p < 10^-5). I estimate that the speed-up occured 252kya, and can say with 95% confidence that it lies between 203 and 377 kya. This result is pretty robust to exactly what model we use; I also tried using a model where brain size grew exponentially with time, and this gave a similar break-point: 250kya, with a 95% interval of 167-402 kya (see this graph).

If you prefer non-parametric statistics, here is a loess smoothing of the data, showing a clear kink around 280kya:

Simon Blakemore’s theory of a single, sudden macromutation is, of course, inconsistent with the data; given that each sample is an individual human, if large brain size was a macromutation we’d see each sample either having a massive brain, or a small one. But, there does appear to be a change in the processes driving brain evolution somewhere between 200 and 400 kya.

Appendix 1: Adding sex into the mix

Examining the literature (and by “examining the literature” I mean “googling the sample names”), of the 94 samples in the dataset, 30 are probably male, 24 are probably female, and 40 have not been sexed, or I couldn’t find out the sex (my analysis was pretty haphazard, so don’t take this as gospel). Both sexes have the same age distribution (see here), so it is unlikely that sex would confound the break-point. However, just to be on the safe side, here is the analysis using just males:

And just females:

Both show a similar break-point. The break-point model explains the data significantly better than the linear fit for males (F(1,26) = 20.4, p < 10^-3) but not for females (F(1,20) = 2.9, p = 0.105). Maybe some hint that the change in brain growth was greater for men than for women, perhaps? [Insert fruitless and subconciously sexist speculation here]. We don't have much data, and you'd want to get an expert to classify the sexes before you drew any conclusions.

Appendix 2: Lee and Wolpoff’s Slightly Weird Model

In the original paper, Lee and Wolpoff argue that there is in fact no discontinuity anywhere, essentially because if you plot the log of brain size against the log of time, you get a straight line all the way back (Figure 3 in the paper). A log-log relationship corresponds to fitting an exponent model of Y = Axb (where y is brain size and x is number of years before present), which looks like this:

And indeed, under this model, you cannot find any break points that will significantly improve the fit. The model is pretty weird though; in our case, the model basically corresponds to brain size being inversely proportional to the fifth root of time (b = -0.2). Linear growth makes sense, and exponential growth corresponds to a growth in brain size is proportional to the current brain size, but what does inverse-quintic root growth correspond to? Maybe there is something fundamental going on here, but I expect the goodness of the fit in the graph above is just down to the flexibility of the exponent model, and I’d consider any conclusions drawn from a log-log transformation of the data to be somewhat dodgy (or at least, very underpowered, as the flexibility of the model will tend to obscure true discontinuities).

The authors strange choice of model appears to stem from a slight misunderstanding:

A logarithmic transformation may help avoid the problem of interdependence within the data set because it can be derived from the assumption that the rate of change of cranial capacity at any particular time is proportional to the cranial capacity of the sample at that time

But of course this is an exponential model, which corresponds to a log-linear transformation (the analysis I did above, which still showed a breakpoint); they performed a log-log, which corresponds to a exponent model.


The data I used, including sex classification, is all here. I’d normally put the code I’d used here as well, but in this case it is in a pretty nasty state. If anyone wants it, I can clean it up.

International Women’s Day

Few tragedies can be more extensive than the stunting of life, few injustices deeper than the denial of an opportunity to strive or even to hope, by a limit imposed from without, but falsely identified as lying within.

- Steven Jay Gould, The Mismeasure of Man

Today is International Women’s Day; today, we remember the all-to-quickly forgotten contributions of women to society, to industry, and to intellectual thought. It is also, importantly, a day to remember the hardships that women have had to fight through, and how far we still have to go to conquer the biases and prejudices that still exist at every level of society. In particular, I think that this should be a time for us scientists to consider how our field and our institutions have treated women, and how science can be used to hurt or help the cause of women.

Biologists have a far from untainted history when it comes to the treatment of women. The late Steven Jay Gould, in his essay Women’s Brains, documented how the fathers of anthropometry, lead by Paul Broca, used bad statistics on brain measurements to infer a general lack of intelligence in women, going on to argue against their access to higher education. Milder versions of this sort of thinking still go on today; in recent years, we’ve seen Lawrence Summers, the president of Harvard, comment that there may be a general lack of highly intelligent women compared to men, because of a (speculated) innate difference in the spread of intelligence, and that this may drive the lack of women in high-up academic positions. For better or for worse, the public still have a lot of trust for scientists, and unfounded and speculative statements such as these can do a lot to reinforce existing prejudices.

Despite dodgy patches in our past, science is far from an inherently sexist process. As Gould made clear in his take-down of Broca, the opinions the anthropometricists held were bad science; they were based on a shallow understanding of the statistics, and a failure to look at the factors that contributed to brain differences. It is bad science (fueled by unacknowledged biases) that leads people to make these statements, and ultimately it will be good data that counteracts it.

In that spirit, I have read a nice crop of blog posts over the last few months. In a series of posts collected together for Women’s Day, Ed Yong explains how the lack of female grandmasters is entirely explained by sex differences in the number of people taking up chess, that differences in maths ability is driven by gender inequality, that the ‘larger variation’ hypothesis falls down when you look across difference cultures, and that uncouncious but identifiable gender stereotypes in a society correlates strongly with sex differences in ability.

Another, unexpected source for challenging gender stereotypes is OkTrends, the datablog of the dating site OkCupid. A recent post was on “Older Women” (in this case, defined as women in their 30s and 40s); many men refuse to date women in this age range, citing beliefs that they are neurotic, non-sexual, unattractive, and too serious about relationships. However, none of these things are true; the data shows that women in their 30s and 40s are more self-confident, less sexually conservative, and less concerned about finding a relationship that leads to marriage; and while some women, with very youth-based looks, may lose their attractiveness with age, the majority (90% or more) of women look just as attractive at 40 as they did at 18. Men’s opinions on these women are not based on any actual experience of them; they are based on stereotypes about ‘older women’, which are inventions, propagating through a youth-obsessed culture.

These studies, and dozens of others, are building up an accurate picture of how society, through subtle effects of institutions, expectations and stereotypes, partition men and women into different roles; and when we know how it happens, we can attempt to counteract it.

AGBT: Speculating on Third Gen Tech

So, AGBT is over. I’ve reported on the existing tech in my previous post; one thing that I haven’t covered so far is 3rd Generation sequencing. Time to rectify this.

We have three major players, two of which had a strong presence at AGBT. Pacific Biosciences had a major launch (covered extensively elsewhere), and Life Technologies gave a surprisingly awesome presentation on their new Quantum Dot sequencing technology, QDot. Left out was Oxford Nanopore, the other major player in the 3rd Gen sphere; they did not present anything at AGBT, and I hope they all know that I am very angry about this.

We now have some information about these technologies; we know, in broad terms, how they work, and we can make some guesses about how they’ll compare. Based on the extremely limited amount of data we have at the moment, and a few speculative computer simulations (the R code for which can be read here), I’m going to draw some overall conclusions about how each tech will perform in terms of read length, yield, and accuracy.

To get a lot of this data, I’ve made “educated guesses” at the parameters (mostly based on what we know from the PacBio machine). This is an ‘all things being equal’ analysis; I assume that PacBio, Nanopore and all have the same read density, and the same enzyme efficiency; i.e. that the probability of the QDot polymerase dying is the same as for the PacBio polymerase, which are both the same as the DNA strand falling off the nanopore. If any of the companies feel like providing me with their molecule densities and decay parameters for their enzymes (ha!), I’ll happily fix the plots.

I really must repeat; all of these graphs or figures are guesses. I have no actual data beyond what has publicly been announced by the companies. I fully expect much of this to be proved wrong over the next year: this is just my guess at what the machines may look like. I should especially emphasise that I have not used any information sourced via my employer about any of these technologies.

Continue reading

AGBT: Sequencing Tech Lowdown

Alright, it’s time to address the meat of the matter of AGBT; the state of play of sequencing technology. I’ll go through each of the major companies in turn, and talk about what they’ve brought to the table, and what the future holds for them.

As usual, for more in depth information you can follow me on twitter (@lukejostins). Other coverage can be found on Genetic Future, MassGenomics, Fejes, GenomeWeb and Bio IT World.

Illumina

I covered Illumina on day zero. Basically, the GAIIx can now generate 7Gb/day, with 2x150bp, and error rates universally under 2%. The HiSeq generates 31Gb/day, 2x100bp, with error rates under 1%; this will soon be pushed to 43Gb/day with a slight decrease in accuracy. For sheer volume of sequence, no-one can match Illumina.

454

As I said yesterday, 454‘s median read lengths are climbing into the 700-800 range, but with error rates being pretty high beyond 600 or so. Not bad, but after all the fuss over 1000bp reads, also a little disappointing.

454 have been pushing their work on assembly; they’ve worked pretty hard to make an easy-to-follow recipe, involving both single-end and paired-end sequencing, and the program Newbler. Many interesting critters have had this treatment, including bonobo, panda and Desmond Tutu (in order of majesty).

SOLiD

I found the SOLiD content of this conference very cool. Focusing more on the medical genomics side of things, SOLiD is involved in various clinical trials to see whether genomic information can increase cancer survivial times, and emphasizing the importance of accuracy in a clinical setting.

Lots of cool new tech too: For instance, mixing 2-base and 1-base encoding, apparently making error rates of 1 in 10^6 possible. Apparently library prep errors now dominate, so SOLiD has been working on finding more gentle enzymes for amplification. Particularly cool was a throw-away slide on running the ligase on single molecules and actually getting signal (though actual single-molecular sequencing probably isn’t economic).

Pacific Biosciences

Pacific Biosciences have produced an extremely interesting product; it is a game-changer, though exactly what it means for sequencing is not immediately obvious. I am going to hold back on writing about PacBio right now, because I have a more in-depth post on the exact specs and implications of the PacBio, in comparison to their nearest equivalent Oxford Nanopore, in the works.

Complete Genomics

Complete Genomics have gone from “interesting idea” to “thriving technology” in a very short period of time. They’re scaling up their sequencing centre as we speak; they’ll have 16 machines in the next few months, generating 500 40X genomes a month. Over the year, providing they get more orders, they’ll scale up to 96 machines, with a predicted 5X increases in capacity per machine as well. If this all goes well, in theory they are on target for their 5000 genomes by the end of the year.

Complete also have some very interesting new technologies on the horizon, which they will be discussing tomorrow; check the twitter feed for coverage. A lot of people underestimate Complete Genomics, but it is starting to become evident that they are as much game-changers as more flash technology.

Ion Torrent

Ion Torrent wins both my major awards this year: the “most surprising release” award and the “sounds most like a soviet weapons project” award. Ion torrent burst onto the scene with its tiny machine (GS Junior sized); the first major non-florescence-based method in a long time, using the emission of hydrogen ions from the the DNA polymerase reaction to measure incorporation in a 454 stylee.

The machine can produce a rapid 150Mb or so in a single hour run, for about $500 in disposables. The machine itself costs a tiny $50k. From what I’ve heard, a lot of people are interested in a machine like this for fast library validation, though it also has applications in diagnostics and microbiology. Unfortunately, it looks like the error rates are currently high, though they claim these will drop by release time.

Summary

Overall, we are starting to see a divergence in sequencing technologies, as each tech concentrates on having clearly defined advantages and potential applications that differ from all others. This means that the scientists themselves can more closely tailor their choice of tech to fit their situation. Are you a small lab that needs 10 high-quality genomes on a budget? Go to Complete. Want a cheap, fast machine for library validation? Use Ion Torrent. Setting up a pipeline for sequencing thousands of genomes? Go Illumina.

I suppose this was all driven by the fact that Illumina’s machine has such high yield that chasing them is a fool’s game, so everyone else is concentrating on what they can do that Illumina doesn’t. This is pretty good for science as a whole; we are moving away from the One-Size-Fits-All approach to high-throughput sequencing, and moving into a time of more mature, application-based methods.