On Excel-Damaged Genes

Hello again, it has been a while. In the half an hour I have before an afternoon seminar, I thought I’d share an interesting and amusing paper that came out a few years back. It is entitled Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics. It is available for free on PubMed Central (three cheers for Open Access!).

The paper is about a distressing clash between the sublime and the mundane. The first element of the two is the DNA microarray, a technology that allows you to measure the expression of a very large number of genes (a technology that is now reaching the end of it’s lifespan, a point that I may discuss another day). The output of these experiments tends to be large text tables, in which rows correspond to genes and columns correspond to different individuals, which each entry giving an indicated of the level of gene expression. Often, this data will be processed and analysed with a variety of high tech algorithms to discover genes that differ among classes of people (say diseased and healthy), or to model the expression mathematically, or to reconstruct the networks that underly expression.

However, sometimes this data lands somewhere all together less exciting, which leads us to our second element: Microsoft Excel. Researchers will often import their data into Excel, to examine it, or to sort through and reorder it. Zeeberg et al’s noticed, while testing some software, that that genes that went into Excel tended not to come out the other side (or to come out… changed…). They traced the problem to Excel making guesses as to what certain genes were trying to say. The typical examples were things like DEC1, which Excel converts to 1-DEC (i.e. the date 1st of December), and RIKEN clone IDs, which look like “2310009E13″, and Excel gives the floating-point number “2.31E13″ (i.e. 2.31 x 10^13).

My first reaction to this finding was ‘well, who uses Excel to do bioinformatics anyway?’. Then I remembered that work I am currently intending to put into a paper used Excel at one point (in order to turn a table copied from a pdf into a text table); it is surprisingly often you do something like this, somethi g that you do not give much thought to, but may have been making problems. And, in fact, Zeeberg et al found a number of examples in the NCBI databases of gene names that had been changed by this effect.

This is one of the general pitfalls of doing things with massive datasets. When you have a large amounts of data, problems that do not occur with simple data start to hound you. When you are examining tens of thousands of genes it is impossible to check each one by hand, and the chance of noticing errors falls dramatically; in addition, rare events (like having a gene name that looks like a date) become almost inevitable when they have hundreds of thousands of chances to happen. There are plenty of technical examples, such as long-branch attraction when making DNA trees, where lots of data causes small but systematic errors to become big problems, but this example is nice because it demonstrates in a very simple way the potential quirks that can creep into big datasets.

I heard about this via fejes.ca

——————————————————————————————

Reference:

Barry R Zeeberg, Joseph Riss, David W Kane, Kimberly J Bussey, Edward Uchio, W Marston Linehan, J Carl Barrett, John N Weinstein (2004). Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics BMC Bioinformatics, 5 (1) DOI: 10.1186/1471-2105-5-80

Share and Enjoy:

6 Responses to On Excel-Damaged Genes

Brandon | March 10, 2009 at 6:42 pm |

Does the same thing happen in OpenOffice?
Luke | March 10, 2009 at 7:06 pm |

Yeah, it isn’t a bug in Excel, and OpenOffice does the same - it is just using the wrong tool for the job.
Olaf Davis | March 10, 2009 at 10:20 pm |

Do you think this, in light of our conversation with those leaders of the finance industry last year, has anything to do with the current crisis?
Lab Rat | March 11, 2009 at 4:42 pm |

Computer programs trying to be helpful can cause the most maddening problems! I’ve added all my bacteria names onto the dictionary in word in an attempt to stop the machine making me recheck them every time i spellcheck (through paranoia) or, in extreme cases, attempting to correct them itself.
Hannah | March 13, 2009 at 3:27 pm |

I have a similar but much more easily remedied problem that Word persists in ‘correcting’ Latin to French. It seems to be attempting to draw me into the 21st century.
shpo | June 6, 2009 at 10:33 pm |

My version of Excel has Tools->Options->Spelling->Autocorrect . Hopefully, if these were all switched off, the problem would just disappear… I’d never use excel though - perl or general linux tools are much better

6 Responses to On Excel-Damaged Genes

Leave a Reply Cancel reply

Search It!

Recent Entries

Links

Similar interests