On Excel-Damaged Genes

ResearchBlogging.org

Hello again, it has been a while. In the half an hour I have before an afternoon seminar, I thought I’d share an interesting and amusing paper that came out a few years back. It is entitled Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics. It is available for free on PubMed Central (three cheers for Open Access!).

The paper is about a distressing clash between the sublime and the mundane. The first element of the two is the DNA microarray, a technology that allows you to measure the expression of a very large number of genes (a technology that is now reaching the end of it’s lifespan, a point that I may discuss another day). The output of these experiments tends to be large text tables, in which rows correspond to genes and columns correspond to different individuals, which each entry giving an indicated of the level of gene expression. Often, this data will be processed and analysed with a variety of high tech algorithms to discover genes that differ among classes of people (say diseased and healthy), or to model the expression mathematically, or to reconstruct the networks that underly expression.

However, sometimes this data lands somewhere all together less exciting, which leads us to our second element: Microsoft Excel. Researchers will often import their data into Excel, to examine it, or to sort through and reorder it. Zeeberg et al’s noticed, while testing some software, that that genes that went into Excel tended not to come out the other side (or to come out… changed…). They traced the problem to Excel making guesses as to what certain genes were trying to say. The typical examples were things like DEC1, which Excel converts to 1-DEC (i.e. the date 1st of December), and RIKEN clone IDs, which look like “2310009E13″, and Excel gives the floating-point number “2.31E13″ (i.e. 2.31 x 10^13).

My first reaction to this finding was ‘well, who uses Excel to do bioinformatics anyway?’. Then I remembered that work I am currently intending to put into a paper used Excel at one point (in order to turn a table copied from a pdf into a text table); it is surprisingly often you do something like this, somethi g that you do not give much thought to, but may have been making problems. And, in fact, Zeeberg et al found a number of examples in the NCBI databases of gene names that had been changed by this effect.

This is one of the general pitfalls of doing things with massive datasets. When you have a large amounts of data, problems that do not occur with simple data start to hound you. When you are examining tens of thousands of genes it is impossible to check each one by hand, and the chance of noticing errors falls dramatically; in addition, rare events (like having a gene name that looks like a date) become almost inevitable when they have hundreds of thousands of chances to happen. There are plenty of technical examples, such as long-branch attraction when making DNA trees, where lots of data causes small but systematic errors to become big problems, but this example is nice because it demonstrates in a very simple way the potential quirks that can creep into big datasets.

I heard about this via fejes.ca

——————————————————————————————

Reference:

Barry R Zeeberg, Joseph Riss, David W Kane, Kimberly J Bussey, Edward Uchio, W Marston Linehan, J Carl Barrett, John N Weinstein (2004). Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics BMC Bioinformatics, 5 (1) DOI: 10.1186/1471-2105-5-80

Share and Enjoy:
  • Digg
  • Reddit
  • StumbleUpon
  • del.icio.us
  • Facebook
  • Twitter
  • Google Bookmarks
  • FriendFeed

6 Responses to On Excel-Damaged Genes

  1. Does the same thing happen in OpenOffice?

  2. Yeah, it isn’t a bug in Excel, and OpenOffice does the same - it is just using the wrong tool for the job.

  3. Do you think this, in light of our conversation with those leaders of the finance industry last year, has anything to do with the current crisis?

  4. Computer programs trying to be helpful can cause the most maddening problems! I’ve added all my bacteria names onto the dictionary in word in an attempt to stop the machine making me recheck them every time i spellcheck (through paranoia) or, in extreme cases, attempting to correct them itself.

  5. I have a similar but much more easily remedied problem that Word persists in ‘correcting’ Latin to French. It seems to be attempting to draw me into the 21st century.

  6. My version of Excel has Tools->Options->Spelling->Autocorrect . Hopefully, if these were all switched off, the problem would just disappear… I’d never use excel though - perl or general linux tools are much better

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>