Computing lessons from DNA analysis experiments

I’ve been working with my colleagues in Earlham’s Icelandic Field Science program on a workflow for DNA analysis, about which I hope to have other content to share later. (I’ve previously shared my work with them on the Field Day Android app.)

My focus has been heavily experimental and computational: run one workflow using one dataset, check the result, adjust a few “dials”, and run it again. When we’re successful, we can often automate the work through a series of scripts.

At the same time, we’ve been trying to get our new “phat node” working to handle jobs like this faster in the future.

Definitions vary by location, context, etc. but we define a “phat node” or “fat node” as a server with a very high ratio of (storage + RAM)/(CPU). In other words, we want to load a lot of data into RAM and plow through it on however many cores we have. A lot of the bioinformatics work we do lends itself to such a workflow.

All this work should ultimately redound to the research and educational benefit of the college.

It’s also been invaluable for me as a learning experience in software engineering and systems architecture. Here are a few of the deep patterns that experience illustrated most clearly to me:

  • Hardware is good: If you have more RAM and processing power, you can run a job in less time! Who knew?
  • Work locally: Locality is an important principle of computer science – basically, keep your data as close to your processing power as you can given system constraints. In this case, we got a 36% performance improvement just by moving data from NFS mounts to local storage.
  • Abstractions can get you far: To wit, define a variable once and reuse it. We have several related scripts that refer to the same files, for example, and for a while we had to update each script with every iteration to keep them consistent. We took a few hours to build and test a config file, which resolved a lot of silly errors like that. This doesn’t help time for any one job, but it vastly simplifies scaling and replicability.
  • Work just takes a while: The actual time Torque (our choice of scheduler) spends running our job is a small percentage of the overall time we spend shaping the problem:
    • buying and provisioning machines
    • learning the science
    • figuring out what questions to ask
    • consulting with colleagues
    • designing the workflow
    • developing the data dictionary
    • fiddling with configs
    • testing – over, and over, and over again
    • if running a job at a bigger supercomputing facility, you may also have to consider things like waiting for CPU cycles to become available; we are generally our systems’ only users, so this wasn’t a constraint for us

A lot of this is (for computer scientists, software engineers, etc.) common sense, but taking care to apply that common sense can be critical for doing big interesting work.

The punchline of it all? We managed to reduce the time – walltime, for fellow HPC geeks – required to run this example workflow from a little over 8 hours to 3.5 hours. Just as importantly we developed a bunch of new knowledge in the process. (I’ve said almost nothing here about microbiology, for example, and learning a snippet of that has been critical to this work.) That lays a strong foundation for the next several steps in this project.

If you read all this, here’s a nice picture of some trees as a token of my thanks (click for higher-resolution version):

Image of trees starting to show fall color
Relevance: a tree is a confirmed DNA-based organism.