Computing lessons from DNA analysis experiments

I’ve been working with my colleagues in Earlham’s Icelandic Field Science program on a workflow for DNA analysis, about which I hope to have other content to share later. (I’ve previously shared my work with them on the Field Day Android app.)

My focus has been heavily experimental and computational: run one workflow using one dataset, check the result, adjust a few “dials”, and run it again. When we’re successful, we can often automate the work through a series of scripts.

At the same time, we’ve been trying to get our new “phat node” working to handle jobs like this faster in the future.

Definitions vary by location, context, etc. but we define a “phat node” or “fat node” as a server with a very high ratio of (storage + RAM)/(CPU). In other words, we want to load a lot of data into RAM and plow through it on however many cores we have. A lot of the bioinformatics work we do lends itself to such a workflow.

All this work should ultimately redound to the research and educational benefit of the college.

It’s also been invaluable for me as a learning experience in software engineering and systems architecture. Here are a few of the deep patterns that experience illustrated most clearly to me:

  • Hardware is good: If you have more RAM and processing power, you can run a job in less time! Who knew?
  • Work locally: Locality is an important principle of computer science – basically, keep your data as close to your processing power as you can given system constraints. In this case, we got a 36% performance improvement just by moving data from NFS mounts to local storage.
  • Abstractions can get you far: To wit, define a variable once and reuse it. We have several related scripts that refer to the same files, for example, and for a while we had to update each script with every iteration to keep them consistent. We took a few hours to build and test a config file, which resolved a lot of silly errors like that. This doesn’t help time for any one job, but it vastly simplifies scaling and replicability.
  • Work just takes a while: The actual time Torque (our choice of scheduler) spends running our job is a small percentage of the overall time we spend shaping the problem:
    • buying and provisioning machines
    • learning the science
    • figuring out what questions to ask
    • consulting with colleagues
    • designing the workflow
    • developing the data dictionary
    • fiddling with configs
    • testing – over, and over, and over again
    • if running a job at a bigger supercomputing facility, you may also have to consider things like waiting for CPU cycles to become available; we are generally our systems’ only users, so this wasn’t a constraint for us

A lot of this is (for computer scientists, software engineers, etc.) common sense, but taking care to apply that common sense can be critical for doing big interesting work.

The punchline of it all? We managed to reduce the time – walltime, for fellow HPC geeks – required to run this example workflow from a little over 8 hours to 3.5 hours. Just as importantly we developed a bunch of new knowledge in the process. (I’ve said almost nothing here about microbiology, for example, and learning a snippet of that has been critical to this work.) That lays a strong foundation for the next several steps in this project.

If you read all this, here’s a nice picture of some trees as a token of my thanks (click for higher-resolution version):

Image of trees starting to show fall color
Relevance: a tree is a confirmed DNA-based organism.

A -> AB -> B

I was reading a recent Rachel By The Bay post in my RSS reader and this struck me:

Some items from my “reliability list”

It should not be surprising that patterns start to emerge after you’ve dealt with enough failures in a given domain. I’ve had an informal list bouncing around inside my head for years. Now and then, something new to me will pop up, and that’ll mesh up with some other recollections, and sometimes that yields another entry.

Item: Rollbacks need to be possible

This one sounds simple until you realize someone’s violated it. It means, in short: if you’re on version 20, and then start pushing version 21, and for some reason can’t go back to version 20, you’ve failed. You took some shortcut, or forgot about going from A to AB to B, or did break-before-make, or any other number of things.

That paragraph struck me because I’m about one week removed from making that very mistake.

Until last week, we’d been running a ten-year-old version of the pfSense firewall software on a ten-year-old server (32-bit architecture CPU! in a server!). I made a firewall upgrade one of our top summer priorities.

The problem was that I got in a hurry. We tried to upgrade without taking careful enough notes about how to reset to our previous configuration. We combined that with years’ worth of lost knowledge about the interoperability of the Computer Science Department’s subnets with the Earlham ITS network. That produced a couple of days of downtime and added stress.

We talked with ITS. We did research. I sat in a server room till late at night. Ultimately we reverted back to the old firewall, allowing our mail and other queues to be processed while we figured out what went wrong in the new system.

The day after that we started our second attempt. We set up and configured the new one alongside the old, checking and double-checking every network setting. Then we simply swapped network cables. It was almost laughably anticlimactic.

In short, attempting to move directly from A to B generated hours of downtime, but when we went from A to AB, and then from AB to B, it was mere seconds.

We learned a lot from the experience:

  1. The A->AB->B pattern
  2. ECCS and ITS now understand our network connections much more deeply than we did three weeks ago.
  3. Said network knowledge is distributed across students, staff, and faculty.
  4. We were vindicated in our wisest decision: trying this in July, when only a handful of people had a day-to-day dependence on our network and we had time to recover.

A more big-picture lesson is this: We in tech often want to get something done real fast, and it’s all too easy to conflate that with getting it done in a hurry. If you’re working on something like this, take some time to plan a little bit in advance. Make sure to allow yourself an A->AB->B path. A little work upfront can save you a lot later.

Or, as one mentor of mine has put it in the context of software development:

Days of debugging can save you from hours of design!

Misadventures in source control

Or, what Present!Me ever do to Past!Me?

I observed a while ago on Twitter that learning git (for all its headaches) was valuable for me:

This was on my mind because I recently set about curating (and selecting for either skill review or public presentation) all the personal software projects I worked on as a student. It was a vivid reminder of how much I learned then and in the few years since.

Every day since then I have observed more version control errors of mine, and at some point I thought it worth gathering my observations into one post. Here is a non-comprehensive list of the mistakes I observed in my workflows from years past:

  • a bunch of directories called archive, sometimes nested two or three deep
  • inconsistent naming scheme so that archive and old in multiple capitalization flavors were together
  • combinations of the first two: I kid you not, cs350/old-string-compare/archive/archive/old is a path to some files in my (actual, high-level, left-as-it-was-on-final-exam-day) archive
  • multiple versions OF THE SAME REPO with differing levels of completion, features, etc. (sure, branching is tricky but… really?)
  • no apparent rhyme or reason in the sorting at all – a program to find the area under a curve by dividing it up into trapezoids and summing the trapezoid area was next to a program to return a list of all primes less than X, and next to both of those was a project entirely about running software through CUDA, which is a platform not a problem
  • timestamps long since lost because I copied files through various servers without preserving metadata when I was initially archiving
  • inconsistent use of README‘s that would inform of me of, say, how to compile a program with mpicc rather than gcc or how to submit a job to qsub
  • files stored on different servers with no real reason for any of them to be in any particular place
  • binaries in some directories but not others
  • Makefiles in some directories but not others

(You may have noticed that parallelism is a recurring theme here, and that’s because it was a parallel and distributed computing course where I realized that my workflows weren’t right. I didn’t learn how to fix that problem in time to go from a B to an A in the course, but after that class I did start improving my efficiency and consistency.)

To be fair to myself and to anyone who might find this eerily familiar: I never learned programming before college, so much of my college years were spent catching up on the basics that a lot of people already knew when they got there. Earlham is a place that values experiment, learn-by-doing, jumping into the pool rather than (or above and beyond) reading a book about swimming, etc. Which is good! I learned vastly more that way than I might have otherwise.

What’s more, I understand that git isn’t easy to pick up quickly and poses problems for accessibility to newcomers. Still I can’t help but look at my own work and consider it vastly superior to trying to make this up as you go. It’s well worth the time to learn.

Git and related software carpentry were not something I learned until quite a while into my education. And that’s a bit of a shame, to me: if you’re trying to figure out (as I clearly was) how to manage a workflow, do appropriate file naming, etc. concurrently with learning to code, you end up in a thicket of barely-sorted, unhelpfully-named, badly-organized code.

And then neither becomes especially fun, frankly.

I’ve enjoyed the coding I’ve done since about my junior year in college much more than before that, because I finally learned to get out of my own way.