I was reading a recent Rachel By The Bay post in my RSS reader and this struck me:
Some items from my “reliability list”
It should not be surprising that patterns start to emerge after you’ve dealt with enough failures in a given domain. I’ve had an informal list bouncing around inside my head for years. Now and then, something new to me will pop up, and that’ll mesh up with some other recollections, and sometimes that yields another entry.
…Item: Rollbacks need to be possible
This one sounds simple until you realize someone’s violated it. It means, in short: if you’re on version 20, and then start pushing version 21, and for some reason can’t go back to version 20, you’ve failed. You took some shortcut, or forgot about going from A to AB to B, or did break-before-make, or any other number of things.
That paragraph struck me because I’m about one week removed from making that very mistake.
Until last week, we’d been running a ten-year-old version of the pfSense firewall software on a ten-year-old server (32-bit architecture CPU! in a server!). I made a firewall upgrade one of our top summer priorities.
The problem was that I got in a hurry. We tried to upgrade without taking careful enough notes about how to reset to our previous configuration. We combined that with years’ worth of lost knowledge about the interoperability of the Computer Science Department’s subnets with the Earlham ITS network. That produced a couple of days of downtime and added stress.
We talked with ITS. We did research. I sat in a server room till late at night. Ultimately we reverted back to the old firewall, allowing our mail and other queues to be processed while we figured out what went wrong in the new system.
The day after that we started our second attempt. We set up and configured the new one alongside the old, checking and double-checking every network setting. Then we simply swapped network cables. It was almost laughably anticlimactic.
In short, attempting to move directly from A to B generated hours of downtime, but when we went from A to AB, and then from AB to B, it was mere seconds.
We learned a lot from the experience:
- The A->AB->B pattern
- ECCS and ITS now understand our network connections much more deeply than we did three weeks ago.
- Said network knowledge is distributed across students, staff, and faculty.
- We were vindicated in our wisest decision: trying this in July, when only a handful of people had a day-to-day dependence on our network and we had time to recover.
A more big-picture lesson is this: We in tech often want to get something done real fast, and it’s all too easy to conflate that with getting it done in a hurry. If you’re working on something like this, take some time to plan a little bit in advance. Make sure to allow yourself an A->AB->B path. A little work upfront can save you a lot later.
Or, as one mentor of mine has put it in the context of software development:
Days of debugging can save you from hours of design!