July 2019 – Craig Earley

I was reading a recent Rachel By The Bay post in my RSS reader and this struck me:

Some items from my “reliability list”

It should not be surprising that patterns start to emerge after you’ve dealt with enough failures in a given domain. I’ve had an informal list bouncing around inside my head for years. Now and then, something new to me will pop up, and that’ll mesh up with some other recollections, and sometimes that yields another entry.

…
Item: Rollbacks need to be possible
This one sounds simple until you realize someone’s violated it. It means, in short: if you’re on version 20, and then start pushing version 21, and for some reason can’t go back to version 20, you’ve failed. You took some shortcut, or forgot about going from A to AB to B, or did break-before-make, or any other number of things.

That paragraph struck me because I’m about one week removed from making that very mistake.

Until last week, we’d been running a ten-year-old version of the pfSense firewall software on a ten-year-old server (32-bit architecture CPU! in a server!). I made a firewall upgrade one of our top summer priorities.

The problem was that I got in a hurry. We tried to upgrade without taking careful enough notes about how to reset to our previous configuration. We combined that with years’ worth of lost knowledge about the interoperability of the Computer Science Department’s subnets with the Earlham ITS network. That produced a couple of days of downtime and added stress.

We talked with ITS. We did research. I sat in a server room till late at night. Ultimately we reverted back to the old firewall, allowing our mail and other queues to be processed while we figured out what went wrong in the new system.

The day after that we started our second attempt. We set up and configured the new one alongside the old, checking and double-checking every network setting. Then we simply swapped network cables. It was almost laughably anticlimactic.

In short, attempting to move directly from A to B generated hours of downtime, but when we went from A to AB, and then from AB to B, it was mere seconds.

We learned a lot from the experience:

The A->AB->B pattern
ECCS and ITS now understand our network connections much more deeply than we did three weeks ago.
Said network knowledge is distributed across students, staff, and faculty.
We were vindicated in our wisest decision: trying this in July, when only a handful of people had a day-to-day dependence on our network and we had time to recover.

A more big-picture lesson is this: We in tech often want to get something done real fast, and it’s all too easy to conflate that with getting it done in a hurry. If you’re working on something like this, take some time to plan a little bit in advance. Make sure to allow yourself an A->AB->B path. A little work upfront can save you a lot later.

Or, as one mentor of mine has put it in the context of software development:

Days of debugging can save you from hours of design!

We recently upgraded our firewall, and after much ado we’re in good shape again with regard to network traffic and basic security. The most recent bit of cleanup was that our mail stack wasn’t working off-campus. This post is the text of the message I sent to the students in the sysadmin group after fixing it today. I’ve anonymized it as best I can but otherwise left it unaltered.

tl;dr the firewall rule allowing DNS lookups on the CS subnet allowed only TCP requests, not TCP/UDP. Now it allows both.

Admins, here’s how I deduced this problem:

Using a VPN, I connected to an off-campus network. (VPN’s as a privacy instrument are overrated, but they’re a handy tool as a sysadmin for other reasons.)
I verified what $concernedParty observed, that mail was down when I was on that network and thus apparently not on-campus.
I checked whether other services were also unavailable. While pinging cs dot earlham dot edu worked, nothing else seemed to (Jupyter was down, website down, etc.)
I tried pinging and ssh-ing tools via IP address instead of FQDN. That worked. That made me think of DNS.
I checked the firewall rules, carefully. I observed that our other subnet, the cluster subnet, had a DNS pass rule that was set to allow both TCP and UDP traffic, so I tried ssh’ing to cluster (by FQDN, not IP address) and found that it worked.
I noticed that, strangely, the firewall rule allowing DNS lookups on the CS subnet via our DNS server allowed only TCP connections, not TCP/UDP. (I say “strange” not because it didn’t use both protocols but because, of the two, it accepted TCP instead of DNS’s more common protocol of choice, UDP.)
I updated the appropriate firewall rule to allow both TCP and UDP.
It seemed to work so I sent a followup message to $concernedParty. And now here we are.

This approach – searching for patterns to understand the scope of the problem, followed by narrowing down to a few specific options, and making small changes to minimize external consequences – has often served me well in both my sysadmin work and my work developing software.

A -> AB -> B

Fixing mail as a troubleshooting case study