Fixing mail as a troubleshooting case study

We recently upgraded our firewall, and after much ado we’re in good shape again with regard to network traffic and basic security. The most recent bit of cleanup was that our mail stack wasn’t working off-campus. This post is the text of the message I sent to the students in the sysadmin group after fixing it today. I’ve anonymized it as best I can but otherwise left it unaltered.

tl;dr the firewall rule allowing DNS lookups on the CS subnet allowed only TCP requests, not TCP/UDP. Now it allows both.

Admins, here’s how I deduced this problem:

  • Using a VPN, I connected to an off-campus network. (VPN’s as a privacy instrument are overrated, but they’re a handy tool as a sysadmin for other reasons.)
  • I verified what $concernedParty observed, that mail was down when I was on that network and thus apparently not on-campus.
  • I checked whether other services were also unavailable. While pinging cs dot earlham dot edu worked, nothing else seemed to (Jupyter was down, website down, etc.)
  • I tried pinging and ssh-ing tools via IP address instead of FQDN. That worked. That made me think of DNS.
  • I checked the firewall rules, carefully. I observed that our other subnet, the cluster subnet, had a DNS pass rule that was set to allow both TCP and UDP traffic, so I tried ssh’ing to cluster (by FQDN, not IP address) and found that it worked.
  • I noticed that, strangely, the firewall rule allowing DNS lookups on the CS subnet via our DNS server allowed only TCP connections, not TCP/UDP. (I say “strange” not because it didn’t use both protocols but because, of the two, it accepted TCP instead of DNS’s more common protocol of choice, UDP.)
  • I updated the appropriate firewall rule to allow both TCP and UDP.
  • It seemed to work so I sent a followup message to $concernedParty. And now here we are.

This approach – searching for patterns to understand the scope of the problem, followed by narrowing down to a few specific options, and making small changes to minimize external consequences – has often served me well in both my sysadmin work and my work developing software.