Responding to emergencies in the Earlham CS server room

A group of unrelated problems overlapped in time last week and redirected my entire professional energy. It was the most informative week I’ve had in maybe months, and I’m committing a record of it here for posterity. Each problem we in the admins encountered in the last week is briefly described here.

DNS on the CS servers

It began Tuesday morning with a polite request from a colleague to investigate why the CS servers were down. Unable to ping anything, I walked to the server room and found a powered-on but non-responsive server. I crashed it with the power button and brought it back up, but we still couldn’t get to anything by ssh.

That still didn’t restore network access to our virtual machines, so I began investigating, starting by perusing /var/log.

An hour or so later that morning, I was joined by two other admins. One had an idea for where we might look for problems. We discovered that one of our admins had, innocently enough, used underscores in the hostnames assigned to two computers used by sysadmins-in-training. Underscores are not generally acceptable in DNS hostnames, so we fixed that and restarted bind. That resolved the problem.

The long-term solution to this is to train our sysadmin students in DNS, DHCP, etc. more thoroughly — and to remind them consistently to RTFM.

Fumes

Another issue that materialized at the same time and worried me more: we discovered a foul smell in the server room, like some mix of burning and melting plastic. Was this related to the server failure? At the time, we didn’t know. Using a fan and an open door we ventilated the server room and investigated.

We were lucky.

By way of the sniff test, we discovered the smell came from components that had melted in an out-of-use, but still energized, security keypad control box. I unplugged the box. The smell lingered but faded after a few days, at which point we filed a work order to have it removed from the room altogether.

I want to emphasize our good fortune in this case: Had the smell pointed to a problem in the servers or the power supply, we would have had worse problems that may have lasted a long time and cost us a lot. Our long-term fix should be to implement measures to detect such problems automatically, at least to such an extent that someone can quickly respond to them.

Correlated outages

Finally, the day after those two events happened and while we were investigating them, we experienced a series of outages all at once, across both subdomains we manage. Fortunately, each of these servers is mature and well-configured, and in every case pushing the power button restored systems to normal.

Solving this problem turned out to be entertaining and enlightening as a matter of digital forensics.

Another admin student and I sat in my office for an hour and looked for clues. Examining the system and security log files on each affected server consistently pointed us toward the log file for a script run once per minute under cron.

This particular script checks (by an SNMP query) if our battery level is getting low and staying low – i.e., that we’ve lost power and need to shut down the servers before the batteries are fully drained and we experience a hard crash. The script acted properly, but we’d made it too sensitive: it allowed only 2 minutes to elapse at a <80% battery level before concluding that every server running the script needed to shut down. This happened on each server running the script – and it didn’t happen on servers not running the script.

We’re fixing the code now to allow more time before shutting down. We’re also investigating why the batteries started draining at that time: they never drained so much as to cut power to the entire system, but they clearly dipped for a time.

To my delight (if anything in this process can be a delight), a colleague who’s across the ocean for a few weeks discovered the same thing we did sitting in my office.

Have a process

Details varied, but we walked through the same process to solve each problem:

  1. Observe the issue.
  2. Find a non-destructive short-term fix.
  3. Immediately, while it’s fresh on the mind and the log files are still current, gather data. Copy relevant log files for reference but look at the originals if you still can. If there’s a risk a server will go down again, copy relevant files off that server. Check both hardware and software for obvious problems.
  4. Look for patterns in the data. Time is a good way to make an initial cut: according to the timestamps in the logs, what happened around the time we started observing a problem?
  5. Based on the data and context you have, exercise some common sense and logic. Figure it out. Ask for help.
  6. Based on what you learn, implement the long-term fix.
  7. Update stakeholders. Be as specific as is useful – maybe a little more so. [Do this earlier if there are larger, longer-lasting failures to address.]
  8. Think of how to automate fixes for the problem and how to avoid similar problems in the future.
  9. Implement those changes.

For us, the early stages will be sped up when we finish work on our monitoring/notification systems, but that would not have helped much in this case. Even with incomplete monitoring software, we discovered each problem that I’ve described within minutes or hours, because of the frequency and intensity of use they get by the Earlham community.

I would add is that, based on my observations, it’s easy to become sloppy about what goes into a log file and what doesn’t. Cleaning those up (carefully and conservatively) will be added to the tasks for the student sysadmins to work on.

Work together

We in the admins left for the weekend with no outstanding disasters on the table after a week in which three unrelated time-consuming problems surfaced. That’s tiring but incredibly satisfying. It’s to the credit of all the student admins and my colleagues in the CS faculty, whose collective patience, insights, and persistence made it work.