Responding to emergencies in the Earlham CS server room

A group of unrelated problems overlapped in time last week and redirected my entire professional energy. It was the most informative week I’ve had in maybe months, and I’m committing a record of it here for posterity. Each problem we in the admins encountered in the last week is briefly described here.

DNS on the CS servers

It began Tuesday morning with a polite request from a colleague to investigate why the CS servers were down. Unable to ping anything, I walked to the server room and found a powered-on but non-responsive server. I crashed it with the power button and brought it back up, but we still couldn’t get to anything by ssh.

That still didn’t restore network access to our virtual machines, so I began investigating, starting by perusing /var/log.

An hour or so later that morning, I was joined by two other admins. One had an idea for where we might look for problems. We discovered that one of our admins had, innocently enough, used underscores in the hostnames assigned to two computers used by sysadmins-in-training. Underscores are not generally acceptable in DNS hostnames, so we fixed that and restarted bind. That resolved the problem.

The long-term solution to this is to train our sysadmin students in DNS, DHCP, etc. more thoroughly — and to remind them consistently to RTFM.

Fumes

Another issue that materialized at the same time and worried me more: we discovered a foul smell in the server room, like some mix of burning and melting plastic. Was this related to the server failure? At the time, we didn’t know. Using a fan and an open door we ventilated the server room and investigated.

We were lucky.

By way of the sniff test, we discovered the smell came from components that had melted in an out-of-use, but still energized, security keypad control box. I unplugged the box. The smell lingered but faded after a few days, at which point we filed a work order to have it removed from the room altogether.

I want to emphasize our good fortune in this case: Had the smell pointed to a problem in the servers or the power supply, we would have had worse problems that may have lasted a long time and cost us a lot. Our long-term fix should be to implement measures to detect such problems automatically, at least to such an extent that someone can quickly respond to them.

Correlated outages

Finally, the day after those two events happened and while we were investigating them, we experienced a series of outages all at once, across both subdomains we manage. Fortunately, each of these servers is mature and well-configured, and in every case pushing the power button restored systems to normal.

Solving this problem turned out to be entertaining and enlightening as a matter of digital forensics.

Another admin student and I sat in my office for an hour and looked for clues. Examining the system and security log files on each affected server consistently pointed us toward the log file for a script run once per minute under cron.

This particular script checks (by an SNMP query) if our battery level is getting low and staying low – i.e., that we’ve lost power and need to shut down the servers before the batteries are fully drained and we experience a hard crash. The script acted properly, but we’d made it too sensitive: it allowed only 2 minutes to elapse at a <80% battery level before concluding that every server running the script needed to shut down. This happened on each server running the script – and it didn’t happen on servers not running the script.

We’re fixing the code now to allow more time before shutting down. We’re also investigating why the batteries started draining at that time: they never drained so much as to cut power to the entire system, but they clearly dipped for a time.

To my delight (if anything in this process can be a delight), a colleague who’s across the ocean for a few weeks discovered the same thing we did sitting in my office.

Have a process

Details varied, but we walked through the same process to solve each problem:

  1. Observe the issue.
  2. Find a non-destructive short-term fix.
  3. Immediately, while it’s fresh on the mind and the log files are still current, gather data. Copy relevant log files for reference but look at the originals if you still can. If there’s a risk a server will go down again, copy relevant files off that server. Check both hardware and software for obvious problems.
  4. Look for patterns in the data. Time is a good way to make an initial cut: according to the timestamps in the logs, what happened around the time we started observing a problem?
  5. Based on the data and context you have, exercise some common sense and logic. Figure it out. Ask for help.
  6. Based on what you learn, implement the long-term fix.
  7. Update stakeholders. Be as specific as is useful – maybe a little more so. [Do this earlier if there are larger, longer-lasting failures to address.]
  8. Think of how to automate fixes for the problem and how to avoid similar problems in the future.
  9. Implement those changes.

For us, the early stages will be sped up when we finish work on our monitoring/notification systems, but that would not have helped much in this case. Even with incomplete monitoring software, we discovered each problem that I’ve described within minutes or hours, because of the frequency and intensity of use they get by the Earlham community.

I would add is that, based on my observations, it’s easy to become sloppy about what goes into a log file and what doesn’t. Cleaning those up (carefully and conservatively) will be added to the tasks for the student sysadmins to work on.

Work together

We in the admins left for the weekend with no outstanding disasters on the table after a week in which three unrelated time-consuming problems surfaced. That’s tiring but incredibly satisfying. It’s to the credit of all the student admins and my colleagues in the CS faculty, whose collective patience, insights, and persistence made it work.

Misadventures in source control

Or, what Present!Me ever do to Past!Me?

I observed a while ago on Twitter that learning git (for all its headaches) was valuable for me:

This was on my mind because I recently set about curating (and selecting for either skill review or public presentation) all the personal software projects I worked on as a student. It was a vivid reminder of how much I learned then and in the few years since.

Every day since then I have observed more version control errors of mine, and at some point I thought it worth gathering my observations into one post. Here is a non-comprehensive list of the mistakes I observed in my workflows from years past:

  • a bunch of directories called archive, sometimes nested two or three deep
  • inconsistent naming scheme so that archive and old in multiple capitalization flavors were together
  • combinations of the first two: I kid you not, cs350/old-string-compare/archive/archive/old is a path to some files in my (actual, high-level, left-as-it-was-on-final-exam-day) archive
  • multiple versions OF THE SAME REPO with differing levels of completion, features, etc. (sure, branching is tricky but… really?)
  • no apparent rhyme or reason in the sorting at all – a program to find the area under a curve by dividing it up into trapezoids and summing the trapezoid area was next to a program to return a list of all primes less than X, and next to both of those was a project entirely about running software through CUDA, which is a platform not a problem
  • timestamps long since lost because I copied files through various servers without preserving metadata when I was initially archiving
  • inconsistent use of README‘s that would inform of me of, say, how to compile a program with mpicc rather than gcc or how to submit a job to qsub
  • files stored on different servers with no real reason for any of them to be in any particular place
  • binaries in some directories but not others
  • Makefiles in some directories but not others

(You may have noticed that parallelism is a recurring theme here, and that’s because it was a parallel and distributed computing course where I realized that my workflows weren’t right. I didn’t learn how to fix that problem in time to go from a B to an A in the course, but after that class I did start improving my efficiency and consistency.)

To be fair to myself and to anyone who might find this eerily familiar: I never learned programming before college, so much of my college years were spent catching up on the basics that a lot of people already knew when they got there. Earlham is a place that values experiment, learn-by-doing, jumping into the pool rather than (or above and beyond) reading a book about swimming, etc. Which is good! I learned vastly more that way than I might have otherwise.

What’s more, I understand that git isn’t easy to pick up quickly and poses problems for accessibility to newcomers. Still I can’t help but look at my own work and consider it vastly superior to trying to make this up as you go. It’s well worth the time to learn.

Git and related software carpentry were not something I learned until quite a while into my education. And that’s a bit of a shame, to me: if you’re trying to figure out (as I clearly was) how to manage a workflow, do appropriate file naming, etc. concurrently with learning to code, you end up in a thicket of barely-sorted, unhelpfully-named, badly-organized code.

And then neither becomes especially fun, frankly.

I’ve enjoyed the coding I’ve done since about my junior year in college much more than before that, because I finally learned to get out of my own way.

The perks of being a VM

Several of the CS department’s servers are virtual machines. While running VM’s adds complexity, it also lets you do things like octuple* system RAM in five minutes from your laptop.

For context, Earlham CS runs a Jupyterhub server for the first- and second-semester CS students. We want to provide a programming environment (in this case Python, a terminal, and a few other languages) so students can focus on programming instead of administration, environment, etc. Jupyter is handy for that purpose.

The issue: Each notebook takes a relatively large amount of RAM. There are 60 or so intro CS students here. The Xen virtual machine hosting Jupyter was simply not equipped for that load. So at the request of my colleagues teaching the course, I visited a lab today. After observing the problem, we took five minutes to shut the server down, destroy the volume, change a single number in a single config file, and bring it all back to life with a boosted configuration. We’ve had no additional problems – so far. 🙂

Running a VM is frequently more complex than running on bare hardware. But the alternative is this:

I wish I had some of the “upcoming maintenance” email notifications we sent out in my ECCS sysadmin days for comparison. They were basically “no email, no websites for several days while we rebuild this server from parts, mmmkay?”

@chrishardie

Because we do so much of our administration in software, we’ve mostly avoided that problem in recent years. The closest we’ve gotten to scrambling hardware lately was recovering from disk failures after a power outage over the summer. We had to send a lot of “sorry, X is down” emails. I wouldn’t want that to be our approach to managing all servers all the time.

(Of course there are many other alternatives, but running Xen VM’s serves our purposes nicely. It’s also, for many reasons, good practice for our student system administrators.)

*I tweeted about this originally and said we quadrupled the RAM. In fact, a previously-arranged RAM doubling had been specified in the VM’s config file but not implemented. Before we restarted the machine, we decided to boost it even more. Ultimately we quadrupled the double of the previous RAM amount.