Why Earlham CS restarts its servers once per semester

Last weekend, the CS sysadmin students performed a complete shutdown and restart of the servers we manage. We do this in the last month of every semester, and it’s a consistently valuable experience for us.

The department manages two subnets full of servers: cs.earlham.edu and cluster.earlham.edu.

On the side of cs.earlham.edu, we are (funny enough) mostly CS department-focused: the website is there, as are the software tools we use for students in their intro courses, the wiki we use for information management, and tools for senior projects. These services mostly run on virtual machines.
In the cluster domain, we support scientific and high-performance computing in other departments, most commonly chemistry, physics, and biology. That includes parallel processing across a tightly-linked “cluster” of small servers as well as “fat nodes” that provide large amounts of RAM, storage, and CPU power on a single machine. In contrast to cs, there are no virtual machines in the cluster domain.

Manually shutting down all servers in both domains is complex. It requires grappling with the complexities of both of them: the bare metal/virtual machine distinction, file system mounting, network configuration, which services start at boot and whether they should. There are no “trick questions” but there are plenty of places problems could appear.

Since the content is complicated, we like to keep the process orderly. We look for basic system health indicators. Do all of our virtual servers come back? (Yep.) Does everything launch at startup that should? (Mostly!) Do NFS mounts in each domain work as we expect? (Mostly!) Are we backing up everything we need? (No but we’re fixing that.)

We enforce this simplicity with two tools:

A clear and unambiguous plan, communicated from the very start, that does not change except by necessity.
One of the best note-taking tools ever invented: a yellow legal pad with a cheap pen. This allows notes to be taken on the fly, separating the note-taking/accumulating process from the aggregation and curation, which is better done out of the heat of a major operation.

In doing this, we always detect some problems. Some are system problems, but just as often they’re problems of knowledge transfer: no one wrote down that the VM’s have extra dependencies to manage at startup, for example, so we have a cascade of minor failures across the CS domain to fix. We add any issues to a project list in a local instance of RequestTracker.

As usual, we booked three hours to do this last weekend. Almost all of it was done in that time. There’s always something left over at the end, but most systems were running again on schedule.

The coders and admins of the world may, at this point, wonder why we would go through all this and why (if we must do it) we don’t just have a script for it.

We definitely could, but the technical value of the shutdown is orthogonal to its purpose for us. We don’t do the server shutdown because the servers strictly need to be powered off and back on every six months. We do it because it’s one of the few projects that…

exposes the logic and structure of the entire server system to the students managing it,
provides opportunities to learn a lot in terms of both computing and teamwork,
forces us to be accountable for what we’ve installed and how we’ve configured it,
involves every sysadmin student from first-timers to seniors,
and yet is tightly constrained in time and scope.

I like the way this works so much that I’m engineering other projects that meet these criteria and can be implemented more readily throughout the regular academic calendar.