A -> AB -> B

I was reading a recent Rachel By The Bay post in my RSS reader and this struck me:

Some items from my “reliability list”

It should not be surprising that patterns start to emerge after you’ve dealt with enough failures in a given domain. I’ve had an informal list bouncing around inside my head for years. Now and then, something new to me will pop up, and that’ll mesh up with some other recollections, and sometimes that yields another entry.

Item: Rollbacks need to be possible

This one sounds simple until you realize someone’s violated it. It means, in short: if you’re on version 20, and then start pushing version 21, and for some reason can’t go back to version 20, you’ve failed. You took some shortcut, or forgot about going from A to AB to B, or did break-before-make, or any other number of things.

That paragraph struck me because I’m about one week removed from making that very mistake.

Until last week, we’d been running a ten-year-old version of the pfSense firewall software on a ten-year-old server (32-bit architecture CPU! in a server!). I made a firewall upgrade one of our top summer priorities.

The problem was that I got in a hurry. We tried to upgrade without taking careful enough notes about how to reset to our previous configuration. We combined that with years’ worth of lost knowledge about the interoperability of the Computer Science Department’s subnets with the Earlham ITS network. That produced a couple of days of downtime and added stress.

We talked with ITS. We did research. I sat in a server room till late at night. Ultimately we reverted back to the old firewall, allowing our mail and other queues to be processed while we figured out what went wrong in the new system.

The day after that we started our second attempt. We set up and configured the new one alongside the old, checking and double-checking every network setting. Then we simply swapped network cables. It was almost laughably anticlimactic.

In short, attempting to move directly from A to B generated hours of downtime, but when we went from A to AB, and then from AB to B, it was mere seconds.

We learned a lot from the experience:

  1. The A->AB->B pattern
  2. ECCS and ITS now understand our network connections much more deeply than we did three weeks ago.
  3. Said network knowledge is distributed across students, staff, and faculty.
  4. We were vindicated in our wisest decision: trying this in July, when only a handful of people had a day-to-day dependence on our network and we had time to recover.

A more big-picture lesson is this: We in tech often want to get something done real fast, and it’s all too easy to conflate that with getting it done in a hurry. If you’re working on something like this, take some time to plan a little bit in advance. Make sure to allow yourself an A->AB->B path. A little work upfront can save you a lot later.

Or, as one mentor of mine has put it in the context of software development:

Days of debugging can save you from hours of design!

Fixing mail as a troubleshooting case study

We recently upgraded our firewall, and after much ado we’re in good shape again with regard to network traffic and basic security. The most recent bit of cleanup was that our mail stack wasn’t working off-campus. This post is the text of the message I sent to the students in the sysadmin group after fixing it today. I’ve anonymized it as best I can but otherwise left it unaltered.

tl;dr the firewall rule allowing DNS lookups on the CS subnet allowed only TCP requests, not TCP/UDP. Now it allows both.

Admins, here’s how I deduced this problem:

  • Using a VPN, I connected to an off-campus network. (VPN’s as a privacy instrument are overrated, but they’re a handy tool as a sysadmin for other reasons.)
  • I verified what $concernedParty observed, that mail was down when I was on that network and thus apparently not on-campus.
  • I checked whether other services were also unavailable. While pinging cs dot earlham dot edu worked, nothing else seemed to (Jupyter was down, website down, etc.)
  • I tried pinging and ssh-ing tools via IP address instead of FQDN. That worked. That made me think of DNS.
  • I checked the firewall rules, carefully. I observed that our other subnet, the cluster subnet, had a DNS pass rule that was set to allow both TCP and UDP traffic, so I tried ssh’ing to cluster (by FQDN, not IP address) and found that it worked.
  • I noticed that, strangely, the firewall rule allowing DNS lookups on the CS subnet via our DNS server allowed only TCP connections, not TCP/UDP. (I say “strange” not because it didn’t use both protocols but because, of the two, it accepted TCP instead of DNS’s more common protocol of choice, UDP.)
  • I updated the appropriate firewall rule to allow both TCP and UDP.
  • It seemed to work so I sent a followup message to $concernedParty. And now here we are.

This approach – searching for patterns to understand the scope of the problem, followed by narrowing down to a few specific options, and making small changes to minimize external consequences – has often served me well in both my sysadmin work and my work developing software.

A summer in upgrade mode

Much like facilities and maintenance, CS/IT work is often best done when services are most lightly used. At a college, that’s in the summer. For that reason we spent May and June performing maintenance and upgrades.

Our biggest achievement is the near-complete rebuild of two of our computing clusters. They’re modest, and both are a few years old, but we gave one a complete OS upgrade and reconfigured the other with a new head node that will let us better use the systems we already have.

We’re running CentOS 7 on all nodes of the newly-upgraded cluster (up from CentOS 5!), and to configure it we’re implementing an instance of Ansible. It’s similar to the c3 tools we’ve previously run on our three clusters, but it’s vastly more powerful. We’re all learning its vocabulary and some new syntactical sugar, but it’s already paying returns in efficiency of time and labor.

In addition to those upgrades:

  • We’ve racked and de-racked some servers, a task largely implemented by the students who should get the experience.
  • Several bits of software that were long overdue for updates finally got attention.
  • We’re ready to upgrade to a new server to host our firewall, also long overdue.

Other than sysadmin work, I lent support to the Icelandic field studies group while they were on-site for a few weeks. Development on Field Day tends to slow once the crew returns, but I found it quite enjoyable and fulfilling to build the app, so I hope to have the chance to continue developing it (alongside some other thickets of code I’ve wandered into).

Finally, to my great relief, an annual scheduled power outage didn’t induce downtime, let alone the hard crash of last year. That’s thanks to some fixes we made to hardware and software in the wake of the last incident, one I hope no one in this department ever repeats.

It’s a successful first half of the summer. I’ve been supervising all of it but we wouldn’t be half as far without the hard work of the summer admin students. I continue to be optimistic that we’re setting ourselves up for some interesting things in the next academic year.

Phase 2 for the Admins

I entered my current position last June. Before me, it hadn’t existed for a year or so. That gap gave me a lot to do immediately as faculty supervisor of the sysadmins: upgrades, problem-solving, server migrations, deprecation of old hardware, etc., all on top of maintaining existing services. The admins spent a lot of time on those issues, to good effect.

Starting my second year in the role, I’m basically satisfied that we’ve finished that work. Maintenance and tracking will keep us up to speed for a while.

We’re now exploring options for how to make better use of the great resources we have. I see opportunities for growth in three areas:

  • Virtualization, containers, and other cloud-style features
  • High-performance computing (HPC) and specifically scientific computing, a longstanding strength of ours
  • Security

We’re working out the specifics as a group over the next few months, but I’m pretty excited about what we can accomplish at this point.

Why Earlham CS restarts its servers once per semester

Last weekend, the CS sysadmin students performed a complete shutdown and restart of the servers we manage. We do this in the last month of every semester, and it’s a consistently valuable experience for us.

The department manages two subnets full of servers: cs.earlham.edu and cluster.earlham.edu.

  • On the side of cs.earlham.edu, we are (funny enough) mostly CS department-focused: the website is there, as are the software tools we use for students in their intro courses, the wiki we use for information management, and tools for senior projects. These services mostly run on virtual machines.
  • In the cluster domain, we support scientific and high-performance computing in other departments, most commonly chemistry, physics, and biology. That includes parallel processing across a tightly-linked “cluster” of small servers as well as “fat nodes” that provide large amounts of RAM, storage, and CPU power on a single machine. In contrast to cs, there are no virtual machines in the cluster domain.

Manually shutting down all servers in both domains is complex. It requires grappling with the complexities of both of them: the bare metal/virtual machine distinction, file system mounting, network configuration, which services start at boot and whether they should. There are no “trick questions” but there are plenty of places problems could appear.

Since the content is complicated, we like to keep the process orderly. We look for basic system health indicators. Do all of our virtual servers come back? (Yep.) Does everything launch at startup that should? (Mostly!) Do NFS mounts in each domain work as we expect? (Mostly!) Are we backing up everything we need? (No but we’re fixing that.)

We enforce this simplicity with two tools:

  1. A clear and unambiguous plan, communicated from the very start, that does not change except by necessity.
  2. One of the best note-taking tools ever invented: a yellow legal pad with a cheap pen. This allows notes to be taken on the fly, separating the note-taking/accumulating process from the aggregation and curation, which is better done out of the heat of a major operation.

In doing this, we always detect some problems. Some are system problems, but just as often they’re problems of knowledge transfer: no one wrote down that the VM’s have extra dependencies to manage at startup, for example, so we have a cascade of minor failures across the CS domain to fix. We add any issues to a project list in a local instance of RequestTracker.

As usual, we booked three hours to do this last weekend. Almost all of it was done in that time. There’s always something left over at the end, but most systems were running again on schedule.

The coders and admins of the world may, at this point, wonder why we would go through all this and why (if we must do it) we don’t just have a script for it.

We definitely could, but the technical value of the shutdown is orthogonal to its purpose for us. We don’t do the server shutdown because the servers strictly need to be powered off and back on every six months. We do it because it’s one of the few projects that…

  • exposes the logic and structure of the entire server system to the students managing it,
  • provides opportunities to learn a lot in terms of both computing and teamwork,
  • forces us to be accountable for what we’ve installed and how we’ve configured it,
  • involves every sysadmin student from first-timers to seniors,
  • and yet is tightly constrained in time and scope.

I like the way this works so much that I’m engineering other projects that meet these criteria and can be implemented more readily throughout the regular academic calendar.

Some reflections on guiding a student sysadmin team

How does a team of students administer a powerful data center for education and research at a small undergraduate liberal arts college?

Success at my job is largely dependent on how well I can answer that question and implement the answer.

Earlham CS, under the banner of the Applied Groups, has a single team of students running our servers:

The Systems Admin Group’s key functions include the maintenance of the physical machines used by the Earlham Computer Science Department. They handle both the hardware and software side of Earlham’s Computer Science systems. The students in the sysadmin group configure and manage the machines, computational clusters, and networks that are used in classes, for our research, and for use by other science departments at Earlham. 

The students in that group are supervised by me, with the invaluable cooperation of a tenured professor.

The students are talented, and they have a range of experience levels spanning from beginner to proficient. Every semester there’s some turnover because of time, interest, graduations, and more.

And students are wonderful and unpredictable. Some join with a specific passion in mind: “I want to learn about cybersecurity.” “I want to administer the bioinformatics software for the interdisciplinary CS-bio class and Icelandic field research.” Others have a vague sense that they’re interested in computing – maybe system administration but maybe not – but no specific focus yet. (In my experience the latter group is consistently larger.)

In addition to varieties of experience and interest, consider our relatively small labor force. To grossly oversimplify:

  • Say I put 20 hours of my week into sysadmin work, including meetings, projects, questions, and troubleshooting.
  • Assume a student works 8 hours per week, the minimum for a regular work-study position. We have a budget for 7 students. (I would certainly characterize us as two-pizza-compliant.)
  • There are other faculty who do some sysadmin work with us, but it’s not their only focus. Assume they put in 10 hours.
  • Ignore differences in scheduling during winter and summer breaks. Also ignore emergencies, which are rare but can consume more time.

That’s a total of 86 weekly person-hours to manage all our data, computation, networking, and sometimes power. That number itself limits the amount we can accomplish in a given week.

Because of all those factors, we have to make tradeoffs all the time:

  • interests versus needs
  • big valuable projects versus system stability/sustainability
  • work getting done versus documenting the work so future admins can learn it
  • innovation versus fundamentals
  • continuous service versus momentary unplanned disruption because someone actually had time this week to look at the problem and they made an error the first time

I’ve found ways to turn some of those tradeoffs into “both/and”, but that’s not always possible. When I have to make a decision, I tend to err on the side of education and letting the students learn, rather than getting it done immediately. The minor headache of today is a fair price to pay for student experience and a deepened knowledge base in the institution.

In some respects, this is less pressure than a traditional company or startup. The point is education, so failure is expected and almost always manageable. We’re not worried about reporting back to our shareholders with good quarterly earnings numbers. When something goes wrong, we have several layers of backups to prevent real disaster.

On the other hand, I am constantly turning the dials on my management and technical work to maximize for something – it’s just that instead of profit, that something is the educational mission of the college. Some of that is by teaching the admins directly, some is continuing our support for interesting applications like genome analysis, data visualization, web hosting, and image aggregation. If students aren’t learning, I’m doing something wrong.

In the big picture, what impresses me about the group I work with is that we have managed to install, configure, troubleshoot, upgrade, retire, protect, and maintain a somewhat complex computational environment with relatively few unplanned interruptions – and we’ve done it for quite a few years now. This is a system with certain obvious limitations, and I’m constantly learning to do my job better, but in aggregate I would consider it an ongoing success. And at a personal level, it’s deeply rewarding.

Responding to emergencies in the Earlham CS server room

A group of unrelated problems overlapped in time last week and redirected my entire professional energy. It was the most informative week I’ve had in maybe months, and I’m committing a record of it here for posterity. Each problem we in the admins encountered in the last week is briefly described here.

DNS on the CS servers

It began Tuesday morning with a polite request from a colleague to investigate why the CS servers were down. Unable to ping anything, I walked to the server room and found a powered-on but non-responsive server. I crashed it with the power button and brought it back up, but we still couldn’t get to anything by ssh.

That still didn’t restore network access to our virtual machines, so I began investigating, starting by perusing /var/log.

An hour or so later that morning, I was joined by two other admins. One had an idea for where we might look for problems. We discovered that one of our admins had, innocently enough, used underscores in the hostnames assigned to two computers used by sysadmins-in-training. Underscores are not generally acceptable in DNS hostnames, so we fixed that and restarted bind. That resolved the problem.

The long-term solution to this is to train our sysadmin students in DNS, DHCP, etc. more thoroughly — and to remind them consistently to RTFM.

Fumes

Another issue that materialized at the same time and worried me more: we discovered a foul smell in the server room, like some mix of burning and melting plastic. Was this related to the server failure? At the time, we didn’t know. Using a fan and an open door we ventilated the server room and investigated.

We were lucky.

By way of the sniff test, we discovered the smell came from components that had melted in an out-of-use, but still energized, security keypad control box. I unplugged the box. The smell lingered but faded after a few days, at which point we filed a work order to have it removed from the room altogether.

I want to emphasize our good fortune in this case: Had the smell pointed to a problem in the servers or the power supply, we would have had worse problems that may have lasted a long time and cost us a lot. Our long-term fix should be to implement measures to detect such problems automatically, at least to such an extent that someone can quickly respond to them.

Correlated outages

Finally, the day after those two events happened and while we were investigating them, we experienced a series of outages all at once, across both subdomains we manage. Fortunately, each of these servers is mature and well-configured, and in every case pushing the power button restored systems to normal.

Solving this problem turned out to be entertaining and enlightening as a matter of digital forensics.

Another admin student and I sat in my office for an hour and looked for clues. Examining the system and security log files on each affected server consistently pointed us toward the log file for a script run once per minute under cron.

This particular script checks (by an SNMP query) if our battery level is getting low and staying low – i.e., that we’ve lost power and need to shut down the servers before the batteries are fully drained and we experience a hard crash. The script acted properly, but we’d made it too sensitive: it allowed only 2 minutes to elapse at a <80% battery level before concluding that every server running the script needed to shut down. This happened on each server running the script – and it didn’t happen on servers not running the script.

We’re fixing the code now to allow more time before shutting down. We’re also investigating why the batteries started draining at that time: they never drained so much as to cut power to the entire system, but they clearly dipped for a time.

To my delight (if anything in this process can be a delight), a colleague who’s across the ocean for a few weeks discovered the same thing we did sitting in my office.

Have a process

Details varied, but we walked through the same process to solve each problem:

  1. Observe the issue.
  2. Find a non-destructive short-term fix.
  3. Immediately, while it’s fresh on the mind and the log files are still current, gather data. Copy relevant log files for reference but look at the originals if you still can. If there’s a risk a server will go down again, copy relevant files off that server. Check both hardware and software for obvious problems.
  4. Look for patterns in the data. Time is a good way to make an initial cut: according to the timestamps in the logs, what happened around the time we started observing a problem?
  5. Based on the data and context you have, exercise some common sense and logic. Figure it out. Ask for help.
  6. Based on what you learn, implement the long-term fix.
  7. Update stakeholders. Be as specific as is useful – maybe a little more so. [Do this earlier if there are larger, longer-lasting failures to address.]
  8. Think of how to automate fixes for the problem and how to avoid similar problems in the future.
  9. Implement those changes.

For us, the early stages will be sped up when we finish work on our monitoring/notification systems, but that would not have helped much in this case. Even with incomplete monitoring software, we discovered each problem that I’ve described within minutes or hours, because of the frequency and intensity of use they get by the Earlham community.

I would add is that, based on my observations, it’s easy to become sloppy about what goes into a log file and what doesn’t. Cleaning those up (carefully and conservatively) will be added to the tasks for the student sysadmins to work on.

Work together

We in the admins left for the weekend with no outstanding disasters on the table after a week in which three unrelated time-consuming problems surfaced. That’s tiring but incredibly satisfying. It’s to the credit of all the student admins and my colleagues in the CS faculty, whose collective patience, insights, and persistence made it work.

The perks of being a VM

Several of the CS department’s servers are virtual machines. While running VM’s adds complexity, it also lets you do things like octuple* system RAM in five minutes from your laptop.

For context, Earlham CS runs a Jupyterhub server for the first- and second-semester CS students. We want to provide a programming environment (in this case Python, a terminal, and a few other languages) so students can focus on programming instead of administration, environment, etc. Jupyter is handy for that purpose.

The issue: Each notebook takes a relatively large amount of RAM. There are 60 or so intro CS students here. The Xen virtual machine hosting Jupyter was simply not equipped for that load. So at the request of my colleagues teaching the course, I visited a lab today. After observing the problem, we took five minutes to shut the server down, destroy the volume, change a single number in a single config file, and bring it all back to life with a boosted configuration. We’ve had no additional problems – so far. 🙂

Running a VM is frequently more complex than running on bare hardware. But the alternative is this:

I wish I had some of the “upcoming maintenance” email notifications we sent out in my ECCS sysadmin days for comparison. They were basically “no email, no websites for several days while we rebuild this server from parts, mmmkay?”

@chrishardie

Because we do so much of our administration in software, we’ve mostly avoided that problem in recent years. The closest we’ve gotten to scrambling hardware lately was recovering from disk failures after a power outage over the summer. We had to send a lot of “sorry, X is down” emails. I wouldn’t want that to be our approach to managing all servers all the time.

(Of course there are many other alternatives, but running Xen VM’s serves our purposes nicely. It’s also, for many reasons, good practice for our student system administrators.)

*I tweeted about this originally and said we quadrupled the RAM. In fact, a previously-arranged RAM doubling had been specified in the VM’s config file but not implemented. Before we restarted the machine, we decided to boost it even more. Ultimately we quadrupled the double of the previous RAM amount.

Make your documentation specific

Programmers grasp the importance of precision when they/we write code.

Then too often it seems they/we forget it when writing documentation.

Somewhat ironically, I’m not going to include many specifics here. I don’t begrudge any individual for being too busy to leave notes or edit them too meticulously. But it’s a pattern I’ve now observed in several places I’ve worked: a ton of relevant information exists in two-person instant messages, months-old emails, or one person’s head (so we hope they remember it) and not in an established location that any person in the group can use as a reference.

The tech community developed its preferences against overdoing documentation and toward lean startups/Agile methodology for good reason. The global flow of paperwork is excessive. Small teams shouldn’t need to maintain complete logs of every action by anyone at anytime for any reason. All things being equal, less documentation is probably better than more.

All the same, writing down important information for reference goes a long way to not wasting the time and energy of new people – and by extension, not wasting the money of the organization.

I actually think my workplace, the Earlham CS Department, has a structure to handle this better than most: a wiki that any employee or student in the program can edit with minimal training. Teaching people the habit of updating the wiki after major project updates helps preserve institutional memory (funny enough, teams of students have a high turnover rate) and helps new people integrate into the system faster.

Minimizing bureaucracy is good. But making concise, specific, actionable, relatively frequent updates to community-accessible notes is also good. Teams can find a balance.