If you spend any time at all in the tech chatter space, you have probably heard a lot of discontent about the quality of software these days. Just two examples:
I can’t do anything about the cultural, economic, and social environment that cultivates these issues. (So maybe I shouldn’t say anything at all? 🙂 )
I can say that, if you’re in a position to do something about it, you should treat yourself to quality control.
The case I’d like to briefly highlight is about our infrastructure rather than a software package, but I think this principle can be generalized.
Case study: bringing order to a data center
After a series of (related) service outages in the spring of 2020, shortly before the onset of the COVID-19 crisis, we cut back on some expansionary ambitions to get our house in order.
Here’s a sample, not even a comprehensive list, of the things we’ve fixed in the last couple of months:
- updated every OS we run such that most of our systems will need only incremental upgrades for the next few years
- transitioned to the Slurm scheduler for all of our clusters and compute nodes, which has already made it easier to track and troubleshoot batch jobs
- modernized hardware across the board, including upgraded storage and network cards
- retired unreliable nodes
- implemented comprehensive monitoring and alerts
- replaced our old LDAP server and map with a new one that will better suit our authentication needs across many current and future services
- fixed the configuration of our Jupyterhub instances for efficiency
Notice: None of those are “let’s add a new server” or “let’s support 17 new software packages”. It’s all about improving the things we already supported.
There are a lot of institutional reasons our systems needed this work, primarily the shortage of staffing that affects a lot of small colleges. But from a pragmatic perspective, to me and to the student admins, these reasons don’t matter. What matters is that we were in a position to fix them.
By consciously choosing to do so, we think we’ve reduced future overhead and downtime risk substantially. Quantitatively, we’ve gone from a few dozen open issue tickets to 19 as of this writing. Six others are advancing rapidly.
How we did it and what’s next
I don’t have a dramatic reveal here. We just made the simple (if not always easy) decision to confront our issues and make quality a priority.
Time is an exhaustible, non-renewable resource. We decided to spend our time on making existing systems work much much better, rather than adding new features. This kind of focus can be boring, because of how strictly it blocks distractions, but the results speak for themselves.
After all that work, now we can pivot to the shiny new thing: installing, supporting, and using new software. We’ve been revving up support for virtual machines and containers for a long time. HPC continues to advance and discover new applications. The freedom to explore these domains will open up a lot of room for student and faculty research over time. It may also help as we prepare to move into our first full semester under COVID-19, which is likely to have (at minimum) a substantial remote component.