I’m proud of something from this year – a real 2020 success story.
To give some backstory: January-March were maybe the roughest three months of my tech life to date. We had a cascade of server hardware failures that induced a lot of downtime. Total catastrophe. I’m grateful for my institution’s patience.
After a lot of extra hours in windowless rooms working on it, we did resolve those problems. We diagnosed the root causes and took steps to prevent similar issues in the future. I also learned a lot. (Some of the lessons from those days continue to guide us, and they’ve been imprinted on me forever.)
The very next day the March lockdowns started and the College sent everyone away.
We went all-remote for the rest of spring and shifted into hybrid mode for the fall. That increased the dependency on system availability. Naturally, I was uneasy about that after the stress of the spring term. I directed myself and the CS admin students to focus on uptime, iterative improvement, and minimal disruption.
What makes me proud is this: it worked.
Since resolving those issues in the winter and spring, we’ve been stable. Individual services and hosts have had issues, of course. Some of those issues took significant time and energy, and we’re still not perfect (probably never will be!). There is always more to fix, more to improve, more to automate, more to introduce.
But systematically we’ve operated without unplanned interruption since March.
We’ve faced uncertainty after uncertainty in 2020. But my colleagues and students have been able to count on our systems working. We’re not a giant shop here, but we have kept up with the changing times.
There it is: one clear 2020 success story. Engineering this success was a collaborative effort to which I’m just one contributor, but I am proud of it.