The admins and I have just emerged from a few weeks of slow-rolling outages. We have good reason to believe all is well again.
I want to write more about the incident and the skills upgrade that it gave us, but I’ll stick with this for now:
We’re a liberal arts college CS department with a couple dozen servers. In other words, we’re not a big shop. Mostly we can see when something’s off with our eyes and ears. We won’t be able to access something, emails will stop going through, etc. For that reason, services rarely disappear for long in our world.
Once in a while, though, we have to trace a problem – like the last month, for example. Ultimately it was a simple problem with a simple solution (hardware failures, nothing to do but replace our old SSD’s) but we spent a lot of time trying to identify that cause. That’s mostly because we currently lack a comprehensive monitoring suite.
That’s getting fixed, and quick.
As part of a broader quality control initiative we’re working on, we’re going to monitor everything – hardware, software, networks, connections, etc. The overhead of maintaining extensible monitoring systems is not as severe as the overhead of tracing problems when those systems don’t exist. To my knowledge, this brings us inline with best practices in industry.
Yes, it’s just a couple dozen servers. But experience has shown: All sentences of the form “It’s just X” are dangerous sentences.