How does a team of students administer a powerful data center for education and research at a small undergraduate liberal arts college?
Success at my job is largely dependent on how well I can answer that question and implement the answer.
Earlham CS, under the banner of the Applied Groups, has a single team of students running our servers:
The Systems Admin Group’s key functions include the maintenance of the physical machines used by the Earlham Computer Science Department. They handle both the hardware and software side of Earlham’s Computer Science systems. The students in the sysadmin group configure and manage the machines, computational clusters, and networks that are used in classes, for our research, and for use by other science departments at Earlham.
The students in that group are supervised by me, with the invaluable cooperation of a tenured professor.
The students are talented, and they have a range of experience levels spanning from beginner to proficient. Every semester there’s some turnover because of time, interest, graduations, and more.
And students are wonderful and unpredictable. Some join with a specific passion in mind: “I want to learn about cybersecurity.” “I want to administer the bioinformatics software for the interdisciplinary CS-bio class and Icelandic field research.” Others have a vague sense that they’re interested in computing – maybe system administration but maybe not – but no specific focus yet. (In my experience the latter group is consistently larger.)
In addition to varieties of experience and interest, consider our relatively small labor force. To grossly oversimplify:
- Say I put 20 hours of my week into sysadmin work, including meetings, projects, questions, and troubleshooting.
- Assume a student works 8 hours per week, the minimum for a regular work-study position. We have a budget for 7 students. (I would certainly characterize us as two-pizza-compliant.)
- There are other faculty who do some sysadmin work with us, but it’s not their only focus. Assume they put in 10 hours.
- Ignore differences in scheduling during winter and summer breaks. Also ignore emergencies, which are rare but can consume more time.
That’s a total of 86 weekly person-hours to manage all our data, computation, networking, and sometimes power. That number itself limits the amount we can accomplish in a given week.
Because of all those factors, we have to make tradeoffs all the time:
- interests versus needs
- big valuable projects versus system stability/sustainability
- work getting done versus documenting the work so future admins can learn it
- innovation versus fundamentals
- continuous service versus momentary unplanned disruption because someone actually had time this week to look at the problem and they made an error the first time
I’ve found ways to turn some of those tradeoffs into “both/and”, but that’s not always possible. When I have to make a decision, I tend to err on the side of education and letting the students learn, rather than getting it done immediately. The minor headache of today is a fair price to pay for student experience and a deepened knowledge base in the institution.
In some respects, this is less pressure than a traditional company or startup. The point is education, so failure is expected and almost always manageable. We’re not worried about reporting back to our shareholders with good quarterly earnings numbers. When something goes wrong, we have several layers of backups to prevent real disaster.
On the other hand, I am constantly turning the dials on my management and technical work to maximize for something – it’s just that instead of profit, that something is the educational mission of the college. Some of that is by teaching the admins directly, some is continuing our support for interesting applications like genome analysis, data visualization, web hosting, and image aggregation. If students aren’t learning, I’m doing something wrong.
In the big picture, what impresses me about the group I work with is that we have managed to install, configure, troubleshoot, upgrade, retire, protect, and maintain a somewhat complex computational environment with relatively few unplanned interruptions – and we’ve done it for quite a few years now. This is a system with certain obvious limitations, and I’m constantly learning to do my job better, but in aggregate I would consider it an ongoing success. And at a personal level, it’s deeply rewarding.