Last week I had the pleasure of attending the spring 2024 conference of the Coalition for Academic Scientific Computation (CASC) in Washington, DC. I was especially treated to attend (in-person) the Cyberinfrastructure Leadership Academy (CILA) 2024 the day before CASC. [1]
It’s an opportunity to learn about the state of research computing at academic institutions at the U.S. today. Along with SC, it’s also a chance to see in-person a lot of people I mostly encounter over email or as boxes in Zoom meetings.
To me, celebrating pride is about celebrating different modes of pursuing happiness. More to the point, it’s about the breaking of arbitrary expectations for gender presentation, identity, and expression. That includes the right to fall in love with someone of the same sex, but it goes well beyond that.
I’m gay, so I’m very much a participant in this month’s celebrations. I’m also a cis male and to outward appearances basically gender-conforming. That’s neither a good thing nor a bad thing – just where I’ve landed. But I like the idea that others enjoy the freedom to be otherwise, that if I felt compelled to change or redefine some aspect of my identity or presentation tomorrow I could, and that the realm of personal freedom keeps expanding.
The opposition is loud and destructive, and it’s reached a fever pitch in the last few years. Transgender people in particular are the targets du jour. I see conservatives trying to drive a wedge between gay/bi and trans people. I see Republicans attacking Pride Month merchandise in stores, shuttering programs promoting diversity, and banning LGBTQ books. Worse, they’re isolating queer kids and queer families in school. They’re making it harder for people to just live as they see fit without doing a bit of harm to anyone else.
In the face of this, my fellow queer people make me proud. These are people living happy, interesting, loving, fulfilling lives despite intimidation and scapegoating. This community gives me hope for the future when it sometimes feels in short supply.
It’s inspiring, and not just in theory and not just for each person individually. We truly have accomplished a lot for the improvement of our society. On a scale of decades, and with plenty of setbacks, America has become more accepting of the wide variety of people who live here. If we (and now I’m including straight folks) can empathize with each other and make a bit of room for other people’s differences, we can continue on that path. To me, that’s what all those rainbow flags and parades are about: celebrating where we’ve been, looking forward to how much better we still have to do.
At Earlham Computer Science we have to support a couple dozen intro CS students per semester (or, in COVID times, per 7-week term). We teach Python, and we want to make sure everyone has the right tools to succeed. To do that, we use the Jupyterhub notebook environment, and we periodically respond to user issues related to running notebooks there.
A couple of dozen people running Python code on a server can gobble up resources and induce problems. Jupyter has historically been our toughest service to support, but we’ve vastly improved. In fact, as I’ll show, we have reduced the frequency of incidents by about 90 percent over time.
Note: we only recently began automatic tracking of uptime, so that data is almost useless for comparisons over time. This is the best approximation we have. If new information surfaces to discredit any of my methods, I’ll change it, but my colleagues have confirmed to me that this analysis is at least plausible.
Retrieving the raw data
I started my job at Earlham in June 2018. In November 2018, we resolved an archiving issue with our help desk/admin mailing list that gives us our first dataset.
I ran a grep for the “Messages:” string in the thread archives:
grep 'Messages:' */thread.html # super complicated
I did a little text processing to generate the dataset: regular expression find-and-replace in an editor. That reduced the data to a column of YYYY-Month values and a column of message counts.
Then I went and searched for all lines with subject matching “{J,j}upyter” in the subject.html files:
grep -i jupyter {2018,2019,2020}*/subject.html
I saved it to jupyter-messages-18-20.dat. I did some text processing – again regexes, find and replace – and then decided that followup messages are not what we care about and ran uniq against that file. A few quick wc -l commands later and we find:
21 Jupyter requests in 2018
17 Jupyter requests in 2019
19 Jupyter requests in 2020
One caveat is that in 2020 we moved a lot of communication to Slack. This adds some uncertainty to the data. However, I know from context that Jupyter requests have continued to flow through the mailing list disproportionately. As such, Slack messages are likely to be the sort of redundant information already obscured using uniq in the text processing.
Another qualifier is that a year or so ago we began using GitLab’s Issues as a ticket tracking system. I searched that. It found 11 more Jupyter issues, all from 2020. Fortunately, only 1 of those was a problem that did not overlap with a mailing list entry.
Still, I think those raw numbers are a good baseline. At one level, it looks bad. The 2020 number has barely budged from 2018 and in fact it’s worse than 2019. That’s misleading, though.
Digging deeper into the data
Buried in that tiny dataset is some good news about the trends.
For one thing, those 21 Jupyter requests were in only 4 months out of the year – in other words, we were wildly misconfigured and putting out a lot of unnecessary technical fires. (That’s nobody’s fault – it’s primarily due to the fact that my position did not exist for about a year before I arrived at it, so we atrophied.)
What’s more, the 19 this year are, by inspection, half password or feature requests rather than the 17 problems we saw in 2019, which I think were real.
So in terms of Jupyter problems in the admin list, I find:
around 20 in the latter third of 2018
17 in ALL OF 2019
only two (granted one was a BIG problem but still only 2) in 2020
That’s a 90% reduction in Jupyterhub user issues over three years, by my account.
“That’s amazing, how’d you do it?”
Number one: thank you, imaginary reader, you’re too kind.
Number two: a lot of ways.
In no particular order:
We migrated off of a VM, which given our hardware constraints was not conducive to a resource-intensive service like Jupyterhub.
Gradually over time, we’ve upgraded our storage hardware, as some of it was old and (turns out) failing.
We added RAM. When it comes to RAM, some is good, more is better, and too much is just enough.
We manage user directories better. We export these over NFS but have done all we can to reduce network dependencies. That significantly reduces the amount of time the CPU spends twiddling its thumbs.
What’s more, we’re not stopping here. We’re currently exploring load-balancing options – for example, running Jupyter notebooks through a batch scheduler like Slurm, or potentially a containerized environment like Kubernetes. There are several solutions, but we haven’t yet determined which is best for our use case.
This is the work of a team of people, not just me, but I wanted to share it as an example of growth and progress over time. It’s incremental but it really does make a difference. Jupyterhub user issues, like so many issues, are usually solvable.
When building software for large datasets or HPC workflows, we talk a lot about the trip cost versus the item cost.
The item cost is the expense (almost always measured in time) to run an operation on a single unit of data – one member of a set, for example. The trip cost is the total expense of running a series of operations on some subset (possibly the whole set) of the data. The trip cost incorporates overhead, so it’s not just N times the item cost.
This is a key reason that computers, algorithms, and data structures that support high-performance computing are so important: by analyzing as many items in one trip as is feasible, you can often minimize time wasted on unnecessary setup and teardown.
Thus trip cost versus item cost is an invaluable simplifying distinction. It can clarify how to can make many systems perform better.
Yes, Virginia, there is a trip cost
Christmas trees provide a good and familiar example.
Let’s stipulate that you celebrate Christmas and that you have a tree. You’ve put up lights. Now you want to hang the ornaments.
The item cost for each of the ornaments is very small: unbox and hang the ornament. It takes a couple of seconds, max – not a lot, for humans. It also parallelizes extremely well, so everyone in the family gets to hang one or more ornaments.
The trip cost is at least an order of magnitude (minutes rather than seconds) more expensive, so you only want to do it once:
Find the ornament box
Bring the box into the same room as the tree
Open the box
Unbox and hang N ornaments
Close the box
Put the box back
Those overhead steps don’t parallelize well, either: we see no performance improvement and possibly a performance decline if two or more people try to move the box in and out of the room instead of just one.
It’s plain to see that you want to hang as many ornaments as possible before putting away the ornament box. This matches our intuition (“let’s decorate the tree” is treated as a discrete task typically completed all in one go), which is nice.
Whether Christmas is your holiday or not, I wish you the best as the year draws to a close.