CASC meeting for spring 2024

Last week I had the pleasure of attending the spring 2024 conference of the Coalition for Academic Scientific Computation (CASC) in Washington, DC. I was especially treated to attend (in-person) the Cyberinfrastructure Leadership Academy (CILA) 2024 the day before CASC. [1]

It’s an opportunity to learn about the state of research computing at academic institutions at the U.S. today. Along with SC, it’s also a chance to see in-person a lot of people I mostly encounter over email or as boxes in Zoom meetings.

One memorable event was a talk by “HPC Dan” Reed, incorporating information from this blog post and a lot more (in 2024, it wouldn’t do to ignore LLM’s). This preceded the release of the Indicators report by the National Science Board, which he presented at the White House on the 13th.

[1] Enough acronyms yet?

Pride reflection

June is Pride Month. I have a few thoughts.

Pride street painting, Boulder, CO, 2021

To me, celebrating pride is about celebrating different modes of pursuing happiness. More to the point, it’s about the breaking of arbitrary expectations for gender presentation, identity, and expression. That includes the right to fall in love with someone of the same sex, but it goes well beyond that.

I’m gay, so I’m very much a participant in this month’s celebrations. I’m also a cis male and to outward appearances basically gender-conforming. That’s neither a good thing nor a bad thing – just where I’ve landed. But I like the idea that others enjoy the freedom to be otherwise, that if I felt compelled to change or redefine some aspect of my identity or presentation tomorrow I could, and that the realm of personal freedom keeps expanding.

The opposition is loud and destructive, and it’s reached a fever pitch in the last few years. Transgender people in particular are the targets du jour. I see conservatives trying to drive a wedge between gay/bi and trans people. I see Republicans attacking Pride Month merchandise in stores, shuttering programs promoting diversity, and banning LGBTQ books. Worse, they’re isolating queer kids and queer families in school. They’re making it harder for people to just live as they see fit without doing a bit of harm to anyone else.

In the face of this, my fellow queer people make me proud. These are people living happy, interesting, loving, fulfilling lives despite intimidation and scapegoating. This community gives me hope for the future when it sometimes feels in short supply.

It’s inspiring, and not just in theory and not just for each person individually. We truly have accomplished a lot for the improvement of our society. On a scale of decades, and with plenty of setbacks, America has become more accepting of the wide variety of people who live here. If we (and now I’m including straight folks) can empathize with each other and make a bit of room for other people’s differences, we can continue on that path. To me, that’s what all those rainbow flags and parades are about: celebrating where we’ve been, looking forward to how much better we still have to do.

Happy Pride. 🏳‍🌈🏳️‍⚧️

The author, windswept, smiling at Seyðisfjörður, Iceland, chapel by rainbow cobblestones, 2021

Fun with the Slurm reservation MAINT and REPLACE flags

Note: as of 23.02.1 at least, Slurm no longer exhibits this behavior:

scontrol: error: REPLACE and REPLACE_DOWN flags cannot be used with STATIC_ALLOC or MAINT flags

Preserving this post for posterity to know why that’s a very good idea.

***

Late this fine Saturday morning I noticed the work Slack blowing up. Uh-oh.

Turns out that earlier in the week I had introduced an issue with our compile functionality, which rests on the logic of Slurm reservations. It’s now fixed, and I wanted to share what we learned in the event that it can help admins at other HPC centers who encounter similar issues.

See, on CU Boulder Research Computing (CURC)’s HPC system Alpine, we have a floating reservation for two nodes to allow users to access a shell on a compute node to compile code, with minimal waiting. Any two standard compute nodes are eligible for the reservation, and we use the Slurm replace flag to exchange the nodes over time as new nodes become idle.

But on Saturday morning we observed several bad behaviors:

  • The reservation, acompile, had the maint flag.
  • Nodes that went through acompile ended up in a MAINTENANCE state that, upon their release, rendered them unusable for users for standard batch jobs.
  • Because nodes rotate in and out, Slurm was considering more and more nodes to be unavailable.
  • A member of our team attempted to solve the issue by setting flags=replace on the reservation. This seemed to solve the issue briefly but it quickly resurfaced.

I think I have a sense of the proximate cause and an explainer, and I also think I know the underlying cause and possible fixes.

Proximate cause: Slurm reservations (at least as of version 22.05.2) are conservative with how they update the maint flag. To use this example, to remove from a reservation with flags=maint,replace, it’s not sufficient to say flags=replace  – the flag must be explicitly removed, with something like flags-=maint.

Allow me to demonstrate.

This command creates a reservation with flags=maint,replace :

$ scontrol create reservation reservationName=craig_example users=crea5307 nodeCnt=1 starttime=now duration=infinite flags=maint,replace
Reservation created: craig_example

Slurm creates it as expected:

$ scontrol show res craig_example
ReservationName=craig_example StartTime=2023-04-08T11:58:44 EndTime=2024-04-07T11:58:44 Duration=365-00:00:00
   Nodes=c3cpu-a9-u1-2 NodeCnt=1 CoreCnt=64 Features=(null) PartitionName=amilan Flags=MAINT,REPLACE
   TRES=cpu=64
   Users=crea5307 Groups=(null) Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)

We (attempt to) update the reservation using flags=replace. The intention is to have replace be the only flag. This would seem to be the logical behavior.

$ scontrol update reservation reservationName=craig_example flags=replace
Reservation updated.

However, despite an apparently satisfying output message, this fails to achieve our goal. The maint flag remains:

$ scontrol show res craig_example
ReservationName=craig_example StartTime=2023-04-08T11:58:44 EndTime=2024-04-07T11:58:44 Duration=365-00:00:00
   Nodes=c3cpu-a9-u1-2 NodeCnt=1 CoreCnt=64 Features=(null) PartitionName=amilan Flags=MAINT,REPLACE
   TRES=cpu=64
   Users=crea5307 Groups=(null) Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)

Then, using minus-equals, we actually remove the maint flag:

$ scontrol update reservation reservationName=craig_example flags-=maint
Reservation updated.

Lo, the flag is gone:

$ scontrol show res craig_example
ReservationName=craig_example StartTime=2023-04-08T11:58:44 EndTime=2024-04-07T11:58:44 Duration=365-00:00:00
   Nodes=c3cpu-a9-u1-2 NodeCnt=1 CoreCnt=64 Features=(null) PartitionName=amilan Flags=REPLACE
   TRES=cpu=64
   Users=crea5307 Groups=(null) Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)

Based on this example, which replicates the behavior we observed, it very much appears to me that proper removal of the maint flag was the proximate cause of our problem. I’ve made this exact mistake in other contexts before, so I at least had a sense of this already.

That’s all well and good, but the proximate cause is not really what we care about. It’s more important how we got to this point. As it happens, the underlying cause is that the maint flag was set on acompile in the first place. I’ll describe why I did so initially and what we will do differently in the future.

An important fact is that Slurm (sensibly) does not want two reservations scheduled for the same nodes at the same time unless you, the admin, are REAL SURE you want that. The maint flag is one of the only two documented ways to create overlapping reservations. We use this flag all the time for its prime intended purpose, reserving the system for scheduled maintenance. So far so good.

However, at our last planned maintenance (PM), on April 5, we had several fixes to make to our ongoing reservations including acompile. For simplicity’s sake, we chose to delete and rebuild them according to our improved designs, rather than updating them in-place. When I first attempted the rebuild step with acompile, I was blocked creating it because of the existing (maint) reservations, so I added that flag to my scontrol create reservation command. From my bash history:

# failed
  325  alpine scontrol create reservation ReservationName=acompile StartTime=now Duration=infinite NodeCnt=2 PartitionName=acompile Users=-root Flags=REPLACE

# succeeded
  329  alpine scontrol create reservation ReservationName=acompile StartTime=now Duration=infinite NodeCnt=2 PartitionName=acompile Users=-root Flags=REPLACE,MAINT

What I had not realized was that, by setting the maint flag in acompile and never removing it, I was leaving every node that cycled through acompile in the MAINTENANCE state – hence the issues above.

I can imagine other possible solutions to this issue – scontrol reconfigure or systemctl restart slurmctld may have helped, though I don’t like to start there. In any case, I think what I’ve described did reveal an issue in how I rebuilt this particular reservation.

For the future I see a few followup steps:

  1. Document this information (see: this blog post, internal docs).
  2. Revisit the overlap flag for Slurm reservations, which in my experience is a little trickier than maint but may prevent this issue if we implement it right.
  3. Add reservation config checks as a late step in post-PM testing, perhaps the last thing to do before sending the all-clear message.

This was definitely a mistake on my part, but I (and my team) learned from it. I wrote this post to share the lesson, and I hope it helps if you’re an HPC admin who encounters similar behavior.

Alpine updates and RMACC 2022

This week I had the opportunity to speak at the 2022 RMACC Symposium, hosted by my own institution, about the Alpine supercomputer. My presentation and the others from my CU colleagues are available here.

In summary, Alpine has been in production since our launch event in May. After some supply chain issues (the same that have affected the entire computing sector), we are preparing to bring another round of nodes online within weeks. That will put Alpine’s total available resources (about 16,000 cores) on par with those of the retiring Summit system. It’s an exciting step for us at CURC.

As for RMACC: I’ve never attended the symposium before. After three days, I came away with a lot of new information, new contacts, and ideas for how to support our researchers better. A few topics in particular I paid attention to:

  • Better and more scalable methods of deploying HPC systems and software
  • How the community will navigate the transition from XSEDE to ACCESS
  • The companies, organizations, and universities (like mine!) building the future of this space
  • Changes in business models for the vendors and commercial developers we work with

Academic HPC is a small niche in the computing world, and gatherings like this can be valuable as spaces to connect and share our best ideas.

New supercomputer just dropped

These certainly are server cabinets alright…

Today marks the launch of CU Boulder’s shiny new research supercomputer, Alpine. Text of the university press release:

The celebratory event signals the official launch of CU Boulder’s third-generation high performance computing infrastructure, which is provisioned and available to campus researchers immediately.

On May 18, numerous leaders from on- and off-campus will gather to celebrate, introduce and officially launch the campus’s new high-performance computing infrastructure, dubbed “Alpine.”

Alpine replaces “RMACC Summit,” the previous infrastructure, which has been in use since 2017. Comparable to systems now in use at top peer institutions across the country, Alpine will improve upon RMACC Summit by providing cutting-edge hardware that enhances traditional High Performance Computing workloads, enables Artificial Intelligence/Machine Learning workloads, and provides user-friendly access through tools such as Open OnDemand.

“Alpine is a modular system designed to meet the growing and rapidly evolving needs of our researchers,” said Assistant Vice Chancellor and Director of Research Computing Shelley Knuth. “Alpine addresses our users’ requests for faster compute and more robust options for machine learning.”

Notable among the technical specifications that will make Alpine an invaluable tool in research computing for researchers, industry partners and others, Alpine boasts: 3rd generation AMD EPYC CPUs, which provide enhanced energy efficiency per cycle compared to the Intel Xeon E5-2680 CPUs on RMACC Summit; Nvidia A100 GPUs; AMD MI100 GPUs; HDR InfiniBand; and 25 Gb Ethernet.

The kick-off event on May 18 will celebrate the Alpine infrastructure being fully operational and allow the community to enjoy a 20-minute tour, including snacks, an introduction to Research Computing, and a tour of the supercomputer container. The opportunity is open to the public and free of charge, and CU Boulder Research Computing staff will be on site to answer questions. CU Boulder Chief Information Officer Marin Stanek, Chief Operating Officer Patrick O’Rourke, and Acting Vice Chancellor for Research and Innovation Massimo Ruzzene will offer remarks at 1:30 p.m.

In addition to the main launch event, Research Computing is offering a full slate of training and informational events the week of May 16—20.

Researchers seeking to use Research Computing resources, which includes not only the Alpine supercomputer, but also large scale data storage, cloud computing and secure research computing, are invited to visit the Research Computing website to learn about more training offerings, the community discussion forum, office hours and general contact information.

Alpine is funded by the Financial Futures strategic initiative.

This is the biggest project I have ever worked on. It was in the works months before I arrived but has consumed most of my professional time since September. It’s exciting that we can finally welcome our researchers to use it.

What’s next

Some personal news… 🙂

I have been at Earlham College for almost seven years, including my time as a student and as CS faculty. Today is my last day there.

It’s been an incredible place to grow as a person, deepen my skills, collaborate with talented people from all walks of life, and try to make the world a little bit better. I’ve seen a few generations of the community cycle through and watched us withstand everything up to and including a literal pandemic. I capped it with the trip of a lifetime, spending a month doing research in Iceland – on a project I hope to continue working on in the future.

To the Earlham Computer Science community in particular I owe a big thanks. I have had a supportive environment in which to learn and grow for virtually the entirety of those years. The value they’ve added to my life can’t be quantified. I am deeply grateful.

What’s next?

I am elated to announce that in mid-September I will go to work as a Research Computing HPC Cluster Administrator at the University of Colorado Boulder! I’m excited to take the skills I’ve built at Earlham and apply them at the scale of CU Boulder. Thanks to the many people who’ve helped make this opportunity possible.

Highlights of an amazing trip

Today is the last day most of us are in Iceland for this trip. As I started this post, we were completing a tour of the Golden Circle after a few days in beautiful Reyjkavik. Now we are preparing for departure.

Our view of the volcano

I wanted to post some of the highlights of our trip. There’s a rough order to them, but don’t take the numbering too seriously – it’s been a great experience all-around. Without further ado:

  1. The volcano is truly incredible. It was not uncommon for people to spontaneously shout “Wow!” and “Oh my god!” as the lava burst up from the ground.
  2. We woke up every day for a few weeks with a view of a fjord.
  3. We did a glacier hike on Sólheimajökull, with two awesome guides.
  4. This was a historically successful round of data collection, both on the drone side and on the biology side. We’ll write and share a lot more about this in the next few months.
  5. We shared space with the group of phenomenal students from the University of Glasgow. We also collaborated with them on multiple occasions, learning a lot about different ways to study wildlife and local sites.
  6. THE FOOD – you probably don’t associate Iceland with food culture (I certainly didn’t), but our meals were delicious.
  7. The architecture and decorations are so distinctly Icelandic.
  8. Amazing photography and video – in high quality and high quantity.
  9. Walking along the boundary between the North American and European plates.
  10. Guided tour from our Skalanes hosts – who incidentally are awesome people – of a stretch of eastern Iceland.
Getting the rundown about glaciers at Solo

Some of my personal honorable mentions include:

  • Trail running at Skalanes is breathtaking.
  • Blue glacier ice is real neat.
  • The National Museum of Iceland is fascinating and well-done.
  • Rainbow roads in both Seyðisfjörður and Reykjavik highlight what a welcoming place this country is – also perfect reminders of Pride Month in the U.S.!
  • My first-in-my-lifetime tour of a beautiful country happened alongside people I admire who teach me things every single day. What more could I ask for?
A drone photo of the coast by the fjord

If you haven’t already, check out this interview with Charlie and Emmett, conducted by Cincinnati Public Radio.

Davit and Tamara flying

In addition to our success this year, we’ve also set up some great new opportunities for future years. With our long-time friend and collaborator Rannveig Þórhallsdóttir, we’ve added the cemetery in Seyðisfjörður to our list of sites to survey. We believe there may be historically-significant artifacts to be found there, and our drone work lends itself well to finding out.

The fjord at Skalanes

Finally, here’s the trip by the numbers:

  • 7 Earlhamites
  • 26 days
  • 183 GB of initial drone images and initial assemblies
  • 2 great hosts at Skalanes
  • 6 outstanding co-dwellers
  • 4 guides at 2 sites
  • 1 perfect dog
  • N angry terns
  • 1 amazing experience
Admiring the view

And that’s a wrap. Hope to see you again soon, Iceland!

Cross-posted at the Earlham Field Science blog.

Flying cameras are good

Update: We have learned! And we no longer agree with this post! GCP’s remain critical for deriving elevation. The cameras are not yet ready to replace that kind of precision. Always something you didn’t realize at first glance. Post preserved for posterity and because lots of it is still perfectly valid.

We recently chose not to use ground control points (GCP’s) as part of our surveying work. This is a departure from standards and conventions in the near-Earth surveying space. However, we believe we have made a sound decision that will support equally effective and more time and cost-effective research. In this post, I’ll explain that decision.

The short version: drone imagery and open-source assembly software (e,g. OpenDroneMap) are now so good that, for our purposes, GCP’s have no marginal benefit.

We have high-quality information about our trial area from an established authority – the Cultural Heritage Agency of Iceland. Their 2007 report of finds is the basis of our trial runs here at Skalanes. Surveying these predefined areas, we’ve now flown multiple flights, gathered images, and then run three assembles with OpenDroneMap.

Here’s a simple run over the area with no GCP’s:

Here’s a run over the area with GCP’s, adding no location metadata other than the craft’s built-in GPS coordinates (you’ll note that the ground footprint is slightly different, but the roundhouse in the middle is the key feature):

We also manually geocoded the GCP’s for one run.

In the end, we observed no meaningful difference between an assembly with GCP’s and an assembly without them. Adding the images as raster layers to a QGIS project confirmed this to our satisfaction:

With GCP:

Without GCP:

In summary, ground control points just don’t help us much compared to just taking a bunch of good photos and using high quality software to assemble them. They also cost us in portability: even four GCP’s are difficult to carry, occupying significant space in airport luggage and weighing down walks in the field. For scientists interested in doing work over a large area, potentially multiple times, that inconvenience is not a trivial cost.

The ODM assemblies are outstanding by themselves. We have good technology and build on the work of a lot of brilliant people. That frees us to be more nimble than we might have been before.

It wouldn’t be a post by me if it didn’t end with a cool picture. Here’s a drone image from a cliff near the house where we’re staying:

Cross-posted at the Earlham Field Science blog.

Awe in Iceland

The Greater Good Institute at Berkeley considers awe one of the keys to well-being:

Awe is the feeling we get in the presence of something vast that challenges our understanding of the world, like looking up at millions of stars in the night sky or marveling at the birth of a child. When people feel awe, they may use other words to describe the experience, such as wonder, amazement, surprise, or transcendence.

That’s the feeling I have at least once a day, every day, here in Iceland.

And it’s difficult to write a blog post about awe. Almost by definition, it’s an emotion that defies easy explanation. It has a mystique that risks being lost in the translation to plain language.

But if I can’t describe the feeling, I can describe why I’m having it.

Unique among my traveling companions, this is my first-ever trip out of my country of origin (🇺🇸) The sliver of gray in this image is the first thing I ever saw of a country not my own:

When we arrived, I got a passport stamp and exchanged currency – both brand new experiences. However mundane, they were novel for me and began waking me up to the new world I’d entered.

Our first few days were chilly, windy, and rainy. I was much happier about this than were my traveling companions. If our weather wasn’t pleasant, it was nonetheless exactly the immersive experience I was hoping for when I signed up for this trip.

In those first few days, I got to see this amazing waterfall:

I got to participate in collecting soil samples at a glacier —

Solo!

— and in howling wind on the side of a moraine:

The right side of the moraine was calm and quiet. The left was much less so.

For good measure, I saw floating blue ice for the first time:

All this was great, and to me they made this trip worth the months of planning and days of travel difficulties it took to get here.

Then we got to Skalanes, where I’m writing this post, and its landscapes exist on a whole other level. Here are ten views here, drawn almost at random from my photos:

This is a country that absolutely runs up the score on natural beauty.

I’ve taken hundreds of pictures here and they’re all amazing – but none does justice to actually being here. That combination is the signature of an awe-inspiring experience.

Awe puts us in touch with something above and beyond our daily worldly experience – call it the divine, the sublime, whatever speaks to you. It’s an experience you can reproduce if you try, but I believe it connects most deeply when it emerges organically from the world you enter. That’s what’s happened to me here.

It is remarkable that this is what we get to do for work, and I am so glad we have some more time to spend here in this awesome country.

Cross-posted at the Earlham Field Science blog.

Jupyterhub user issues: a 90% improvement

photo of Jupiter the planet, as a play on words in the context of Jupyterhub user issues
Jupyter errors are not to be confused with Jupiter errors.

At Earlham Computer Science we have to support a couple dozen intro CS students per semester (or, in COVID times, per 7-week term). We teach Python, and we want to make sure everyone has the right tools to succeed. To do that, we use the Jupyterhub notebook environment, and we periodically respond to user issues related to running notebooks there.

A couple of dozen people running Python code on a server can gobble up resources and induce problems. Jupyter has historically been our toughest service to support, but we’ve vastly improved. In fact, as I’ll show, we have reduced the frequency of incidents by about 90 percent over time.

Note: we only recently began automatic tracking of uptime, so that data is almost useless for comparisons over time. This is the best approximation we have. If new information surfaces to discredit any of my methods, I’ll change it, but my colleagues have confirmed to me that this analysis is at least plausible.

Retrieving the raw data

I started my job at Earlham in June 2018. In November 2018, we resolved an archiving issue with our help desk/admin mailing list that gives us our first dataset.

I ran a grep for the “Messages:” string in the thread archives:

grep 'Messages:' */thread.html # super complicated

I did a little text processing to generate the dataset: regular expression find-and-replace in an editor. That reduced the data to a column of YYYY-Month values and a column of message counts.

Then I went and searched for all lines with subject matching “{J,j}upyter” in the subject.html files:

grep -i jupyter {2018,2019,2020}*/subject.html 

I saved it to jupyter-messages-18-20.dat. I did some text processing – again regexes, find and replace – and then decided that followup messages are not what we care about and ran uniq against that file. A few quick wc -l commands later and we find:

  • 21 Jupyter requests in 2018
  • 17 Jupyter requests in 2019
  • 19 Jupyter requests in 2020

One caveat is that in 2020 we moved a lot of communication to Slack. This adds some uncertainty to the data. However, I know from context that Jupyter requests have continued to flow through the mailing list disproportionately. As such, Slack messages are likely to be the sort of redundant information already obscured using uniq in the text processing.

Another qualifier is that a year or so ago we began using GitLab’s Issues as a ticket tracking system. I searched that. It found 11 more Jupyter issues, all from 2020. Fortunately, only 1 of those was a problem that did not overlap with a mailing list entry.

Still, I think those raw numbers are a good baseline. At one level, it looks bad. The 2020 number has barely budged from 2018 and in fact it’s worse than 2019. That’s misleading, though.

Digging deeper into the data

Buried in that tiny dataset is some good news about the trends.

For one thing, those 21 Jupyter requests were in only 4 months out of the year – in other words, we were wildly misconfigured and putting out a lot of unnecessary technical fires. (That’s nobody’s fault – it’s primarily due to the fact that my position did not exist for about a year before I arrived at it, so we atrophied.)

What’s more, the 19 this year are, by inspection, half password or feature requests rather than the 17 problems we saw in 2019, which I think were real.

So in terms of Jupyter problems in the admin list, I find:

  • around 20 in the latter third of 2018
  • 17 in ALL OF 2019
  • only two (granted one was a BIG problem but still only 2) in 2020

That’s a 90% reduction in Jupyterhub user issues over three years, by my account.

“That’s amazing, how’d you do it?”

Number one: thank you, imaginary reader, you’re too kind.

Number two: a lot of ways.

In no particular order:

  1. We migrated off of a VM, which given our hardware constraints was not conducive to a resource-intensive service like Jupyterhub.
  2. Gradually over time, we’ve upgraded our storage hardware, as some of it was old and (turns out) failing.
  3. We added RAM. When it comes to RAM, some is good, more is better, and too much is just enough.
  4. We manage user directories better. We export these over NFS but have done all we can to reduce network dependencies. That significantly reduces the amount of time the CPU spends twiddling its thumbs.

What’s more, we’re not stopping here. We’re currently exploring load-balancing options – for example, running Jupyter notebooks through a batch scheduler like Slurm, or potentially a containerized environment like Kubernetes. There are several solutions, but we haven’t yet determined which is best for our use case.

This is the work of a team of people, not just me, but I wanted to share it as an example of growth and progress over time. It’s incremental but it really does make a difference. Jupyterhub user issues, like so many issues, are usually solvable.