Fun with the Slurm reservation MAINT and REPLACE flags

Note: as of 23.02.1 at least, Slurm no longer exhibits this behavior:

scontrol: error: REPLACE and REPLACE_DOWN flags cannot be used with STATIC_ALLOC or MAINT flags

Preserving this post for posterity to know why that’s a very good idea.

***

Late this fine Saturday morning I noticed the work Slack blowing up. Uh-oh.

Turns out that earlier in the week I had introduced an issue with our compile functionality, which rests on the logic of Slurm reservations. It’s now fixed, and I wanted to share what we learned in the event that it can help admins at other HPC centers who encounter similar issues.

See, on CU Boulder Research Computing (CURC)’s HPC system Alpine, we have a floating reservation for two nodes to allow users to access a shell on a compute node to compile code, with minimal waiting. Any two standard compute nodes are eligible for the reservation, and we use the Slurm replace flag to exchange the nodes over time as new nodes become idle.

But on Saturday morning we observed several bad behaviors:

  • The reservation, acompile, had the maint flag.
  • Nodes that went through acompile ended up in a MAINTENANCE state that, upon their release, rendered them unusable for users for standard batch jobs.
  • Because nodes rotate in and out, Slurm was considering more and more nodes to be unavailable.
  • A member of our team attempted to solve the issue by setting flags=replace on the reservation. This seemed to solve the issue briefly but it quickly resurfaced.

I think I have a sense of the proximate cause and an explainer, and I also think I know the underlying cause and possible fixes.

Proximate cause: Slurm reservations (at least as of version 22.05.2) are conservative with how they update the maint flag. To use this example, to remove from a reservation with flags=maint,replace, it’s not sufficient to say flags=replace  – the flag must be explicitly removed, with something like flags-=maint.

Allow me to demonstrate.

This command creates a reservation with flags=maint,replace :

$ scontrol create reservation reservationName=craig_example users=crea5307 nodeCnt=1 starttime=now duration=infinite flags=maint,replace
Reservation created: craig_example

Slurm creates it as expected:

$ scontrol show res craig_example
ReservationName=craig_example StartTime=2023-04-08T11:58:44 EndTime=2024-04-07T11:58:44 Duration=365-00:00:00
   Nodes=c3cpu-a9-u1-2 NodeCnt=1 CoreCnt=64 Features=(null) PartitionName=amilan Flags=MAINT,REPLACE
   TRES=cpu=64
   Users=crea5307 Groups=(null) Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)

We (attempt to) update the reservation using flags=replace. The intention is to have replace be the only flag. This would seem to be the logical behavior.

$ scontrol update reservation reservationName=craig_example flags=replace
Reservation updated.

However, despite an apparently satisfying output message, this fails to achieve our goal. The maint flag remains:

$ scontrol show res craig_example
ReservationName=craig_example StartTime=2023-04-08T11:58:44 EndTime=2024-04-07T11:58:44 Duration=365-00:00:00
   Nodes=c3cpu-a9-u1-2 NodeCnt=1 CoreCnt=64 Features=(null) PartitionName=amilan Flags=MAINT,REPLACE
   TRES=cpu=64
   Users=crea5307 Groups=(null) Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)

Then, using minus-equals, we actually remove the maint flag:

$ scontrol update reservation reservationName=craig_example flags-=maint
Reservation updated.

Lo, the flag is gone:

$ scontrol show res craig_example
ReservationName=craig_example StartTime=2023-04-08T11:58:44 EndTime=2024-04-07T11:58:44 Duration=365-00:00:00
   Nodes=c3cpu-a9-u1-2 NodeCnt=1 CoreCnt=64 Features=(null) PartitionName=amilan Flags=REPLACE
   TRES=cpu=64
   Users=crea5307 Groups=(null) Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)

Based on this example, which replicates the behavior we observed, it very much appears to me that proper removal of the maint flag was the proximate cause of our problem. I’ve made this exact mistake in other contexts before, so I at least had a sense of this already.

That’s all well and good, but the proximate cause is not really what we care about. It’s more important how we got to this point. As it happens, the underlying cause is that the maint flag was set on acompile in the first place. I’ll describe why I did so initially and what we will do differently in the future.

An important fact is that Slurm (sensibly) does not want two reservations scheduled for the same nodes at the same time unless you, the admin, are REAL SURE you want that. The maint flag is one of the only two documented ways to create overlapping reservations. We use this flag all the time for its prime intended purpose, reserving the system for scheduled maintenance. So far so good.

However, at our last planned maintenance (PM), on April 5, we had several fixes to make to our ongoing reservations including acompile. For simplicity’s sake, we chose to delete and rebuild them according to our improved designs, rather than updating them in-place. When I first attempted the rebuild step with acompile, I was blocked creating it because of the existing (maint) reservations, so I added that flag to my scontrol create reservation command. From my bash history:

# failed
  325  alpine scontrol create reservation ReservationName=acompile StartTime=now Duration=infinite NodeCnt=2 PartitionName=acompile Users=-root Flags=REPLACE

# succeeded
  329  alpine scontrol create reservation ReservationName=acompile StartTime=now Duration=infinite NodeCnt=2 PartitionName=acompile Users=-root Flags=REPLACE,MAINT

What I had not realized was that, by setting the maint flag in acompile and never removing it, I was leaving every node that cycled through acompile in the MAINTENANCE state – hence the issues above.

I can imagine other possible solutions to this issue – scontrol reconfigure or systemctl restart slurmctld may have helped, though I don’t like to start there. In any case, I think what I’ve described did reveal an issue in how I rebuilt this particular reservation.

For the future I see a few followup steps:

  1. Document this information (see: this blog post, internal docs).
  2. Revisit the overlap flag for Slurm reservations, which in my experience is a little trickier than maint but may prevent this issue if we implement it right.
  3. Add reservation config checks as a late step in post-PM testing, perhaps the last thing to do before sending the all-clear message.

This was definitely a mistake on my part, but I (and my team) learned from it. I wrote this post to share the lesson, and I hope it helps if you’re an HPC admin who encounters similar behavior.

Alpine updates and RMACC 2022

This week I had the opportunity to speak at the 2022 RMACC Symposium, hosted by my own institution, about the Alpine supercomputer. My presentation and the others from my CU colleagues are available here.

In summary, Alpine has been in production since our launch event in May. After some supply chain issues (the same that have affected the entire computing sector), we are preparing to bring another round of nodes online within weeks. That will put Alpine’s total available resources (about 16,000 cores) on par with those of the retiring Summit system. It’s an exciting step for us at CURC.

As for RMACC: I’ve never attended the symposium before. After three days, I came away with a lot of new information, new contacts, and ideas for how to support our researchers better. A few topics in particular I paid attention to:

  • Better and more scalable methods of deploying HPC systems and software
  • How the community will navigate the transition from XSEDE to ACCESS
  • The companies, organizations, and universities (like mine!) building the future of this space
  • Changes in business models for the vendors and commercial developers we work with

Academic HPC is a small niche in the computing world, and gatherings like this can be valuable as spaces to connect and share our best ideas.

New supercomputer just dropped

These certainly are server cabinets alright…

Today marks the launch of CU Boulder’s shiny new research supercomputer, Alpine. Text of the university press release:

The celebratory event signals the official launch of CU Boulder’s third-generation high performance computing infrastructure, which is provisioned and available to campus researchers immediately.

On May 18, numerous leaders from on- and off-campus will gather to celebrate, introduce and officially launch the campus’s new high-performance computing infrastructure, dubbed “Alpine.”

Alpine replaces “RMACC Summit,” the previous infrastructure, which has been in use since 2017. Comparable to systems now in use at top peer institutions across the country, Alpine will improve upon RMACC Summit by providing cutting-edge hardware that enhances traditional High Performance Computing workloads, enables Artificial Intelligence/Machine Learning workloads, and provides user-friendly access through tools such as Open OnDemand.

“Alpine is a modular system designed to meet the growing and rapidly evolving needs of our researchers,” said Assistant Vice Chancellor and Director of Research Computing Shelley Knuth. “Alpine addresses our users’ requests for faster compute and more robust options for machine learning.”

Notable among the technical specifications that will make Alpine an invaluable tool in research computing for researchers, industry partners and others, Alpine boasts: 3rd generation AMD EPYC CPUs, which provide enhanced energy efficiency per cycle compared to the Intel Xeon E5-2680 CPUs on RMACC Summit; Nvidia A100 GPUs; AMD MI100 GPUs; HDR InfiniBand; and 25 Gb Ethernet.

The kick-off event on May 18 will celebrate the Alpine infrastructure being fully operational and allow the community to enjoy a 20-minute tour, including snacks, an introduction to Research Computing, and a tour of the supercomputer container. The opportunity is open to the public and free of charge, and CU Boulder Research Computing staff will be on site to answer questions. CU Boulder Chief Information Officer Marin Stanek, Chief Operating Officer Patrick O’Rourke, and Acting Vice Chancellor for Research and Innovation Massimo Ruzzene will offer remarks at 1:30 p.m.

In addition to the main launch event, Research Computing is offering a full slate of training and informational events the week of May 16—20.

Researchers seeking to use Research Computing resources, which includes not only the Alpine supercomputer, but also large scale data storage, cloud computing and secure research computing, are invited to visit the Research Computing website to learn about more training offerings, the community discussion forum, office hours and general contact information.

Alpine is funded by the Financial Futures strategic initiative.

This is the biggest project I have ever worked on. It was in the works months before I arrived but has consumed most of my professional time since September. It’s exciting that we can finally welcome our researchers to use it.

What’s next

Some personal news… 🙂

I have been at Earlham College for almost seven years, including my time as a student and as CS faculty. Today is my last day there.

It’s been an incredible place to grow as a person, deepen my skills, collaborate with talented people from all walks of life, and try to make the world a little bit better. I’ve seen a few generations of the community cycle through and watched us withstand everything up to and including a literal pandemic. I capped it with the trip of a lifetime, spending a month doing research in Iceland – on a project I hope to continue working on in the future.

To the Earlham Computer Science community in particular I owe a big thanks. I have had a supportive environment in which to learn and grow for virtually the entirety of those years. The value they’ve added to my life can’t be quantified. I am deeply grateful.

What’s next?

I am elated to announce that in mid-September I will go to work as a Research Computing HPC Cluster Administrator at the University of Colorado Boulder! I’m excited to take the skills I’ve built at Earlham and apply them at the scale of CU Boulder. Thanks to the many people who’ve helped make this opportunity possible.