Fun with the Slurm reservation MAINT and REPLACE flags

Note: as of 23.02.1 at least, Slurm no longer exhibits this behavior:

scontrol: error: REPLACE and REPLACE_DOWN flags cannot be used with STATIC_ALLOC or MAINT flags

Preserving this post for posterity to know why that’s a very good idea.

***

Late this fine Saturday morning I noticed the work Slack blowing up. Uh-oh.

Turns out that earlier in the week I had introduced an issue with our compile functionality, which rests on the logic of Slurm reservations. It’s now fixed, and I wanted to share what we learned in the event that it can help admins at other HPC centers who encounter similar issues.

See, on CU Boulder Research Computing (CURC)’s HPC system Alpine, we have a floating reservation for two nodes to allow users to access a shell on a compute node to compile code, with minimal waiting. Any two standard compute nodes are eligible for the reservation, and we use the Slurm replace flag to exchange the nodes over time as new nodes become idle.

But on Saturday morning we observed several bad behaviors:

  • The reservation, acompile, had the maint flag.
  • Nodes that went through acompile ended up in a MAINTENANCE state that, upon their release, rendered them unusable for users for standard batch jobs.
  • Because nodes rotate in and out, Slurm was considering more and more nodes to be unavailable.
  • A member of our team attempted to solve the issue by setting flags=replace on the reservation. This seemed to solve the issue briefly but it quickly resurfaced.

I think I have a sense of the proximate cause and an explainer, and I also think I know the underlying cause and possible fixes.

Proximate cause: Slurm reservations (at least as of version 22.05.2) are conservative with how they update the maint flag. To use this example, to remove from a reservation with flags=maint,replace, it’s not sufficient to say flags=replace  – the flag must be explicitly removed, with something like flags-=maint.

Allow me to demonstrate.

This command creates a reservation with flags=maint,replace :

$ scontrol create reservation reservationName=craig_example users=crea5307 nodeCnt=1 starttime=now duration=infinite flags=maint,replace
Reservation created: craig_example

Slurm creates it as expected:

$ scontrol show res craig_example
ReservationName=craig_example StartTime=2023-04-08T11:58:44 EndTime=2024-04-07T11:58:44 Duration=365-00:00:00
   Nodes=c3cpu-a9-u1-2 NodeCnt=1 CoreCnt=64 Features=(null) PartitionName=amilan Flags=MAINT,REPLACE
   TRES=cpu=64
   Users=crea5307 Groups=(null) Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)

We (attempt to) update the reservation using flags=replace. The intention is to have replace be the only flag. This would seem to be the logical behavior.

$ scontrol update reservation reservationName=craig_example flags=replace
Reservation updated.

However, despite an apparently satisfying output message, this fails to achieve our goal. The maint flag remains:

$ scontrol show res craig_example
ReservationName=craig_example StartTime=2023-04-08T11:58:44 EndTime=2024-04-07T11:58:44 Duration=365-00:00:00
   Nodes=c3cpu-a9-u1-2 NodeCnt=1 CoreCnt=64 Features=(null) PartitionName=amilan Flags=MAINT,REPLACE
   TRES=cpu=64
   Users=crea5307 Groups=(null) Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)

Then, using minus-equals, we actually remove the maint flag:

$ scontrol update reservation reservationName=craig_example flags-=maint
Reservation updated.

Lo, the flag is gone:

$ scontrol show res craig_example
ReservationName=craig_example StartTime=2023-04-08T11:58:44 EndTime=2024-04-07T11:58:44 Duration=365-00:00:00
   Nodes=c3cpu-a9-u1-2 NodeCnt=1 CoreCnt=64 Features=(null) PartitionName=amilan Flags=REPLACE
   TRES=cpu=64
   Users=crea5307 Groups=(null) Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)

Based on this example, which replicates the behavior we observed, it very much appears to me that proper removal of the maint flag was the proximate cause of our problem. I’ve made this exact mistake in other contexts before, so I at least had a sense of this already.

That’s all well and good, but the proximate cause is not really what we care about. It’s more important how we got to this point. As it happens, the underlying cause is that the maint flag was set on acompile in the first place. I’ll describe why I did so initially and what we will do differently in the future.

An important fact is that Slurm (sensibly) does not want two reservations scheduled for the same nodes at the same time unless you, the admin, are REAL SURE you want that. The maint flag is one of the only two documented ways to create overlapping reservations. We use this flag all the time for its prime intended purpose, reserving the system for scheduled maintenance. So far so good.

However, at our last planned maintenance (PM), on April 5, we had several fixes to make to our ongoing reservations including acompile. For simplicity’s sake, we chose to delete and rebuild them according to our improved designs, rather than updating them in-place. When I first attempted the rebuild step with acompile, I was blocked creating it because of the existing (maint) reservations, so I added that flag to my scontrol create reservation command. From my bash history:

# failed
  325  alpine scontrol create reservation ReservationName=acompile StartTime=now Duration=infinite NodeCnt=2 PartitionName=acompile Users=-root Flags=REPLACE

# succeeded
  329  alpine scontrol create reservation ReservationName=acompile StartTime=now Duration=infinite NodeCnt=2 PartitionName=acompile Users=-root Flags=REPLACE,MAINT

What I had not realized was that, by setting the maint flag in acompile and never removing it, I was leaving every node that cycled through acompile in the MAINTENANCE state – hence the issues above.

I can imagine other possible solutions to this issue – scontrol reconfigure or systemctl restart slurmctld may have helped, though I don’t like to start there. In any case, I think what I’ve described did reveal an issue in how I rebuilt this particular reservation.

For the future I see a few followup steps:

  1. Document this information (see: this blog post, internal docs).
  2. Revisit the overlap flag for Slurm reservations, which in my experience is a little trickier than maint but may prevent this issue if we implement it right.
  3. Add reservation config checks as a late step in post-PM testing, perhaps the last thing to do before sending the all-clear message.

This was definitely a mistake on my part, but I (and my team) learned from it. I wrote this post to share the lesson, and I hope it helps if you’re an HPC admin who encounters similar behavior.

Alpine updates and RMACC 2022

This week I had the opportunity to speak at the 2022 RMACC Symposium, hosted by my own institution, about the Alpine supercomputer. My presentation and the others from my CU colleagues are available here.

In summary, Alpine has been in production since our launch event in May. After some supply chain issues (the same that have affected the entire computing sector), we are preparing to bring another round of nodes online within weeks. That will put Alpine’s total available resources (about 16,000 cores) on par with those of the retiring Summit system. It’s an exciting step for us at CURC.

As for RMACC: I’ve never attended the symposium before. After three days, I came away with a lot of new information, new contacts, and ideas for how to support our researchers better. A few topics in particular I paid attention to:

  • Better and more scalable methods of deploying HPC systems and software
  • How the community will navigate the transition from XSEDE to ACCESS
  • The companies, organizations, and universities (like mine!) building the future of this space
  • Changes in business models for the vendors and commercial developers we work with

Academic HPC is a small niche in the computing world, and gatherings like this can be valuable as spaces to connect and share our best ideas.