Note: as of 23.02.1 at least, Slurm no longer exhibits this behavior:
scontrol: error: REPLACE and REPLACE_DOWN flags cannot be used with STATIC_ALLOC or MAINT flags
Preserving this post for posterity to know why that’s a very good idea.
***
Late this fine Saturday morning I noticed the work Slack blowing up. Uh-oh.
Turns out that earlier in the week I had introduced an issue with our compile functionality, which rests on the logic of Slurm reservations. It’s now fixed, and I wanted to share what we learned in the event that it can help admins at other HPC centers who encounter similar issues.
See, on CU Boulder Research Computing (CURC)’s HPC system Alpine, we have a floating reservation for two nodes to allow users to access a shell on a compute node to compile code, with minimal waiting. Any two standard compute nodes are eligible for the reservation, and we use the Slurm replace
flag to exchange the nodes over time as new nodes become idle.
But on Saturday morning we observed several bad behaviors:
- The reservation,
acompile
, had the maint
flag.
- Nodes that went through
acompile
ended up in a MAINTENANCE
state that, upon their release, rendered them unusable for users for standard batch jobs.
- Because nodes rotate in and out, Slurm was considering more and more nodes to be unavailable.
- A member of our team attempted to solve the issue by setting
flags=replace
on the reservation. This seemed to solve the issue briefly but it quickly resurfaced.
I think I have a sense of the proximate cause and an explainer, and I also think I know the underlying cause and possible fixes.
Proximate cause: Slurm reservations (at least as of version 22.05.2) are conservative with how they update the maint
flag. To use this example, to remove from a reservation with flags=maint,replace
, it’s not sufficient to say flags=replace
– the flag must be explicitly removed, with something like flags-=maint
.
Allow me to demonstrate.
This command creates a reservation with flags=maint,replace
:
$ scontrol create reservation reservationName=craig_example users=crea5307 nodeCnt=1 starttime=now duration=infinite flags=maint,replace
Reservation created: craig_example
Slurm creates it as expected:
$ scontrol show res craig_example
ReservationName=craig_example StartTime=2023-04-08T11:58:44 EndTime=2024-04-07T11:58:44 Duration=365-00:00:00
Nodes=c3cpu-a9-u1-2 NodeCnt=1 CoreCnt=64 Features=(null) PartitionName=amilan Flags=MAINT,REPLACE
TRES=cpu=64
Users=crea5307 Groups=(null) Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
MaxStartDelay=(null)
We (attempt to) update the reservation using flags=replace
. The intention is to have replace
be the only flag. This would seem to be the logical behavior.
$ scontrol update reservation reservationName=craig_example flags=replace
Reservation updated.
However, despite an apparently satisfying output message, this fails to achieve our goal. The maint
flag remains:
$ scontrol show res craig_example
ReservationName=craig_example StartTime=2023-04-08T11:58:44 EndTime=2024-04-07T11:58:44 Duration=365-00:00:00
Nodes=c3cpu-a9-u1-2 NodeCnt=1 CoreCnt=64 Features=(null) PartitionName=amilan Flags=MAINT,REPLACE
TRES=cpu=64
Users=crea5307 Groups=(null) Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
MaxStartDelay=(null)
Then, using minus-equals, we actually remove the maint
flag:
$ scontrol update reservation reservationName=craig_example flags-=maint
Reservation updated.
Lo, the flag is gone:
$ scontrol show res craig_example
ReservationName=craig_example StartTime=2023-04-08T11:58:44 EndTime=2024-04-07T11:58:44 Duration=365-00:00:00
Nodes=c3cpu-a9-u1-2 NodeCnt=1 CoreCnt=64 Features=(null) PartitionName=amilan Flags=REPLACE
TRES=cpu=64
Users=crea5307 Groups=(null) Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
MaxStartDelay=(null)
Based on this example, which replicates the behavior we observed, it very much appears to me that proper removal of the maint
flag was the proximate cause of our problem. I’ve made this exact mistake in other contexts before, so I at least had a sense of this already.
That’s all well and good, but the proximate cause is not really what we care about. It’s more important how we got to this point. As it happens, the underlying cause is that the maint
flag was set on acompile
in the first place. I’ll describe why I did so initially and what we will do differently in the future.
An important fact is that Slurm (sensibly) does not want two reservations scheduled for the same nodes at the same time unless you, the admin, are REAL SURE you want that. The maint
flag is one of the only two documented ways to create overlapping reservations. We use this flag all the time for its prime intended purpose, reserving the system for scheduled maintenance. So far so good.
However, at our last planned maintenance (PM), on April 5, we had several fixes to make to our ongoing reservations including acompile
. For simplicity’s sake, we chose to delete and rebuild them according to our improved designs, rather than updating them in-place. When I first attempted the rebuild step with acompile
, I was blocked creating it because of the existing (maint
) reservations, so I added that flag to my scontrol create reservation
command. From my bash history:
# failed
325 alpine scontrol create reservation ReservationName=acompile StartTime=now Duration=infinite NodeCnt=2 PartitionName=acompile Users=-root Flags=REPLACE
# succeeded
329 alpine scontrol create reservation ReservationName=acompile StartTime=now Duration=infinite NodeCnt=2 PartitionName=acompile Users=-root Flags=REPLACE,MAINT
What I had not realized was that, by setting the maint
flag in acompile
and never removing it, I was leaving every node that cycled through acompile
in the MAINTENANCE
state – hence the issues above.
I can imagine other possible solutions to this issue – scontrol reconfigure
or systemctl restart slurmctld
may have helped, though I don’t like to start there. In any case, I think what I’ve described did reveal an issue in how I rebuilt this particular reservation.
For the future I see a few followup steps:
- Document this information (see: this blog post, internal docs).
- Revisit the
overlap
flag for Slurm reservations, which in my experience is a little trickier than maint
but may prevent this issue if we implement it right.
- Add reservation config checks as a late step in post-PM testing, perhaps the last thing to do before sending the all-clear message.
This was definitely a mistake on my part, but I (and my team) learned from it. I wrote this post to share the lesson, and I hope it helps if you’re an HPC admin who encounters similar behavior.