CASC meeting for spring 2024

Last week I had the pleasure of attending the spring 2024 conference of the Coalition for Academic Scientific Computation (CASC) in Washington, DC. I was especially treated to attend (in-person) the Cyberinfrastructure Leadership Academy (CILA) 2024 the day before CASC. [1]

It’s an opportunity to learn about the state of research computing at academic institutions at the U.S. today. Along with SC, it’s also a chance to see in-person a lot of people I mostly encounter over email or as boxes in Zoom meetings.

One memorable event was a talk by “HPC Dan” Reed, incorporating information from this blog post and a lot more (in 2024, it wouldn’t do to ignore LLM’s). This preceded the release of the Indicators report by the National Science Board, which he presented at the White House on the 13th.

[1] Enough acronyms yet?

End of year roundup, 2023

It’s year-in-review time. I’ve included some short notable points and some of my favorite nature photos of 2023. Here we go.

***

Addition to the family – I am now an uncle and that’s pretty cool. 🙂

First SC – Despite a few years working on HPC systems, I’d never attended SC before. I actually felt like a member of the HPC community (beyond just my current workplace) for the first time. Plus I got to hang out and have drinks and food with coworkers. You can read some more of my highly-non-technical impressions here.

Family photos – My project, now a few years in the works, to digitize all my family’s photos is nearing completion. Thousands of photos are now neatly sorted into nice folders on my computer, backed up to a mix of cloud services, and shared online. There could be more to do, but this dramatically increases the discoverability and longevity of these images.

30 – I turned 30 in 2023, the only decade milestone that has ever troubled me. It turned out mercifully anticlimactic. There are areas in my life where I’m not satisfied, but I am happy with my career progression, I like my workplace and my team, and my geographic location is a great fit for me. I wouldn’t want to stop here, because there are other parts of my life that I’ve neglected. However, everyone’s timelines are different, and I believe I’m now well-suited to improve in other areas.

Incremental improvement – I went back and re-read my 2022 year-in-review post. In a lot of ways, 2023 feels like a part two. A lot of what I did either continued or built upon the progress of that year. I’m better at my job, better off financially, and more assured with myself as a person. Like last year, this one was not host to a swath of drastic changes. But looking back over the course of those two years in total, there is unmistakable improvement.

***

This has been 2023. I’m still formulating my 2024 plans, but I think 2023 provided a foundation I can build on throughout my thirties.

Happy Holidays!

SC reflection

I had a great time at SC 23. What follows is a series of stray thoughts, some related but others entirely standalone. In a way, that captures the vibe of the six-day event as well as a more coherent narrative might.

Entrance of the convention center. In the middle of the image you’ll see the physical representation of the timeline of SC history.

Seeing people I know through industry or have met through different venues and getting to connect to talk HPC, all in one place, was really exciting.

The free coffee and muffins were also exciting, although I could always do with more coffee. 🙂

I really let myself become a pack rat, since this was my first SC. I wanted to grab as much as I could, so I bought merch at the store as well as a lot of swag and stickers. A number of institutions brought booklets and brochures as well. A few of the items I took were things I would not have if this hadn’t been my first SC, but it was – so I splurged.

We had some great conversations with peers and some vendors about how different systems can be deployed into our HPC environments. In particular we have some systems coming up that will be suited to expanding our work into quantum computing.

I attended several technical workshops. One about memory and energy I found especially interesting, but there was a wide range of topics catering to a mix of HPC specialties.

I appreciated the emphasis that the conference put on this year’s slogan, “I am HPC”, emphasizing the human element of our work. Some of that is diversity and inclusion. Some is developing the workforce. Some involves the networking you get to do at any well-organized trade show. The slogan was a useful lens to consider.

Panel from the “I am HPC” plenary session.

CU Research Computing actually participated in SC. We supplied some hardware (from our retired Summit cluster) for the IndySCC, an event linked to the SC’s annual Student Cluster Competition. It was legitimately a challenging project, but it came together in the end and we’re proud to have played our part.

Outside the convention itself, I also got to spend some time in Denver. A coworker of mine lives in the area and had a recommendation, so we went to a few different spots near the convention center to eat. There were plenty of options for a range of different tastes, including some great vegetarian food that I appreciated.

One llama.

Finally, SC turned out to be a great bonding opportunity for the team: we were all in place, with a common purpose and common vocabulary. We had a booth for RMACC that acted as our home base for some stretches of time. I feel better connected to my teammates because of the experience.

SC 24 is in Atlanta, Georgia, and I am certainly interested in participating again.

SC23: Very first impressions

Despite now being several years into being an HPC professional, I’ve never attended SC. As such, SC23 will be a time of first impressions. My very first impression: wild that a conference in this admittedly niche line of work fully earns a convention center and an airport-style registration line.

Hope to write more on this later.

Campus Cyberinfrastructure Conference

In late September I attended a conference of the National Science Foundation’s Campus Cyberinfrastructure program, on behalf of the University of Colorado Boulder and our research computing team. This year’s event took place in Columbus, Ohio, in a nice convention center.

Visiting Columbus constituted a bit of a return to my past work and home at Earlham College just a few hours from there. I didn’t have time for a visit, but it was a bit nostalgic just the same.

It was also cloudy and cool on arrival, a reprieve from high temperatures we were still experiencing in northern Colorado at the time.

I spoke for about ten minutes on an NSF award we received to expand the Alpine cluster to support more users from the Rocky Mountain Advanced Computing Consortium (RMACC). This is a project that got picked up by HPCWire using text I drafted alongside RMACC’s executive director. That’s fun!

Some key takeaways:

  • Computing needs in higher education to continue to grow. This is certainly true of research but it also applies to courses, as the large language model (LLM, colloquially “AI”) surge brings high compute and data storage requirements.
  • This is perhaps obvious but I think it’s worth saying explicitly: needs and approaches vary widely. Some institutions, like CU, run centralized resources that everyone on campus (as well as external contributors) can use. Others build out dedicated HPC clusters customized for a particular domain science. Still others exclusively run a condo (researcher buy-in) model.
  • The campus cyberinfrastructure community is very open and friendly. This was my first conference in this program and I never felt out of place.
  • Researcher trust is central to success in building out campus cyberinfrastructure.
  • Typing “cyberinfrastructure” over and over again is a pain, which is why they abbreviate it to CC*. 🙂

Ultimately I had a good time. I’m looking forward to implementing this project.

Pride reflection

June is Pride Month. I have a few thoughts.

Pride street painting, Boulder, CO, 2021

To me, celebrating pride is about celebrating different modes of pursuing happiness. More to the point, it’s about the breaking of arbitrary expectations for gender presentation, identity, and expression. That includes the right to fall in love with someone of the same sex, but it goes well beyond that.

I’m gay, so I’m very much a participant in this month’s celebrations. I’m also a cis male and to outward appearances basically gender-conforming. That’s neither a good thing nor a bad thing – just where I’ve landed. But I like the idea that others enjoy the freedom to be otherwise, that if I felt compelled to change or redefine some aspect of my identity or presentation tomorrow I could, and that the realm of personal freedom keeps expanding.

The opposition is loud and destructive, and it’s reached a fever pitch in the last few years. Transgender people in particular are the targets du jour. I see conservatives trying to drive a wedge between gay/bi and trans people. I see Republicans attacking Pride Month merchandise in stores, shuttering programs promoting diversity, and banning LGBTQ books. Worse, they’re isolating queer kids and queer families in school. They’re making it harder for people to just live as they see fit without doing a bit of harm to anyone else.

In the face of this, my fellow queer people make me proud. These are people living happy, interesting, loving, fulfilling lives despite intimidation and scapegoating. This community gives me hope for the future when it sometimes feels in short supply.

It’s inspiring, and not just in theory and not just for each person individually. We truly have accomplished a lot for the improvement of our society. On a scale of decades, and with plenty of setbacks, America has become more accepting of the wide variety of people who live here. If we (and now I’m including straight folks) can empathize with each other and make a bit of room for other people’s differences, we can continue on that path. To me, that’s what all those rainbow flags and parades are about: celebrating where we’ve been, looking forward to how much better we still have to do.

Happy Pride. 🏳‍🌈🏳️‍⚧️

The author, windswept, smiling at Seyðisfjörður, Iceland, chapel by rainbow cobblestones, 2021

Evening Tones

“Evening Tones” – Oscar Bluemner

Evening Tones abstracts a landscape along the Hudson River into a vibrant range of colors. Bluemner came to the United States to escape Germany’s conservatism, hoping to find the freedom to try new ideas. After years of struggling in his architectural practice, he turned to painting, throwing himself into the exciting theories of modern art that were making their way across the Atlantic from Europe. But in the climate of World War I, foreign painters and foreign ideas were suspect. A critic reviewing Bluemner’s work in 1915 avowed that his art was “utterly alien to the American idea of democracy.”

Smithsonian Open Access

Sometimes a work of art and a story just speak to you.

Fun with the Slurm reservation MAINT and REPLACE flags

Note: as of 23.02.1 at least, Slurm no longer exhibits this behavior:

scontrol: error: REPLACE and REPLACE_DOWN flags cannot be used with STATIC_ALLOC or MAINT flags

Preserving this post for posterity to know why that’s a very good idea.

***

Late this fine Saturday morning I noticed the work Slack blowing up. Uh-oh.

Turns out that earlier in the week I had introduced an issue with our compile functionality, which rests on the logic of Slurm reservations. It’s now fixed, and I wanted to share what we learned in the event that it can help admins at other HPC centers who encounter similar issues.

See, on CU Boulder Research Computing (CURC)’s HPC system Alpine, we have a floating reservation for two nodes to allow users to access a shell on a compute node to compile code, with minimal waiting. Any two standard compute nodes are eligible for the reservation, and we use the Slurm replace flag to exchange the nodes over time as new nodes become idle.

But on Saturday morning we observed several bad behaviors:

  • The reservation, acompile, had the maint flag.
  • Nodes that went through acompile ended up in a MAINTENANCE state that, upon their release, rendered them unusable for users for standard batch jobs.
  • Because nodes rotate in and out, Slurm was considering more and more nodes to be unavailable.
  • A member of our team attempted to solve the issue by setting flags=replace on the reservation. This seemed to solve the issue briefly but it quickly resurfaced.

I think I have a sense of the proximate cause and an explainer, and I also think I know the underlying cause and possible fixes.

Proximate cause: Slurm reservations (at least as of version 22.05.2) are conservative with how they update the maint flag. To use this example, to remove from a reservation with flags=maint,replace, it’s not sufficient to say flags=replace  – the flag must be explicitly removed, with something like flags-=maint.

Allow me to demonstrate.

This command creates a reservation with flags=maint,replace :

$ scontrol create reservation reservationName=craig_example users=crea5307 nodeCnt=1 starttime=now duration=infinite flags=maint,replace
Reservation created: craig_example

Slurm creates it as expected:

$ scontrol show res craig_example
ReservationName=craig_example StartTime=2023-04-08T11:58:44 EndTime=2024-04-07T11:58:44 Duration=365-00:00:00
   Nodes=c3cpu-a9-u1-2 NodeCnt=1 CoreCnt=64 Features=(null) PartitionName=amilan Flags=MAINT,REPLACE
   TRES=cpu=64
   Users=crea5307 Groups=(null) Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)

We (attempt to) update the reservation using flags=replace. The intention is to have replace be the only flag. This would seem to be the logical behavior.

$ scontrol update reservation reservationName=craig_example flags=replace
Reservation updated.

However, despite an apparently satisfying output message, this fails to achieve our goal. The maint flag remains:

$ scontrol show res craig_example
ReservationName=craig_example StartTime=2023-04-08T11:58:44 EndTime=2024-04-07T11:58:44 Duration=365-00:00:00
   Nodes=c3cpu-a9-u1-2 NodeCnt=1 CoreCnt=64 Features=(null) PartitionName=amilan Flags=MAINT,REPLACE
   TRES=cpu=64
   Users=crea5307 Groups=(null) Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)

Then, using minus-equals, we actually remove the maint flag:

$ scontrol update reservation reservationName=craig_example flags-=maint
Reservation updated.

Lo, the flag is gone:

$ scontrol show res craig_example
ReservationName=craig_example StartTime=2023-04-08T11:58:44 EndTime=2024-04-07T11:58:44 Duration=365-00:00:00
   Nodes=c3cpu-a9-u1-2 NodeCnt=1 CoreCnt=64 Features=(null) PartitionName=amilan Flags=REPLACE
   TRES=cpu=64
   Users=crea5307 Groups=(null) Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)

Based on this example, which replicates the behavior we observed, it very much appears to me that proper removal of the maint flag was the proximate cause of our problem. I’ve made this exact mistake in other contexts before, so I at least had a sense of this already.

That’s all well and good, but the proximate cause is not really what we care about. It’s more important how we got to this point. As it happens, the underlying cause is that the maint flag was set on acompile in the first place. I’ll describe why I did so initially and what we will do differently in the future.

An important fact is that Slurm (sensibly) does not want two reservations scheduled for the same nodes at the same time unless you, the admin, are REAL SURE you want that. The maint flag is one of the only two documented ways to create overlapping reservations. We use this flag all the time for its prime intended purpose, reserving the system for scheduled maintenance. So far so good.

However, at our last planned maintenance (PM), on April 5, we had several fixes to make to our ongoing reservations including acompile. For simplicity’s sake, we chose to delete and rebuild them according to our improved designs, rather than updating them in-place. When I first attempted the rebuild step with acompile, I was blocked creating it because of the existing (maint) reservations, so I added that flag to my scontrol create reservation command. From my bash history:

# failed
  325  alpine scontrol create reservation ReservationName=acompile StartTime=now Duration=infinite NodeCnt=2 PartitionName=acompile Users=-root Flags=REPLACE

# succeeded
  329  alpine scontrol create reservation ReservationName=acompile StartTime=now Duration=infinite NodeCnt=2 PartitionName=acompile Users=-root Flags=REPLACE,MAINT

What I had not realized was that, by setting the maint flag in acompile and never removing it, I was leaving every node that cycled through acompile in the MAINTENANCE state – hence the issues above.

I can imagine other possible solutions to this issue – scontrol reconfigure or systemctl restart slurmctld may have helped, though I don’t like to start there. In any case, I think what I’ve described did reveal an issue in how I rebuilt this particular reservation.

For the future I see a few followup steps:

  1. Document this information (see: this blog post, internal docs).
  2. Revisit the overlap flag for Slurm reservations, which in my experience is a little trickier than maint but may prevent this issue if we implement it right.
  3. Add reservation config checks as a late step in post-PM testing, perhaps the last thing to do before sending the all-clear message.

This was definitely a mistake on my part, but I (and my team) learned from it. I wrote this post to share the lesson, and I hope it helps if you’re an HPC admin who encounters similar behavior.

Making a WordPress code block monospace

Say you’re on WordPress, and say you have a code block.

This is a code block.

That’s fine, but with code I usually want monospace:

This is monospace code.

The way to do that is as follows:

1. Create a code block.
2. Click "More" (the downward caret) in the formatting menu that goes with the block.
3. Click "Keyboard input".

As you can see above, you can apply it to individual lines or to all the code, just like formatting any other bit of text.

I had to search a bit to find this, so I’m posting primarily as a reminder to myself.

These steps work as of the time of this writing. They could always change.

A few more thoughts on family photos

For context, see this post and this one.

I have a few more observations about my family photo project. They are below, and I reserve the right to update this list in the future rather than creating a new post on the subject.

  • Use fdupes or something similar to find and remove duplicate photos. I avoided uploading over 9,000 duplicate photos by that. I wish I’d discovered it much sooner.
  • Something I glossed over in my earlier posts: do not throw away photos. I’ve tossed a handful, almost exclusively duplicates, but the default should be to keep the original pictures. It’s one extra layer of redundancy.
  • If you are cropping multiple pictures out of a single scan, you might at some point crop an original scan by accident and lose some of the pictures. To prevent the incredible annoyance – or outright data loss – of an event like this, make sure to have local backups. I’m a believer in the classic 3-2-1 backup system.
  • It is delightful to hear from your loved ones about their excitement at viewing their old photos.
  • You will never be done. 🙂

To me there’s often a cutoff where something goes from being a “project” to just something I occasionally refresh. This venture is at that point. I’m quite happy how it’s turned out.