CentOS 8 is going away

CentOS 8 is going away at the end of 2021 (emphasis added):

The future of the CentOS Project is CentOS Stream, and over the next year we’ll be shifting focus from CentOS Linux, the rebuild of Red Hat Enterprise Linux (RHEL), to CentOS Stream, which tracks just ahead of a current RHEL release. CentOS Linux 8, as a rebuild of RHEL 8, will end at the end of 2021. CentOS Stream continues after that date, serving as the upstream (development) branch of Red Hat Enterprise Linux.

We’re just going to walk into the ocean now…

Losing an HPC-friendly, enterprise-grade, stable, free operating system throws a wrench into our activities. We run several small CentOS clusters, though fortunately they all run CentOS 7 (maintenance updates till 2024). We have time to respond, but it will be extra work when it comes time to upgrade.

Here’s an aggregation of observations from the last couple of days:

  • The comments on that blog post speak bluntly.
  • The Beowulf email list is full of discussion already.
  • I see a lot of chatter now about Ubuntu and Debian. Both would be viable distros for the few CentOS 8 hosts we currently run. They may or may not fit cluster upgrade needs a few years from now.
  • Others have chimed in to mention Oracle Enterprise Linux (ehhhhh), a revival of Scientific Linux, NixOS, and GuixOS.
  • This replacement project, Rocky Linux, is in development here or maybe here (hard for me as an outsider to tell). It may go somewhere – or maybe not. I’ve seen others float their own forks as well. Maybe one will get traction, but who can say?
  • CentOS 8 was quite a bit different from CentOS 7. Hard not to feel frustrated at learning a new system just to see its lifespan cut short by eight years.
  • I put a note about this in our issue tracker today. It reads: “The path is straightforward and requires less data copying than some past migrations. We just need to have a good answer on the OS of choice. I imagine the community will converge on either a ‘correct’ answer or a few good answers relatively soon, but I don’t think we’re there yet.”
  • In short, for us: This will be fine, but it will also be annoying.

This does change our strategy for the next few months. We need a new external SSH server, for example, and it can’t reasonably be CentOS 8 anymore. Down the road, we will have cluster systems to upgrade, and I still have no idea what that looks like.

A 2020 success story from Earlham CS

I’m proud of something from this year – a real 2020 success story.

Earlham, winter 2020; 2020 success story

To give some backstory: January-March were maybe the roughest three months of my tech life to date. We had a cascade of server hardware failures that induced a lot of downtime. Total catastrophe. I’m grateful for my institution’s patience.

After a lot of extra hours in windowless rooms working on it, we did resolve those problems. We diagnosed the root causes and took steps to prevent similar issues in the future. I also learned a lot. (Some of the lessons from those days continue to guide us, and they’ve been imprinted on me forever.)

The very next day the March lockdowns started and the College sent everyone away.

We went all-remote for the rest of spring and shifted into hybrid mode for the fall. That increased the dependency on system availability. Naturally, I was uneasy about that after the stress of the spring term. I directed myself and the CS admin students to focus on uptime, iterative improvement, and minimal disruption.

What makes me proud is this: it worked.

Since resolving those issues in the winter and spring, we’ve been stable. Individual services and hosts have had issues, of course. Some of those issues took significant time and energy, and we’re still not perfect (probably never will be!). There is always more to fix, more to improve, more to automate, more to introduce.

But systematically we’ve operated without unplanned interruption since March.

We’ve faced uncertainty after uncertainty in 2020. But my colleagues and students have been able to count on our systems working. We’re not a giant shop here, but we have kept up with the changing times.

There it is: one clear 2020 success story. Engineering this success was a collaborative effort to which I’m just one contributor, but I am proud of it.

Tech improves pandemic life

I can’t imagine going through the COVID-19 pandemic without computers. Tech improves pandemic life, and it makes it easier for us to make good decisions.

For reasons of both personal caution and what I see as a moral duty, I am probably in the 80th percentile for cautious behavior during the pandemic. I live alone, and my job lends itself to remote work for almost everything. What’s more, my workplace is a socially-conscious liberal arts college. As a result, I interact with very few people (those I do see are always masked-up).

That lifestyle is only sustainable because of computer technology. I buy and pick up groceries through an app. Meetings take place over video chats. Songs or podcasts play in the background while I cook. I can stream almost anything I want to see. I’ve continued to learn and to work using some excellent rectangles.

Ron Swanson "This is an excellent rectangle." Tech life.

There are tradeoffs, of course, but I have basically lived this way since March. Doing so I have weathered the pandemic as well as I could hope (so far).

The national dialogue now includes a lot of chatter about how to stay safe for the holidays. I’m cautious and want to model good behavior. That means I’ll be on FaceTime for Thanksgiving, Christmas, and New Year’s Eve. That’s not great, and it’ll be sad not to be physically visiting family.

But for people like me, the alternative to a FaceTime holiday isn’t an in-person holiday, but a canceled holiday, spent in isolation. Thanks to the people in my industry, I don’t have to do that. Technology brings people together. It’s one reason I remain idealistic about the work I do.

Amidst the tragedies and terrors of 2020, pause to appreciate the age we live in and the cool things we’ve invented. Tech improves pandemic life – and improves life in general. There’s lots to worry about if you want (conspiracy theories, AI risk, etc.), but I’m happy to live in a technologically advanced society.

Jupyterhub user issues: a 90% improvement

photo of Jupiter the planet, as a play on words in the context of Jupyterhub user issues
Jupyter errors are not to be confused with Jupiter errors.

At Earlham Computer Science we have to support a couple dozen intro CS students per semester (or, in COVID times, per 7-week term). We teach Python, and we want to make sure everyone has the right tools to succeed. To do that, we use the Jupyterhub notebook environment, and we periodically respond to user issues related to running notebooks there.

A couple of dozen people running Python code on a server can gobble up resources and induce problems. Jupyter has historically been our toughest service to support, but we’ve vastly improved. In fact, as I’ll show, we have reduced the frequency of incidents by about 90 percent over time.

Note: we only recently began automatic tracking of uptime, so that data is almost useless for comparisons over time. This is the best approximation we have. If new information surfaces to discredit any of my methods, I’ll change it, but my colleagues have confirmed to me that this analysis is at least plausible.

Retrieving the raw data

I started my job at Earlham in June 2018. In November 2018, we resolved an archiving issue with our help desk/admin mailing list that gives us our first dataset.

I ran a grep for the “Messages:” string in the thread archives:

grep 'Messages:' */thread.html # super complicated

I did a little text processing to generate the dataset: regular expression find-and-replace in an editor. That reduced the data to a column of YYYY-Month values and a column of message counts.

Then I went and searched for all lines with subject matching “{J,j}upyter” in the subject.html files:

grep -i jupyter {2018,2019,2020}*/subject.html 

I saved it to jupyter-messages-18-20.dat. I did some text processing – again regexes, find and replace – and then decided that followup messages are not what we care about and ran uniq against that file. A few quick wc -l commands later and we find:

  • 21 Jupyter requests in 2018
  • 17 Jupyter requests in 2019
  • 19 Jupyter requests in 2020

One caveat is that in 2020 we moved a lot of communication to Slack. This adds some uncertainty to the data. However, I know from context that Jupyter requests have continued to flow through the mailing list disproportionately. As such, Slack messages are likely to be the sort of redundant information already obscured using uniq in the text processing.

Another qualifier is that a year or so ago we began using GitLab’s Issues as a ticket tracking system. I searched that. It found 11 more Jupyter issues, all from 2020. Fortunately, only 1 of those was a problem that did not overlap with a mailing list entry.

Still, I think those raw numbers are a good baseline. At one level, it looks bad. The 2020 number has barely budged from 2018 and in fact it’s worse than 2019. That’s misleading, though.

Digging deeper into the data

Buried in that tiny dataset is some good news about the trends.

For one thing, those 21 Jupyter requests were in only 4 months out of the year – in other words, we were wildly misconfigured and putting out a lot of unnecessary technical fires. (That’s nobody’s fault – it’s primarily due to the fact that my position did not exist for about a year before I arrived at it, so we atrophied.)

What’s more, the 19 this year are, by inspection, half password or feature requests rather than the 17 problems we saw in 2019, which I think were real.

So in terms of Jupyter problems in the admin list, I find:

  • around 20 in the latter third of 2018
  • 17 in ALL OF 2019
  • only two (granted one was a BIG problem but still only 2) in 2020

That’s a 90% reduction in Jupyterhub user issues over three years, by my account.

“That’s amazing, how’d you do it?”

Number one: thank you, imaginary reader, you’re too kind.

Number two: a lot of ways.

In no particular order:

  1. We migrated off of a VM, which given our hardware constraints was not conducive to a resource-intensive service like Jupyterhub.
  2. Gradually over time, we’ve upgraded our storage hardware, as some of it was old and (turns out) failing.
  3. We added RAM. When it comes to RAM, some is good, more is better, and too much is just enough.
  4. We manage user directories better. We export these over NFS but have done all we can to reduce network dependencies. That significantly reduces the amount of time the CPU spends twiddling its thumbs.

What’s more, we’re not stopping here. We’re currently exploring load-balancing options – for example, running Jupyter notebooks through a batch scheduler like Slurm, or potentially a containerized environment like Kubernetes. There are several solutions, but we haven’t yet determined which is best for our use case.

This is the work of a team of people, not just me, but I wanted to share it as an example of growth and progress over time. It’s incremental but it really does make a difference. Jupyterhub user issues, like so many issues, are usually solvable.

I’m making websites!

As the exclamation point indicates, I’m excited to announce this: I’m now making websites again!

A bit over two years ago, I left self-employment as an all-around tech services provider and joined my alma mater, Earlham College. That was a good move. I have built my skills across the board, and having this job has kept my career steady through e.g. COVID.

However, I’ve missed some of the work from those days, as well as the independence. I don’t like having only one income source in a time of high economic unpredictability. I also want to continue expanding my skillset, growing my portfolio, and controlling the course of my own career.

For all these reasons, I’m accepting new projects effective now. You can click here to seen plans and examples or reach out (cearley@craigearley.com) hire me to make a website for you.

My particular passions are making websites for individuals and small businesses (including online stores). Most likely if you’re at a larger scale than that, you have in-house web and sysadmin teams anyway. 🙂 If what I offer is right for you, please reach out. I look forward to hearing from you.

Meet our Terrestrial Mapping Platform!

Just a nice photo from Iceland

I’m excited to share that the Earlham field science program is now sharing the core of our Terrestrial Mapping Platform (TMP)! This is very much a work-in-progress, but we’re excited about it and wanted to share it as soon as we could.

We had to delay the 2020 Iceland trip because of COVID-19. That of course pushed back the implementation and case study component of this project, which was Iceland-centric. But we are moving forward at full speed with everything else. As Earlham has now started the new academic year, we have also resumed work on the TMP.

The project is a UAV hardware-software platform for scientists. It consists of:

  • a consumer-grade drone for capturing images
  • flight plan generation software and application to automate drone flights
  • data analysis workflows for the images – visible light and NIR, assembled into 2D and 3D models

All of this goes toward making science more accessible to a broader range of domain scientists. Archaeologists and glaciologists are our current target cohort, but many more could find use for this work if it’s successful.

We will make all of this accessible in repositories with open licenses on our GitLab instance. Some are already available. Others we will share once we review them for (e.g.) accidentally-committed credentials.

That was all planned, if delayed. We’re also using our extra year of preparation time to make the project better in a few ways:

  • Reevaluating our choice of UAV make and model
  • Prettifying our web presence, which very much includes blog posts like this
  • Reducing the friction and pain points in our current workflow
  • Making our code and infrastructure better in general (I’ve covered my growing emphasis on quality here before)

The team mostly comprises students and faculty (of whom I’m the junior-most). Additionally, there are a few on-site partners in Iceland and innumerable personal supporters who make this possible. We’ll be sharing more at the Earlham Field Science blog as we go. I will undoubtedly share more here as well.

COVID is bad, but we want to make the best of this era. This is one way we’re doing that.

(Disclosure: We received funding for this from a National Geographic grant. None of the views in this blog post or our online presence represents, or is endorsed by, Nat Geo.)

Give yourself the gift of quality control

If you spend any time at all in the tech chatter space, you have probably heard a lot of discontent about the quality of software these days. Just two examples:

I can’t do anything about the cultural, economic, and social environment that cultivates these issues. (So maybe I shouldn’t say anything at all? 🙂 )

I can say that, if you’re in a position to do something about it, you should treat yourself to quality control.

The case I’d like to briefly highlight is about our infrastructure rather than a software package, but I think this principle can be generalized.

Case study: bringing order to a data center

After a series of (related) service outages in the spring of 2020, shortly before the onset of the COVID-19 crisis, we cut back on some expansionary ambitions to get our house in order.

Here’s a sample, not even a comprehensive list, of the things we’ve fixed in the last couple of months:

  • updated every OS we run such that most of our systems will need only incremental upgrades for the next few years
  • transitioned to the Slurm scheduler for all of our clusters and compute nodes, which has already made it easier to track and troubleshoot batch jobs
  • modernized hardware across the board, including upgraded storage and network cards
  • retired unreliable nodes
  • implemented comprehensive monitoring and alerts
  • replaced our old LDAP server and map with a new one that will better suit our authentication needs across many current and future services
  • fixed the configuration of our Jupyterhub instances for efficiency

Notice: None of those are “let’s add a new server” or “let’s support 17 new software packages”. It’s all about improving the things we already supported.

There are a lot of institutional reasons our systems needed this work, primarily the shortage of staffing that affects a lot of small colleges. But from a pragmatic perspective, to me and to the student admins, these reasons don’t matter. What matters is that we were in a position to fix them.

By consciously choosing to do so, we think we’ve reduced future overhead and downtime risk substantially. Quantitatively, we’ve gone from a few dozen open issue tickets to 19 as of this writing. Six others are advancing rapidly.

How we did it and what’s next

I don’t have a dramatic reveal here. We just made the simple (if not always easy) decision to confront our issues and make quality a priority.

Time is an exhaustible, non-renewable resource. We decided to spend our time on making existing systems work much much better, rather than adding new features. This kind of focus can be boring, because of how strictly it blocks distractions, but the results speak for themselves.

After all that work, now we can pivot to the shiny new thing: installing, supporting, and using new software. We’ve been revving up support for virtual machines and containers for a long time. HPC continues to advance and discover new applications. The freedom to explore these domains will open up a lot of room for student and faculty research over time. It may also help as we prepare to move into our first full semester under COVID-19, which is likely to have (at minimum) a substantial remote component.

Some thoughts on moving from Torque to Slurm

This is more about the process than the feature set.

Torque moved out of open-source space a couple of years ago. This summer we are finally make the full shift to Slurm. I’m not going to trash the old thing here. Instead I want to celebrate the new thing and reflect on the process of installing it.

  1. I haven’t researched the progeny of Slurm as a project, but the UI seems engineered to make this shift easier. There are tables all over the Internet (including on our wiki!) of the Torque<->Slurm translations.
  2. Slurm’s accounting features were the trickiest part of this all to configure, but taking the time was worth it. Even at the testing stage, the sacct command’s output is super-informative.
  3. SchedMD’s documentation is among the best of any large piece of software I’ve worked with. If you’re doing this and you feel like you’re missing something, double-check their documents before flogging Stack Overflow etc.
  4. You can in fact do a single-server install as well as a cluster install. We did both, the latter in conjunction with Ansible. Neither is actually much more difficult than the other. That’s because the same three pieces of software (the controller, the database, and the worker daemon) have to run no matter the topology. It’s just that the worker runs on every compute node while the controller and database run only on the head node.
  5. We’ve been successful in using an A –> AB –> B approach to this transition. Right now we have both schedulers next to each other on each of these systems. That will remain the case for a few weeks, until we confirm we’ve done Slurm right.
  6. Schedulers have the most complicated build process of any piece of software I’ve worked with – except gcc, the building of which sometimes makes one want to walk into the ocean.
  7. Dependencies and related programs (e.g. your choice of email tool) are as much a complexity as the scheduler itself.
  8. From a branding perspective, Slurm managed to pull off an impressive feat. Its name is clear and distinctive in the software space, but a fun Easter egg if you have a certain geek pop culture interest/awareness.

This is has been successful up to now. We’ve soft-launched Slurm installs on our scientific computing servers. We should be all-Slurm when classes and researchers return.

Batten down the (network) hatches

It’s been a long time since we systematically updated our security measures at Earlham CS. I spent some time on that this week. I wanted to share some of the changes we made so that if you’re running a small-to-midsize network you might implement similar fixes.

The bare minimum

We’ve been using two critical and often unmentioned security measures already:

  • physically locking down the data center
  • running a network firewall

These two things alone do a lot to secure the system.

Securing services

Of course, we also provide a lot of services over the network, everything from web servers to shells. We have to secure access to all of those tools, plus our data. We want the necessary cracks in our firewall have as low a risk as possible of being exploited.

What remained, then, was the installation and configuration of server tools to harden security above and beyond physical locks and firewalls – in a word, “DevSecOps”.

First, on those machines that didn’t already have it, we installed unattended-upgrades (Debian/Ubuntu)/yum-cron (CentOS 7)/dnf-automatic (CentOS 8). We use these to automatically apply security patches to package-managed software. We’re still free to install larger updates each semester manually to minimize disruptions. It’s a good balance of stability and security vigilance.

Next we installed fail2ban on the small number of servers to which our firewall allows SSH access. It detects and blocks possibly-malicious IP addresses trying to connect to the servers. We enabled two “jails” in fail2ban: sshd, which catches likely bad actors attempting ssh connections and bans them for a short time; and recidive, which checks the log records from sshd (and potentially other jails), detects repeat offenders, and imposes longer-lasting bans against them.

(This is the digital equivalent of locking up your house so that the lazy would-be burglar going door-to-door checking knobs can’t get in.)

We then ran trufflehog on our public GitLab repos. It gave us a few warnings but none that actually contained compromising system or user information. I consider this good luck more than anything, and we’re taking steps now proactively to prevent such mistakes.

Still to come

Our next security steps will focus on improved monitoring and notification. This has been an issue in the past for stability, but fixing it will also contribute to security. We are also constantly reevaluating security approaches at a department policy level.

Thanks to this post for pointing me to some of the tools mentioned here.