CentOS 8 is going away

December 10, 2020 by Craig

CentOS 8 is going away at the end of 2021 (emphasis added):

The future of the CentOS Project is CentOS Stream, and over the next year we’ll be shifting focus from CentOS Linux, the rebuild of Red Hat Enterprise Linux (RHEL), to CentOS Stream, which tracks just ahead of a current RHEL release. CentOS Linux 8, as a rebuild of RHEL 8, will end at the end of 2021. CentOS Stream continues after that date, serving as the upstream (development) branch of Red Hat Enterprise Linux.

We’re just going to walk into the ocean now…

Losing an HPC-friendly, enterprise-grade, stable, free operating system throws a wrench into our activities. We run several small CentOS clusters, though fortunately they all run CentOS 7 (maintenance updates till 2024). We have time to respond, but it will be extra work when it comes time to upgrade.

Here’s an aggregation of observations from the last couple of days:

The comments on that blog post speak bluntly.
The Beowulf email list is full of discussion already.
I see a lot of chatter now about Ubuntu and Debian. Both would be viable distros for the few CentOS 8 hosts we currently run. They may or may not fit cluster upgrade needs a few years from now.
Others have chimed in to mention Oracle Enterprise Linux (ehhhhh), a revival of Scientific Linux, NixOS, and GuixOS.
This replacement project, Rocky Linux, is in development here or maybe here (hard for me as an outsider to tell). It may go somewhere – or maybe not. I’ve seen others float their own forks as well. Maybe one will get traction, but who can say?
CentOS 8 was quite a bit different from CentOS 7. Hard not to feel frustrated at learning a new system just to see its lifespan cut short by eight years.
I put a note about this in our issue tracker today. It reads: “The path is straightforward and requires less data copying than some past migrations. We just need to have a good answer on the OS of choice. I imagine the community will converge on either a ‘correct’ answer or a few good answers relatively soon, but I don’t think we’re there yet.”
In short, for us: This will be fine, but it will also be annoying.

This does change our strategy for the next few months. We need a new external SSH server, for example, and it can’t reasonably be CentOS 8 anymore. Down the road, we will have cluster systems to upgrade, and I still have no idea what that looks like.

A 2020 success story from Earlham CS

December 2, 2020 by Craig

I’m proud of something from this year – a real 2020 success story.

Earlham, winter 2020; 2020 success story

To give some backstory: January-March were maybe the roughest three months of my tech life to date. We had a cascade of server hardware failures that induced a lot of downtime. Total catastrophe. I’m grateful for my institution’s patience.

After a lot of extra hours in windowless rooms working on it, we did resolve those problems. We diagnosed the root causes and took steps to prevent similar issues in the future. I also learned a lot. (Some of the lessons from those days continue to guide us, and they’ve been imprinted on me forever.)

The very next day the March lockdowns started and the College sent everyone away.

We went all-remote for the rest of spring and shifted into hybrid mode for the fall. That increased the dependency on system availability. Naturally, I was uneasy about that after the stress of the spring term. I directed myself and the CS admin students to focus on uptime, iterative improvement, and minimal disruption.

What makes me proud is this: it worked.

Since resolving those issues in the winter and spring, we’ve been stable. Individual services and hosts have had issues, of course. Some of those issues took significant time and energy, and we’re still not perfect (probably never will be!). There is always more to fix, more to improve, more to automate, more to introduce.

But systematically we’ve operated without unplanned interruption since March.

We’ve faced uncertainty after uncertainty in 2020. But my colleagues and students have been able to count on our systems working. We’re not a giant shop here, but we have kept up with the changing times.

There it is: one clear 2020 success story. Engineering this success was a collaborative effort to which I’m just one contributor, but I am proud of it.

Give yourself the gift of quality control

July 2, 2020 by Craig

If you spend any time at all in the tech chatter space, you have probably heard a lot of discontent about the quality of software these days. Just two examples:

I can’t do anything about the cultural, economic, and social environment that cultivates these issues. (So maybe I shouldn’t say anything at all? 🙂 )

I can say that, if you’re in a position to do something about it, you should treat yourself to quality control.

The case I’d like to briefly highlight is about our infrastructure rather than a software package, but I think this principle can be generalized.

Case study: bringing order to a data center

After a series of (related) service outages in the spring of 2020, shortly before the onset of the COVID-19 crisis, we cut back on some expansionary ambitions to get our house in order.

Here’s a sample, not even a comprehensive list, of the things we’ve fixed in the last couple of months:

updated every OS we run such that most of our systems will need only incremental upgrades for the next few years
transitioned to the Slurm scheduler for all of our clusters and compute nodes, which has already made it easier to track and troubleshoot batch jobs
modernized hardware across the board, including upgraded storage and network cards
retired unreliable nodes
implemented comprehensive monitoring and alerts
replaced our old LDAP server and map with a new one that will better suit our authentication needs across many current and future services
fixed the configuration of our Jupyterhub instances for efficiency

Notice: None of those are “let’s add a new server” or “let’s support 17 new software packages”. It’s all about improving the things we already supported.

There are a lot of institutional reasons our systems needed this work, primarily the shortage of staffing that affects a lot of small colleges. But from a pragmatic perspective, to me and to the student admins, these reasons don’t matter. What matters is that we were in a position to fix them.

By consciously choosing to do so, we think we’ve reduced future overhead and downtime risk substantially. Quantitatively, we’ve gone from a few dozen open issue tickets to 19 as of this writing. Six others are advancing rapidly.

How we did it and what’s next

I don’t have a dramatic reveal here. We just made the simple (if not always easy) decision to confront our issues and make quality a priority.

Time is an exhaustible, non-renewable resource. We decided to spend our time on making existing systems work much much better, rather than adding new features. This kind of focus can be boring, because of how strictly it blocks distractions, but the results speak for themselves.

After all that work, now we can pivot to the shiny new thing: installing, supporting, and using new software. We’ve been revving up support for virtual machines and containers for a long time. HPC continues to advance and discover new applications. The freedom to explore these domains will open up a lot of room for student and faculty research over time. It may also help as we prepare to move into our first full semester under COVID-19, which is likely to have (at minimum) a substantial remote component.

Some thoughts on moving from Torque to Slurm

June 3, 2020 by Craig

This is more about the process than the feature set.

Torque moved out of open-source space a couple of years ago. This summer we are finally make the full shift to Slurm. I’m not going to trash the old thing here. Instead I want to celebrate the new thing and reflect on the process of installing it.

I haven’t researched the progeny of Slurm as a project, but the UI seems engineered to make this shift easier. There are tables all over the Internet (including on our wiki!) of the Torque<->Slurm translations.
Slurm’s accounting features were the trickiest part of this all to configure, but taking the time was worth it. Even at the testing stage, the sacct command’s output is super-informative.
SchedMD’s documentation is among the best of any large piece of software I’ve worked with. If you’re doing this and you feel like you’re missing something, double-check their documents before flogging Stack Overflow etc.
You can in fact do a single-server install as well as a cluster install. We did both, the latter in conjunction with Ansible. Neither is actually much more difficult than the other. That’s because the same three pieces of software (the controller, the database, and the worker daemon) have to run no matter the topology. It’s just that the worker runs on every compute node while the controller and database run only on the head node.
We’ve been successful in using an A –> AB –> B approach to this transition. Right now we have both schedulers next to each other on each of these systems. That will remain the case for a few weeks, until we confirm we’ve done Slurm right.
Schedulers have the most complicated build process of any piece of software I’ve worked with – except gcc, the building of which sometimes makes one want to walk into the ocean.
Dependencies and related programs (e.g. your choice of email tool) are as much a complexity as the scheduler itself.
From a branding perspective, Slurm managed to pull off an impressive feat. Its name is clear and distinctive in the software space, but a fun Easter egg if you have a certain geek pop culture interest/awareness.

This is has been successful up to now. We’ve soft-launched Slurm installs on our scientific computing servers. We should be all-Slurm when classes and researchers return.

Batten down the (network) hatches

May 22, 2020 by Craig

It’s been a long time since we systematically updated our security measures at Earlham CS. I spent some time on that this week. I wanted to share some of the changes we made so that if you’re running a small-to-midsize network you might implement similar fixes.

The bare minimum

We’ve been using two critical and often unmentioned security measures already:

physically locking down the data center
running a network firewall

These two things alone do a lot to secure the system.

Securing services

Of course, we also provide a lot of services over the network, everything from web servers to shells. We have to secure access to all of those tools, plus our data. We want the necessary cracks in our firewall have as low a risk as possible of being exploited.

What remained, then, was the installation and configuration of server tools to harden security above and beyond physical locks and firewalls – in a word, “DevSecOps”.

First, on those machines that didn’t already have it, we installed unattended-upgrades (Debian/Ubuntu)/yum-cron (CentOS 7)/dnf-automatic (CentOS 8). We use these to automatically apply security patches to package-managed software. We’re still free to install larger updates each semester manually to minimize disruptions. It’s a good balance of stability and security vigilance.

Next we installed fail2ban on the small number of servers to which our firewall allows SSH access. It detects and blocks possibly-malicious IP addresses trying to connect to the servers. We enabled two “jails” in fail2ban: sshd, which catches likely bad actors attempting ssh connections and bans them for a short time; and recidive, which checks the log records from sshd (and potentially other jails), detects repeat offenders, and imposes longer-lasting bans against them.

(This is the digital equivalent of locking up your house so that the lazy would-be burglar going door-to-door checking knobs can’t get in.)

We then ran trufflehog on our public GitLab repos. It gave us a few warnings but none that actually contained compromising system or user information. I consider this good luck more than anything, and we’re taking steps now proactively to prevent such mistakes.

Still to come

Our next security steps will focus on improved monitoring and notification. This has been an issue in the past for stability, but fixing it will also contribute to security. We are also constantly reevaluating security approaches at a department policy level.

Thanks to this post for pointing me to some of the tools mentioned here.

How to enable a custom systemctl service unit without disabling SELinux

May 15, 2020 by Craig

We do a lot of upgrades in the summer. This year we’re migrating from the Torque scheduler (which is no longer open-source) to the Slurm scheduler (which is). It’s a good learning experience in addition to being a systems improvement.

First I installed it on a cluster, successful. That took a while: It turns out schedulers have complicated installation processes with specific dependency chains. To save time in the future, I decided that I would attempt to automate the installation.

This has gone better than you might initially guess.

I threw my command history into a script, spun up a VM, and began iterating. After a bit of work, I’ve made the installation script work consistently with almost no direct user input.

Then I tried running it on another machine and ran headfirst into SELinux.

The problem

The installation itself went fine, but the OS displayed this message every time I tried to enable the Slurm control daemon:

[root@host system]# systemctl enable slurmctld.service
Failed to enable unit: Unit file slurmctld.service does not exist.

I double- and triple-checked that my file was in a directory that systemctl expected. After that I checked /var/log/messages and saw a bunch of errors like this …

type=AVC msg=audit(1589561958.124:5788): avc:  denied  { read } for  pid=1 comm="systemd" name="slurmctld.service" dev="dm-0" ino=34756852 scontext=system_u:system_r:init_t:s0 tcontext=unconfined_u:object_r:admin_home_t:s0 tclass=file permissive=0

… and this:

type=USER_END msg=audit(1589562370.317:5841): pid=3893 uid=0 auid=1000 ses=29 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 msg='op=PAM:session_close grantors=pam_keyinit,pam_limits,pam_systemd,pam_unix acct="slurm" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/0 res=success'UID="root" AUID="[omitted]"

Then I ran ls -Z on the service file’s directory to check its SELinux context:

-rw-r--r--. 1 root root unconfined_u:object_r:admin_home_t:s0                  367 May 15 13:01 slurmctld.service
[...]
-rw-r--r--. 1 root root system_u:object_r:systemd_unit_file_t:s0               337 May 11  2019 smartd.service

Notice that the smartd file has a different context (system_u...) than does the slurmctld file (unconfined_u...). My inference was that the slurmctld file’s context was a (not-trusted) default, and that the solution was to make its context consistent with the context of the working systemctl unit files.

The solution

Here’s how to give the service file a new context in SELinux:

chcon system_u:object_r:systemd_unit_file_t:s0 slurmctld.service

To see the appropriate security context, check ls -Z. Trust that more than my command, because your context may not match mine.

Concluding remarks

I am early-career and have done very little work with SELinux, so this is not a specialty of mine right now. As such, this may or may not be the best solution. But, mindful of some security advice, I think it is preferable to disabling SELinux altogether.

Between the files and the disks of your server

May 8, 2020 by Craig

I recently took a painful and convoluted path to understanding management of disks in Linux. I wanted to post that here for my own reference, and maybe you will find it useful as well. Note that these commands should generally not be directly copy-pasted, and should be used advisedly after careful planning.

Let’s take a plunge into the ocean, shall we?

Filesystem

You’ve definitely seen this. This is the surface level, the very highest layer of abstraction. If you’re not a sysadmin, there’s a good chance this is the only layer you care about (on your phone, it’s likely you don’t even care about this one!).

The filesystem is where files are kept and managed. There are tools to mount either the underlying device (/dev/mapper or /dev/vgname) or the filesystem itself to mount points – for example, over NFS. You can also use the filesystem on a logical volume (see below) as the disk for a virtual machine.

This is where ext2, ext3, ext4, xfs, and more come in. This is not a post about filesystems (I don’t know enough about filesystems to credibly write that post) but they each have features and associated utilities. Most of our systems are ext4 but we have some older systems with ext2 and some systems with xfs.

Commands (vary by filesystem)

mount and umount; see /etc/fstab and /etc/exports
df -h can show you if your filesystem mount is crowded for storage
fsck (e2fsck, e4fsck, and similar are wrappers around fsck)
resize2fs /dev/lv/home 512G # resize a filesystem to be 512G, might accompany lvresize below
xfsdump/xfsrestore for XFS filesystems
mkfs /dev/lvmdata/device # make a filesystem on a device
fdisk -l isn’t technically a filesystem tool, but it operates at a high level of abstraction and you should be aware of it

LVM

Filesystems are made on top of underlying volumes in LVM, or “logical volume manager” – Linux’s partitioning system. (Actually manipulating LVM’s rather that passively using simple defaults is technically optional, but it’s widely used.)

LVM has three layers of abstraction within itself that each have utilities associated with them. This closely follows the abstraction patterns we’ve already seen in the layers below this one.

LVM logical volumes

A volume group can then be organized into logical volumes. The commands here are incredibly powerful and give you the ability to manage disk space with ease (we’re grading “easy” on a curve).

If you want to resize a filesystem, there’s a good chance you’ll want to follow up by resizing the volume underneath it.

Commands:

lvdisplay
lvscan
lvcreate -L 20G -n mylv myvg # create a 20GB LVM called mylv in group myvg
lvresize -L 520G /dev/lv/home # make the LVM on /dev/lv/home 520GB in size

LVM volume groups

A logical volume is created from devices/space within a volume group. It’s a collection of one or more LVM “physical” volumes (see below).

Commands:

vgscan
vgdisplay
pvmove /dev/mydevice # to get stuff off of a PV and move it to available free space elsewhere in the VG

LVM physical volumes

At the lowest LVM layer there are “physical” volumes. These might actually correspond to physical volumes (if you have no hardware RAID), or they might be other /dev objects in the OS (/dev/md127 would be a physical volume in this model).

These are the LVM analog to disk partitions.

Commands:

pvscan
pvdisplay

Software RAID (optional)

RAID is a system for data management on disk. There are both “hardware” and “software” implementations of RAID, and software is at a higher level of abstraction. It’s convenient for a (super-)user to manage. Our machines (like many) use mdadm, but there are other tools.

Commands:

mdadm --detail --scan
mdadm -D /dev/mdXYZ # details
mdadm -Q /dev/mdXYZ # short, human-readable
cat /proc/mdstat

Devices in the OS

“In UNIX, everything is a file.” In Linux that’s mostly true as well.

The /dev directory contains the files that correspond to each particular device detected by the OS. I found these useful mostly for reference, because everything refers to them in some way.

If you look closely, things like /dev/mapper/devicename are often symlinks (pointers) to other devices.

All the other layers provide you better abstractions and more powerful tools for working with devices. For that reason, you probably won’t do much with these directly.

(The astute will observe that /dev is a directory so we’ve leapt up the layers of abstraction here. True! However, it’s the best lens you as a user have on the things the OS detects in the lower layers.)

Also: dmesg. Use dmesg. It will help you.

Hardware RAID (optional)

If you use software RAID for convenience, you use hardware RAID for performance and information-hiding.

Hardware RAID presents the underlying drives to the OS at boot time by way of a RAID controller on the motherboard. At boot, you can access a tiny bit of software (with a GUI that’s probably older than me) to create and modify hardware RAID volumes. In other words, the RAID volume(s), not the physical drives, appear to you as a user.

At least some, and I presume most, RAID controllers have software that you can install on the operating system that will let you get a look at the physical disks that compose the logical volumes.

Relevant software at this level:

MegaCLI # we have a MegaRAID controller on the server in question
smartctl --scan
smartctl -a -d megaraid,15 /dev/bus/6 # substitute the identifying numbers from the scan command above
not much else – managing hardware RAID carefully requires a reboot; for this reason we tend to keep ours simple

Physical storage

We have reached the seafloor, where you have some drives – SSD’s, spinning disks, etc. Those drives are the very lowest level of abstraction: they are literal, physical machines. Because of this, we don’t tend to work with them directly except at installation and removal.

Summary and context

From highest to lowest layers of abstraction:

filesystem
LVM [lv > vg > pv]
software RAID
devices in the OS
hardware RAID
disks

The origin story of this blog post (also a wiki page, if you’re an Earlham CS sysadmin student!): necessity’s the mother of invention.

I supervise a sysadmin team. It’s a team of students who work part-time, so in practice I’m a player-coach.

In February, we experienced a disk failure that triggered protracted downtime on an important server. It was a topic I was unfamiliar with, so I did a lot of on-the-job training and research. I read probably dozens of blog posts about filesystems, but none used language that made sense to me in a coherent, unified, and specific way. I hope I’ve done so here, so that others can learn from my mistakes!

I didn’t know there was a full First Letters Capitalized list

March 18, 2020 by Craig

While reading through the early days of a pandemic, I discovered that someone made a list of Eight Fallacies of Distributed Computing:

The network is reliable.
Latency is zero.
Bandwidth is infinite.
The network is secure.
Topology doesn’t change.
There is one administrator.
Transport cost is zero.
The network is homogeneous.

I’ve seen all those assumptions fail personally. That’s with only a few years of experience running small distributed systems.

The originating article/talk is here.

Monitor more than you think you need

March 6, 2020 by Craig

The admins and I have just emerged from a few weeks of slow-rolling outages. We have good reason to believe all is well again.

I want to write more about the incident and the skills upgrade that it gave us, but I’ll stick with this for now:

We’re a liberal arts college CS department with a couple dozen servers. In other words, we’re not a big shop. Mostly we can see when something’s off with our eyes and ears. We won’t be able to access something, emails will stop going through, etc. For that reason, services rarely disappear for long in our world.

Once in a while, though, we have to trace a problem – like the last month, for example. Ultimately it was a simple problem with a simple solution (hardware failures, nothing to do but replace our old SSD’s) but we spent a lot of time trying to identify that cause. That’s mostly because we currently lack a comprehensive monitoring suite.

That’s getting fixed, and quick.

As part of a broader quality control initiative we’re working on, we’re going to monitor everything – hardware, software, networks, connections, etc. The overhead of maintaining extensible monitoring systems is not as severe as the overhead of tracing problems when those systems don’t exist. To my knowledge, this brings us inline with best practices in industry.

Yes, it’s just a couple dozen servers. But experience has shown: All sentences of the form “It’s just X” are dangerous sentences.

Golang and more: this week’s personal tech updates

February 20, 2020 by Craig

First I haz a sad. After a server choke last week, the Earlham CS admins finally had to declare time-of-death on the filesystem underlying one of our widely-used virtual machines. Definitive causes evade us (I think they are lost to history), so we will now pivot to rebuilding and improving the system design.

In some respects this was frustrating and produced a lot of stress for us. On the other hand, it’s a sweet demo of the power of virtualization. The server died, but the hardware underlying it was still fine. That means we can rebuild at a fraction of the cost of discovering, purchasing, installing, and configuring new metal. The problem doesn’t disappear but it moves from hardware to software.

I’ve also discovered a few hardware problems. One of the drones we will take to Iceland need a bit of work, for example. I also found that our Canon camera may have a bad orientation sensor, so the LCD display doesn’t auto-rotate. Discovering those things in February is not fun. Discovering them in May or June would have been much worse.

Happier news: I began learning Go this week. I have written a lot of Java, C, and some Python, but for whatever reason I’ve taken to Golang as I have with no other language. It has a lot of the strengths of C, a little less syntactical cruft, good documents, and a rich developer literature online. I also have more experience.

A brief elaboration on experience: A lot of people say you have to really love programming or you have no hope of being good at it. Maybe. But I’m more partial to thinking of software engineering as a craft. Preexisting passion is invaluable but not critical, because passion can be cultivated (cf. Cal Newport). It emerges from building skills, trying things, perseverance, solving some interesting problems, and observing your own progress over time. In my experience (like here and here), programming as a student, brand-new to the discipline, was often frustrating and opaque. Fast forward, and today I spent several hours on my day off learning Golang because it was interesting and fun. 🤷‍♂️

Your mileage may vary, but that was my experience.

Finally, here are a few articles I read or re-read this week:

“The many faces of fsck” (LWN)
“Ship Smaller Diffs” (Dan McKinley)
“The Unreasonable Effectiveness of Linear Search” (Evan Jones)
“Latency, Throughput, and Walking on Escalators” (Andrew Certain)

Earlham’s four-day weekend runs from today through Sunday. After a couple of stressful weeks, I’m going to take advantage of the remainder of the time off to decompress.