How to enable a custom systemctl service unit without disabling SELinux

We do a lot of upgrades in the summer. This year we’re migrating from the Torque scheduler (which is no longer open-source) to the Slurm scheduler (which is). It’s a good learning experience in addition to being a systems improvement.

First I installed it on a cluster, successful. That took a while: It turns out schedulers have complicated installation processes with specific dependency chains. To save time in the future, I decided that I would attempt to automate the installation.

This has gone better than you might initially guess.

I threw my command history into a script, spun up a VM, and began iterating. After a bit of work, I’ve made the installation script work consistently with almost no direct user input.

Then I tried running it on another machine and ran headfirst into SELinux.

The problem

The installation itself went fine, but the OS displayed this message every time I tried to enable the Slurm control daemon:

[root@host system]# systemctl enable slurmctld.service
Failed to enable unit: Unit file slurmctld.service does not exist.

I double- and triple-checked that my file was in a directory that systemctl expected. After that I checked /var/log/messages and saw a bunch of errors like this …

type=AVC msg=audit(1589561958.124:5788): avc:  denied  { read } for  pid=1 comm="systemd" name="slurmctld.service" dev="dm-0" ino=34756852 scontext=system_u:system_r:init_t:s0 tcontext=unconfined_u:object_r:admin_home_t:s0 tclass=file permissive=0

… and this:

type=USER_END msg=audit(1589562370.317:5841): pid=3893 uid=0 auid=1000 ses=29 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 msg='op=PAM:session_close grantors=pam_keyinit,pam_limits,pam_systemd,pam_unix acct="slurm" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/0 res=success'UID="root" AUID="[omitted]"

Then I ran ls -Z on the service file’s directory to check its SELinux context:

-rw-r--r--. 1 root root unconfined_u:object_r:admin_home_t:s0                  367 May 15 13:01 slurmctld.service
[...]
-rw-r--r--. 1 root root system_u:object_r:systemd_unit_file_t:s0               337 May 11  2019 smartd.service

Notice that the smartd file has a different context (system_u...) than does the slurmctld file (unconfined_u...). My inference was that the slurmctld file’s context was a (not-trusted) default, and that the solution was to make its context consistent with the context of the working systemctl unit files.

The solution

Here’s how to give the service file a new context in SELinux:

chcon system_u:object_r:systemd_unit_file_t:s0 slurmctld.service 

To see the appropriate security context, check ls -Z. Trust that more than my command, because your context may not match mine.

Concluding remarks

I am early-career and have done very little work with SELinux, so this is not a specialty of mine right now. As such, this may or may not be the best solution. But, mindful of some security advice, I think it is preferable to disabling SELinux altogether.

Between the files and the disks of your server

I recently took a painful and convoluted path to understanding management of disks in Linux. I wanted to post that here for my own reference, and maybe you will find it useful as well. Note that these commands should generally not be directly copy-pasted, and should be used advisedly after careful planning.

Let’s take a plunge into the ocean, shall we?

Filesystem

You’ve definitely seen this. This is the surface level, the very highest layer of abstraction. If you’re not a sysadmin, there’s a good chance this is the only layer you care about (on your phone, it’s likely you don’t even care about this one!).

The filesystem is where files are kept and managed. There are tools to mount either the underlying device (/dev/mapper or /dev/vgname) or the filesystem itself to mount points – for example, over NFS. You can also use the filesystem on a logical volume (see below) as the disk for a virtual machine.

This is where ext2, ext3, ext4, xfs, and more come in. This is not a post about filesystems (I don’t know enough about filesystems to credibly write that post) but they each have features and associated utilities. Most of our systems are ext4 but we have some older systems with ext2 and some systems with xfs.

Commands (vary by filesystem)

  • mount and umount; see /etc/fstab and /etc/exports
  • df -h can show you if your filesystem mount is crowded for storage
  • fsck (e2fscke4fsck, and similar are wrappers around fsck)
  • resize2fs /dev/lv/home 512G # resize a filesystem to be 512G, might accompany lvresize below
  • xfsdump/xfsrestore for XFS filesystems
  • mkfs /dev/lvmdata/device # make a filesystem on a device
  • fdisk -l isn’t technically a filesystem tool, but it operates at a high level of abstraction and you should be aware of it

LVM

Filesystems are made on top of underlying volumes in LVM, or “logical volume manager” – Linux’s partitioning system. (Actually manipulating LVM’s rather that passively using simple defaults is technically optional, but it’s widely used.)

LVM has three layers of abstraction within itself that each have utilities associated with them. This closely follows the abstraction patterns we’ve already seen in the layers below this one.

LVM logical volumes

A volume group can then be organized into logical volumes. The commands here are incredibly powerful and give you the ability to manage disk space with ease (we’re grading “easy” on a curve).

If you want to resize a filesystem, there’s a good chance you’ll want to follow up by resizing the volume underneath it.

Commands:

  • lvdisplay
  • lvscan
  • lvcreate -L 20G -n mylv myvg # create a 20GB LVM called mylv in group myvg
  • lvresize -L 520G /dev/lv/home # make the LVM on /dev/lv/home 520GB in size

LVM volume groups

A logical volume is created from devices/space within a volume group. It’s a collection of one or more LVM “physical” volumes (see below).

Commands:

  • vgscan
  • vgdisplay
  • pvmove /dev/mydevice # to get stuff off of a PV and move it to available free space elsewhere in the VG

LVM physical volumes

At the lowest LVM layer there are “physical” volumes. These might actually correspond to physical volumes (if you have no hardware RAID), or they might be other /dev objects in the OS (/dev/md127 would be a physical volume in this model).

These are the LVM analog to disk partitions.

Commands:

  • pvscan
  • pvdisplay

Software RAID (optional)

RAID is a system for data management on disk. There are both “hardware” and “software” implementations of RAID, and software is at a higher level of abstraction. It’s convenient for a (super-)user to manage. Our machines (like many) use mdadm, but there are other tools.

Commands:

  • mdadm --detail --scan
  • mdadm -D /dev/mdXYZ # details
  • mdadm -Q /dev/mdXYZ # short, human-readable
  • cat /proc/mdstat

Devices in the OS

“In UNIX, everything is a file.” In Linux that’s mostly true as well.

The /dev directory contains the files that correspond to each particular device detected by the OS. I found these useful mostly for reference, because everything refers to them in some way.

If you look closely, things like /dev/mapper/devicename are often symlinks (pointers) to other devices.

All the other layers provide you better abstractions and more powerful tools for working with devices. For that reason, you probably won’t do much with these directly.

(The astute will observe that /dev is a directory so we’ve leapt up the layers of abstraction here. True! However, it’s the best lens you as a user have on the things the OS detects in the lower layers.)

Also: dmesg. Use dmesg. It will help you.

Hardware RAID (optional)

If you use software RAID for convenience, you use hardware RAID for performance and information-hiding.

Hardware RAID presents the underlying drives to the OS at boot time by way of a RAID controller on the motherboard. At boot, you can access a tiny bit of software (with a GUI that’s probably older than me) to create and modify hardware RAID volumes. In other words, the RAID volume(s), not the physical drives, appear to you as a user.

At least some, and I presume most, RAID controllers have software that you can install on the operating system that will let you get a look at the physical disks that compose the logical volumes.

Relevant software at this level:

  • MegaCLI # we have a MegaRAID controller on the server in question
  • smartctl --scan
  • smartctl -a -d megaraid,15 /dev/bus/6 # substitute the identifying numbers from the scan command above
  • not much else – managing hardware RAID carefully requires a reboot; for this reason we tend to keep ours simple

Physical storage

We have reached the seafloor, where you have some drives – SSD’s, spinning disks, etc. Those drives are the very lowest level of abstraction: they are literal, physical machines. Because of this, we don’t tend to work with them directly except at installation and removal.

Summary and context

From highest to lowest layers of abstraction:

  1. filesystem
  2. LVM [lv > vg > pv]
  3. software RAID
  4. devices in the OS
  5. hardware RAID
  6. disks

The origin story of this blog post (also a wiki page, if you’re an Earlham CS sysadmin student!): necessity’s the mother of invention.

I supervise a sysadmin team. It’s a team of students who work part-time, so in practice I’m a player-coach.

In February, we experienced a disk failure that triggered protracted downtime on an important server. It was a topic I was unfamiliar with, so I did a lot of on-the-job training and research. I read probably dozens of blog posts about filesystems, but none used language that made sense to me in a coherent, unified, and specific way. I hope I’ve done so here, so that others can learn from my mistakes!

I didn’t know there was a full First Letters Capitalized list

While reading through the early days of a pandemic, I discovered that someone made a list of Eight Fallacies of Distributed Computing:

  • The network is reliable.
  • Latency is zero.
  • Bandwidth is infinite.
  • The network is secure.
  • Topology doesn’t change.
  • There is one administrator.
  • Transport cost is zero.
  • The network is homogeneous.

I’ve seen all those assumptions fail personally. That’s with only a few years of experience running small distributed systems.

The originating article/talk is here.

An inspirational place

I graduated from Earlham in December 2016. I returned to work for the Computer Science Department here in June 2018. Like so many in the community, I relate to it as more than an alma mater or an employer: it’s an institution and a community I (and so many others) hold in high esteem.

For all that, I don’t think I’ve ever been more inspired by this place than I was by this:

It’s an incredible display. I wasn’t able to attend this event myself but I watched some of its organization unfold on social media in the hours beforehand. It was a breathtaking, awe-inspiring achievement.

This is a time of a lot of fear, heartbreak, and frustration. To briefly lapse into politics, it is horrifying to check the news and to see the President of the United States so thoroughly, spectacularly, and dangerously fail in guiding this nation through the crisis. The effects of COVID-19 may hover over our heads for a long time to come.

But this is also a moment of profound social solidarity. I need look no further than this small liberal arts college to see it. It’s wonderful to be part of a community where this could materialize.

We’re dispersing geographically for the rest of the semester, but we carry this spirit with us wherever we go. I can only hope the people of America and the world rise to the occasion as this community did.

Monitor more than you think you need

The admins and I have just emerged from a few weeks of slow-rolling outages. We have good reason to believe all is well again.

I want to write more about the incident and the skills upgrade that it gave us, but I’ll stick with this for now:

We’re a liberal arts college CS department with a couple dozen servers. In other words, we’re not a big shop. Mostly we can see when something’s off with our eyes and ears. We won’t be able to access something, emails will stop going through, etc. For that reason, services rarely disappear for long in our world.

Once in a while, though, we have to trace a problem – like the last month, for example. Ultimately it was a simple problem with a simple solution (hardware failures, nothing to do but replace our old SSD’s) but we spent a lot of time trying to identify that cause. That’s mostly because we currently lack a comprehensive monitoring suite.

That’s getting fixed, and quick.

As part of a broader quality control initiative we’re working on, we’re going to monitor everything – hardware, software, networks, connections, etc. The overhead of maintaining extensible monitoring systems is not as severe as the overhead of tracing problems when those systems don’t exist. To my knowledge, this brings us inline with best practices in industry.

Yes, it’s just a couple dozen servers. But experience has shown: All sentences of the form “It’s just X” are dangerous sentences.

Golang and more: this week’s personal tech updates

First I haz a sad. After a server choke last week, the Earlham CS admins finally had to declare time-of-death on the filesystem underlying one of our widely-used virtual machines. Definitive causes evade us (I think they are lost to history), so we will now pivot to rebuilding and improving the system design.

In some respects this was frustrating and produced a lot of stress for us. On the other hand, it’s a sweet demo of the power of virtualization. The server died, but the hardware underlying it was still fine. That means we can rebuild at a fraction of the cost of discovering, purchasing, installing, and configuring new metal. The problem doesn’t disappear but it moves from hardware to software.

I’ve also discovered a few hardware problems. One of the drones we will take to Iceland need a bit of work, for example. I also found that our Canon camera may have a bad orientation sensor, so the LCD display doesn’t auto-rotate. Discovering those things in February is not fun. Discovering them in May or June would have been much worse.

Happier news: I began learning Go this week. I have written a lot of Java, C, and some Python, but for whatever reason I’ve taken to Golang as I have with no other language. It has a lot of the strengths of C, a little less syntactical cruft, good documents, and a rich developer literature online. I also have more experience.

A brief elaboration on experience: A lot of people say you have to really love programming or you have no hope of being good at it. Maybe. But I’m more partial to thinking of software engineering as a craft. Preexisting passion is invaluable but not critical, because passion can be cultivated (cf. Cal Newport). It emerges from building skills, trying things, perseverance, solving some interesting problems, and observing your own progress over time. In my experience (like here and here), programming as a student, brand-new to the discipline, was often frustrating and opaque. Fast forward, and today I spent several hours on my day off learning Golang because it was interesting and fun. 🤷‍♂️

Your mileage may vary, but that was my experience.

Finally, here are a few articles I read or re-read this week:

Earlham’s four-day weekend runs from today through Sunday. After a couple of stressful weeks, I’m going to take advantage of the remainder of the time off to decompress.

Learning on my own

One of the CS servers choked this week. That made for a stressful recovery, assessment, and cleanup process.

To cut the stress, I put in about an hour each night to some autodidactic computing education. That included a mix of reading and exercises.

In particular, I walked through this talk on compilers.

As I noted in my GitHub repo‘s README, I did fine in my Programming Languages course as a student, but I was never fully confident with interpreters and compilers in practice. People (accurately!) talk about building such programs as “metaprogramming”, but as a student I found they always came across as more handwavey or tautological than meta.

This exercise, which I’d emphasize consists of code built by someone else (Tom Stuart, posted at his site Codon) in 2013 for demo purposes and not by me, was clarifying. Meticulously walking through it gave me a better intuition for interpreters and compilers – which are not, in fact, handwavey or tautological. 🙂 The Futamura projections at the end of the article were particularly illuminating to discover and think through.

I also read some articles.

  • Teach Yourself Programming in Ten Years” (Peter Norvig; re-read)
  • Purported origin of “never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway”. (This was evidently considered an extremely funny joke among 1970s computer geeks.)
  • The performance links in this post.
  • The next language I’d like to explore is Go, and I started this week.

Computing lets you learn on your own. It’s not unique as a field in this way, but I appreciate that aspect of it.

Improve software performance, both sooner and later

This week I read “How To Be A Programmer”. It’s part of my work to shore up my fundamental computing skills. From a section in “Beginners” called “How to Fix Performance Problems” (emphasis added):

The key to improving the performance of a very complicated system is to analyse it well enough to find the bottlenecks, or places where most of the resources are consumed. There is not much sense in optimizing a function that accounts for only 1% of the computation time. As a rule of thumb you should think carefully before doing anything unless you think it is going to make the system or a significant part of it at least twice as fast.

That struck me for two reasons: One: I’ve reflected in the past on high-performance computing showing exceptions to rules we learn as beginners. Two: just an hour earlier, I’d read Nelson Elhage’s excellent blog post “Reflections on software performance” (emphasis added):

I think [“Make it work, then make it right, then make it fast”] may indeed be decent default advice, but I’ve also learned that it is really important to recognize its limitations, and to be able to reach for other paradigms when it matters. In particular, I’ve come to believe that the “performance last” model will rarely, if ever, produce truly fast software (and, as discussed above, I believe truly-fast software is a worthwhile target). 

One of my favorite performance anecdotes is the SQLite 3.8.7 release, which was 50% faster than the previous release in total, all by way of numerous stacked performance improvements, each gaining less than 1% individually. This example speaks to the benefit of worrying about small performance costs across the entire codebase; even if they are individually insignificant, they do add up. And while the SQLite developers were able to do this work after the fact, the more 1% regressions you can avoid in the first place, the easier this work is.

Software development advice: land of contrasts!

Both approaches have merit. However, from my admittedly limited experience, I’m partial to the latter.

The traditional advice – build it, make it work, then make it fast – works in many cases. It’s a pleasantly simple entry point if you’re just learning to build software. I learned to code that way, and so do many of our students. Both text selections give it credit – “rule of thumb”, “decent default”. But I think its placement in the “Beginner” section is appropriate.

I’m not even at a tech company, but I work on projects where performance matters from start to finish. I’ve also worked on projects where bad performance made the user experience pretty miserable. As Elhage emphasizes in his post, “Performance is a feature”. CS majors learn “big-O” notation for a reason. Everyone likes fast software, and that requires both good design and ongoing optimization.

Compare:

Small resolutions for 2020

I have a lot coming up in 2020, so I don’t want to make any major resolutions. But I do see a few obvious, relatively simple places for improvement in the new year:

  • Use social media better. I’ve cut back quite a bit, but Twitter, Facebook, and LinkedIn each have benefits. I want to put them to good use.
  • Listen to more variety in music. I’ve expanded my taste in movies significantly in the last couple of years and want to nurture my musical taste as well.
  • Read fewer articles, more books.
  • More intense workouts. I’ve been coasting on light-to-moderate walking and jogging, and I’d like to push myself more. HIIT and strength training are in my mind currently.

This is all in addition to continuing to the next steps in my career and skills growth.

Happy New Year, friends!

Christmas trees and trip cost vs item cost

When building software for large datasets or HPC workflows, we talk a lot about the trip cost versus the item cost.

The item cost is the expense (almost always measured in time) to run an operation on a single unit of data – one member of a set, for example. The trip cost is the total expense of running a series of operations on some subset (possibly the whole set) of the data. The trip cost incorporates overhead, so it’s not just N times the item cost.

This is a key reason that computers, algorithms, and data structures that support high-performance computing are so important: by analyzing as many items in one trip as is feasible, you can often minimize time wasted on unnecessary setup and teardown.

Thus trip cost versus item cost is an invaluable simplifying distinction. It can clarify how to can make many systems perform better.

Yes, Virginia, there is a trip cost

Christmas tree

Christmas trees provide a good and familiar example.

Let’s stipulate that you celebrate Christmas and that you have a tree. You’ve put up lights. Now you want to hang the ornaments.

The item cost for each of the ornaments is very small: unbox and hang the ornament. It takes a couple of seconds, max – not a lot, for humans. It also parallelizes extremely well, so everyone in the family gets to hang one or more ornaments.

The trip cost is at least an order of magnitude (minutes rather than seconds) more expensive, so you only want to do it once:

  • Find the ornament box
  • Bring the box into the same room as the tree
  • Open the box
  • Unbox and hang N ornaments
  • Close the box
  • Put the box back

Those overhead steps don’t parallelize well, either: we see no performance improvement and possibly a performance decline if two or more people try to move the box in and out of the room instead of just one.

It’s plain to see that you want to hang as many ornaments as possible before putting away the ornament box. This matches our intuition (“let’s decorate the tree” is treated as a discrete task typically completed all in one go), which is nice.

Whether Christmas is your holiday or not, I wish you the best as the year draws to a close.