2020 in review

You don’t need me to tell you 2020 was a bad year. Others will write about the details that apply nationally and globally, so I’m going to jump right into my own retrospective.

The 2020 wallowing

I was planning on a trip to Iceland followed by a new job with room for advancement in 2020. Instead I stayed at my current job (a good job!), made an attempt at a side hustle that has so far largely fizzled, and was obviously not able to go to Iceland. I didn’t visit family in Montana for Thanksgiving or Christmas. At work, I made a few dumb mistakes (we did rebound in each case, happily). It was, on net, a rough year.

That’s about all I have to say about that. I don’t want to wallow too much, but I also don’t want to go further without acknowledging the struggle.

The better stuff

All that said, the rest of this post summarizes my accomplishments for the year. I write this to remind myself that even though it didn’t generate those external signals, I still did a lot. I advanced my skillset, did my job well, and patched through the year.

Accomplishments:

  • Got out of bed every day and went back to sleep every night
  • Kept Earlham CS running through the pandemic, student dispersal, lockdown, and restricted return
  • Modernized our systems engineering infrastructure with better monitoring, solid backups, improved responsiveness to inquiries, and higher availability – still a long way to go, but we’re so much better than we were a year ago
  • For each error I made, rebounded and learned a lesson
  • Migrated our cluster infrastructure from Torque to Slurm successfully
  • Learned a bunch of low-level details about filesystems, SELinux, and more en route to improving overall quality
  • Took 10,000 steps most days and got outside frequently
  • Provided a lot of internal tech support, engineering, feedback, and project contributions to the projects associated with the Iceland program
  • Made checking LinkedIn a regular part of my routine, though I should use it more socially in the new year
  • Purely as a hobby, learned a ton about video and audio production

Casual observation: I posted a lot in February, and my tech achievements primarily happened over the summer.

I also want to dedicate a section to expanding my horizons. I couldn’t do it with travel, but there were a lot of new things I got the chance to explore this year:

  • On Spotify, listened to 1,408 new artists and 366 new genres (“genre” seems like a nebulous category, but I am taking the win)
  • Watched a lot of new movies, including the complete Hayao Miyazaki filmography
  • Learned to make curry! (this winter squash red Thai curry recipe is great)
  • Baked a pie – apple – for the first time, at Christmas
  • Grew my hair long for the first time in my life (it’s still growing actually – not getting a professional haircut during a pandemic)
  • Visited and walked new hiking trails

That was 2020 for me.

What’s next?

I can’t guarantee 2021 will be better than this year. However, I do have some broad intentions around a theme for the new year. I will do all that is in my control to make the next year better, and I hope you join me.

Extremely preliminary notes about the Parrot Anafi

After three extremely short flights, these are my notes about the Parrot Anafi drone.

We’re experimenting with different UAV’s as part of the Iceland terrestrial surveying program (we’re being optimistic about travel in 2021…). These are some notes with my initial observations.

This is based on the base case: taking the craft out, taking off, flying for at most a few minutes, and touching back down. As such, don’t take a single word of this as gospel – it’s just preliminary opinions for the historical record. 🙂

Short version of the review: holy portability! One thing I don’t like about the DJI Phantoms is that they are so heavy (both the craft and the RC-tablet unit). If it’s a pain here on-campus, where trips are short, I imagine it’s a pain in the field. The Anafi is ludicrously lightweight and doesn’t feel like a chore to carry around.

Video quality on the built-in camera is fantastic (4K etc.).

It’s not a perfectly seamless integration with our existing workflows. Within our group, for example, we usually use tablets, which are handy for their big screens. The Anafi seems built around the assumption of a phone. That’s true all the way down to the RC unit being designed to accommodate a phone but not a tablet. That’s different, but if we can only get this app for phones I am not necessarily sad about it.

There are many X factors I haven’t yet been thorough enough to review. For example: battery life, stability in breezes (heavy winds make most UAV’s hard to use), and the software/developer ecosystem.

These have been my extremely preliminary notes about the Parrot Anafi. It’s not even close to a comprehensive evaluation of everything we care about. Still, those usability factors are important if this is going to scale and be useful for others. So for now, I’m impressed.

CentOS 8 is going away

CentOS 8 is going away at the end of 2021 (emphasis added):

The future of the CentOS Project is CentOS Stream, and over the next year we’ll be shifting focus from CentOS Linux, the rebuild of Red Hat Enterprise Linux (RHEL), to CentOS Stream, which tracks just ahead of a current RHEL release. CentOS Linux 8, as a rebuild of RHEL 8, will end at the end of 2021. CentOS Stream continues after that date, serving as the upstream (development) branch of Red Hat Enterprise Linux.

We’re just going to walk into the ocean now…

Losing an HPC-friendly, enterprise-grade, stable, free operating system throws a wrench into our activities. We run several small CentOS clusters, though fortunately they all run CentOS 7 (maintenance updates till 2024). We have time to respond, but it will be extra work when it comes time to upgrade.

Here’s an aggregation of observations from the last couple of days:

  • The comments on that blog post speak bluntly.
  • The Beowulf email list is full of discussion already.
  • I see a lot of chatter now about Ubuntu and Debian. Both would be viable distros for the few CentOS 8 hosts we currently run. They may or may not fit cluster upgrade needs a few years from now.
  • Others have chimed in to mention Oracle Enterprise Linux (ehhhhh), a revival of Scientific Linux, NixOS, and GuixOS.
  • This replacement project, Rocky Linux, is in development here or maybe here (hard for me as an outsider to tell). It may go somewhere – or maybe not. I’ve seen others float their own forks as well. Maybe one will get traction, but who can say?
  • CentOS 8 was quite a bit different from CentOS 7. Hard not to feel frustrated at learning a new system just to see its lifespan cut short by eight years.
  • I put a note about this in our issue tracker today. It reads: “The path is straightforward and requires less data copying than some past migrations. We just need to have a good answer on the OS of choice. I imagine the community will converge on either a ‘correct’ answer or a few good answers relatively soon, but I don’t think we’re there yet.”
  • In short, for us: This will be fine, but it will also be annoying.

This does change our strategy for the next few months. We need a new external SSH server, for example, and it can’t reasonably be CentOS 8 anymore. Down the road, we will have cluster systems to upgrade, and I still have no idea what that looks like.

A 2020 success story from Earlham CS

I’m proud of something from this year – a real 2020 success story.

Earlham, winter 2020; 2020 success story

To give some backstory: January-March were maybe the roughest three months of my tech life to date. We had a cascade of server hardware failures that induced a lot of downtime. Total catastrophe. I’m grateful for my institution’s patience.

After a lot of extra hours in windowless rooms working on it, we did resolve those problems. We diagnosed the root causes and took steps to prevent similar issues in the future. I also learned a lot. (Some of the lessons from those days continue to guide us, and they’ve been imprinted on me forever.)

The very next day the March lockdowns started and the College sent everyone away.

We went all-remote for the rest of spring and shifted into hybrid mode for the fall. That increased the dependency on system availability. Naturally, I was uneasy about that after the stress of the spring term. I directed myself and the CS admin students to focus on uptime, iterative improvement, and minimal disruption.

What makes me proud is this: it worked.

Since resolving those issues in the winter and spring, we’ve been stable. Individual services and hosts have had issues, of course. Some of those issues took significant time and energy, and we’re still not perfect (probably never will be!). There is always more to fix, more to improve, more to automate, more to introduce.

But systematically we’ve operated without unplanned interruption since March.

We’ve faced uncertainty after uncertainty in 2020. But my colleagues and students have been able to count on our systems working. We’re not a giant shop here, but we have kept up with the changing times.

There it is: one clear 2020 success story. Engineering this success was a collaborative effort to which I’m just one contributor, but I am proud of it.

Tech improves pandemic life

I can’t imagine going through the COVID-19 pandemic without computers. Tech improves pandemic life, and it makes it easier for us to make good decisions.

For reasons of both personal caution and what I see as a moral duty, I am probably in the 80th percentile for cautious behavior during the pandemic. I live alone, and my job lends itself to remote work for almost everything. What’s more, my workplace is a socially-conscious liberal arts college. As a result, I interact with very few people (those I do see are always masked-up).

That lifestyle is only sustainable because of computer technology. I buy and pick up groceries through an app. Meetings take place over video chats. Songs or podcasts play in the background while I cook. I can stream almost anything I want to see. I’ve continued to learn and to work using some excellent rectangles.

Ron Swanson "This is an excellent rectangle." Tech life.

There are tradeoffs, of course, but I have basically lived this way since March. Doing so I have weathered the pandemic as well as I could hope (so far).

The national dialogue now includes a lot of chatter about how to stay safe for the holidays. I’m cautious and want to model good behavior. That means I’ll be on FaceTime for Thanksgiving, Christmas, and New Year’s Eve. That’s not great, and it’ll be sad not to be physically visiting family.

But for people like me, the alternative to a FaceTime holiday isn’t an in-person holiday, but a canceled holiday, spent in isolation. Thanks to the people in my industry, I don’t have to do that. Technology brings people together. It’s one reason I remain idealistic about the work I do.

Amidst the tragedies and terrors of 2020, pause to appreciate the age we live in and the cool things we’ve invented. Tech improves pandemic life – and improves life in general. There’s lots to worry about if you want (conspiracy theories, AI risk, etc.), but I’m happy to live in a technologically advanced society.

Jupyterhub user issues: a 90% improvement

photo of Jupiter the planet, as a play on words in the context of Jupyterhub user issues
Jupyter errors are not to be confused with Jupiter errors.

At Earlham Computer Science we have to support a couple dozen intro CS students per semester (or, in COVID times, per 7-week term). We teach Python, and we want to make sure everyone has the right tools to succeed. To do that, we use the Jupyterhub notebook environment, and we periodically respond to user issues related to running notebooks there.

A couple of dozen people running Python code on a server can gobble up resources and induce problems. Jupyter has historically been our toughest service to support, but we’ve vastly improved. In fact, as I’ll show, we have reduced the frequency of incidents by about 90 percent over time.

Note: we only recently began automatic tracking of uptime, so that data is almost useless for comparisons over time. This is the best approximation we have. If new information surfaces to discredit any of my methods, I’ll change it, but my colleagues have confirmed to me that this analysis is at least plausible.

Retrieving the raw data

I started my job at Earlham in June 2018. In November 2018, we resolved an archiving issue with our help desk/admin mailing list that gives us our first dataset.

I ran a grep for the “Messages:” string in the thread archives:

grep 'Messages:' */thread.html # super complicated

I did a little text processing to generate the dataset: regular expression find-and-replace in an editor. That reduced the data to a column of YYYY-Month values and a column of message counts.

Then I went and searched for all lines with subject matching “{J,j}upyter” in the subject.html files:

grep -i jupyter {2018,2019,2020}*/subject.html 

I saved it to jupyter-messages-18-20.dat. I did some text processing – again regexes, find and replace – and then decided that followup messages are not what we care about and ran uniq against that file. A few quick wc -l commands later and we find:

  • 21 Jupyter requests in 2018
  • 17 Jupyter requests in 2019
  • 19 Jupyter requests in 2020

One caveat is that in 2020 we moved a lot of communication to Slack. This adds some uncertainty to the data. However, I know from context that Jupyter requests have continued to flow through the mailing list disproportionately. As such, Slack messages are likely to be the sort of redundant information already obscured using uniq in the text processing.

Another qualifier is that a year or so ago we began using GitLab’s Issues as a ticket tracking system. I searched that. It found 11 more Jupyter issues, all from 2020. Fortunately, only 1 of those was a problem that did not overlap with a mailing list entry.

Still, I think those raw numbers are a good baseline. At one level, it looks bad. The 2020 number has barely budged from 2018 and in fact it’s worse than 2019. That’s misleading, though.

Digging deeper into the data

Buried in that tiny dataset is some good news about the trends.

For one thing, those 21 Jupyter requests were in only 4 months out of the year – in other words, we were wildly misconfigured and putting out a lot of unnecessary technical fires. (That’s nobody’s fault – it’s primarily due to the fact that my position did not exist for about a year before I arrived at it, so we atrophied.)

What’s more, the 19 this year are, by inspection, half password or feature requests rather than the 17 problems we saw in 2019, which I think were real.

So in terms of Jupyter problems in the admin list, I find:

  • around 20 in the latter third of 2018
  • 17 in ALL OF 2019
  • only two (granted one was a BIG problem but still only 2) in 2020

That’s a 90% reduction in Jupyterhub user issues over three years, by my account.

“That’s amazing, how’d you do it?”

Number one: thank you, imaginary reader, you’re too kind.

Number two: a lot of ways.

In no particular order:

  1. We migrated off of a VM, which given our hardware constraints was not conducive to a resource-intensive service like Jupyterhub.
  2. Gradually over time, we’ve upgraded our storage hardware, as some of it was old and (turns out) failing.
  3. We added RAM. When it comes to RAM, some is good, more is better, and too much is just enough.
  4. We manage user directories better. We export these over NFS but have done all we can to reduce network dependencies. That significantly reduces the amount of time the CPU spends twiddling its thumbs.

What’s more, we’re not stopping here. We’re currently exploring load-balancing options – for example, running Jupyter notebooks through a batch scheduler like Slurm, or potentially a containerized environment like Kubernetes. There are several solutions, but we haven’t yet determined which is best for our use case.

This is the work of a team of people, not just me, but I wanted to share it as an example of growth and progress over time. It’s incremental but it really does make a difference. Jupyterhub user issues, like so many issues, are usually solvable.

I’m making websites!

As the exclamation point indicates, I’m excited to announce this: I’m now making websites again!

A bit over two years ago, I left self-employment as an all-around tech services provider and joined my alma mater, Earlham College. That was a good move. I have built my skills across the board, and having this job has kept my career steady through e.g. COVID.

However, I’ve missed some of the work from those days, as well as the independence. I don’t like having only one income source in a time of high economic unpredictability. I also want to continue expanding my skillset, growing my portfolio, and controlling the course of my own career.

For all these reasons, I’m accepting new projects effective now. You can click here to seen plans and examples or reach out (cearley@craigearley.com) hire me to make a website for you.

My particular passions are making websites for individuals and small businesses (including online stores). Most likely if you’re at a larger scale than that, you have in-house web and sysadmin teams anyway. 🙂 If what I offer is right for you, please reach out. I look forward to hearing from you.

Meet our Terrestrial Mapping Platform!

Just a nice photo from Iceland

I’m excited to share that the Earlham field science program is now sharing the core of our Terrestrial Mapping Platform (TMP)! This is very much a work-in-progress, but we’re excited about it and wanted to share it as soon as we could.

We had to delay the 2020 Iceland trip because of COVID-19. That of course pushed back the implementation and case study component of this project, which was Iceland-centric. But we are moving forward at full speed with everything else. As Earlham has now started the new academic year, we have also resumed work on the TMP.

The project is a UAV hardware-software platform for scientists. It consists of:

  • a consumer-grade drone for capturing images
  • flight plan generation software and application to automate drone flights
  • data analysis workflows for the images – visible light and NIR, assembled into 2D and 3D models

All of this goes toward making science more accessible to a broader range of domain scientists. Archaeologists and glaciologists are our current target cohort, but many more could find use for this work if it’s successful.

We will make all of this accessible in repositories with open licenses on our GitLab instance. Some are already available. Others we will share once we review them for (e.g.) accidentally-committed credentials.

That was all planned, if delayed. We’re also using our extra year of preparation time to make the project better in a few ways:

  • Reevaluating our choice of UAV make and model
  • Prettifying our web presence, which very much includes blog posts like this
  • Reducing the friction and pain points in our current workflow
  • Making our code and infrastructure better in general (I’ve covered my growing emphasis on quality here before)

The team mostly comprises students and faculty (of whom I’m the junior-most). Additionally, there are a few on-site partners in Iceland and innumerable personal supporters who make this possible. We’ll be sharing more at the Earlham Field Science blog as we go. I will undoubtedly share more here as well.

COVID is bad, but we want to make the best of this era. This is one way we’re doing that.

(Disclosure: We received funding for this from a National Geographic grant. None of the views in this blog post or our online presence represents, or is endorsed by, Nat Geo.)

Give yourself the gift of quality control

If you spend any time at all in the tech chatter space, you have probably heard a lot of discontent about the quality of software these days. Just two examples:

I can’t do anything about the cultural, economic, and social environment that cultivates these issues. (So maybe I shouldn’t say anything at all? 🙂 )

I can say that, if you’re in a position to do something about it, you should treat yourself to quality control.

The case I’d like to briefly highlight is about our infrastructure rather than a software package, but I think this principle can be generalized.

Case study: bringing order to a data center

After a series of (related) service outages in the spring of 2020, shortly before the onset of the COVID-19 crisis, we cut back on some expansionary ambitions to get our house in order.

Here’s a sample, not even a comprehensive list, of the things we’ve fixed in the last couple of months:

  • updated every OS we run such that most of our systems will need only incremental upgrades for the next few years
  • transitioned to the Slurm scheduler for all of our clusters and compute nodes, which has already made it easier to track and troubleshoot batch jobs
  • modernized hardware across the board, including upgraded storage and network cards
  • retired unreliable nodes
  • implemented comprehensive monitoring and alerts
  • replaced our old LDAP server and map with a new one that will better suit our authentication needs across many current and future services
  • fixed the configuration of our Jupyterhub instances for efficiency

Notice: None of those are “let’s add a new server” or “let’s support 17 new software packages”. It’s all about improving the things we already supported.

There are a lot of institutional reasons our systems needed this work, primarily the shortage of staffing that affects a lot of small colleges. But from a pragmatic perspective, to me and to the student admins, these reasons don’t matter. What matters is that we were in a position to fix them.

By consciously choosing to do so, we think we’ve reduced future overhead and downtime risk substantially. Quantitatively, we’ve gone from a few dozen open issue tickets to 19 as of this writing. Six others are advancing rapidly.

How we did it and what’s next

I don’t have a dramatic reveal here. We just made the simple (if not always easy) decision to confront our issues and make quality a priority.

Time is an exhaustible, non-renewable resource. We decided to spend our time on making existing systems work much much better, rather than adding new features. This kind of focus can be boring, because of how strictly it blocks distractions, but the results speak for themselves.

After all that work, now we can pivot to the shiny new thing: installing, supporting, and using new software. We’ve been revving up support for virtual machines and containers for a long time. HPC continues to advance and discover new applications. The freedom to explore these domains will open up a lot of room for student and faculty research over time. It may also help as we prepare to move into our first full semester under COVID-19, which is likely to have (at minimum) a substantial remote component.