A few more thoughts on family photos

For context, see this post and this one.

I have a few more observations about my family photo project. They are below, and I reserve the right to update this list in the future rather than creating a new post on the subject.

  • Use fdupes or something similar to find and remove duplicate photos. I avoided uploading over 9,000 duplicate photos by that. I wish I’d discovered it much sooner.
  • Something I glossed over in my earlier posts: do not throw away photos. I’ve tossed a handful, almost exclusively duplicates, but the default should be to keep the original pictures. It’s one extra layer of redundancy.
  • If you are cropping multiple pictures out of a single scan, you might at some point crop an original scan by accident and lose some of the pictures. To prevent the incredible annoyance – or outright data loss – of an event like this, make sure to have local backups. I’m a believer in the classic 3-2-1 backup system.
  • It is delightful to hear from your loved ones about their excitement at viewing their old photos.
  • You will never be done. 🙂

To me there’s often a cutoff where something goes from being a “project” to just something I occasionally refresh. This venture is at that point. I’m quite happy how it’s turned out.

Family photos by decade

I crunched a bit of data about my ongoing photo-scanning project, which now includes some more scans as well as a heap of JPEG’s harvested from SD cards from our digital cameras.

Surprising no one, the rate of photos taken has grown dramatically over the last century:

Linear-scale graph showing the increase in number of photos taken by my family from the 1940s to the 2010s

That’s the linear scale. Here’s a prettier graph on a log scale, showing steep growth every decade:

Log-scale graph showing decade-by-decade jumps in number of photos taken by my family from the 1940s to the 2010s

These charts actually understate the case, of course. The era labeled “2010s” is really only 2010-2015 or so, peaking in 2012. After that, analog and standalone digital cameras finally gave way (in our family) to phone cameras. In the decade or so since then, I’ve taken more than this entire graph’s worth of pictures combined – which is to say nothing of the number of photos my parents, grandparents, and siblings have taken.

UPDATE: It turns out the dataset also includes lots of duplicates, but the order of magnitude by decade remains accurate. Sanitize the data first, people!

Notes on photo digitization

Disclaimer: I do not endorse any of the products mentioned in this post, nor does my employer the University of Colorado Boulder. I will provide links to programs and resources that are free and especially those that are FOSS, but not to those that cost money. I include the various tools here for clarity only.

Two shoeboxes of 3x2 photos, some other bits of paper, and one pink photo album with silly-looking cartoon ducks on it
Love those ducks

As part of a project now running multiple years, I recently digitized over 1,100 analog photos for my parents. Like a lot of families, our photo storage and cloud apps safely store most of the last decade of pictures and videos. Memories from around 2012 and earlier, though, were effectively hidden in a series of shoeboxes and envelopes. I wanted to change that.

Here are some things I’ve noted.

I used a Canon LiDE 400, though Epson and other manufacturers also have good scanners that work for this purpose. For me, the specific scanner doesn’t really matter above a certain threshold of resolution and reliability.

I scanned the images at 600dpi and stored them as uncompressed TIF to preserve quality. For final sharing, I compressed them into JPEG’s. That shrunk them by a factor of about 10. (Naturally I’ll keep the original TIF’s as well as the source photographs.)

I used my workhorse, a Windows PC with supplemental work in a Debian environment in the Windows Subsystem for Linux. I cobbled together some scripts based on a few core pieces of software:

  • ImageMagick – the one and the only
  • Bulk Rename Utility – an absolutely core part of the workflow once images are properly scanned
  • multicrop2 – it saves a lot of time to scan multiple images at once, and multicrop2 is a handy script that does a reasonably good job breaking them into separate files again

Sometimes a photo will have important metadata on the back. This is a pain if you have a one-sided flatbed scanner (as I do). I scanned a batch of fronts, then flipped them to scan the backs in the same relative positions. I saved them in a special folder for later processing. At the end, I used multicrop2 to break them apart and used the original images as a guide for which front to pair with which back when appending with convert.

I used the genealogist-recommended YYYYMMDD-description format for naming purposes, which makes for easier metadata encoding and sorting. To support my nerdy automation workflows, I used hyphens as the delimiter between words in the description, rather than the more commonly-used spaces.

I have a lot left to do, but those are a few of my notes so far.

The obvious question here is “why bother?” There are good digitization services in the U.S. so why spend the time and work on this? (I have in fact used such services for other media, like 8mm video and VHS tapes.) Honestly I did it because I wanted to do it. It was a fun series of low-stakes puzzles that were rewarding to solve, and in the end I had lots of digitized family memories. YMMV but it was worth it to me.

I’m making websites!

As the exclamation point indicates, I’m excited to announce this: I’m now making websites again!

A bit over two years ago, I left self-employment as an all-around tech services provider and joined my alma mater, Earlham College. That was a good move. I have built my skills across the board, and having this job has kept my career steady through e.g. COVID.

However, I’ve missed some of the work from those days, as well as the independence. I don’t like having only one income source in a time of high economic unpredictability. I also want to continue expanding my skillset, growing my portfolio, and controlling the course of my own career.

For all these reasons, I’m accepting new projects effective now. You can click here to seen plans and examples or reach out (cearley@craigearley.com) hire me to make a website for you.

My particular passions are making websites for individuals and small businesses (including online stores). Most likely if you’re at a larger scale than that, you have in-house web and sysadmin teams anyway. 🙂 If what I offer is right for you, please reach out. I look forward to hearing from you.

Meet our Terrestrial Mapping Platform!

Just a nice photo from Iceland

I’m excited to share that the Earlham field science program is now sharing the core of our Terrestrial Mapping Platform (TMP)! This is very much a work-in-progress, but we’re excited about it and wanted to share it as soon as we could.

We had to delay the 2020 Iceland trip because of COVID-19. That of course pushed back the implementation and case study component of this project, which was Iceland-centric. But we are moving forward at full speed with everything else. As Earlham has now started the new academic year, we have also resumed work on the TMP.

The project is a UAV hardware-software platform for scientists. It consists of:

  • a consumer-grade drone for capturing images
  • flight plan generation software and application to automate drone flights
  • data analysis workflows for the images – visible light and NIR, assembled into 2D and 3D models

All of this goes toward making science more accessible to a broader range of domain scientists. Archaeologists and glaciologists are our current target cohort, but many more could find use for this work if it’s successful.

We will make all of this accessible in repositories with open licenses on our GitLab instance. Some are already available. Others we will share once we review them for (e.g.) accidentally-committed credentials.

That was all planned, if delayed. We’re also using our extra year of preparation time to make the project better in a few ways:

  • Reevaluating our choice of UAV make and model
  • Prettifying our web presence, which very much includes blog posts like this
  • Reducing the friction and pain points in our current workflow
  • Making our code and infrastructure better in general (I’ve covered my growing emphasis on quality here before)

The team mostly comprises students and faculty (of whom I’m the junior-most). Additionally, there are a few on-site partners in Iceland and innumerable personal supporters who make this possible. We’ll be sharing more at the Earlham Field Science blog as we go. I will undoubtedly share more here as well.

COVID is bad, but we want to make the best of this era. This is one way we’re doing that.

(Disclosure: We received funding for this from a National Geographic grant. None of the views in this blog post or our online presence represents, or is endorsed by, Nat Geo.)

Golang and more: this week’s personal tech updates

First I haz a sad. After a server choke last week, the Earlham CS admins finally had to declare time-of-death on the filesystem underlying one of our widely-used virtual machines. Definitive causes evade us (I think they are lost to history), so we will now pivot to rebuilding and improving the system design.

In some respects this was frustrating and produced a lot of stress for us. On the other hand, it’s a sweet demo of the power of virtualization. The server died, but the hardware underlying it was still fine. That means we can rebuild at a fraction of the cost of discovering, purchasing, installing, and configuring new metal. The problem doesn’t disappear but it moves from hardware to software.

I’ve also discovered a few hardware problems. One of the drones we will take to Iceland need a bit of work, for example. I also found that our Canon camera may have a bad orientation sensor, so the LCD display doesn’t auto-rotate. Discovering those things in February is not fun. Discovering them in May or June would have been much worse.

Happier news: I began learning Go this week. I have written a lot of Java, C, and some Python, but for whatever reason I’ve taken to Golang as I have with no other language. It has a lot of the strengths of C, a little less syntactical cruft, good documents, and a rich developer literature online. I also have more experience.

A brief elaboration on experience: A lot of people say you have to really love programming or you have no hope of being good at it. Maybe. But I’m more partial to thinking of software engineering as a craft. Preexisting passion is invaluable but not critical, because passion can be cultivated (cf. Cal Newport). It emerges from building skills, trying things, perseverance, solving some interesting problems, and observing your own progress over time. In my experience (like here and here), programming as a student, brand-new to the discipline, was often frustrating and opaque. Fast forward, and today I spent several hours on my day off learning Golang because it was interesting and fun. 🤷‍♂️

Your mileage may vary, but that was my experience.

Finally, here are a few articles I read or re-read this week:

Earlham’s four-day weekend runs from today through Sunday. After a couple of stressful weeks, I’m going to take advantage of the remainder of the time off to decompress.

Learning on my own

One of the CS servers choked this week. That made for a stressful recovery, assessment, and cleanup process.

To cut the stress, I put in about an hour each night to some autodidactic computing education. That included a mix of reading and exercises.

In particular, I walked through this talk on compilers.

As I noted in my GitHub repo‘s README, I did fine in my Programming Languages course as a student, but I was never fully confident with interpreters and compilers in practice. People (accurately!) talk about building such programs as “metaprogramming”, but as a student I found they always came across as more handwavey or tautological than meta.

This exercise, which I’d emphasize consists of code built by someone else (Tom Stuart, posted at his site Codon) in 2013 for demo purposes and not by me, was clarifying. Meticulously walking through it gave me a better intuition for interpreters and compilers – which are not, in fact, handwavey or tautological. 🙂 The Futamura projections at the end of the article were particularly illuminating to discover and think through.

I also read some articles.

  • Teach Yourself Programming in Ten Years” (Peter Norvig; re-read)
  • Purported origin of “never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway”. (This was evidently considered an extremely funny joke among 1970s computer geeks.)
  • The performance links in this post.
  • The next language I’d like to explore is Go, and I started this week.

Computing lets you learn on your own. It’s not unique as a field in this way, but I appreciate that aspect of it.

My new laptop

Now that Apple fixed the keyboard, I finally upgraded my Mac.

I am a non-combatant in the OS wars. I have my Mac laptop and a Windows desktop. I have an iPhone and build for Android at work. I run Linux servers professionally. I love everybody.

My little 2015 MacBook Air is a delightful machine. But it wasn’t keeping up with my developer workloads, in particular Android app builds. I bought the $2799 base model MacBook Pro 16″.

I started from scratch with a fresh install. Generally I try to avoid customizing my environments too much, on the principle of simplicity, so I didn’t bother migrating most of my old configs (one exception: .ssh/config).

Instead I’ve left the default apps and added by own one by one. I migrated my data manually – not as daunting as it sounds, given that the old laptop was only 128GB and much of it was consumed by the OS. I closed with an initial Time Machine backup to my (aging) external hard drive.

Now I’ve had a couple of weeks to actually use the MacBook Pro. Scattered observations:

  • WOW this screen.
  • WOW these speakers.
  • WOW the time I’m going to save building apps (more on that later).
  • I’m learning zsh now that it’s the Mac’s default shell.
  • Switching from MagSafe to USB-C for charging was ultimately worth the tradeoff.
  • I was worried about the footprint of this laptop (my old laptop is only 11-inch!), but I quite like it. Once I return to working at my office, I think it will be even better.
  • I am running Catalina. It’s fine. I haven’t seen some of the bad bugs people have discussed – at least not yet.
  • I’m holding on to my old Mac as a more passive machine or as a fallback if something happens to this one.

Only one of those really matters, though.

Much better for building software

The thing that makes this laptop more than a $2799 toy is the boon to my development work. I wanted to benchmark it, not in a strictly scientific way (there are websites that will do that) but in a comparative way in the actually existing use case for me: building Android apps.

The first thing I noticed: a big cut in the time taken to actually launch Studio. It’s an immediate lifestyle improvement.

I invalidated caches and restarted Studio on both machines. The two apps opened at the same time (not optimal performance-wise, but not uncommon when I’m working on these apps intensively).

I then ran and recorded times for three events, on each machine, for both of the apps I regularly build:

  • Initial Gradle sync and build
  • Build but don’t install (common for testing)
  • Build and install

Shock of all shocks, the 2019 pro computer is much better than the 2015 budget-by-Apple’s-standards computer (graphs generated with a Jupyter notebook on the new laptop, smaller bars are better; code here):

Yeah. I needed a new computer. 🙂

I expect 2020 to be a big year for me for a number of reasons I’ll share over time, and my old laptop just couldn’t keep up. This one will, and I’m happy with it.

I sure hope there’s enough to do…

I begin the month-long winter break this weekend. Students and teaching faculty finish about a week later. When classes reconvene in January, we’ll start spending a lot of time on Iceland and its spinoff scientific computing projects.

To lay the groundwork, we’ve spent the last few weeks clearing brush:

  • Updating operating systems and apps on a dozen laptops, handsets, and tablets
  • Syncing accounts with Apple and Google
  • Sitting in on planning/logistics meetings
  • Coordinating with the students who will do most of the actual research and development
  • Producing the list of software improvements we need to make

The last item is the most substantial from the perspective of my colleagues and students in CS. We build a lot of software in-house to collect and visualize data, capture many gigabytes of drone photos, and run datasets through complex workflows in the field.

It takes a lot of work locally to make this succeed on-site. Students have to learn how to both use and develop the software in a short time. Since the entire Iceland team (not just CS students) depends on everything we build, these projects provide real and meaningful stakes.

All of this has come together in the last few weeks in a satisfying way. We’re up to 62 GitLab issues based on our experiences using the software. That’s a good enough list to fill a lot of time in the spring, for both students and faculty.

We’ll hit the ground running in January, when the clock officially begins ticking.

Computing lessons from DNA analysis experiments

I’ve been working with my colleagues in Earlham’s Icelandic Field Science program on a workflow for DNA analysis, about which I hope to have other content to share later. (I’ve previously shared my work with them on the Field Day Android app.)

My focus has been heavily experimental and computational: run one workflow using one dataset, check the result, adjust a few “dials”, and run it again. When we’re successful, we can often automate the work through a series of scripts.

At the same time, we’ve been trying to get our new “phat node” working to handle jobs like this faster in the future.

Definitions vary by location, context, etc. but we define a “phat node” or “fat node” as a server with a very high ratio of (storage + RAM)/(CPU). In other words, we want to load a lot of data into RAM and plow through it on however many cores we have. A lot of the bioinformatics work we do lends itself to such a workflow.

All this work should ultimately redound to the research and educational benefit of the college.

It’s also been invaluable for me as a learning experience in software engineering and systems architecture. Here are a few of the deep patterns that experience illustrated most clearly to me:

  • Hardware is good: If you have more RAM and processing power, you can run a job in less time! Who knew?
  • Work locally: Locality is an important principle of computer science – basically, keep your data as close to your processing power as you can given system constraints. In this case, we got a 36% performance improvement just by moving data from NFS mounts to local storage.
  • Abstractions can get you far: To wit, define a variable once and reuse it. We have several related scripts that refer to the same files, for example, and for a while we had to update each script with every iteration to keep them consistent. We took a few hours to build and test a config file, which resolved a lot of silly errors like that. This doesn’t help time for any one job, but it vastly simplifies scaling and replicability.
  • Work just takes a while: The actual time Torque (our choice of scheduler) spends running our job is a small percentage of the overall time we spend shaping the problem:
    • buying and provisioning machines
    • learning the science
    • figuring out what questions to ask
    • consulting with colleagues
    • designing the workflow
    • developing the data dictionary
    • fiddling with configs
    • testing – over, and over, and over again
    • if running a job at a bigger supercomputing facility, you may also have to consider things like waiting for CPU cycles to become available; we are generally our systems’ only users, so this wasn’t a constraint for us

A lot of this is (for computer scientists, software engineers, etc.) common sense, but taking care to apply that common sense can be critical for doing big interesting work.

The punchline of it all? We managed to reduce the time – walltime, for fellow HPC geeks – required to run this example workflow from a little over 8 hours to 3.5 hours. Just as importantly we developed a bunch of new knowledge in the process. (I’ve said almost nothing here about microbiology, for example, and learning a snippet of that has been critical to this work.) That lays a strong foundation for the next several steps in this project.

If you read all this, here’s a nice picture of some trees as a token of my thanks (click for higher-resolution version):

Image of trees starting to show fall color
Relevance: a tree is a confirmed DNA-based organism.