Chrome History Inspector

I’m interested in digital minimalism right now, so I wanted to examine my browser history. Google Chrome on my desktop is the place I spend most of my time online, and it’s also a black box to me. Unlike iOS with its Screen Time feature, I have no obvious window into my browser activity over time. All chrome://history shows is a stream of links you’ve clicked in reverse-chronological order, with no aggregation options. I have a rough idea, but that’s not much to go on.

This weekend I decided I wanted to investigate. At first I thought I’d keep it simple: get my history as a CSV file and open it in Google Sheets. That didn’t work: 15,000 lines is apparently a lot for a web-connected browser-based tool, and it crashed my tab. I could have used macOS’s Numbers, but I realized quickly that my task lent itself to programming better than to a spreadsheet.

As a rough cut (and presented to you now), I made a Python program – code here – that, given a history file of a particular format, produces a graph of your most-visited websites. It makes use of the pandas, matplotlib, and seaborn libraries. The earliest date on my dataset is October 12, 2018. The program produced this graph:

The first thing I noticed in the image was that I clicked into Reddit a lot. I’ve had a Reddit account for less than a year, so I knew I could live without it and I swiftly deleted my account.

What was left fell into a few categories:

  • search/reference: I was surprised and then immediately unsurprised by Google’s supremacy on this list; Wikipedia and StackOverflow are also in this category
  • news: Instapaper, Feedly, Twitter
  • professional tools: Gitlab, GitHub, Wiki, Google Drive, and WordPress
  • entertainment: Netflix, TVTropes, Amazon, Facebook, YouTube – all of which I regulate using Freedom
  • Esquire scored surprisingly high, I think because viewing a slideshow there requires a click per slide and I’ve visited a few of them.

A few caveats about this approach:

  • I’d like something more dynamic, maybe an improved version of some old browser extensions I found on my initial research on this idea. This got the very specific information I wanted, but now I want more.
  • I’ve separated the code that obtains the data (which I didn’t write) from the code that processes it. This way when Google inevitably changes how it manages history data, I don’t have to disturb the processing code.
  • I used this tool to decide my Reddit account should be axed, but it’s arguably unfair to Reddit: I actually read a lot more tweets than Reddit posts, but when you want to expand a Reddit post you click it and it changes the URL. (One minor change I may make is to aggregate twitter dot com, t dot co, and tweetdeck into one row.)
  • This analyzes page visits, not time spent. This, I imagine, would be a much stickier problem. I’d need to have an indicator of when the tab and window were both active, and it would be distorted by the frequent distractions of my office. It would also be a much more useful thing to display. Maybe Google can get with the “digital wellness” moment on this.
  • Future work: group by time. I have a much better idea of when I’m on the Internet than of what sites I’m visiting most frequently over time, so this wasn’t my priority. That said, it’s possible I could learn something interesting.
  • Sites visited in Incognito Mode don’t appear in the history so they also don’t appear on the chart.

Finally, through the lens of digital minimalism, that graph is better than I had expected. There’s not a lot of cruft, the cruft that does exist can be removed pretty easily, and most of the sites provide real value to me. This has been a useful exercise.

Introducing @cooltreepix!

Once upon a time I was a character in a (wholesome!) meme a friend posted to a publicly-visible Earlham Facebook group. The meme, which I’ve stored for posterity here, said that one quality about me is “Takes pictures of cool trees”.

So my Twitter bot is super on-brand.

You can now follow @cooltreepix for pictures of cool trees! They were all taken by me and tweeted once per day. I’ve removed or never added geolocations, but probably 90% are from eastern Montana or the Richmond, Indiana, area.

Details follow for the curious. 🙂

What this bot does

Every day this bot tweets a tree picture.

That’s… that’s it, it tweets a picture containing one or more trees. Sometimes the tree will be the subject of the picture. Other times it will be a picture where the tree somehow accentuates the main element, e.g. fall color. The images are of varying quality but most were taken by my iPhone (currently an iPhone 7).

I kept it simple. I didn’t (and still don’t) want to collect your data or do much in the way of analytics. I just wanted to make a simple non-spammy bot that tweets a nice picture once a day.

Process

Here are the steps I followed, roughly, so that you can try your own:

  1. Create a Twitter developer account. I was doing this for education and with no intent to collect data etc., so I had no problems at all in creating the account.
  2. Create your app. If you’re not going to use it for your own account – i.e. if you want to allow the app to tweet on an account other than the @username of your developer account – make sure to enable “Sign in with Twitter”, though there are some more complex ways to do this if you have a specific reason to try them.
  3. Get a cloud-based server to set up your dev environment and hold any assets you need. At first I used AWS because (1) I needed something guaranteed to always be running and (2) I’ve been wanting to learn AWS. Ultimately I decided to stay on Earlham’s servers, but I’m glad to now have the AWS account and some experience with it. Your environment should have some flavor of twurl to make authentication via terminal easier (for more on that, see the next section).
  4. Write your code. I used Python and the tweepy library. My code is simple. As I describe in more detail below, the setup process was much harder than the coding. If I add any features there will be changes to make, but for now I’m happy with it.
  5. Try it out!
  6. Iterate until it works, fixing or adding one thing at a time.
  7. Maintenance.

When you’re done, most of the time your bot should live on its own, just a bot doing bot things.

Biggest challenges

Coding, it turns out, wasn’t the hardest part. I probably only spent about 10 percent of my time on this project programming. The greatest challenges:

  1. Authentication: This was easily my biggest time burner and the problem that most of the steps above solve. It’s easy to make a bot to tweet to your personal developer account but there are extra steps to tweet to a different account, as I wanted to. Worth noting: after you’ve authorized the app on whatever account you want, check out your twurl environment (e.g. if you’re running Linux, ~/.twurlc) to get the customer and access tokens that are needed to make the bot work.
  2. Image transfer: It turns out that when you have a lot of images they take up a lot of storage, so moving them around (i.e. downloading and uploading them) takes gobs of time and bandwidth. I knew this from a project in college, but if I needed a reminder I certainly got it this time.
  3. AWS: I now have a free-tier AWS account, which took some wrangling to figure out. I decided not to use it for this project in the end, but the learning experience was good. I want to try configuring it better for my needs next time I do a similar project.
  4. Image sizes: Twitter caps your images at a particular size, which was producing errors at the terminal and a failure to post tweets. I eventually used ImageMagick’s convert (via a Python subprocess) to solve the problem.

Notes on ownership

All photos tweeted directly by the bot are mine (Craig Earley’s) unless otherwise noted. Please don’t sell them or use them commercially, as they are intended for everyone’s benefit. Also please give me a photo credit and share this link to my site if you use them for your own project.

If you want to submit a tree photo, tweet it to me @cooltreepix or @craigjearley. If it’s a real picture of a tree, I’ll retweet as soon as I see it.

My logo is from the Doodle Library (shared under a CC 4.0 license) and edited by me to add color. My version is under the same license.

Finally you can put a little something in my tip jar if you want to support work like this.

Misadventures in source control

Or, what Present!Me ever do to Past!Me?

I observed a while ago on Twitter that learning git (for all its headaches) was valuable for me:

This was on my mind because I recently set about curating (and selecting for either skill review or public presentation) all the personal software projects I worked on as a student. It was a vivid reminder of how much I learned then and in the few years since.

Every day since then I have observed more version control errors of mine, and at some point I thought it worth gathering my observations into one post. Here is a non-comprehensive list of the mistakes I observed in my workflows from years past:

  • a bunch of directories called archive, sometimes nested two or three deep
  • inconsistent naming scheme so that archive and old in multiple capitalization flavors were together
  • combinations of the first two: I kid you not, cs350/old-string-compare/archive/archive/old is a path to some files in my (actual, high-level, left-as-it-was-on-final-exam-day) archive
  • multiple versions OF THE SAME REPO with differing levels of completion, features, etc. (sure, branching is tricky but… really?)
  • no apparent rhyme or reason in the sorting at all – a program to find the area under a curve by dividing it up into trapezoids and summing the trapezoid area was next to a program to return a list of all primes less than X, and next to both of those was a project entirely about running software through CUDA, which is a platform not a problem
  • timestamps long since lost because I copied files through various servers without preserving metadata when I was initially archiving
  • inconsistent use of README‘s that would inform of me of, say, how to compile a program with mpicc rather than gcc or how to submit a job to qsub
  • files stored on different servers with no real reason for any of them to be in any particular place
  • binaries in some directories but not others
  • Makefiles in some directories but not others

(You may have noticed that parallelism is a recurring theme here, and that’s because it was a parallel and distributed computing course where I realized that my workflows weren’t right. I didn’t learn how to fix that problem in time to go from a B to an A in the course, but after that class I did start improving my efficiency and consistency.)

To be fair to myself and to anyone who might find this eerily familiar: I never learned programming before college, so much of my college years were spent catching up on the basics that a lot of people already knew when they got there. Earlham is a place that values experiment, learn-by-doing, jumping into the pool rather than (or above and beyond) reading a book about swimming, etc. Which is good! I learned vastly more that way than I might have otherwise.

What’s more, I understand that git isn’t easy to pick up quickly and poses problems for accessibility to newcomers. Still I can’t help but look at my own work and consider it vastly superior to trying to make this up as you go. It’s well worth the time to learn.

Git and related software carpentry were not something I learned until quite a while into my education. And that’s a bit of a shame, to me: if you’re trying to figure out (as I clearly was) how to manage a workflow, do appropriate file naming, etc. concurrently with learning to code, you end up in a thicket of barely-sorted, unhelpfully-named, badly-organized code.

And then neither becomes especially fun, frankly.

I’ve enjoyed the coding I’ve done since about my junior year in college much more than before that, because I finally learned to get out of my own way.

Looking for bugs in all the wrong places

When I took Earlham’s Networks and Networking class, we implemented Dijkstra’s algorithm.

Dijkstra’s algorithm is an algorithm for finding the shortest paths between nodes in a graph, which may represent, for example, road networks. It was conceived by computer scientist Edsger W. Dijkstra in 1956 and published three years later.

The algorithm exists in many variants; Dijkstra’s original variant found the shortest path between two nodes, but a more common variant fixes a single node as the “source” node and finds shortest paths from the source to all other nodes in the graph, producing a shortest-path tree.

Wikipedia

I got my implementation (in Python) close, but not quite right, by the time the deadline hit for submission.

And more deeply than I have for any coding project up till now, I always felt bad about falling short on this one. I trained much of my perfectionism out of me to become a CS major and decent programmer, but this particular hangup hit hard. In hours of work, I couldn’t find where my implementation was going wrong, or why it was going wrong so consistently.

I submitted my code for grading unhappily, then put it down to focus on other things. I felt like I’d reached my upper limit as a programmer (though I knew in my mind that this was probably not the case). The source code lay quietly in a directory for a couple of years.

Today I’m happy to report that – judged exclusively my own irrational metric, success in implementing Dijkstra’s algorithm – I underestimated myself.

This semester I’m helping teach the same networks class. Since we may assign Dijkstra’s algorithm at some point, I decided to review my old code and maybe try to make it work.

I spent about two hours today, Sunday, reading that rusty old code, tweaking it, running the new version, and parsing its output. I added debug statement after debug statement. I ran it on different input files.

Then I noticed a mistake in the output. Somehow, an edge of weight 1 was being read as an edge of weight 100000000 (the value I used to approximate infinite cost, i.e. the cost of moving directly between two nodes that do not share an edge). In effect, that edge would never be part of a shortest-path between any combination of source and destination. This was bad, because in fact that edge was part of many such shortest-paths in this network.

I went back to some of the most basic pieces of the code and found a possible problem. It was small, easy to fix but hard to detect. I edited a single line of code and ran the program.

It worked.

As it turns out, I’d gotten the implementation right. The core of the assignment, Dijkstra’s algorithm itself, had worked on the input it received.

Visually, here’s the network I had:

And here’s the network the program thought I had:

So what did I get wrong?

Believe it or not: counting.

You see, I had set a variable for the number of nodes N in the network graph. I also had a two-dimensional list describing the network, where each item in the list was an edge in the graph, itself represented by a list containing two nodes and the weight to go between them. Crucially, there are at most N^2 edges in such a graph.

My fatal flaw: rather than saying “for each possible edge in the network, read a line from the file”, I said, for each node in the network read a line in the file. In other words, for my graph with up to N^2 edges, I would only be loading the data about N of them. In this case, the program read only 4 lines, and the edge of weight 1 was described on the 5th line.

(This might have been obvious had I tested the code more thoroughly on one of the larger network files we had. Alternatively, the combination of edges being missed might have obscured the result a lot. A copy of the same input file, but with the lines reversed, would have been the most useful second test case.)

After switching the variable that the index would be checked against, everything worked as I expected.

The code still has problems. I intend to clean it up and streamline it. But the implementation now consistently returns correct output.

The concrete lessons of this experience for me are:

  • Don’t just write debug statements. Write clear and meaningful debug statements. Be specific.
  • Check your I/O, indices, and other such basic features of the code. You can have the greatest algorithm of all time (though I did not!), but if the program isn’t handling exactly what you expect it to, you won’t get the results you want.
  • Vary the input. Vary the input. Vary the input.
  • Don’t let one project, however important or complex or valuable, determine your feelings about your personal skillset.

Finally, while I emphasized the specific and silly programming error here, failure to count correctly wasn’t a root cause of my mistake. The root causes were factors removed from coding altogether: rushing to completion and getting too tangled in the weeds to think holistically about the problem. I don’t think it’s a coincidence that I solved this problem after spending a lot of time in my life disciplining those tendencies.