On June 30, 1987, when Delta Flight 810 had risen to about 1700 feet after taking off from LAX, the pilot reached for a switch to give him manual control of the throttle. In a butterfingers or butterbrains move, he accidentally flipped two nearby switches that shut off fuel to both of the plane’s engines. The crew spent the next 60 seconds scrambling to restart the engines, as passengers felt the plane drop about a thousand feet and prepared for a crash. The crew managed to get the engines started in time, and the plane landed safely in Cincinnati.1
As a software developer, I think about those switches sometimes. Not that I’m going to crash any airplanes; the very worst thing I could crash would be one of our clients’ websites. But I think about how, with a moment of butterfingers or butterbrains, I’m liable to hit a wrong button or run a wrong command and then have to spend time recovering from some minor disaster.
The other day, I asked one of my coworkers if they had anything like the fuel cutoff switches in their workspace, maybe a big green button that’s right there that they have to go out of their way not to click. By the end of the day, they had a doozy of an example. A second coworker was working on a minor change to one of our clients’ websites. Coworker 2 opened a settings page containing a grid of literally thousands of checkboxes (oh, Drupal). A little while later, Coworker 2 came back to the page, which was still open in a browser tab, and didn’t notice that the page had reloaded itself, clearing all of the offscreen checkboxes and adding a message telling you not to click Save. They clicked Save. This took down the client’s website until Coworker 1 managed to restore the settings from a database backup.
After the incident on Delta Flight 810, the Federal Aviation Administration investigated the problem and realized that it was not a good idea to have the fuel cutoff switches just sitting out in the open. A few days after the accident, they ordered airlines to install a guard over the fuel cutoff switches of all Boeing 767s within 10 days. An FAA spokesperson explained, “We told them to put a cover over the fuel control switch that will make it more difficult for the pilot to touch it unless he is real sure he wants to touch it.”
After the incident with the website settings Save button, my coworker filed a bug report that outlined what had happened and suggested that it would be good to prevent people from being able to accidentally take down an entire website with one erroneous click. One of the software maintainers quickly replied pointing out that there was already that message telling you not to click Save. After a little more discussion, the maintainer changed the bug report to a feature request.
I’m not saying we need to get the FAA involved, just that an awful lot of burden is put on us developers to never ever click the wrong button, or run the wrong command, or use the right button/command at the wrong time. That is how the tools are designed. That is the culture. It’s a burden that we place on ourselves and each other.
In the book Normal Accidents: Living with High Risk Technologies, Charles Perrow talks about accidents being blamed on operator error. For an airline company, it’s easier to say “that knucklehead of a pilot flipped the switch that was clearly labeled Fuel Cutoff” and call it a day than to admit “that cockpit layout was an accident waiting to happen”. (Several weeks after the Delta Flight 810 incident, the FAA did revoke the pilot’s airline transport pilot certificate.) I’m also reminded of a blog post by Lisa Wade about firefighters blaming other firefighters who were killed on the job for being “stupid”. Blaming operator error helps people in the responsible organization (other than the operator) feel more comfortable, but doesn’t help prevent future accidents.
In the world of software development accidents, I think Git is an interesting case because of the irony. Git serves two purposes: to help developers preserve their work over time and to help collaborators combine their work without losing changes. Yet simultaneously Git makes it so easy to do the opposite. Like the time I told the intern to run
git reset --hard without realizing that he had uncommitted changes. I physically tense up whenever I merge or rebase. It’s not that I don’t understand in theory how to avoid throwing away code changes. What I worry about is making sure my fingers and brain do exactly the right thing in the moment, every single moment.
As software developers, we all have certain things we can do to reduce the chance of operator error. After rebasing and before pushing, if I’m worried that I might have messed up, I’ll compare my local commits to the remote commits to make sure I haven’t lost any changes. When I’m developing a website, I use a browser plugin that colors my title bar to make it really clear if I’m on the production site or a local copy. If I got really motivated, I could set up shell aliases to warn me when I try to run certain dangerous commands. The trouble is that I can’t always be tweaking my developer environment and cross-checking things if I want to get any actual work done.
I suppose lost-time accidents during software development are just one more piece of the enormous technical debt that our industry continues to pile up. As one little developer, I can’t do much about that — but I can try to be reasonably careful, accept that I make mistakes just like everyone else, and not be too hard on myself or colleagues when we make a mistake that spirals into a minor disaster because of tools or processes that rely on the operator to never make a mistake.
Now, back to work.
This example comes from The Limits of Safety: Organizations, Accidents, and Nuclear Weapons by Scott D. Sagan with additional information from contemporaneous articles by Dave Skidmore for the AP and an uncredited author for the AP / New York Times. ↩︎