Chapter 13. Never Send a Human to Do a Machine’s Job
Automate Everything; What You Can’t Automate, Make a Self-Service
Who would have thought that you can learn so much about large-scale IT architecture from the movie trilogy The Matrix? Acknowledging that the Matrix is run by machines, it should not be completely surprising to find some nuggets of system design wisdom, though: Agent Smith teaches us that one should never send a human to do a machine’s job after his deal with Cypher, one of Morpheus’ human crew members, to betray and hand over his boss failed.
There’s a certain irony in the fact that corporate IT, which has largely established itself by automating business processes, is often not very automated itself. Early in my corporate career, I shocked a large assembly of infrastructure architects by declaring my strategy as: “automate everything and make those parts that can’t be automated a self-service.” The reaction ranged from confusion and disbelief to mild anger. Still, this is exactly what Amazon et al. have done. And it has revolutionized how people procure and access IT infrastructure along the way. These companies have also attracted the top talent in the industry to build said infrastructure. If corporate IT wants to remain relevant, this is the way it ought to be thinking!
It’s Not Only About Efficiency
Just like test-driven development is not a testing technique (it’s primarily a design technique), automation is not just about efficiency but primarily about repeatability and resilience. A vendor’s architect once stated that automation shouldn’t be implemented for infrequently performed tasks because it isn’t economically viable. Basically, the vendor calculated that writing the automation would take more hours than would ever be spent completing the task manually (the vendor also appeared to be on a fixed-price contract).
I challenged this reasoning with the argument of repeatability and traceability: wherever humans are involved, mistakes are bound to happen, and work will be performed ad hoc without proper documentation. That’s why you don’t send humans to do a machine’s job. The error rate is actually likely to be the highest for infrequently performed tasks because the operators are lacking routine.
The second counter-example is disaster scenarios and outages: we hope that they occur infrequently, but when they happen, the systems better be fully automated to make sure they can return to a running state as quickly as possible. The economic argument here isn’t about saving manpower but minimizing the loss of business during the outage, which far exceeds the manual labor cost. To appreciate this thinking, you need to understand economies of speed (Chapter 35). Otherwise, you may as well argue that the fire brigade should use a bucket chain because all those fire trucks and pumps are not economically viable given how rarely buildings actually catch fire.
Repeatability Grows Confidence
When I automate tasks, the biggest immediate benefit I usually derive is increased confidence. For example, when I wrote the original self-published version of the book in Markdown, I had to maintain two slightly different versions: the ebook version used hyperlinks for chapter references, whereas the print version used chapter numbers. After quickly becoming tired of manually converting between the formats, I developed two simple scripts that switch between print and epub versions of the text. Because it was easy to do, I also made the scripts idempotent, meaning that running a script multiple times caused no harm. With these scripts at hand, I didn’t even worry a split-second about switching between formats because I could be assured that nothing would go wrong. Automation is hugely liberating and hence speeds up work significantly.
Once things are fully automated, users can directly execute common procedures in a self-service portal. To provide the necessary parameters—for example, the size of a server—they must have a clear mental model of what they are ordering. Amazon Web Services provides a good example of an intuitive user interface, which not only alerts you that your server is reachable from any computer in the world but even detects your IP address to make it easy to restrict access.
When filling out the spreadsheet required to order a Linux server, I was told to just copy the network settings from an existing server because I wouldn’t be able to understand what I need anyway.
Designing good user interfaces can be a challenging but valuable exercise for infrastructure engineers who are largely used to working in hiding on rather esoteric “plumbing.” It’s also a chance for them to show the Pirate Ship (Chapter 19), which is far more exciting than all the bits and pieces it’s made out of.
Self-service gives you better control, accuracy, and traceability than semi-manual processes.
Self-service doesn’t at all imply that infrastructure changes become a free-for-all. Just like a self-service restaurant still has a cashier, validations and approvals apply to users’ change requests. However, instead of a human having to re-enter a request submitted in free-form text or an Excel spreadsheet, when a self-service request is approved the workflow pushes the requested change into production without further human intervention and possibility of error. Self-service also reduces input errors: because free-form text or an Excel spreadsheet rarely perform validations, input errors lead to lengthy email cycles or pass through unnoticed. An automated approach gives immediate feedback to the user and makes sure the order actually reflects what the user needs.
Self-service portals are a major improvement over emailing spreadsheets. However, the best place for configuration changes is the source code repository, where approvals can be handled via pull requests and merge operations. Approved changes trigger an automated deployment into production. Source code management has long known how to administer large volumes of complex changes through review and approval processes, including commenting and audit trails. You should leverage these processes for configuration changes so that you can start to think like a software developer (Chapter 14). Because it seems that any good idea needs a buzzword these days, using a source repository to manage code and configuration is now referred to as “GitOps.”
Most enterprise software vendors pitch GUIs as the ultimate in ease of use and cost reduction. However, in large-scale operations the opposite is the case: manual entry into user interfaces is cumbersome and error prone, especially for repeated requests or complex setups. If you need 10 servers with slight variations, would you want to enter this data 10 times by hand? Fully automated configurations should therefore be done via APIs, which can be integrated with other systems or scripted as part of higher-level automation.
I once set a rule that no infrastructure changes could be made from a user interface but had to be done through version-controlled automation. This put a monkey wrench into many vendor demos.
Allowing users to specify what they want and providing it quickly in high quality would seem like a pretty happy scenario. However, in the digital world, you can always push things a little further. For example, Google’s “zero-click search” initiative, which resulted in Google Now, considered even one user click too much of a burden, especially on mobile devices. The system should anticipate the users’ needs and answer before a question is even asked. It’s like going to McDonalds and finding your favorite happy meal already waiting for you at the counter. Now that’s customer service! An IT world equivalent may be autoscaling, which allows the infrastructure to automatically provision additional capacity under high load situations without any human intervention.
Automation Is Not a One-Way Street
Automation usually focuses on the top-down part; for example, configuring a piece of low-level equipment based on a customer order or the needs of a higher-level component. However, we will learn that control can be an illusion (Chapter 27) wherever humans are involved. Also, “control” necessitates two-way communication that references the current system state: when your room is too hot, you want the control system to turn on the air conditioning instead of the heater. The same is true in IT system automation: to specify how much hardware to order or what network changes to request, you likely first need to examine the current state. Therefore, full transparency on existing system structures and a clear vocabulary are paramount. In one case, it took us weeks just to understand whether a datacenter has sufficient spare capacity to deploy a new application. All order process automation doesn’t help if it takes weeks to understand the current state of affairs.
If you manage to fully automate and make your infrastructure immutable, meaning no manual changes are allowed at all, you can start working under the assumption that reality matches what’s specified in the configuration scripts. In that case, transparency becomes trivial: you just look at the scripts. While such a setup is a desirable end-state, it might take significant effort to consistently implement across a large IT estate. For example, legacy hardware or applications might not be automatable.
Explicit Knowledge Is Good Knowledge
Tacit knowledge is knowledge that exists only in employees’ heads but isn’t documented or encoded anywhere. Such undocumented knowledge can be a major overhead for large or rapidly growing organizations because it can easily be lost and requires new employees to relearn things the organization already knew. Encoding tacit knowledge, which existed only in an operator’s head, into a set of scripts, tools, or source code makes these processes visible and eases knowledge transfer.
Tacit knowledge is also a sore spot for any regulatory body whose job it is to assure that businesses in regulated industries operate according to well-defined and repeatable principles and procedures. Full automation forces processes to be well defined and explicit, eliminating unwritten rules and undesired variation inherent in manual processes. As a result, automated systems are easier to audit for compliance. Ironically, classic IT often insists on manual steps in order to maintain separation of duty, ignoring the fact that manually approving an automated process achieves both separation of concerns and repeatability.
A Place for Humans
If we automate everything, is there a place left for humans? Computers are much better at executing repetitive tasks, but even though we humans are no longer unbeatable at the board game Go, we are still number one in coming up with new and creative ideas, designing things, or automating stuff. We should stick to this separation of duty and let the machines do the repeatable tasks without fearing that Skynet will take over the world any moment.