Trust, configuration management, and other white lies - Martin Jackson
Timeline: October 2013 to present
The UK Government has 25 transformational ‘exemplar’ projects to create digital services that are simpler, clearer and faster to use. All of these meet the Digital By Default Service Standard that was introduced in April 2014 and will be accessible to the public in March 2015.
In November 2014 7 exemplar services are live, with a further 15 in beta and the remainder in alpha.
How do you do Continuous Delivery and DevOps within the UK Government?
How do you do Continuous Delivery and DevOps on a UK government project with strict controls that prevent the majority of the team having direct access to the environments that they develop on, deploy to, and support around the clock?
I am currently working for Equal Experts on an exemplar service within the UK Government, and in my experience you can do Continuous Delivery and DevOps in such a setting by establishing trust within the team and with the client. Here are some of the things I have observed and experienced whilst working in this context.
How can trust be weakened in CD and DevOps in general?
In a high velocity software delivery environment, trust between team members and between teams is vitally important. Unfortunately, trust can easily be worn down over time by the cumulative effect of many seemingly innocuous actions. Let’s look at some of these trust-reducing problems.
Killing Me Softly
Every time someone logs into a box interactively I trust that box a little less. I don’t know what was really done to the machine and the odds are that the person who logged on can’t recall everything they did either even if they have a command history. A command history can’t tell you what changes were made in a text file, for example.
One Step Forward (Two Steps Back)
I have found that there are two main reasons for unpredictable deployments:
- Environments composed of Snowflake Servers due to a plan, environment drift, or sheer bad luck
- Methods or tools used that vary between environments
If deployments of a particular component or application are unpredictable, I will always caveat failed deployments and rationalise them away as technical debt because of tool and/or environment differences. In other words, I distrust the toolchain and begin to make excuses for failures.
Man On The Corner
My trust in a deployment drops when it fails the first time because of that odd little corner case that crops up every now and then. You remember it? That odd little corner case has actually occurred more than four or five times in the past few months, but you can never quite find the time to get it fixed.
Just This One Time
“We must do an ad hoc command, process, deployment without the overhead of the release process just this once because it’s urgent”. How many times have you heard that? Every time you allow a manual deployment, trust is weakened.
Need For Speed
Monitoring tools can sometimes be quite tricky to navigate, and log aggregation systems can be quite difficult to search when the pressure is on. So people sometimes (gasp!) feel the pressing need to log directly onto production boxes (see Killing Me Softly) because it is so much faster. This indicates a lack of trust in the available tools, and in what the monitoring tool is actually telling us.
I’ve Got The Power
“If I’m going to be doing on-call support then I must have root access to check configuration files, disk, CPU, I/O and perhaps the contents of the production database, in case I get called”. For some reason people believe they will be extra-awake at 3am when that production incident happens, and they will need root/administrator access to all machines (see Killing Me Softly). This suggests they do not trust their existing access rights.
I’d Do Anything For Love (But I Won’t Do That)
Abusing configuration management tools (Puppet, Chef, Ansible, Salt, etc.) leads to a loss of trust in the tools and in the people who write the code. Just because we can use our configuration management tool as a Swiss army knife does not mean we should. Some abuses of config management tools I have seen include:
- Monitor every file in the /etc, /var and /xyz directory
- Un-install package A through to Z
- Ensure what version of every package in a system is or is going to be installed
- Shell out to other systems and check their state and wait on them to be ready.
This kind of scope creep leads to uncertainty, unreproducibility, and lower trust.
Bang Bang (My Baby Shot Me Down)
Every now and then the “Treating your servers like cattle, not pets” metaphor will be raised and driven to its logical conclusion i.e. using Immutable Servers to guarantee server configurations are never touched. Sometimes this even goes as far as removing any services that would allow me to log in interactive to a server. Phoenix Servers are a huge improvement over Snowflake Servers, and sometimes we need to trust services are there for a reason.
My Definition Of A Boombastic Jazz Style
I have heard so many definitions of what it means to be Done:
- “It’s Dev Done”
- “It’s Done Done”
- “It’s Done if it’s monitored in production”
A lack of agreement of what Done really means is one of the most effective destroyers of trust.
One of the most common complaints I hear from developers is around the creation and usage of Feature Toggles. Some developers feel toggles are difficult to implement when components are too large or tightly coupled, and that it increases complexity in the codebase.
I think of this argument as “creating feature toggles are hard and I don’t want to do them” or “that bit of functionality is so trivial it doesn’t need a toggle”, and it indicates to me developers do not trust such an ability is required. When Feature Toggles are used to Dark Launch features it can mean the difference between a successful deploy with a disabled failing feature, and a complete rollback because the failing feature could not be switched off in isolation. When you have to rollback a release it’s that much harder to get the next release approved.
How have we established trust within the UK Government?
What have we done to fix some of these issues within government?
Building tools and canned commands
We created a interactive dashboard that hooked into a series of ‘canned commands’. These canned commands or scripts only handle non-destructive operations so that anyone can monitor the environment and check configuration files securely on all services with limited interactivity.
As part of the implementation and ongoing maintenance we ensure that:
- Everyone has their own login and password, so we can restrict access and audit activity if needed.
- All service issues are reviewed, so we can identify any deficiencies in the canned command list and add them to our backlog.
- Custom programs are used to obfuscate customer-sensitive data e.g. usernames, passwords or API keys.
Ensure all environments are the same
One of the biggest problems we have faced is ensuring our environments are similar, let alone the same. We have had to address the following problems:
- Legacy implementations
- Lack of resources
- IP address space limitations
- Singleton tools
- Lack of multi environment support from third party services
We tackled this problem in several ways.
Forced environment refresh
Implementations (and mistakes) of the past came back to haunt us due to configuration drift, so by periodically refreshing and re-baselining our environments we made our configuration management process simpler (albeit by moving items from our configuration management code to our provisioning management code) by baking in new dependencies with more recent or even conflicting versions of software.
I was once told by a consultant that our current deployment was wrong because we do not recreate all servers from scratch on every deploy and this was the only acceptable way to deploy an infrastructure that you can rely on. It was a arbitrary comment that lacked context, and ruled out any other methods for us to refresh our application let alone the impact, cost and overhead of adopting Immutable Servers. We questioned if we were doing the right thing by taking a hybrid approach, and our trust in ourselves was temporarily weakened.
Use of lightweight containers
Initially using virtualization we could rapidly prototype our development environments. Unfortunately, solutions solely based on hardware virtualization require resources and time, and as our environment grew recreating our infrastructure on our laptops became almost impossible - the cycle time to spin it up became unworkable and maintaining our custom laptop build became a painful overhead.
We have started to investigate lightweight virtualization container technologies (such as LXC, Docker, and Google Kubernetes) layered on top of hardware virtualization to create our complete environment on one virtual machine. This will require fewer resources, start up faster, and involve less custom configuration management code.
Create environments that can be switched off or recreated easily
The majority of cloud service providers charge by the hour and most development related activities do not occur 24 hours a day 7 days a week, so outside of those hours those resources are idle and you are burning cash that could be better spent elsewhere.
In order to contain costs and stretch resources, we decided it was worth the effort to invest time in generated short-lived environments for the following scenarios:
- Performance or load testing environments that are used once a day or week
- Debugging environments for finding production Heisenbugs
- Experimental environments for testing hypotheses
- Temporarily extending our continuous integration capability on demand
- Testing disaster recovery procedures
- Validating that our configuration management code can do bare metal installations
By recreating our non-production environments regularly, we flush out any manual configurations or tweaks, making our testing more robust and reliable.
Proxy what you can but only when it makes sense
It has become a lot harder recently to get hold of fixed IPv4 addresses so initially we used reverse proxies to host multiple environments. However, we learned over time to limit the use of reverse proxies (or to make these reverse proxy implementations as close to production as possible) because of the risk of deploying configuration changes that did not accurately reflect production.
Whenever we deviated from an environment baseline we were always concerned since we could never really test or fully understand how specific changes related to our service and other third party services would be affected pre-deployment, and we considered this risk to be unacceptable. Eventually we took the hit and simply purchased more IP addresses.
Avoid costly common off-the-shelf software and hardware
One thing which saved us a lot of pain and suffering was sticking to our guns when it came to software and hardware stack selection. Experience has taught me to avoid any piece of software or hardware if:
- You can’t put one in every environment
- You can’t automate how you deploy and configure it
- You have to consider license costs before you can deploy it in a temporary development environment
- You have to constantly re-use, re-purpose and re-implement
- The last major version does not behave as expected
- The cost of changing your mind is outweighed by the cost of procurement
These criteria usually were mitigated by us favouring open source software in our designs.
Create stubs for third-parties that do not support multiple environments
Some of the services that we integrate with do not:
- Provide support for multiple environments
- Appreciate performance testing against their APIs
- Guarantee true API compatibility between their test and production environments
- Offer free API access
To combat these issues we stubbed out any services in our environments that did not meet all of the above criteria. Using these simulators allowed us to test and deliver services with a high degree of confidence, especially doing soak testing before allowing real customers near our systems.
Obsess over build and deployment inconsistencies
When a deployment fails in any environment other than development we force ourselves to obsess over the reason it failed and fix it. Deployment failures in later environments make everyone nervous and that nervousness translates in people lacking any real confidence in software successfully reaching production without issue. When confidence is lacking, getting the next release out the door is much harder.
Knowing it’s OK to say no
We have a well defined, established and successful Continuous Delivery process that we use for all deployments, so why do people sometimes feel the need to try bypass it?
Well established Continuous Delivery pipelines can take time to complete and the pressure to get something out the door yesterday can be immense, so sometimes we have to ask ourselves is it worth running through the entire Continuous Delivery pipeline to do update something trivial e.g. update a single true/false option in a configuration file. However, is that change trivial or non-trivial? Trivial is one of those terms that tends to vary with who you ask but I boil it down to two things:
- Is it possible to use an orchestration tool to push out the change to every environment while ensuring that the change is in our configuration management tool and has been validated for consistency (trivial),
- Does the implementation and scope of the change using option 1 exceed the duration of a full deployment (non-trivial).
So, if someone requests a non-trivial change it must go through the entire release process. It is not really saying no - it’s providing acceptable alternative options that you can have confidence in.
If your tools are getting in your way then get better tools
Inevitably one or more of your tools will bring you pain. This happened with us and chiefly for one or more of the following reasons:
- Our needs have outgrown the tool
- The tool is difficult to use
- The tool has been implemented in way which is suboptimal
- The tool itself has drastically changed
- The tool guru has left or entered the building
- There are newer and better tools about
- The tool does not scale as well as we hoped
- The cost of using the tool is greater than we were willing to bear
- The time investing keeping the tool running was greater than the time spent implementing something else
When one or more of these situations has happened we have either replaced or re-implemented the tool using knowledge learned from the previous implementation or other sources (e.g. the Internet, literature, and subject matter experts). We have taught ourselves to not fear change and also not to throw the baby out with the bath water.
Using a Safety Harness
Working in production scares me and it scares me the most at 3:07am when I am half asleep. In my honest option giving people the greatest responsibility and pressure when they are typically not at their best is perhaps not a winning combination. To combat this fear I use safety harnesses, which give me and others the power and ability to perform potentially dangerous commands with a degree of safety when we are both off- and on-call.
Our safety harness is an aide memoire that typically includes:
- A shared knowledge base of safe command sets
- Canned commands or scripts
- Rules of engagements i.e. Run Books or Work Instructions
- Limited access accounts for investigating issues
Production is a pretty unforgiving place sometimes and it less scary when you have a safety harness.
Expanding your tool kit
In the past we’ve sometimes been a little over zealous with our use of configuration management tools. We’ve pushed the tools beyond the boundaries of what they were originally meant to do which has caused us a few issues such as:
- Overly complicated deployment logic
- Excessive shelling out to Bash scripts
- Complicated service dependencies which may or may not be triggered e.g. database elections clobbering data migrations
- Handling what versions of what particular package was installed where but not with any great deal of granularity
These issues eventually led to hard-to-manage code with debugging dependencies when we tried to add new functionality, so we took a few steps back to make our lives easier in the following areas.
We implemented phased configuration runs (or stages) with specific configuration pipelines being placed one after another e.g. data migrations running before software updates. By doing this we sacrificed speed but gained greater reliability in our deployments.
We use remote execution tools to handle the ordering of deployment pipelines for services and service tiers.
We created environment independent build scripts for each application or service that worked on the Continuous Integration servers, locally on laptops, and/or in our throwaway virtual development environments. Doing this allowed us to be able to build our code locally and know it was being built by the CI server in the same manner.
We now use a repository management tool to handle our upstream vendor dependencies to guarantee what version of what package gets installed where and when.
Everything in moderation
Some old hands in Operations may remember that the IT service mangement framework ITIL had a non-prescriptive clause in it, which was completely lost when people rushed to implement every disciple without realising you could achieve 80 per cent of its benefits with 20% of the effort.
We’ve come full circle in the DevOps arena and have to learn the same thing again, since much of what we do are options and not prescriptions. for instance:
- We can compose infrastructures from a mix of Immutable Servers and Phoenix Servers
- Not everyone needs to deploy 10 times plus per day, we deploy once a week but our code base is always in a deployable state
- We don’t all use the same processes i.e. our Developers use Scrum and our Web Operations team uses Kanban, but we use the same code base
Running in Production is the only true definition of done
I have heard many definitions of what it means to be Done but in my mind the only effective definition of done is to have the software successfully running in production and delivering business value. Any other definition of done is divisive since it separates the responsibility of creating, testing, and and delivering a feature, therefore making each stage in delivery someone else’s problem.
On our team the Web Operations personnel deploy weekly using Kanban, while the developers use Scrum with two-week sprints. We are all continuously committing to the master branch while keeping the code releasable, and avoid exposing features in development using a combination of short-lived feature branches and Feature Toggles. New features are typically created on a feature branch behind a Feature Toggle, and then merged into master in an off state until ready for launch.
We have found Dark Launching features to be particularly useful when integration with a third-party provider is involved and testing beforehand is simply not possible due to provider constraints. For example, if a new feature relies upon a payment gateway that cannot be confidently tested in a pre-production environment then the ability to switch that feature off on production failure is very handy indeed.
Communication is key. However, it is important continually to emphasise that you can’t have continuous delivery without environment stability and a known and trusted state, and we’re aiming to guarantee safe outcomes (reliable deployments with can be depended upon). More importantly, it is not about not trusting your colleagues - we are in fact granting them greater access to more environments than they had before.
We have achieved a number of successes in our approach, namely:
- Trust from the business - our stakeholders are confident in the processes we use, since they are simple and transparent
- A team who feel confident going on call even with limited access production
- Skeptics and naysayers have become advocates - our approach which was initially perceived as unworkable has become the norm
- A business which was typically used to running large big bang releases is now used to small incremental weekly releases
- An increased appetite for change among the business community, deployment successes with incremental features has show that mistakes can be minimized and easily corrected
- To date we have high deployment success rate
A sign of success is that release parties are now a thing of the past - in fact, our releases are now pretty boring. To get here, we learned some valuable lessons:
- Communicate, communicate and keep communicating. You need to bring everyone along with you
- Be patient and unafraid to go over old ground
- Challenge assumptions and established processes respectfully. Try to understand why something is done a certain way
- Start small and show what is possible. Avoid extremely clever solutions and try to keep your solutions simple
- Try to use technology to empower those around you
In the near future we have plans to look at newer technologies such as lightweight containers and explore more automation opportunities. We want to build an automated system which can create, suspend, and delete complex multi-virtualisation environments for development teams. We plan to create infrastructure service tests to decrease the cost of operational acceptance testing, and migrate to an OS native package management system to make our deployment process and configuration code easier to manage.
Finally, we have to translate and transfer the benefits of our approach to other teams and organisations within the UK government!
Song title references
- Killing me Softly by Roberta Flack
- One Step Forward (Two Steps Back) by Johnny Winter
- Man On The Corner by Genesis
- Just This One Time by Cher
- Need for Speed by Petey Pablo
- I’ve got the Power by Snap
- I’d Do Anything For Love (But I Won’t Do That) by Meatloaf
- Bang Bang (my baby shot me down) by Nancy Sinatra
- My Definition Of A Boombastic Jazz Style by Dream Warriors
- Switch by Will Smith