DevOps in the mix - John Clapham
Timeline: Spring 2010 to Summer 2014
MixRadio is like a personal radio station with an uncanny knack for playing your favorite tracks. Available on the web and mobile, it allows free streaming of playlists without subscriptions or ads and is available in 31 countries including the US, Brazil, India and China. It has a catalogue of over 30 million songs. MixRadio grew out of one of the first online music services, On Demand Distribution (OD2), which was funded in part by Peter Gabriel. Based in central Bristol, the organisation has a reputation for forward thinking in both technology and ways of working.
On a crisp, bright day in 2011 I took to the stage for my first ever public talk: “Dev+Ops+Org how we’re including (almost) everyone”. I was nervous, but proud to represent the combined efforts of MixRadio’s engineering organisation. The title was perhaps more telling than we first realised. We included people by default, just one of those agile principles that happens naturally - it would help us along, it would be fun. In truth, if we hadn’t extended collaboration beyond engineering we may never have built such a strong Continuous Delivery capability.
An Unsustainable Pace
Our Continuous Delivery and DevOps journey began about a year before. At the time we were part of an autonomous business unit of about 200 people within Nokia Entertainment, based in Bristol and predominantly building music services. We used a mix of open and closed source technology, we were moving to a service oriented architecture, and we employed agile methods to get things done. The development teams averaged a server side release every three months. While that delivery frequency wasn’t great, we were incurring a lot of costs that caused us to carefully consider our approach:
- Time. An awful lot of activity went into each release and it’s related artifacts; scoping, release planning, testing, rehearsals, documentation. Much of it was bespoke and specific to that release, adding no lasting value to the organisation or customers.
- Cash. People worked extra hours, environments existed just to support release activities and quality gates, and we had a stack of pizza boxes high enough to reach the elevator ceiling. We were burning cash.
- Engagement. Some people found high pressure releases a buzz, but those pizza fuelled nights weren’t sustainable. It was frustrating that those high odds were of our own making. Most people became engineers to make stuff and solve hard problems, not to babysit changes.
- Reputation. Pressure, boredom, late nights and forced decisions aren’t ingredients for a high quality product, and defects often cropped up at the worst time. The mitigation for this was often more process and downstream checking, adding to cost and delays.
- Opportunity. Customers had to wait for a release before they saw features, potentially costing us competitive advantage. It created a lengthy product feedback cycle, hampering our ability assess and react to user behaviour. Mirrored on the technical side, three month change cycles reduced our opportunity to evolve our architecture and technical approach.
- Product Flow Interruptus. Perhaps most seriously, product development stopped during the release phase. Engineers were either involved in release activities, or last minute fixes. Fear of missing precious release opportunities influenced product decision making. The cost was greater than the total number of mythical man days of the release. Flow was interrupted, people took time to recover and pick up their threads.
Reviewing the above list you might notice a theme - risk. Attitude to risk was driving behaviour, and it manifested in much of the process. Not wanting to risk downtime, Operations checked everything thoroughly and requested documentation to support unfamiliar applications. Not wanting to risk errors in production, staging and pre-live environments were created. Not wanting to risk errors in pre-production, quality assurance teams were established with multiple handovers. Our system bore many of the hallmarks of Risk Management Theatre.
A Pause For Thought
It was clear the price of release was too high, and the simple metric of talented people getting frustrated reinforced that message. You might wonder how we’d got into that situation, and why things hadn’t changed sooner. In fact it evolved slowly, as it is hard to view a whole organisation in depth. Issues aren’t always obvious as they are obscured by local budgets, politics, silos and shifting fiefdoms. Often the impact of a problem is not felt in the area responsible for introducing it. Few people are in a situation to take this holistic view, and may even consider it not in their interest to comment. The crucial point is that most teams were working to best practices, within their own areas. These local optimisations meant that the overall product delivery system suffered.
This situation couldn’t continue, but it is hard to break out of this kind of deadlock. There was a long period between knowing what to do at a technical level, and actually starting. During that time there were many courageous conversations and tough decisions, but the end result was sponsorship from the leadership level. This was a crucial and foresighted piece of leadership - investing some of the team’s capacity in improving capability, rather than building features and running systems.
The first smart move was to pause and think, to really zoom out and consider the problem. It is comforting to make improvements on existing process and tools, but the danger is iterating on an area that won’t reward investment. This is a common theme, by way of example consider the Blockbuster example - improving a product without realising it has already been disrupted.
To build Continuous Delivery tools a Scrum team was formed. In creating a new team there was every risk of contributing to the problem - another interface, and other step between concept and customer, or as we soon termed it “Laptop to Live”. Our team was different to the much debated DevOps Team approach, as it’s remit was to deliver a product to Research & Development and then disband.
To start with, the focus was tools. It soon became clear that tools weren’t the only constraint, and we began to understand the close relationship between tech and ways of working. Each time we attempted to simplify tooling, we would hit a process or people constraint. Here is a typical (but not real) requirements gathering conversation:
“This environment appears to be a bottle neck” “You can’t remove that environment because its needed for performance testing” “Why is performance testing needed?” “Performance tester says so” “Why?” “It’s best practice and…”
And so on. So it became clear that both tooling and ways of working were closely coupled, and we’d need to do more than just deploy a pipeline. There were touch points on the software throughout the organisation, understanding those, and their true value would be crucial to success.
The amount of work that went into building our Continuous Delivery capability can’t be done justice here, and it was by no means a smooth path. On reflection, we were able to gather some of the initiatives into the areas below although at the time our approach was far less structured.
- Improve, continuously. Learning behaviours, and a desire to learn are the catalysts for this kind of change. It was necessary to iterate quickly, learning what to change and build and what to leave alone for now. This was especially important as organisational and process changes often set up the technical challenges.
- Listen, constantly. We used formal exercises like value stream mapping, and agile workshops, and tried to find a sensible balance between iterating in the right direction and starting absolutely every thing again from scratch. It was also important to listen to dissenters, often uncovering areas we hadn’t considered and new improvements.
- Establish a shared purpose. We didn’t realise the significance of this until much later, but our sound bytes like “Laptop to Live” were actually useful rallying points that encouraged people to aim for the same goals. Other methods included sharing and showing progress through demos, monitors, and taking a community approach.
- Establish principles. There were a few key phrases, or principles that crept into the vocabulary of engineering - “never break your consumer”, “put responsibility in the right place”. Some were reminders that increased trust and increased freedom introduced the possibility of going wildly off track. These basics allowed freedom, and invited technical and non-technical solutions to problems.
- Judicious automation. A common slogan is “Automate Everything”, but we tried to make automation targets go away. Once something is automated it becomes an asset requiring maintenance, it becomes less malleable and actually impedes ability to change. Where possible we used simple systems, like a toy badger as the token to lock the build transaction.
- Earn and build trust. Trust is commonly cited as a foundation for DevOps and Continuous Delivery, but it doesn’t just happen. Management can’t declare “we all trust each other now, get out there and do trusting”. Trust has to be earned repeatedly, otherwise it evaporates quickly. Keeping progress visible, inviting feedback and showing successes all helped here.
- Design the organisation for flow. Some groups are more beholden to hierarchy than others. We found it useful to avoid dividing teams with separate managers, styles and incentives. This reinforced our joint purpose, encouraged collaboration and sped communications.
- Build damn good software. Do I need to say more?
Over time a new perspective on risk emerged, a pragmatic approach where the quantity of process and checks around a change were proportional to the risk in carried. Standard changes allowed changes within than category to move with minimum attention. To start with that assessment was carried out by Development and Operations together, but after a learning period developers categorised their own changes.
After a relatively short period of time, multiple daily releases became routine. I enjoy bragging about daily release count as much as anyone, but the real milestone was the ability to release often enough to match the pace of development team throughput.
As intended the Continuous Delivery team disbanded, and the MixRadio engineering community as a whole took forward the release toolchain and furthered our DevOps ways of working. Momentum was maintained, Continuous Delivery stayed in the limelight, and we found a few nice ways to promote internally. One example was celebrity deployments by visitors, including Nokia CEO Stephen Elop.
The tooling continued to improve, albeit slowly, on a best efforts basis. Improved tooling revealed more process and ways of working issues that needed attention. Increased confidence enabled bold decisions like removing pre-production environments. With each change quality was monitored, and invariably improved.
A Tipping Point
The MixRadio organisation is a hungry one, keen to improve and better itself. That meant just like the environments we’d pruned out the Continuous Delivery tool itself was subject to scrutiny. A welcome change in hosting strategy permitted migration to Amazon Web Services. The CD tool could be adapted to deploy to the cloud, but it was coupled to the processes it modelled and reflected the conservative organisation in which it was brought up. The architecture had evolved, with fine-grained services reducing queues and making it less likely that changes would conflict. Roles had changed and knowledge had been developed over the years. The first pipeline tool helped engineers to deploy, now engineers had good experience of production systems. Operations needed to turn their attention to specialist areas including infrastructure, security, and provisioning. All these factors were positive, but it was clear the pipeline was on the verge of limiting our delivery capability.
A new Continuous Delivery tool was to be built, but this time things preceded very differently, a mark of the organisation’s maturity and learning capacity. It didn’t take long to gain sponsorship for a new team, and this time it was an autonomous group of engineers. Organisational buy-in was a given and this enabled a firm focus on technical challenges. The new system took full advantage of the changed environment, organisation and architecture. As opposed to being driven by the need to protect from failure and bolt down every risk, the new system relies on trust, simplicity and mastery. There are just two environments, a handful of commands to deploy, no tooling orchestration, and no exclusive locks.
Continuous Delivery has brought many benefits to MixRadio, improving speed of execution and efficiency. Although highly subjective, engagement and motivation appear to have improved as people focus more on problems that matter and are able to take full responsibility for their code. Having a single, simple, visible goal was crucial for bringing change. In some cases this united teams and encouraged collaboration. In other cases change was less welcome. What mattered was the sense of purpose it brought and that conversations were started. A sense of community helped the changes in many ways, from building trust to gaining feedback and understanding the problem and goals.
While trying to improve delivery capability there appeared to be five interdependent factors: tools, culture, architecture, organisation and process. Often one would enable another, and when improvements were made a new constraint would surface elsewhere. Culture has a significant impact on technical choice, lightweight approaches become viable in environments of sufficient mastery, trust and autonomy.
From this it seems important to view technology and process as learning tools, to not be sentimental and refactor both mercilessly. Unless your business or competitors stand still they are never done.