DevOps-ifying a traditional enterprise - Niek Bartholomeus
A large investment bank in Europe
Timeline: April 2007 to August 2012
Between 2007 and 2012 I had the chance to work in a cross-cutting team within the dev side of the IT department of a large investment bank in Europe. The objectives of this team - that throughout several internal re-organisations had been given very different names like ‘Strategy & Architecture’, ‘Technical Architecture’, and ‘DevTools’ - were never very clear, although they could be broadly summarised as “doing all the things that could benefit more than one development team”. Nonetheless the work was very interesting and each day had its own unique twists and turns.
The team consisted of between four and eight people, who were technical experts specialised in one or two of the organisation’s supported technologies (such as Java, .NET, ETL languages, reporting). I was the only true generalist in the team so work that required knowledge of multiple domains simultaneously was therefore my “specialty”.
Initially we focused on creating re-usable building blocks for each of these technologies, going from defining the company’s preferred application and security architectures to building framework components for security, UI templates, a common build platform, common deployment scripts, and so on.
This work - although of technical nature - had plenty of interesting cultural challenges as well, as it required finding a common ground between all of the different development flavours practiced within the organisation, and then convincing each team that the chosen solution is the best for the company, although it might not have been the best for that particular team.
It took a while but once this technical platform had finally settled down it proved to be of good value, not in the least for new development teams that were brought in who could hit the ground running by relying on these building blocks for all of their cross-cutting concerns.
On the other hand, all that automation had not been able to contribute very much to the solution of what had by that time become the biggest bottleneck for delivering the software to the end users: the issue of the infrequent, organisation-wide releases that remained very brittle and labour-intensive. A different approach was needed to tame that beast, and for the generalist in me this multi-domain challenge attracted me like a magnet!
Let me first explain the organisation structure in more detail before discussing the problems that caused this bottleneck.
As with so many other traditional enterprises, the company relied heavily on managers to get the work done: they cut the total work to be done in pieces by speciality (analysis, development, testing, …), assign it to their team members and coordinate the hand-overs between them. With this kind of micro-management the team sizes have to be kept small enough to avoid the manager drowning in work. This typically results in steep and hierarchical organisation structures where higher management gets separated from the work floor by several layers of middle management. As a result a huge gap is created between the place where the decisions are made (at the top) and the place where they are executed and where in many cases the deep knowledge sits (at the bottom).
Something else that is typical for these enterprises is their heavy reliance on planning. There is a general assumption that the world is simple, stable, and deterministic, and therefore we can perfectly predict it. Based on this mindset the most efficient way to execute a task is to rigorously plan up-front all the work that is needed and to assign it to specialist teams or individuals, further increasing the need for managers and coordination.
This is also where corporate process frameworks like CMMI and ITIL come in. These frameworks assume that our business is so mature that process-analysts who are far away from the reality can standardise the work we need to do into detailed procedures. This approach to structuring an organisation has some interesting consequences, which we will now explore.
First of all there is the ‘silo-isation’ that comes with these specialist teams. People are motivated to stay inside of their domain of expertise - to ‘increase the efficiency’ - and leave coordination to the project managers. I have always been surprised by the little attention that generalists receive in these environments. A new problem that arises cannot always be divided up-front over the various specialist teams, but rather needs people with good understanding of the bigger picture and an 80% knowledge of multiple domains to find a good solution.
In such a context there is also little room for experimentation. Rather the expectation is that people come with solutions by applying reductionist thinking in this supposedly deterministic world. The assumption is that we can fully predict upfront the world into hard requirements (instead of mere hypotheses) so there is no need for experimentation. If these requirements turn out to be wrong, it can only mean that we have not spent enough effort on planning so the thinking goes.
Secondly, the most difficult problems are typically solved in such a planning-heavy organisation by bringing in some form of centralisation. For example, if there is a big need for data to flow between applications and people get the feeling that work is being duplicated in order to combine, analyse, or transform that data then immediately this sets off a red “bad efficiency” alert throughout the management departments and significant effort is spent on rationalising the situation by adding a central data hub solution that sucks in all the source information, integrates it, and makes it available to any application that may need it.
Another example concerns software delivery: as soon as the number of moving parts that has to be delivered into production reaches a certain threshold, an organisation-wide release management team is brought in to take control over the situation.
Anything for which the solution is a company-wide configuration management database (CMDB) or messaging bus are usually also good examples of this phenomenon.
Furthermore, the organisation was characterised by its hugely entangled and heterogeneous application landscape (in terms of technology and architecture), in which most of the applications were acquired on the market, not developed in-house. Many of these applications were tightly integrated between one another and depended on older technologies that did not lend themselves very well to automated deployment or testing.
In general there was a lack of automation throughout the whole software delivery lifecycle. This in itself is quite interesting because one could argue that automation (of business processes) is what we as a department do for a living. Keeping track of which features were implemented in which versions of the application, automated acceptance testing, automated provisioning of test environments, deployment requests, release plans, all kinds of documents in order to pass architectural or project-level approvals, and much more was all done the artisanal way using Word, Excel, and a lot of manual human effort.
Infrequent, organisation-wide releases
All of the above, but especially the high trust in planning, the many (known and unknown) dependencies between the applications, and the many manual steps, led to releases that occurred infrequently and that tied together all the applications that needed upgrading, which in turn led to huge batch sizes (the amount of changes implemented in one release cycle).
An uncertain world
In addition to the problem of huge batch sizes, the whole process of software delivery had several other problems that were all rooted in one fundamental problem: the fact that it is simply impossible to predict in a sufficiently precise manner the context in which the application will exist once delivered to the end users. Even in a relatively mature domain as investment banking, there are just too many unknowns, in terms of the exact needs that the users have, the way in which all these complex technologies will behave in the real world, etc. This lack of information, this existence of uncertainty, is further increased by the high degree of silo-isation that exists. Take for example the developers: they may know all about their programming language, but they have only limited knowledge about the infrastructure on which their application depends, or about how their end users act and think exactly. They are shielded away from all these domains that may have an impact on how best to write the application code. The same applies for all other specialist teams involved in delivering or maintaining the application, each having only a partial comprehension of it.
Many false assumptions exist within a heavily-siloed organisation, and these assumptions will only be exposed when the application is finally deployed and used in an acceptance test or even production environment. Operations people who interpret the deployment instructions incorrectly, developers who don’t understand how operations have set up a piece of infrastructure, what the exact procedure is to request their services, etc. All of these issues take time to resolve and this unplanned time gradually puts a bigger and bigger pressure on the planning downstream. Eventually one of the deadlines will not be kept, and this will have a domino effect on all the other teams involved. In our case this resulted in testers not having enough time for regression testing (or worse: testing all of the new features), workarounds and shortcuts being implemented due to a lack of time to come up with a decent solution, new features needing to be pulled out of the release because they were not finished in time, release weekends running late, etc.
Tactical solution: enhancing the existing communication flows
I would like to say that we solved the problems by switching to a more agile approach that favours experimentation and a quick feedback cycle between idea and production that allows to spot discrepancies between assumption and reality early on. Unfortunately I only got this insight long after I left the company, when I had had the opportunity to take a step back and see things from a distance. I guess it was just too difficult to think out-of-the-box as long as I was still inside it.
Instead we focused on making the existing software delivery process more reliable by first streamlining the process level and then by automating it as much as possible.
On the process side we made sure that we came up with a process that was the simplest possible, was agreed by all stakeholders (and for software delivery that is quite a few) and was understood by everyone else involved. One of the positive consequences was that people got a better insight in what the other teams were doing which in turn led to developers and ops people starting to appreciate better what each group was doing. They finally had a common ground from which to start discussing whenever an incomprehension between them arose.
On the automation side we decided to introduce a collaboration tool to facilitate the manual work involved in tracking multi-application releases, and integrated it with our existing tools for feature tracking, continuous integration, and deployment automation in order to keep the manual work to a minimum. With the tools taking care of all the simple and recurrent tasks, it allowed the people (and the release manager in particular) finally to start focusing on more important, higher-level work. Using tooling to keep track of which versions of your application exist, which version is deployed where, how the application should be deployed, and so on avoids the human errors that would have typically caused lots of troubleshooting and stress downstream, and also increases the level of trust people put in this information.
Looking back at this project two years later, I realise now that it was only the first step in solving the problem. By improving the quality of the existing communication flows we indeed considerably increased the chances of getting the releases out in time, and we definitely made the whole process more efficient, but it didn’t lead in any way to an increase in the frequency of the releases.
See here the score card after this first step:
- Reliability: check
- Speed: uncheck
The next step should now be to shorten the release cycle, to make releasing software so easy that nothing stands in the way of doing it whenever the need occurs to validate your assumptions in the wild; that is, to finally get the quick feedback cycle that is necessary to come up with a working solution in a complex and constantly changing business.
Let me briefly explain the obstacles that still stood in the way of speeding up the release cycles and how I would now go about solving them by introducing decentralisation.
Structural solution: decentralisation
The heavy reliance on centralisation that was traditionally used as a way to solve the data integration and release management problems 10 turned out to require a huge communication channel between the central orchestrator and the agents it conducted. Therefore, as the problem domains gradually scaled out, this solution required more and more efforts to keep up. By enhancing the existing communication flows we got ourselves out of the worst mess but we could easily see that it was just a matter of time before even this solution would be pushed to its limits.
A tendency to over-standardise
Another consequence of this centralisation was that there was a natural tendency by the central orchestrator to standardise the behaviour of its agents into a common template. The reality was that there were a lot of very different applications out there, each with their own preferred release cadence, risk profile, business maturity, technology stack, etc.
The online applications typically live in a quickly changing business and therefore demand a rapid release cycle. There are huge opportunities in these markets and risks have to be taken in order to unlock these opportunities. The back end applications on the other hand have been around a lot longer already and their market has had the time to mature, therefore it has become a little easier to make upfront predictions based on previous experiences. Also, cost-efficiency is more important here because the opportunities to create the value to cover for these costs are limited. These applications are sometimes referred to as the core applications because so many other - more recent - applications depend on it, which also makes it more difficult (in terms of total cost, risk, etc.) to change them. The drive to change them is small anyway because their business doesn’t change that often anymore.
As such, each individual application had a very specific profile, ranging from innovative to mature. It was obvious to me that squeezing them into a common one-size-fits-all structure had a big cost attached, although this cost was not always fully visible up front.
Decentralisation to the rescue!
To avoid these problems with scaling and standardisation, I realise now that it would be better if we could have ‘loosened up’ this tight coupling by pushing down the finer-grained decision-making power from the orchestrator into the agents and similarly by allowing these agents to collaborate between one another instead of always having to rely on the orchestrator for all coordination needs. If there are agents that require close interaction, it makes sense simply to bring them closer together (physically or virtually) or to combine them into one agent so the communication becomes more local and therefore more reliable. With the decision power that the agents gained they are then also free to optimise it to their own specific needs instead of having to follow the centrally imposed standards.
Decentralisation applied to software releases
Translated to our problem of infrequent releases this decentralisation would mean that we should first of all get rid of the application integrations that are not strictly necessary (take the use of shared infrastructure as an example) and then to decouple as much as possible the inherent integrations that remain. This decoupling can be done by making all the changes to the application backward-compatible. Yes, the magic word here is backward-compatibility! Make no mistake, this is a incredibly difficult task that goes to the root of how you architect and design your applications. However, once you have put the efforts to ensure backward-compatibility you will get back the freedom to release your application whenever you want, and as fast as you want, independently of all the other applications and independent of any corporate release schedules that may exist. No matter which of the other domino blocks may fall, they will not be able to touch yours. The decision power is hereby moved down from the central orchestrator - the release management team - to the individual agents - the development teams, who become autonomous and self-empowered.
And to keep it within the spirit of autonomy and self-empowerment, in my view there is absolutely no need for all the applications to start this journey towards decentralisation at the same time and pace. The applications on the innovative side of the range would naturally benefit more from increased autonomy so it makes sense to start with them. The other applications could be done at a later time or not at all, whatever makes most sense in their specific case.
With the introduction of decentralisation the score card can hopefully be updated to:
- Reliability: check
- Speed: check
We have seen that at one point the biggest bottleneck for delivering software in the company I worked for was the fact that the releases happened infrequently and tied all applications together. We were able to trace down the reason for this to an organisation structure that relied too heavily on managers and upfront planning (resulting in heavily silo-ised teams and centralised decision making), a hugely entangled application landscape, and a high degree of manual work.
When building software for complex and quickly changing business domains it is impossible to rely so much on upfront planning because the world is simply too uncertain and there are too many false assumptions to work with. Instead we need a quick feedback loop between idea and production. This can only happen when software can be released frequently, with minimal effort.
We were able to reduce the biggest problems of these infrequent releases by improving the existing communication flows, both in terms of the process, and on the automation side. This greatly improved the reliability but didn’t really do much to actually speed up the release cycles.
The next step should now be to increase this release frequency by introducing decentralisation. Where the first step only had an impact on the process and automation side, this step will address the cultural side of the company, attempting to move the minds from a focus on determinism, upfront planning, top-down management, efficiency, etc to one with a focus on self-empowerment, mutual collaboration, experimentation, and accepting failure.
To me this looks like a crazy difficult challenge, one with no guarantee on success and with lots of pit falls underway. But still one that we should attempt, because there is not really an alternative, is there?