Communicate to collaborate – Matthew Skelton – Build Quality In

Communicate to collaborate - Matthew Skelton

Matthew on Twitter: @matthewpskelton - Matthew’s blog: matthewskelton.net

A major UK-based ticket retailing website

Timeline: 2011 to 2013

Context

The organisation began selling tickets online in 1999, making the website one of the longest-running online booking websites in the UK.

From 2010, the organisation embarked upon a major transformation of its technical capabilities across all areas of IT: development, testing, release engineering, system operations, and infrastructure. The company grew the UK-based teams to 5 development (Dev) teams, 10 application support people, 12-15 people in infrastructure and operations (Ops), and 6 in IT Support. In addition 200 developers, QAs, environments admins, etc. were retained in Bangalore through ThoughtWorks and were part of the internal team. By the middle of 2012 the teams looked like:

  • 10 x 2-pizza dev teams - London and Bangalore
  • 1 x 8-person NFT (Performance Testing) team - Bangalore
  • 1 x 10-person Regression Test team London
  • 2 x 2-person Build & Deployment teams in London and Bangalore
  • 1 x 8-person Environments team - London and Bangalore
  • 1 x 5-person Internal IT infrastructure team - London and Edinburgh
  • 2 x 8-person front-line system support teams - London and Bangalore
  • 1 x 8-person IT Operations - London

At least 260 technical (hands-on) people were involved in building and operating the software systems at this organisation, and I joined the company in late 2011 as Build and Deployment Architect in order to lead the build & deployment activities as part of this transformation.

10+ Dev teams in two locations (India and UK)

This article focuses on the ways in which we within the Build & Deployment team in London nurtured communication with other teams in order to improve collaboration and improve the way in which the software systems were built and operated. We were inspired by the 2010 book Continuous Delivery by Jez Humble and Dave Farley, and by the emerging patterns from the DevOps movement, and sought to use these and other practices to make software delivery more reliable and predictable.

Problems and causes: mental and financial models, execution, and communication

By 2011, the commercial teams were increasingly concerned that the cost and speed of software delivery were too high, and they began demanding improvements from IT. Predictability of feature releases was paramount for the commercial teams, who had relationships with commercial ticket resellers, large corporate (B2B) clients, and other business customers. Features would be promised contractually to clients for a specific date, so reliability of delivery was more important than pure speed in a market dominated by a few companies.

The working assumption was that it was possible to specify features independently of other features without negative future implications. Over time, this approach had led to a system with significant technical debt that made development of new features ever more costly, a large central monolithic database, and a low level of focus on operational concerns such as logging, metrics, diagnostics, and deployability. This created a lengthy regression testing phase and infrequent big bang releases (a platform release) every three months or more. As platform releases became larger and less frequent they became increasingly error-prone, leading to a drive for even less frequent releases despite the commercial teams pushing for more and more features14.

A ‘smell’ that indicated something was amiss was the presence during 2010 of a small development team that was empowered to bypass the lengthy platform release process and work on innovations in their own workstream. Useful features outside the main platform were developed and released rapidly in close collaboration with the commercial teams. The development speed and collaborative nature of this small team were a welcome eye-opener for many, but unfortunately there was limited focus on Continuous Integration, automated deployments, or logging. After I joined it soon became clear that within our Build & Deployment team we would need to spend significant time increasing awareness of good software practices within the wider organisation if we were to be effective.

Treat the build system as production

When I joined it’s fair to say that the state of Build & Deployment was not very healthy. The quality and reliability of builds and deployments had become gradually worse: builds would break for no obvious reason; deployments would fail apparently randomly; and teams were contending for shared environments in which to test their changes. Pressure to get things working had led to developers with administrative access to test environments, resulting in tests going green due to undocumented fixes. My colleague had been given the mostly thankless task of getting the builds green, and likened these activities to juggling flaming plates.

The Bangalore and London teams each had their own set of build and test infrastructure15. The London system initially included ~130 servers:

  • 4 servers running Go for CI and deployments
  • 100 build agents, some doubling as test agents (!)
  • 5 machines for data replication
  • 4 Subversion servers for source code and artifacts
  • ~20 other machines for testing

This infrastructure had grown organically over several years, and had no real owner except the (overworked) Build & Deployment team.

Be a Service Desk

I realised quickly that the build/deployment/replication (BDR) infrastructure was an internal live system, and that we needed to treat it as a live system to stabilise and run it effectively. Specifically, there were certain points in the delivery cycle when BDR unavailability would have blocked deployments to Production. On one occasion, a crucial deployment server in the primary data centre was 2 hours away from being decommissioned because it was not classified as Production. We recommended the machine be retained and rebuilt with Chef to prove we could survive a future decommissioning incident!

We setup a JIRA ticketing system to give us a single point of contact for issues affecting the BDR system. When people walked up to us, sent a Skype message, or emailed over a problem, we asked them (nicely!) to report the problem via a JIRA ticket, explaining that we needed to triage and prioritise the issues.

Crucially, we did not attach SLAs or response time guarantees to the ticketing system and only used it to centralise reporting. Submitting a ticket was as simple as emailing a specific address, and I also printed some cards with details of our JIRA ticketing system:

Biz card for build system problems!

The cards and the JIRA system seemed to confirm for people that we were serious about addressing the instabilities in the BDR infrastructure16, and we began to gain the trust of development teams. We found using a light-touch ITSM approach based on aspects of ITIL®17 also helped us to win friends in IT Operations because we were now speaking their language.

Collaboration with IT Support

A practical way in which we improved BDR was to develop a close working relationship with IT Support. We built up trust by helping each other out; we shielded the IT Support team from the more random requests from developers wanting a new tool, and in return they helped us to operate the BDR infrastructure by adding monitoring or configuring backups.

In order to emphasise the need for the BDR infrastructure to be taken more seriously, we collaborated on defining the service levels (recovery point objectives, backups, etc.) for all the key services and servers in the BDR estate:

Differentiating service levels at different times of the delivery cycle

Shortly after this, we worked together on a highly-available monitoring system built on Zabbix, with API access for the Build & Deployment team to configure alerts and thresholds. Without the earlier collaboration on ticketing and service level requirements, the shared monitoring would have been much harder to achieve.

Use the build systems as a test-bed

A happy side-effect of having a large and complicated Build & Deployment system was that we could prototype new approaches within our own live system before suggesting new approaches for the main Production systems. For instance, in early 2012 the Production systems did not really use DNS for endpoint resolution and most people used specific server IP addresses. We knew this was hampering good deployment and service discovery practices, so we trialled a couple of approaches to DNS management within the BDR system before recommending an approach for other environments: using an environment-agnostic DNS entry in config files with the fully-qualified part being supplied by the Primary DNS Suffix for an environment.

Use DNS, Robin!

In a similar way, we also trialled and demonstrated several other new approaches and technologies:

  • Chef for infrastructure configuration
  • Artifactory for NuGet, Chocolatey, and RPM packages
  • Graphite for time-series metrics and monitoring
  • The vSphere API for virtualisation
  • Vagrant for manual and automated VM image changes
  • Patching schemes for Windows and Linux
  • Building and testing VM templates within CI

The introduction of Graphite (promoted by my colleague in London) was a particular success, as the existing monitoring tools used by IT Operations were end-of-life. One morning at 8am, a sharp-eyed developer spotted the metrics had hit zero and told the IT Operations team that something was wrong. Operations had not yet seen the problem because their tools aggregated data every 5 minutes, so they wanted to know how the hell we had found the problem first!

A mini Graphite graph from an unhealthy server

Improving the software release process

Use Conway’s Law to align teams and software systems

One of the key changes we were deeply involved in was the move from a shared codebase model where any Dev team could work on any part of the system to a model where teams were aligned to sets of products & services. We called this product-aligned teams and did it for several reasons:

  • We knew that we needed different parts of the system to be independently deployable
  • Conway’s Law18 strongly suggested that independent subsystems would require teams to have responsibility for specific subsystems
  • We wanted a greater sense of ownership over the code in general

We found volunteers (mostly willing) within each Dev team to act as go-to people for any build or deployment problems relating to a subsystem. It was their responsibility to coordinate the fixing activities with their team, whilst providing a point of contact for other teams with dependencies.

We hit an interesting snag when moving to product-aligned teams: when all broken builds were displayed on a shared information radiator, teams often ignored red statuses, even for their own components. We tried various different ways of displaying the information until we realised that because the radiator never showed any green at all, the meaning of red had become diluted: an interesting psychological gotcha. A developer consequently tweaked the build radiator software to show a percentage bar.

All red dilutes the meaning
Green percentage gives context

Applying Conway’s Law to the build & deployment systems themselves, we began using a separate build & deployment server per team, forcing a separation of tooling between teams, in addition to a separation of application logic. Prior to the change, we had one deployment server for each set of activities (build, packaging, test, Pre-Production, Production), shown here in the vertical purple blocks:

Deployment servers optimised for function

We changed the focus of the deployment servers so that each team had its own server (shown here in purple) that could take that team’s software all the way from CI through testing to Production:

Deployment servers optimised for value flow

With the old approach, it was very difficult to trace a change from version control to Production and vice versa. In the new scheme, we could build a deployment pipeline that was simple and clear enough to show to non-technical people and have them immediately grasp the progress of a change:

Deployment pipeline stretching to Production using ThoughtWorks Go

Improve quality incrementally

Having read (and re-read) the Humble & Farley book ‘Continuous Delivery’ I realised that the contents pages of the book read rather like a checklist for sensible technical practices. For example, some of the subheadings are:

  • Every Change Should Trigger the Feedback Process
  • Keep Everything in Version Control
  • Never Go Home on a Broken Build
  • Only Build Your Binaries Once

I printed out the Contents section on large paper, stuck the pages on a wall by a thoroughfare, and annotated each page with a score reflecting how well we thought we were doing. Quite a few people stopped by the printouts and asked about the rankings and the reasons behind the Humble & Farley recommendations, which sparked some useful discussions. This was a practical way to engage people in thinking about the scale and nature of changes we needed to undertake.

The Contents pages of ‘Continuous Delivery’ form a handy roadmap

A further way in which we worked with teams to increase engagement with the different software products and subsystems was to run a series of workshops with Dev and QA teams to characterise the kinds of testing activities they thought were necessary. By this point, members of the QA team had begun to work more closely with Dev teams rather than just as a QA team.

Essentially, we characterised our ideal value stream and looked at how the current state of the deployment pipelines matched that ideal. For each kind of testing activity we discussed a range of factors related to each test type:

  • Duration (magnitude)
  • Magnitude (num tests)
  • Value Gained
  • Scope
  • Cost of creation
  • Cost of maintenance
  • Cost per test
  • Visibility of test results
  • What does stage failure mean?
  • Number of machines
  • Manual or Automated
  • Duration (seconds)
  • Technology or Business Facing?
  • Purpose - Support Programming or Critique Product?

Exact values were not important; what mattered was that teams consciously thought about these questions in the context of the products and subsystems they were now responsible for. Here is an early version of some subsystem target deployment pipeline stages:

Stage properties Build Unit Test Isolated Tests Exploratory
Pipeline Meta-Stage Commit Testing Commit Testing Auto Acceptance Testing UAT
Duration (magnitude) 10 sec 1 min 3 min 1 hour
Magnitude (num tests) 10 100 30 5
Value Gained        
Scope Solution File In-memory, does not File System +  
    make HTTP call Local DB  
    or contact DB    
Number of machines 1 1 1 3
Manual or Automated Automated Automated Automated Manual
Duration (seconds) 10 60 180 3600

We can see the teams felt unit tests should take no longer than about 60 seconds, whereas integration tests might take 3 minutes, and exploratory tests might take an hour to complete. We also captured the principle that unit tests should not make an HTTP call or touch a database, and discussed other kinds of constraints. We were particularly influenced by the outside-in ‘GOOS’ model of system testing. The members of each team also rated the test stages for their own subsystems as Green (healthy), Amber (needs work), Red (problematic), or Grey (not relevant), based on gut feeling and analysis of Go logs:

A heat map for the quality of test stages for different subsystems

The purpose of this push for a bottom-up, shared view of tests at different stages was to increase trust between the Dev teams and the QA team, so that we could avoid test duplication. Previously, separate and overlapping sets of tests were maintained by Dev and QA (Conway’s Law at work again!) but by collaborating on the existing tests we could remove duplication without increased risk of regression problems. The collaborative nature of the workshops helped to get buy-in from teams for ownership of their newly-assigned products and components, and provided useful insights into the health of the code and tests across all the systems we examined.

Increase the focus on operational concerns

The Dev teams had become accustomed to prioritising operational features below user-visible features, leaving deployment, diagnostics, metrics, logging, etc. somewhat neglected. This had led to the Ops teams often distrusting releases because their needs had not been addressed. It also made the software more difficult to work with during a live incident, because the hooks and information that would have been useful to diagnose problems were often not present or difficult to find.

We adopted a multi-pronged approach to improving operability:

  1. Add version information in DLLs and packages for discoverability and audit
  2. Push for log aggregation for both developers and operations people
  3. Encourage Dev and Ops collaboration through Run Books
Version-stamping DLLs

A useful feature of .NET DLLs is that the Windows operating system recognises embedded metadata within the files, which allowed us to store useful information within the DLL itself to be discovered later on after deployment. After a hotfix copied new DLLs over existing DLLs the IT Support team could not detect where files had come from. By embedding the Subversion repository revision or a Git SHA in the DLL we had a way to trace back a particular DLL to a specific commit in version control.

Log aggregation

My colleague put in place some cunning build scripts to generate a SCOM management pack, which defined the alerting thresholds for different events in the logs. We defined some sensible defaults for the location of log files on disk, as the logging location had previously been arbitrary19.

We pushed hard for log aggregation as a first-class capability. We were able to take advantage of a Security audit requirement that IT Support should not remote onto Production machines, with log aggregation recognised as a good solution. We were able to get agreement to make the log search available from tech team machines20, and also began to demonstrate the benefits of log aggregation for developers using ElasticSearch/LogStash/Kibana (ELK).

In a pilot run by one Dev team, the time taken for diagnosing deployment and runtime failures in their test environment was greatly reduced, and much developer excitement ensued: rather than having to remote onto each of five or six boxes, they could simply use Kibana from their browser to search across logs from all machines. After seeing this, I realised that using ELK in both Operations and Development produces an excellent opportunity for direct, practical collaboration between Ops and Dev, especially during a live incident, as both Dev and Ops people would be familiar with the same Lucene search queries in Kibana.

Run Book templates for an operability checklist

A slow-burn but effective approach to improving operability was to define a common Run Book template covering typical operational concerns and retro-fit these to existing subsystems as well as new systems. The operational criteria covered how applications would address things such as:

  • Resilience, Fault Tolerance and High-Availability
  • Throttling and Partial Shutdown
  • Security and Access Control
  • Daylight-saving time changes

Prior to joining the organisation I had found that a focus on operational readiness - although painful for many developers - usually paid dividends after the software went into Production, with reduced errors, enhanced response to failure conditions, and better diagnostic capabilities. This resonated with the IT Service Manager and the Head of IT, and so we began collaborating regularly on the details of the Run Book template as a way to improve how software worked in Production.

An extract from an early Run Book template

The act of collaborating in this way with key people in IT Operations was in itself useful for building trust between Dev and Ops. We also began introducing the Run Book template to Dev teams working on major new functionality as an operability checklist. If the Dev team was unfamiliar with one of the criteria, IT Operations were on hand to guide the team in some sensible defaults.

Branching is painful

One of the most awkward aspects of platform releases was that everything was released together at the same time. Originally, this had been a fairly sound design decision for the UI part of the codebase, but by 2011 almost all the software components and services were versioned in lockstep. Because the Regression Testing and pre-release activities took at least 6 weeks per release, and work for release N+2 overlapped with previous work for release N+1 and N, we had up to three active source code branches at any one time. The branch names (3.14, 3.15, 3.16, etc.) had to be created in separate Subversion repositories, and because Subversion was also used for artifacts storage the branch names had to match artifact repositories, too. To make matters worse, development began one release in trunk and then switched to a named branch just before a release was cut:

Late branching with trunk - painful

This scheme combined the horrible inefficiencies of branch-based development with the pain that arises if there is a lack of discipline in a trunk-based approach. Development work often halted for several days around the branching activity, with build pipelines going red and managers putting pressure on the Dev teams in London and Bangalore to finish the branching. Work was regularly lost on the wrong branch, and had to be back-ported or removed from future branches.

We wanted to move to Trunk Based Development, but sensed it would take several years to achieve due to the difficulty of breaking out of commitments already given to external customers for feature X in release Y. Trunk Based Development was also made difficult by an internal rigid date-based release pattern, with release engineers staying awake for many nights at a time preparing and testing the release in Pre-Production.

We decided to remove the biggest source of noise from the platform branching scheme by moving the branching date to the start of each development cycle and using fixed branch names:

Early branching with named branches - less painful

This had the effect of moving away from the spurious trunk-based scheme to a staircase scheme21 that reduced the confusion and pressure around branching, allowing us to automate most of the branching activities using Chef. Because the branching became more deterministic, we also could push responsibility for getting branch builds green to the Dev teams, and over time this helped to align teams with certain products and parts of the system.

During 2013 we began to remove some software components from the platform branching scheme, giving them their own versioning scheme based on semantic versioning. This allowed teams to communicate intent through version numbering, especially when a new version contained breaking changes. Eventually we reached a point where a platform release was essentially a collection of metadata representing the version numbers of the different components & services to be released. We even moved to a declarative verification model where platform aggregation scripts would fail fast if expected components were deemed invalid:

Verifying the platform release aggregation

By using Go to orchestrate the replication and aggregation tasks, we could make visible the work in progress (WIP):

Making replication WIP visible

Over the course of 2 years we reduced branching activity time from 2 weeks to 2 days; when you consider up to 10 Dev teams plus release teams and testers would block on branch completion, this represented a significant cost saving. It was also satisfying also to see improved collaboration across teams as a result of the new branching scheme.

The artifact repository as a first-class service

A particularly troublesome part of the initial BDR infrastructure was the replication of source code and artifacts between London and Bangalore. Each location had a build farm which built separate sets of artifacts which were then combined during integration in London. The flow of materials between the two locations was very complicated and tortuous, using a combination of Git, Subversion, and svnsync for replication and duplication of source code and binary artifacts.

Initially it took over eight hours to replicate all the latest artifacts from Bangalore to London, so if a new build was needed we would often wait up to two days for a new set of artifacts owing to the time difference between the UK and India.

Replication represented one of our biggest and most costly delays. Over the course of 2 years, we managed to reduce replication time from 8 hours to 40 minutes for the main set of artifacts22. This significantly improved the experience for everyone involved in the build and release of software: dev teams, QA people, and release engineers, reducing waste and saving money. We achieved this through a combination of:

  • Fixing configuration problems with the MPLS link between Bangalore and London (at one point it was rate-limiting to 32KB/s per TCP connection)
  • Replacing unreliable svnsync replication with a home-grown Subversion replication scheme based on rsync
  • Running a customised version of NuGet Gallery with Squid caching build artifacts
  • Moving more of the CI builds to London so we had to ship only source code
  • Splitting up the builds for Bangalore-based components so that artifacts should be shipped independently
Home-grown Subversion replication

We eventually began moving to Artifactory for storing and replicating artifacts. Artifactory was useful not only for multi-site replication of deployable artifacts, but also for its native package repository support: NuGet for .NET compile-time packages; Chocolatey for .NET run-time packages; and Yum for RPM packages. We found that because Artifactory can be set up to replicate only artifacts that are actually needed in an environment, the volume of replication traffic was much lower, as was the time taken for replication.

Fostering communication, collaboration, and trust

Within the Build & Deployment group in London we had to interact with many different teams: the IT Support team for hardware-related issues with development infrastructure; the Release team for Production-focussed replication; and IT Security for firewall changes. We therefore set up regular meetings with these teams and fervently stuck to the schedule even if we had nothing particular to discuss. We had weekly informal sessions with:

  • The Build & Deployment team in Bangalore
  • The internal IT Support team
  • The Environments team
  • The Release team
  • Incident Management & Security
  • Software architecture
  • The QA team (although this was patchy/occasional)

The regularity of meetings was important, as it provided a rhythmic heartbeat for changes and improvements. Small problems were caught and addressed early on, before they became larger problems. In particular, the meetings helped to coordinate activities across different teams, which at the time often had very different drivers or success criteria. Regular sessions meant that people knew that the awkward stuff would be addressed at least once a week, so additional quick everyday discussions were based on higher trust and typically did not expand or go off-topic.

Raising awareness within teams and across the organisation

Team engagement

When I started in 2011, the Dev teams were separated physically from the Ops teams at opposite ends of the building with a cafe/seating area in the middle - a true Dev/Ops divide! I called this out on my first day, and we continued to try to find ways to break down this unhelpful split. Some things were small and simple, like a cheery Yuletide greeting:

Ops problem now

Other things were trivial: we took the technical books out of a storage cupboard where they were effectively hidden and put them on display in the open-plan area we used for tech talks in order to encourage people to borrow the books. We also devised and ran training courses in version control fundamentals for Ops people who had never used it before.

More fundamentally, we regularly presented and sought feedback at a weekly lunchtime pizza session, and encouraged people from IT Ops to present. By late 2013 the Incident Manager was doing a monthly report on recent live incidents, explaining what went wrong and how developers could help prevent a repeat incident in future.

Engineering Day

Early on it was clear that many teams did not understand the scale of the technical challenges ahead; for example, at one early meeting the idea of Infrastructure As Code was ridiculed by a fairly senior technologist23. To tackle this we conceived an internal technology day for all of the IT department, but as we started to explore talk topics we realised there was a great opportunity to widen event participation by opening it up to teams outside of IT. By presenting technical material in an easy-to-understand way to people in Commercial, Finance, Legal, Marketing, HR, etc., we would have the opportunity to win wider support for the changes we knew were needed for the next five years.

The first such ‘Engineering Day’ was held in September 2012 in London. There were 16 sessions of around 15-20 minutes each, with presenters from almost every team within the IT department. We had spontaneous cheering from non-IT attendees as the database team showed a query which took over 6 minutes on the old database took only 5 seconds on the new database, we had cloud computing explained by the lead software architect, and a video-link to the office in Bangalore for a session on build system monitoring.

We heard from the Commercial team, who worked closely with front-end developers to build a mobile-friendly application, and demonstrations of how both Fisheye and Graphite could be used by tech teams and non-tech teams for business metrics. The IT Support team ran a tour of the comms room, and our team ran two workshops for complete beginners on HTML and websites with WordPress, which gave some attendees the sense of having super powers (“I programmed the web!”).

In all, the effect of Engineering Day on teams outside IT was strong:

Mahoosive thank-you to [IT] for the Engineering Day. We definitely learnt a lot and appreciate the effort put in to educate the upstairs folk on the technical shizzle going on downstairs. You are all heroes

We went on to run one of these tech events every six months for a half-day. Through these events, we almost forced a kind of collaboration between the different groups within IT, because on the day we had to present a unified message to the rest of the organisation, which meant finding common ground beforehand. It was great to see people from different teams speaking for perhaps the first time in front of a large group, and people across the organisation offered help with putting on the events.

By October 2013, the events had become a core part of the communications strategy for the IT department, and a good heartbeat for capturing and reflecting on successes and failures during the preceding six months. To ‘sell’ the idea of these events within your own organisation, I would emphasise that the events:

  1. Increase cross-team awareness and collaboration within IT
  2. Help to bridge the gap between IT and commercial/Programme teams
  3. Begin to increase faith in IT’s ability to communicate, to care, and to contribute

Reach out to the external tech community

Tech blog

One of the key differentiators for many tech-savvy organisations is their public presence: blogs, open source contributions, attendance and speaking at events, sponsoring of meetups and conferences, and interaction with the tech community in general. To help bootstrap this outreach activity, we started a public tech (engineering) blog covering all areas of IT at the organisation: software development, testing, web operations, performance, networks, etc. We worked with the People team (HR) on the messaging and integration with the jobs website; in fact, the main financial justification for the commercial blogging account was a reduced cost of acquiring new tech staff.

A year after the launch in September 2012 we had a modest but steadily increasing 1000 page views per month, and the blog was driving around 20 people per month to the jobs website.

Meetups and conferences

Members of the tech teams also began speaking at (and hosting) meetup groups and conferences. After speaking at WebPerfDays 2012 on how software architecture should be a function of build & deployment concerns, I met several fellow Continuous Delivery enthusiasts and we decided to start the London Continuous Delivery meetup group; we hosted the first session with an invited speaker at the London offices of the organisation, helping to bootstrap our engineering outreach efforts.

Meetup group at the London offices

People also began blogging about things like attending Velocity Conference, tech books such as Patterns for Performance and Operability, and moving the non-core applications to Continuous Delivery.

Results

During my time at the organisation, we in the Build & Deployment teams in London and Bangalore - together with help from many other people - made some significant changes to the way in which the software systems are built. We introduced and improved many different tools to provide enhanced capabilities:

  • Git instead of Subversion for source code
  • the wider use of version control in general
  • Gitolite for multi-site code replication
  • Fisheye for code review
  • Chef for infrastructure configuration
  • Vagrant for VM template changes and infrastructure development
  • Artifactory as a first-class artifacts store
  • Chocolatey as a .NET package management solution
  • LogStash as a log aggregation system for rapid diagnostics
  • Linux as the appropriate operating system (rather than Windows) for much of the auxiliary tooling
  • Graphite for time-series metrics
  • CI pipelines for infrastructure code
  • dissemination of the concept of ‘infrastructure as code’ itself
  • deployment pipelines that extend from code commit all the way to Production and visible to all

We also saw our lightweight approach to the operation of internal systems using JIRA tickets copied and adapted by many other teams in IT, increasing the visibility and awareness of work underway. The ability to show visually that work was increasing faster than we could tackle it helped significantly in winning support for additional team members and funding.

Reducing the time taken for replication and distribution of deployable artifacts from 8 hours to 40 minutes (or much less for some subsystems) was an important improvement, and this significantly improved the workflow for many teams. We played a major role in stabilising the release cadence of the main platform to a regular six weeks (with a lead time of 12 weeks), and demonstrated how parts of the platform could be split off and released separately on a much shorter cycle without compromising reliability. We also provided clarity and sanity on changes to teams and responsibilities - particularly the move to ‘product-aligned’ teams rather than separate silos - and delivered a pragmatic, achieveable roadmap for build & deployment for the subsequent three or four years.

Ultimately, however, I think it was the way in which we helped to bring together previously disparate teams that was the biggest achievement. The Engineering Day events had a marked positive effect on the relationship between IT and other departments, and within IT we facilitated collaboration between Development, QA, IT Support, Architecture, Incident Management, Operations, and Project Management. At one point in late 2013 an IT project manager said to me that the current wisdom within IT was “if you want your project to succeed, get the Eccentric Ninjas involved”; hyperbole, obviously, but it gives an indication of the extent to which we were joining up different bits of the IT department in our daily collaboration and communication.

I should mention that we did not magically fix everything, not by a long way; a substantial amount of work remained to be done in terms of both technology and teams. However, we put the Build & Deployment capability at the organisation on the right track, laying the ground for Continuous Delivery across a good part of the software systems, and opening up ways in which a DevOps culture could take root and grow.

Lessons learnt

My two years at the organisation were an excellent experience. I had the privilege of working with some amazing colleagues in both London and Bangalore, and I simply could not have done half of what I did without their encouragement and inspiration. In fact, we worked more closely and effectively with our counterparts in Bangalore than with some of the teams in London; I’m convinced that co-location of purpose beats co-location of desks every time. Above all, I learnt that a high-performing team can do astonishing things given the right context and autonomy.

Some other things I took from that time are:

  • Find opportunities and excuses for collaboration because it produces serendipitous benefits
    • There is a need to communicate often, with many people, and in simple terms, even when it feels unnatural or difficult. This can take time to have an effect (months or even years), but the effect can be transformative.
    • Regular communication - even if it seems a bit forced at first - helps to reinforce the connections between teams (rather like Hebbian learning in neural networks). With stronger connections, more trust builds, and more useful work can be done.
  • Non-IT really appreciates IT openness
    • IT departments have a major opportunity to reach out to and impress other departments through good and well-executed communications
  • Find shared language and terminology to help bring teams together
    • Map ITIL lifecycle or incident management terminology onto software development phases
    • A focus on software operability can be a way to develop trust between Development and Operations
  • Conway’s Law applies not only to application architecture but to auxiliary tooling
    • Avoid centralised monolithic tooling unless shared state is needed (e.g. for version control or infrastructure configuration)
    • Separate teams should probably have separate instances of common tooling (e.g. deployment pipeline tools, monitoring config, etc.) or at least a way to configure and query these tools separately
  • Deployment pipelines should be visible and useful to all
    • Deployment pipelines are not just for moving artifacts towards Production, but for gaining trust and confidence
    • Make pipelines simple enough for less technical staff to be comfortable using
  • Tools can be used in special ways to enable and facilitate collaboration
    • The collaboration aspect of the tools is often orthogonal to the main tool purpose
    • The way in which a tool is introduced - and the timescale for introduction - can strongly influence the success or failure of a new initiative

I would hesitate in 2015 and beyond before running build/deployment/replication infrastructure internally unless there is a very good reason to do so. Instead, I would start with established SaaS offerings for version control, deployable artifacts, continuous integration, build and test environments, log aggregation, metrics & monitoring, and (possibly) deployment pipelines. A reasonable driver for running some infrastructure internally might be the per-GB cost of a SaaS product, but before believing that the organisation will ‘save money’ by running a large build/deployment/replication estate internally, the organisation should weigh the SaaS costs against the salaries of a team of people (three? six? nine?) to build and operate such a system ‘as a service’ - the real cost is not small, and business-critical build systems do not ‘run themselves’.

It is clear to me that the drivers of behaviour and technology - and the resulting systems architecture - at the organisation are quite typical of many successful organisations, particularly those organisations that do not see building and operating software systems as their core business; the challenges were certainly not unique to this organisation alone. I have since seen some very similar patterns at clients in many different sectors: tourism; financial data; hotels; betting; legal software; online donations; and media organisations. The threads common to these organisations are:

  • Building an internal capability is hard when the organisation has previously seen IT as a provider rather than a partner
  • Projects and budgets damage the viability of software systems
  • Most people don’t understand how crucial build & deployment has become

To address these challenges, we must:

  • Treat build and test infrastructure as a live system, and either outsource the headache by using SaaS or invest properly in internal infrastructure and teams to run it
  • Use programme-based funding rather than project-driven budgets
  • Treat software operability as first-class concern, with a special focus on the ability to deploy and monitor software systems
  • Recognise that modern cloud-based systems are very different from the early web infrastructures, and require substantial ongoing investment in a ‘value-add’ operations capability (either SaaS or in-house). ‘NoOps’ does not mean ‘no operations’, and ‘DevOps’ does not mean developers running Production.

Finally, I realised that some people have a very strange idea of ‘communication’. During some difficult negotiations over responsibility for patching the Dev and Test servers, someone within IT Operations said “there is no communication problem on the Ops side: we updated the wiki page several weeks ago”. This view of ‘communication’ as a one-way flow of information (or worse, ‘requirements’) probably explains most of the communication problems seen in IT departments in many different organisations.

The word ‘communication’ comes from Latin communicare meaning ‘to share’. Effective collaboration for developing and operating complex software systems requires sharing: sharing of skills; sharing of ideas; sharing of responsibility; sharing of ‘failures’; and sharing of respect.

About the contributor