Continuous Delivery across Time Zones and Cultures - Sriram Narayanan
Timeline: July 2011 to June 2012
ThoughtWorks is a global company of passionate technologists. They specialize in disruptive thinking, cutting-edge technology and have a hard-line focus on delivery. Their clients are ambitious, with missions to change their industry, their government, and their society. They provide software design and delivery, pioneering tools and consulting to help them succeed.
Aiming for Continuous Delivery involves a journey of people, of processes, and of tools. It involves letting go of and taking up of responsibilities, changing how people are measured, and resolving all manner of issues. Everyone needs to be a part of the solution and make things happen.
Over more than 6 years ThoughtWorks has helped one of its clients on their journey of broadening their customer base for its online ticketing services. The application stack comprises an ASP.NET based web app that supports theming for multi-tenant hosting, a .NET based web service layer, various custom components developed in house and sourced from third parties, and an Oracle database. The web services also integrate with multiple third parties for ticket reservations, bookings, and validations of various sorts.
The Origin of the Build and Release Team
On smaller teams, the concept of shared responsibility works well. When the same source repository and environments are shared across time zones as well as organisations, there is a need for a single group to hold the pieces together. The programme had a Build and Release team which maintained build/deployment scripts and Thoughtworks Go-based pipelines.
Various programme managers tried to rotate members in the team, which meant a number of developers got exposure to some form of infrastructure. However, when you have only developers working on infrastructure things may sometimes be suboptimal. Furthermore, the developers sensed they were getting out of touch with the application domain and wanted to get back to development. Finally, the team had grown weary of dealing with various instabilities - even with admin access there was only so much they could fix.
We invested time in learning how the build scripts were organized, how and why we branched, and the reasons for build and test failures. Certain jobs would run correctly on only one particular VM or a set of VMs; an integration test would fail intermittently and then pass soon afterwards. Despite parallelised builds of various components across VMs, we didn’t have predictable timings for the overall build stage. Our regression testing would take up to 18 hours, and various components would take a much longer time to build than other components. Furthermore, during user acceptance testing (UAT), users would suddenly experience an unresponsive website.
Performant environments and reduced functional test times
The top complaint was “environment issues”, which were a combination of deployment misconfigurations and overcrowding of VMs. Regression testing took 18 hours on a good day if there were no environmental issues. We therefore decided to stabilise the regression testing environment and speed up regression testing, followed by stabilising the UAT environment, and lastly, stabilise and speed up the build and packaging process.
By dedicating high end VMWare servers for regression testing on our Selenium grid we brought down regression testing times from 18 hours to 2 hours. Next we dedicated three high end servers and SAN storage for hosting just the UAT environment and ensured it was not starved of CPU resources. Later, we isolated another physical server for performance testing. In all these cases we appointed individuals to be complete administrators of a specific environment, and introduced low-level monitoring of physical and virtual infrastructure. Our intent was to delegate administration, to remove the fear of infrastructure, to get people exposure across the full stack, and help them understand the infrastructure was now capable of handling load.
Network issues, bandwidth issues, and network latencies
We often had embarrassing situations where the application failed during showcases, where code checkout took too long, and/or functional tests intermittently failed. While infrastructure can be scripted, the network layer is often mysterious and you have to be creative - you can only improve latency to an extent across distance. We brought in a lot of network layer fixes including redundant data links, SNMP monitoring of the link between the UK and Indian data centres, and ensuring everyone had a working SSL VPN connection. We also migrated the whole codebase from Subversion over svn+https to Git over git+ssh, with local mirrors reducing latency and SSH access improving speed.
Reducing build times
The core components in the application stack were compiled in parallel across a farm of build agents, and there was often no predictability in the times it took for various application components to compile. We used the Go API to get a sample of the build times for various parallel jobs, and split the list of jobs into two – long running and short running jobs - for specific build agents. This brought the compile phase down from 75 mins to 19 minutes.
During this analysis we also noticed that several components were getting compiled as part of other components’ build jobs. We created a team of two developers who broke down the 39 job compile stage into two smaller stages – one of independent components and DB initialisation, followed by another stage that built components with dependencies. This meant issues with independent components or database initialisation could be reported faster.
Providing predictable environments
The existing VMs had been in existence for 5 years and had become virtualised Snowflake Servers. Our goal was to understand what it took to get a generic build agent. We’d create one, run a variety of build jobs on it, understand what was missing, and then spin up a new build agent and test that out. Over a period of time, we understood what it took to create a generic build and used Chef based cookbooks to create the constituent VMs of a UAT environment. We moved to a forced refresh of VMs every 6 weeks, and created a Go job that could trigger envronment creation on demand.
Improving test predictability and stabilizing tests
Over a course of time, we identified several reasons for test failure. Parallel tests used the same database, there was no proper test data management, Selenium timeouts were not correctly configured, and our Selenium version was old. All of these caused intermittent problems, which people would overcome by simply retriggering the build. Once predictable test environment was in place, the functional test automation team improved Selenium timeouts and we migrated to Web Driver.
We set up Graphite monitoring to measure the times that individual jobs took, and to notify us if a job took longer than usual. To help everyone realize that we were now having more green builds, we started to produce a daily report of the green and red builds each day.
We had to deal with the following ‘antipattern’ behaviour from developers:
- Checking new code onto broken code.
- Retriggering a broken build without actually fixing the core problems.
- Expecting the Build and Release team to investigate broken builds.
- Call for email based notifications when the build breaks.
Early on the teams had a low confidence in the build and test environment, choosing instead to verify a build independently in a team-specific environment. People would commit code even if a build was broken, retrigger various jobs to get a green build, and ask the Build and Release team to investigate broken builds for them. My colleague and I realized that everyone had stopped treating the CI system as a CI system, and considered it merely a compilation farm.
Being a purist I disliked the notion of commit tokens and didn’t want to send out email notifications. Instituting such measures would introduce further bad behaviour and move us away from the practice of Continuous integration. Instead, we introduced pre-commit hooks that would only allow commits on green or if the message contained “[Fixing Build]”. This didn’t encounter any opposition and managers liked the onus upon fixing the build.
We sometimes took shortcuts to get a green build, but we actively worked with developers on the longer term fixes. Once we had build monitoring in place it was easy to point out the top 5 jobs that failed the most and required attention. Based on this the programme manager would allocate developers to look at those, and most errors were caused by dependency management rather than environments.
The pipeline structure
We ended up with the following pipeline:
Some artifacts were built in India and then published to the UK where they were integrated with the artifacts built by UK-based teams:
Lessons we learned for effective Continuous Delivery
- Push as much into the VM Template as possible By pushing the platform stack binaries onto a template, we used Biztalk and SQL Server configuration tools to reduce VM preparation from 45 minutes to 10 minutes.
- Make use of OS facilities where possible To run our Chef scripts on startup we wrote computer startup scripts instead of user logon scripts
- Get system administrators comfortable with configuration and infrastructure Had we been comfortable with configuration management and automated infrastructure sooner, we could have reduced cycle time sooner.
- Prioritise your solutions While there were many burning fires, we chose to stabilize critical elements first.
- Facilitate rapid identification of failures The first investigation of a broken build should be by the development teams so they own the act of keeping the build green.
- Showcase your ideas rather than seek permission We spent a lot of time caught up in permission seeking calls to use Chef, to discuss revisiting build scripts, etc. Showcasing ideas requires time, and it is best to always make sure you have a proof of concept before you start talking to the business.
Personal and team-related
- Ensure excellent relations with counterparts on the other side Because we worked well with our counterparts on the UK-based build and IT support teams we got a lot done over email or the phone rather than via formal calls or inter-organisation catch up sessions.
- Ensure regular team member rotation Exposure to infrastructure helps developers consider deployment and maintenance scenarios during development, and keeping the same people on a team for a long time can lead to them becoming out of touch or bored. Rotation can be a win-win.
- Work on getting the staffing right Since our team had a development background we were able to work across the full stack and resolve issues earlier. Later on, we enabled developers to understand infrastructure and they were able deal with the full stack.
- Have a ‘one team, one company’ mindset Team and organisation affiliation leads to a ‘Not Invented Here’ syndrome, treating ‘offshore’ folks as ‘resources’, and leading to a general slowdown. Thanks to a programme level drive to have a ‘one company’ mindset, we knew we were in this together. This was the single biggest contributor to success.
There are some other important activities we didn’t get around to solving:
- Test data management Managing test data effectively is highly desirable in integration and functional testing. Conventional setup-teardown of test data may not work well if the tests are run in parallel.
- Test data at volume We used to regularly face issues due to out-of-date Production data, but newer data was inaccessible to us due to data privacy issues. Somehow the technical team’s arguments for data obfuscation didn’t make it through.
- A single monitoring dashboard We had separate dashboards for virtualisation, storage, networks, builds, and tests, as well as custom monitoring. We planned on having a unified dashboard, but didn’t get there.
- Environments on demand We looked at a number of options for allowing development teams to create and manage environments on-demand, but didn’t have time to select an approach.
Build and Release, or Infrastructure Engineering?
While the team was chartered with taking care of build and release, we focussed solely on infrastructure automation, monitoring and predictability. Once we brought these in, developers and QAs were able to ensure test stability. Similarly, developers were able to work on packaging and deployment with more confidence than before. While we were lucky that developers picked this up, this also highlights that packaging and deployment is best handled by developers, since they know how the application works and what configuration it needs.