Staying Xtremely Unruly through Growth - Alex Wilson and Benji Weber
Timeline: November 2009 onwards
Marketing technology company Unruly is the leading global platform for social video marketing and works with top brands and their agencies to get their videos watched, tracked and shared across the Open Web. Our programmatic video platform reaches and engages the half of the global Internet population who are watching and sharing videos outside of YouTube.
Founded in 2006, Unruly has 12 offices and employs 150 people globally. In 2012, Unruly secured a $25 million Series A investment led by Amadeus, Van den Ende & Deitmers and Business Growth Fund.
Unruly currently has a tech team of 25 spread across 3 development teams. Unruly makes use of Extreme Programming (XP) principles across all of these teams and throughout the rest of the business.
Unruly’s product development teams are cross-skilled, with developers responsible not just for development, but future product direction and keeping systems operational as well. All production code is written in pairs using Test-Driven Development.
“I will never forget my first day at Unruly. We chose who would pair-program with whom in our morning team huddle, then sat down and wrote a bit of code. About an hour later we had made a few tweaks to the appearance of one of our ad units and the developer I was pairing with turned to me and said - ‘Ok, let’s deploy’. I was surprised and somewhat concerned. It was my first day and I was about to deploy changes to production that only my partner and I had tested. Where was the QA team to throw the changes over the fence to? Where was the release board requiring triplicate sign-off on changes?” — Benji
The above reaction is often echoed by developers joining Unruly. Once you get over the initial apprehension, working in this way soon becomes addictive as you can see your work generating value immediately. Each pair does a bit of work then hits the deploy button, waits a couple of minutes for the deploy script to run all the tests and release the changes. Once it is complete we can then wander over to an actual user’s desk to find out if it is working as they desire.
We realised some time ago this was a fundamentally better way of working. There is no room for hours or days of wasted work building things that nobody wanted in the first place.
Unruly has worked this way from the outset as its founders had experienced the many advantages of Continuous Delivery prior to starting the company, so we can’t tell you about how much better things got after switching to a Continuous Delivery approach Rather, we’ve made lots of small incremental improvements as we have grown, while striving to maintain the rapid pace of delivery that we value. We’d like to share some lessons we’ve learnt along the way.
Testing in Production
When the organisation was small enough to fit into a single room it was easy to only deploy changes that customers were happy with. Our team’s direct customers were typically other Unrulies, depending on the impact of the change we’d either get them to look at what we were about to deploy, or deploy it first and then get feedback. We could deploy frequently and receive rapid synchronous feedback on our changes.
However, as we grew and launched offices all around the globe this became more difficult. Our development team is based in London, and when we built features for people in our American offices (San Francisco, in particular) we only had a few minutes of overlap time each day to discuss our change. It was hard to find time to review changes with customers before deploying them, and consequently our deploy frequency dropped.
We considered introducing a staging environment where we could demonstrate changes to our overseas users before release, but ultimately decided this was the wrong path. It would require maintaining and securing a lot more infrastructure, and would have required us to maintain code branches while we were waiting for feedback.
Instead, we opted to use Feature Toggles in our production environment to give selected users access to new features that were partially complete. This allowed us to continue integrating our changes with the rest of the development team and release them to production while we awaited feedback.
The result was that our end-of-day calls consisted of walking the customers through the features in the environment they were already using every day. We could discuss potential improvements for the next day and they could see how the new features interacted with the data and events in the live environment.
Co-ordinating Shared Infrastructure
As the rest of the company grew, so did the product development team. Eventually we found the team was large enough that we were spreading ourselves too thin and becoming unfocussed on product direction, so we split along product boundaries into 3 sub-teams - one for ad-unit management in our video distribution product Unruly Activate, one for campaign management in Unruly Activate, and one for our Unruly Analytics product. Each team had its own infrastructure concerns but everyone was responsible for shared infrastructure such as monitoring, provisioning, and configuration management.
Since we now had 3 teams working by-and-large on entirely different projects which were adding new features and systems faster than before, the parts of our set-up which were used by all teams started to become seen as “not as important”, “not worth prioritising”, and worst of all - “not our problem”. This led to some incidents where problems with our Nagios and Puppetmasters caused production issues. The incidents were ultimately caused by defects that each team had assumed another would fix.
We counteracted this by introducing a cross-team squad responsible for these priorities, consisting of one representative from each of the teams and our site reliability specialist. We affectionately refer to this group as Borat Squad, after the @DEVOPS_BORAT Twitter account.
During each iteration, which varies between 2 and 3 weeks depending on team, we allocate 2 pair days to to the Borat Squad. This gives us 3 pair days across all 3 teams to work on shared issues, and allows us to continue practising XP by sharing knowledge across the teams.
Due to the set-up and efforts of the Borat squad we’ve since solidified our infrastructure across all teams. Everyone runs on the same operating system, uses the same tooling, and deploys in the same way. Reducing variance in this way has allowed us to effectively focus our efforts on increasing overall quality in incremental steps.
Making things visible
One of the most memorable emails we received at Unruly was from our CDN provider. It politely enquired as to whether we were aware that over 80% of our traffic was resulting in server errors. It was accompanied by a graph like the following
There was a moment of panic as we looked to see whether we were still serving ads, and whether we were making any money. It turned out we were serving ads as normal,which is why we had not been alerted by our own monitoring. The problem turned out to be a change we had deployed in our ad-units had caused one particular version of Internet Explorer to miss the end condition of a loop counter and continually request non-existent files.
At times like this there is the temptation to react by adding more checks and reviews to our deployment process to help us avoid releasing things that are broken. Could we have avoided the problem by having more people review a changeset before each release? Possibly. Could we have avoided the problem by having more automated or manual testing? Possibly.
On reflection we realised that this was the wrong reaction. Any given bug might have been caught by these extra steps, or it might not. We tend to only catch problems we’ve anticipated in our tests. We realised we had not spotted this particular problem because the things we had cared about enough to monitor and test were still working as expected. Our real issue was we had no way of spotting problems that we hadn’t thought to alert ourselves on.
In response, we started building dashboards with live graphs showing various aspects of our production systems, not just operational metrics but business metrics such as number of ad impressions, click through rates, even the rate of money flowing through the system. We made these visible by putting them on displays next to our team board. The result of this was we started noticing things that alerting alone would not have told us. For instance, we saw unusual patterns in our traffic when publishers had misconfigured our ad tags, and when we once again deployed a change that broke a subset of browsers we were able to rapidly respond and fix the problem. Our traffic has natural peaks and troughs and as a result it could have been a few minutes before our alerting system would have worked out that there was a serious problem.
The visible dashboards also help us to get synchronous feedback from our deploys. If we are deploying something it’s natural to glance at the dashboards just afterwards to see if everything looks OK.
Like many organisations, we started out with a couple of monolithic applications. As these grew ever bigger, our deploy speed reduced. Test suites took longer to run, and it became harder to reason about the consequences of a change.
Partly as a reaction to this we started moving towards smaller, interoperating components. At the same time, we were transitioning our source code from Subversion to Git and initially stored each small component in its own Git repo. It didn’t take long for this to cause us serious problems with our rapid release culture. We found we would deploy one component and inadvertently break other components that relied on it. Individual components were now faster to deploy, but we had less confidence in the system as a whole.
We experimented with strict versioning schemes for services, but this led to increased deployment complexity and operational complexity, with a need to calculate the order of changes to be deployed without a single revision identifier of an atomic changeset, and multiple versions of the same service running at a time. We also felt we weren’t truly integrating our changes continuously because old code was running in production.
In response, we are now coalescing our Git repositories into bigger repositories which each contain multiple projects. This means we have atomic history again and it also makes it easier for our deploy tooling to work out what has changed and needs to be re-deployed to production. We are also focussing a lot more on monitoring our production system. We choose to omit some acceptance tests on production deploy if it is highly unlikely the deployment will have an impact on a particular system, but there’s no reason our acceptance tests can’t check our production system.
Delivering major design changes incrementally
This is reliable and allows us to still serve ad traffic when other parts of our infrastructure are unavailable, but gives us almost no control over what happens between the assets and the ad-unit. When our feature set expanded the number of required assets ballooned and required bigger/more files to be uploaded until response times became unacceptable.
Furthermore, we observed the landscape of the online advertising ecosystem changing, and so we decided to move our entire ad-serving stack to an architecture more consistent with OpenRTB13. Since we are consistently serving multiple thousands of ad impressions per second, it wasn’t possible for us to simply switch off and over. As we deploy straight to production to get feedback as soon as possible, this meant that we’d have to get the performance characteristics exactly right first time - a risky proposition at best.
In the end, we decided to not change what was being served, just the way we served it - an implementation of Canary Releasing. We designed a set of targeting servers which would at first just transparently proxy requests onto the S3 bucket, and leveraged weighted DNS to help with the migration - since S3 is globally available we needed to do a little work to achieve similar resiliency. We ended up executing this in a 3-stage plan.
- Set up a new DNS record to point to our static assets
- Spin up targeting servers in the required regions and add them to the new DNS record
- Send 10% of the traffic to the new servers and monitor
When we were happy with the performance in production of how we were serving the new assets we would ramp up the traffic by another 10% or so, and leave it running for a few days to expose any performance bugs that we hadn’t caught at lower levels of traffic (ulimits - d’oh!).
Developers at Unruly tend to be fans of testing in production, and deploying new infrastructure changes is no exception - such events are often accompanied by the mantra of “Let’s shake the production tree and see what falls out”.
After around 6 weeks we were serving all traffic through these new servers and iterating on new features quickly, which we weren’t able to do when we were serving static files from S3. Our success with this approach meant that when we recently needed to split our targeting servers into a client/server architecture, we repeated this workflow and were able to turn the change around in just over 3 weeks of work - and most of that time was spent on the actual application development.
Operating as a full-stack team affords us many opportunities to practice both XP and Continuous Delivery at all levels of our software development process - to us there is no distinction between parts of the process that would be siloed away from each other in other companies.
We’ve had our share of problems during our growth period, but operating in a fashion that emphasises Continuous Delivery means that each time we’ve been able to conquer our issues and improve, as well as consistently deliver value to the business at each step along the way.