Learning to dance to a faster rhythm - Chris O’Dell
Timeline: August 2010 to July 2014
7digital’s mission is to simplify access to the world’s music. They do that by offering a proven, robust and scalable technology platform that brings business and development agility. Long-lasting relationships with major and independent record labels and a strong content ingestion system has brought their catalogue to over 25 million tracks and counting.
More than 250 partners use 7digital’s music rights and technology to power services across mobile, desktop, cars and other connected devices. Their own music store (www.7digital.com) is localised for 20 countries, with apps available for all major operating systems.
Founded in 2004 in London’s Silicon Roundabout start-up scene, 7digital now employs more than 100 people, of which roughly half are members of the Technology Teams, and they have offices in Luxembourg, San Francisco, New York, and Auckland. 7digital serves on average 12,000 requests per minute through the API with an average response time of 120ms. 7digital handle 3 million music downloads per month on average and can handle 22 millions streams per month serving petabytes of data.
The opening sonata
7digital aims to simplify access to the world’s music. This is done via a robust, scalable, music platform powered by a flexible API. Of course, it wasn’t always this way.
In 2004 7digital was born as a two man startup in the Shoreditch area of London, before it was trendy. It was a web-based reseller of digital music - MP3s, ringtones and even some video clips. This was a time after Napster had peaked and iTunes had started to dominate the market. Starting a company selling digital music was considered madness. Regardless, the little company sold DRM-free music direct to consumers and via white labelled miniature web stores.
7digital also sold the collated music metadata of their catalogue to clients allowing them to build their own stores with the music files being supplied by 7digital. This data was provided in the format of large, ever increasing, CSV files. One client did not wish to receive the full CSV files, and asked for a way they could query and retrieve music metadata whenever they needed it. What they wanted was a web based API.
A single Asp.Net WebForms application was created. It was built using shared libraries which already existed to serve the consumer-facing website and the white labelled stores. This decision made a lot of sense at the time as the application’s purpose was to simply expose existing functionality via the web. This also meant that the API shared the same database as all of the other applications.
The application was developed with testing in mind, not exactly test driven, but there were end to end tests. These covered the small amount of functionality which the application provided and any new functionality was added with more end to end tests.
This approach served that client’s purpose very effectively, and soon enough other clients gained access to it. The API grew gradually as each new client brought their own needs. The size of the test suite increased and the team supporting it also grew.
All applications were set up to run Continuous Integration using a shared TeamCity server. Each commit would trigger a build and a run of the full test suite before deploying to a pre-production server on success. With most of the test suite being end to end tests the time taken to run the suite increased along with the size of the codebase. Before long it was taking over an hour to get feedback, by which time the developer had lost context and possibly moved onto some other task.
Features were added, bugs crept in, and load increased whilst performance decreased. Time to add features sky rocketed, and the development team were treading on each other’s toes to make changes.
With the API fast becoming the central part of 7digital’s platform, we realised we had to take a step back, review our current approach and make an architectural change. We realised that we needed to split the monolithic application into smaller, more manageable products and by extension smaller, more focussed teams.
As a small company in a fast moving industry we couldn’t afford to stand still. We could not take the time to develop a new version of the platform in parallel as a separate project. The changes had to be made to the existing application - we had to evolve it.
First we needed to get the current situation under control.
The slow adagio
When running a test suite takes over an hour, developers will start to employ a range of tactics for shortening the feedback loop. One example is to only run the obviously related tests on their local machine after making a change, thus leaving the full suite to be run by the Continuous Integration server upon commit.
This tactic relies on the developer knowing which tests are relevant and also remaining focussed whilst the full suite runs - it’s tempting to assume the work is complete when you’ve run all the ‘relevant’ tests.
The end to end tests also suffered from fragility and ‘bleed’ by requiring the data in the database to be in a particular state. We would experience flaky tests that seemed to fail for no reason other than the order of execution.
Another tactic employed was the existence of a ‘golden database backup’ which contained the expected data for the tests to run. It was a large backup which could not be reduced in size due to the tangled and unquantified actions of the end to end tests. It would be copied to a new starter’s machine like a rite of passage on their first day.
The above practices sound ridiculous, and they are, but you must realise that these things happen gradually - a single change or test at a time. As with the ‘golden database backup’ the pain is most evident when a new developer joins the team and the time it takes for them to get up and running is far longer than desired.
We knew this was a problem and that we needed to tackle it, but with such a large scope it was difficult to pin down. We took the approach that when working in a certain area you would review the associated tests, retain the main user journeys as end to end tests, and push the edge cases down to integration and unit tests. The edge cases included scenarios such as validation and error handling, which could be more easily tested closer to the implementation.
As the application was built with ASP.Net WebForms testing other than end to end tests was extremely difficult as the presentation and logic layers were deeply intertwined. Also the HTTP Context cannot easily be abstracted away, something which Microsoft made easier in later frameworks such as ASP.Net MVC. We decided to refactor every WebForm into a Model-View-Presenter pattern such that the WebForm itself did as little as possible and the business logic was pushed down into a Presenter class. The Presenter took only the elements it required from the HTTP Context and returned a Model which the WebForm bound to. This allowed us to unit test the business logic in the Presenter without needing to invoke the full ASP.Net lifecycle.
These changes significantly extended the time it took to fix a bug or add a feature and there were times when a refactoring was considered large enough to be tackled as its own work item. These items would be labelled as Technical Debt and prioritised in the backlog alongside the rest of the items. We had the full support of our Product Manager who had seen and understood the impact which the poorly performing tests had on our productivity, the platform’s stability and our ability to deliver.
In our team area we had a small whiteboard where we noted down sections of code that we felt needed attention and we would regularly hold impromptu discussions around this board. This kept us focussed on the goal even when the changes seemed impossible and morale was low. Crossing items off the board was a reminder of our progress and a source of pride.
Work items were tracked on a simple spreadsheet where we entered the date we started development and the date it was done. Our definition of Done was when the code from the work item had been released to Production. Rob Bowley, VP of Technology at 7digital, performed some analysis on this data which he published in a report in May 2012 and a subsequent report in 2013.
The interesting findings from the report show the team’s cycle time during the period of refactoring greatly increased. The chart below shows a large spike where work items were taking more than 80 days to complete.
To enable the move to a Service Oriented Architecture a feature was added to the API codebase whereby incoming requests could be redirected to another service - an Internal API. The routes were configurable and stored in a database. The API would pattern match against the request URL and decide whether to handle the request itself or to pass it along to an Internal API.
With the API acting as a routing façade we were able to carve out chunks of the functionality along domain boundaries. Internal APIs were created for Payment Processing, Catalogue Searching, User Lockers (user access to previous music purchases), music downloading, music streaming and many other domains.
In all cases the changes were extremely gradual and took years of work, with a single route being replaced at a time. Some were rewritten completely in new frameworks whilst others were first carved out by duplicating the existing code as a new project and rewriting it separately from the API. Each domain called for a different approach. For example, the Search functionality was rewritten to use SOLR as a more appropriate datastore, while the Purchasing functionality was cut out as-is to isolate the functionality and make it easier to understand and test before attempting to rewrite it.
The development teams also split apart from the API team into domain focussed teams: a Payments Team, a Search Team, a Media Delivery Team and so on. Each team was now able to focus on a smaller subset of the overall platform, and to operate as mostly independent projects. With the original API now a façade each team could release almost all changes independently and without need for co-ordination between teams.
This separation allowed the teams to devise their own build and deployment scripts and finally move away from the now bulky Rake scripts. The Rake scripts were originally created to be a standard way of managing build, testing and deployment. Over time, features and exceptions had been added to them, eventually making them unwieldy, fragile and unintelligible. One team chose simple batch scripts for the deployment with TeamCity managing the build and test steps, whilst another team chose Node.js simply because it was the same language they were using to develop the application.
Even though the consuming projects themselves had been split up they were still tied together by shared libraries which held unknown, and possibly untested, quantities of business logic. Any changes to these shared libraries had to be co-ordinated between the teams to ensure that they pulled in the latest fixes.
Using TeamCity we changed the process around such that changes to the shared libraries were picked up and pushed into the consuming applications. This removed a barrier to refactoring the shared libraries - the work involved in ensuring consuming applications are updated - and so many more bug fixes and improvements were made to them. This did cause some problems where a bug would creep into the shared library and break every consuming application or when the applications were not in a position to receive changes (such as when working on a large refactoring), but we chose to receive fast feedback and consume smaller changes to the libraries than have it mount up into a large, scary change.
When the majority of the platform had been split out we turned our attention to replacing the shared libraries with services. This way we could isolate the domain they were intended to encapsulate and have the logic in one place - as per the SOA approach. With frequent deployments to the consuming applications these changes could be done gradually, first by wrapping the calls to the shared libraries then by replacing the wrapped functionality one piece at a time until the library was no longer needed. There was no big bang release where the libraries were removed, it was done in small continuous changes with little to no impact on the end consumers.
The dance of the minuet
Kanban was our chosen method for managing changes. Each team had their own kanban board, backlog and roadmap. We found that keeping our Work in Progress limit small promoted frequent releases and ensured that changes did not hang around unreleased for any length of time. We were able to experiment by implementing a change, releasing it quickly, and monitoring what happened.
Monitoring is an essential part of Continuous Delivery. If you are releasing changes in quick succession, you are doing so in order to gain feedback. We employed many tools for our monitoring including NewRelic, statsd and a logging platform comprising of Redis, Logstash, ElasticSearch and Kibana.
Our monitoring gave us information about the performance of the platform, error data and its usage. If we had a theory about a particular area that may be causing a performance issue we would add metrics around it to get a baseline before making changes and watch for any improvement. This would be done in a series of releases, facilitated by the Continuous Delivery process. With the smaller applications and focussed teams we were able to try out changes to many areas of the system in parallel.
With the replacement of existing functionality, such as a shared library providing a user lookup to an internal API call with a REST URL per User id, we’d first add metrics around the current functionality. We would add a counter for the number of calls, a counter for the number of errors, and a timer. This would give us our baseline. We would replace the user lookup code with a call to the internal API and monitor the effect this had on the metrics. If it was detrimental we would roll back the change and investigate further.
Rolling back is another essential feature of Continuous Delivery which we used often. Being able to recover quickly from a bad change allowed the platform to continue to serve requests with minimum downtime. We implemented rollback as a redeploy of the last known good state. It was as quick as a normal deploy as it used the same process and ran all the relevant smoke tests upon completion. If there was any doubt that a change had caused negative effects then we rolled it back and investigated without the added pressure of downtime in a production environment. We also had all the data our monitoring tools had collected during that bad deploy to help isolate what had caused the issue.
When serious downtime did occur we had to take steps to ensure it didn’t happen again. We held blameless post-mortems to ascertain how a scenario came to be, and created actions to put in place changes to prevent a recurrence. It is very important such discussions are blameless otherwise it becomes extremely difficult to discover what really happened and to make changes. We realised that we were all part of a system and that a series of events, rather than a single event led to the downtime, and so we need to change the system. The actions were followed up in a weekly meeting.
The closing sonata
Continuous Delivery at 7digital is more than the technical challenges. The changes made were not only to the code but also to our culture and how we approached development work.
Improvements to our automated testing meant the role of Quality Assurance moved to the front of the process rather than the traditional position of being after a release candidate has been created. Instead of verifying the accuracy of changes made, QA helped us to ensure that the changes we were making satisfied the requirements and that our understanding of the changes was correct. Together the developer and QA would devise acceptance criteria and tests, including automated acceptance tests, integration tests and unit tests.
The frequent releases, rollback procedure and monitoring allowed us to spike out a change and test it in production with real live data. For example, if we believe that caching user details would be advantageous we could add a simple cache with a short timeout and monitor. If the spike proved successful we could then improve the caching strategy to add redundancy, graceful fallbacks etc. This changes the way roadmaps are devised and how closely we work with Product Managers.
The 7digital development teams no longer sit together, but rather they are situated near their internal clients - the Payments team are near the Customer Operations team, the Media Delivery Development team are near the Content Operations team, the API Routing team are near the Account Managers and so on. This promotes trust and transparency between the teams adding to greater co-operation - we took full advantage of the ‘Water Cooler Effect’ for incidental conversations and creating relationships across departments.
It can be appealing to be continuously deploying changes all day, but we added some rules around it to ensure a good balance - no releases after 4pm, and no releases on a Friday. This may sound counter-intuitive to the trust we have in the system, but it ensured we maintained a sustainable pace and that people were focussed when making a release. A problem caused by a bad release could take hours to manifest (e.g. a memory leak), so preventing releases after 4pm ensured that people were available to notice such issues.
The same rule applied to all of Fridays, as there are two whole days over the weekend where people may not be available. There was of course the option of agreeing a developer on-call support rota and allowing releases at any time, but this felt like an anti-solution when a sustainable pace is desired.
7digital’s cycle time has demonstrably improved since these painful and laborious architecture changes were made. The work was difficult, took a very long while and at times it felt like a Sisyphean task. We pushed on through and now the API Platform is continuously being released to production as small units multiple times a day, averaging 10 or more deployments.