Scrum for Ops teams - James Betteley
Timeline: October 2012 to present (October 2014)
Fourth are one of those companies that people are always surprised to have never heard of before. They’re the world’s leading provider of cloud-based solutions for the hospitality industry. This means they’re pretty big, and since they’ve been around for quite a while, their systems are pretty big too. Looking after these systems is a very demanding job, and not for the faint of heart.
I joined the Infrastructure Department back in the autumn of 2012, with grand visions of automating the socks off everything and making the hard stuff look easy. It wasn’t long before I realised that there was a very good reason why the teams hadn’t done all of this stuff already – they were too busy. Yes, they were too busy to do the stuff that would help them to be less busy. To make matters worse, there was a backlog of other work to get done, as well as the day-to-day fire fighting. But, as is so often the case, the problem wasn’t technical in nature, it was about how the team was working – you could say it was more cultural than technical.
All Hands on Deck!
The Infrastructure Department was split into 2 teams: the IT Ops team who looked after all the office based systems including test environments (as well as doing desktop support), and (somewhat confusingly) the Infrastructure Team who supported the live environments (webservers, database servers, load balancers etc). Both teams were always frantically busy. Most of the time they were busy doing what I call fire-fighting. They were responding to issues in the live environment, or issues from within the office itself. They were constantly being approached by individuals who had some urgent matter they needed resolving right there and then. Somewhat unsurprisingly, these issues would get addressed very quickly. In essence, the team were subconsciously prioritising these interruptions over the work already in their backlog.
What the team needed was a way of visualising their workload, prioritising it, and somehow be able to give accurate predictions about when it would get done.
Is it Scrum?
Where possible, I try to adopt a Kanban approach within Infrastructure teams, because it embraces the reality of their work, namely that “interruptions happen” and the future is unpredictable. However, the business needed firm dates for the completion of certain projects on the Infrastructure Department’s backlog, so I started to look for an alternative approach – something that embraced regular interruptions but also gave us some predictability.
Scrum offers a fantastic framework for delivering software. Delivery teams are generally ring-fenced (and therefore immune from interruptions) and once they’ve attained a sustainable velocity, they can make accurate predictions on how much work they can get through in any sprint, iteration or even a whole release. I wanted this level of predictability, but I couldn’t ring-fence any of the team, so I knew the interruptions would continue to happen and impact everyone. I desperately needed to get a handle on the amount of backlog work the team could complete each week, but it varied so wildly from one week to the next! So we decided to start tracking the interruptions instead, by recording each one on a sticky note.
We opted for a sprint based approach in order to give ourselves a reasonably short timeframe, and immediately set about planning our first sprint. We guessed (yes, it would be wrong to call it anything but a guess!) that we’d lose about 50% of our time and capacity to unplanned interruptions, and also guessed at how much backlog work we could also get through. We got ourselves a board and started adding our story cards. It felt like scrum, it looked like scrum, but somehow, it wasn’t really scrum.
Points Not Hours
In another tip-of-the-hat to Scrum, we decided to start estimating in points rather than hours. We had a well-established “1 point” item, which everyone was familiar with, and sized all our other stories relative to it. We took stories from our backlog, and sometimes needed to break them down into smaller stories, but we stayed away from using hour-based tasks.
I wanted to avoid using hours at all costs, for many reasons. Firstly, hours are an absolute value, while points are a relative value. Humans just happen to be far more accurate when measuring things using a relative scale than an absolute scale. Secondly I wanted the team to stop focusing on how long something would take, and consider other aspects such as complexity and unknowns with equal weighting. In my experience, as soon as you start dealing in hours, you automatically start focusing on how long things will take, and the other aspects get far less weighting.
Thirdly, I wanted to use something more abstract than hours, because it can often be too confusing for people outside of the team to understand why the total hours don’t add up to the total available resource hours. I was hoping to avoid that particular political hot potato.
Tackling the Interruptions
We knew we had an issue with interruptions, and there was a feeling that they were being prioritised over our backlog work, sometimes without justification. I suspected that in reality there were a combination of reasons why the interruptions got done over project work:
- They were usually smaller tasks and therefore easier to do!
- Someone was usually standing over you, telling you it was urgent
- They were “quick wins” which provided instant feedback, giving us the feeling that we’d actually done something useful!
- Sometimes it’s not very easy to tell someone that you aren’t going to do their request because it’s not high enough priority
By writing the interruptions down on a card, and being able to look at the board and see a list of tasks we have committed to deliver, we were able to make better judgement calls on the priority, and also make it easier to show people that their requests were competing with a large number of other tasks which we’d promised to get done.
And so the first sprint began. Every time we got an interruption we wrote it down on a sticky note, and compared it with all of the cards in our “In Progress” column. If we felt it was a higher priority, then it would go straight into “In Progress”, replacing one of the existing cards (which would move back in to “To Do”). Incidentally (and this was raised in our first retrospective), the sticky notes turned out to be less sticky than the name would suggest, and kept falling off the board, so we switched to pink/red record cards and good old blu-tack instead)
If the interruption was not deemed to be higher priority, then it either went into “To Do” or into an interruptions section of the Backlog. We sized the interruptions as we went along, always using the relative sizing we’d used for all the other stories.
One of the big desires of the team was to be able to give the business more accurate estimations on when our projects would be delivered. These were the bigger items on the backlog which we never seemed to be able to deliver on time because of all the interruptions.
We started out by having regular meetings with the business representatives, and asking them to arrange the projects in priority order. Once they’d done this we started to plan them into our sprints. After about 4 sprints we’d established that we were able to chomp through roughly 50 points of backlog work each sprint. After we estimated each backlog project, and armed with the knowledge that 50 was our magic number, we were able to give estimated delivery dates to the business backed up with figures to prove we could do it. Furthermore, we were able to push back on the business if they demanded that something get done by an unachievable date – we were able to provide them with the figures to prove that it couldn’t be done!
In order for our projected delivery dates to be reliable, we needed to establish a maintainable velocity. It’s no good delivering 60 points in one sprint but only 10 the next, and so on – our velocity would be totally unreliable. To our surprise, the numbers were strangely consistent, but inconsistent at the same time. The total number of points we delivered seemed to vary between 90 and 120, but the number of backlog points (i.e. planned points) seemed to remain consistent at 50 points! This was excellent news for us, we only needed to be reliable on the backlog points in order to be able to give estimated delivery dates for our projects. But I couldn’t help but wonder, “How are we able to do such varying numbers of points each sprint?”.
If we added up all the time that we spent on unplanned interruptions and the time we spent on our backlog work, we would come up with a number significantly lower than the total number of hours available to the team. We were aware of this, but we weren’t sure where we were losing this time. I had previously read a book called Slack, by Tom Demarco, in which he discusses the impact of context-switching on a person’s productivity. Basically, context-switching between a number of different small tasks kills your productivity. In essence, we need to do less in order to do more. I wondered if it was this context switching that was responsible for our lost time, and if it was in some way responsible for the different total points we were getting each sprint. The evidence seemed to back up this theory. The sprints which achieved the most points tended to have a larger number of medium sized interruptions, whereas the sprints which achieved the lowest points were the ones with a mixture of large sized backlog stories, and a high number of small sized interruptions. The cost of being constantly interrupted was high – it appeared that being regularly interrupted with small tasks was costing us up to 20 points per sprint.
Burning Things Down
No, I’m not talking about arson, I’m talking about another tool we borrowed from Scrum; the burn-down. But in a slight twist, we overlaid it with the number of unplanned interruptions we were completing as well, so that we could get a feel for where our efforts were being spent each day.
Over time we noticed some trends – for instance, whenever we saw a sharp increase in interruptions, it correlated with a plateau on the burndown. This was completely in line with expectations, it just meant we were too busy working on interruptions to make any progress with the planned work. We analysed the burndowns in our retrospective and it helped us to visualise the impact of interruptions on our commitments.
Did It Work?
Both the IT Ops and the Infrastructure teams continue to use this Scrum-based approach. Both have made some more tweaks over time, and the process will continue to evolve. The experiment was certainly a success, both within the teams and within the business as a whole. The teams found this scrum based way of working to be refreshing, and the business enjoyed greater visibility of their projects, as well as having more reliable project delivery dates! The scrum approach also helped the Infrastructure Department work in a way that was more aligned with the development teams. They could align their sprints, help out with planning and estimation, and generally feel far more in sync with the development teams than previously. The main lesson we learned from this system though, was nothing profound like “doing less means doing more” or anything like that, it was simply that if you raise a card for every interruption then a standard sized sprint board just isn’t going to cut it - you’re going to need a bigger board!