Chapter 6: Managing Service Outages – The ITSM Iron Triangle

CHAPTER 6: MANAGING SERVICE OUTAGES

Hiu was manager of the network team, and from the initial looks at the data I’d seen, a lot of the incidents had his team involved in some way to resolve them. It didn’t necessarily mean his team was responsible for them. It just meant he had great visibility of what restored the service.

The references had been pretty clear. With only incident management, no matter how good we got at resolving outages, we could never reduce the number of incidents. That was where problem management made all the difference. I was convinced of that, it had to be part of the solution, too. I needed the help of Hiu’s team if I were going to be able to get problem management off the ground. The question in my mind was whether I could convince him to be part of the solution.

I needed to find a way to get him on my side. I’d been told that Hiu was a big fan of churrascarias; those rodizio style restaurants with roving passadores; servers who come to your table with meat on skewers and carved them to your order. It was a good place to meet on business. You could set up a slow pace to the meal, making it almost as good as golf for locking in plenty of time for talk.

“So I figure you must have a reason for buying me lunch besides wanting to spend an hour in my company, right?” said Hiu as he worked on his lunch. “After seeing the unceremonious way they bounced Sarah out for being unable to halt the outages, I guess you must be the next sacrificial offering. They’ve thrown you under the bus, but it hasn’t arrived yet. If you’re quick enough, you can escape, but otherwise, you’ll follow her out the door. Am I right?”

I didn’t answer. Somehow admitting to my own risky situation was not the kind of pressure I needed. “I’m still kind of new here. Did you know Sarah well?”

Hiu nodded. “We both started about the same time. It was too bad she got treated that way. She’d done a lot of good things while she was here. I guess the only thing that counts is what you’ve done for them today.”

“Are you at risk in your role? How’s the last thing you did?” I asked. I needed to know how motivated Hiu was if I asked him to change the way he was doing business.

“Given that I’m probably one of two or three people in the whole company who know how the entire network works, I’d say I’m pretty close to indispensible … at least more than you are,” he laughed.

I shrugged my shoulders. “I guess that’s one way to look at it. All depends on whether or not the number of incidents goes down. You know, those things that get you paged in the middle of the night and on weekends. If they don’t decrease, no one is safe.”

“Actually, I don’t mind getting paged. Neither does my team.” Hiu leaned across the table, poking the air between us with his knife. “It’s a real rush when you’re responding to business outages. The company is in crisis and everyone is counting on you to step in and work your magic to make everything right. I mean, who doesn’t love showing off their skills? Besides, the business is very appreciative towards those that solve those problems quickly. They definitely do not forget your name. You wouldn’t believe all the free lunches, swag and sports tickets they push my way in gratitude. And when we go for a week or two without an outage, they’re even more grateful.”

Hiu stopped and put his knife down. “And here’s some advice to you from someone who’s been around for a while. Nothing is better for you at review time than a big pile of certificates of excellence, and appreciation from our business partners. That always impresses leaders and ensures you get a nice raise.”

I didn’t even bother to try and explain the difference between incidents and problems at this point. That would be later. Right now I had to find something in his view of the world to grab onto; anything that would make him think about helping me. I asked the obvious. “Don’t you think they wouldn’t be even more grateful if you prevented those outages?”

“Not a chance, Chris. I know that’s what they teach you in those ITIL classes. That may be the way academics and executives think, but you’ve got to look at it from the perspective of their operational teams; the people who get their hands dirty delivering results for business operations. If nothing ever breaks, IT is invisible. We become a utility. How many times have you called up the electric company and thanked them because the power didn’t go off at your house yesterday? People only value things they struggle for. Things that don’t kill you, make you stronger, and things that cost you dearly, are always valued more highly. If you make it too easy and give it to them for free, they will never appreciate it.”

“So it’s about doing what makes the business grateful to you personally, rather than doing what’s right for the business, even though you may not get credit for it?” I asked, trying to see how far he was going down this path.

“They aren’t exclusive, Chris. You’re thinking like an abstract academic; like this is a perfect world. In the real world you are never going to get rid of outages. Something will always happen. If you’re not ready for that, you shouldn’t be in IT. And while those outages are regrettable, and I wish they didn’t happen, the fact they do, simply reinforces to the business that the money they give to IT is important and well spent.”

Logic was not going to work with Hiu. There was no way I could change his beliefs in the timeframe I needed. Perhaps after he saw how the business reacted to making IT almost invisible, as he put it, because there were so few incidents, he might change. But I needed him to start systematically collecting data on incidents before I had any hope of getting problem management off the ground, and I was convinced that was the only way I could generate some successes and get in front of Jessica with my solution.

“Interesting. I’d never thought of it that way. So how do you identify to the business and leadership all the times you stepped up and went beyond the norm?” I asked.

“I can usually remember them, especially the recent ones.”

“I can see that,” I said, as the passadore started slicing some roasted, yet unidentifiable hunk of flesh on a skewer, planted firmly on the table between us. “I just think it would be more impressive if they had a context; you know, how many there were in total, what happened and why? It would make a powerful statement about how much work you’re actually doing for them. I know if I were in your place, that’s what I’d do.”

“All that tracking would be a pain. I just want to get in, fix it, and get out. More than that is way too much. They are not as fixated on the detail as someone in IT might be.”

“Maybe I can help,” I offered. “Our ticketing system already has the ability to track that. All it needs is a few other pieces of data, and it can produce all kinds of reports about how valuable your service has been to the business. It’s nothing new, just noting at the time of the incident the same stuff you carry in your head, like what you did to fix it, what caused it, which part of the business was impacted. You know all that already. It’s nothing new. I can show your team how to do it if you want? It’s up to you. But it would definitely give you a bigger list of valuable accomplishments to impress the business with.”

Hiu looked long and hard at me, and then smiled. ”You know, Chris, when you first started here I figured you to be a bureaucratic jerk, whose sole mission in life was to slow us down. I guess I was wrong. Maybe tracking some of this data would be worth doing, so we can get credit for everything we do. But if I get a single complaint from my team, we stop it, right?”

I nodded. “Yep.”

“Then it’s a deal. Come lay it out for my team on Thursday at my staff meeting. And remember, if anybody complains, it’s over.”

“Thanks,” I said. “I appreciate the opportunity. I’m sure they will see the value in it, just as you do.”

“As far as all that ITIL standards stuff goes,” said Hiu. “Let me give you one bit of advice. I deal with a lot of networking standards, so I understand what they are like. Always make sure you adapt the standards to the situation. Don’t force the situation to meet the standard. You may not get that golden city on the hill the standard is focused on, but you’ll get something that gives you 80% of what you need. Don’t try to get to the end state on day one. Make progress your goal, rather than your ideal state.”

Hiu stood up and pushed his chair in. “Now, where is that dessert buffet and espresso bar. I need some serious sugar and caffeine to balance out all this protein.”

The meeting with Hiu’s team went much better than I had hoped. It was helped because I’d visited each person individually before meeting with them as a group. That allowed me to ferret out their concerns. And it probably didn’t hurt that I had taken Hiu’s advice and made sure I could craft and be happy with a process they could live with, too. That made them feel like they had some skin in the game.

Initially, I felt guilty about compromising the standards I learned. When I read the books and studied the standards, it seemed all very cut and dried. Focus on the standard and how you can make reality fit it. There didn’t seem to be any leeway to deviate from the standards and still be compliant. Settling for anything less than best practice seemed like failure. Somehow, adequate practice just didn’t have the same ring to it.

But while working with the Network team, I realized Hiu was absolutely right. If you tried to force fit high-level concepts into reality without taking the local landscape into account, the best you could possibly hope for was minimal task compliance. Anyone could do that with sufficient executive clout behind them.

Hiu had helped me realize that processes, or any other behavior change, only succeeds and matures if people embrace them. Mere compliance was not enough. The real art to implementing behavior change was in understanding the needs of the process participants, and creating an intermediate state they could live with, which at the same time would move you toward the ideal state you desired. We might never get there in my lifetime, but my goal was reduction in outages, not elimination.

Once I realized that, it became clear I wasn’t so much concerned about the purity and perfect alignment with the ITIL standards. Instead, I was focused on getting something … anything started, no matter how ugly. I was going to start small and messy, and grow the maturity and completeness over time. But above all, I was going to morph to meet reality, and not ask reality to shift to conform.

I knew this was the right path, because by the time I was done, Hiu’s team was eager to participate, demanding to know when we would start. Once I had cracked that code to getting acceptance of my ideas, aligning the rest of the teams was easy. In fact, when they found out that Hiu’s team was onboard with me, they began to clamour for participation.

It was interesting how competitive the technical teams were when it came to performance. Perhaps it had something to do with the inherent role measurement plays in technical domains. I didn’t care. I was too pleased at how knowing other teams were getting onboard, made selling the other teams on the ideas so much easier.

I’d even backed off on trying to end the superhero era and push Sean into a corner. There was no need to. Once the teams had started using the ticketing system to log some of the data surrounding incidents, it didn’t matter. They were pulling in all the data I needed. As long as the information about the incident was collected, I could live with superheroes. And if the goal of incident management was to get the business back online as quickly as possible, then maybe there was some value in a superhero approach for the time being.

My excitement about getting incident data entered into the database lasted only until I actually started to work with the data. The ticketing tool was an incredibly complex database a software engineer had built one weekend several years ago. She’d apparently tried to replicate the full functionality of ticket databases she worked with elsewhere. Leadership lauded her for it because it was free, and they didn’t have to use it. The technical teams loved it because it had been built in-house and confirmed that the company’s technical teams were better than any on the outside. And of course, they hadn’t been using it either.

Of course, the developer had long since left the company. Her code was idiosyncratic, unstructured and undocumented. Most of the fields were useless, except in a much more mature environment. Anyone who’d tried to make changes to the database usually ended up breaking something else. The database didn’t record very much, just some basic information, and most of that was contained in free-text fields. It was hardly the right way to do it. But then again, in a superhero based world where you never worried about preventing incidents, who needed a lot of categorizing, or sorting of information?

Everyone was entering data in different ways, and apparently in cryptic code only understandable to themselves. The only way I could make sense out of the notes was to wade through each individual ticket. Usually I ended up calling the person who made the notes and asking them to translate. The ironic part was that often even they could not make sense of their own notes.

I was deep into trying to understand the SEV1 tickets over the last 45 days when Ramesh walked into my cube. Before he could speak, my phone alerted for another SEV1.

“Looks like you are failing to shut these outages down,” he said with a scowl. “I haven’t seen any reduction in outages since you started working on this.”

“It takes time,” I mumbled, as I texted Sean to ask if he was going to lead the War Room. “We’ve done a lot to improve the process. It just takes time to have effect.”

“Actually, what you’ve done after all these weeks is to get incident responders to fill out forms after the incident,” said Ramesh. “If that’s how ITIL is going to reduce outages, then I’m not very impressed with either its theory, or your practice. Something has to change, and you’d better figure it out soon. I doubt Jessica has a lot more patience.”

Ramesh took one look at the piles of paper on my desk and said, “What a mess. No wonder you don’t know what is going on. I want to see your quantitative evidence of how our incident situation has improved, by tomorrow.”

Just then Sean responded to my text with a message that he was busy elsewhere and I’d need to go manage the outage myself this time. This was going to be an ugly day.

Tips that would have helped Chris

Not everyone shares your likes or dislikes. Many IT people like the adrenaline of late-night paging and War Rooms. You need to find a common goal as the basis for building relationships, before you can work out a solution and gain commitment.

Try to build new processes, such that it does not increase the workload on the technical SMEs. You want it to ultimately allow them to focus on their technical discipline, not filling out forms. If you have to add bureaucracy to their workflow, make sure they understand why it is important and whom it benefits. And above all, make sure they have a hand in designing it.