Chapter 11: Why Service Outages are Like Dandelions – The ITSM Iron Triangle

CHAPTER 11: WHY SERVICE OUTAGES ARE LIKE DANDELIONS

The War Room was jammed and noisy. The air was stale, and smelled of carpets and furniture that hadn’t been cleaned in a long time. Several large bags of empty Chinese take-out containers were overflowing the trash can. And someone was already making the fourth pot of coffee.

The good news was, we were just starting to get some traction on the first SEV1 service outage of the evening, when the second one cascaded in. But nobody got flustered like in the old days. The incident response team passed the work out, and assigned people to the second one without a hitch.

All the work everyone had put into improving incident management was paying off. The junior SMEs were focused on restoration and workarounds, while the more senior staff dug into the events, trying to identify and capture key bits of data that would prove critical later, during the root cause phase of problem management.

I was really proud of them, and I guess a little of myself. I was the architect behind the new processes. I had translated the standards into procedures and tasks that meshed well with our organization and skill sets. The tools were still crude. We were loading data into spreadsheets and simple databases, but that had been a conscious decision. It didn’t make sense to stand up a tool until we knew what we needed that tool to do. I kept reminding myself, first people, then process, then tools.

It had taken a lot of work to get incident management aligned with problem management, so both processes were using the same categorization schemes. I’d long since lost track of how many hours I’d spent with each of the technical teams convincing them to give it a chance. Fortunately, it had paid off almost immediately.

Now it was so much easier to track incidents directly, from incident management, through root cause and remediation. Trends were even starting to appear. The problem management meetings were now reasonably well attended by the managers of the technical teams, and although it took us a little while, everyone was finally in sync, that the meeting was not a working session; that it was strictly a status readout and making commitments. I knew that Ramesh was pleased at how far we’d come when he started asking when problem management was going to start reducing the number of incidents. And in talking with one of Mia’s DBAs, I discovered what a cultural klutz I had been during my meeting with her. It had taken a little while, but I had even repaired that relationship, and she was now participating in problem management, along with everyone else.

But two major SEV1s at the same time was going to be a challenge. It would mean careful prioritization of resources. Sean was a great help there. He seemed to instinctively know where and how he could skimp at any given moment, without slowing down the final result. It was almost as if, from experience, he was able to carry a master project plan around in his head.

I was feeling confident we would be able to handle today’s issues, when an alert for the third SEV1 popped up on my phone, followed less than two minutes later by a fourth. I stuck my phone back in my pocket, only to have it start ringing again. I felt only a little better when I saw it was Ramesh, and not another SEV1 coming in.

“Hi Ramesh, what can I do for you?”

“Chris, what is the status of the incidents? We seem to be having a lot of them all at once. Why is that?” From the echo in his voice, he was on speakerphone. I waited for a second to see if he would let me know who was there with him. It wasn’t forthcoming. I tried to be subtle.

“I’m here in the War Room. We’re working on it right now. We’ll get all services restored as quickly as possible. Does anyone there with you have questions?” Not an overly slick way for him to tell me who else was with him, but about as direct as I thought I could be.

“This is Jason, the VP of sales and marketing, and I want to know what the hell is going on,” boomed a resonant voice. Jason had a reputation as the golden rainmaker as far as the business was concerned. No matter what the economic situation, product status, or customer issue, his team always came back with the sale and made the revenue. From what I’d read in the annual report, he was very well compensated for his efforts. More importantly, since his team was the one that delivered the company’s revenue, he pretty much got whatever he wanted, whenever he wanted it.

“We’re executing our incident management process,” I said. “We’re focused on getting these IT services restored to you as quickly as possible. We have representatives from each of the technical teams, as well as our known error and workaround database. We’ll be … ”

Jason cut me off. “Don’t ever tell me what you are going to do, because I don’t work on promises. I want commitments to solutions and closure. I want to know why we’re continuing to have these disruptions in service, and when they will stop. I was promised you were going to make these disruptions go away. You seem to be failing from where I sit.”

“I apologize for the disruption of IT services, Jason. While the incident is not my fault, I do accept it as my problem to investigate and identify some preventative remediation.”

“Being willing to fall on your sword does not impress me,” snapped Jason. “Frankly, I don’t care if you keep your job or not. All I care about is stopping IT from continually disrupting the hard work of my team due to IT’s inability to do anything without breaking it.”

I wasn’t going to rise to the verbal bait and get in an argument with Jason. That seemed to be what he wanted. I simply reminded myself that this wasn’t personal; it wasn’t about me.

“We’ll be assessing today’s events during the problem management meeting tomorrow,” I said. “If you, or any of your team, would like to attend, we would love the participation.”

“If I send one of my staff to your meeting,” said Jason. “Will you waste their time, or will you actually fix the source of these disruptions?”

“There is more value we add to it.”

“Here it comes,” said Jason. There was a moment of silence, then I heard him in a much quieter voice say, “Yes … it’s on mute, Ramesh. Don’t worry. They can’t hear us. Do you have a replacement available for Chris? Every word I hear sounds like a loser forecasting more failure coming down the road at us. Chris should go sooner rather than later. I don’t know what you are waiting for. Just pull the damn trigger, man.”

“Don’t worry,” said Ramesh. “Succession planning is a key part of what I do for all of my employees. I have someone lined up and ready.”

“You’ve talked to them about it?” asked Jason.

“Absolutely,” said Ramesh. “They understand the full situation, and are eager to step in if Chris can’t fix things.”

“Good,” said Jason. “But I wouldn’t wait too long. Because if things don’t improve quickly, or they get worse, I will be having a similar conversation with Jessica about you. Do you understand?”

“Got it,” said Ramesh.

There was the beeping of some fumbled buttons on Ramesh’s phone and then his voice.

“Sorry we had to drop off, Chris. Jessica stopped by … ”

Jason added, “And she wanted to talk about a few things that didn’t involve you.”

I smiled because although I had found out weeks ago that the mute button on Ramesh’s speakerphone no longer worked, obviously he hadn’t figured it out yet.

“No problem,” I said. “I don’t need to be burdened with things that don’t involve me or my work.”

The problem management meeting started right on time. Jason had assigned Meredith as his representative at the meeting. I’d worked with her before the meeting, so that she was fully briefed on the problem management process.

As usual, the first thing we did was review the incoming SEV1 incidents that had occurred since the last meeting, to determine if they should be taken through the full root cause process of problem management, or simply filed for future reference, either because the risk of recurrence was low, they were part of an existing problem already being worked, they had a very low impact, or because there simply wasn’t enough forensic information to make a root cause determination. This was a data filter we used to ensure that we focused on the events that were important, and that we had a chance of impacting.

Of the five SEV1s we’d had in the last week, it turned out that only one was appropriate for problem management. Three of them had insufficient forensic evidence to even attempt a complete root cause. The other one was a complete one-off, and everyone agreed the odds of something like that happening again were minimal.

I had explained this part of the process thoroughly to Meredith, and I thought she understood that some, but not all SEV1s, end up getting a full and final root cause in problem management. Some go no further than the incident trigger stage of analysis. During the briefing, she’d been resistant to a lot of what I’d told her, and occasionally turned almost hostile. But I thought we had reached an agreement as professionals that, while we might personally disagree on some details, we were both working toward the same ends, and could work together using this process for the good of the company. It turns out my assessment of her state of mind was all wrong.

“Everyone needs to stop right now,” she said. “This is just not right. We need to be doing a full root cause on every SEV1 that occurs. We can’t just toss the hard ones away and concentrate only on the easy ones. Why do you even bother meeting. This is just a waste of time if you throw out these incidents that are hurting the business.”

“We don’t throw them away,” I responded. “They become part of our database that we use when evaluating whether or not new SEV1s should go through root cause. We keep them around just in case patterns start to occur. What we don’t do is, take up a lot of time doing root causes that realistically will have little operational or preventative benefit.”

“That may be good in theory,” added Meredith. “But from what I can see, you haven’t really had any success using this plan, have you? The number of incidents remains unchanged, and if last week is any indicator, they have actually increased. How does your theory respond to the fact that incidents are actually increasing, because of the way you run these meetings?”

“We don’t run these meetings based on theory,” I said. “We take all that we have learned, and continue to learn and continually improve, and upgrade, the process.”

“Well then, you must not have learned much, because what you’re doing isn’t working. In sales and marketing we measure success by results, and so far no one has seen any.” She gestured around at the entire room. “You … all of you, are hurting the business and our customers by what you’re doing. I don’t care if you waste your own time. IT seems pretty good at that. But when you’re fiddling around reduces the company’s opportunities for success; then I get mad. So please excuse my being emotional and disrupting the meeting, but this company means a lot to me. Too many people are working too hard to make it a success, for me to sit idly by and watch you do nothing to help it.”

“Meredith, it’s not a question of supporting, or not supporting, the company. It’s about leveraging the resources IT has, by first focusing on what matters most. We start there and grow better. That is … ”

“No, Chris,” said Nicola. “Meredith is correct. The only measure that matters is whether or not we have succeeded in reducing the number of incidents. Everything else is just about the method of execution.”

“He’s right,” said Jose. “Look at how many incidents we actually took into problem management for root cause and remediation. There aren’t that many.”

“Did the ones we took in get all the way through and actually get remediated?” I asked.

“Yeah,” said Jose.

“And of the ones that were remediated, where things were changed in technology, processes, or people activities; have any of those recurred. Have they become repeats?”

“Well, no.”

“And had any of them been recurring issues occurring over and over again that continued to impact the business?”

“No,” offered Mia. “My team was very pleased that we addressed the issue of memory leaks on the Windows® servers that kept corrupting the database indexes, and requiring substantial downtime on all the Windows® servers to repair.”

“Yes,” I offered. “That was a huge event when that happened. Do you remember that, Meredith?”

“Yes, that was right after I came back to work from my maternity leave. But there were plenty of other issues at the same time.”

“Yes, and that’s because by closing off some of the biggest pain points, we can focus on others. Look, things will always break. People will always make mistakes. Incidents will never go away. Problems will always be with us. What’s important, is that we continually work on finding them and clearing them out. Problem management is not about getting to a point where there are no incidents. That will never happen. Problem management is about having a rational, cost-effective way to separate lesser issues from those with the potential to seriously damage the company, and then focusing on the subset of those that you can actually fix. Over time the definition of what constitutes a substantial issue may change, and even when you think you have them all handled, more will erupt.”

“So what you are saying,” said Nicola. “Is that our work here will never reach a destination, it will always be on a journey of improvement, and how we make that journey is the value we add to the company.”

I smiled. “Nicola, are you trying to get my job now?”

Tips that would have helped Chris

Take a moment now and then to assess how far you have come. It is easy to look only at how far you have to go. Looking at where you began, and where you are now, is very important and energizing. Celebrate your success. You have earned it.

At the end of the day, it does not matter how efficient or ITIL compliant your solution is. What matters is how effective your solution is in reducing pain on your users. It is easy to spot the people who confuse this by looking at the metrics they use to measure their performance. How many of your KPIs and metrics track user experience and satisfaction, versus how many track how well items move through your solution process? If it is mostly the latter, then you should consider some new metrics and a shift in focus.