Establishing a Business-Driven, Multi-Factor Priority Model Shared by Multiple ITSM Processes

Yesterday, Troy DuMoulin, (@TroyDuMoulin) VP of Research and Development at Pink Elephant, joined Everbridge for an ITSM deep dive webinar on prioritization, classification, escalation and alerting.

During the webinar, Troy offered advice and best practices around establishing a business driven, multi-factor priority model shared by multiple ITSM processes and how this model drives automation related to escalation and alerting practices. If you missed the webinar, download the slides below, read the transcript or watch the on-demand recording.

“Gaining Velocity Through Alignment of Purpose”

Troy’s overarching theme during the webinar was the need to align processes, prioritization and classification in order to gain “collective velocity,” or speed with direction. Troy reminded us that “(i)f each IT group were being agile in different directions, using different classification structures, or even the same classifications structures in different ways, then you can actually kill velocity.”

Although ITIL is a great source of knowledge with the lifecycle of how things are delivered, Troy reiterated that in order for the processes within ITIL to work in a linear, streamlined matter, an integrated priority and classification model needs to be the glue that holds it together.

The Need for an Integrated Framework and Process Priority Model

Imagine an IT department where there is no agreement on how much impact a given incident has on the business customer. In this organization one person believes an incident to represent a minor impact while another believes the sky is falling. If this sounds familiar then consider that an organization that does not have a solid agreement around a model for establishing ticket priority has no hope of supporting published service level agreements.

Many organizations, Troy states, typically start with incident management, and define classification and prioritization based on an incident perspective only. When this method is used, there may be verbs and adjectives used within the model, that don’t directly translate, or that aren’t applicable to problem, change, and release management models, leading to siloed process frameworks. By creating an integrated model, you create a “backbone” across all ITSM processes, that helps organizations gain collective velocity says Troy. Once there is a integrated model in place, it enables IT teams to get a better sense of any repeat issues and trends, make correlations, drive automation around queue assignment, escalations, and alerting, as well as drive SLAs around service restoration.

“Before we can automate something it has to be defined and actually efficient and effective”
– Troy DuMoulin   

Key Takeaways from the Webinar

  • Classification Structures
    • Should describe generically an environment and/or specific service – they are not your CMDB
    • Need to support multiple processes, so don’t use verbs or adjectives reflective of the service/hardware, for example “Password reset”
    • Can include an “other” category, but it should be managed to avoid incidents, and problems getting unnoticed
    • Should be a shared policy and not have ungoverned control where anyone can add to or modify the structure
  • Prioritization Considerations
    • Priority should be defined based on customer or business need, financial impact, service criticality, business risk, component failure impact analysis, legal requirements, etc…
    • Impact analysis should be based upon degree or scope of the service outage, qualititive and quantitative study into the effects upon other areas of the business, degree of the consequence, data sensitivity, etc…
    • Urgency needs to be tied back to the mission of the organization and needs to be agreed upon ahead of time by everyone across the organization.
    • For urgency and impact, stick to a simple four-level model (e.g. Low, Medium, High, Critical) – don’t go beyond four
    • Pre-classify services and applications relative to the urgency factor so that urgency can be auto-populated based on the pre-defined model

Once you have a shared classification structure and priority model across multiple processes, you can automate escalations, alerting, and notifications, Troy highlighted.

“If you can’t agree on a shared priority model and what that looks like, then ask yourself ‘What hope do you have to stand up behind any SLA?'”
– Troy DuMoulin   

To learn more, and hear more from Troy on the subject watch the on-demand recording and check out slides below or download them from SlideShare. The transcript is also available below so you can read along!

Webinar Transcript

Troy DuMoulin

All right, so we are indeed doing a deep dive into the mysterious world of IT prioritization and classification. Now there’s this dream out there that we can automate things, and there’s a lot of great tools like the Everbridge product and service management suites, that do just that. They take this promise of escalation, notification, automation and they make it real, but, there is a but, before we can automate something it has to be defined and actually efficient and effective. You know, reality is, we have to know what we’re automating and ensure what we build is something worth automating. That’s the context of what I’ll be sharing with you today.

Context is a good word to start this off, because in the world of big data, we have lots of data out there produced by all of these different products around monitoring and escalation and notification, ticketing tools, workflow tools, et cetera, but how do I find a way to take all this, now, just for data and to give it context? You might have heard in the past of the DIKW data continuum. Data first has to be introduced as information. That information has to be the beginning of context, so what does it mean? That information allows me to take that and apply some knowledge. What does it mean for me in the future? Then, wisdom in respect to, how can I proactively use this in the collective context? That whole context is set by, as we’re going to see, a series of classification structures which are critical when you’re implementing an ITSM overall process.

All right, so our agenda today, we’re going to look at just that, why do we need to think about an integrated framework or integrated process priority model? That will bring us into an example of classification structures, prioritization model, and then we’ll end with escalation and alerting. A key thing I want to bring up before we dive into the details of these various agenda items is that a lean principle for a system of value comprised of various agents, and those agents can be people, groups, systems, that’s a concept of systems thinking, and complex adaptive systems operate in the principle of multiple domains and dimensions and participants all working collectively.

Now, the word velocity I’ve chosen here on a very specific and purposeful reason, versus agility. Velocity is defined by speed with direction. For that value system, complex and adaptive as it might be, to gain collective velocity, they need to all be constant in how they do things and how they prioritize, how they practice certain activities. If each of those dimensions/agents/groups/silos were all being, let’s use the word agile, and again, not a negative word in its own right, but agile means to be responsive and quick and nimble, but if they’re all being agile in different directions using different classifications structures, or even the same classification structure in different ways, then you can actually kill velocity

A key principle, principle being the thing we all believe is true, is that we share constancy of purpose, and it’s only when that complex adaptive system shares constancy of purpose that I can gain collective velocity. This is a key principle as we go into this, thinking about this system velocity premise.

Okay, so everything in life begins with an event. In previous webinars, we’ve talked about this a little bit already, but in essence, not all events are negative. Some are simply there for acknowledging the fact that something has occurred, right? I have a back up that’s completed, a batch job has run, I’ve been able to restore a file. These are just notifications, but there are some events out there that deal in the world of anxiety.

Anxiety is something we want to avoid, but life is full of anxiety, so we try to minimize the impact of anxiety and to shorten those timelines in which we actually respond to it. This is where we get into the reality of, we’re getting an unavailable status. The availability of the service is in question, or the service is degraded. The performance isn’t actually up to par, and so people’s willingness to be patient, of course, being a key issue, but it all begins with an event.

Now, what we do with these events is going to be critical, right? Event management isn’t just, “Let’s throw up an on monitoring tool, and let’s see, you know, green and yellow and red lights.” Event management would also say, “When I do have a yellow and or red condition, and it’s a situation where I have stress and anxiety, then what am I going to do? Am I going to flow this alert into an incident management process for correlation and prioritization and potential restoration? Am I going to be tracking the potential repeat patterns of these events to understand where I might have repeat issues which might be corresponding to an overall systemic problem, then I have problem management investigate?”

There’s actually flow, workflow, process flow, that actually would be built behind these monitoring tools, which would allow me to shorten the time from the point of view of events discovered to the concept of, now what do I do next? These workflows have to be enabled by classification structures. What am I dealing with? What service is it? What priority am I dealing with in respect to the urgency of reaction in this context? Event might be the beginning, but it’s more than just a monitoring tool.

A little bit about why we even have to think about these classifications in respect to multiple processes. ITIL has a great source of knowledge in respect to the life cycle of how things are delivered. In essence, we’re in a large demand supply chain. You might not have thought of it that way, but that’s how it works. We get demand in one end, be it strategic, tactical or operational demand, that turns the crank of a factory, which creates or enhances value, and on the other end we supply.

Now, there’s a lot of processes in that service, strategy, design, transition, operation world, but the thing is, these processes are not individual processes or silos. They’re a flow, you can call it a process architecture if you’d like, where we take stuff coming in one end and it produces outcome out the other. That would indicate there’s a great deal of dependence or integration within these processes, which there is. You have to think about the ITIL processes across that life cycle.

I like to use the metaphor, think about this room full of chairs, right? All these chairs are tied together by string. You knock a chair over in the middle, the three chairs beside it, they fall over, and in the back of the room, one moves six inches. Unfortunately, you often feel like a long tail cat in a room full of rocking chairs, but that’s the reality of how it goes. If I’m going to have such integrated framework, I’m going to have to consider what it means to keep these processes glued together. The basis for that gluing together of these processes are going to be the classification, prioritization, escalation models we’re about to talk about.

This is just a basic set of ITIL processes, not the whole life cycle. It’s starting from the bottom up. We might get incoming, incoming be it service restoration issues, complaints from service. You might have requests for fulfillment. These are all going to be, of course, linked to this configuration management database. You’re going to have event correlation across all of those. Those would indicate potential need for change relative to service restoration and or enhancement. All those changes need to be bundled up into a larger context of a release. We have this veritable complex system of processes which require classification structures shared among them to be able to work as a shared process model or architecture.

A challenge is that people don’t normally think of it this way. They’ll attack the ITIL processes one by one. In fact, what typically happens is to divide and conquer, we’ll say, “Hey, Mary, you take change and Joe, you take problem, and Jeff, you’ve got incident.” They all go and prosper in their own corners, come up with their own priority, their own models of process and classification. They might even use different tools, SharePoint for change management, ticketing tool for incident, Excel for problem, and in the end, they’ve basically painted themselves into proverbial process corners because they haven’t built an integrated anything. They’ve implemented these processes in silos.

Now, I want you to think about this from a more integrated perspective. Right, so typically you might start your journey on incident management. Most companies do, because you have to do a decent break fix to be called a service organization. You might start there, and you might think about all of these models we’re about to talk, from an incident perspective only. You might, in your classification structure, actually embed verbs and adjectives respective of just incident management, but that’s a problem because that classification structure around priority and category type item or service classification, would also be potentially the one you use for problem, which would also be the thing you use for change, release, service level monitoring them all.

I think, think about it this way. Literally, if I have problem records, which I could attach to an, multiple incident records, right, so I’ve got this problem record and many incident children are hanging off of it. Then, that problem record will produce a known error, and that known error record will basically attach to a change record. That change record then might be one of many in a release.

I literally have this backbone, or this vertebrae, this connecting tissue between all of these processes. If I don’t get this right, I literally can break the back of my ITIL process automation. They share common classification. Again, I’m just showing you a sample set, all of which will, of course, be connected to the same database of people context, which also has a taxonomy of business unit, department, team, and individual. Configuration management, the same thing, because I’m going to be classifying the records at the CMDB level by the service, the CI, configuration item is connected to and or the technical domain it’s connected to.

Hopefully this gives you a sense of the business reason why these classification structures are so critical to be shared. Here’s some examples. Just a couple of examples. Traditionally, we think about categorization in the context of technology and technology grouping. On the left hand side, you see top level, hardware. Mid level, desktop. Under desktop, I’ve got a monitor, and under monitor I could even have a specific model number. I have this ability to click through and select drop downs to classify that, but what if this monitor, so called, was actually part of several critical services? How would I gain that understanding, right? Or if it was a switch or if it was a server or an application?

I’d want to be able to collectively understand how many incidents, problems, changes, requests I’m getting around a specific type of category of technology, but I also, if that’s an incident coming in or a problem that I’m dealing with, want to understand if that is hitting a specific service. Now, that’s not the same classification structure. Typically it’s going to be a secondary classification on my record, so I might categorize it according to the technology that’s failed or being requested for enhancement, but I also want to say, regardless of the purpose or the reason that I have an outage, we’ll use email because we’ll come back to that later in our webinar, let’s say I’m getting all these calls for email.

I initially classify it in a technology context, which I’m guessing, because I have no real idea about what technical component has failed. The one thing I do know is that it’s been an email service outage, right? Regardless if the technology classification changes multiple times, I still, at the end of the day, want to be able to report, show me all of the incidents, problems, changes and releases related to incident, related to email as a service, and give me a secondary breakdown by technical domain or dimension.

This is a critical way to, if you’re going to be looking at measuring and monitoring service consumption and reporting, you have to do this, either in the classification structure like I’m demonstrating now, or through a configuration item association, which is also a potential way to do this because if I have an incident problem change request type of record, I can potentially attach a service CI, a record for my CMDB which represents email as a service. Now I can extract that service correlation a little differently, but I still have to do both.

These classification structures, some rules of thumb, right? They describe generically an environment and or a specific service. They’re not your configuration management database in their own right. I’ve literally seen, more than once, categorization structures where the last tier of the category was an actual instance of an asset. Now imagine what that looks like in an organization which has tens of thousands of these assets. It’s a complete mess. It’s not departmental or role based. There’s another classification around that, around people and or support groups, but I can’t now get into this taxonomy, including some kind of organizational context, because that changes over time.

I need it to support multiple processes, so I’m not going to put in my generic classification structure password reset, which is a verb or adjective reflective of either the service and or the hardware. That’s going to have to be another action, another drop down, which qualifies this classification. You shouldn’t have it in the standard CTI, category type item, right? It’s going to be critical for me to get any kind of sense of repeat issues or trends, right? Where are my top 10 incident categories, and which services are they impacting, and what’s the frequency of their correlations within the organization? It’s going to drive automation, especially around queue assignments.

When it’s this technology and this service in this specific geographic location, who gets to do first, second, third support? I’m going to automate that queue assignment, and I have to have this clearly defined. It’s going to drive, as we’re going to see, SLAs around service restoration as well. This is key.

Now, reality of this is that this organic structure is organic. It’s always going to be changing because your environment changes. I’m not one of these people who say, “You can’t have an Other category,” because if you can’t find what you’re looking for, then obviously you’ve got to put it somewhere, but the last thing you want is an Other category that’s not managed. What happens typically is it just becomes a garbage bin. Everybody can’t figure out or take the time to figure out, and they put it into the Other category, and now I’m not dealing with that as either an opportunity for education and or the need to modify.

Now, because this classification structure becomes so critical, as the vertebrae tying all of these processes together, which they’re all using the same one, you certainly don’t want this to have ungoverned control, where anyone can add any kind of leaf or limb to this structure. This structure becomes a critical shared classification policy and it becomes something under a process for changing and change management, right? It’s that critical, otherwise it begins to go south on you very quickly.

That’s your generic classification structure, based on technology and service, and that’s not either or, that’s and, and. Now let’s think about priority considerations, because some work, by nature, simply takes precedence over other work. This is true of, whether we’re going to restore one incident before another incident, whether we’re going to do root cause on a problem before another problem, whether we’re going to do one change in advance of another change or what, from a release perspective, takes precedence over other releases.

This priority model is not, again, only something we use for service restoration. The basis for how we determine how fast we move and what priority something comes in will be shared across all processes. The business rules, what we do with this now known priority, will be different per process because in an incident classification, it might be, I need to restore service within X number of hours. Within a problem, I might need to think about it from the point of view, I need to get a root cause within X number of days. Within a change, it must, it probably will take a sequencing perspective in scheduling.

How we deal with business rules from the priority model will be different, but the basis of how we establish priority will be the same across multiple processes, and it will be multiple dimensions, as we can see here, right? It’s not just how many people are impacted. There’s a lot here, respect to the brand and financial and liability and legal requirements, so we’re going to talk about how we can look at building one of these models which take multiple dimensions in hand, because the last thing you want to do is be guessing at this at the service desk or asking a service desk professional to have all of this in their brains at that moment of stress and try to imagine and guess what the priority is, because I can guarantee you, what you will get is a high variability, of course, in how it’s done, but in the essence, it all defaults back to the squeaky wheel, which isn’t any kind of way to establish priority. I think you could all agree.

How does that work? From theory perspective, we know there’s a couple of factors here. Traditionally, before ITIL became a popular model, it was all about severity. Some of you probably are still using the severity concept. Severity and the word impact are synonymous, and we’ll see a number of different indicators in a moment about what that means, but in general, it’s really reflective of how big is big and how bad is bad? It’s how wide, how impactful, what’s the scope of this thing? Is it one person? Is it five? Is it organizational-wide?

The degree, or the scope of failure, is one context. The other is urgency, and urgency is separate than degree or failure because I could have something that is organizational-wide, but it’s not as mission critical as something else that’s potentially organizational-wide. I’d have to get to the sense of speed, and I’ve already given you an indication, it’s going to be based on the mission of the organization. We’re going to combine these two pieces and then, of course, there’s always the reality of the reality check. If something takes just a moment, and I have 100 things, I have five minutes to do something, I can still use human logic to do something based on the ease of doing so. Expected effort.

Let’s get a little bit deeper here in the context of the impact analysis. I’ve already mentioned, we can talk about this in the concept of how big is big, right? I’ve got my little pocket pooch versus the big dog, and the reality is here we’re getting into the scope, whether it’s geographic or it’s multiple business units, it’s an internal only or it’s affecting external clients, it’s looking at this from the point of view of the degree of failure. Yes, we have to look at it differently, that one person can never be the same as a complete organizational failure.

Now, does that mean that one person has the same urgency across the board as another person? Not at all, because we’ll talk about VIP type of personas as well, which is not always, by the way, about org chart and about political power. We’ll come back to that, but for now, understand that we’re talking about impact. We’re talking about degree.

What gets a little bit more complicated is this context of urgency. Urgency basically goes back to the mission of the organization. Many of you listening today are working for companies whose primary mission is profitability, revenue generating for your shareholders, whether that’s a privately held company or you’re publicly traded, in the end, you’re in the business of generating a profitable return on investment for somebody, right?

In that context, mission is about profit, but not all of you are working for mission based companies. You could be working for healthcare, and in this case the most critical services you provide are about health and welfare. Hopefully that’s true in a healthcare scenario, or if you’re in a military situation, you’re going to be supporting front line support, so front line support will be, of course, higher in urgency than back office, clerical type work. Or if you’re in a government scenario, you’re looking for the best interests of your constituency.

Mission is always the beginning of the conversation, and so we think about that. Let’s use profit as the basis for the general conversation here. We’ve got to figure this out ahead of time. First of all, there are certain individuals or roles or personas which have a direct correlation to revenue in the profit scenario. If I’m doing online investment and banking, then an online investment type of role will be critical because they’re going to either make or lose money by the millions in a matter of minutes, versus an individual who is back office. The role of that person has to be taken into consideration.

I already mentioned about, some services that I provide are direct to revenue, so again, using the online banking metaphor, if I’ve got an application or service which is about making trades, that’s going to be a mission critical. Something that supports but is indirect to making trades, it’s going to be support commission. Something like your internal intranet, very indirect, you know? Distantly. We’re going to have something further out.

Brand, reputation, legal, compliance, security, data protection, in fact, some organizations, if you’re in the government, you have classified and unclassified, right? Sensitivity comes in to context here. Partners relative to what you’ve agreed in contracts, safety and health, all of these factors would be taken into consideration when thinking about, how fast do I jump, versus something else.

Again, it’s too complicated for you to do this on the fly. You have to have something that’s going to be done ahead of time that allows you to, with some kind of multidimensional perspective, gain agreement around this, and that’s critical, because again, if none of you have agreement around how big is big and small is small and how fast do I jump versus something else, we’re going back to that whole question of constancy of purpose, right?

You’re all being agile based on your own subjective relativism, so if you can’t agree on a shared priority model and what that looks like, then ask yourself, what hope do you have to stand up behind any SLA? You think it’s a priority one, I think it’s a priority two. John over there thinks it’s a priority three, and we just spend minutes, hours, days, arguing and debating reality, which by the way is unfortunate, because it’s usually what I find.

Here’s an example, walking you through some of this. This is a customer example. It’s actually something that was actual from one of my past engagements, and so what you’re seeing here is the concept of keeping a simple four level model. Again, in my experience, you should never go beyond four. You get beyond that, you’re starting to get into challenges, so your urgency factor, right?

Going back to our revenue conversation, critical, it’s got to directly impact revenue. High, it’s indirect. Medium, it’s an intercompany transfer, collaboration, operational efficiencies. Low, it’s a productivity tool for one person. I want you to begin to think about you and your services or your applications even. You begin to pre-classify those relative to this urgency factor, because that can be done and should be done ahead of time. When I classify it as, this has failed, in respect to technology and or service, the urgency should be auto populated based on a predefined model.

What then takes human consideration is this degree of failure. Sometimes you can automate that by monitoring and understand the context of the scope of failure. That helps deal with this, but in the end, I could probably override it as well. Here you’re seeing disruption of service across entire business units. Significant impact including VIPs. Note the VIP is higher than an individual who might be not a persona relative to mission, right? Even when I attach a person record to the ticket, if there’s a role relative to that SLA that, again, online trader, this person belongs to an online trading group, the moment I put this person attached to the ticket, it automates the escalation and notification to a higher level.

When you’ve got this basic model in place and you’ve pre-populated your tool and automated some of this, what you can then get to is something like this, right? Now I have this combination of impact on the left, urgency on the top, and I know that if it’s a mission critical, direct to mission service, and it’s across the board, well then I might definitely have a critical incident, which might then require me to call into play my major incident process, which is separate and distinct from my normal process. That’s a question. If it’s high high, this might be the indicator, at least when you begin to automate the triggering of that process.

Notably, if we’ve got low or high, we’ve got different classifications. Again, note that I’ve tried to keep it here in a simple model. This is actually just a three by three matrix, because after you get beyond four, which is my recommended max, it becomes something that’s unwieldy. Again, the last thing you want to do is just to leave this to individual interpretation. I guarantee you, what happens from that context is that you begin to get simply the squeaky wheel syndrome.

Now, let’s argue that we’ve got a shared classification structure that’s been used consistently, right? I’ve got a shared priority model across multiple processes. Now what I can do is to automate escalations, alerting and notification, but that’s dependent on the first two things being true. This is an example from an incident or service restoration perspective, and so what I’ve got going on here is priority one through four on the left hand side. Priority one, let’s call this mission critical, disaster across the majority of the organization, so it’s pretty nasty. I want to make sure I get that restored pretty quick versus four, probably something that’s going to be more individualistic.

I’m going to have a number of things begin, right? You see the menu at the bottom. I’m notifying the assignment groups, maybe verbally. I’m alerting senior business stakeholders, because it’s a P1 issue. I’m not going to alert business stakeholders unless I’m telling my own internal IT senior management team, right? In fact, I might escalate it immediately to a team lead or a higher level person because it’s that big, right? I have either things firing automatically or I’m triggering manual activities.

At 50%, let’s say my P1 was restore a P1 incident within four hours 80% of the time, right? That’s my target, being four hours, 80% of the time being my tolerance. For you it might be two hours, but that doesn’t matter. At the 50%, it’s short in respect to timeframe. I better be triggering some more action, because for me to, at 50%, still meet my now two hours from now goal, I’m going to have to make sure people are paying attention. At 75%, anxiety is growing, so I’m going to be triggering things before I get to 100%, which is now four hour mark. At four hour, I have to let people know that now we have reached SLA. Now I might do this over again. If I hit eight hours, I might do a separate set of activities.

What you’re seeing here is that I’ve predefined a set of either automated and or manual alerts and notifications to go out in respect to how big is big and how small is small? Now you can’t do this, or even automate this, until you have the other things in place because no one takes them seriously. Everything’s a P1 and basically you end up shutting off all of your automation because no one agrees that the world should get paged and senior management should be escalated to every 30 minutes.

These tools are designed to do this, but they are not going to work unless you gain organizational agreement around the definition around these models. What I find, unfortunately, over and over again, is that people buy these $500,000 plus tools that do all of this wonderfully, but it’s completely turned off because of the lack of design and consideration and organizational change management required to gain agreement and consensus around these as policies, as models.

One last example here, I’ll be turning it over to Vincent. You might say, “Well, how do I use this in a context of multiple dimensions?” Okay, so fair question. Here’s an example, again, another customer example from an organization in the oil industry. What they did is they already had predefined scales. As you can see, revenue generating, business continuity planning, classification from their BIA analysis. They had security classification with respect to the confidentiality and classification of data. Brand exposure, safety exposure, SOX related, and what we ended up doing with aligning the various models and coming up with a point system.

Walking through each of these, understanding is it a one, two, three or five revenue generating, and five being direct to revenue, or five being high in this case. Each time we would classify each line, we would have a certain point system. We could even weight those points based on how mission critical each of the classification models were. The weighting would then give me my total weighted points as you can see, and then the cumulative points from all of the above allowed me to understand the implications of now multiple scales.

In this example, I have a 15.75 total cumulative score, which as you can see below, is greater than 10, automatically putting me into the high, and this is an urgency model by the way. That allows me to understand at least one dimension from multiple perspectives. If you’re interested, I have a blog post on my blog, Troy DuMoulin blog, you can just search that. It’s called The Practicality of Prioritization, and you can download this as an example.

Last thing I’m going to say is let me walk you through now how this all works out using an actual case scenario from my past. It goes back a couple years, but it’s still very useful as a case scenario. Here’s the situation. A company I was working with at one point, they were having ongoing and significant issues around their email system. I told you I’d come back. It didn’t even occur to them that it was ongoing and probably until it got into two or three months later, but this was organizational-wide. It just seemed to be around the end of every month, the email system would completely fail and the end result, they had to shut down the servers involved and basically reboot the system, and up she would come.

Eventually, after this had happened four or five times, right, and the world had melted, the question was asked, what’s going on? Why it is happening, and how do we stop it from happening? By the way, every time it’s happened, we’d have all these incidents flood in. We would have a crisis team be called, and then major incident process ensue. That was a lot of human capital just in the service restoration, let alone the business impact of the failure.

Eventually somebody said, “Let’s stop the insanity. Let’s get off the hamster wheel here and figure out what’s going on.” They began to do what we’ll call problem management, as you can see, bottom left corner. Their initial perspective was that the email service was failing and it had something to do with software and/or data.

They tried different things. They tried to stabilize the environment, they even added more memory, thinking it was a memory leak issue at one point. A couple of experiments produced, finally, the root cause. What was happening was when the system came under heavy load at the end of, month end, this was a financial industry company, what was happening was the servers would overheat within the specific rack. When they took this thing down and they had shut down machines and they opened the door and basically rebooted it, it recovered because of the overheating, of course, diminished.

What you don’t know is that this door situation was relatively recent. Three to four months earlier, at the beginning of this process, a security review had been done at the data center, and it was realized that all the racks had their doors taken off when they were installed. Of course, this was a policy that couldn’t be tolerated, so security demanded that all of the doors be reinstalled. What they didn’t take into account was the load and the heating scenario when they reinstalled the doors and locked the cabinets.

They realized that they had produced the solution, inadvertently, with this rack configuration. Now they had to come up with a solution, and they had a couple of different options. One, take the doors off. Low tech, low cost. Another, replace the rack and its individual cooling system. Another, change all the environmentals in the data center. Obviously, you can probably imagine which they decided to do. Despite the chagrin of the security department, they went off, they went back to they take the door off the rack, at least that specific rack that was involved. A change was put forward, and that issue was removed.

Now, I wouldn’t even have even known this was a pattern of relationship unless I had this data correlation going on, and the prioritization allowed me to understand this was major. Because this wasn’t effective, people weren’t asking the questions, “Haven’t we seen this before,” or, “This is a real nasty thing we all agree is big and disastrous. We should probably do something about it, open a problem record.”

When this is working, we have a virtuous cycle going on. When this is broken, we’re all just shooting from the hip. That’s the problem that we’re all in often because of this. The tools aren’t able to tell us what we don’t and haven’t automated, so interesting story. It’s a cautionary tale.

Vincent Geffray

Thank you again, Troy, for sharing with us some of the best practices here when it comes to classification, categorization, and the prioritization of incidents and which goal is to make IT organizations more efficient.

What I want to talk about now is how much impact a given incident has on the business customer. To illustrate my point, what I’ll do is I will share with you some of the findings that, of a recent survey that we conducted with Everbridge on the state of incident management. I’m going to start with going back to the last example that Troy talked about, which was, if you remember, an email outage. Now, what I’d like to do is take that same example, but now from a business customer point of view.

Please meet Cindy here. She works in a legal department. She is remote, and as you can tell, she is going through some difficulties here. Indeed, she’s trying to send an important contract to one of her company’s biggest customers, but it’s 5:00 PM and just like in Troy’s example, email is down. This can, as you know, have a huge impact on the business, as we all know, right, especially if today is the last fiscal day of the year for Cindy. This is just an example. An email outage is only one of the many reasons why companies can experience major incidents.

What we did with our survey, we asked that question to 152 IT professionals, and asked them to tell us about the top causes for major incidents at their organizations last year, so in the past 12 months. Here’s what we found out. We found out that, as you can see here, network and IT competence failures came first, 61 and 58%. Then comes the business applications failures, and we could consider here the Outlook issue, the email issue that Cindy is facing to be a business application failure, so that would fall under the third bucket here. Other, some of the other top causes were unplanned maintenance, release deployment issues, data center outages, and we were surprised by the last item here on the list, cyberattack and DDoS attack only making it to 14% of the top causes of major incidents.

Now, the second question that, or the second thing that we wanted to know was about the impact on the business. We always talk about the impact on the business, but what is it really, right? There’s very little information about the real cost of unplanned IT downtime. There’s a good report that you can read, published by the Ponemon Institute that can give you some indication, so something we did here is we had that question. Can you estimate the cost of unplanned downtime on your organization, and here’s the finding. The respondents to this survey collectively answered, and we were able to come up with an average cost of unplanned downtime to be at, to be around $8,662 per minute. If you do the math, you’ll see that this is more than half a million dollars per hour.

Given this result here and as we can see, this can get very expensive very quickly, right? What we wanted to understand is where and how CIOs and IT organizations were indeed trying to reduce this cost. What we did next is we looked into the very process of resolving major incidents. To make it easy and industry agnostic, I like to break this process into five phases. The first phase is, if we have, if we are experiencing an incident, first, IT needs to be made aware of this incident. Troy explained very well how we can apply best practices to find out how big is big, how bad is bad, and come up with the detection and the identification, classification, categorization and ultimately, to the prioritization of the incident.

Let’s say we have a major incident. Now comes phase two. We need to identify what team or who’s going to be responding to this particular incident. Who are going to be those IT, these IT staff, those experts that can start investigating the issue? Once they do identify or isolate the root cause, then they’re going to put together a remediation plan. They will get it approved by the CAB, the Change Advisory Board or the Emergency CAB, then they’ll be allowed to execute the remediation plan, and last but not least, will make sure that, after the change that we’ve made, the service has been restored, right? This is pretty much the process of resolving a major incident. We usually look at this process in terms of efficiency by looking at the mean time to restore, so I’ve put the MTTR here as an indication.

Now, what I’m going to do is look at this phase two here, response team assembled. In the survey that we conducted, we asked them, those IT professional, how and, yeah, what solution or how were they able to identify who to contact in case of major incidents? This is the answer. As we can see, only 11% of the respondents said that they used a centralized on-call schedule management solution. Therefore, they were able to identify pretty quickly who should be responsible or who should be responding to the different incidents based on the nature, based on the priority, and so on and so forth.

Almost 30% of the respondents told us that they had no formal process in place to manage on-call personnel, and 24%, so a fourth of the people, said that they used spreadsheets and company phone books to identify who to call. That’s fine, and that’s fine if it works, right? We then asked them, we said okay, whatever the solution you’re using today or the techniques you’re using today, once you’ve identified the people that you want to respond to the major incident, how do you technically reach out to them? This is their answer. For 83% of the respondents, they use email to communicate with their IT teams. That was, to be honest, that was very surprising to us. Why? Because if I want to send you an email, let’s say I need you to be on this restoration call. If I send you an email, I have no way to know if you’ve opened my email. I have no way to know if you’ve taken any action following this email that you’ve received from me. That’s for many, many reasons. One could be that you’re away from your desk. You can be in a meeting. You can be on PTO, you can be on vacation, or you may just be sleeping because we’re not working on the same time zones, right? As far as I know, emails don’t wake up anybody at two in the morning, at two o’clock in the morning. This is why we were surprised by this answer.

This was the segue to the next question for us, which is, okay, again, whatever solution you use to reach out to your response team, now what I’d like to know is, how much time does it take your organizations between the moment where, or when you’ve declared a major incident and the time where all the required IT staff is actively collaborating on a conference call? Here’s the answer. The answer is it takes an average of 27 minutes to assemble the response team. Again, for phase two here, it takes 27 minutes to get the right IT experts on board. Now, in the meantime, remember Cindy. She is still stuck at home, waiting for someone to fix her email issue, right, which is not only impacting her to do her job, but in this case, that’s a company wide outage.

What’s interesting now is that we have data, we can do the math, so let’s do the math once again. If we do, if we multiply the 27 minutes by the average cost that was given previously in the survey, you see that it’s, the cost of assembling the team comes up to more than $230,000 per major incident, which is a pretty significant number. If we were to do the math again, and we would ask those respondents, multiply these number by the number of major incidents that they’ve experienced in the past 12 months, to give them an idea of how much does it really cost to the business or to the organization for a full year.

Given this information, this is an area where IT service alerting solutions and Everbridge can help organizations. What we have now is a lot of IT organizations are too busy or don’t even know that there are solutions available out there to help them with that. That’s pretty much what we do at Everbridge. We help IT organizations improve their incident communication, so reaching out to the IT staff, better so, senior management, key stakeholders and impacted customers. Improve their incidence response so they can consistently engage their IT response team in five minutes or less.

How we do it, based on the nature and the priority of the incident, the system will automatically locate the right on-call staff. It will then automatically reach out to the right people via different communication channels. It can be SMS, it can be voice, so on and so forth, and provide the very people with the collaboration tools so that they can start addressing IT issues as quickly as possible.

For today, if the, or the, my take away for you, if there are three things that you can do today, here’s what I would recommend. The first thing you want to do is assess what we could call here the average time to find someone, therefore, how much time does it take your organization, your team, from the moment where a major incident has been declared to the moment where your IT experts are actively collaborating on the resolution of the incident? If you find out that this time, this mean time is better than five minutes, then great. You’re doing a great job, and you have good processes in place and good solutions in place.

Now, if it’s higher than five minutes, then the good news is, there may be room for improvement. The suggestion I will make here would be to review your on-call schedule management solution, so basically how do you know who to contact in case of a major incident? Escalation process, if those people, those primary IT staff are not available, who is next? Who is available? Who’s next in line, and how do I automatically escalate to those people? Finally, I would review your notification and communication system to see if, like we saw in the survey, you heavily rely on email or if you have a multimodal communication approach to notification.

By | 2017-03-01T14:37:53-05:00 Feb 15th, 2017|

You resolve incidents.
Leave the communication workflows to us.

In the event of an IT issue, IT Alerting quickly connects the right on-call personnel with the right information using phone, email, SMS and mobile app alerts. Rules based automation, dynamic on-call scheduling and automatic escalation ensures that someone will respond and take ownership of the incident, regardless of time, day, location or device.

Learn why 3,000+ organizations trust Everbridge. Request a demo.