During a recent webinar, Pink Elephant’s VP of Research and Development, Troy DuMoulin joined us for a talk on ITIL service management, discussing the integration of normal incident management, major incident management, and service continuity management. At the webinar, Troy reviewed how these three service processes are related, and how each are triggered to support the rapid escalation or deescalation of the service recovery processes – ensuring the best mean time to repair (MTTR) balanced with cost and risk.
If you missed the webinar, download the slides below, watch the on-demand webinar, or keep reading for a recap!
“It all begins with an event…”
Troy began his talk reminding us that incident management, major incident management, and service continuity management processes all start with an event – a detectable or discernible change in an environment. Once an event is detected, and set of events are collected and correlated to determine if intervention is needed – processes need to be launched. Certain events have different paths, and different processes: from adding capacity, to notifications of batch process completions. But in the event of service degradation, the processes of incident management, major incident management, or perhaps service continuity management need to be launched.
Normal Incident and Major Incident Management
Once events causing the service degradation have triggered incidents, these incidents have to be tracked, from a problem, root cause, and mitigation perspective. This falls under the normal incident management, or the linear process to capture / log incident information, categorize, prioritize, diagnose, escalate, resolve and close the incident.
“When something (in IT) fails… know the business impact.”
– Troy DuMoulin Tweet
– Troy DuMoulin Tweet
“This is the world of incident management… normally” says Troy. But when there’s a situation that has a major impact on the organization (as defined by a business-driven, multi-factor priority and classification model), or when no one is quite sure what is broken, this is when major incident management processes kick in. Many organizations though, have yet to define their major incident processes says Troy, which can be problematic given that major incident management, is not a linear process like normal incident management. (You may also be interested in our other on-demand webinar with Pink Elephant and Troy: Applying the Principles of First Responders to Major IT Incident Management.)
“Major incident management is not a linear process (like normal incident management). It’s more like a case management process.”
– Troy DuMoulin Tweet
– Troy DuMoulin Tweet
When applying the Cynefin System Sense Making Model (see below) to help make sense of the three processes, normal incidents would fall under the “complicated” or “simple” quadrants of the framework whereas major incidents fall under the “complex” domain of the Cynefin framework. Major incident management includes bringing together the war room and conference bridges to conduct “hypothesis testing,” says Troy. Unlike normal incident management’s linear flow, major incident management is about probing and experimenting until a resolution is found.
Time… Bridging IcM, MIM and IT Service Continuity Management
Time is the differentiator though between major incident management and IT service continuity management, when there is no time for experimentation, and there is severe liability to the business or organization, that’s when IT service continuity / disaster recovery processes kicks in. Referring back to the previous Cynefin framework, service continuity and disaster recovery would fall under the “chaotic” domain – where you have to act first, experiment later.
Understanding and having a shared model of incident criticality and prioritization across the business, can then dictate when service continuity processes kick in by having service levels tied directly to the shared model. For example, when a service is unavailable for a specific amount of time as defined by the shared model, then a normal incident would be escalated to a major incident, then at some point, even the major incident may run out of time for experimentation (according to the service levels,) at which service continuity / disaster recovery plans would have to be implemented. Troy reiterates the importance of having a shared model stating “we can’t have a model, where one person, regarldess of their political power, is going to bring down the organization, and going to implicate our service continuity processes.”
To learn more, and hear more from Troy on the subject watch the on-demand recording and check out slides below or download them from SlideShare. The transcript is also available below so you can read along!
Let’s begin our narrative at the beginning, which is always a good place to begin. While we’ve talked about the recovery processes of normal and major and service continuity, we have to begin with the fact that all of these begin with an event. That’s another process in ITIL we’re not gonna cover in any great detail today but these events happen all the time. Not all of them are intrusive. Not all of them are negative, but we have a world that is surrounding events of some sort. So, there’s a detectable or discernible change in our environment. That could be simply the notification that a batch job is completed, a backup has been completed successfully, or it could be the status of a service is unavailable, so we have a ping out there and it’s not returning back.
But it’s not just the fact that the service we expect to be there is currently not there, there’s also the concept of an event that describes a [degragation 00:00:54] of service, where the performance is at expected level and that is somehow not there, in respect to either the speed at which something is or the availability overall. The concept of status is a key component here. In fact, we can even put events out there that actually provide us triggers before an event of negative proportions occurs, such as we put a threshold event on a storage or on a memory aspect that triggers an event that says we need to indicate some kind of process intervention. All of this, just be aware that most organizations have monitoring tools and there are good ones such as the Everbridge Product Set, but there’s also a set of processes behind that.
You can picture yourself in that dashboard environment, or that network operations center, and you’re seeing the proverbial green-yellow-red lights. The question about that then becomes, based on what light I’m seeing and what type of event I’m actually receiving, what is the correlation of events? Because we’re gonna actually need to collect all those events now and provide some kind of pattern recognition in respect to is it a negative or simply a notification, but in the sense of some of these will actually trigger different processes.
In the concept of a threshold we’re gonna be potentially triggering a change from the point of view of adding capacity. In the concept of a batch scenario or a backlog, we’re simply managing more from the point of view of just operations. But when it comes to service degragation, meaning it’s completely not available or it’s degragated to some level that’s not acceptable, now we have to have the processes of an incident, major incident, and perhaps service continuity though that’s normally not automated. That’s going to be definitely a human intervention on the conversation of when to trigger that separate process, which is part of our narrative today. We’re focusing on this event correlation where we have understood and predefined that certain events have different paths, different process, which enable them. And in the context of a service degragation, or service unavailability, it’s certainly gonna begin at least in the normal incident world.
Let’s begin with what is a normal incident. A normal incident is a scenario where we have a disruption, so it’s something that’s not what we expect and that hopefully has been defined in something of a context of a service definition, perhaps presented in a catalog, maybe it’s in a service level agreement, but you’ve defined what good looks like and we have some deviation from that. So, the variability here is not positive.
Now, that can simply be, “I can’t do something I expect to,” like print something. I could have an entire segment of my network disappear unexplainably. I might see the performance on a screen refresh unacceptably slow. Most of these though, these would be something that you would normally attribute to a user experience and that’s understandable because most of incidents we normally think about are user impacting. That’s normally where we have our “service levels.”
However, we have to also understand that there are events, which then trigger incidents, which do not have a customer facing impact, but [nothertheless 00:04:05] are incidents that we need to track. If you consider it this way, there are many incidents, which come in from a user experience perspective. But again, if now you’re watching your alerting tools and you’re understanding that you have a failure or a fall over to situation where you’ve built in redundancy, the fact that that failure occurs and that perhaps has been reoccurring on a regular basis still needs to be tracked from a perspective of, “Why am I seeing this?” From a problem management perspective, “What’s the root cause and how do I avoid this becoming something even more negative where now user experience is impacted?”
But all of this falls into the normal world of standard incident management. To give you a concept of that, it’s a linear process that most of us understand because regardless if you had a background in iFlow or not you have a break/fix process I’m assuming. Because as any service organization delivers service, there’s also kind of an understanding that you also support that service. You begin by logging it, and we’re gonna talk about that in the context of prioritization in a minute, but we have this captured incident. I can’t assume that because as we know, many incidents in my environment don’t get logged. As often, the argument, “Well, it takes more time to log this thing than it actually does to recover from it,” and that’s often the [pastered 00:05:27] reset example. But then again without that captured knowledge, then we have a challenge because we don’t see the pattern. We can’t understand, we can’t manage, we can’t perceive what is not defined.
This is where monitoring tools often come in, especially around the event correlation component, and especially if you can get automation around the logging. It’s difficult for the human bias and the human opinion to basically determine on the stress of the moment whether something should be logged or not.
Logging in itself you have to begin with the fact that you have the data to support decision making. But now once we get this in we have to categorize it, and of course now we can categorize it multiple ways. We can look at it from the point of view of a technical categorization and that’s most typically what you see in a category type item where I’m now getting it down to a network switch or perhaps an application of certain character or nature or a hardware device. We classify it in the technology sense but we also want to categorize it in a service orientation based on what service is currently being consumed or not being provided.
Actually right there that’s a key telling point, whether an organization has a service orientation or a technical orientation. Because if your categorization today on any of these recovery processes only are at a technical level, then you’re only thinking switches, servers, and routers or applications and you’re not thinking about the business context of the failure. So, categorization also has to be thought of in the business service concept and many of the configuration management tools out there are enabling you to model services and literally if a failure at a component level happens, there’s an upward correlation to the business service impacted, which is critical for the next conversation we’re gonna have, which is the prioritization. Because how do I actually understand how I prioritize something in the context of how fast I jump and support based on business risk unless I understand the business impact that that incident is actually having?
There’s an assumption I’m gonna make right now that when something fails you do know the business impact. And I know that’s a large assumption ’cause that’s when we begin moving from a technical to a service orientation. But let’s consider right now that that service orientation is in place.
Initial diagnosis says, “Hey, can I actually resolve this at first level?” That’s a service desk question and ideally that’s often the case, but if I can’t then I’m gonna move it on and I’m going to say, “Okay, what escalation?” Now, escalation in a classic sense could simply be move it to a second tier support group but that only happens when you have the time, and it’s all about time. So, if you have a service agreement that says, “Well, you know it’s an incident that’s going to be a certain priority,” and so now I’ve got eight hours per se to deliver a resolve to this, then I can follow a linear process where I get to second, third level et cetera, and I have a recovery process. We move through and we recover and we close. And those things, by the way, are different because recovery simply means, “I think I have resolved this,” closure is when you validate with the initiator of the incident that they actually can do what they were previously supposed to be doing, which is the service is restored.
Now all of that is the normal world of incident management. We’re gonna dive deeper into prioritization but it’s a very linear flow, it’s a well known, understood process. I know I go from Step A to Step B to Step C. That’s if I’ve experienced this before. If I have some knowledge within my organization, if I’ve seen it before, I have a pattern, I have a recognized outage. This is the world of incident management, normally. Where it gets a little bit dicey is when we get into situations where it is so large that the normal process or waiting can’t wait, so we have a time emergency. Or where it gets into the context of, “I have no clue what is broken. It’s large, it’s impactful to the organization’s welfare. I’ve not seen this before. I need to move into a different type of process, which is major incident management.”
That’s what we’re gonna talk about next but note this question mark. This is the typical flow. The question is, “Is this a major incident?” And if it’s a yes, it flows off into this box called “major incident management” and for most organizations that’s as much detail as they defining that process, which in itself is a problem. ‘Cause major incident management isn’t a linear process. It’s more like a case management process where we come into a box and we just don’t know how we’re gonna get out yet but there are many different inputs we can use to actually do that. We’ll talk about that in a few more moments, but if you’re interested in major incident management, a month ago we did another webinar with Everbridge on major incident management in its own right. That can be found on the Everbridge website. I’m sure the link will be provided for you.
The key question then comes, “When is it major and how do I understand that it’s so big and so nasty I can’t wait for the linear process of normality to follow?” I have to jump into a completely different mindset and work in a more case management perspective, which is the world of healthcare and different types of organizations where we’re experimenting. To describe this, I want to introduce you to what’s called the Cynefin Model. That’s a Welsh word meaning “the place or origin.” I’m gonna keep it simple, but it’s really about understanding the difference between an ordered system and a complex system.
If you think about it, you can divide it right down the middle. You begin your journey of any situation where you’re responding and recovering in the middle. You don’t know what kind of situation you’re dealing with so that’s why we’re going through the categorization and the prioritization trying to get a sense of how big this is.
On the right hand side is the world of the simple and the complicated, but both can be considered ordered systems in a sense. We know what we’re doing, we’ve been here before, and there’s someone how knows the right answer. The bottom right hand corner is where we’ve seen this incident hundreds of times before. I sense that this is a simple scenario, I know the root cause, I do have the resolution. So there’s a best practice. I pick it off the shelf and I say, “Control, alt, delete works all the time. Apply and fix.”
So in the simple conversation, definitely we’re talking about level one type support, ideally, if you have a good knowledge base, where we’ll rapidly be able to resolve because we know this and we’ve seen it hundreds of times. Complicated, this is the world where, there is somebody who knows something about this. I sense that it’s something I’ve seen before. It might have a small, or different permutation, but I can go touch someone and say, “Hey, can you help because you’re the expert in this area?” And their previous experience, maybe combining some other different experiences as well is enabling us to get to a resolution. This is typically the world of second and third level intervention.
Both on the right hand side, whether it’s a simple or it’s complicated but yet there’s a right answer, this is the world of linear incident management. Where we get into the world of major incident management and especially crisis management, was on the left hand side, and this is where we’re in a complex adaptive system. There really is no way to predict what’s gonna happen based on what we’re gonna try, and we’ve never been here before so we certainly don’t have the knowledge or maybe even the internal subject matter expertise.
In the top left hand corner, we pull together the major incident process we’re gonna talk about. We bring together the war room, the [telefony 00:13:39] bridges, and the center for major incident control and we’re, in essence, doing hypothesis testing. We’re probing through experimentation saying, “I think this is going to be solved in this way.” We come up with a number of different experiments and based on that we sense whether the [inaudible 00:14:00] are going in the right direction hopefully resolving our issue, and then we respond with either doing more of the above or less based on the success of our experimentation.
That’s the world of major incident management. There is no linear flow. We’re pretty much probing and experimenting. That does still mean we have some time and time as we’re gonna see is one of the differentiations between major incident management and IT service continuity.
Time is the basis for experimentation. It might still be compressed time but it’s not like I have to completely now … Stop trying to resolve and move onto the next situation, which we’ll talk about. It’s called Disaster Recovery Crisis Management or IT Service Continuity depending on the conversation you’re having.
This is the world of the chaotic. The chaotic is: we have no time, the liability we’re experiencing is a severe liability now. We have to make a decision to simply act and recover the best we can and once we recover, now by perhaps completely moving over to a failover system, we can then go back and find some time for experimentation on figuring out why this happened.
This is the world of disaster recovery. We’ve run out of time, in essence, to experiment and it’s certainly not something we’ve seen before. To understand how to apply this model, one of the things that any organization has to understand when they’re in the middle of a recovery concept is to understand, first of all, “Are we in a situation we’ve seen before?” On the right hand side. An ordered system. Or, “Are we in a situation we have not seen before?” So either of them we’re gonna apply experimentation, aka major incident management, or we’re in a chaos scenario where we need to respond because any decision is better than no decision and that might be what we have to trigger. The left hand side is the world of major service continuity.
TO kinda help us understand time, let’s think about the concept of priority. Because as we saw in the previous slide, priority is a key indicator of “do we have time?” Really it comes back to that whole service interruptions, service disruption, business risk perspective. A lot of questions come up around how do you do prioritization right? Well, we’ll talk about some of that now.
The key question. ITIL talks about priority, many organizations use the concept of severity, and is a priority one incident equal to a major incident? Well, that depends. And I’ll give you more than “it depends.” I’ll give you some examples.
Let’s understand that priority is actually multidimensional. It’s many things that you have to consider here. It’s not as simple as the system is down and it hurts and people are complaining and yelling and screaming. We have a customer or business need and so the mission of the organization is at risk. We have potential for financial impact especially if we’re a revenue based mission. We might have a situation where we’re in healthcare where the system failure actually means life and death. We might be in a military scenario where system failure is actually leaving troops exposed on the front lines. The reality is that mission and mission impact will be significant.
This will have some further delineation about, “How much money are we losing? What criticality? What risk?” And risk can come in different flavors, it’s not just about immediate risk. It could be brand implications, it could be liability and opening ourselves up to lawsuits et cetera. There’s maybe different permutations to understanding priority that have to be thought through before you can become consistent in how you apply this model.
Here’s an example. An example would be, first of all to understand the ITIL concept of impact and urgency, impact on the left hand side of this grid is, think of it as more severity. Think of it as the concept of degree of failure. It has to be reasonably understood that an entire business being down can never be equal to one individual being down. There’s a degree of failure in the scope of how wide this thing is. Now we’ll come back ’cause I know in your minds you’re already thinking about, “What about VIPs?” We’ll talk about that as you can see here in the model. We’ll focus on that as well ’cause not all people are gonna be equal, unfortunately that’s a reality in the world of work, to the concept of the mission of the organization.
Then on urgency side, there’s several parameters here, but let’s use the mission conversation again. As most organizations probably listening on this webinar are, they’re revenue based organizations so it comes back to profit. Mission is to create profit and benefit for the stakeholders of the company. But there are others, which again, the healthcare scenario or the military scenario, which have a different primary focus for their mission. But what you can do once you define your service model: these are the services we define and provide to the business that the business can enable its capabilities to an outside market, is then understand those services in the context of the mission.
If I was a financial institution who is involved in doing online trading and investment, there are services which directly allow me to execute online trades. Those would be mission critical because they are directly tied to revenue. Because if they fail or they’re not available I’m losing potentially millions of dollars of investment revenue. Then there’s a supporting service to that mission. Could be, perhaps, I’m using a Bloomberg NASDAQ data feed to make investment decisions. It’s not the system I use to execute trade but it’s supporting that. And then there’s another set of services which now are perhaps internal. Optimization, automation, my internal company portal for example, which are indirectly or not directly at all related to the mission criticality. What you’re doing right off the bat is you’re classifying the services in your portfolio relative to their criticality to business mission.
Now, put these together. Just a bit more on the left hand side, your VIP. As we talked about not all services are equal to mission, not all roles in relationship to the organization are equal to mission. There are some people, again using our metaphor of the online trading and investment company that are actually making the trades happen. We’ll talk about that as a very important person but not so much in the context of politics but more as in the sense of mission to the organization. Then there are people, though they’re still important in the concept of humanity, their role isn’t as critical to mission. So, we have to understand that when we know what part of the organization and what roles of the organization, we can have different implications of severity as well as urgency.
Both of these combine, you try to keep it simple, so that we understand that while there are definitely many things which can be time sensitive, not all things are the same. We can’t have a model where one person, regardless of their political power, is going to bring down the organization and it’s going to implicate our service continuity processes. This is a tickly subject but it’s something that’s gotta be addressed in this conversation.
As you would imagine, now that I have some kind of understanding of criticality, high, medium, and low, and shared that across the organization, I would then put service levels around this, around how long can this system be unavailable before we need to escalate this up to a corresponding higher level of intervention? We might start with the high category as a normal incident management process that we aspire our service level, so now we move into major incident. So someone is in the role of calling the major incident process to be.
And at some point even the major incident may run out of time relative to the next tier of service level and we’ve now run out of time for experimentation. Someone, and again this may be several someones here, will have to make the call of, “We move to our service continuity disaster recovery plan because there’s a significant implication on cost and risk to that question.” But also not doing anything.
Moving this along here, when that major incident process is called, in essence what we’re doing is we’re definitely bringing together now everyone in to the theoretical, or actual, space that need to be. That could be in a conference room, a physical conferences room, that could be on [telefony 00:23:26] bridges, that could be on multiple [telefony 00:23:28] bridges, in the sense of there’s a technical bridge and a business bridge. This is the command center concept, and ITIL gives us some basic information but we can look to other sources like Crisis Management and Homeland Security sources, which we covered in the other webinar about how this process works.
But in essence, to give you a high level understanding ’cause we’ll be focusing more on service continuity in a few minutes, how this works is when we call this war room together, there’s an initial triage team actually. I should step back. Before we call the major process together, there’s a SWAT team or a forward deploy triage team, which is cross functional and has multiple different roles in it that does a situational analysis initially and based on that might even have to act to stabilize the environment and cauterize the liability. That team, which is going to be this triage team, will then activate the larger process if you will. The triage team might be called into action but the full process would be called in if the triage team now deems it as a true major incident, otherwise, we talked about deactivation. The triage team can also deactivate it back to normal incident management, thereby not calling into effect our full major incident process and incurring the cost that that would entail to everyone’s time and, of course, resources.
But let’s imagine that the activation happens, the major room is pulled together. Now we’re into a series of asses, stabilize, or store. And this is the period we talked about earlier, which is the experimentation. Where I have a hypothesis, we have a couple of hypotheses, we’re testing out various scenarios, and we’re doing scientific experimentation. We’re amplifying or [dampling 00:25:20] experiments based on their success or failure. Ideally we come out of that with successful experiments, even if it’s shorter term, and then once we’ve restored the service we do a post mortem, a full root/cause/analysis to understand the impact and the implication.
Now this process has to be called into play either at the beginning of a normal incident process or at the point where the normal incident process as run out of time based on the service level. It can be triggered in either scenario. But this process, ’cause we can only go around the loop of the red so many times before we get into the next world, which is service continuity.
I’m hoping that you’re getting a sense that these are tiered and integrated, or they should be at least. So, we’ve run out of time. That’s what we’re talking about. Time is not on our hands. This is not availability management. Business continuity management is a business process for recovery in the face of a disaster or crisis. IT service continuity management is a child process to business continuity management. It’s, in essence, realizing that to the best of our attempt we have not been able to successfully restore so we have to failover to our alternative environment. This is not availability on steroids where we simply have built redundancy into everything, because that’s not practical and even when we talk about our service portfolio we’re gonna talk about different levels of continuity based on the criticality of that service. But it definitely is now giving up on restoration and moving to a failover scenario where we’re moving to a completely alternative system and providing service in a different way.
We’re in now, the world of service continuity. To understand it a little bit, we first gotta define the word “crisis.” I use a bit of a metaphor here. I grew up on Sesame Street like probably many of you did, where we have the beloved Cookie Monster who cares dearly about his cookies. First of all, let’s talk about a crisis. A crisis is where we’ve run out of time, back to that word. It’s an unplanned situation, we certainly didn’t plan to be here, which it is expected that the period during which one or more IT services will be unavailable will exceed that threshold we’ve already agreed to with the customer. We’ve pre-determined how long we’ll try to restore and based on that, we can come up with different scenarios of recovery.
Just to give the metaphor a chance. So in one recovery option is we basically rebuild it from scratch. Hopefully we’ve got the procedures and the protocols and the detailed how-to on how to do that. Think of your metaphor of a recipe book ’cause this service will have to be restored and if I don’t have a configuration management database, which gives me a snapshot or baseline, that will be a very tricky scenario. That’s another conversation for another day. But we basically rebuild it from scratch. That’s one way.
Another one is where we have reciprocal agreements and there’s a couple of different options here. This is where we have … We’ve prepared, we have a plan, but we have different levels of plan. So, we have a gradual recovery option. Think about this where I’ve got vendors in line, I’ve got contracts in place, I even have a place to move, and I got all of my precursors to find … And my recipe, understood, for recovery. All of these things are in place but I have to trigger their execution. We call this a cold standby. Now this is where we have it identified, it will be a rebuild, but it’s a rebuild with known understanding of how we’re gonna do it. That’s one level of recovery.
Another recovery method is an intermediate recovery. This is where you have somewhere else you can go for the service, so we’re gonna take time and drive over the Grandma’s house … She might be in a different state but we know the way there and we can book a flight, but there’s still a time period where we literally will have to do this. A reciprocal agreement with perhaps another business unit or with a vendor where they have a subset of our environment and our service model in play already that looks like, or can be used to equate or be the equivalent of what we’re doing today at some agreed level. Might be not the full level we’re doing but at some agreed level. That’s your warm standby. It’s a place but it’s not just an empty place. There’s literally cookies over there and I can go eat them.
Then, of course, you have the immediate recovery, which is gonna be your most expensive scenario where you’ve got a hot standby and in some case, you have a completely [irreplicated 00:30:25] environment and we call upon that when we need to. Ideally that’s in a different location, on a different power grid so that if there’s a natural disaster we’re not losing both of these scenarios. Not it’s not a one or all of the above. It’s probably not one of, it’s many of. So you could actually use, based on that service model again, some services could be manual rebuild, some could be cold, warm, or hot depending on the criticality, ’cause all of it has cost and it has risk, which comes back to the process.
This is an example of the ITIL slide and I’ll be looking at another one in a second, ’cause a bit more details, but the process of business continuity management that you see on the left … We have a strategy as an organization, we know what our mission critical services are, we understand how to reconstitute not just the technical environment but the business environment and the people. So we have a business continuity strategy not just the technical component which underlies it. Then we have the ability to restore ongoing operation from a total business unit perspective. As you would well imagine, being that all business processes are digital at this point in history, to make that work we’re gonna have to have a corresponding set of plans relative to the technology environment that enables the business environment. So if we have a business strategy it’s gonna have to drive things such as a business impact analysis.
Now a business impact analysis is literally based on mission, we’ll use that one again … The business has predetermined what level of criticality each of these business capabilities have. This business impact analysis will be a key feed into, as you would probably imagine, our priority model we discussed earlier. And if it’s not, if they exist completely separate, you can already begin to see the disconnects here, because they have decided what is critical and you have a priority model for restoring incidents, which is deciding what is critical. Ideally these things mesh and one is driving the other, but in my experience that’s more often the case not the case.
Implementation is not just we’re gonna run a DR plan. That DR plan has to exist. You have to the have the environment set up, either the cold or the hot standby. You have to have them not just set up initially and tested, but they have to be [ongoingly 00:32:59] updated as changes to the production environment occur. Implementation is the initial implementation but then there’s the ongoing operation so that ideally, when the disaster does happen, when we’ve run out of time for a major incident, when I need to call the plan into execution, I can.
What happens many times, unfortunately, is that organizations implement their service continuity processes initially but fail to manage them ongoing, and so currently what is looks very little like what is currently in service continuity. What, unfortunately many organizations end up doing is they’re gonna run a test, which is often a requirement before audit, and that test will simply verify that … Well, before you run the test you update the service continuity environment ’cause this synchronization is at risk. God forbid that basically we have an actual disaster happen before we plan our test ’cause the disconnect between what is in your continuity environment often is completely broken.
That previous slide is a bit further fleshed out here. This is an actually older ITIL slide. I still like bringing it out because it brings in some of the key components. Someone actually has it intending to start up business continuity. That’s means when you have to think about stage two of doing your impact assessment, risk assessment, business continuity strategy. That’s going back to my cookie scenarios.
Now I have to figure out who are the service owners for these services? WHo’s going to be the captains for recovery in the case of a disaster? How do we maintain disaster recovery plans per key service in the system? Where are those located? How do I have an alternative location if that location is unavailable? I’m doing my initial testing but at the same time I’m managing through any changes to the production environment through change management. Our synchronizing, that’s a driver. I’m making sure that. I’m testing not just initially but annually at least. I have people who practiced recovery practices and it’s something that I’m training people to do. All of this would be necessary for me to actually execute.
This is a bit of an eye chart, I recognize that, but I will describe what you’re seeing. This is actually a business process from a customer that we worked with. This is business continuity management. That’s that darker line. You’re talking to your customer, your business unit line of business owner, and you’re saying, “Okay, what would you like for service continuity in the event all time fails and we’d need to recover? What would the level of DR be for you? Disaster recovery.”
Of course, they come back, we go down to the next box and they say, “Well, we’d like it all to be completely, redundantly, recoverable in a hot site.” So, you do your cost estimation, you go back up the loop on the left hand side and you say, “Well, that would cost X.” And of course you have a bit of a mental meltdown and they say, “Well, that’s not going to do.”
You go through this loop from DCM to evaluation until you get an agreement about what are your critical systems? Which are the less critical? How much are you willing to spend per system and what level of recovery will hot to build it from scratch am I talking about per service? I move down the middle box and now I’ve got ongoing operations and now as changes occur I’m not just updating the changes in production environment but I’m now in a place where I’m actually updating the DR components and, ideally, the procedures. That’s why you see a database there because without that recipe for recovery, good luck on trying to rebuild it.
This process keeps it live, keeps it synchronized, between incident and change et cetera, or change specifically. Then to the far left you see the testing going on where a test will either prove you need to do something different, or, of course it might reopen the whole conversation about how much DR do we buy? How much do we apply? That’s an annual process at least.
But everything that is on the center to left is about maintaining this process for the eventuality where the right hand side has to actually be called into play. That’s where someone one day says, “This incident has gotten outta hand, we’re outta time, and we now have to recover in a completely alternative way. We need to failover to DR.” That’s where we get into this conversation we’re having now and the alerting that goes with that.
So in summary, ’cause I’m done here, is that we have all these processes. There’s three processes: normal, major, and service continuity. Think of them as three tiers of restoration, and at any given point where an incident’s happening you have to be looking for does it have to be escalated up or can it be escalated down as we’re talking about situational analysis?
The roles, the processes, how we do this? This is critical. Unfortunately what we often see, that while these three processes may exist in some form, they’re not necessarily integrated.
Thank you, Jesse, and thank you very much, Troy, for another great presentation. Again, as it was, this was great content especially for those of you not familiar with the IT processes.
Good afternoon or good morning depending on where you are everybody. My name is Vincent Geffray and I am a member of the product team here at Everbridge focusing on communication workflow automation and collaboration tools for IT.
Troy just talked about the three tiers of service restoration processes. And what I want to do now for the next five to ten minutes is not to talk about how we can get those Grandma’s cookies that Troy talked about. I’m gonna be talking about communication and how it relates to [inaudible 00:39:40]. Especially why communication workflows are key if you want to streamline your service restoration processes. We’ll see that just like you have a business contingency plan, a disaster recovery plan in place, you should have a communication plan pre-defined and ready to go whenever you are facing, your organization is facing, a major incident.
Again, Troy gave us a few examples. A data center power outage, I could add DDoS attack, a cyber attack, an EMR outage, so electronic medical recall system if you work in a hospital. A network outage. A website going down for an eCommerce company or just the website being too slow. Or something even more basic. Something like emails not available for your organization, for your corporation.
So if we look at this situation, what they all have in common are that they most likely are going to impact the business operations of your organization, of your company. And because they impact large number users or customers usually they are most likely time sensitive events that the IT organization’s gonna have to handle.
If I refer back to Troy’s presentation, I’m gonna be talking about situation where we are, from an IT perspective, facing a P1, a Seg 1, or a major incident, where urgency is high and impact on the business is also high. So in this situation, every minute counts and there’s no real time for guess work at this point. The response teams, you need to be active very quickly, ideally we would wish that they act in a coordinated manner and hopefully they’re not gonna make too many mistakes while restoring the services. This is easy to put on a slide and this could be seen as the best case scenario. But is it really what we see in the reality?
What I can share with you is that when we work with IT organizations and service desk, incident managers will still see a lot of situation where the service desk or the incident manager is gonna be trying to reach out to people under decent circumstances using blast emails to entire teams hoping that someone’s gonna jump in and take accountability, or responsibility for the issue. So they’re gonna send this email, they’re gonna wait. At one point they may start a conference call and they’re gonna reach out to people to help them to join a conference call. They may open a chat up or chatroom depending on the tools they’re using for collaboration, and so and so forth. So a lot of manual processes.
Using those manual processes is like driving in the fog because they have no clue whatever whether the people they’re trying to reach out have received the message. They don’t know if the people they are trying to reach are the right people for this type of incident and so on and so forth. In the meantime, relying on emails or ChatOps to reach out to people doesn’t give you the ability to be sure that the people are going to see the message. Maybe they are out for lunch, they may be in a meeting, on the phone, they may be on vacation, and if it’s the middle of the night they may just be sleeping.
What I want to do now is share with you a few best practices. Actually, I’m gonna share with you five best practices, which will help your organization get better at dealing with major IT incidents.
My first recommendation for you is gonna be this one. Have a plan. Have a plan. As we all know, failing to prepare is preparing to fail. Have a plan and identify the critical services that are supporting your business. [inaudible
Or directly with the business. Identify those critical services. They be already mentioned in your disaster recovery plan or your business continuity plan. You have to be also able to recognize major incidents, and Troy spent some time on helping us define those priorities and define what major incidents could be for your organization. So for each service identify the team of first responders and try to define the recovery timelines with the business.
You’ll see that there are solutions out there available in the market to help you with this. So my recommendation would be to use Cross team On-call Schedule Management so that for each on-call team automatically you know who should be responding to the different incidents. Rule engine is also a feature that’s actually used to identify within those teams who should be contacted to respond to this incident.
Best Practice number two. Again, we are facing a major incident, or even a P1, so at this point we need to find an easy way but also reliable way of contacting the people who matter. I’m gonna say just don’t use emails. Emails don’t even wake up anybody at two in the morning. You don’t have time to go through spreadsheet and call trees whenever you’re dealing with a major incident. You need to know who’s gonna be contacted but also who’s received the notification and who has responded to the notification. You’re gonna need to ensure that all the require staff … we talked about the war room. We talked about the triage team. We talked about the swat team. We need to make sure that all the required staff is gonna be onboard.
So use sequential multi-modal targeted notifications. SMS, voice, mobile application push notifications. The system has to provide two way notifications so that not only can you send notification but you can also see who’s responded and how long it took them to respond. Globally local solutions is also important if you have distributed teams. If your service desk or your support teams are spread out all over the globe, you may want to think about globally local solutions so people receive a notification from local numbers.
Number three. In the middle of the crisis there’s no time to start writing messages. In this situation you’re gonna need clear and crisp messages. Consider the fact that you may not want to send the same message and the same information to everybody, should they be your IT staff, the business, your key customers, or the impacted users and customers. What I want to say here is unlike us, automation doesn’t know how to panic. That’s why using automation tools for communication will help you in those crisis situations.
Use error proof communication systems. Provide your operators with pre-defined message templates so there’s no error possible. System with the ability to send messages and the dates. You gotta keep your stakeholders, your key customers, your impacted users updated every 15 minutes, every half an hour. Again, your solution should be able to differentiate messages depending on who you’re sending the message to.
Number four. Provide ready-to-use collaboration tools. Troy talked about the war room and the conference bridges. Yes. When you’re dealing with a major incident, you want to have at least two conference bridges open, one for your technical teams and one for the business. And believe me, you don’t want to mix the two together. You don’t want to have your business people involved on a technical restoration call.
Ready-to-use. Try to find ways to send information to the IT resolvers so that for them it can be very easy to join the conference bridge. One click access to conference bridges so you don’t have, if you’re at the service desk, you don’t have to send [inaudible 00:49:41] information and access codes. There is identification, they click or they press one, and here they are. They are walking with the team on the same conference bridge. Group chat, ChatOps tools should also be available at this point where the team’s gonna be starting the investigation and the remediation of the incident.
Best practice number 5. That’s about the continuous improvement. Troy talked about this as well. Have a post-mortem after every major incident. Why? This is gonna help you refine your communication workflows and you’re gonna become better over time. So my recommendation here is to track the mean time to restore, or mean time to repair. Also, track those metrics like the mean time between failures and the mean time to engage. One of the benefits of solution like IT alerting solution is to review this mean time to engage.
Use solutions that are going to include reporting and auditing capabilities. After a major incident you want to know who was contacted, how long it took them to acknowledge the notification, how long it took them to jump on the conference bridge, if they left, and so on and so forth.
That’s my final recommendation so if we apply those five recommendations, this is what most likely is gonna happen. You’re gonna see the team becoming more efficient at getting ready to tackle or handle a major IT incident. You’ll see the resolution of the IT incident going faster. Depending on the cost to the business because the resolution time’s gonna be shorter, you should savings in terms of impact on the business.
You’ll be able to engage your IT response teams in five minutes or less. Another benefit of those solution is to reduce the inbound call volumes. Why? Because proactive notification to the impacted customers and users will have full effect to prevent them from calling into your service desk to investigate, to figure out why an application is not working as it should be.
It can also reduce alert fatigue, not using emails but using targeted notification will only contact the people who are required on those restoration call. Overall it can increase the team efficiency and the performance.