First 4 Steps to Recover From An IT Outage

Whether it’s because a rushed user downloaded ransomware or an overloaded admin missed a critical patch, it’s not a question of “if” you’ll be hit by an outage or security breach but “when.” And in today’s digital world, when a critical server, network or application goes down, your business goes down with it.

Here are four critical steps to take in the first minutes after you learn of an outage, and what you need to do beforehand so you’ll be ready to hit the ground running.  

1. Assess the severity

This is essential to deciding how many, and which, resources to deal with the incident. This assessment should also you how soon you’ll start incurring external penalties (such as regulatory fines or lost sales) or internal penalties (such as breached service level agreements or damage to the reputation of the IT group.)

To prepare: Work with your business stakeholders to create standard definitions for various incident levels based on which systems are most critical to your business. For example, for a hospital, patient monitoring and clinical records systems will be more critical than its public-facing web site. For an e-commerce company the public web site will be top priority. 

For more information, check out these blog posts on measuring the impact of an IT incident or this one on determining urgency, or watch this ITIL deep dive webinar on prioritization and escalation.

2. Assign a major incident manager to the case

Having a single “coach” in charge, with an overall view of the problem and its impact on the business, vastly improves your chances of an efficient and effective fix.

To prepare:  Find, and assign, someone to this task beforehand. Technical skills, while valuable aren’t as essential as two “soft” skills. The first is project management – the ability to keep track of complex, constantly changing workflows and making sure everyone involved is completing their tasks on time, or at least alerting someone if they can’t. The other essential skill is communications – the ability to clearly explain to every stakeholder the status of the issue and what needs to happen to resolve it.

For more information, download our white paper on streamlining the major incident resolution process, with tips on staffing your MIM and response teams.

3. Have your collaboration tools in place

When you’re in the midst of a critical outage, your stakeholders cannot afford to sit idle while they wait to hear from each other.

To prepare: Based on your remediation plans and the responders who will execute them, define who will need what types of information, and when. Then make sure you have the communication systems in place to deliver that information as quickly as needed. Be sure to take into account work schedules, varying time zones and the telecommunications infrastructure used by your global support teams. This may require multi-modal communications (a mix of email, text, voice and other channels) to confirm each respondent knows what they are supposed to do and has committed to do it. Be sure to include everyone affected, including business managers, customers and external partners.

4. Collect and share remediation information with those who need it

To prepare: Configure your monitoring and alerting systems so the right stakeholders can get the information they need as quickly and easily as possible. Give resolvers all the information they need in the initial communication, and allow easy access and links to relevant chat channels and/or conference bridges so they can quickly collaborate. The ITIL (Information Technology Infrastructure Library) framework offers many best practices for getting started. Linking these systems to an IT response automation system can facilitate the information sharing, conference bridging and other collaboration steps required for all personnel to resolve the issue as quickly as possible. As you share information, be sure you are also tracking whether the proper remediation steps are being taken and, if not, corrective action is underway.

With these four steps you’re well on your way to minimizing the cost and disruption of an IT service outage or security event. But you’ll only be able to get this fast start with the proper preparation before an incident begins.

By | 2018-01-19T16:57:58-04:00 Dec 11th, 2017|

You resolve incidents.
Leave the communication workflows to us.

In the event of an IT issue, IT Alerting quickly connects the right on-call personnel with the right information using phone, email, SMS and mobile app alerts. Rules based automation, dynamic on-call scheduling and automatic escalation ensures that someone will respond and take ownership of the incident, regardless of time, day, location or device.

Learn why 3,000+ organizations trust Everbridge. Request a demo.