Right after the investigation phase, comes the remediation. Once all your stakeholders are fully informed about an IT issue and have agreed on a restoration plan it’s time to start actually fixing things. This requires orchestration: The execution of a properly sequenced, detailed plan that assures the proper execution of every step required of a person or a machine to restore critical business services or end a security breach.
What Orchestration Means
- A properly sequenced plan. This might mean, for example, taking into account dependencies among multiple applications, meeting overall Recovery Time Objectives (RTOs) or ensuring automated repair sequences have run on RAID arrays that may have been corrupted by abrupt shutdowns before restoring the applications that rely on them.
- A properly detailed plan. For example, if the restoration of a system at a remote office requires sending equipment or staff to that office, the plan should explain where to purchase the required equipment and who will arrange transportation of the required staff.
- A properly assigned plan. The automated plan should make clear who is responsible for carrying out which steps, and to whom a function should be escalated if the “first responder” is unavailable. For example, if the usual DBA who would make a required change to a database is on vacation, who is the backup person, how will they be notified, and how will you know they received the message and have taken the required steps?
- A properly automated plan. Automation not only speeds your mean time to response, but helps ensure the required steps are executed correctly and consistently. Automation reduces the chances of human error, and by providing notifications when steps are not taken or responsible parties fail to respond, helps ensure no steps fall through the cracks.
- Covering both machine and human actions. While many recovery steps can and should only be taken by humans (such as approvals) many others can and should be initiated automatically by machines. These include checking the validity of backups before restoring them and creating snapshots of system images before an incident to speed recovery.
What It Requires
- A unified plan for remedial action across silos and teams such as IT Security Information and Network Operations Centers, IT Infrastructure, server and, network teams, application groups, service desks and those responsible for functions such as Business Continuity and Disaster Recovery as needed.
- Integrations with workflow platforms encompassing both on-site, cloud and hybrid platforms. This also includes Release and Runbook automation tools.
- The ability to automate the flow of work, and the flow of information and the human approved check points in a flexible way to accommodate even complex series of interlocking processes by multiple internal and external players.
- A closed-loop solution that can not only gather, analyze and present information from every relevant IT monitoring, ticketing and reporting system, but proactively communicate updates while eliminating multiple log ins and manual data entry that cost time and increase the likelihood of human error.
- Agreed-upon rules for which players need to perform which functions at each step in the remediation process, to reduce the human error that can cause subsequent outages.
- A unified collaboration platform to support your orchestration. Every minute counts, and wherever your engineers and IT resolvers might be, they need to be able to quickly start collaborating with one another, on a global conference call and via ChatOps, all of which should be integrated into your response system.
As one disaster recovery manager put it, “Sensible automation…reduces people dependencies and human errors to help us build competitive advantage, not so much against our competitors, but against the event itself by reducing latency between event trigger and event response.”