No matter how well you document your IT configuration, change and testing processes, it doesn’t take much for someone to accidently cause a slowdown, outage or security issue. Human error is estimated to cause almost a quarter of data center failures, and we’ve all made enough mistakes to know it’s impossible to prevent them all.
Some of the most public and expensive outages have hit the airline industry. This spring’s failure of the British Airways system responsible for passenger booking, baggage handling, mobile apps and check-in desks resulted in more than 1,000 cancelled flights and tens of thousands of stranded passengers. Shares in the parent company dropped significantly following the outage, which reportedly cost the airline $91.6 million and untold damage to its brand.
While details about the cause are unclear, they seem to revolve around the unplanned cut off and restoration of power to the servers running these systems by an outside contractor. The use of an outside contractor is especially relevant given the often complex support structures at modern organizations, with multiple outside vendors providing hardware and software support that used to be handled by a single internal IT organization.
A similar failure that hit Delta Air Lines in August 2016 caused the cancellation of around 1,780 flights, at an estimated cost of about $30 million. The Wall Street Journal reported the outage was caused by the failure of an Automatic Transfer Switch that was supposed to move the power load to an alternate source such as a generator if the primary power failed.
What’s as bad as the initial failure is when it goes on and on, and when the outage makes it impossible to give your customers or even your own employees accurate updates about the situation. The middle of an emergency is too late to start avoiding such problems.
A better approach is to define procedures, proactively develop the proper recovery workflows, communicate them to all stakeholders, and automate the alerting and response process for when something does go wrong. Here are ten ways this approach might have minimized the damage from some major outages with airline ticketing and operation systems.
- Develop standard processes for the shutting down and restoration of servers, network gear and their power supplies, including each step to be performed, its expected result, and how to return to a previous safe condition if a change produces an unexpected result.
- Require approval by management for any changes to these processes.
- Create advisories within the workflow about what warning signs, such as noises or error messages, might signal various failures.
- Develop safeguards (such as required approval by a second person) for any action taken during the restoration process that could disrupt business-critical systems.
- Conduct periodic tests of the configuration and status of backup power systems and the switches that move the power load to backup sources.
- Conduct periodic audits to assure that any new or upgraded hardware is provided for in the backup and restoration plan, and is properly configured for the failover of power and connectivity.
- Require periodic tests of the recovery of servers and databases and the rapid updating of production data.
- Share the approved processes with all outside service providers and internal stakeholders to assure they are followed.
- Automate response workflows when something does go wrong (e.g., when an outage is detected) to assure timely communication of information to the right resolvers and stakeholders such as customers.
- Keep audit trails to verify which resolvers received which information, whether they confirmed the receipt of the information, and if they took the required action.
None of these steps are rocket science. You’re probably doing some of them, some of the time. But as airlines (and other industries) keep finding out, there is no way to anticipate every possible mistake. You need a response plan ready to go before things go south.