Tech outages of 2016 and how to prevent them in 2017

How to make sure your brand isn’t negatively impacted by downtime.

data center down
Thinkstock (Thinkstock)

Downtime

2016 has seen major downtime events lead to lost revenue for a number of highly-recognizable brands and caused a severe knock to their reputation and consumer confidence. One of the most common causes of outages is unplanned configuration changes to a system, often when an immediate fix for a bug or potential system vulnerability unintentionally creates a much larger problem.

To avoid unexpected downtime, BigPanda recommends that companies take the following steps to ensure the availability and reliability of their services, but first lets review some the computer- and server-related outages of the past 12 months.

screen shot 2016 07 20 at 5.51.33 pm
Southwest Airlines/IDGNS

Southwest Airlines

About 836 Southwest flights were delayed in October in what was also described as a problem related to the airline’s technology systems. Employees had to work around issues with primary systems and used back-up procedures to get customers and their checked luggage to their destinations, the airline said.

delta delay airlines operator
Reuters/George Frey (Reuters)

Delta Airlines

The airline confirmed in an update that a power outage in Atlanta that started at 2:30 a.m. Eastern Time had affected its computer systems and operations worldwide, leading to the flight delays. It warned of large-scale flight cancellations on Monday and said that airport screens and other flight status systems were incorrectly showing flights as being on time.

BigPanda said the 5-hour outage led to 2,000 flights cancelled with an estimate of $150 million lost. 

 

150914 salesforce dreamforce 2
Martyn Williams

Salesforce

The cloud applications company said on its website that the over 12 hours disruption was the result of a database failure on the NA14 instance, which introduced a file integrity issue in the NA14 database.

BigPanda said the estimated revenue impact was approximately $20 million.

 

apple tv tinder
Tinder (IDG US Media)

Apple

In June, internet services such as iCloud, App Store, iTunes and Apple TV were down for 9 hours. Also in early December users could not access their iCloud accounts.

 

slack

Slack

In June, 3 million users lost Slack for 2 hours due to web servers being overwhelmed.

And now onto how to prevent these situations.

tech outages
fdecomite (Creative Commons BY or BY-SA)

Identify what is mission critical

To avoid unexpected downtime, BigPanda recommends that IT Ops teams tier their services and identify the systems that are mission critical to the business. Top-tier applications should include those that are directly linked to the success or failure of the business, such as point-of-sale, ticketing, or billing.

 

02 planning

Develop an ironclad failover plan for top-tier systems

Offering a high level of availability is not something that happens by chance. It must be carefully planned for in every aspect of the system architecture. Top-tier systems should be bolstered by an ironclad failover plan – one that carefully plans for load capacity to handle unexpected spikes.

 

monitoring
Facebook

Invest in a best-of-breed monitoring stack

You can’t protect against what you don’t see coming. In the age of continuous integration and continuous delivery, the only way to ensure that you have an accurate pulse on the health of your IT systems is to implement the best monitoring tool for each layer of your stack (e.g. systems monitoring, application monitoring, web and user monitoring, logging, error tracking, etc.) The industry is rapidly replacing monolithic monitoring architectures with this “best-of-breed” approach to better service increasingly complex and dynamic IT systems.

security alert
Shutterstock (Shutterstock)

Implement alert correlation to distinguish signal from noise

More tools – monitoring more moving parts – leads to more noise. It’s a simple fact. In order to efficiently identify, triage, and remedy potential issues before they have the chance to do real damage, IT teams require a way to properly separate the signal (e.g. “the real problem”) from the many sources of noise. By implementing an alert correlation solution, IT teams are able to see how alerts from their various monitoring tools are related, allowing them to quickly filter non-critical issues and focus on what matters most.