Security leaders hindered by lack of automation to respond to incidents

Vincent Geffray shares insights gained from a new survey assessing the ability of leaders to respond to IT outages and security incidents with suggestions on where to focus for rapid improvement

newtons cradle
Thinkstock

Incidents, especially security incidents, get a lot of attention.

That attention creates pressure to perform. In turn, we focus on prevention, speed of detection, and appropriateness of response. In that process, then, how do you actually respond to the incident? How automated is your response? How long does it take to rally the team and get moving in the right direction?

The deeper we dive into the process, the more questions we need to consider.

Everbrige just published the "State of IT Incident Management: Lack of Automation Hinders Speed of Response to IT Outages and Incidents" with some interesting insights.

Vincent Geffray (LinkedIn, @VGeffray) , Senior Director Of Product Marketing at Everbridge and I talked about their findings in the survey. Vincent has over fifteen years of experience in the IT Operations and Service Management space, with expertise in Critical Communications, IT service alerting, Application Performance Management, IT Process, Runbook and Workload Automation.

Despite the pressure to perform and a growing awareness of the business impact, organizations have some surprising opportunities to improve and automate. Those that do end up better protecting their organizations and saving money in the process.

Our discussion centered on some key definitions, two surprise findings, an interesting conclusion, and some actionable steps for someone to follow today.

I liked the way you defined prevention and preparation. Can you explain how those concepts work for incident response?

The digital transformation we’ve seen companies go through forces businesses to rely more and more on their IT infrastructure, whether it’s hosted on-premises or in the cloud. This has exposed businesses to major IT incidents. These are all the major disruptions of IT service due, for instance to a cyber or a DDoS attack, a network outage, a hardware failure, a datacenter outage, a website slowdown, a ERP failure or the central Electronic Health Record system is a hospital partially unavailable.  Similarly to earthquakes along the San Andreas Fault line, with IT incidents the question is not “if” but “when” they’ll strike. Having said that, the key to minimizing the impact of IT issues and restore the services as quickly as possible is to consider both the prevention and the preparation.

With regards to prevention, companies have adopted anti-intrusion solutions, system redundancy, clustering or high availability architectures.

Now, because failing to prepare is preparing to fail, IT organizations must be ready to react promptly to major incidents, at any time, to minimize the damage to the business. That means to be prepared for a range of situations, including outages, failures and cyberattacks. This extends to internal processes and actions, the tools but also the people who are key in these crisis situations.

One of the surprises in the survey was around the lack of automation in the way organizations respond to incident. What did you learn and what do you make of it?

Even though more that 90% of the companies we surveyed have invested in ticketing and IT Service Management systems, it was a surprise for us to see that the time to engage the first IT responders was on average 27 minutes. That’s 27 minutes between the time a major incident has been categorized and the time the IT experts start investigating. It is really important to realize that during this time, the number of impacted customers or users continue to grow, so does the frustration and effect on the businesses bottom line.

What we see here is totally in line with other recent studies and researches which showed that the biggest inefficiencies in the major incident process was found around response team engagement and root cause investigation. We see very little automation being leveraged in these areas which is an opportunity for organizations to track and eliminate unnecessary wasted time, reduce cost and become more predictable and efficient. Everbridge’s experience with customers who have automated the IT response process shows that companies could reduce this average time to 5 minutes or less. That represents $190,000 in savings per major IT incident, based on the average $8,662 per minute cost this survey shows.

Another surprise was the reliance on email as an alerting and coordination function. What did you find?

Exactly, the survey shows that responding to time-sensitive critical situations is still a fairly manual and unpredictable process. 83% of the organizations rely on emails as a communication vehicle for contacting people and communicate with them during incident resolution. In addition this, two thirds (66%) of companies have distributed IT organizations with people spread among multiple locations and multiple time zones, and that 39 percent have more than 25 people included in their IT response teams, that 28% have more than 50 people who need to be coordinated to respond to an incident and 16% more than 100 people, you understand how the communication and the collaboration can quickly become a big challenge without using any sort of process automation.

The main reason why organization should not build their notification process upon email is that email systems can go down just like any other business application. If businesses rely on email communication only, and the incident itself brings down the email system, which means it has also brought down your ability to respond to that incident.

In addition, emails themselves have their weaknesses. They do not provide any visibility into whether or not recipients have received the email, and emails don’t wake people up in the middle of the night.

When used extensively, we see that emails can quickly lead to “alert fatigue” as too many people receive too many emails. The importance of the message tends to disappear very quickly which can explain why the time to engage is so high.

IT professionals are obsessed with optimizing the performance of systems, but haven’t applied this mindset to their own communications. 43 percent of respondents reported that at least part of their process relies on manually calling and reaching out to people to activate the incident response team. Only 11 percent reported using an IT Alerting tool to automate the process. These systems can improve response by reaching people through multiple modalities, use on-call schedules to see who is available, automatically escalate to additional people if designated primary contacts do not respond, automatically organize conference bridges, and provide an audit trail of performance. 

A positive sign is the understanding of the business impact. But that also reveals a missed opportunity in a lot of organizations. How so?

Yes, more than 90% of the companies which participated in our survey say they’ve experienced significant business impact due to IT incidents or outages. The average cost reported is $8,662/minute which equates to more than half a million US dollars per hour, every time an outage occurs. The maximum cost reported in this survey is $100,000 US dollars per minute. 

47 percent of companies have experienced a major IT outage or incident six times in the past 12 months; 36% experience them close to monthly (11 or more times per year). More than a quarter of respondents reported that their companies experienced more than 21 incidents last year—that’s almost two major incidents per month. Only 9% of respondents reported that their organization did not report a major IT outage or incident in the past year.

The good news is that people recognize the business impact and they understand financial implications.

However, connecting the impact to the number of incidents is not yet spurring action. This could be due to lack of focus, tight budgets, competition with other higher priority projects, or lack of knowledge about option to improve the processes.

Where should a security leader get started to explore how they can improve the coordination of incident response?

The very first thing to do is to audit the processes that are currently in place to see how well the IT organization is responding to and communicating during major IT incidents. Based on this assessment, the processes should be either improved or course corrected. Meanwhile, the tools people use for crisis communication and collaboration or the absence of tools should be reviewed as well. The questions I ask IT leaders is how they categorize major IT incidents? What process or procedure do they follow in this case? Can we leverage automation to improve the process, communication and/or collaboration automated? Is the process predictable, repeatable, and efficient?

Specifically look to measure the time to response, from categorization to assembling the right team together.

IT leaders should also assess the cost of inefficient processes to their business. They should start by looking at the number of major incidents they had in the past last year. Look at the average duration of these outages and calculate the overall downtime and ultimately the costs to their business. As a goal, organizations should set the time to engage their response to 5 minutes.  And given the actual time to engage, they should be able to calculate what automation would save them.

Copyright © 2017 IDG Communications, Inc.

What is security's role in digital transformation?