Business Continuity Event Planning: Framework for root cause and continuous improvement analysis

This is the final post in the Business Continuity Event Planning series.  We close with a look at how to manage the process, how to improve response, recovery, and prevent recurrence of an event.

In the previous posts, we stepped through the phases leading to a documented and tested Business Continuity Event Plan (BCEP), including:

Once all processes are restored, two additional activities should be performed—post event response review and event root cause analysis. 

General considerations

Before we look at each of these activities, we should examine guidelines which apply to both.  Failure to follow these recommendations may produce results which fall short of management’s expectations.

  1. Don’t assign or discuss blame.  The purpose of these activities is to improve performance and reduce the frequency and impact of continuity events.  If participants know they might be sanctioned or ridiculed for what is said during discovery meetings, they will resist sharing their observations or opinions.
  2. Don’t immediately dismiss any comment as irrelevant or impossible.  Similar to brainstorming sessions, post event and root cause analysis meetings should be open, encouraging more rather than less participation.  No comment related to the topic discussed should be dismissed until viewed within the context of all other information provided.  I’ve often been involved in root cause analysis meetings where a comment receiving chuckles and raised eyebrows early in the discussion—or a version of it--turned out to be a key to part of all of the chain of events.
  3. Invite everyone who had anything to do with the event.  Minimally, this includes:

    1. Business users affected by process interruption
    2. The incident response team
    3. Technical staff responsible for recovery
  4. Use non-management personnel to facilitate the meetings.  This might not be necessary.  It depends on organizational culture or the managers involved.  The important outcome of this guideline is assurance the facilitator doesn’t intentionally or inadvertently direct the outcome of these activities, resulting in recommendations unjustifiably leaning toward the manager’s perspectives or opinions, rather than balanced with those of the people who were actually in the trenches.  Most managers would not consciously make this mistake.  But strong managers or managers who tend to intimidate staff might, but their mere presence, cause less valuable outcomes.

Post event response review

The purpose of this review is ensuring the planned response and recovery activities worked as expected.  It should answer the following questions:

  • What happened?
  • What was supposed to happen?
  • What were the gaps?
  • What can we do to eliminate the gaps?

As we’ve seen during this series, the BCEP is based on management expectations concerning critical process downtime, workarounds, and recovery timelines.  If one or more of the expectations were not met, business impact might have exceeded acceptable thresholds.  Answering these four questions helps identify weaknesses in the response and recovery efforts, including:

  • Gaps in documentation
  • Team training requirements, including actual technology recovery testing at the alternate location
  • Incorrect assumptions about maximum tolerable downtime
  • Inaccurate recovery timelines due to incorrect assumptions about

    • Availability of the alternate facility or facility resources
    • Vendor support availability
    • Product availability
    • Staff availability
    • Effectiveness of communication plan
    • Actual times needed to recover technology
    • Actual times needed to implement manual workarounds
  • Additional dependent processes
  • Additional customer concerns or expectations

Once issues are identified, a discussion about how to remediate the gaps should result in an action plan designed to resolve the issues and update relevant documentation.

Root cause analysis

In addition to improving response and recovery, you also want to look at the causes of the event and whether it could have been prevented or rendered less harmful to the business.  This is the purpose of event root cause analysis.

When looking at causes, it’s important not to identify and treat symptoms.  Doing so doesn’t effectively prevent recurrence.  Many approaches to root cause analysis are available on the Internet.  However, I prefer the simple “five question” approach described in Prevent recurring problems with root cause analysis (Olzak, TechRepublic, 10 September 2008).  This process is simple, intuitive, and doesn’t require complex diagram building skills.

Briefly, the team clearly describes the event and then asks why it occurred.  Answering the first why should provide a description of the “proximate cause.”  Again, the team must answer why the proximate cause occurred.  This continues until a root cause is identified or a dead-end reached.  (See the article above for more details about root cause analysis.)

The final word

The only way to mitigate risk associated with business continuity events is to prepare.  It’s unreasonable to believe events will never happen, that all business processes will continue to operate flawlessly.  Planning, training, and continuous improvements to response and recovery efforts comprise the most important difference between a business which successfully moves past an event and one seriously damaged.

Copyright © 2009 IDG Communications, Inc.

Make your voice heard. Share your experience in CSO's Security Priorities Study.