Disaster recovery and business continuity planning are processes that help organizations prepare for disruptive events—whether those event might include a hurricane or simply a power outage caused by a backhoe in the parking lot. The CSO's involvement in this process can range from overseeing the plan, to providing input and support, to putting the plan into action during an emergency. This primer (compiled from articles on CSOonline) explains the basic concepts of business continuity planning and also directs you to more resources on the topic. Last update: 4/2/2012.
- What's the difference between disaster recovery and business continuity planning?
- What does a disaster recovery and business continuity plan include?
- How do I get started?
- Is it really necessary to disrupt business by testing the plan?
- What kinds of things have companies discovered when testing a plan?
- What are the top mistakes that companies make in disaster recovery?
- How do changes in technology trends affect BC/DR planning?
- Who should lead our BC/DR program? Where should it report?
- Can we outsource our contingency measures?
- How can I sell this business continuity planning to other executives?
- How do I make sure the plans aren't overkill for my company?
- (Also see related articles from CSO.)
A: Disaster recovery is the process by which you resume business after a disruptive event. The event might be something huge-like an earthquake or the terrorist attacks on the World Trade Center-or something small, like malfunctioning software caused by a computer virus.
Given the human tendency to look on the bright side, many business executives are prone to ignoring "disaster recovery" because disaster seems an unlikely event. "Business continuity planning" suggests a more comprehensive approach to making sure you can keep making money, not only after a natural calamity but also in the event of smaller disruptions including illness or departure of key staffers, supply chain partner problems or other challenges that businesses face from time to time.
Despite these distinctions, the two terms are often married under the acronym BC/DR because of their many common considerations.
FREE: Download CSOonline's Ultimate Guide to Business Continuity now! [11 page PDF free Insider registration is required]
All BC/DR plans need to encompass how employees will communicate, where they will go and how they will keep doing their jobs. The details can vary greatly, depending on the size and scope of a company and the way it does business. For some businesses, issues such as supply chain logistics are most crucial and are the focus on the plan. For others, information technology may play a more pivotal role, and the BC/DR plan may have more of a focus on systems recovery. For example, the plan at one global manufacturing company would restore critical mainframes with vital data at a backup site within four to six days of a disruptive event, obtain a mobile PBX unit with 3,000 telephones within two days, recover the company's 1,000-plus LANs in order of business need, and set up a temporary call center for 100 agents at a nearby training facility.
But the critical point is that neither element can be ignored, and physical, IT and human resources plans cannot be developed in isolation from each other. (In this regard, BC/DR has much in common with security convergence.) At its heart, BC/DR is about constant communication.
Business, security and IT leaders should work together to determine what kind of plan is necessary and which systems and business units are most crucial to the company. Together, they should decide which people are responsible for declaring a disruptive event and mitigating its effects. Most importantly, the plan should establish a process for locating and communicating with employees after such an event. In a catastrophic event (Hurricane Katrina being a relatively recent example), the plan will also need to take into account that many of those employees will have more pressing concerns than getting back to work.
A good first step is a business impact analysis (BIA). This will identify the business's most crucial systems and processes and the effect an outage would have on the business. The greater the potential impact, the more money a company should spend to restore a system or process quickly.
For instance, a stock trading company may decide to pay for completely redundant IT systems that would allow it to immediately start processing trades at another location. On the other hand, a manufacturing company may decide that it can wait 24 hours to resume shipping. A BIA will help companies set a restoration sequence to determine which parts of the business should be restored first.
Here are 10 absolute basics your plan should cover:
- Develop and practice a contingency plan that includes a succession plan for your CEO.
- Train backup employees to perform emergency tasks. The employees you count on to lead in an emergency will not always be available.
- Determine offsite crisis meeting places and crisis communication plans for top executives. Practice crisis communication with employees, customers and the outside world.
- Invest in an alternate means of communication in case the phone networks go down.
- Make sure that all employees-as well as executives-are involved in the exercises so that they get practice in responding to an emergency.
- Make business continuity exercises realistic enough to tap into employees' emotions so that you can see how they'll react when the situation gets stressful.
- Form partnerships with local emergency response groups—firefighters, police and EMTs—to establish a good working relationship. Let them become familiar with your company and site.
- Evaluate your company's performance during each test, and work toward constant improvement. Continuity exercises should reveal weaknesses.
- Test your continuity plan regularly to reveal and accommodate changes. Technology, personnel and facilities are in a constant state of flux at any company.
Useful Books on Business Continuity and Disaster Recovery
By Wallace and Webber (Anacom 2010)Protect vital operations, facilities and assets
By Kelley Okolita (CRC Press 2009)Includes numerous checklists and test scenerios
by Julia Graham et al (Rothstein Associates 2006)Case studies with a business focus
For more details, see this book excerpt on business impact analysis, including a sample BIA form.
Let us give you an example of a company that thinks tabletops and paper simulations aren't enough. And why their experience suggests they're right.
When [former] CIO Steve Yates joined USAA, a financial services company, business continuity exercises existed only on paper. Every year or so, top-level staffers would gather in a conference room to role-play; they would spend a day examining different scenarios, talking them out-discussing how they thought the procedures should be defined and how they thought people would respond to them.
Live exercises were confined to the company's technology assets. USAA would conduct periodic data recovery tests of different business units-like taking a piece of the life insurance department and recovering it from backup data.
Yates wondered if such passive exercises reflected reality. He also wondered if USAA's employees would really know how to follow such a plan in a real emergency. When Sept. 11 came along, Yates realized that the company had to do more. "Sept. 11 forced us to raise the bar on ourselves," said Yates.
Yates engaged outside consultants who suggested that the company build a second data center in the area as a backup. After weighing the costs and benefits of such a project, USAA initially concluded that it would be more efficient to rent space on the East Coast. But after the attack on the World Trade Center and Pentagon, when air traffic came to a halt, Yates knew it was foolhardy to have a data center so far away. Ironically, USAA was set to sign the lease contract the week of Sept. 11.
Instead, USAA built a center in Texas, only 200 miles away from its offices-close enough to drive to, but far enough away to pull power from a different grid and water from a different source. The company has also made plans to deploy critical employees to other office locations around the country.
Yates made site visits to companies such as FedEx, First Union, Merrill Lynch and Wachovia to hear about their approach to contingency planning. USAA also consulted with PR firm Fleishman-Hillard about how USAA, in a crisis situation, could communicate most effectively with its customers and employees.
Finally, Yates put together a series of large-scale business continuity exercises designed to test the performance of individual business units and the company at large in the event of wide-scale business disruption. When the company simulated a loss of the primary data center for its federal savings bank unit, Yates found that it was able to recover the systems, applications and all 19 of the third-party vendor connections. USAA also ran similar exercises with other business units.
For the main event, however, Yates wanted to test more than the company's technology procedures; he wanted to incorporate the most unpredictable element in any contingency planning exercise: the people.
USAA ultimately found that employees who walked through the simulation were in a position to observe flaws in the plans and offer suggestions. Furthermore, those who practice for emergency situations are less likely to panic and more likely to remember the plan.
Read Disaster drill: practice makes perfect for more details of the USAA exercise.
Some companies have discovered that while they back up their servers or data centers, they've overlooked backup plans for laptops. Many businesses fail to realize the importance of data stored locally on laptops. Because of their mobile nature, laptops can easily be lost or damaged. It doesn't take a catastrophic event to disrupt business if employees are carting critical or irreplaceable data around on laptops.
One company reports that it is looking into buying MREs (meals ready-to-eat) from the company that sells them to the military. MREs have a long shelf life, and they don't take up much space. If employees are stuck at your facility for a long time, this could prove a worthwhile investment.
Mike Hager, former head of information security and disaster recovery for OppenhiemerFunds, said 9/11 brought issues like these to light. Many companies, he said, were able to recover data, but had no plans for alternative work places. The World Trade Center had provided more than 20 million square feet of office space, and after Sept. 11th there was only 10 million square feet of office space available in Manhattan. The issue of where employees go immediately after a disaster and where they will be housed during recovery should be addressed before something happens, not after.
USAA discovered that while it had designated a nearby relocation area, the setup process for computers and phones took nearly two hours. During that time, employees were left standing outside in the hot Texas sun. Seeing the plan in action raised several questions that hadn't been fully addressed before: Was there a safer place to put those employees in the interim? How should USAA determine if or when employees could be allowed back in the building? How would thousands of people access their vehicle if their car keys were still sitting on their desk? And was there an alternate transportation plan if the company needed to send employees home?
Hager and other experts have noted the following pitfalls:
- Inadequate planning: Have you identified all critical systems, and do you have detailed plans to recover them to the current day? (Everybody thinks they know what they have on their networks, but most people don't really know how many servers they have, or how they're configured, or what applications reside on them-what services were running, what version of software or operating systems they were using. Asset management tools claim to do the trick here, but they often fail to capture important details about software revisions and so on.
- Failure to bring the business into the planning and testing of your recovery efforts.
- Failure to gain support from senior-level managers. The largest problems here are:
- Not demonstrating the level of effort required for full recovery.
- Not conducting a business impact analysis and addressing all gaps in your recovery model.
- Not building adequate recovery plans that outline your recovery time objective, critical systems and applications, vital documents needed by the business, and business functions by building plans for operational activities to be continued after a disaster.
- Not having proper funding that will allow for a minimum of semiannual testing.
How does changing technology affect my BC/DR plans?