6 reasons why IT recoveries fail

The top 6 reasons IT recoveries fail to meet SLAs

disaster recovery button
Thinkstock

Few of us believe that our Disaster Recovery solution will fail when we have the need to use it. We plan and prepare and test for such eventualities so we are protected against the embarrassing stories of lengthy IT downtime that we hear about in the press – right? Yet companies like BA, the NHS and county councils (some of who are driven by codes of practice and guidelines) are still today experiencing IT downtime out of their control and beyond reasonable expectations. How is this being allowed to happen and what’s really going wrong?  

Here are the top 6 reasons recoveries fail to meet their SLAs:

1. Wrong decision making at the point of failure

When your IT systems are down, people within the business make mistakes caused by the high stress levels they are under. Take BA and SSP incidents for example. BA decided against using their DR solution following a power surge and instead tried to restore power to their production systems as their primary strategy. Unfortunately, this went wrong (the power engineer tried to restore power to too many systems at the same time), resulting in the magnitude of the failure being compounded, and ultimately a larger problem with a longer recovery than if they had just invoked their DR solution in the first instance. SSP also tried to fix their problem quickly by trying to replace some failed SAN discs instead of using their secondary data center which acted as their DR site. When they experienced further disc failures they were eventually forced to use their DR provision and recovery time was extended significantly.  If you have a DR solution, knowing when to use it is important. Fast, but correct decision making is essential to preventing a crisis.

2. DR solutions lacking adequate testing.

If you ask anyone responsible for DR then of course they’ll tell you that they test their DR. However, testing that your data backup is available is very different to testing that a virtual machine is recoverable, which is different again to testing the recovery of your entire systems to a point where users are working on them. At the lowest level, ensuring your data is available for restore is of course necessary, however this is not a true DR test – it’s a backup test. Testing your DR properly means testing that your full recovery systems come up, applications and data are seen, all systems (physical and virtual) are configured to work together and have all relevant DNS changes made. It’s only once you do this, and start working with your recovery systems that you will be able to see whether there are any errors (databases and files missing etc). And it’s surprising how many errors occur when DR tests are properly carried out. These errors will prevent the success of your recovery at your time of need.

3. Changes in live system not being updated on DR systems

DR systems need to be exact replicas of your production systems to work properly. Every time your production system is updated, this should be logged as part of a change control process and your DR system should also be updated. What often happens is that a DR system is invoked only to find that the database or application it’s pointing at has been moved and there’s nothing there, leading to recovery failure.

4. Data volumes and bandwidth restrictions

SSP customers were informed that their DR solution consisted of continuous replication to a secondary data center for minimal data loss. Sounds great right? But replicating the data meant that in effect all they had was a backup of customer’s raw data., with the building blocks to deliver a secondary platform for this data if required. The first step to SSP’s recovery was to get the platform and applications back up and running (which takes a while if it is not already prepared). Once they had done this SSP (and their customers) were in for a nasty surprise when it took around 3 weeks to unpack all of the customer data and get it restored. Bandwidth and data volumes have a big impact on replication and recovery times.

5. False DR test reports

I work for Plan B who runs DR tests on a daily basis for customers. It’s interesting how many times a replication product reports a replication as a success, when in actual fact it has not worked. I know this because we’ve had instances whereby we’ve gone to boot a DR system to test it properly and the virtual machine (VM) or data replication is missing. There’s simply nothing there. Users need to be a bit wary that automated testing, if not verified, can be misleading.

6. Reluctance to invoke DR solutions

Many companies are either slow or reluctant to use their Disaster Recovery solution at the point of failure. This is normally because the failback process (reverting back from your DR system to your live system) is lengthy and time consuming – typically a migration project. Increasingly with the introduction of more advanced technology, failback is becoming more automated, which enables customers to view their DR system as more of a usable standby system that they can flip to/from. This gives customers much more confidence in their DR system so the decision to invoke their DR is faster. It also opens up a whole world of benefits to customers with the ability to test any migration, development or change knowing that they can easily revert to the DR system if required.

So why are companies still experiencing costly and business crippling IT downtime incidents? Usually it’s because of human limitations rather than technology limitations, and accountable individuals should be more confident in their DR solutions.

This article is published as part of the IDG Contributor Network. Want to Join?

New! Download the State of Cybercrime 2017 report