Few of us believe that our Disaster Recovery solution will fail when we have the need to use it. We plan and prepare and test for such eventualities so we are protected against the embarrassing stories of lengthy IT downtime that we hear about in the press \u2013 right? Yet companies like BA, the NHS and county councils (some of who are driven by codes of practice and guidelines) are still today experiencing IT downtime out of their control and beyond reasonable expectations. How is this being allowed to happen and what\u2019s really going wrong? \u00a0Here are the top 6 reasons recoveries fail to meet their SLAs:1. Wrong decision making at the point of failureWhen your IT systems are down, people within the business make mistakes caused by the high stress levels they are under. Take BA and SSP incidents for example. BA decided against using their DR solution following a power surge and instead tried to restore power to their production systems as their primary strategy. Unfortunately, this went wrong (the power engineer tried to restore power to too many systems at the same time), resulting in the magnitude of the failure being compounded, and ultimately a larger problem with a longer recovery than if they had just invoked their DR solution in the first instance. SSP also tried to fix their problem quickly by trying to replace some failed SAN discs instead of using their secondary data center which acted as their DR site. When they experienced further disc failures they were eventually forced to use their DR provision and recovery time was extended significantly. \u00a0If you have a DR solution, knowing when to use it is important. Fast, but correct decision making is essential to preventing a crisis.2. DR solutions lacking adequate testing.If you ask anyone responsible for DR then of course they\u2019ll tell you that they test their DR. However, testing that your data backup is available is very different to testing that a virtual machine is recoverable, which is different again to testing the recovery of your entire systems to a point where users are working on them. At the lowest level, ensuring your data is available for restore is of course necessary, however this is not a true DR test \u2013 it\u2019s a backup test. Testing your DR properly means testing that your full recovery systems come up, applications and data are seen, all systems (physical and virtual) are configured to work together and have all relevant DNS changes made. It\u2019s only once you do this, and start working with your recovery systems that you will be able to see whether there are any errors (databases and files missing etc). And it\u2019s surprising how many errors occur when DR tests are properly carried out. These errors will prevent the success of your recovery at your time of need.3. Changes in live system not being updated on DR systemsDR systems need to be exact replicas of your production systems to work properly. Every time your production system is updated, this should be logged as part of a change control process and your DR system should also be updated. What often happens is that a DR system is invoked only to find that the database or application it\u2019s pointing at has been moved and there\u2019s nothing there, leading to recovery failure.4. Data volumes and bandwidth restrictionsSSP customers were informed that their DR solution consisted of continuous replication to a secondary data center for minimal data loss. Sounds great right? But replicating the data meant that in effect all they had was a backup of customer\u2019s raw data., with the building blocks to deliver a secondary platform for this data if required. The first step to SSP\u2019s recovery was to get the platform and applications back up and running (which takes a while if it is not already prepared). Once they had done this SSP (and their customers) were in for a nasty surprise when it took around 3 weeks to unpack all of the customer data and get it restored. Bandwidth and data volumes have a big impact on replication and recovery times.5. False DR test reportsI work for Plan B who runs DR tests on a daily basis for customers. It\u2019s interesting how many times a replication product reports a replication as a success, when in actual fact it has not worked. I know this because we\u2019ve had instances whereby we\u2019ve gone to boot a DR system to test it properly and the virtual machine (VM) or data replication is missing. There\u2019s simply nothing there. Users need to be a bit wary that automated testing, if not verified, can be misleading.6. Reluctance to invoke DR solutionsMany companies are either slow or reluctant to use their Disaster Recovery solution at the point of failure. This is normally because the failback process (reverting back from your DR system to your live system) is lengthy and time consuming \u2013 typically a migration project. Increasingly with the introduction of more advanced technology, failback is becoming more automated, which enables customers to view their DR system as more of a usable standby system that they can flip to\/from. This gives customers much more confidence in their DR system so the decision to invoke their DR is faster. It also opens up a whole world of benefits to customers with the ability to test any migration, development or change knowing that they can easily revert to the DR system if required.So why are companies still experiencing costly and business crippling IT downtime incidents? Usually it\u2019s because of human limitations rather than technology limitations, and accountable individuals should be more confident in their DR solutions.