There is no such thing as a DR test failure

The only failure is not learning from prior tests.

data center datacenter private cloud
Leonardo Rizzi (CC BY-SA 2.0)

Testing your IT Disaster Recovery (DR) plan can be laborious, tedious and fraught with potential landmines. Case in point, that was my first exposure to DR way back in the ancient times of the early 1990’s.

We were a mainframe shop, Big Blue, Amdahl, you know the beasts. Our infrastructure team had been performing annual DR tests for several years. These were the kind of tests where you rented space and equipment in some far-away datacenter for a finite amount of time, something like 36 hours. Within that window, you had fire up the mainframes, tape drives and disks, restore OS, middleware and all the utilities.

This year was going to be different, however. This year, they actually wanted to recover an application. At the time, I was the lead contractor assigned to the order management applications. The applications consisted of dozens of systems, hundreds of programs and literally thousands of data files backed up on countless tapes. We had certainly used the tape backups to restore files often enough, but we had never attempted a full recovery on all the applications in the order management umbrella.

The date of the test had been set for late that summer. Earlier in the summer, one other person and I were assigned the monumental task of preparing for the test by conducting the “pre-test.” We were locked away together in the back recesses of the datacenter, given our own mainframe on which to play—er uh, test—and given some instructions from the infrastructure engineer assigned to our little team.

What followed was hours and hours of running restores, verifying data, running applications, documenting issues and, in a lot of cases, changing production jobs and programs to ensure the backups were taken at proper sync points for the applications that had to talk together. Finally, after three months, we had a documented “run book” for restoring the order management applications. 

The week before the real test, we ran several successful recovery scenarios. We were confident THIS would be a successful test.

We piled into cars and caravanned the three-hour drive up I-65 from Indianapolis to Chicago. In all, we had three application people and about 15 infrastructure engineers. 

What did one Tape Librarian say to the other Tape Librarian? Give up? He said, “You brought the tapes, right?” Yes, ladies and gentlemen, 18 people had driven three hours and no one had thought to pick up the tapes from the offsite vault. The clock was ticking. Four hours later, the tapes arrived. The mainframes were fired up, the tape drives were started, the disc drives were initialized. 30 hours to go.

It was then our infrastructure team realized the Tape Management System at the facility had been updated to a newer (read NOT backwards compatible) version. The next several hours were spent back-leveling the TMS so our tapes could be read. 28 hours to go.

Finally, the restores of the system software and the OS could begin. The data transfer rate between the tape drives and the mainframe was painfully slow. The hours ticked by with every tape mount. 19 hours to go. It was time to reset and bring the system online so the recovery of the applications could begin. Time and old age has dulled my memory of exactly what happened next. Suffice it to say, the disks would not come online. The app team headed back to the hotel to get some sleep. 

By noon the next day, we had to accept the fact that not one of the application tapes had even been mounted yet. Six hours to go and it would take eight hours to restore the applications and data. The test was declared over and unsuccessful. Gloomily, we headed back to Indianapolis.

But wait! Was it really unsuccessful?

  • We had spent months ensuring the applications could get recovered. We had made dozens of changes to the programs, the job schedules, and the tape rotation cycles.
  • We had a documented (and tested) applications in our DR run book.
  • We now had a system of double checks in place to ensure all the tapes were accounted for before heading to the cold site.
  • We knew to validate the release levels of all the equipment in the cold site, including the tape management system.
  • We had dozens of documented lessons learned.
  • AND we had successfully raised the importance of DR Testing, including Application DR Testing with our executive management.

I am happy to report the test the following year was more successful. We were able to recover our order management system and run some transactions against the recovered application. As we progressed, the definition of success progressed. The only failure was not learning from the prior test. Yes, it took several years, but in the end, we did have a fully-tested and validated DR plan, just in time to incorporate client server applications!

The key thing to keep in mind when performing DR tests for your company is that, no matter the results, your real objective is to learn from the situation and your team’s process. Often times, it’s people not technology that make or break a successful recovery. The technology is important, yes. But so is the competency of your IT department in handling DR scenarios. Recognize vulnerabilities in the aftermath of your DR test, and shore those areas up.

This article is published as part of the IDG Contributor Network. Want to Join?

SUBSCRIBE! Get the best of CSO delivered to your email inbox.