Is Your DR Plan Complete?

Kevin Hill (blog|twitter) posted a thought-provoking item on his blog last week about Disaster Recovery Plans. While I am in the 10% who perform DR tests for basic functionality on a regular basis, there’s a lot more to being prepared for disaster than just making sure you can get the databases back online.

You really need to have a full-company business continuity plan (BCP), which your DR plan is an integral portion of. Here come the Boy Scouts chanting “Be Prepared!”

When disaster strikes:

  • How will you communicate it to your customers, including regular status updates?
  • How will you communicate within the company?
  • Do you have your systems prioritized so that you know what order things have to be brought online? Which systems can lag by a day or two while you get the most critical things online?
  • Do you have contingency plans for all of the disaster conditions that could impact your business or failure modes of your systems?

Let’s say you’re prepared to fail over from your primary datacenter to a DR datacenter when a catastrophe hits the primary. You’ve got that all worked out and you rehearse it monthly or quarterly. You can bring critical databases and websites online within the required time period and the steps are well-documented. That’s a great start!

You probably do this periodic test on the 2nd Tuesday of each quarter, from the comfort of your desk at work, under “normal” conditions.

  • What if your main office is unavailable due to fire, flood, or weather conditions? Can you remotely access any of your datacenters or cloud infrastructure without first connecting to the building that just got wiped off the planet by a tornado? Can you weather a wide-scale blackout?
  • Are you expecting everyone to work from home (or wherever they may be/may find convenient), or do you have a fixed location to use as a command center? Do you have a contingency plan if that “command center” is inaccessible due to unsafe travel conditions or the same problems that plague your main office?
  • Have you tested executing your DR/BCP out of those alternate locations?
  • What if you can access the office (either VPN or physically), but the connection to your offsite datacenter(s) is severed?
  • Maybe you’ve got everyone set up to work “remotely”. Are they able to work at 100% or even 50% capacity if you lose the office, or are they dependent upon a VPN endpoint in the office? How many routes to the datacenter(s) do you have? Are all the necessary tools available on laptops for remote work, or are you reliant upon a jumpbox? Is that jumpbox accessible in a true disaster scenario?
  • A modern laptop that’s only running an RDP client (aka smart terminal) can run quite a while on a full charge and is pretty responsive even when tethered to your phone’s LTE connection or a MiFi device. Are you keeping all those batteries fully charged (confession: my laptop is at about 40% as it sits in its bag right now, and I don’t carry a battery pack for my phone all the time) so you can work a few hours while waiting for the lights to come back on at home?

As a DBA, I’m responsible for ensuring that we can get the necessary databases online with reasonably recent data (meeting our SLAa) and accepting connections for users. But that presumes that I can gain access to the DR site. It also presumes that communication channels are documented and followed such that my team isn’t being asked for status updates every 3 minutes, instead of allowing us to work the problem.

There are a lot of moving parts that have to be working together for your database DR plan to execute successfully, and many of them are outside the DBA’s realm or even the IT department. Testing your database recovery plan is terrific - but unless you’ve prepared and tested an end-to-end plan that encompasses everything the company needs to do to continue operating, how can you be sure that you’ll even be in a position to execute the database DR plan?