Welcome! I’m Thomas…

How To Survive Any Database Disaster

Database disaster: Keep calm and recover with backupsDatabase disasters happen from time to time. I have a method that allows you and your business to survive any database disaster.

Have backups.

This isn’t rocket surgery. You are only going to be able to fully recover from any disaster to the point of your last good backup.

No backups mean you can’t recover. At all. Ever.

I see stories every day about companies (or even regular people) that don’t have backups of critical systems and files. I would think that by now everyone understands if it is important then you should have three copies.

I also see and hear people confuse the terminologies “high availability” with “disaster recovery”. Two different terms, with very different meanings. One will help you recover from a disaster, the other will not. Guess which one?

Whenever I talk about database disaster planning with my clients and customers I like to remind them of some key concepts. Let’s all take a minute to review some key definitions:

Database Disaster Terminologies You Need To Know

HA – Stands for High Availability. The word you want to think about here is this: uptime. It’s that simple. If your servers have a high uptime percentage (five-nines) then they are highly available.

DR – Stands for Disaster Recovery. The word you want to think about here is this: recovery. It’s that simple. If you are able to recover your data then you have the makings of a DR plan.

Now, here is the very important piece of information that you need to know: HA IS NOT THE SAME AS DR. For the developers that might stumble upon this blog I would explain it like this: HA <> DR

You would be surprised as to how many people confuse these two terms. I know I was sure surprised that some folks would either confuse the terms, or try to classify issues as “events” versus a “disaster”. To me it does not matter if one server or one hundred servers are wiped out, a disaster is a disaster and you need to be able to recover the data.

RPO – Stands for Recovery Point Objective and is the point in time to which you can recover data as part of an overall business continuity plan. In other words, it is the acceptable amount of data loss. For example, if your RPO is 15 minutes then you are going to want to be doing some kind of backup every 15 minutes.  But there’s a rub here, and that is your business may not be able to pick back up at that point and carry on. Read this article for more examples of where you need to set an appropriate RPO based upon the nature of the system. You can’t just set an arbitrary number like “15 minutes” and expect that has real meaning without knowing the underlying system.

RTO – Stands for Recovery Time Objective and this is the amount of time it will take for you to recover data before the business is severely impacted. Taking log backups every 15 minutes may help satisfy your RPO objective, but if it takes you hours to recover a 5TB database then you are probably not going to be helping your business continuity plans.

Your business continuity and recovery plans should include both a recovery point objective (RPO) and a recovery time objective (RTO).

Do You Know Your DR Plan?

For most folks the DR plan is simple: recover the server from a tape backup and restore the databases from backup files (also written to tape). Now, some folks will tell you that they have replication deployed as a DR solution. But I like to play a game called “what if?” So, if your shop is using SAN replication and claim it is their DR solution, ask some simple questions such as:

“What if a corruption happens at Site A and is replicated immediately to Site B?”

And see where that question leads you. (HINT: it should lead you to your current DR solution (if you have one) which is most likely recovering from tape, which (hopefully) shows you are only as good as your last tape backup.)

Here’s another question that most folks tend to overlook: What if your RPO and RTO agreements are no longer (or never were) compatible?

For example, your RPO could be stated as “We need to be put back to a point in time no more than 15 minutes prior to the disaster”, and your RTO could be 15 minutes as well. So if it takes you 15 minutes in recovery time to be at a point 15 minutes prior to the disaster then you are going to be starting again having suffered 30 minutes of downtime.

Think about what you current RPO and RTO are right now (assuming they exist). How long ago were they agreed upon? When was the last time you tested to make certain those agreements were still compatible?

Check That Your RPO and RTO Still Make Sense

You must check often to make certain that you are meeting your RTO for the given RPO and that you are at an acceptable re-starting point for business to continue. Chances are the RPO and RTO you agreed to initially are no longer viable as the size of the data has grown over the years.

Quite often that reality of being down for 30 minutes, and not the desired (and expected) 15 minutes causes folks to start thinking about alternatives such as HA in an effort to augment their DR situation. I’ve seen folks then start to think of HA as a replacement for DR, and that’s where real trouble creeps in to your shop.

And the prices rise considerably as you try to narrow those gaps. As you get closer and closer to no downtime, and no data loss as well, your costs skyrocket.

Which is probably why companies then decide that some downtime is acceptable. But not having backups, or a DR solution in place?

That is never acceptable.

If you want to survive any database disaster then you need to start with having your backups in place.

10 Pingbacks/Trackbacks

  • Pingback: How To Survive Any Database Disaster - SQL Server Blog - SQL Server - Telligent()

  • What amazes me is how often people don’t take into account the location of their backups. Long term and short term. For example at my company our backups are copied to tape and the tape is moved to an off site location. It takes anywhere from 4-24 hours to get that tape back depending on which of our locations it has to be shipped to. That time had better be added in to our RTO or we are going to be messed over. Not to mention the importance of making periodic if not frequent tests to make sure that the backups actually make it on to tape and that we can get them back off again if we need to.

    Short term is easier since in our case that is just a location on the SAN. However we still need to keep an eye on how long the files in that location remain before being dumped for space. In other words how short is short term? We have discovered (rather unhappily) that our 7 days worth of backups on disk had been shortened to 3 (and by Wed that meant no full backups on the disk).

    • ThomasLaRock

      Yeah, we had a tape storage provider the promised a 2 hour tape recovery time. But that’s just the tape itself. It could be terabytes of data to restore!

      Storing your backups locally is an option I have seen used as well, but then what do you do if if that location is destroyed? Now you’ve lost your current data AND your backups!

      I think the “rule of three” is key here. If it is important then store it in three places, in two different formats, with at least one offsite location.

      Will that cost extra? Absolutely. But that cost is better than having your business collapse.

      • Our backups move onto tape once a day, then off site later that day. Then we also have the several days of backups on disk. We run periodic tests to make sure that not only can we restore from our backups on disk but how many days worth of backups are available. Then on the other side we pay the “rush” fee several times a year to get tapes back from our offsite location to test how long it actually takes (as opposed to how long they say it will) for us to get the tape. Then find out how long it takes to restore and if our tapes are still viable.

        • ThomasLaRock

          Sounds like you have the makings of a decent business continuity plan there. I love hearing that you test your process, that’s wonderful.

          • Test is an understatement. Two to three times a year we run a 24×7 test. We put up a DR environment, recover all of our servers (hundreds of mainframe and PC), then recover all of the applications and the databases, then run a set of test batch runs, then have the business test to see if everything still works ok.

            On top of that we have a huge process where we write up everything that went wrong and what we should do better. Then we update our continuity plans as appropriate.

            It’s a big headache at times, but worth it if we ever need it.

          • ThomasLaRock

            Yessir…worth it if you need it. Very, very true.

  • Pingback: Weekly Links - 05/04/2013 | DB NewsFeed()

  • Pingback: Something for the Weekend - SQL Server Links 05/04/13 • John Sansom()

  • MattWilliams81

    Loosing data is something that can be bad for businesses as they can loose all different types of internal and external data. If this does affect your business then the best thing to do would be to get a London data recovery company to help you recover it.

  • Pingback: 8 Ways To Be An Awesome Data Professional - SQL Server - SQL Server - Toad World()

  • Pingback: How To Find A Missing Fitbit - SQLRockstar - Thomas LaRock()

  • Pingback: General Core Dump - SQLRockstar - Thomas LaRock()

  • Pingback: General Core Dump - SQL Server - SQL Server - Toad World()

  • Pingback: Andrew Orange » How To Survive Any Database Disaster()

  • Pingback: 101 Things I Wish You Knew About SQL Server - Thomas LaRock()

  • Pingback: 14 Things Sysadmins Can Learn From an Elf - Thomas LaRock()