02 Apr 2013 How To Survive Any Database Disaster
This isn’t rocket surgery. You are only going to be able to fully recover from any disaster to the point of your last good backup.
No backups mean you can’t recover. At all. Ever.
I see stories every day about companies (or even regular people) that don’t have backups of critical systems and files. I would think that by now everyone understands if it is important then you should have three copies.
I also see and hear people confuse the terminologies “high availability” with “disaster recovery”. Two different terms, with very different meanings. One will help you recover from a disaster, the other will not. Guess which one?
Whenever I talk about database disaster planning with my clients and customers I like to remind them of some key concepts. Let’s all take a minute to review some key definitions:
Database Disaster Terminologies You Need To Know
HA – Stands for High Availability. The word you want to think about here is this: uptime. It’s that simple. If your servers have a high uptime percentage (five-nines) then they are highly available.
DR – Stands for Disaster Recovery. The word you want to think about here is this: recovery. It’s that simple. If you are able to recover your data then you have the makings of a DR plan.
Now, here is the very important piece of information that you need to know: HA IS NOT THE SAME AS DR. For the developers that might stumble upon this blog I would explain it like this: HA <> DR
You would be surprised as to how many people confuse these two terms. I know I was sure surprised that some folks would either confuse the terms, or try to classify issues as “events” versus a “disaster”. To me it does not matter if one server or one hundred servers are wiped out, a disaster is a disaster and you need to be able to recover the data.
RPO – Stands for Recovery Point Objective and is the point in time to which you can recover data as part of an overall business continuity plan. In other words, it is the acceptable amount of data loss. For example, if your RPO is 15 minutes then you are going to want to be doing some kind of backup every 15 minutes. But there’s a rub here, and that is your business may not be able to pick back up at that point and carry on. Read this article for more examples of where you need to set an appropriate RPO based upon the nature of the system. You can’t just set an arbitrary number like “15 minutes” and expect that has real meaning without knowing the underlying system.
RTO – Stands for Recovery Time Objective and this is the amount of time it will take for you to recover data before the business is severely impacted. Taking log backups every 15 minutes may help satisfy your RPO objective, but if it takes you hours to recover a 5TB database then you are probably not going to be helping your business continuity plans.
Your business continuity and recovery plans should include both a recovery point objective (RPO) and a recovery time objective (RTO).
Do You Know Your DR Plan?
For most folks the DR plan is simple: recover the server from a tape backup and restore the databases from backup files (also written to tape). Now, some folks will tell you that they have replication deployed as a DR solution. But I like to play a game called “what if?” So, if your shop is using SAN replication and claim it is their DR solution, ask some simple questions such as:
“What if a corruption happens at Site A and is replicated immediately to Site B?”
And see where that question leads you. (HINT: it should lead you to your current DR solution (if you have one) which is most likely recovering from tape, which (hopefully) shows you are only as good as your last tape backup.)
Here’s another question that most folks tend to overlook: What if your RPO and RTO agreements are no longer (or never were) compatible?
For example, your RPO could be stated as “We need to be put back to a point in time no more than 15 minutes prior to the disaster”, and your RTO could be 15 minutes as well. So if it takes you 15 minutes in recovery time to be at a point 15 minutes prior to the disaster then you are going to be starting again having suffered 30 minutes of downtime.
Think about what you current RPO and RTO are right now (assuming they exist). How long ago were they agreed upon? When was the last time you tested to make certain those agreements were still compatible?
Check That Your RPO and RTO Still Make Sense
You must check often to make certain that you are meeting your RTO for the given RPO and that you are at an acceptable re-starting point for business to continue. Chances are the RPO and RTO you agreed to initially are no longer viable as the size of the data has grown over the years.
Quite often that reality of being down for 30 minutes, and not the desired (and expected) 15 minutes causes folks to start thinking about alternatives such as HA in an effort to augment their DR situation. I’ve seen folks then start to think of HA as a replacement for DR, and that’s where real trouble creeps in to your shop.
And the prices rise considerably as you try to narrow those gaps. As you get closer and closer to no downtime, and no data loss as well, your costs skyrocket.
Which is probably why companies then decide that some downtime is acceptable. But not having backups, or a DR solution in place?
That is never acceptable.
If you want to survive any database disaster then you need to start with having your backups in place.