09 Jun SQL University – HA/DR Week
I may be at TechEd this week, but that doesn’t excuse me from pitching in for SQL University. This week we are going to discuss the most critical function you have as administrators: your ability to recover from a disaster. Make no mistake, this is going to decide if you are able to keep your job or not. And in order to do that, you had better be able to ascertain the difference between what it means to be highly available and what it means to recover from a disaster.
There is little doubt in my mind that most people do not understand the difference between high availability (HA) and disaster recovery (DR). My advice to you is simple: learn and understand the difference between these two terms. They are not the same. There is, indeed, a difference. And should you ever be in a meeting with someone who happens to say “If we have HA then we don’t need DR” you have my permission to slap that person. Repeatedly. And then go look for another job because you do *not* want to work there anymore anyway.
HA is a solution that is implemented which allows for your data to be highly available. That’s it, nothing more. It does not protect your data (or yourself) in the event of a disaster, it merely offers you a chance for your data to remain available to the end users.
DR, on the other hand, is the solution that will allow for you to recover from a disaster and is often associated with business continuity plans. While it may be nice to always have your data highly available, if you (and others like you) want to keep your job then you had best have a way to recover that data should a disaster happen.
(And don’t get me started on the difference between a “disaster” or an “event”, or try to split hairs about “available” and “recoverable”. Look, at the end of the day, if all is lost, how would you rebuild your servers, or your business? Whatever process you just described is your DR plan. For most, their DR plan is to recover from a tape backup. If you are relying on your HA solution to always work and that DR is unnecessary then I suggest you update your resume now and save yourself some time later on.)
At this level you need to be able to get the job done. And in order to do that then you need to have an understanding of the different options available in SQL 2008 with regards to HA. Get yourself familiar with database mirroring, transaction log shipping, clustering, and SQL replication. I am not saying you need to be an expert in each, just that you understand the advantages of each one and how they can be deployed as part of an HA/DR solution. These details will be crucial for you to have success at the next level.
Here you want to listen to what people are asking for before you present a solution. Write down all of the facts and requirements that you are able to capture. Then go back and see how well those requirements and facts match up to the features available in SQL 2008. If you are lucky then you will be able to fit everything neatly into one solution. Chances are you will need a combination of solutions in order to meet all the requirements. I often see examples of how people deploy an HA/DR solution that combines two or more features from SQL 2008, so keep that in mind when you are designing your solution; do not pigeonhole yourself into just one feature, use as many that are necessary.
If you want to be considered a master then you are going to need to throw around some acronyms in order for other people to think you know what you are talking about. If you want to build credibility fast then look no further to the terms “RPO” and “RTO”. These two acronyms are going to be your very close friends here; you’ll sound smart when you are using them, people will believe you are an expert, and you may have an opportunity to teach people the difference between HA and DR because people are always willing to listen to an expert than uses fancy acronyms.
RTO stands for recovery time objective and represents the length of time between the point at which your data became inaccessible and the time it was brought back online. Put simply, it is the amount of time you are allowed to be out of the water and it is often miscalculated. For example, have you ever been told that it would be OK for a system to be down for, say, a few hours? And then that system goes down and your phone rings and the voice on the other end is upset that it will take longer than five minutes to bring it back online? Yeah, most of us have been there, and that is why it is not only important to have a defined RTO, but also to make certain it is documented and communicated.
RPO stands for recover point objective and represents the amount of data you will need to recover to a specified point in time. Put simply, it is the acceptable amount of data loss. So if you have a defined RPO of one hour then you would need to recover all data up to one hour prior to the disaster. Same rules apply here as to the RTO, you need to have them well defined, documents, and communicated. Otherwise you will have a voice on the other end of the phone line rather upset when you tell them you can get them back to a one hour RPO and they say they need to be back within a five minute RPO. Not good times.