Welcome back to SQL University! Today we talk about Very Large DataBases (VLDBs). I hope this blog post helps get you pointed in the right direction with regards to architecting and building your very own VLDB someday. (Chances are you already have one and maybe don’t know it yet!)
Let’s get it started.
First up, a question for you: are you tall? How do you define yourself as tall? Do you compare your height to those around you? Or do you compare it to some statistical average? Is that average for all humans? Just adults? Adults in your part of the world?
OK, enough stalling, let’s get to the real question. Do you have a large database? How about a very large database? How do you know if it is large, or very large? Who decides this?
Nobody does. Well, technically you do, so I guess that really isn’t “nobody”. But the point here is that typically nobody knows that they have a large (or very large) database until they have hit some type of tipping point. Good tipping points to consider would be the following:
- backup time
- restore time
- batch load time
- database size
- server memory needed
- number of CPUs needed
If you find yourself administering a database that requires special hardware purchases (disk, CPU, memory) or the time it takes to perform data operations (batch loads, backups, restores) is taking too long then there is a good chance you would classify that database as being “very large”.
(At some point I feel the need to tell you that size doesn’t matter, at least I am always told that is the case.)
How do you decide that tipping point anyway?
Whose concerns are more important? Yours? Your boss? The end user? The business? (HINT: The answer is ‘All of the above’)
My tipping point was centered on recovery. Whatever was built needed to be recovered in an acceptable amount of time. Who decided what would be acceptable? Everyone, that’s who.
The managers wanted the batch load to be completed as quickly as possible. Their tipping point was focused on the amount of time it would take to shove terabytes of data into the database. The end users tipping point was reporting. They wanted their reports generated as quickly as possible, no matter what parameters were being used to generate the reports.
Sense any problems yet? Well, you should. Just those three things (backups, writes, reads) are not always playing nice together in the sandbox at recess. Once you decide that you have hit that tipping point, and that you have a very large database, make sure that everyone understands the expected performance. If report generation is going to take five hours, then make sure the end users know that. If the batch load takes four hours, make sure the managers know that. If the backups take three hours, make sure everyone knows that fact (as well as how long it will take for you to recover, if recovery is to be needed).
Now that everyone has their tipping point identified, as well as an expectation for their area of need, we can talk about the actual design. What? You normally design the database before everyone gets together to discuss their needs? Then you’re doing it wrong. If you don’t have a list of (at least general) requirements then how do you expect to design something that is going to satisfy anyone?
Let’s assume you have talked to everyone at this point. Now you need to get down to the details. Know the options you have with things like partitioning, filegroups, and piecemeal backups. Understand how you can help to architect a database that can help those who need to write, those who need to read, and those who need to recover.
At this level people will come seek out your knowledge. You will have been able to not only help architect a viable solution for the requirements given at the onset, but you will have planned for future growth as well. If one word could describe someone at the master level it would be this: scalability.
Make certain that whatever you spend time building is flexible enough to be moved around whenever necessary. And I don’t mean a lift and load from one server to another. What I mean is making certain that your design can be shifted as business needs change. Today the reports are fine, tomorrow they need to run 50% faster, and don’t be stuck saying “it will take six days to rebuild that array onto faster disks”. Nobody wants to hear that from a master. What they want to hear is “We can get it done, this is what the plan is, and if we start today we can have the new array up and running by tomorrow and here is the cost.”
Thinking two steps ahead and having a plan for whatever needs arise is the sign of a master. And don’t forget to mention the cost, because that usually makes everyone stop and rethink if their needs are true. You’ll be surprised at home often things go from “WE NEED THIS NOW” to “damn, we can’t afford that”. But if you don’t have the details then you are not seen as a master, just a roadblock to progress.
Surprisingly there are not a whole lot of people who write or present on working with VLDBs specifically. I will see items about scalibility and the like but I rarely see anyone ever saying “hey, check out what I am doing with this VLDB”. At any rate, here are a few links I believe you will find useful.