Welcome back to SQL University, this week we are going to learn about data warehouses. I still remember the first time I saw what was described as a data warehouse. It was really just a bunch of excel spreadsheets shoved into SQL 2000 in a manner that allowed folks to run queries against and if they were lucky then the queries would finish in a decent amount of time (although I think some of them may still be running). In time I started to learn about a lot of warehousing concepts, tips and tricks, design concerns, etc. And the more I learned, the more I understood how difficult it is to build a proper data warehouse.
Oh, sure, it is easy to create a database and shove a bunch of data into it in a short amount of time. The trick is to design it in a way that allows for users to get data, the *right* data, out of the warehouse in a short enough amount of time to allow for them to make good business decisions. I can’t give you the skills overnight to be able to create a 30TB warehouse, but I can point out the skills you will need in order to build a warehouse of any size.
At this level you need to have an awareness about what a data warehouse is and how it is different than a traditional transactional database. Much like a traditional warehouse that is used to store a large volume of goods (for example, pickles) a data warehouse is used to store a large amount of data. And just like it is important to be able to get the pickles from the warehouse to the consumers, it is important that the data stored is able to be retrieved as well. Otherwise, why store the data if you would not need to use it at some time?
A data warehouse is the cornerstone to any effective Business Intelligence solution. A warehouse will allow for the extract, transform, and load (ETL) of data from a variety of sources. It will also allow for the ability to cleanse and verify and data discrepancies and the ability for end users to perform analysis and reporting quickly. In other words, the perfect data warehouse is a little like bacon made from a unicorn and served to leprechauns on gold plates under a double rainbow. With so many moving parts it is near impossible to have a perfect data warehouse. The phrase I often think of when it comes to data warehousing is “close enough for government work”, and I think that describes many of the systems I have ever seen.
In order to get the job done with data warehouses you need to become familiar with two terms: normalization and star schema. It should be easy for you to get started in learning how to normalize a database. Simply go to any database in your shop and examine the table structure, looking for primary and foreign keys. Chances are you will have lots of databases without any relationships defined. This is where you put on your thinking cap and start outlining what you believe may be a better database design.
Start thinking about how you would organize the data, the relationships you would build, and how your design would best allow for data to get in and out. I am not saying that every database needs to be in the third normal form, but I am saying that having a “database” with one table and 400 billion rows is not going to be entered into any performance benchmark tests.
What better level to talk about the need to have a good data model? Unfortunately good data models are hard to find. As a result most data warehouses end up serving as the data model that the underlying data sources wish they could have been. Which is fine, really, because some of the better data warehouses get their data from a lot of sources that are not structured, or normalized, to any degree.
So don’t get wrapped up in making certain that every possible database is a perfect data model. They don’t have to be, really. But your data warehouse certainly will need to be a better model than the source systems. And that’s where those earlier concepts of normalization and star-schemas come into play.
If you want to master data warehousing then you need to go build one. It’s that simple, and it does not have to be complex. Almost everything I have ever needed to learn has been a result of football. Well, fantasy football, to be accurate. And the same is true for data warehousing. Whenever I need to learn something new I always try to use data that is most familiar to me. So take whatever data you can, data that is familiar to you in some way, and try to shove it into a data warehouse. You will find that in a short amount of time you will learn just about everything you need to know. You will learn about ETL, you will learn about normalization and star schemas, you will learn about Analysis Services, you will learn about PowerPivot, and you will learn about Business Intelligence just by trying to take some data you know and storing it to use for reports and making decisions.
Here is a list of resources for data warehousing and/or business intelligence.
- Edgar Codd
- Ralph Kimball (blog | no twitter)
- David Stein (blog | @Made2Mentor)
- Karen Lopez (blog | @datachick)
- Donald Farmer (blog | @donalddotfarmer)
- Siddharth Mehta (blog | @siddhumehta)