Recently I wrote about about five database design mistakes you don’t want to make. Those mistakes were focused on database design, but what about the design process itself? It is possible that you could have all intentions of a solid design only to be let down by your overall design process.
Here are the five ways I have seen the design process fail even the best architects.
1. Not Testing To Scale
I think we have all seen this at least once. The coding is done, it gets moved to production, and everything comes to a halt faster than you can say “scalability”. What took five seconds in the development and test databases now takes over fifty minutes in production. The difference? It’s the data, of course. The code took five seconds against the 10,000 rows in development and test, but fifty minutes against the 6,000,000 rows in production.
After a quick round of blamestorming the application team is looking to the DBA team to perform some miracle to bring performance to within acceptable limits. The DBA team is looking to the application team to make the necessary code and/or design changes but of course that ship has sailed since it is important that the code have been deployed by the drop dead date (successful code deployments mean you met your deadline; having the code actually work isn’t all that necessary for the project to be a success and managers to cash fat bonus checks, apparently).
The solution? Test your code to scale, of course, and do so before you get to production. In fact, you should push your code beyond what is expected to be a common production load so that you know you have capacity for the future. You can do this the easy way by capturing query statistics (think logical I/O) and using extrapolation methods. Or you can go purchase one of a myriad of tools to help stress test your code for scalability.
2. Failing To Look For Missing Indexes
Assuming you are testing against a reasonable production workload another area that is often overlooked is reviewing the indexes. Are you missing any? Do you have too many? How do you know if the ones you have are the right ones?
There are a handful of ways to get this information. One way is to use the Database Engine Tuning Advisor. Another way is to query the DMVs to pull out information on the indexes. Kimberly Tripp has a handful of posts regarding indexes including this one that helps you find if you have duplicate indexes.
Another indexing tip: put indexes on your foreign keys. I won’t say “always do this”, but you should certainly be reviewing your foreign keys as part of your indexing strategy. If you do create them, just make sure they are going to be used, otherwise you have unnecessary overhead.
3. Not Reviewing Datatypes
Things change, that’s what they do. What was once an excellent choice for a datetime column might actually only need a smalldatetime datatype. Your primary key may have started out narrow but after a few rounds of design meetings it may have been expanded and could use some trimming.
Here’s a pro tip for you: never assume the datatype based upon the name of the column. A Vehicle Identification Number (VIN) is not a number. Most of the account numbers you have with companies are not numbers, either. Quite often if your business folks call something a “number” there is a good chance it is not really a number in the true sense of the word. Always keep in mind that the business folks are there to run the business and not to design databases and applications.
Reviewing the datatypes is a very tedious job, and not very glamorous. But the performance benefits are often worth the investment.
4. Not Listening To Your SMEs
You have a database administrator on staff and/or a data architect as well. You pay them for their experience. So why don’t you listen to them? If they suggest that your clustered key is too wide, maybe you should rethink those three GUIDs you put in there. If the DBA suggests that using table-valued functions is not an optimal design choice, maybe you should rethink that part of your design. All too often the SMEs are perceived to be a roadblock to progress simply because they want to help build something that has a reasonable amount of stability.
You have subject matter experts right there in front of you. Why aren’t you listening to them?
5. Skipping Data Profiling
With each new system comes an influx of data, and much of it seems familiar to you as if you have seen it before. That’s because you have. It’s the same data, but from a different provider, and presented in a shiny new way. But don’t be misled into thinking it is complete. For example, if you have purchased a mailing list of customers from the CYOA company you may want to verify that it contains the details of customers spread throughout a wide geographical area. How many times have you worked with an ETL system and been ready to deploy to production only to hear someone comment “what do you mean we don’t have any customers east of Chicago?” Or one of my favorites, “what do you mean we don’t have any stock price or mutual fund information for last three years, where did it go?”
There you go, five ways you could be let down by your design process. Together with the five database design mistakes you now have ten ways to improve upon your database creations.
I’ll list a few more.
1) Not properly evaluating the use of natural keys vs synthetic keys, typically resulting in simply applying synthetic (identity) keys to all tables, even when natural keys make more sense.
2) Simply creating “Primary Keys” on tables using the Enterprise Manager GUI, or even scripting and creating Primary Keys, allowing SQL to create the default clustered index. Quite often, primary in data warehouse settings, the clustered index and the primary key should be different. Indeed in some cases the primary key should be on a synthetic key and the clustered index should be on a natural key.
An example here in my case was a claim transaction table I designed recently, where I created a clustered index on insured id, claim number, transaction date, transaction id. The primary key is simply on transaction id. The transaction id is unique in and of itself, but the way the data is reported on is always in relation to an insured or to a claim, and typically within date ranges.
3) COMPRESSION! Evaluation of compression should be a standard practice for database design now, especially in a data warehouse or reporting type of environment. And this is a big deal, because its not just determining after the fact whether to compress or not to compress, the use of compression can impact your table design.
Again, in a reporting environment, I’m trending more and more to de-normalization and page compression for superior performance. Normalization is still very important in OLTP systems for data integrity, but its much less important now for efficient data storage. As we all know, query performance is generally better for de-normalized data, but in the past the size of the data remained a constraint on de-normalization. It was considered foolish to store duplicate values across rows or to store “decoded values” as opposed to storing tinyint keys that reference de-code tables, etc. But now (in reporting settings) I’m de-normalizing data, flattening it out, and storing real values in the data, using a lot of varchar and even char fields to store the real values of things like “types” and “categories”, etc., etc. And yeah, when you look at the data you see tons of duplication. But guess what, now you throw page compression on it and your 50 gig database shrinks down to 5 gig.
When you write reports against this data its super fast. Very few joins, and now you can put columnstore indexes on it too. In many cases it essentially avoids the need to build cubes.
Another approach of course is to use a compressed indexed view against normalized data, but I haven’t had much luck with this due to the complexities and requirements of setting up indexed views.
Those are great additions, thanks for the comment.