Look, everyone makes mistakes. It’s true. But not all of us have the chance to make mistakes that end up costing millions of dollars in hardware and production support costs.
Any one of the following five mistakes listed below will add additional costs to your company. It’s guaranteed. The costs could be hardware related (extra disk space, network bandwidth), which tend to add up quickly. The costs are also support related (bad performance, database re-design, report creation, etc.), and they add up very quickly.
Want to save your company some money, some headaches for your end-users, and ensure data quality all at the same time? Then avoid these five database design mistakes.
1. Going BIG, just in case
I often see data types being chosen as if they don’t matter. But the truth is they do matter. If you know that the only possible values for a certain column are between 0 and 100,000 then you don’t need to slap a BIGINT data type for that column when a INT would do just fine. Why does this matter? The BIGINT data type requires 8 bytes of storage, and the INT requires only 4 bytes of storage. That means for each row of data you could be wasting 4 bytes. Doesn’t sound like much, right?
OK then, let’s consider that your table has two million rows. Multiply those rows by 4 bytes and you have 8 million bytes, or roughly 7.8MB of wasted space. I know it doesn’t sound like a lot, does it? Well, it adds up, and quickly. I’ve only shown you an example for just one column, but what about those NCHAR(2000) columns that are storing mostly first and last names? How much extra overhead for those? How about your date columns? Do you have a need for calendar dates prior to the year 1900, or after the year 2079? Then SMALLDATETIME is likely to work just fine for you.Oh, and let’s not forget that these columns can be indexed, and those indexes will also be unnecessarily wider as well.
Choosing the right data type matters, for all sorts of reasons. Take the time and make an effort to get it right at the start.
2. Expecting a DBMS feature to mean the same across all systems
Bigger companies tend to have a variety of systems, all of which need to exchange data. It is quite common to see familiar terms in use by those systems. For example, you are likely to see the use of the term TIMESTAMP. You simply cannot assume that the format of TIMESTAMP will be the same between systems. If you do you are likely to have issues at some point. If you are lucky the issues will manifest themselves early. If you are unlucky the issue will lay dormant until the worst possible time for your system to come to a grinding halt.
I chose TIMESTAMP as my example here because it is one of my favorite misnomers inside of SQL Server. Why? Because TIMESTAMP has nothing to do with date or time. And I love this hidden gem inside the BOL:
The Transact-SQL timestamp data type is different from the timestamp data type defined in the ISO standard.
So that means other systems that do adhere to the ISO standard of what it means to be a TIMESTAMP are going to return something vastly different. Take Oracle, for example, and how they define a TIMESTAMP value. Even DB2 has some subtle differences as well. The lesson here is that when you are working with disparate systems, do not assume that familiar terminologies will work in exactly the same way.
3. Expecting database-generated identifiers to be sequential
Being able to insert data into a database in sequential order has many benefits, and most database designers will look to implement some type of mechanism to ensure that such sequences are used. However, many of those mechanisms will leave “gaps” from time to time, and these gaps lead to confusion as people except to see {1, 2, 3, 4, 5…} and not something like {1, 2, 4, 5, 6…}. As soon as a gap is found people send out a search party, looking for the missing data. I know I’ve lost hours trying to track down “bugs” in systems where people thought rows of data were missing, only to realize that the integrity of the data was fine but the method for generating the sequence was not.
Now, for reasons I can’t explain, people just love GUIDs in database designs. One of the (many) issues with GUIDs is that they are not sequential by default. This will lead to decreased performance, often in a very short amount of time. For a great explanation of this in detail, check out this post by Kimberly Tripp at SQL Skills. The part you want to pay attention to is here:
“But, a GUID that is not sequential…can be a horribly bad choice – primarily because of the fragmentation that it creates in the base table but also because of its size. It’s unnecessarily wide…”
Yeah, that *is* what she said.
4. Not examining your foreign keys (FKs) as part of your indexing strategy
I am assuming, of course, that you even have foreign keys defined. I’ve seen *many* databases that have little to no primary keys, foreign keys, or even any indexes defined. No, I don’t know who would do such a thing either. But they’re out there, and sooner or later you will find them, too.
Assuming you have FKs defined then you should be evaluating to see if it would make sense to add indexes to match those FK definitions. In some case, it will. In other cases, it won’t. But you should make certain that this type of review is part of your overall design process.
5. Applying surrogate key without having the original business key as an alternate
Back to those folks who love the GUIDs for a moment. If you are using a surrogate key (and, ideally, one that is sequential) in your database design you should not forget that the data in your table should also have an alternate key based upon the business logic that was used to necessitate the creation of the table in the first place.
Many database designers tend to focus on the surrogate key only, and forget all about the other columns. And this is bad for your data quality, as it would allow for duplicate values to creep into your business layer if you do not put a unique constraint/index on that alternate key. Nobody wants that, trust me.
And there you have it. Avoid these five database design mistakes and save your company money, headaches, and ensure data quality all at the same time.
”
I’ve only shown you an example for just one column, but what about those NVARCHAR(2000) columns that are storing mostly first and last names? How much extra overhead for those?”
None? It’s a varchar 😉
Sorry, meant for that to be NCHAR, thanks for catching that.
Sorry, meant for that to be NCHAR, thanks for catching that.
Thomas, wouldn’t you use Tinyint if the values are only between 0 and 255? Also for dates if you don’t need the time use date instead, this is only 3 bytes compared to 4 bytes for smalldatetime…the nvarchar(2000) I don’t understand…are you talking about it being nvarchar instead of varchar (2 bytes per character)
Another thing that I see is having smallint in one table and then in the FK table using an int
The other issue with using something like nvarchar(2000) for names is that people get tempted in to stuffing more data into that column rather than adding a new column. I’ve worked in shops where the standard is that all character columns are varchar or nvarchar(2000) using the same logic as “it doesn’t take up more space.” And in almost all cases, when I profile the data, I find all kinds of extra data in those columns. I call the “bucket columns” because they get used to hold miscellaneous data, usually delimited with something that requires parsing the data in a column to get what’s really needed. That can be both a data quality and a performance hit.
I have seen all kinds of crazy data in the AddressLine2 column as well
Ah, I usually see it in AddressLineFour 🙂 . Like the client who stored the date they actually mailed cheques to their vendors, not the date on their agreement. Then some yahoo printed all four lines on the cheques.
So the vendors figured it out. And a new column was created, just for that special data. The world was at peace again.
You can tell that I get all hyped over over this “but storage is free now” stuff I hear about db design. I don’t think storage is the only issue when trying to figure out cost, benefit and risk in design.
Storage is cheap but not when you have to replicate or mirror that useless data over a slow LAN..also the backups are bigger (and yes compression is a nice gift…thankful for that as well)
+1 Denis. Excellent point.
Denis,
Thanks for the catch, I had some thoughts and ideas get crossed up in between the time my brain thought of them and the words left my fingers. Sorry for the confusion. I changed the 255 to 100,000, so that the use of INT would make more sense in that context, as opposed to TINYINT (which would be the correct choice for 0 to 255).
Actually the problem with Tinyint is that other systems often define 1 byte ints as being from -128 = 127 as opposed to from 0-255. Because of this I shy away from tinyint and use smallint instead.
Thanks for this post. I’m pretty new to all this, but I understood it and took something away from it.
Thanks again!
Thomas,
Exactly well said. Most of the time where i struggled on performance tuning because of bad design, If you asked the organization guys they will say i don’t who designed it.
see my post. I’m also plan to write each bullet in a post.
http://www.sqlserverblogforum.com/2011/07/t-sql-best-practice-sql-server/
Actually, another mistake I see a lot is ALWAYS using a synthetic identity key as opposed to natural keys, or defining a synthetic identity key and not putting any kind of indexes or constraints on the natural key. I think this is a common error among software developers who create databases as opposed to actual database developers. I’ve seen databases with stuff like state and zip code look-up tables with identity columns on them as opposed to just defining the 2 char state code or the zip code as the primary key. Absurdity, especially since this required doing a search on the zip code table by zip code to look up the id in order to join to another table by zip code id in order to look up some fact by zip code, because in all of the tables with zip code data in them, the only foreign key was to the ZipId identity column instead of simply to ZipCode :p
Bad decision. Your company decides to sell into the Australan marketplace, which has three letter state codes and 4 digit postcodes. You’ve just setup your primary key based on a two char primary key and a 9 digit zip code for the primary key.
Natural keys are a bad move, because even the most basic requirements can change.
Number 5 is my number one.
Excellent post.
Data Types are often neglected, even in big companies. I encountered one database where data type used in all tables was nvarchar(max). What more horrible was the fact that they didn’t believe in Indexing. None of the tables had P.K-F.K thing.
What mechanism for creating surrogate key sequences is both gap-free and safe for concurrent access?
Some good points.
I’ve made a whole series on DB Design Mistakes To Avoid.
Starting with, comparing bad DB Designs, to the Leaning Tower Of Pisahttp://rodgersnotes.wordpress.com/2010/09/14/database-design-mistakes-to-avoid/
Thomas, we use goods because of dbs that get folding into one another…..combined data….and we need to know the id is going to be truly unique across the enterprise…..Do you have other suggestions that might accomplish this…
There are other ways to guarantee uniqueness that are more expensive, but I can’t say without more background what the right solution would be for your case.
Sorry that was guids….don’t know if the system did a word search or my typing is just that bad.
Its excellent topic that is about database designing. I really have improved my skills regarding database designing by reading your points.
This is nice summary, I have similar kind of article ( 5 articles) series written around this topics in 2009 http://www.sql-server-citation.com/search/label/Common%20mistakes%20in%20sql%20server . I have observed misuse of identity most of the time and nchar data type.
Thanks
I know, I’m late to the party with commenting on a 2 year old post.
I’m surprised nothing with row-by-row data was included here. I’ve seen a lot of databases that crawl because it’s relying on a foundation built upon either scalar functions or parameterized table-valued functions. The worst part is, these get so widely used and they throw so much in them that they end up being the most difficult performance issues to resolve.
However, the other 5 points are perfect, so I guess you’d have to do a top-6, which doesn’t sound as nice.
Steve,
Those things can be implemented regardless of the underlying database design. They are more dev/programming issues than design issues.
Bad code can bring good design to its knees, begging for mercy, just as you have described.
Thanks for the comment, even if it’s been 2 years!
At what point do you draw the line from database design and code? The indexes are part of the foundation of a database, but they’re easier to manipulate than these functions.
I tend to think of them as part of the database design when it’s a documented procedure that when you want to get a piece of data you run fn_SpeedLimit(@LoopOnMe).
Steve,
Interesting question…and it might be a matter of preference for some. For me I try to think about things that are physical design (file layout, partitioning, etc.) and logical design (tables, views). After that, I see things like procedures and functions as part of a data access layer, and that’s when I tend to say “it’s code” and not design.
Others may disagree with that viewpoint. Since you can script procedures and functions as part of theTom database schema, they may see it all as design.
Tom