How well do you know your data?

I’m not talking about user names, passwords, or credit card numbers. I don’t expect you to know the values stored inside the rows and tables of your database (although I *was* once called out for not knowing such details so apparently there are some people out there that expect the DBA to know every piece of information).

What I am asking you here has to do with datatypes. You know what those are, right? They are how you define your data when creating a table. They have fancy names like NVARCHAR, TIMESTAMP, and VARBINARY.

And most folks have no idea which ones they should be using. How confident are you that your systems have been designed by someone that took the time and effort to choose the datatypes wisely?

I’m willing to wager that you have systems in your care, right now, that have a less than optimal selection of datatypes for the actual data being stored. As such I have put together a list of ten ways for you to right-size your datatypes and your data.

The items below apply to relational database platforms as a whole, but many of the scripts will be focused on SQL Server. I’ve tried to make note of that where appropriate. If you have a script you’d like me to link to just let me know and I can include them in an update to this post. Just leave a comment at the end.

1. Understand the storage requirements for every datatype you consider

You need to know and understand the implications of choosing one datatype versus another. The storage implication between an INTEGER and BIGINT is 4 bytes, which may not seem like a lot but those 4 bytes can add up fast given that you are likely to have more than one column with the wrong datatype and millions of rows in the table.

Here’s a list to some database platforms that you may find useful:

2. Review all design decisions based on the shape of the data – where it is now and where it is likely to be later.

When designing a database you really need to understand how the data will move through your system; how it gets inserted, how it gets stored, and how it gets used. Once you have those details it will be easier for you to know you have made the right decision.

3. Set datatypes based on business requirements, not tool defaults

Different tools will use different defaults. If you create a new table in a database using SQL Server Management Studio (SSMS) you may very well end up with columns defined as NCHAR(10) on that table. If you then go and create a stored procedure using SSMS you will find that the default datatype choices are INTEGER, which will result in a likely mismatch of datatypes for the table you just created. Don’t rely on the defaults to build your system.

4. Measure and monitor fit of the data to its datatypes regularly

You can use my script to look inside a SQL Server database for integer values that may need to have their datatypes adjusted. Or you can run this script to check the datatypes currently residing in memory, as those datatypes are likely to be the ones you should focus on adjusting first. In either case you are being proactive in the measuring and monitoring of the data matching the defined datatype.

5. Review each index creation request to see if it is duplicate

You want to avoid adding unnecessary indexes to your database, as each one adds additional overhead for any delete, update, or insert statement. One of the easiest ways to achieve this is to review the index creation scripts before you execute them against the database. In other words, don’t just take scripts from your developers and deploy them to production without taking the time to review them.

Most people think of data governance as just something for tables and columns, but the right datatype and index choice will impact user satisfaction for both data quality and performance.

6. Measure for unused and duplicate indexes regularly

Reviewing the scripts before the indexes are created is great, but there’s a good chance you will not always be able to catch everything. You will need to monitor for duplicate indexes at regular intervals. I have a script that I have used in the past for finding duplicate indexes inside of SQL Server. Running that script helps me to find and eliminate wasted space inside of databases.

7. Review new stored procedures to verify parameters are matching

This could be a tad time consuming, but the benefits should be worth the effort over time. Datatype mismatches will often result in inefficient query plans to be chosen, that means you get poorer performance. If you can help avoid poor performance by simply reviewing and verifying the parameters defined inside of a stored procedure before it is deployed, why wouldn’t you take the time? I would also want to remind you that not all queries are part of a stored procedure, so you should be mindful of the parameter declarations for ad-hoc statements as well.

8. Find longest running and most often used queries

You can find which queries are being executed most frequently inside your server instance as well as the queries that have the highest duration. I like to use the diagnostic queries provided by Glenn Berry (blog | @GlennAlanBerry) in order to get the details from the DMVs. You can get a full copy of the SQL2012 version here. By finding the queries that are executed most often or have the highest duration you can prioritize the right-sizing of datatypes that are being actively used.

9. Look for implicit conversions in your plan cache

You can scrub your plan cache to find the queries that had implicit conversion warnings in the query plan. Here is a great SQL Server script from Jonathan Kehayias (blog | @SQLPoolboy) that will help you do just that.

10. Remember that size matters

Despite everything you may have been told while in college, size does matter.

That’s my list of ten ways to help you right-size your data and datatypes. They have served me well over the years, helping me to keep my environments running with as little waste as possible.