Does This Datatype Make My Column Look Fat?

Last Wednesday I delivered a webinar for Confio titled “Database Design: Size DOES Matter”. I had a piece of code at the end that evaluated defined datatypes to the current data values in those columns. The idea being that if a column is defined as BIGINT but contains values that are best defined as a SMALLINT the script would recognize this and flag that column as a candidate for being altered. The script also outputs a couple of result sets, one result set for the amount of disk space and one result set for the amount of logical I/O you could potentially save.

I had many requests for that script but I don’t feel it is ready to be shared yet. One person asked me for guidance on how to build their own script and that gave me an idea: I could build a scaled down version of my script and share it here. That way if someone wanted to take my script and add their own code on top of it they could.

So here it is. This script will only evaluate the TINYINT, SMALLINT, INT, and BIGINT datatypes. It is for SQL Server only, and should run against any version from SQL 2005 and later.

Please don’t blindly follow the advice that any script provides. You MUST know a few things about the database design, the business requirements, etc., before you start altering columns. You certainly would NOT want to change from BIGINT to INT and then find out you ran out of values sooner than necessary! I would also recommend that you test thoroughly. In fact, let’s just go with the standard disclaimer:

DISCLAIMER: Do not run code you find on the internet in your production environment without testing it first. Do not use this code if your vision becomes blurred. Seek medical attention if this code runs longer than four hours. On rare occasions this code has been known to cause one or more of the following: nausea, headaches, high blood pressure, popcorn cravings, and the impulse to reformat tabs into spaces. If this code causes your servers to smoke, seek shelter. Do not taunt this code.

/*=============================================
  File: check_datatype_int.sql

  Author: Thomas LaRock, https://thomaslarock.com/contact-me/

  Summary: This script will verify defined datatypes to the data
           that currently exists. 

  Returns: Two (2) results sets. The first is for an estimate of 
           disk space savings. The second is for potential
           logical I/O savings.

  Date: September 4th, 2012

  SQL Server Versions: SQL2005, SQL2008, SQL2008R2, SQL2012

  You may alter this code for your own purposes. You may republish
  altered code as long as you give due credit. 

  THIS CODE AND INFORMATION IS PROVIDED "AS IS" WITHOUT WARRANTY
  OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT
  LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR
  FITNESS FOR A PARTICULAR PURPOSE.

=============================================*/

--CHANGE TO WHATEVER DATABASE YOU WANT TO EVALUATE
USE [AdventureWorks2012]
GO

SET NOCOUNT ON
SET ANSI_WARNINGS OFF

IF  EXISTS (SELECT * FROM tempdb.dbo.sysobjects 
WHERE id = OBJECT_ID(N'tempdb..#tmp_ValDT') 
AND type in (N'U'))
DROP TABLE #tmp_ValDT
GO

IF  EXISTS (SELECT * FROM tempdb.dbo.sysobjects 
WHERE id = OBJECT_ID(N'tempdb..#tmp_spTmp') 
AND type in (N'U'))
DROP TABLE #tmp_spTmp
GO

CREATE TABLE #tmp_ValDT
        (stmt nvarchar(max) NULL,
        min_val bigint NULL,
        max_val bigint NULL,
        rc bigint,
        table_name sysname,
        schema_name sysname,
        column_name sysname,
        TypeName sysname,
        recomm_DT sysname NULL,
        recomm_byte smallint NULL,
        max_length smallint,
        precision tinyint,
        scale tinyint)
GO

CREATE TABLE #tmp_spTmp
        (name nvarchar(128) NULL,
        rows char(11) NULL,
        reserved varchar(18) NULL,
        data varchar(18) NULL,
        index_size varchar(18) NULL,
        unused varchar(18))
GO


INSERT INTO #tmp_ValDT
SELECT '' AS stmt
, NULL -- min value not known yet
, NULL -- max value not known yet
, NULL -- rowcount not known yet
, tbl.name AS table_name
, SCHEMA_NAME(tbl.schema_id) AS schema_name
, col.name AS column_name
, t.name AS TypeName
, NULL -- recommended DT not known yet
, NULL -- recommended DT bytes known yet
, col.max_length
, col.PRECISION
, col.scale
FROM sys.tables AS tbl
INNER JOIN sys.columns col ON tbl.OBJECT_ID = col.OBJECT_ID
INNER JOIN sys.types t ON col.user_type_id = t.user_type_id
--not evaluating user defined datatypes, or MAX types, at this time
WHERE t.is_user_defined = 0
AND t.is_assembly_type = 0
AND col.max_length <> -1
ORDER BY schema_name, table_name;

DECLARE @SQL nvarchar(max)
DECLARE DT CURSOR
FOR 
SELECT table_name, schema_name, column_name, TypeName, max_length 
FROM #tmp_ValDT

DECLARE @table_name sysname, @schema_name sysname
DECLARE @column_name sysname, @TypeName sysname
DECLARE @min_val nvarchar(max), @max_val nvarchar(max)
DECLARE @max_length nvarchar(max), @rc nvarchar(max), @max_lenOUT nvarchar(max)
OPEN DT

FETCH NEXT FROM DT INTO @table_name, @schema_name, @column_name, @TypeName, @max_length
WHILE (@@fetch_status <> -1)
BEGIN
        IF (@@fetch_status <> -2)
        BEGIN

                IF @TypeName IN ('tinyint', 'smallint', 'int', 'bigint')
                BEGIN
                
                        SET @SQL = 'SELECT @min_val = MIN(['+@column_name+']), @max_val = MAX(['+@column_name+']), @rc = COUNT(*) FROM ['+@schema_name+'].['+@table_name+']'            
                        EXEC sp_executesql @SQL, N'@min_val bigint OUTPUT
                        , @max_val bigint OUTPUT
                        , @rc bigint OUTPUT', @min_val = @min_val OUTPUT, @max_val = @max_val OUTPUT, @rc = @rc OUTPUT

                        IF CONVERT(BIGINT, @min_val) < -2147483648 OR CONVERT(BIGINT, @max_val) > 2147483648
                                BEGIN
                                        UPDATE #tmp_ValDT
                                        SET stmt = @SQL, min_val = @min_val, max_val = @max_val, rc = @rc,
                                                recomm_DT = 'bigint', recomm_byte = 8
                                        WHERE CURRENT OF DT
                                END
                        ELSE IF @min_val < -32768 OR @max_val > 32768
                                BEGIN
                                        UPDATE #tmp_ValDT
                                        SET stmt = @SQL, min_val = @min_val, max_val = @max_val, rc = @rc,
                                                recomm_DT = 'int', recomm_byte = 4
                                        WHERE CURRENT OF DT
                                END
                        ELSE IF @min_val < 0 OR @max_val > 255
                                BEGIN
                                        UPDATE #tmp_ValDT
                                        SET stmt = @SQL, min_val = @min_val, max_val = @max_val, rc = @rc,
                                                recomm_DT = 'smallint', recomm_byte = 2
                                        WHERE CURRENT OF DT
                                END
                        ELSE IF @min_val >= 0 AND @min_val <= 255 AND @max_val >= 0 AND @max_val <= 255
                                BEGIN
                                        UPDATE #tmp_ValDT
                                        SET stmt = @SQL, min_val = @min_val, max_val = @max_val, rc = @rc,
                                                recomm_DT = 'tinyint', recomm_byte = 1
                                        WHERE CURRENT OF DT
                                END
                        ELSE IF @min_val IS NULL OR @max_val IS NULL
                                BEGIN
                                        PRINT 'empty tables: ' + @SQL
                                END
                        ELSE
                                BEGIN
                                        PRINT 'how did i get here int?'
                                END
                END

        END
        FETCH NEXT FROM DT INTO @table_name, @schema_name, @column_name, @TypeName, @max_length
END

CLOSE DT
DEALLOCATE DT
GO

INSERT INTO #tmp_spTmp
EXEC sp_MSforeachtable 'EXECUTE sp_spaceused [?];'

--summarize the possible space savings
SELECT schema_name + '.' + table_name AS [Tablename]
, column_name AS [ColumnName], TypeName AS [CurrentDT]
, max_length AS [Length], recomm_DT AS [RecommendedDT]
, recomm_byte AS [RecommendedLength]
, CASE WHEN recomm_DT NOT IN ('varchar', 'char')
        THEN ((max_length - recomm_byte) * rc)/(1024.0*1024.0) 
        ELSE (recomm_byte * 2)/(1024.0*1024.0) 
        END AS [Space_Saved_MB]
FROM #tmp_ValDT
WHERE TypeName <> recomm_DT
AND recomm_byte <> 0
ORDER BY [Space_Saved_MB] DESC

--summarize the possible logical i/o savings
SELECT vdt.table_name, SUM(vdt.max_length-vdt.recomm_byte) AS [row_savings_bytes] 
, SUM(vdt.max_length-vdt.recomm_byte) * spt.rows / 8192 AS [LIO_savings_pages]
FROM #tmp_ValDT vdt INNER JOIN #tmp_spTmp spt ON vdt.table_name = spt.name
GROUP BY vdt.table_name, spt.rows
ORDER BY 3 DESC

I hope you find this script useful. If you decide to use it, please provide me some feedback on what you think, so I can incorporate that feedback into the full version I hope to make available soon.

Interesting, but I have a script that tests every column in a table and comes up with a recommended data type, which of course is always the smallest possible type that will accommodate the data.

The script shows you the current datatype and the suggested data type, as well as the top 10 and bottom 10 values, and the sum and the average (if numeric), as well as the most common value and the count of values as well as the count of NULLs and the count of the most common value.

It’s all dynamic SQL though, and relies on calling a second stored procedure (also dynamic SQL) which basically use a brute force method to determine the optimal data type of trying to convert the data to each of the major data types, and it returns a value once it’s successful.

It’s a series of nested TRY CATCH blocks, and in the final TRY CATCH it determines if the data type should either be CHAR or VARCHAR (I don’t test for NCHAR/VARCHAR).

Basically, the proc that determines the column type starts by trying to CAST the passed in column to a datetime, if it can, then it determines if its a DATE, TIME or DATETIME. If that cast throws an error then in the CATCH it tries to cast the value to a BIGINT, if so then it determines if its a BIGINT, INT, SMALLINT, or TINYINT, if the cast fails it tries to cast it to a FLOAT, if so then it determines if its a FLOAT or a REAL, if that fails it tries to cast it to MONEY (small money, etc.), then a BIT, and if all that fails then it determines if it should be a CHAR or VARCHAR.

Actually a better way to do this would be to use CLR and use Regular Expressions to do it all, but at the time I wrote this I hadn’t done any CLR, and it works so I haven’t bothered to re-do it… (though I assume it would be faster).

14 thoughts on “Does This Datatype Make My Column Look Fat?”

R. G. Price

September 4, 2012 at 5:11 PM

Interesting, but I have a script that tests every column in a table and comes up with a recommended data type, which of course is always the smallest possible type that will accommodate the data.

The script shows you the current datatype and the suggested data type, as well as the top 10 and bottom 10 values, and the sum and the average (if numeric), as well as the most common value and the count of values as well as the count of NULLs and the count of the most common value.

It’s all dynamic SQL though, and relies on calling a second stored procedure (also dynamic SQL) which basically use a brute force method to determine the optimal data type of trying to convert the data to each of the major data types, and it returns a value once it’s successful.

It’s a series of nested TRY CATCH blocks, and in the final TRY CATCH it determines if the data type should either be CHAR or VARCHAR (I don’t test for NCHAR/VARCHAR).

Basically, the proc that determines the column type starts by trying to CAST the passed in column to a datetime, if it can, then it determines if its a DATE, TIME or DATETIME. If that cast throws an error then in the CATCH it tries to cast the value to a BIGINT, if so then it determines if its a BIGINT, INT, SMALLINT, or TINYINT, if the cast fails it tries to cast it to a FLOAT, if so then it determines if its a FLOAT or a REAL, if that fails it tries to cast it to MONEY (small money, etc.), then a BIT, and if all that fails then it determines if it should be a CHAR or VARCHAR.

Actually a better way to do this would be to use CLR and use Regular Expressions to do it all, but at the time I wrote this I hadn’t done any CLR, and it works so I haven’t bothered to re-do it… (though I assume it would be faster).
- ThomasLaRock
  
  September 5, 2012 at 3:01 PM
  
  Thanks for the comment.
  
  You’ve taken an interesting approach. Are you also considering the cases where the datatype selection doesn’t match the meaning? For example, I’ve seen dates defined as a char(8) (i.e., ‘20120905’).
  
  My idea was to take a more passive approach than what you have done.
  
  Is your code publicly available? I think we should get something like that up on Codeplex!
  - R. G. Price
    
    September 12, 2012 at 12:09 PM
    
    Yeah, it’s not 100% fool proof, but I’ll post it on my blog (which I haven’t made a post to in a while anyway). I’ll let you know when I get it posted, it may be a couple of weeks, I want to clean it up and comment it, etc. before posting it.
    - Gary Pritchard
      
      December 29, 2012 at 10:24 AM
      
      Did you ever post your script on your blog? If so please share the link. It sounds perfect for a design review I’m doing.
Pingback: Incorrect Datatype Selection in SQL | My Notes
Pingback: Something for the Weekend - SQL Server Links 07/09/12
Chuck Rummel

September 7, 2012 at 8:04 PM

I’ve done conceptually similar things for decimals. Upstream db has larger precision and scale and we have to squeeze into a smaller legacy table. I cast as the max(abs(value)) to varchar then do len() on the substrings before and after the decimal.
Brian G

September 10, 2012 at 8:02 AM

Laughed out loud at that title. Well done.
Pingback: How To: Right-sizing The Datatypes Currently In The Buffer Pool | SQLRockstar | Thomas LaRock
Glenn Wilson

September 16, 2012 at 7:05 PM

Nice one, Thanks.
Pingback: Does Your Data Match the Type? | SQLRockstar | Thomas LaRock
Pingback: Log Buffer #290, A Carnival of the Vanities for DBAs | The Pythian Blog
Pingback: Atom Wire » Blog Archive » Log Buffer #290, A Carnival of the Vanities for DBAs
Pingback: It's Christmas: 3 Ways To Be Wise About Your Rowsize | SQLRockstar | Thomas LaRock

Does This Datatype Make My Column Look Fat?

Related

14 thoughts on “Does This Datatype Make My Column Look Fat?”

Leave a Comment Cancel reply

Share this:

Related

14 thoughts on “Does This Datatype Make My Column Look Fat?”

Leave a Comment Cancel reply