Introduced with SQL Server 2008, page and row compression are turning ten years old this year. In that time, the internet became littered with posts describing both features, how they work, the performance gains, etc.
Despite digesting all of that information, a colleague asked me a very simple question those posts did not answer:
“How do I know to use row or page compression in SQL Server?”
That’s what this post is for, to help provide some clarity on row versus page compression. You’re welcome.
Types of Data Compression in SQL Server
Row and page compression are a dirty secret among many SQL consultants. I know more than one person who walked into a shop, enabled compression, showed a performance gain on a critical report, and walked out with a fat contract that same day. I’m not writing this post to out them or their secret. I’m writing to help you get a fat contract, or a fat bonus, by understanding compression a bit better.
Let’s start with the basics. SQL Server 2017 has the follow compression features:
– Row Compression
– Page Compression
– COMPRESS function
– Columnstore compression
The COMPRESS function uses GZIP to store data in a compressed format. Columnstore indexes store compressed data by default, and also offer archival compression using XPRESS. We aren’t going to talk about COMPRESS or Columnstore today. Instead, we will focus on row and page compression, because that’s the question asked. But I thought you should know what types of data compression are available in SQL Server 2017.
How They Work
The first thing to understand is that row compression is a subset of page compression. Put another way, you cannot have page compression without first having row compression performed. When you go to enable page compression the engine will do the following actions in this order:
– Row compression
– Prefix compression
– Dictionary compression
Row compression is defined here. The simple explanation is that it take fixed-length columns and makes them variable length, adding additional bytes for the overhead of tracking the changes being made. The link provided has a table that references the savings for the datatypes used. It’s interesting reading. And by “interesting” I mean “it won’t matter for my summary later”.
Prefix and dictionary compression come next, as part of page compression. You can read all the details here. The simple explanation is that the process looks for repeated patterns. First it looks for repeated values by column (that’s the prefix part), then it looks for patterns on the entire page (that’s the dictionary part). The link provided has the details and is more interesting than the row compression article for reasons that will be apparent soon-ish, I promise.
How to enable Row or Page Compression in SQL Server
Enabling row or page compression on a table is easy. (However, enabling row or page compression for a database is not easy, because you have to enable it for each object manually. For large schemas, this can be a PITA and make you wish for the days of Microsoft Access and the ‘Compact and Repair Database‘ option. LOL, no, it doesn’t. Nothing makes anyone wish for Access. #justsayin)
Let’s set up a test table and enable compression. Here’s the example I’ve been using in my session “Database Design: Size DOES Matter!” for years. In that session I use sys.messages to populate a table:
CREATE TABLE [dbo].[TestCompression](
[m_id] [int] NULL,
[text] [nvarchar](3000) NULL
) ON [PRIMARY]
INSERT INTO [dbo].[TestCompression]
SELECT message_id, text from sys.messages
After the table is created, we estimate the space savings for row compression like this:
EXEC sp_estimate_data_compression_savings 'dbo', 'TestCompression', NULL, NULL, 'ROW' ;
GO
Here’s my result set (click to embiggen):
The results show an expected gain of 54.8% (taking the column size_with_requested_compression_setting(KB) and dividing by the column size_with_current_compression_setting(KB), or 31880/58112). I will leave estimating the space savings for page compression as an exercise for the reader. Besides, I don’t want to spoil the surprise ending.
One last item before we enable row compression for this table. I want to know how many logical reads it takes to return all the rows. I will use STATISTICSIO to get this information:
SET STATISTICS IO ON
SELECT m_id, text
FROM dbo.TestCompression
SET STATISTICS IO OFF
This returns the following message (formatted to fit on this page):
(232056 rows affected) Table 'TestCompression'. Scan count 1, logical reads 7226, physical reads 0, read-ahead reads 7172, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
We now enable row compression with the following statement:
ALTER TABLE dbo.TestCompression REBUILD PARTITION = ALL
WITH
(DATA_COMPRESSION = ROW)
GO
And now rerun the SELECT, and we observe the following message returned with the results:
(232056 rows affected) Table 'TestCompression'. Scan count 1, logical reads 3982, physical reads 0, read-ahead reads 7, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
So, we have 55% fewer logical reads (3982/7226). Similar to the expected savings estimate of 54.8% found earlier.
(I love it when I check my work and the answers match.)
When to Use Row or Page Compression in SQL Server
OK, we’ve reviewed details on compression in SQL Server, let’s answer the original question: “How do I know to use row or page compression in SQL Server?”
The MSDN article I referenced and linked to at the beginning of the article has a wonderful summary for this purpose. If you are too lazy to scroll up, I will link to it again for you here. If you are too lazy to read the article, let me repeat the pertinent details for you now:
A more detailed approach to deciding what to compress involves analyzing the workload characteristics for each table and index. It is based on the following two metrics:
U: The percentage of update operations on a specific table, index, or partition, relative to total operations on that object. The lower the value of U (that is, the table, index, or partition is infrequently updated), the better candidate it is for page compression.
S: The percentage of scan operations on a table, index, or partition, relative to total operations on that object. The higher the value of S (that is, the table, index, or partition is mostly scanned), the better candidate it is for page compression.
That sounds great except we don’t know the percentage of update and scan operations. Yet.
Lucky for us the whitepaper includes the T-SQL needed to determine such metrics. For the U-value, the percentage of updates, we are given this code:
/*
To compute U, use the statistics in the DMV sys.dm_db_index_operational_stats.
U is the ratio (expressed in percent) of updates performed on a table or index
to the sum of all operations (scans + DMLs + lookups) on that table or index.
The following query reports U for each table and index in the database.
*/
SELECT o.name AS [Table_Name], x.name AS [Index_Name],
i.partition_number AS [Partition],
i.index_id AS [Index_ID], x.type_desc AS [Index_Type],
i.leaf_update_count * 100.0 /
(i.range_scan_count + i.leaf_insert_count
+ i.leaf_delete_count + i.leaf_update_count
+ i.leaf_page_merge_count + i.singleton_lookup_count
) AS [Percent_Update]
FROM sys.dm_db_index_operational_stats (db_id(), NULL, NULL, NULL) i
JOIN sys.objects o ON o.object_id = i.object_id
JOIN sys.indexes x ON x.object_id = i.object_id AND x.index_id = i.index_id
WHERE (i.range_scan_count + i.leaf_insert_count
+ i.leaf_delete_count + leaf_update_count
+ i.leaf_page_merge_count + i.singleton_lookup_count) != 0
AND objectproperty(i.object_id,'IsUserTable') = 1
ORDER BY [Percent_Update] ASC
For S, the percentage of scans, we are given this code:
/*
To compute S, use the statistics in the DMV sys.dm_db_index_operational_stats.
S is the ratio (expressed in percent) of scans performed on a table or index
to the sum of all operations (scans + DMLs + lookups) on that table or index.
In other words, S represents how heavily the table or index is scanned.
The following query reports S for each table, index, and partition in the database.
*/
SELECT o.name AS [Table_Name], x.name AS [Index_Name],
i.partition_number AS [Partition],
i.index_id AS [Index_ID], x.type_desc AS [Index_Type],
i.range_scan_count * 100.0 /
(i.range_scan_count + i.leaf_insert_count
+ i.leaf_delete_count + i.leaf_update_count
+ i.leaf_page_merge_count + i.singleton_lookup_count
) AS [Percent_Scan]
FROM sys.dm_db_index_operational_stats (db_id(), NULL, NULL, NULL) i
JOIN sys.objects o ON o.object_id = i.object_id
JOIN sys.indexes x ON x.object_id = i.object_id AND x.index_id = i.index_id
WHERE (i.range_scan_count + i.leaf_insert_count
+ i.leaf_delete_count + leaf_update_count
+ i.leaf_page_merge_count + i.singleton_lookup_count) != 0
AND objectproperty(i.object_id,'IsUserTable') = 1
ORDER BY [Percent_Scan] DESC
OK, let’s put it all together.
Summary
All of the above information can be found on hundreds of blog posts over the past ten years. Here’s the part that I haven’t found mentioned anywhere else:
When your table has few duplicate values, as our test table does here, row compression is the right answer no matter what your U-value or S-value. The reason for this is detailed above. Page compression looks for repeated patterns. Therefore, if your data does not have repeated patterns, you don’t get much extra benefit from page compression. You will see extra CPU utilization, but probably not much of a performance gain to make it worthwhile. Our test table has few repeated patterns, therefore page compression is likely to be more overhead than what is needed.
(Of course you should test everything and not just listen to me ramble. Oh, and now would be a good time to do that exercise I left for you above.)
And that’s the decision point I want you to take away from this post. When you need to decide to use either row or page compression look at the data and determine if it is likely to have repeated values or not. If yes, then it is a candidate for page compression provided the U and S-values are OK. If there are few repeated values, then row compression is likely to be sufficient.
(BONUS: I once wrote a post on how to find the objects in your database that store the fewest number of rows per page. It may be relevant for your interests here, too. Have a look.)
This is a great article, thanks for dedicating the time and effort to create original content and insight.
Based on my understanding “prefix compression looks for common patterns in the beginning of the column values” and “dictionary compression searches for repeated values”
https://docs.microsoft.com/en-us/previous-versions/sql/sql-server-2008/dd894051(v=sql.100)
In your article you have “repeated values” and “patterns” reversed:
“First it looks for repeated values by column (that’s the prefix part), then it looks for patterns on the entire page (that’s the dictionary part).”
Just wanted to clarify since it confused me when cross referencing with additional documentation
Great article! Thanks for keeping it simple and to the point.
Calculation of Percentage Gain seems wrong. It should be ((SIZE CURRENT – SIZE AFTER COMPRESSION)/SIZE CURRENT)*100
Ravinder,
can you please mention exact field name for calculation of percentage gain.
Here is the column list object_name schema_name index_id partition_number size_with_current_compression_setting(KB) size_with_requested_compression_setting(KB) sample_size_with_current_compression_setting(KB) sample_size_with_requested_compression_setting(KB)
eton_rei_workload Admin 1 1 38008 27768 38280 27968
Very useful instructions for those coming from Oracle background.
Thank you for this post, Thomas.
I’d like to focus on this sentence you wrote:
“Therefore, if your data does not have repeated patterns, you don’t get much extra benefit from page compression”
Based on this, I’m understanding that when it comes down to it, your considerations for whether to use ROW or PAGE compression, actually depend on the compression benefit that PAGE compression would provide relative to ROW compression: If the benefit is not high enough relative to the added CPU cost, then you’d prefer the ROW compression. Is that correct?
In that case, it should be quite easy to use the sp_estimate_data_compression_savings procedure using both ROW and PAGE compression types, save their results in a temp table and compare the savings in relation to each other. If the PAGE compression savings isn’t significant enough compared to the ROW compression savings (based on some kind of arbitrary threshold), then you would prefer the ROW compression, otherwise, you’d prefer the PAGE compression.
Sure, you could do that. However, you still want to track the U and S values. HTH
That definitely makes things more complicated. How one is to decide upon thresholds for picking the right compression type?
Collect metrics, as you suggest. Then, test.
Well then that kinda makes this kind of articles rather pointless, doesn’t it?
Thanks for the feedback.
Can you be more specific about ‘repeated values’? do you mean every column has the same value? or just some columns have it? if the whole article depended on this, i think it warrants more examples or definition at least.
Think of a table that has a column which might have only a handful of distinct values. After row compression is complete, page compression will look for those repeated value and perform additional compression (dictionary compression). In my test example, I used sys.messages, which has unique message text. Thus, very few chances for repeated values in the column, and hence page compression offers no additional performance benefit. HTH
ok got it, thank you
Very great article with a lot of useful information. I just have one question about the decisionmaking row vs. page compression. How can I analyse my data with regard to repeated patterns? We have over thousands of different table in our database; one table has almost 100 GB… =( are there analytical tools or how is the general approach?
Many thanks
”
U: The percentage of update operations on a specific table, index, or partition, relative to total operations on that object. The lower the value of U (that is, the table, index, or partition is infrequently updated), the better candidate it is for *page compression*.
S: The percentage of scan operations on a table, index, or partition, relative to total operations on that object. The higher the value of S (that is, the table, index, or partition is mostly scanned), the better candidate it is for *page compression*.
”
Are the ending of both statements supposed to be page compression? Is one supposed to be row compression instead of page?