VM Snapshots and Performance or: How I Learned to Love My Martian Data Center

It takes a radio signal just about 1.28 seconds to get to the Moon (about 239,000 miles away), and about 2.5 seconds for round trip communication between our secret moon base and Earth. So, therefore this common SQL Server error message number 833…

SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds
to complete on file [E:\SQL\database.mdf] in database [database].
The OS file handle is 0x0000000000000000.
The offset of the latest long I/O is: 0x00000000000000

…implies that the round trip time is over 15 seconds, so using 7.5 seconds (as a minimum estimate, we really don’t know how long it is taking) we see the underlying SAN disks are over 1,396,500 miles away, or about 5.8 times as far away as the Moon. No, I don’t have any idea how they got there, either. But how else to explain this error? For all I know this SAN could be on Mars!

4.lunarrangingNow, I’ve seen this error message many times in my career. The traditional answers you find on the internet tell you to look at your queries and try to figure out which ones are causing you the I/O bottleneck. In my experience, this guidance was wrong more than 95% of time. In fact, this is the type of guidance that usually results in people wanting to just throw hardware at the problem. I’ve seen that error message appear with little to no workload being run against the instance.

In my experience the true answer was almost always “shared storage” or “the nature of the beast that is known as a SAN”. Turns out that when several servers share the same storage arrays you can end up being a victim to what is commonly called a “noisy neighbor“. One workload, on one particular server, causing performance pain for a seemingly unrelated server elsewhere.

What’s more frustrating is that sometimes the only hint of the issue is with the SQL Server error message. Often the conventional tools used to monitor the SAN don’t necessarily show the problem, as they are focusing on the overall health of the SAN and not on the health of specific applications, database servers, or end-user experience.

And just when I thought I had seen it all when it comes to the error message above, along comes something new for me to learn.

Snapshot and Checkpoints

No, they aren’t what’s new. I’ve been involved with the virtualization of database servers for more than eight years now and the concept of snapshots and checkpoints are not recent revelations for me. I’ve used them from time to time when building personal VMs for demos and I’ve seen them used sparingly in production environments. Why the two names? To avoid confusion, of course. (Too late.)

The concept of a snapshot or checkpoint is simple: to create a copy of the virtual machine at a point in time. The reason for wanting this point in time copy is simple as well: recovery. You want to be able to quickly put the virtual machine back to the point in time created by the snapshot or checkpoint. Think of things like upgrades or service packs. Take a snapshot of the virtual machine, apply changes, sign off that everything looks good, and remove the snapshot. Brilliant!

How do they work?

For snapshots in VMWare, the documentation is very clear:

When you create a snapshot, the system creates a delta disk file for that snapshot in the datastore and writes any changes to that delta disk.

So, that means the original file(s) used for the virtual machine become read-only, and this new delta file stores all of the changes. To me, I liken this to the similar “copy-on-write” technology in database snapshots inside of SQL Server. In fact, this VMWare KB article explains the process in the same way:

The child disk, which is created with a snapshot, is a sparse disk. Sparse disks employ the copy-on-write (COW) mechanism, in which the virtual disk contains no data in places, until copied there by a write.

OK, so we know how they work, but are they bad for performance?

Are they bad?

Not at first, no. But just like meat left out overnight they can become bad, yes. And the reason why should be very clear: the longer you have them, the more overhead you will have as the delta disk keeps track of all the changes. Snapshots and checkpoints are meant to be a temporary thing, not something you would keep around. In fact, VMware suggests that you keep a snapshot for no more than 72 hours, due to the performance impact. Here’s a brief summary of other items from the “Best practices for virtual machine snapshots in the VMware environment” KB article:

  • Snapshots are not backups, and do not contain all the info needed to restore the VM. If you delete the original disk files, the snapshot is useless.
  • The delta files can grow to be the same size as the original files. Plan accordingly.
  • Up to 32 snapshots are supported (unless you run out of disk), but you are crazy to use more than 2-3 at any one time.
  • Rarely, if ever, should you use a snapshot on high-transaction virtual machines such as email and database servers.
  • Snapshots should only be left unattended for 24-72 hours, and don’t feed them after midnight, ever.

OK, I made that last part up. You aren’t supposed to feed them, ever, otherwise they become like your in-laws at Christmas and they will never leave.

So, snapshots and checkpoints can have an adverse affect on performance! And I found out about it through this Spiceworks thread, then from other articles on the internet that detailed this very same issue.

So this performance issue wasn’t exactly an unknown, but rather new to me since I hadn’t come across issues related to snapshots or thought to check for them in production. And, from what I can tell, most people don’t have this experience either, hence the reason for scratching our heads when we see the affects of snapshots and checkpoints on our database servers.

Do I have one?

I don’t know, you’ll need to look for yourself. For VMware, you have three methods as detailed in this KB article:

1. Using vSphere
2. Using the virtual machine Snapshot Manager
3. Viewing the virtual machine configuration file
4. Viewing the virtual machine configuration file on the ESX host

Yes, that’s four things. I didn’t write the KB article. You can read it for yourself. Consider number 4 to be a bonus option or something. Or maybe they meant to combine the last two. Again, I didn’t write the article, I’m just pointing you to what it says.

Now, for Hyper-V, we can look at the Hyper-V Manager GUI as well, which is essentially similar to using vSphere. But we could also use the Hyper-V Powershell cmdlets as listed here. In fact, this little piece of code is all you really need:

PS C:\> get-vm "vmName" | get-vmsnapshot

Summary

Snapshots and checkpoints are fine tools for you to use, but when you are done with them you should get rid of them, especially for a database server. Otherwise you can expect to see a lot of disk latency and high CPU as a result. And should you see such things but your server team reports back that everything looks normal, I hope this post will stick in your head enough for you to remember to go looking for any rogue snapshots that may exist.

9 thoughts on “VM Snapshots and Performance or: How I Learned to Love My Martian Data Center”

  1. The performance issues associated with vSphere snapshots are the reason I switched most of my SAN snapshots to plain storage snapshots (not synchronized with vSphere) with my SQL Server Vos stored on my Nimble SAN. I only do vSphere snapshots during off-hours snapshots so I get at least one application-consistent storage snapshot each day, and once the storage snapshot is completed, the vSphere one is automatically deleted.

    Reply
    • Carl,

      Sounds like a good plan. Do you have alerts in place to let you know if a snapshot is not deleted properly?

      Reply
      • Yes. The Nimble SAN OS automatically sends email notifications if it can’t get confirmation from vSphere that the snapshot was deleted. If you miss that email, it also automatically generates a support issue to Nimble tech support, and emails come from their automated system to warn you about snapshot deletion problems.

        Reply
  2. One other way to check for vmware snapshot is to query the vcenter database with something like this:
    SELECT vm.DNS_NAME,ss.SNAPSHOT_NAME,ss.SNAPSHOT_DESC,ss.CREATE_TIME
    FROM dbo.VPX_VM vm
    INNER JOIN dbo.VPX_SNAPSHOT ss
    ON vm.ID = ss.VM_ID
    WHERE ss.CREATE_TIME < DATEADD(dd,-3,GETDATE());

    Reply
    • You can also use vSphere PowerCLI, connect to vcenter server using Connect-VIServer ‘servername.’ Once connected run a command like ‘get-vm | get-snapshot | format-list vm, name, created, sizeGB’

      Reply
      • Very cool, I seem to forget PowerCLI exists. Using that would make it easy to automate a process to report on rogue snapshots.

        Thanks!

        Reply

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.