With the idea that we will be consolidating our servers, squeezing them together in an effort to make more efficient use of resources, it suddenly dawned on me: How much?
Do we want our CPU, for example, at 100% all the time? Logically it would make sense that you would use every last resource in some way, like a master chef who throws all is scraps into a stock pot instead of throwing them in the garbage like you might do at home.
And yet, it doesn’t make sense to be at 100% all of the time. So what number should we think about as a target? 80%? 90%?
What is the magic number for our consolidated servers and CPU usage? Does one even exist, or am I going to get back a bunch of adult diapers (“it Depends”) for responses?
I don’t think it’s realistic or even possible to target a specific percentage and have it work out. IT folks freak out when machines are “running hot,” but my feeling is that’s what they’re meant to be. The danger with running at, say, 90% all the time, is that you don’t have room for those occasions where you need or want some process to consume more CPU. Since you’re letting the rest of the system monopolize the CPU, this process takes a back seat, even though it might be your most important task at the time.
It’s kind of like buying a one-bedroom house even though you know you are going to be having weekend company a lot. An air mattress or pull-out couch can only cut it for so long. 🙂
interesting points aaron. and yet i get the sense that some 3rd party tools try to size things in exactly that way. Take this box at avg 10%, this one at avg 20%…and so on until you build a host that is (in theory) operating efficiently by overloading the available CPU’s on the host.
i am starting to get a sense that these tools may end up leading us to a bad place. i doubt we want to be running hot all the time, and yet i don’t really know how warm we want to be. and even if we spent the time trying to size things exactly, you know someone will change code at some point, looking to perform some new calculation with a handful of sorts, and the CPU will spike and throw all of our work out the window.
oh sure…that’s where your software kicks in to dynamically move your systems over to an idle CPU somewhere. well, that just seems like a lot of extra overhead for any environment. and yet i don’t see any other way. once you start down this consolidation path it would appear you really need to go all in.
I agree with Aaron, but also need to chime in about what your expected useage is like.
In my current envronment, we’re only “busy” 30 days of the year. The rest of the time we’re at 1% (if that) as a daily average. If you average our load out over a year, you’re looking at ~7% utilization. However, that’s really 30 days of 75% utilization and 335.25 days of 1%.
So, yes – “it depends.”
yes, another good point, and also why i am leery of software the tries to tell you how to consolidate. unless that software has been in place for a year, how would you ever really have an idea about your peak usage? it would seem you are always taking a best guess estimate.
Our system is pretty much in use constantly, and I’d be worried if it ever went above 50% average load because that would mean that we couldn’t handle a doubling in traffic. If I were running a busy website, I would probably want that number to be much lower, say around 5 to 10% because of the potential for the slashdot/digg effect taking the server down.
why are you shouting at me?
Interesting post especially because I have recently been asked by a former employer about consolidating several SQL Servers to one server for this reason.
My recommendation was to get a good set of perfmon stats and then determine how many of the servers they could consolidate.
I know I’d prefer to keep the average no higher than 50% just so you can handle those spikes.
Matt,
If you have several servers and those 30 days are all different days, could you then consolidate?
Jack,
How long would you want perfmon to collect stats for? a week? a month? a year?
good question about those 30 day spikes. i suppose if you knew which 30 days you could overlay them with servers that did not spike on the same days, if possible.
Lazy Response from Buck: 75% usage for 75% of the time. THat’s back from my mainframe days.
not a bad guide to follow. seems simple enough especially if you know going in that you will never get it perfect.
I simplified the problem for the sake of brevity, but in short – the 30 days are almost identical across the main line servers. The only consolidation could be with test servers – since little testing is done during the busy season.
i would imagine the same would be true for most any line of business, with the exception of a few departments within.
Just realised I never got back to you about the replication/mirroring thing (mostly because I was Doing What I Was Told at that point).
In my previous position, I had a server running SQL and MIIS, and it was generally running at about 95-99% CPU, and was in perpetual danger of total collapse: SQL was taking the vast majority of that CPU, and it really needed rebooting once a week.
75% usage 75% of the time would have been rather dreamlike…