Resource Overcommit: When It Works and When It Bites

Resource overcommit feels like free pizza until you realize who is paying the tab.
In the lab it looks like magic. In production it works until it does not, and that moment arrives fast when the mix changes.

Resource overcommit is the promise that sells virtualization. Run more guests than you have physical muscle by betting that not everyone peaks at the same time. On paper, it is solid. In practice, you need guardrails. CPU overcommit shines with bursty app servers, CI boxes that sit idle most of the day, and light web stacks that spike on deploys. ESXi is very good at time slicing and keeping vCPUs busy, and CPU ready stays low when your mix is right. Memory overcommit can also work if you lean on ballooning and transparent page sharing and you size VMs by observed use instead of what someone asked for on a ticket years ago. Storage can be overcommitted with thin provisioning, and with smart monitoring it stretches your array without the finance team breathing down your neck. Even network can be shared if you keep an eye on uplink saturation and queue depths. The sweet spot shows up when you right size, you tune shares and reservations for a few crown jewels, and you watch the cluster rather than a single host. With vSphere 5 giving us Storage DRS and Storage IO Control and with KVM and Xen maturing in the open source world, the tools help a lot. For rough numbers that I see working, think two to four vCPUs per physical core on mixed workloads that are not latency sensitive, with CPU ready under five percent as a comfort line. For RAM, going to one hundred and twenty to one hundred and fifty percent of physical across a balanced cluster can be fine when TPS and ballooning keep the hot set in memory and swap activity stays near zero. For storage, keep an IOPS budget and treat it like money, not a wish. Done this way, overcommit works and you ship more for less without angry emails.

Then comes the day it bites. The usual suspect is not CPU. It is storage and memory. You will see one chatty VM turn into a noisy neighbor that drags everyone down. A single report in a legacy database can send random reads into a spin and your datastore latency goes from happy to red. On shared arrays with mixed tiers, thin disks and snapshots stack up, and writes hit the ceiling. On the public cloud side, you can feel this too. Anyone who has run an EC2 instance on busy EBS volumes knows the story of variable IO and surprise pauses. When memory overcommit is pushed too far, the balloon driver starts reclaiming and the guest does not like it. The next step is host swap, and then users ask why the app stutters when no one changed code. The killer is double paging, where the guest thinks it is clever and the host thinks it is clever and both are wrong. For CPU, the pain shows with latency sensitive workloads like voice gateways, trading feeds, HPC nodes with tight loops, or even chatty Java services with lots of synchronized regions. They hate time slicing and long ready queues. Watch co stop on big SMP guests because overcommitting vCPUs on a single large VM without the cores to back it up turns into long pauses when the scheduler tries to co schedule. With RAM, respect NUMA. Placing a fat VM across memory nodes invites remote access penalties that look like a mystery heisenbug in your app. With storage, the ugliest failure is out of space on thin provisioned datastores which can pause or crash VMs. That is not a warning. That is downtime. Snapshot sprawl adds write amplification and long commit times. Queue depths on HBAs and arrays get ignored until you see thirty plus milliseconds on average latency and everything slows to a crawl. Overcommit bites when we forget that the physics still rules and that shared means shared even if the UI looks like a private island.

So what is the playbook from the field. Start with right sizing. Few workloads need eight vCPUs. Many perform better with two or four that can actually run. Watch CPU ready, not just usage. If ready rises over five to ten percent during busy times, you are over the line. Keep an eye on co stop for big VMs. For memory, track active and consumed, not just configured. Balloon growth is a signal. Host swap in is an alarm bell. If swap in rate is not zero during business hours, fix your commit level or add RAM. Respect NUMA boundaries, use per VM NUMA awareness, and keep guest vCPU counts aligned with memory locality. For storage, manage to an IO budget. You can oversubscribe capacity, but do not oversubscribe IOPS without a plan. Use Storage IO Control to keep the bullies in check. Thin provision with alerts when free space hits a real line, not a soft guess. Clear snapshots on a schedule. Keep average datastore latency under twenty milliseconds for transactional work. Ten is a good target. For network, do not pack too many chatty guests on a single vmnic and watch for microbursts. In VMware, use shares for fairness, use reservations only for the few that truly need a floor, and avoid limits unless you are doing lab work. Limits become invisible walls that cost you nights and weekends. Test with load, not with a ping. Record the before and after of consolidation changes. When you plan consolidation ratios, pick a ceiling per cluster and live within it. Two to one or three to one vCPU per core for mixed enterprise apps has been a safe middle ground for me. Go higher for VDI but measure login storms and print storms. Keep memory commit at the cluster level, not per host, and spread large VMs to avoid stacking. On KVM or Xen, turn on virtio, use huge pages for memory hungry guests, and pin to sockets for the monsters that need it. If you run OpenStack Essex today, the same rules apply. And a note for the cloud crowd. Overcommit exists there too. You feel it as variable performance. Your only defense is to spread risk, keep read replicas, and test in the time window when your users show up. The easy win with overcommit is to explain the trade to the business. Consolidation saves cash until it costs revenue. Show the chart. Choose the line you will not cross.

Overcommit is a great trick as long as you remember it is a trick and respect the bill that comes due.

General Software Software Engineering