For quite some time now, our team & I have been chasing a strange, inconsistent problem — a performance issue, specifically — within our VMware infrastructure. It is difficult to repeat on purpose, but the issue does recur reasonably frequently. And now we may finally have identified it’s cause… maybe!
What we’re seeing is that, on both Linux and Windows VMs, the VMs are periodically “losing” their virtual disks. The disks themselves are located on shared storage; in our case they are on NetApp filers and shared to the ESX hosts via NFS. Windows VMs seem to be much more resilient in dealing with the issue than Linux VMs — despite using journaled filesystems, Linux VMs will sometimes experience such severe problems that the filesystems actually crash (though this probably happens only once every few months or so.) More typically than that, in the Linux world the issue is manifested as a SCSI error. This makes sense, given that VMware abstracts the virtual disks away through using a SCSI driver in Linux (specifically for us it’s the mptscsi driver. Here’s an example of the issue:
mptscsi: ioc0: attempting task abort! (sc=f7dcb940)
scsi0 : destination target 0, lun 0
command = Write (10) 00 00 83 3d d0 00 00 08 00
mptbase: ioc0: WARNING - IOCStatus(0x004b): SCSI IOC Terminated on CDB=2a id=0 lun=0
mptscsi: ioc0: task abort: SUCCESS (sc=f7dcb940)
Up until recently, I thought I had looked at virtually every possibility that existed that could cause a slowdown between the ESX host (in our case, HP BL465c Blades) and the shared storage (NetApp FAS-3020s.) Or so I thought. I’d looked at packet frame sizes, NIC teaming algorithms, duplex matching, spanning-tree configs, NFS keepalives, NFS window sizes and a whole bunch of other things I’d forgotten about. My boss had a suggestion that I hadn’t thought of — perhaps the root cause of everything was balloon memory and VMs swapping? At first, I didn’t believe him; after all we have VMs using 167GB of RAM spread across a cluster of six hosts that has a total of 228GB. While that is an admittedly high average utilization of 73% per Blade, we only have 36GB in the Blades and they will take up to 64GB each. So, we have head-room to grow our RAM resources if we can find the money in our collective wallet.
His theory was an interesting one. When VMs experience balloon memory conditions (that is, they demand more memory than is available to them from the pool), they essentially stop using RAM for their temporary storage (as they have run out) and start using disk for their temporary storage. Under the best conditions (i.e., fast local disk), this would involve a hopefully-slight hit to performance. But what if the disk they’re swapping to is not, in fact, local — what if it is a disk presented via NFS over a LAN that is part of a larger shared storage pool? Simply put, you cannot expect equal performance in terms of latency or throughput as you can over a 2Gb NFS link (a pair of 1Gb NICs bonded together) as you could a fast SATA, SAS or SCSI disk will perform in the real world. Now what if this problem is occuring not on a single VM but on multiple VMs? Or even dozens? You would have a cascading problem — some VMs run out of RAM and begin swapping to what is really an NFS disk. More VMs run out of RAM and they begin swapping to an NFS disk. And your disk — in actuality, the TCP/IP pipe to your disk — becomes increasingly and increasingly congested.
To test his theory, we concocted a strange kind of environment. We built a single ESX host with 4 CPUs (which are essentially irrelevant here) and 4GB of physical RAM. Then we deployed 2 Red Hat VMs: one had 3GB of RAM and had a truly local (i.e., on the Blade itself) virtual disk ; the other VM had had 1GB of RAM and a NFS-mounted virtual disk. But the key is this: the 3GB VM got all of its RAM reserved, and the 1GB VM had none of its RAM reserved. His theory is that, if we introduce a workload on the 3GB (local disk) VM that chews up all its RAM — which it has gleefully reserved — the 1GB VM, when faced with a workload, will need to swap to disk. NFS disk. And it will do it slowly. And then we will see performance go to hell.
I’m running the tests right now, but so far it looks like he will be proved right. Which is embarrassing for yours truly