Quickly benchmarking vSphere’s PVSCSI driver

There are a lot of discussions going on about VMware’s PVSCSI driver (available in vSphere, under select guest OSs) so I figured I’d give it a try. Although I’m a little disappointed that it’s not [currently?]) supported to boot off a PVSCSI-controlled disk, there are people who have gotten it to work. I thought I’d do a quick-and-dirty test by using Oracle’s ORiON tool; the guest operating system is Windows 2008 Server SP1 (64-bit). I’m booting off a standard 20GB partition (connected to the LSI Logic SCSI controller), with a simple, raw 1GB partition (connected to the PVSCSI controller) as a test LUN. Here are the results:

orion.exe -run simple -testname mytest -num_disks 1

Using paravirtualized SCSI driver:

Maximum Large MBPS=103.97 @ Small=0 and Large=2
Maximum Small IOPS=6159 @ Small=5 and Large=0
Minimum Small Latency=0.51 @ Small=1 and Large=0

Not using paravirtualized SCSI driver:

Maximum Large MBPS=108.16 @ Small=0 and Large=2
Maximum Small IOPS=6543 @ Small=5 and Large=0
Minimum Small Latency=0.56 @ Small=2 and Large=0

Notice that, when not under load, latency is lower (which is better) with the paravirtualized driver but throughput is also lower (which is poorer.) In order to use a raw partition in ORiON under Windows, you need to specify something like this in your .lun file:

\\.\e:

Where e: is the drive letter assigned to the raw, non-formatted partition. I’ll do some more tests as time allows — I have a feeling that, when the VM is under higher load, the PVSCSI’s numbers will be better. Similarly, I’ll need to play with block sizes & disk counts, too.

Here is a new test (stolen from here), which gives the VM a bit more of an I/O beating. The results are very interesting:

orion.exe -run advanced -testname mytest -num_disks 1 -type rand -write 5 -matrix row -num_large 1

Using paravirtualized SCSI driver:

Maximum Large MBPS=83.17 @ Small=0 and Large=1
Maximum Small IOPS=956 @ Small=5 and Large=1
Minimum Small Latency=5.21 @ Small=1 and Large=1

Not using paravirtualized SCSI driver:

Maximum Large MBPS=70.46 @ Small=5 and Large=1
Maximum Small IOPS=806 @ Small=5 and Large=1
Minimum Small Latency=5.46 @ Small=1 and Large=1

Here, we can see quite a big difference in numbers. Latency is down by 5.5% while performance (both MBPS and IOPS) are up by 15.3%. For the record, the ESX host is vSphere build 175625; the datastore is NFS-mounted to a Sun Fire x4540 storage server with 48 500GB SATA disks.

Contractor + UPS + datacenter = epic fail

So, we had a contractor in today to upgrade the firmware in our UPS. Sounds simple, right? It’s a gigantic Emerson thing; just put it in bypass and have at it.

Anyway, the guy forgot to put it in bypass. So, at 3pm, the whole datacenter went dark.

I am not making this up.

Everything seems back to normal, now. I shudder to think how many man-hours have gone into restoring affected servers, given that it took down everything on campus. 😦

VMware ESX, Microsoft NLB and why you should care

…especially if you use Cisco switches!

We are a reasonably heavy user of Microsoft’s network load-balancing (NLB) technology in Windows 2003 and 2008 Server architectures. Specifically, we’re clustering a reasonably large (4000 mailboxes) instance of Exchange Server 2007, as well as numerous SQL Server 2005 & 2007 instances. We’re doing this on both physical, virtual and hybrid deployments.

While I’m not our resident Microsoft expert, my understanding of the fundamentals of this kind of clustering is this: servers transmit heartbeats/keepalives to other servers, and react accordingly. This is complicated by the fact that you have two options for heartbeat transmission: unicast and multicast. Microsoft’s default is unicast, VMware’s recommendation is multicast and you can read the rationale for this decision here. (There is even more information here. For information about setting up multicast NLB, go here.)

Where this turns into an embarrassment-of-riches problem is if you’re using Cisco switches. You can read a somewhat wordy explanation here, which summarizes the problem better than I can. But essentially it boils down to this: VMware recommend you use multicast NLB. In order to use multicast NLB, you need to (*gulp*) hard-code MACs and ARP entries in your Cisco switching infrastructure. For those with relatively small systems infrastructure, this isn’t the biggest deal in the world. But when you have more than 12,000 ports on campus, it prevents some strong scalability (and, subsequently, feasibility) problems. Every time you set up a cluster — which admittedly might not be that often — you’re going to have to coordinate configuration changes to your switching platforms.

How does this make you feel?

When incorrectly configured — which is to say, the switches have not been configured at all to handle the uniqueness of multicast NLB — you can experience problems where nodes seemingly drop off the network. This causes a split-brain scenario, where the cluster hasn’t actually failed over but it thinks that it has. When you add shared storage into the mix, things develop the inertia to turn pear-shaped very quickly. Which is what we’re seeing now, and what we’re trying to debug…

Sun x4540 in action

So, the x4540 (I still think “Thumper” is a cooler name) has arrived and we’ve got it in action. We managed to dig up some fairly bizarre-looking 240VAC power gear, complete with strange connectors and pre-SNMP PDUs. (I’ll post pictures when I get a chance.)

Overall I’m pretty happy with the unit. The build quality is very good, connectivity (quad on-board gigabit ethernet, iLO, serial console) is nice and expansive and the box seems very fast. There are some caveats, though: out of the box, the server will seemingly display BIOS and POST information to the main video output (a VGA port), but will not show the operating system loading over it — instead, you have to use the serial console. Which, I think, is kind of prehistoric. The server is also quite noisy; much noisier than anything else we have in the server room with the possible exception of the SATAbeasts.

But, those are fairly minor inconveniences when you can see this via top:
load averages: 0.02, 0.02, 0.02; up 11+01:15:33 13:43:40
48 processes: 47 sleeping, 1 on cpu
CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap
Memory: 32G phys mem, 1181M free mem, 4103M total swap, 4103M free swap

And,

[cwaltham@grinder ~]$ df -h | grep pool
pool1 3.6T 286G 3.3T 8% /vol1
pool2 4.5T 30K 4.5T 1% /vol2
pool3 4.5T 30K 4.5T 1% /vol3
pool4 3.6T 29K 3.6T 1% /vol4

16.2 TB usable is very nice, indeed! I am not yet happy with the I/O performance we’re getting out of them as NFS datastores, but — and this is a big but — I’m currently limited by uplink capacity to the Thumper itself. That is, the Thumper is connected to a 24-port Cisco 3500 series, which has a single gigabit uplink to a Catalyst 6500, which in turn has the other NFS filers connected. So far, I’ve only managed to get about 10MB/sec sustained over that link:

grinder_nge1