…especially if you use Cisco switches!
We are a reasonably heavy user of Microsoft’s network load-balancing (NLB) technology in Windows 2003 and 2008 Server architectures. Specifically, we’re clustering a reasonably large (4000 mailboxes) instance of Exchange Server 2007, as well as numerous SQL Server 2005 & 2007 instances. We’re doing this on both physical, virtual and hybrid deployments.
While I’m not our resident Microsoft expert, my understanding of the fundamentals of this kind of clustering is this: servers transmit heartbeats/keepalives to other servers, and react accordingly. This is complicated by the fact that you have two options for heartbeat transmission: unicast and multicast. Microsoft’s default is unicast, VMware’s recommendation is multicast and you can read the rationale for this decision here. (There is even more information here. For information about setting up multicast NLB, go here.)
Where this turns into an embarrassment-of-riches problem is if you’re using Cisco switches. You can read a somewhat wordy explanation here, which summarizes the problem better than I can. But essentially it boils down to this: VMware recommend you use multicast NLB. In order to use multicast NLB, you need to (*gulp*) hard-code MACs and ARP entries in your Cisco switching infrastructure. For those with relatively small systems infrastructure, this isn’t the biggest deal in the world. But when you have more than 12,000 ports on campus, it prevents some strong scalability (and, subsequently, feasibility) problems. Every time you set up a cluster — which admittedly might not be that often — you’re going to have to coordinate configuration changes to your switching platforms.
How does this make you feel?
When incorrectly configured — which is to say, the switches have not been configured at all to handle the uniqueness of multicast NLB — you can experience problems where nodes seemingly drop off the network. This causes a split-brain scenario, where the cluster hasn’t actually failed over but it thinks that it has. When you add shared storage into the mix, things develop the inertia to turn pear-shaped very quickly. Which is what we’re seeing now, and what we’re trying to debug…