More on mixing ESX & ESXi in the same cluster

A while back I wrote this post, which detailed some of my struggles as I moved from ESX to ESXi at the same time as we conducted some network reconfiguration. Now that vSphere 5 is almost ready to go gold, it’s imperative that I finish off our ESX->ESXi conversions as vSphere 5 is going to be ESXi-only. Previously, I had written the following:

So, now we have a mismatch in the architecture between ESX and ESXi.

While that is technically true, it turns out that it doesn’t really matter. Yes, your Service Console (ESX) is now your vmk0 (ESXi), which means your VMkernel(s) are going to be vmkN where N>0 because the VMkernels get created after the Management Network (ESXi’s new name for the Service Console). Now, this presents you with a small problem that you’ll see appear as this error dialog:

The vMotion interface of the destination host uses network ‘vmk1’, which differs from the network ‘vmk0’ used by the vMotion interface of the source host. Are you sure you want to continue? Yes / No

Normally, if you put a host in Maintenance Mode, the VMs will vMotion off said host automatically. However, if you see this warning when manually trying to vMotion a VM, they will not vMotion automatically as the host goes into Maintenance Mode! They will just sit there and be surly. However, if you vMotion the VMs yourself, it will be successful! You simply have to click Yes when the warning appears.

What I used to do, and what I now recommend against you doing, is removing the Management Network from your new ESXi host, making [a] new VMkernel(s) and then re-creating your Management Network in order to have your device names “match” — i.e., vmk0 on your ESX host is a VMkernel just like vmk0 on your ESXi host is a VMkernel. You don’t need to do it! Sure, you won’t be able to migrate your machines simply by putting them into Maintenance Mode, but the steps involved are a PITA and require iLO or physical console access. After all, how can you remove your Management Network in ESXi while you’re logged in to it? 🙂

Another thing to keep in mind is that your datastores must be mapped the same in ESXi as they were on your ESX servers — at least, if you don’t want to have to power off every VM before you migrate it. Here’s what I mean: I have two hosts, ESX and ESXi, and both are connected to an NFS datastore with the name nfsprod. When I try and vMotion a VM between the ESX and the ESXi host, it fails with the not-particularly-descriptive error message:

A general system error occurred: the source detected that the destination failed to resume

If you look in the VMkernel error log, you probably won’t see anything descriptive (at least, I didn’t when I looked.) But this blog post came up from The Googles, and it contained the answer I was looking for. Here’s what you see at the hypervisor level — remember, this is a datastore that appears as nfsprod on both the ESX & ESXi via the vSphere Client. Check it out:

[root@ESX ~]# esxcfg-nas -l
nfsprod is /vol/nfsprod from 10.140.231.135 mounted

And now, the ESXi host:

~ # esxcfg-nas -l
nfsprod is /vol/nfsprod from blender

See? One is named via its IP, and one is named via its FQDN. And that is enough to make the vMotions fail!

A grand don’t come for free

Or, “exploring some of the hidden costs of computing & virtualization.” (And with apologies to The Streets for stealing their album title.)

My college education was in Economics and Political Science, two fields with significant overlap — hence why I chose to major in both of them. As such, I am the exception to the norm of trained computer scientists; it is also why I resist using the word “Engineer” in my title. That said, Economics and Political Science have brought me understanding of many things, a couple of which I will bring up here.

Economics and Political Science are obviously two very broad fields, but nevertheless I will try and distill a couple of points here. First, that everything has a cost: there aint no such thing as a free lunch (or a free grand). And second: that people respond to incentives.

I work for a small, private, resource-rich college in Maine; I graduated from a large, public, resource-scarce university in Maine. The two offer very different appearances, but share surprisingly many problems. Specifically, the abundance of technology — and the ease of its use — often makes people think that computing is either easy, cheap or both. Anyone that’s tried to build a large, fast, reliable wireless network can attest that building something ubiquitous or commoditized is not necessarily easy. One thing that VMware and virtualization has brought to the front is the [reinvention of the] idea that computing, or computing power/resources, is something that should be commoditized. Being able to deploy 15 virtual machines off a single physical server is certainly a) a very good thing but also b) something that supports the notion of commoditization.

Unfortunately, commoditization has its costs (see? I’m an Economist after all!). When we commoditize something, we tend to ignore its true costs, many of which are not immediately apparent. Gasoline is a commodity; we think of a price at the pump per gallon (and maybe a price per drum of oil) but little else. We do not consider its effects on the environment, its effects on world geopolitics and U.S. foreign policy or many other things that a seemingly simple product calls into place. We see a price at the pump, decide to pay it or not, and receive a product. Little else seems to enter in to the equation.

Virtualization has brought us something similar: in many cases it has allowed us, the administrators or providers of systems, to offer products to a consumer base (in my case, the rest of a college) at a price much more inline with something we could identify as a commodity. If a department wants to deploy a server, they think that a virtual machine is essentially “free” because there’s no more hardware to buy.

This, of course, is a fallacy.

People trained in Economics — myself included — love to think about things in terms of opportunity costs. If I do X, what am I giving up in regards to Y? If I give you $5 to buy a beer, that’s $5 less I have to spend on beer, or to spend on anything else for that matter. The cost incurred in doing something means I give up some ability to do something else.

Unfortunately, at least in technology circles, this is an effect not often considered. Where I work, we have about 15 ESX & ESXi hosts running around 200 virtual machines; we no longer need to purchase one piece of hardware to deploy one service. Because we don’t perform chargebacks, this gives the impression to our clients (i.e., other campus departments) that their new virtual machine is “free”. But, of course, it is not – I had to buy a license for VMware, as well as some fraction of the underlying physical hardware. And, I had to set it all up to boot. We run at a consolidation ratio of roughly 13 VMs to one physical host; every time I provision a VM for a department or faculty member that’s 1/13th less of an ESX host I have access to.

So, there is a definite opportunity cost to the provisioning of commoditized computing resources because those are resources I could be doing other things with. Instead of deploying a new server for a Geology professor, I could have a sandbox for Oracle Data Guard. While I didn’t invoice the Geology department, I still incurred a cost for the work.

And physical resources are not the only ones we should be concerned with, either. The commoditization of computing resources, just as in the commoditization of mass production (e.g., factory labor), tends to obscure the true cost of the underlying labor. This is especially widespread in my institution because, so often, the true cost of simply implementing a solution is ignored. The commoditization of services like virtualization, clustering and high-availability frameworks have significantly obscured the underlying costs of deploying such services.

Recently, there was a discussion at my institution regarding the choice of database for a new SIS implementation. The vendor requires Oracle, but should we use single-node or RAC? Or RAC One Node? The managers, of course, want 100% reliability; therefore clustering (and therefore “full” RAC) must be the best choice! This simple — yet common — misunderstanding seems particularly rife in many I.T. environments today. While RAC does allow you to do many great things that you can’t do on a single-instance database, it also has costs: significant ones at that. But how do we even begin to quantify those?

Say your boss wants 99.99% reliability, aka “four nines”, which works out to roughly 1 minute of downtime per week; a little under an hour a year. To do that, he wants to use Oracle RAC. Let’s say that, without Oracle RAC, you could reasonably accomplish 99.9% (“three nines”) of reliability, which works out to roughly 10 minutes of downtime per week — a little under 9 hours of downtime a year. Great, he says, more reliability is better! Less downtime is good!

What if an Oracle RAC license costs you $1000/year? This absurdly-low figure just bought you 468 minutes of added uptime, or about 8 hours worth, for $1000. Now let’s call divide 1000 by 80 and get the average cost: about $128. So, if your institution accepts that price for a RAC license, then it is valuing avoiding downtime at something like $128 per hour avoided per year.

Now let’s add a real RAC price. For our SIS, Oracle RAC was something like $44,000 per year, and that’s a deeply-discounted price! If we assume the same, “four nines” v. “three nines” model, we’re still getting the same added uptime (or, the same avoided downtime) of roughly 8 hours per year. Now, divide 44000 by 8 and you’ll get an hourly value for Oracle RAC. Curious? It’s roughly $5,500 per hour of avoided downtime per year.

Do you think that’s good value? More to the point, do you think anyone bothered to calculate it out that way? My guess is, you don’t, and they didn’t. $5,500 buys a BbWorld 2011 conference pass, airfare, accommodation and spending money next year. In New Orleans!

Now that we’ve sort of quantified the cost (and, opportunity cost) of an Oracle RAC license, let’s examine something that’s thought about even less: the added complexity that RAC involves. As somebody who builds I.T. systems for a living, my mantra has always been “Build it once, build it right”; I have a relatively short attention span so doing the same thing over and over tends to bore me. I like to deploy a solution once, deploy it right and move on to a new challenge. Now, when your application fails — and sooner or later, it will fail — you’ll have to fix it. Now, though, you’re not just debugging a single instance of Oracle and a single application: maybe you’re debugging a problem with Oracle RAC, or a problem with RAC and and an application load-balancer thrown somewhere in the mix.

Although I am by no means a RAC expert, I can tell you that a two-node RAC instance is not simply twice as hard to diagnose as a single Oracle instance. RAC introduces many new layers of complexity and features, all of which need to be understood if you’re to have any success in fixing your outage. More complexity means more time, of course, and now your $44,000 RAC license (which didn’t protect you against this hypothetical case of failure) means that the Mean Time To Recovery, or MTTR, is now larger.

This is important! I have just shown that a solution like Oracle RAC — a solution that is really pretty good, and something I quite like — has both a financial opportunity cost (it’s money you could spend elsewhere) and a cost to MTTR, too. As good as it is, if RAC goes wrong, it is very possibly going to be more difficult to fix and take more time to fix than a simpler, single-instance solution. See what I mean? Everything has costs.

Now that we’ve got that out of the way, I’ll return to my other point: people respond to incentives. Where I work, we have about 1,600 students and something in the order of 15,000 network ports. There are no chargebacks; ports are “free” — if you want one, you ask for one & you get one. Where I went to school, there were about 16,000 students and a smaller number of network ports. My wife worked in a department on campus when she was in graduate school; they had to pay the University I.T. department $8 per port per month for each hardwired port they had activated in their building. Wireless “ports” (i.e., access points) were free. Guess what they department did? Shut down the physical ports, went to wireless and saved their money. This, of course, is what the University I.T. department wanted all along as it was less infrastructure for them to support.

Still with me? Good, because now it’s time to wrap these two ideas together. If everything has a cost, and if people respond to incentives, what happens when you match people to their incentives? As any behavioral economist will tell you, people will react to the costs. Our facilities department recently replaced one of their physical servers with seven virtual machines, because they didn’t have to pay for the VMs (because there are no chargebacks). They have no incentive not to use as many resources as they desire, because they are unaware of the true costs of the commodities they’re using. This isn’t their fault — they simply don’t know how much they use, costs. This is one example of behavior that leads to virtual machine “sprawl”: when no one considers that VMs aren’t free, everyone builds as many VMs as they want. This helps to explain why a college with ~1,600 students has ~200 VMs!

So, to summarize: the commoditization of I.T. resources is a great thing, and our abilities to avoid downtime are great abilities to have, but they both have costs — and those costs are often hidden. But just because you can’t see them doesn’t mean they’re not there. Just because you can deploy 15 or 20 VMs to one physical host doesn’t mean you’re not really using some [large] fraction of an underlying host, and it doesn’t mean you should be able to utilize a commodity without helping affray some of its true, underlying costs.