Do we need a new paradigm to evaluate application platforms?

While technology in the IT world changes pretty rapidly, the paradigms that organizations (including my employer) value and pursue don’t tend to change that often. Sure, the paradigms are influenced by the technology that’s available, but a change in one does not necessarily mandate a change in the other.

20 years ago, technology was slower and more expensive than it is today. It was also less reliable. 10 years ago, technology slower and more expensive than it is today. It was also less reliable (but probably not by too much.) Today, you can get an incredible amount of computational power for a low amount of money. A few months ago we purchased a half-dozen HP BL465c Blade servers — they each contained 12 2.1GHz CPUs and 48GB of RAM. And they were considerably less than $5000 each. 10 years ago, that would have been an unbelievable amount of computational power for an unconscionable price.

But at the same time as power has increased, reliability has increased — or, perhaps more accurately, plateaued at a (very) high level. I’ve been administering hardware & software for more than a decade, and I can’t remember the last time a NIC or a CPU went bad in a server. I’ve seen a few HDDs go bad (which we have RAID to prevent), and I’ve seen RAM go bad (over time.)

Why am I writing about this? Mainly because it presents an interesting “problem”. We have a few database platforms here that are built on pairs of HP BL465c G5 servers — so they have the dual-socket, dual-core AMD Opterons. We build them in pairs using network load-balancing for both performance and reliability needs. But do we really need to? The availability of dual-core, hex-socket HP BL465c G6 servers seems to make performance concerns (provided your databases take advantage of multi-core environments, of course) seem somewhat silly. And given the very high levels of reliability of individual blades — multiple CPUs, multiple banks of RAM, multiple HDDs, multiple NICs — building pairs of servers for single applications starts to seem a little unnecessary from a reliability point of view, too.

The hard part, of course, is figuring out where to draw the line. Designing platforms for performance costs money but doesn’t introduce a whole lot of extra complexity (at least in most circumstances) to your environment. Designing platforms for high-availability costs you both money (you have to buy more stuff) and complexity (you have to setup load-balancing, failover, or myriad other HA options.)

But what if I told you that building for high-availability is, for the most part, pointless? Would you be shocked? Or would you try and think of the last time a brand-name server failed catastrophically, not be able to remember the time that happened, and be a little surprised? As I mentioned, I’ve seen HDDs fail and RAM go bad lots of times. But if you build a single server with any degree of intelligence — a couple CPUs, a couple banks of RAM, a couple HDDs in RAID, and even a couple NICs in a bond (which I think is becoming less and less meaningful in today’s world) — it seems like building paired servers (network load-balancing, failover, whatever) introduces added complexity and increases cost while giving you little — very little — in return.

Going back to my organization’s example: we have two totally separate database platforms. Both are comprised of three HP BL465c G5 blade servers with dual-socket, dual-core AMD Opterons. Each blade has 4 CPUs, 16GB of RAM, 2 HDDs and 4 NICs. What are the chances of a catastrophic failure? Pretty low, I would say. But having a three-way network load-balanced pair introduces complexity (which costs you time) and introduces additional cost (more servers to buy, more power to consume.) With the HP BL465c G6 blade servers, a single server can hold 12 CPUs, 48GB of RAM (the limit is actually 64GB, but that’s super-expensive), 2 HDDs and 4 NICs (ditto.) We have more CPU and RAM capacity in one server than we had in three — and we have it with much less complexity and much less cost.

But the big question remains — did we lose anything in terms of reliability? Statistically, of course, a 3-way platform is less likely to fail than a 1-way platform. But is that a 0.001% chance versus a 0.0001% chance? And if so, how much is that worth — in money and in time — to your organization?

I would argue that you are gaining very little while giving up quite a lot. In other words, I would be happy to see the 3-way SQL Server platforms be replaced with 1-way platforms that utilize a single, high-capacity server. Especially when you consider that the actual SQL Server instances exist on shared (in our case, NetApp) storage; which further reduces the capacity for failure versus a generic RAID array (such as the blade server’s internal drives.)

To me, this comes back full-circle to my initial question about chasing paradigms. The industry’s desire to follow Moore’s Law has led us to incredibly powerful — and reliable — hardware for what is really very little cost. Put very plainly, I don’t think you need to specifically chase either the performance or high-availability paradigm in any way other than choosing intelligently when you design your platform. Choose a reputable vendor, multiple CPUs, multiple NICs and HDDs in RAID and I really, really don’t think that network load-balancing or failover (either at the hardware level or the network level) is worthwhile anymore for most organizations. (Obviously if you’re incredibly large, like say Facebook, or have a huge dataset, like say ESPN, you have very different concerns.)

Put another way, choose your servers right the first time and — given the state of today’s technology — I don’t think you need to complicate your architecture to chase either performance or high-availability concerns. Maybe I have more faith in modern hardware than most administrators, but I really think it’s a case of today’s hardware is so reliable that a lot of concerns have been mitigated. Surely I can’t be the only one that noticed, though?