Bitpushr's Blog

I push bits around.

Backups, virtualization and shared storage

When it comes to the actual operations of backup policies, they are often dictated by corporate data retention policies: such as, “We must have a recovery window of N days, and we must store the backups off-site.” It is rare, in my experience, for an organization to actually specify the required media (e.g., tape) or where the offsite location needs to be. Typically, though, backups are written to tape on-site and then moved to an off-site location soon after.

With the adoption of virtualized environments (that live on shared storage), these operations took an odd turn. Many organizations are still backing up their virtual machines via agent/server backup applications: EMC/Legato NetWorker, IBM Tivoli, Symantec BackupExec, etc. And they run the same way on virtual machines as they do on physical machines: you install an agent on the (virtual machine) client, and the backup server connects to that agent and sucks a (full) copy out of the virtual machine. You may do this full every night, or incrementals, but you’re still accomplishing the same thing — you’re taking the contents of the VM, which lives on shared storage, out through the VM itself.

My question is: why not just backup the virtual machine where it sits? Why not just tell your Filer to back it up primarily? A .vmdk is the same thing no matter where or how it sits.

A couple of weeks ago I had a customer (a very big organization) that remarked that the single biggest workload in their virtualized environment was backups. That is, their Filers see the most IOps and their switches see the most traffic at 4am every morning when they kick off their system backups. This is because each VM is backed up via an agent — and the backup server pulls the entire contents of the virtual machine out through the virtual machine itself.

This problem isn’t limited to virtual machines, either. Many organizations that deploy SQL Server (or, to a lesser extent, Oracle) are running full backups of the actual SQL databases, even though those databases are living on shared storage. Again, while this is feasible and effective, it is hardly efficient.

My question is: why not just backup the SQL Server (or Oracle) database where it sits? Why not just tell your Filer to back it up primarily? An SQL database is the same no matter where it sits.

In the case of NetApp (my employer), it doesn’t matter if you’re using Oracle Database over NFS on Linux or SQL Server over FC on Windows: your database is living in WAFL, in a whole big series of 4KB blocks. And because we can take snapshots of anything stored on WAFL no matter how you access it, why not just take a snapshot of the database? It’s just a bunch of blocks, right? After all, the Filer doesn’t care if it’s a chunk of SQL or a chunk of VMDK or a chunk of Word document. Blocks are blocks are blocks, and if it’s in a block we can snap it (and, we can dedupe it.)

So why not just take a snapshot of it? One common objection is “Well, a snapshot is fine, but I have to store them off-site”. Okay, cool, I understand — I was the same way. So let’s take that snapshot and move it off-site; to another Filer in a different building or a different city or a different country. “Well, that sounds okay, but my auditor tells me it has to be read-only”. Okay, cool, I understand — I was the same way. So let’s take that snapshot and lock it as read-only. “Well, that sounds okay too, but my CIO tells me it has to live for 7 years.” Okay, cool, I understand — I was the same way. So let’s take that snapshot and vault it.

What I’m trying to get at is, the policies you’re living inside of (in this case, backup policies) shouldn’t dictate the technologies you use. Just because you need off-site backups doesn’t beholden you to use tape. You should, under any policy at any organization, use the best tool or technology for the job. In the case of virtualized environments living on shared storage, what is the best tool? What is the best technology? If the data are blocks living on a Filer somewhere, why not just back them up where they belong — on the Filer itself?

Why cache is better than tiering for performance

When it comes to adjusting (either increasing or decreasing) storage performance, two approaches are common: caching and tiering. Caching refers to a process whereby commonly-accessed data gets copied into the storage controller’s high-speed, solid-state cache. Therefore, a client request for cached data never needs to hit the storage’s disk arrays at all; it is simply served right out of the cache. As you can imagine this is very, very fast.

Tiering, in contrast, refers to the movement of a data set from one set of disk media to another; be it from slower to faster disks (for high-performance, high-importance data) or from faster to slower disks (for low-performance, low-importance data). For example, you may have your high-performance data living on a SSD, FC or SAS disk array, and your low-performance data may only require the IOps that can be provided by relatively low-performance SATA disks.

Both solutions have pros and cons. Cache is typically less configurable by the user, as the cache’s operation will be managed by the storage controller. It is considerably faster, as the cache will live on the bus — it won’t need to traverse the disk subsystem (via SAS, FCAL etc.) to get there, nor will it have to compete with other I/O along the disk subsystem(s). But, it’s also more expensive: high-grade, high-performance solid-state cache memory is more costly than SSD disks. Last but not least, the cache needs to “warm up” in order to be effective — though in the real world this does not take long at all!

Tiering’s main advantages are that it is more easily tunable by the customer. However, all is not simple: in complex environments, tuning the tiering may literally be too complex to bother with. Also, manual tiering relies on you being able to predict the needs of your storage, and adjust tiering automatically: how do you know tomorrow which business application will require the hardest hit? Again, in complex environments, this relatively simply question may be decidedly difficult to answer. On the positive side, tiering offers more flexibility in terms of where you put your data. Cache is cache, regardless of environment; data is either on the cache or it’s not. On the other hand, tiering lets you take advantage of more types of storage: SSD or FC or SAS or SATA, depending on your business needs.

But if you’re tiering for performance (which is the focus of this blog post), then you have to deal with one big issue: the very act of tiering increases the load on your storage system! Tiering actually creates latency as it occurs: in order to move the data from one storage tier to another, we are literally creating IOps on the storage back-end in order to accomplish the performance increase! That is, in order to get higher performance, we’re actually hurting performance in the meantime (i.e., while the data is moving around.)

In stark contrast, caching reduces latency and increases throughput as it happens. This is because the data doesn’t really move: the first time data is requested, a cache request is made (and misses — it’s not in the cache yet) and the data is served from disk. On it’s way to the customer, though, the data will stay in the cache for a while. If it’s requested again, another cache request is made (and hits — the data is already in the cache) and the data is served from cache. And it’s served fast!

(It’s worthwhile to note that NetApp’s cache solutions actually offer more than “simple” caching: we can even cache things like file/block metadata. And customers can tune their cache to behave how they want it.)

Below is a graph from a customer’s benchmark. It was a large SQL Server read, but what is particularly interesting is the behavior of the of the graph: throughput (in red) goes up while latency (in blue) actually drops!

If you were seeking a performance augmentation via tiering, there would have been two different possibilities. If your data was already tiered, throughput will go up while latency will remain the same. If your data wasn’t already tiered, throughput will decrease as latencies will increase as the data is tiered; only after the tiering is completed will you actually see an increase in throughput.

For gaining performance in your storage system, caching is simply better than tiering.

Is VMware forcing the homogenization of enterprise storage?

Stephen Foskett wrote this article about possible futures of enterprise storage; in turn it was further examined by Scott Lowe here.

Stephen’s article posits this: is VMware forcing the homogenization of enterprise storage? (If it is, will that be at the expense of it’s corporate parent, EMC?) Is VMware’s progress in technologies like VMFS, snapshots etc. moving customers to make different purchasing decisions, away from VMware? My answer is: no. At least, not in a significant fashion.

In order for VMware’s storage offerings to prosper, they need to be able to supplant existing storage vendors that are not offering significant value-adds. Those can come in many forms, though: be they great hardware offerings, great software offerings, or perhaps even great price offerings. The companies that don’t offer anything strategic or unique are likely to go by the wayside because those are the companies whose products VMware can “replace” with its own offerings.

It’s important, then, to keep in mind the key differentiators of various storage vendors. If they have a product or a solution that fills a particular niche better than anyone else, they’re likely to survive (even against VMware as its own software vendor). However — like any performer in any market — if you make a poor product in a competitive industry, you’re going to be replaced. The only thing that makes VMware’s offerings uniquely potent is that they’re sometimes able to better-leverage the vSphere platform because they literally wrote the APIs that they’re hooking in to.

New job, new town, same look

As some of my more astute readers may know, I recently left Bowdoin College for a position down at NetApp in Waltham, MA. I’m a Systems Engineer in the Enterprise group; essentially I’ll be helping with both pre-sales consulting as well as post-sales review (with focuses on performance and VMware).

While I was sad to leave Bowdoin, I am super excited to be at my new position! It required a bit of a move down from Maine, as well as an environmental adjustment from academia to corporate, but so far I am having a great time.

I’ll be blogging here when I can, and (like previous posts) mostly focusing on issues surrounding VMware and performance. I have a couple of small updates in the pipeline, hopefully I’ll get to them this week!

Migrating vCenter Server from SQL Server Express to SQL Server Standard

Under vSphere 4, we were using SQL Server Express 2005; when we upgraded to vSphere 5 we kept the same database (even though vSphere 5 comes bundled with SQL Server Express 2008). However, we had long since surpassed 5 hosts, hence VMware suggested we migrate from SQL Server Express 2005 to SQL Server Standard 2008 R2. Here is a quick synopsis of how that happened:

Pre-flight:

  1. Stop all VMware services, in particular VMware vCenter
  2. Install SQL Server 2008 R2 Management Studio
  3. Perform a full backup of the VIM_VCDB database (I do this via SQL Server Management Studio)
  4. Uninstall VMware vCenter Server
  5. Move the backup somewhere, like C:\
  6. Uninstall SQL Server 2005

Installing SQL Server 2008 R2:

  1. Install SQL Server 2008 R2 Enterprise Edition
  2. Installation features:
    • Database Engine Services
    • Management Tools – Basic
    • Management Tools – Complete
  3. Service Accounts:
    • SQL Server Agent: NT_AUTHORITY\SYSTEM; startup type Automatic
    • SQL Server Database Engine: NT_AUTHORITY\SYSTEM; startup type Automatic
    • SQL Server Browser: NT_AUTHORITY\LOCAL S...; startup type Automatic

Restoring the vCenter Server DB:

  1. Launch SQL Server Management Studio & connect to your instance
  2. Right-click on Databases and choose Restore Database…
  3. Select the file & and the database name (probably VIM_VCDB5 or something unique)
  4. Create a new, 64-bit System DSN and point it to the new database. Use SQL Server Native Client 10.0 as your driver
  5. Make sure the default database is VIM_VCDB5, not master!
  6. Start the SQL Server 2008 R2 agent, if it’s not already running

Now, you should be able to install vSphere 5. When prompted, select the new DSN you created, and make sure you use your existing database!

For those using VMware Update Manager, you will also need to re-create a 32-bit System DSN and point it to the same VIM_VCDB5 database. You create a 32-bit DSN by calling the 32-bit ODBC manager which is located at c:\windows\SysWOW64\odbcad32.exe. (You’ll still use SQL Server Native Client 10.0 as the driver, though.)

Fixing slow boot times with ESXi & NetApp iSCSI LUNs

I’ve been using a mix of iSCSI and NFS LUNs since VMware ESX 3.5; I used them quite heavily in ESX & ESXi 4 without issue. Since moving to ESXi 5, though, I’ve noticed that my ESXi hosts are taking a long time to boot — more than 45 minutes! During the boot process, they’re hanging on this screen for the majority of that time:

You can see that the message is vmware_vaaip_netapp loaded successfully. I did some debugging work, and my boss chimed in with his suggestions, and we got the issue squared away this morning. I had narrowed it to the point where I could identify the cause of the symptoms: the presence of a dynamic iSCSI target. You can have as many NFS datastores as you want, and even as many iSCSI software HBAs as you want, but the moment you add a dynamic iSCSI target is the moment where you have issues — at least in our environment. What I mean by that is, we have a large number of server VLANs (several dozen) and our NetApp filers provide file services to almost all of those VLANs:

[root@palin ~]# rsh blender iscsi portal show
Network portals:
IP address        TCP Port  TPGroup  Interface
10.140.229.7         3260    2000    vif0-260
10.140.231.135       3260    2001    vif0-265
10.140.235.7         3260    2003    vif0-278
10.140.224.135       3260    2006    vif0-251
10.140.225.7         3260    2007    vif0-252
10.140.226.7         3260    2008    vif0-254
10.140.227.7         3260    2009    vif0-256
10.140.227.135       3260    2010    vif0-257
10.140.228.7         3260    2011    vif0-258
10.140.228.135       3260    2012    vif0-259
10.140.230.7         3260    2013    vif0-262
10.140.231.7         3260    2014    vif0-264
10.140.232.7         3260    2015    vif0-266
10.140.232.71        3260    2016    vif0-267
10.140.232.135       3260    2017    vif0-268
10.140.232.199       3260    2018    vif0-269
10.140.233.7         3260    2019    vif0-270
10.140.233.135       3260    2021    vif0-272
10.140.233.199       3260    2022    vif0-273
10.140.234.7         3260    2023    vif0-274
10.140.234.71        3260    2024    vif0-275
10.140.234.135       3260    2025    vif0-276
10.140.234.199       3260    2026    vif0-277
10.140.235.199       3260    2027    vif0-281

You can see that there are two dozen VIFs there, each on their own VLAN. In my case, I’m looking for the target that sits on vif0-265; I don’t care about any of the other targets. Trouble is, though, that my ESXi hosts only have VLAN 265 trunked to their VMkernels, hence the only target they can see is on VLAN 265. After I explained this to my boss, he said “I bet the Filer is enumerating all of those portals to the ESXi host, and 99% of them are timing out” (because they are inaccessible.)

Turns out, he was right! This is taken from the iscsi(1) man page:

* If a network interface is disabled for iSCSI use (via iscsi interface disable), then it is not accessible to any initiator regardless of any accesslists in effect.

Since we’re using initiator groups and not accesslists, this is our problem: the Filer is indeed enumerating every portal (all two dozen!) it has configured, even though our ESXi host is only trunked out to one of them. So, I’m waiting for 23 separate connections to time out so that 1 connection can work. So we came up with this:

[root@palin ~]# rsh blender iscsi interface accesslist add iqn.1998-01.com.vmware:londonderry-29a21d34 vif0-265
Adding interface vif0-265 to the accesslist for iqn.1998-01.com.vmware:londonderry-29a21d34

[root@palin ~]# rsh blender iscsi interface accesslist show
Initiator Name                      Access List
iqn.1998-01.com.vmware:londonderry-29a21d34    vif0-265

Now, because that accesslist exists on that VIF, the Filer replies only with an initiator target as being present on the correct VIF & VLAN (in this case, vif0-265). Problem solved! Now, all I have to do is go through and add the rest of my iSCSI initiator names to my Filers, and Robert will be my father’s brother.

Deleting problem files from VMFS data stores

Following up from my issues yesterday, I had a bunch of files (old, bad snapshot deltas) that I needed to delete. Problem was, I couldn’t:

/vmfs/volumes/datastore/vm/badsnapshots # rm -rf *
rm: cannot remove 'foghorn-000101-ctk.vmdk': Invalid argument
rm: cannot remove 'foghorn-000101-delta.vmdk': Invalid argument

Try as I might, I couldn’t get rid of them. Via lsof I couldn’t see that the files weren’t locked; indeed, I was able to move them, just not delete them. So I cheated, by echoing a character to the file (to verify its sanity and update it’s mtime):

echo "a" > *

Then, rm worked. Victory!

vSphere problems with vpxa on hosts

I had a very bizarre issue recently, where two of my 20 vSphere ESXi 5 hosts disconnected from their clusters. When I’d try and reconnect (or, remove & connect) the hosts from the clusters, I would get an error message saying the host couldn’t be added because timed waiting for vpxa to start. Bad grammar theirs, not mine!

After filing a support request with VMware, a very helpful engineer helped me determine the cause. Looking through the vpxa logs (/var/log/vpxa.log), he noticed that some virtual machines on each host had lots of snapshot files, and vCenter Server was having trouble managing that host. So, we enabled SSH on the problematic ESXi host, and took a look:

/vmfs/volumes/4e68dec0-274d0c10-21f1-002655806654/Foghorn(Test BlackBaud DB) # ls
Foghorn-000001-ctk.vmdk    Foghorn-000041-delta.vmdk  Foghorn-000081.vmdk        Foghorn-000122-ctk.vmdk    Foghorn-000162-delta.vmdk  Foghorn-000202.vmdk
Foghorn-000001-delta.vmdk  Foghorn-000041.vmdk        Foghorn-000082-ctk.vmdk    Foghorn-000122-delta.vmdk  Foghorn-000162.vmdk        Foghorn-000203-ctk.vmdk

I cut that off, because there were more than 200 delta files for that VM! Obviously, the snapshot process had spun way out of control for this particular VM. It’s unclear why this happened, but removing those VMs from the host allowed me to add the hosts back to the cluster.

After that, I simply cloned the problematic VMs (which automatically flattens the snapshots) into new VMs and the problem was solved.

The reality of economics in tech innovation

There are things in the economies of tech businesses that scale well (mass production; agglomeration of labor) and there are things that don’t scale well. Innovation is not something that scales well, and so I will try and point out a few reasons why.

A few days ago HP announced that it is killing its tablet and spinning off WebOS. Perhaps the only surprising thing is that it happened so quickly; the HP tablet was only a few months old.

On the surface, you would imagine — as many people did — that a company as large, talented and wealthy as HP could actually pull off a device to compete with the Apple iPad. However, I think this was a poor supposition to have been made at all, for reasons I will go into in this post.

HP’s competitive advantage in the tech industry is that, up until maybe the last decade or so, it had the best engineers in the business. While that is not arguably true anymore, it is important to note that it never really had a significant competitive advantage in the mobile/tablet market. The mobile/tablet market is dominated by Apple, of course, whose competitive advantage is simply selling great design and great UI. Apple have the most market share because they are obviously the best at it. When was the last time you said “This HP device is really great, and really easy to use?” I will go ahead and give you the answer, i.e. “Never”. They had tried mobile before, with the iPaq smartphones, and they sucked.

While HP’s lack of competitive advantage in the mobile/tablet sector doesn’t mean the HP tablet was destined to fail, it sure had a huge mountain to climb if it was to be unseat the iPhone & iPad. Unfortunately for HP, they failed.

It is also worth noting that competitive advantages, like everything else in the world, are dynamic. Research In Motion (RIM) had a competitive advantage in building mobile devices that a) worked well with enterprise software and b) were very network-efficient. Now, neither are really true; as RIM employees are starting to see the writing on the wall. RIM failed to notice that customers started wanting different things: as the mobile market expanded to “regular” consumers (as opposed to corporate consumers), people want things like cameras and MP3 players and don’t care about “true” multitasking or many things RIM think they care about. But RIM is wrong, and it will kill the company if they don’t reverse their course.

Innovation can come out of nowhere, and it can disappear just as quickly. Perhaps the first great innovator in the mobile space was Nokia; now their products are openly mocked. They had the best technology at the time the technology was emerging; once the technology became common and cheap they were quickly displaced by other manufacturers who had better devices (Samsung, Motorola, etc.) Businesses cannot rest on their laurels and presume that their advantages will live forever.

BlackBoard & Oracle benchmarks on SSD storage

I spent Friday benchmarking two platforms against each other: our production BlackBoard instance (9.1 SP6) versus a development BlackBoard instance (also 9.1 SP6) that I spun up for this test. The unique thing about this test was that I put the development instance entirely on SSD storage.

Production instance:

  • BlackBoard 9.1 SP6 on VMware ESXi VM (Linux)
  • Oracle 10g R2 on HP BL465c G6 Blade (Linux)
  • BlackBoard on NetApp filer; 1 volume on a 45-disk RAID-DP aggregate
  • Oracle on local HP SAS RAID-1 disk set

Development instance:

  • BlackBoard 9.1 SP6 on VMware ESXi VM (Linux)
  • Oracle 10g R2 on VMware ESXi VM (Linux)
  • BlackBoard on FusionIO ioDrive SSD
  • Oracle also on (the same) FusionIO ioDrive SSD

With the exception of workloads (itself a significant difference, to be sure), I tried to keep everything else the same — same patches, same OS revision, same LDAP & SSL configurations, etc.

To test performance, I had a suggestion from Steve (@Seven_Seconds) at BlackBoard: hit the /webapps/login/ page with ab, the Apache benchmark tool.

The results were, well, strange. With each test of em running 5000 requests to /webapps/login/, here is a graph:

NetApp v SSD

You can see that the mean response times SSD (blue line) equals or out-performs the NetApp (red line) at every concurrency value tested. What is particularly interesting, though, is the incredibly slow rate at which the standard deviation of the SSD (purple bars) grows; compare that with the rate at which the standard deviation of the response times of the NetApp (green bars) grow.

Right now, I don’t have a clear answer as to why the deviations from the mean grow so differently. The NetApp filer has a significantly different workload to the SSD (the SSD has no workload other than this test), but the exponential growth of the NetApp’s standard deviation is something I’ll need to investigate further.

Last but not least, there were no SSD-specific tuning options put in place, nor where there any NetApp-specific tuning options. This was as vanilla as both installs could get in order to keep everything relatively comparable.

Follow

Get every new post delivered to your Inbox.

Join 70 other followers