Differences between NetApp Flash Cache and Flash Pool

NetApp’s Flash Cache product has been around for several years, the Flash Pool product is much newer. While there is a lot of overlap between the two products, there are distinct differences as well. The goal of both technologies is the same: to accelerate reads by serving data from solid-state memory instead of spinning disk. Although they produce the same end result, they are built differently and they operate differently.

First of all, why do we bother with read (and write caching)? Because there’s a significant speed differential depending on the source form which we serve data. RAM is much faster than SSD, and SSD is much faster than rotating disk. This graphic illustrates the paradigm:

why-cache

(Note that we can optimize the performance of 100%-rotating disk environments with techniques like read-aheads, buffering and the like.)

Sequential reads can be read off rotating disks extremely quickly, and very rarely needs to be accelerated by solid-state storage. Random reads are much better served from Flash than rotating disks, and very recent random reads can be served from buffer cache RAM (i.e., the controller’s RAM) — which is an order of magnitude faster again. Although Flash Pool and Flash Cache provide a caching mechanism for those random reads (and some other I/O operations), they do so in different ways.

Here is a quick overview of some of the details that come up when comparing the two technologies:

Detail Flash Cache Flash Pool
Physical entity PCI-e card SSD drive
Physical location Controller head Disk shelf
Logical location Controller bus Disk stack/loop
Logical accessibility Controller head Disk stack/loop
Cache mechanism First-in, first-out Temperature map
Cache persistence on failover Requires re-warm Yes
Cache data support Reads Reads, random overwrites
Cache metadata support Yes Yes
Cache sequential data support Yes, with lopri mode No
Infinite Volume support Yes No
Management granularity System-level for all aggregates & volumes Aggregate-level with per-volume policies
32-bit aggregate support Yes No
64-bit aggregate support Yes Yes
RAID protection N/A RAID-4, RAID-DP (recommended)
Minimum quantity 1 PCI-e card 3 SSD drives
Minimum Data ONTAP version 7.3.2 8.1.1
Removable Yes Yes, but aggregate must be destroyed

The most basic difference between the two technologies is that Flash Cache is a PCI-e card (or cards) that sits in the NetApp controller, whereas Flash Pool are SSD drives that sits in NetApp shelves. This gives way to some important points. First off, Flash Cache memory that is accessed via the PCI-e bus is always going to have a much higher potential for throughput than Flash Pool memory that sits on a SAS stack. Secondly, the two different architectures result in different means of logical accessibility — for Flash Cache, any and all aggregates that sit on a controller with a Flash Cache card can be accelerated. With Flash Pool, any and all aggregates that sit in a disk stack (or loop) with a Flash Pool SSD can be accelerated. This gives way to the fact that in the event of planned downtime (such as a takeover/giveback), the Flash Cache card’s cache data can be copied to the HA pair’s partner node (“rewarming”), but in the event of unplanned downtime (such as a panic), the Flash Cache card’s cache data will be lost. The corollary of this is that if a controller with Flash Pool fails, the aggregate will fail over to the other controller and the Flash Pool will be accessible on the other controller.

Another important difference is Flash Pool’s ability to cache random overwrites. First, let’s define what we’re talking about: a random overwrite is a small, random write of a block(s) that was recently written to HDD, and now we’re seeing a request for that block to now be overwritten. The fact that we’re talking only about random overwrites is important, because Flash Pools do not accelerate traditional and sequential writes — Data ONTAP is already optimized for writes. Rather, Flash Pool lets us cache these random overwrites by allowing the Consistency Points (CPs) that contain random overwrites to be written into the SSD cache. Writing to SSD is up to 2x quicker than writing to rotating disk, which means that the Consistency Point occurs faster. Random overwrites are the most expensive (i.e., slowest) operation that we can perform to a rotating disk, so it’s in our interest to accelerate them.

Although Flash Cache cannot cache random overwrites, it can function as a write-through cache via the use of the lopri mode. In some workloads, we may desire that recently-written data be immediately read just after being written (the so-called “read after write” scenario). For these workloads, Flash Cache can improve performance by caching recently-written blocks rather than having to seek the rotating disks for that recently-written data. Note that we are not writing the client data into Flash Cache primarily, but rather we are writing the data to rotating disk through the Flash Cache — thus, a write-through cache. Flash Cache serves as a write-through cache under two scenarios. Flash Cache will serve as a write-through cache if the Flash Cache is less than 70% full if lopri is disabled. If lopri is enabled, it will always serve as a write-through cache. The enabling of lopri mode in Flash Cache will also cache sequential reads, whereas Flash Pool has no mechanism for caching sequential data.

Continuing this exercise are the differences between how Flash Cache and Flash Pools actually cache data. Flash Cache is populated with blocks of data that are being evicted from the controller’s primary memory (i.e. its RAM) but are requested by client I/O. Flash Cache utilizes a first-in, first-out algorithm: when a new block of data arrives in the Flash Cache, an existing block of data is purged. When the next new block arrives, our previous block is now one step closer to its own eviction. (The number of blocks that can exist in cache depends on the size of your Flash Cache card.)

Flash Pool is initially populated the same way: when blocks of data match the cache insertion policy and have been evicted from the controller’s primary memory but are requested by client I/O. Flash Pool utilizes a temperature map algorithm: when a new block of data arrives in the Flash Pool, it is assigned a neutral temperature. Data ONTAP keeps track of the temperature of blocks by forming a heat map of those blocks.

This is what a temperature map looks like, and shows how a read gets evicted:

read-cache-mgmt

You can see the read cache process: a block gets inserted and labeled with a neutral temperature. If the block gets accessed by clients, the scanner sees this and increases the temperature of the block — meaning it will stay in cache for longer. If the block doesn’t get accessed, the scanner sees this and decreases the temperature of the block. When it gets cold enough, it gets evicted. This means that we can keep hot data while discarding cold data. Note that the eviction scanner only starts running when the cache is at least 75% utilized.

This is how a random overwrite gets evicted:

write-cache-mgmt

The hotter the block, the farther right on the temperature gauge and the longer the block is kept in cache. Again, you can see the process: a block gets inserted and labeled with a neutral temperature. If the block gets randomly overwritten again by clients, the block is re-inserted and labeled with a neutral temperature. Unlike a read, a random overwrite cannot have its temperature increased — it can only be re-inserted.

Note that I mentioned Flash Pool actually has RAID protection. This is to ensure the integrity of the cached data on disk, but it’s important to remember that read cache data is just that – a cache of read data. The permanent copy of the data always lives on rotating disks. If your Flash Pool SSD drives all caught fire at the same time, the read cache would be disabled but you would not lose data.

Advertisements

Deep diving into NetApp NFS operations — file operations

Following on from my last post about NFS operation types, I thought I’d show some basic filesystem commands and then use tcpdump to illustrate what operations they generate on a Filer. Here are some simple commands and their relevant output. Note that I will omit the non-NFS traffic from the tcpdump output. The client is a 32-bit RHEL VM, and the Filer in question is a NetApp FAS3240 running Data ONTAP 8.1.2 7-mode. It’s important to note that, despite the existence of NFS RFCs and associated standards, NFS servers (and clients!) vary in their behavior in the real world. That said, to keep things simple, I am only specifying a handful of mount options:

charles-e0a:/vol/blog	/mnt/blog	nfs	noac,vers=3,tcp,soft,bg	0 0

Now let’s jump into things! First, we’ll mount the /mnt/blog filesystem and see what happens (mount command omitted):

mount

Not much to see when mounting the filesystem — just a bunch of acks. However, if we wait a little while, Linux appears to look at the new filesystem on its own volition:

mount_followup

Note the appearance of two new operations: fsinfo and fsstat. From my previous post, you’ll remember that these commands – not surprisingly, given their names – return information about filesystems. Now, let’s take the next step and cd into the NFS mount:

cd

Nothing. OK, let’s try ls:

ls

Interesting — we see a repeat of the fsstat operation, along with a a couple of new operations: getattr and access. getattr returns file attributes (such as ownership and timestamps), and access returns file access permission information. Note that, although we’re only running ls once, we’ve seen a total of 5 operations travel along the wire to our NFS server. Let’s try and create an empty file with touch:

touch

More operations involving getattr and access, but interestingly no read or write traffic. Is that because the file is empty? Let’s write some data to it:

echo

Even more operations! Now we’re up to a handful. We have the previously-seen getattr and access, along with some new ones: lookup, create and setattr. (I didn’t highlight every operation as it would have made the graphic a little unwieldy.) As their names suggest, lookup looks up a file’s handle, create creates a file and setattr is used to set a file’s attributes.

One curious thing to note is that, right after setting an attribute with a setattr, we go ahead and read it right back with getattr. Also curious is that we still haven’t seen the write operation.

Let’s try writing to a new file:

echo-append

Finally, we see a write operation — highlighted in red for your viewing pleasure. But we still see a plethora of other operations that we’ve seen before, all to create and write to a single file: getattr, setattr, access, lookup and create.

Now that we have two files in our NFS mount, let’s run an ls and see what we see:

ls

We see all the operations we’ve seen before, and a new operation called readdirplus. As mentioned in my previous post, readdirplus returns the names of a file in a directory and their attributes. Interestingly, we see the output of ls (files bar and foo) returned after four operations, but then we see another six operations (including readdirplus) occur after we’ve already seen the output.

Let’s append to a file that already has data in it:

append

Not a lot to see here, and no write command — just some access. Is Linux caching my request? Let’s try reading the file back to make sure the data is there:

cat

Huh! Before the content of the file is returned to us, we see three operations: getattr, access and write. Then we see the file’s content, and then we see another getattr and another access.

For the sake of being thorough, let’s run an ls -al to see if the NFS operations are any different to a regular ls:

ls-al

Yikes. Even though we have only two files in the directory, there are more than 20 operations before we even see the output of the directory (which I cropped from my screenshot). Nothing new, though; just a lot of getattr and access.

Let’s copy a non-empty file and see what happens:

cp

We see largely the same output as our append process. There’s a whole lot of getattr and access, interspersed with a lookup — but no write! In order to see the write command, I’d have to cat the file to read its contents.

Let’s copy a file that should be too big to cache. We’ll use my favorite Linux calculator, bc, and copy it into our NFS mount:

cp-bc

By now, we expect to see something like the above. A bunch of metadata operations (access, getattr, lookup) followed by a bunch of writes. For the record, the bc binary is only ~70KB. Note that the write operations completed while the copy process was taking place – i.e., there is no filesystem caching going on here. Also note that the number of write operations we see is going to depend on the NFS block size we’re using, too.

For something a little different, let’s try to create a new symbolic link (horse) to point to a file that already exists (donkey):

ln-s

Hmm. Nothing we haven’t seen before — which is weird, because I know a symlink operation exists in the NFS RFCs. Let’s try and look it up with ls -al:

ls-al_symlink

Et voila! Buried in the output of ls -al, we see our new symlink operation, which is of course used to create symbolic links. Let’s try and follow the symlink:

cat-symlink

Interesting. We see some typical getattr and access operations, but we do not see the readlink operation. What else can we do? Let’s try changing the permissions on bc:

chmod

Not a lot unusual there. We read the permissions attributes in (via access and getattr), then we set the permissions attributes via setattr. How about changing ownership?

chown

Not much difference — it operates pretty much the same way that chmod operates, i.e. the setting of an attribute. Now, let’s try deleting a file:

rm

Only one new operation here, and not much of a surprise that it’s called remove and neither a surprise that its function is to remove a file 🙂 What if I remove a symbolic link? If you remember, horse is a symlink to donkey:

ln-rm

Here we can see the remove operation doesn’t occur until we bother doing an ls. Last but not least, what about renaming a file?

ln-rm

Again, we see the operation (in this case, rename) after the ls. And that’s it for now! Up next: directories. As always, questions and comments and suggestions are welcome.

Understanding NFS operation types

During my day, I spend a lot of time analyzing NetApp performance statistics in an effort to help my customers better understand what their Filers are doing. Most of my customers prefer NFS, so that’s what I usually focus on. But just as every customer is different, so is every customer’s environment: they have different infrastructure to serve different data.

As the NFS article on Wikipedia explains, NFS traffic is actually made up of a variety of different operations, or “server procedures” in RFC language. There are reads, of course, and there are writes, but there are other things too: metadata needs to be looked up, file ownerships examined, directories enumerated and so forth. And you may well be surprised at how much of your Filer workload is spent doing “other” (i.e., neither reads nor writes) work.

Here is a graph of a Filer that does a lot of reads (blue) and writes (yellow), but look what else:

2013-03-05_1437

That Filer does an average of 1,500 IOps just on one type of metadata operation – in this case, GETATTR. So what are the various types of operations, and what do they do? Let’s have a look at operations valid in NFSv1 and NFSv2. Note that not all operations are shown in the graph, as they are rarely occuring:

  • GETATTR: return the attributes of a file. E.g. file type, permission, size, ownership, last access time etc.
  • SETATTR: set the atttributes of a file. E.g. permission, ownership, size, last access time etc.
  • STATFS: return the status of a file system. E.g., the output of df.
  • LOOKUP: lookup a file. E.g., when you open a file, you receive the file handle.
  • READ: read from a file.
  • WRITE: write do a file.
  • CREATE: create a file.
  • REMOVE: delete a file.
  • RENAME: rename a file.
  • LINK: create a hard link to a file.
  • SYMLINK: create a soft link to a file.
  • READLINK: read a symbolic link. E.g., find out where the link goes.
  • MKDIR: make a directory.
  • RMDIR: delete a directory.
  • READDIR: read a directory. E.g., the output of ls.

NFSv3 accepts the above operations, as well as:

  • ACCESS: check file access permissions.
  • MKNOD: create a UNIX special device.
  • READDIRPLUS: return the names of files in a directory and their attributes.
  • FSSTAT: return dynamic information about a filesystem. E.g., the filesystem mountpoint and capacity.
  • FSINFO: return static information about a filesystem. E.g., the maximum read & write size the filesystem supports.
  • PATHCONF: return POSIX.1 information about a file. E.g., whether or not the file has case sensitivity.
  • COMMIT: commit previous async writes to storage.

(I stole most of these summaries from W. Richard Stevens’ invaluable TCP/IP Illustrated, Volume 1. The NFS v3 RFC document is also very helpful.)

READDIRPLUS is in bold because it’s one of the most useful metadata operations available in NFSv3. Now hold on, as in my next post I’ll introduce tcpdump outputs.

Getting started with Data ONTAP cluster-mode

First things first: this post (and subsequent posts) are built off the backs of many great NetApp employees, especially the PS organization and the cmode-ses mailing list. In particular I’d like to thank Luke Mun, Doug Moore, Barry Spencer and Pavel Sobiezczuk.

This post will describe how to convert a system that’s currently running Data ONTAP 8.1 7-mode into 8.1 cluster-mode. If you’re using 8.0 7-mode, I strongly recommend upgrading to 8.1 7-mode before the conversion: there’ll be a couple less steps involved. I’ll assume you’re using a Cisco Nexus 55xx switch as the private cluster interconnect. While this isn’t required per se, it is the only option that’s properly supported.

The Cisco switch setup is actually quite simple. You can get the full, NetApp-approved configuration from the NetApp support site here. Here are the relevant global options for the switch(es):

feature lacp

errdisable recovery interval 30
errdisable recovery cause pause-rate-limit

policy-map type network-qos cluster
  class type network-qos class-default
    mtu 9216
system qos
  service-policy type network-qos cluster

spanning-tree port type edge default
port-channel load-balance ethernet source-dest-port

You can see that we’re not doing a whole lot: enabling LACP (the Link Aggregation Contro Protocol), building a basic QoS policy and setting some error parameters. Now, here are the relevant port options for cluster nodes:

interface Ethernet1/1
  description Cluster Node 1
  no lldp transmit
  no lldp receive
  spanning-tree port type edge
  spanning-tree bpduguard enable

Here, you can see we’re disabling LLDP (the Link Layer Discovery Protocol) and configuring the switch ports for spanning-tree protocol. Last but not least, if you’re connecting two Cisco Nexus switches together, here are the relevant options for the inter-switch link:

interface Ethernet1/13
  description Inter-Cluster Switch ISL Port 13 (Port-channel)
  no lldp transmit
  no lldp receive
  switchport mode trunk
  channel-group 1 mode active

Not much to see here, either: disable LLDP, define STP (this time as trunk nodes) and note that the link is part of a channel group (i.e. an aggregated link). Once you’ve done the Cisco configuration, you’re off to the races!

From here, we boot the Filer. Once you hit the LOADER prompt, hit Ctrl-C to interrupt the boot process. Enter these options to configure the Filer to boot cluster-mode now and in the future:

set-defaults
setenv bootarg.init.boot_clustered true
setenv bootarg.init.usebootp false
setenv bootarg.bsdportname e0a

After you’ve set these options, boot Data ONTAP by entering boot_ontap. Once Data ONTAP starts booting, hit Ctrl-C again to get to the installation options. You’ll see a menu like this:

(1) Normal Boot.
(2) Boot without /etc/rc.
(3) Change password.
(4) Clean configuration and initialize all disks.
(5) Maintenance mode boot.
(6) Update flash from backup config.
(7) Install new software first.
(8) Reboot node.

Select option (4) Clean configuration and initialize all disks. This will re-zero the disks, which can take an hour or two. After re-zeroing, your Filer will reboot automatically and proceed to the setup wizard.

And that’s enough for now!

Why efficiency matters: NetApp versus Isilon for scale-out

Disclaimer: I am a NetApp employee. The views I express are my own.

As the marketplace for scale-out offerings heats up, it’s interesting to see the different approaches that different companies take with their product development. The two leaders in scale-out NAS & SAN are NetApp and Isilon, and they both take rather different approaches to scale-out technology when it comes to performance. In this post, I will attempt to quantify the result of those differences: how NetApp does more, with less, than a comparable Isilon offering.

In terms of reference material, I’ll be drawing from the SPECsfs2008_nfs.v3 submissions from both NetApp and Isilon. NetApp’s 4-node FAS6240 submission is here, and Isilon’s 28-node S200 submission is here. First things first: I picked the two submissions that most closely matched one another in terms of performance. As you can see from the sfs2008 overview, there are a lot of submissions to choose from. I chose NetApp’s smallest published cluster-mode offering, and then looked for an Isilon submission that was roughly equivalent.

As part of this exercise, I put together list prices based on data taken from here (NetApp) and here (Isilon). I chose list prices because there is no standard discount rate from one vendor, or one customer, to another. If you have an updated list price sheet for either vendor, please let me know. Here are the results:

NetApp

  • 260,388 IOps
  • $1,086,808 list
  • 288 disks for 96TB usable

Isilon

  • 230,782 IOps
  • $1,611,932 list
  • 672 disks for 172TB usable

Doing some basic math, that’s $4.17 per IOp for NetApp and $6.98 per IOp for Isilon.

And to get a roughly equivalent level of performance, Isilon needs more than twice as many disks to do so! That brings us full-circle to the point about efficiency. NetApp does more with less disk, which means we get significantly more performance per disk than Isilon does:

“But wait”, I hear you cry, “NetApp has FlashCache! That’s cheating because you’re driving up the IOps of the disks via cache!” It’s true – the submission did include FlashCache. 2TB in total; 512GB in each of the four FAS6240 controllers. Isilon’s submission had solid-state media of their own; 5.6TB in total from a 200GB SSD in each of the 28 nodes.

“But wait”, I hear you cry, “RAID-DP has all of that overhead! WAFL, snap reserve space; you must be short-stroking to get those numbers!” Wrong again – we’re using more space than the competition.

In order to meet roughly the same performance goal, Isilon needed to provision almost twice as much space as an equivalent NetApp offering. That’s hardly storage efficient. It’s not cost-efficient, either, because you have to buy all those spindles to reach a performance goal even if you don’t have that much data you need to run at that speed.

“But wait”, I hear you cry, “those NetApp boxes are huge! They must be chock-full of RAM. And CPUs. And NVRAM too!” True again; each NetApp controller has 48GB of RAM for a total of 192GB. By contrast, Isilon has 1344GB of RAM. Isilon does have slightly less NVRAM (14GB) compared to NetApp (16GB).

“But wait”, I hear you cry, “NetApp requires 10Gb Ethernet for that performance!”. Yes, and so does Isilon. Let’s see how efficient we get not only from the nodes themselves, but also from the load-generating clients, too:

Although 10Gb Ethernet switch ports are coming down in price, they’re still not particularly cheap. And look at the client throughputs: Isilon struggled to get more than 10,000 IOps from each client, which means you have to scale out your client architecture as well. Which, of course, means more money.

“But wait”, I hear you cry, “NetApp is still going to use more power and space with their big boxes!” Not true. Here are the environmental specs for both:

Isilon did use less rack space (56RU) than NetApp (64RU). The environmental data were taken from here (Isilon) and here (NetApp).

Every single graph pictured above was compiled with data taken only from the three sources listed: the SPECSFS submissions themselves (found via Google), the list-price lists (found via Google) and the environmental submissions (found via Google). I will gladly provide the .xls file from which I generated the graphs if anyone’s interested.

Thoughts and comments are welcome!

Backups, virtualization and shared storage

When it comes to the actual operations of backup policies, they are often dictated by corporate data retention policies: such as, “We must have a recovery window of N days, and we must store the backups off-site.” It is rare, in my experience, for an organization to actually specify the required media (e.g., tape) or where the offsite location needs to be. Typically, though, backups are written to tape on-site and then moved to an off-site location soon after.

With the adoption of virtualized environments (that live on shared storage), these operations took an odd turn. Many organizations are still backing up their virtual machines via agent/server backup applications: EMC/Legato NetWorker, IBM Tivoli, Symantec BackupExec, etc. And they run the same way on virtual machines as they do on physical machines: you install an agent on the (virtual machine) client, and the backup server connects to that agent and sucks a (full) copy out of the virtual machine. You may do this full every night, or incrementals, but you’re still accomplishing the same thing — you’re taking the contents of the VM, which lives on shared storage, out through the VM itself.

My question is: why not just backup the virtual machine where it sits? Why not just tell your Filer to back it up primarily? A .vmdk is the same thing no matter where or how it sits.

A couple of weeks ago I had a customer (a very big organization) that remarked that the single biggest workload in their virtualized environment was backups. That is, their Filers see the most IOps and their switches see the most traffic at 4am every morning when they kick off their system backups. This is because each VM is backed up via an agent — and the backup server pulls the entire contents of the virtual machine out through the virtual machine itself.

This problem isn’t limited to virtual machines, either. Many organizations that deploy SQL Server (or, to a lesser extent, Oracle) are running full backups of the actual SQL databases, even though those databases are living on shared storage. Again, while this is feasible and effective, it is hardly efficient.

My question is: why not just backup the SQL Server (or Oracle) database where it sits? Why not just tell your Filer to back it up primarily? An SQL database is the same no matter where it sits.

In the case of NetApp (my employer), it doesn’t matter if you’re using Oracle Database over NFS on Linux or SQL Server over FC on Windows: your database is living in WAFL, in a whole big series of 4KB blocks. And because we can take snapshots of anything stored on WAFL no matter how you access it, why not just take a snapshot of the database? It’s just a bunch of blocks, right? After all, the Filer doesn’t care if it’s a chunk of SQL or a chunk of VMDK or a chunk of Word document. Blocks are blocks are blocks, and if it’s in a block we can snap it (and, we can dedupe it.)

So why not just take a snapshot of it? One common objection is “Well, a snapshot is fine, but I have to store them off-site”. Okay, cool, I understand — I was the same way. So let’s take that snapshot and move it off-site; to another Filer in a different building or a different city or a different country. “Well, that sounds okay, but my auditor tells me it has to be read-only”. Okay, cool, I understand — I was the same way. So let’s take that snapshot and lock it as read-only. “Well, that sounds okay too, but my CIO tells me it has to live for 7 years.” Okay, cool, I understand — I was the same way. So let’s take that snapshot and vault it.

What I’m trying to get at is, the policies you’re living inside of (in this case, backup policies) shouldn’t dictate the technologies you use. Just because you need off-site backups doesn’t beholden you to use tape. You should, under any policy at any organization, use the best tool or technology for the job. In the case of virtualized environments living on shared storage, what is the best tool? What is the best technology? If the data are blocks living on a Filer somewhere, why not just back them up where they belong — on the Filer itself?

Why cache is better than tiering for performance

When it comes to adjusting (either increasing or decreasing) storage performance, two approaches are common: caching and tiering. Caching refers to a process whereby commonly-accessed data gets copied into the storage controller’s high-speed, solid-state cache. Therefore, a client request for cached data never needs to hit the storage’s disk arrays at all; it is simply served right out of the cache. As you can imagine this is very, very fast.

Tiering, in contrast, refers to the movement of a data set from one set of disk media to another; be it from slower to faster disks (for high-performance, high-importance data) or from faster to slower disks (for low-performance, low-importance data). For example, you may have your high-performance data living on a SSD, FC or SAS disk array, and your low-performance data may only require the IOps that can be provided by relatively low-performance SATA disks.

Both solutions have pros and cons. Cache is typically less configurable by the user, as the cache’s operation will be managed by the storage controller. It is considerably faster, as the cache will live on the bus — it won’t need to traverse the disk subsystem (via SAS, FCAL etc.) to get there, nor will it have to compete with other I/O along the disk subsystem(s). But, it’s also more expensive: high-grade, high-performance solid-state cache memory is more costly than SSD disks. Last but not least, the cache needs to “warm up” in order to be effective — though in the real world this does not take long at all!

Tiering’s main advantages are that it is more easily tunable by the customer. However, all is not simple: in complex environments, tuning the tiering may literally be too complex to bother with. Also, manual tiering relies on you being able to predict the needs of your storage, and adjust tiering automatically: how do you know tomorrow which business application will require the hardest hit? Again, in complex environments, this relatively simply question may be decidedly difficult to answer. On the positive side, tiering offers more flexibility in terms of where you put your data. Cache is cache, regardless of environment; data is either on the cache or it’s not. On the other hand, tiering lets you take advantage of more types of storage: SSD or FC or SAS or SATA, depending on your business needs.

But if you’re tiering for performance (which is the focus of this blog post), then you have to deal with one big issue: the very act of tiering increases the load on your storage system! Tiering actually creates latency as it occurs: in order to move the data from one storage tier to another, we are literally creating IOps on the storage back-end in order to accomplish the performance increase! That is, in order to get higher performance, we’re actually hurting performance in the meantime (i.e., while the data is moving around.)

In stark contrast, caching reduces latency and increases throughput as it happens. This is because the data doesn’t really move: the first time data is requested, a cache request is made (and misses — it’s not in the cache yet) and the data is served from disk. On it’s way to the customer, though, the data will stay in the cache for a while. If it’s requested again, another cache request is made (and hits — the data is already in the cache) and the data is served from cache. And it’s served fast!

(It’s worthwhile to note that NetApp’s cache solutions actually offer more than “simple” caching: we can even cache things like file/block metadata. And customers can tune their cache to behave how they want it.)

Below is a graph from a customer’s benchmark. It was a large SQL Server read, but what is particularly interesting is the behavior of the of the graph: throughput (in red) goes up while latency (in blue) actually drops!

If you were seeking a performance augmentation via tiering, there would have been two different possibilities. If your data was already tiered, throughput will go up while latency will remain the same. If your data wasn’t already tiered, throughput will decrease as latencies will increase as the data is tiered; only after the tiering is completed will you actually see an increase in throughput.

For gaining performance in your storage system, caching is simply better than tiering.