Characterizing workloads in Data ONTAP

One of the most important things when analyzing Data ONTAP performance is being able to characterize the workload that you’re seeing. This can be done a couple of ways, but I’ll outline one here. Note that this is not a primer on analyzing storage bottlenecks — for that, you’ll have to attend Insight this year! — but rather a way to see how much data a system is serving out and what kind of data it is.

When it comes to characterizing workloads on a system, everyone has a different approach — and that’s totally fine. I tend to start with a few basic questions:

  1. Where is the data being served from?
  2. How much is being served?
  3. What protocols are doing the most work?
  4. What kind of work are the protocols doing?

You can do this with the the stats command if you know what counters to look for. If you find a statistic and you don’t know what it means, we have a good help system at hand. To wit and to use stats explain counters:

# rsh charles "priv set -q diag; stats explain counters readahead:readahead:incore"
Counters for object name: readahead
Name: incore
Description: Total number of blocks that were already resident
Properties: delta
Unit: none

Okay, so that’s not the greatest explanation, but it’s a start! More commonly-used counters are going to have better descriptions.

When it comes to analyzing your workload, a good place to start are the read_io_type counters in the <ttwafl object. These statistics are not normally collected, so you have to start counting and stop counting them. Here’s an example on a 7-mode system — I’m going to use rsh non-interactively because there are a lot of counters!

# rsh charles "priv set -q diag; stats show wafl:wafl:read_io_type"
wafl:wafl:read_io_type.cache:0%
wafl:wafl:read_io_type.ext_cache:0%
wafl:wafl:read_io_type.disk:0%
wafl:wafl:read_io_type.bamboo_ssd:0%
wafl:wafl:read_io_type.hya_hdd:0%
wafl:wafl:read_io_type.hya_cache:0%

Here is an explanation of these counters:

  • cache are reads served from the Filer’s RAM
  • ext_cache are reads served from the Filer’s Flash Cache
  • disk are reads served from any model of spinning disk
  • bamboo_ssd are reads served from any model of solid-state disk
  • hya_hdd are reads served from the spinning disks that are part of a Flash Pool
  • hya_cache are reads served from the solid-state disks that are part of a Flash Pool

Note that these counters are present regardless of your system’s configuration. For example, the Flash Cache counters will be visible even if you don’t have Flash Cache installed.

Now that you know where your reads are being served from, you might be curious as to how big the reads are, and if they’re random or sequential. Those statistics are available, but it takes a bit more grepping and cutting to get them. Here’s a little from my volume, sefiles. Note that I need to prefix this capture with priv set:

# rsh charles "priv set -q diag; stats start -I perfstat_nfs"
# rsh charles "priv set -q diag; stats stop -I perfstat_nfs"

Okay, so there’s a lot of output there. It’s data for the entire system. That said, it’s a great way to learn — you can see everything the system sees. And if you dump it to a text file, it will save you using rsh over and over again! Let’s use rsh again and drill down to the volume stats, which is the most useful way to investigate a particular volume:

# rsh charles "priv set -q diag; stats show volume:sefiles"
volume:sefiles:instance_name:sefiles
volume:sefiles:instance_uuid:a244800b-5506-11e1-91cc-123478563412
volume:sefiles:parent_aggr:aggr1
volume:sefiles:avg_latency:425.77us
volume:sefiles:total_ops:93/s
volume:sefiles:read_data:6040224b/s
volume:sefiles:read_latency:433.44us
volume:sefiles:read_ops:92/s
volume:sefiles:write_data:0b/s
volume:sefiles:write_latency:0us
volume:sefiles:write_ops:0/s
volume:sefiles:other_latency:12.17us
volume:sefiles:other_ops:1/s
volume:sefiles:nfs_read_data:6040224b/s
volume:sefiles:nfs_read_latency:433.44us
volume:sefiles:nfs_read_ops:92/s
volume:sefiles:nfs_write_data:0b/s
volume:sefiles:nfs_write_latency:0us
volume:sefiles:nfs_write_ops:0/s
volume:sefiles:nfs_other_latency:15.09us
volume:sefiles:nfs_other_ops:0/s

I’ve removed some of the internal counters for clarity. Now, let’s take a look at a couple of things: the volume name is listed as sefiles in the instance_name variable. During the time I was capturing statistics, the average amount of operations per second was 93 (“total_ops“). Read latency (“read_latency“) was 433us, which is 0.433ms. Notice that we did no writes ("write_ops") and basically no metadata either ("other_ops").

If you’re interested in seeing what the system is doing in general, I tend to go protocol-by-protocol. If you grep for nfsv3, you can start to wade through things:

# rsh charles "priv set -q diag; stats show nfsv3:nfs:nfsv3_read_latency_hist"
nfsv3:nfs:nfsv3_read_latency_hist.0 - <1ms:472
nfsv3:nfs:nfsv3_read_latency_hist.1 - <2ms:7
nfsv3:nfs:nfsv3_read_latency_hist.2 - <4ms:10
nfsv3:nfs:nfsv3_read_latency_hist.4 - <6ms:8
nfsv3:nfs:nfsv3_read_latency_hist.6 - <8ms:7
nfsv3:nfs:nfsv3_read_latency_hist.8 - <10ms:0
nfsv3:nfs:nfsv3_read_latency_hist.10 - <12ms:1

During our statistics capture, approximately 472 requests were served at 1ms or better. 7 requests were served between 2ms and 1ms. 10 requests were served between 4ms and 2ms, and so on. If you keep looking, you’ll see write latencies as well:

# rsh charles "priv set -q diag; stats show nfsv3:nfs:nfsv3_write_latency_hist"
nfsv3:nfs:nfsv3_write_latency_hist.0 - <1ms:1761
nfsv3:nfs:nfsv3_write_latency_hist.1 - <2ms:0
nfsv3:nfs:nfsv3_write_latency_hist.2 - <4ms:2
nfsv3:nfs:nfsv3_write_latency_hist.4 - <6ms:1
nfsv3:nfs:nfsv3_write_latency_hist.6 - <8ms:0
nfsv3:nfs:nfsv3_write_latency_hist.8 - <10ms:0
nfsv3:nfs:nfsv3_write_latency_hist.10 - <12ms:0

So that’s 1761 writes at 1ms or better. Your writes should always show excellent latencies as writes are acknowledged once they’re committed to NVRAM — the actual writing of the data to disk will occur at the next consistency point.

Keep going and you’ll see latency figures that are concatenated for all types of requests — reads, writes, other (i.e. metadata):

# rsh charles "priv set -q diag; stats show nfsv3:nfs:nfsv3_latency_hist"
nfsv3:nfs:nfsv3_latency_hist.0 - <1ms:2124
nfsv3:nfs:nfsv3_latency_hist.1 - <2ms:5
nfsv3:nfs:nfsv3_latency_hist.2 - <4ms:16
nfsv3:nfs:nfsv3_latency_hist.4 - <6ms:18
nfsv3:nfs:nfsv3_latency_hist.6 - <8ms:13
nfsv3:nfs:nfsv3_latency_hist.8 - <10ms:6
nfsv3:nfs:nfsv3_latency_hist.10 - <12ms:1

Once you’ve established latencies, you may be interested to see the I/O sizes of the actual requests. Behold:

# rsh charles "priv set -q diag; stats show nfsv3:nfs:nfsv3_read_size_histo"
nfsv3:nfs:nfsv3_read_size_histo.0-511:0
nfsv3:nfs:nfsv3_read_size_histo.512-1023:0
nfsv3:nfs:nfsv3_read_size_histo.1K-2047:0
nfsv3:nfs:nfsv3_read_size_histo.2K-4095:0
nfsv3:nfs:nfsv3_read_size_histo.4K-8191:328
nfsv3:nfs:nfsv3_read_size_histo.8K-16383:71

These statistics are represented like the latencies were: 0 requests were served that were between 0-511 bytes in size, 0 between 512 & 1023 bytes, and so on; then, 328 requests were served that were between 4KB & 8KB in size and 71 requests between 8KB & 16KB.

When it comes to protocol operations, we break things down by the unique operation functions — including metadata operations. I’ll show the count numbers, but it’ll also show percentages of requests and individual latencies for each:

# rsh charles "priv set -q diag; stats show nfsv3:nfs:nfsv3_op_count"
nfsv3:nfs:nfsv3_op_count.getattr:4
nfsv3:nfs:nfsv3_op_count.setattr:0
nfsv3:nfs:nfsv3_op_count.lookup:0
nfsv3:nfs:nfsv3_op_count.access:1
nfsv3:nfs:nfsv3_op_count.readlink:0
nfsv3:nfs:nfsv3_op_count.read:424
nfsv3:nfs:nfsv3_op_count.write:1767
nfsv3:nfs:nfsv3_op_count.create:0
nfsv3:nfs:nfsv3_op_count.mkdir:0
nfsv3:nfs:nfsv3_op_count.symlink:0
nfsv3:nfs:nfsv3_op_count.mknod:0
nfsv3:nfs:nfsv3_op_count.remove:0
nfsv3:nfs:nfsv3_op_count.rmdir:0
nfsv3:nfs:nfsv3_op_count.rename:0
nfsv3:nfs:nfsv3_op_count.link:0
nfsv3:nfs:nfsv3_op_count.readdir:0
nfsv3:nfs:nfsv3_op_count.readdirplus:0

In this case, most of my requests were reads & writes; there was very little metadata operation (just 4 attribute lookups). If you want to know what these operations represent, you can read some of my previous blog entries or check out the RFC that defines your protocol.

One last thing that I forgot to add is trying to figure out if your workload is random or sequential. The easiest way that I’ve found to do that is to look in the readahead stats, as the readahead engine is a particularly useful way of determining how your reads are operating. We don’t need to focus on writes as much because any write much larger than 16KB is coalesced in memory and written sequentially — even if it started as random at the client. (I’ll talk about RAVE, the new readahead engine in DOT 8.1+, in later posts.)

I’ve filtered out some of the irrelevant stuff. So let’s gather what we can use and see:

# rsh charles "priv set -q diag; stats show readahead:readahead:total_read_reqs"
readahead:readahead:total_read_reqs.4K:629
readahead:readahead:total_read_reqs.8K:77
readahead:readahead:total_read_reqs.12K:0
readahead:readahead:total_read_reqs.16K:4
readahead:readahead:total_read_reqs.20K:1
readahead:readahead:total_read_reqs.24K:7
readahead:readahead:total_read_reqs.28K:8
readahead:readahead:total_read_reqs.32K:0
readahead:readahead:total_read_reqs.40K:0
readahead:readahead:total_read_reqs.48K:0
readahead:readahead:total_read_reqs.56K:0
readahead:readahead:total_read_reqs.64K:77662

These statistics work much the same as our previous ones. 629 read requests were serviced that were 4KB blocks, 77 were serviced that were 8KB blocks, etc. The numbers continue right up to 1024KB. (Keep this in mind whenever someone tells you that WAFL & ONTAP can only read or write 4KB at a time!) Now if you want to see sequentiality:

# rsh charles "priv set -q diag; stats show readahead:readahead:seq_read_reqs"
readahead:readahead:seq_read_reqs.4K:36%
readahead:readahead:seq_read_reqs.8K:46%
readahead:readahead:seq_read_reqs.12K:0%
readahead:readahead:seq_read_reqs.16K:0%
readahead:readahead:seq_read_reqs.20K:0%
readahead:readahead:seq_read_reqs.24K:0%
readahead:readahead:seq_read_reqs.28K:0%
readahead:readahead:seq_read_reqs.32K:0%
readahead:readahead:seq_read_reqs.40K:0%
readahead:readahead:seq_read_reqs.48K:0%
readahead:readahead:seq_read_reqs.56K:0%
readahead:readahead:seq_read_reqs.64K:96%

We can see that, of the 4KB reads that we serviced, 36% were sequential. Mathematically this means that ~64% would be random, which we can confirm here:

# rsh charles "priv set -q diag; stats show readahead:readahead:rand_read_reqs"
readahead:readahead:rand_read_reqs.4K:63%
readahead:readahead:rand_read_reqs.8K:53%
readahead:readahead:rand_read_reqs.12K:0%
readahead:readahead:rand_read_reqs.16K:100%
readahead:readahead:rand_read_reqs.20K:100%
readahead:readahead:rand_read_reqs.24K:100%
readahead:readahead:rand_read_reqs.28K:100%
readahead:readahead:rand_read_reqs.32K:0%
readahead:readahead:rand_read_reqs.40K:0%
readahead:readahead:rand_read_reqs.48K:0%
readahead:readahead:rand_read_reqs.56K:0%
readahead:readahead:rand_read_reqs.64K:3%

Here, of our 4KB requests, 63% of them are random. Of our 8KB requests, 53% were random. Notice that the random percentages and sequential percentages for each block histogram will add up to ~100% depending on rounding. If you dig deeper into the stats, readahead can provide you with some useful information about how your data came to be read:

# rsh charles "priv set -q diag; stats readahead:readahead"
readahead:readahead:requested:2217395
readahead:readahead:read:82524
readahead:readahead:incore:25518
readahead:readahead:speculative:79658
readahead:readahead:read_once:82200

You can interpret those numbers thusly. Again, I’ve removed the irrelevant stuff:

  • requested is the sum of all blocks requested by the client
  • read are the blocks actually read from disk (where the readahead engine had a “miss”)
  • incore are the blocks read from cache (where the readahead engine had a “hit”)
  • speculative are blocks that were readahead because we figured the client might need them later in the I/O stream
  • read_once are blocks that were not read ahead and was not cached because it didn’t match the cache policy (e.g. flexscale.enable.lopri_blocks being off)

With this information, you should have a pretty good idea about how to characterize what your Filer is doing for workloads. Although I’ve focused on NFSv3 here, the same techniques can be applied to the other protocols. If you’ve got any questions, please holler!

Advertisements

How Data ONTAP caches, assembles and writes data

In this post, I thought I would try and describe how ONTAP works and why. I’ll also explain what the “anywhere” means in the Write Anywhere File Layout, a.k.a. WAFL.

A couple of common points of confusion are the role that NVRAM plays in performance, and where our write optimization comes from. I’ll attempt to cover both of these here. I’ll also try and cover how we handle parity. Before we start, here are some terms (and ideas) that matter. This isn’t NetApp 101, but rather NetApp 100.5 — I’m not going to cover all the basics, but I’ll cover the basics that are relevant here. So here goes!

What hardware is in a controller, anyway?

NetApp storage controllers contain some essential ingredients: hardware and software. In terms of hardware the controller has CPUs, RAM and NVRAM. In terms of software the controller has its operating system, Data ONTAP. Our CPUs do what everyone else’s CPUs do — they run our operating system. Our RAM does what everyone else’s RAM does — it runs our operating system. Our RAM also has a very important function in that it serves as a cache. While our NVRAM does roughly what everyone else’s NVRAM does — i.e., it contains our transaction journal — a key difference is the way we use NVRAM.

That said, at NetApp we do things in unique ways. And although we have one operating system, Data ONTAP, we have several hardware platforms. They range from small controllers to big controllers, which medium controllers in between. Different controllers have different amounts of CPU, RAM & NVRAM but the principles are the same!

CPU

Our CPUs run the Data ONTAP operating system, and they also process data for clients. Our controllers vary from a single dual-core CPU (in a FAS2220) to a pair of hex-core CPUs (in a FAS6290). The higher the amount of client I/O, the harder the CPUs work. Not all protocols are equal, though; serving 10 IOps via CIFS generates a different CPU load than serving 10 IOps via FCP.

RAM

Our RAM contains the Data ONTAP operating system, and it also caches data for clients. It is also the source for all writes that are committed to disk via consistency points. Writes do not come from NVRAM! Our controllers vary from 6GB of RAM (FAS2220) to 96GB of RAM (FAS6290). Not all workloads are equal, though; different features and functionality requires a different memory footprint.

NVRAM

Physically, NVRAM is little more than RAM with a battery backup. Our NVRAM contains a transaction log of client I/O that has not yet been written to disk from RAM by a consistency point. Its primary mission is to preserve that not-yet-written data in the event of a power outage or similar, severe problem. Our controllers vary from 768MB of NVRAM (FAS2220) to 4GB of NVRAM (FAS6290). In my opinion, NVRAM’s function is perhaps the most-commonly misunderstood part of our architecture. NVRAM is simply a double-buffered journal of pending write operations. NVRAM, therefore, is simply a redo log — it is not the write cache! After data is written to NVRAM, it is not looked at again unless you experience a dirty shutdown. This is because NVRAM’s importance to performance comes from software.

In an HA pair environment where two controllers are connected to each other, NVRAM is mirrored between the two nodes. Its primary mission is to preserve data that not-yet-written data in the event a partner controller suffers a power outage or similar severe problem. NVRAM mirroring happens for HA pairs in Data ONTAP 7-mode, HA pairs in clustered Data ONTAP and HA pairs in MetroCluster environments.

Disks and disk shelves

Disk shelves contain disks. DS14 shelves contain 14 drives, and DS2246 & DS4243 & DS4246 shelves contain 24 disks. DS4248 shelves contain 48 disks. Disk shelves are connected to controllers via shelf modules, those logical connections run either FCAL (DS14) or SAS (DS2246/DS42xx) protocols for connectivity.

What software is in a controller, anyway?

Data ONTAP

Data ONTAP is our controller’s operating system. Almost everything sits here — from configuration files and databases to license keys, log files and some diagnostic tools. Our operating system is built on top of FreeBSD, and usually lives in a volume called vol0. Data ONTAP features implementations of protocols for client access (e.g. NFS, CIFS), APIs for programming access (ZAPI) and implementations of protocols for management access (SSH). It is fair to say that Data ONTAP is the heart of a NetApp controller.

WAFL

WAFL is our Write Anywhere File Layout. If NVRAM’s role is the most-commonly misunderstood, WAFL comes in 2nd. Yet WAFL has a simple goal, which is to write data in full stripes across the storage media. WAFL acts as an intermediary of sorts — there is a top half where files and volumes sit, and a bottom half that interacts with RAID, manages SnapShots and some other things. WAFL isn’t a filesystem, but it does some things a filesystem does; it can also contain filesystems.

WAFL contains mechanisms for dealing with files & directories, for interacting with volumes & aggregates, and for interacting with RAID. If Data ONTAP is the heart of a NetApp controller, WAFL is the blood that it pumps.

Although WAFL can write anywhere we want, in reality we write where it makes the most sense: in the closest place (relative to the disk head) where we can write a complete stripe in order to minimize seek time on subsequent I/O requests. WAFL is optimized for writes, and we’ll see why below. Rather unusually for storage arrays, we can write client data and metadata anywhere.

A colleague has this to say about WAFL, and I couldn’t put it better:

There is a relatively simple “cheating at Tetris” analogy that can be used to articulate WAFL’s advantages. It is not hard to imagine how good you could be at Tetris if you were able to review the next thousand shapes that were falling into the pattern, rather than just the next shape.

Now imagine how much better you could be at Tetris if you could take any of the shapes from within the next thousand to place into your pattern, rather than being forced to use just the next shape that is falling.

Finally, imagine having plenty of time to review the next thousand shapes and plan your layout of all 1,000, rather than just a second or two to figure out what to do with the next piece that is falling. In summary, you could become the best Tetris player on Earth, and that is essentially what WAFL is in the arena of data allocation techniques onto underlying disk arrays.

The Tetris analogy incredibly important, as it directly relates to the way that NetApp uses WAFL to optimize for writes. Essentially, we collect random I/O that is destined to be written to disk, reorganize it so that it resembles sequential I/O as much as possible, and then write it to disk sequentially. Another way of explaining this behavior is that of write coalescing: we reduce the number of operations that ultimately land on the disk, because we re-organize them in memory before we commit them to disk and we wait until we have a bunch of them before committing them to disk via a Consistency Point. Put another way, write coalescing allows to avoid the common (and expensive) RAID workflow of “read-modify-write”.

Putting it all together

NetApp storage arrays are made up of controllers and disk shelves. Where the top and bottom is depends on your perspective: disks are grouped into RAID groups, and those RAID groups are combined to make aggregates. Volumes live in aggregates, and files and LUNs live in those volumes. A volume of CIFS data is shared to a client via a SMB share; a volume of NFS data is shared to a client via a NFS export. A LUN of data is shared to a client via an FCP, FCOE or ISCSI initiator group. Note the relationship here between controller and client — all clients care about are volumes, files and LUNs. They don’t care directly about CPUs, NVRAM or really anything else when it comes to hardware. There is data and I/O and that’s it.

In order to get data to and from clients as quickly as possible, NetApp engineers have done much to try to optimize controller performance. A lot of the architectural design you see in our controllers reflects this. Although the clients don’t care directly about how NetApp is architected, our architecture does matter to the way their underlying data is handled and the way their I/O is served. Here is a basic workflow, from the inimitable Recovery Monkey himself:

  1. The client writes some data
  2. Write arrives in controller’s RAM and is copied into controller’s NVRAM
  3. Write is then mirrored into partner’s NVRAM
  4. The client receives an acknowledgement that the data has been written

Sounds pretty simple right? On the surface, it is a pretty simple process. It also explains a few core concepts about NetApp architecture:

  • Because we acknowledge the client write once it’s hit NVRAM, we’re optimized for writes out of the box
  • Because we don’t need to wait for the disks, we can write anywhere we choose to
  • Why we can survive an outage to either half of an HA pair

The 2nd bullet provides the backbone of this post. Because we can write data anywhere we choose to, we tend to write data in the best place possible. This is typically in the largest contiguous stripe of free space in the volume’s aggregate (closest to the disk heads). After 10 seconds have elapsed, or if NVRAM becomes >=50% full, we write the client’s data from RAM (not from NVRAM) to disk.

This operation is called a consistency point. Because we use RAID, a consistency point requires us to perform RAID calculations and to calculate parity. These calculations are processed by the CPUs using data that exists in RAM.

Many people think that our acceleration occurs because of NVRAM. Rather, a lot of our acceleration happens while we’re waiting for NVRAM. For example — and very significantly — we transmogrify random I/O from the client into sequential I/O before writing it to disk. This processing is done by the CPU and occurs in RAM. Another significant benefit is because we calculate data in RAM, we do not need to hammer the disk drives that contain parity information.

So, with the added step of a CP, this is how things really happen:

  1. The client writes some data
  2. Write arrives in controller’s RAM and is copied into controller’s NVRAM
  3. Write is then mirrored into partner’s NVRAM
  4. When the partner NVRAM acknowledges the write, the client receives an acknowledgement that the data has been written [to disk]
  5. A consistency point occurs and the data (incl. parity) is written to disk

To re-iterate, writes are cached in the controller’s RAM at the same time as being logged into NVRAM (and the partner’s NVRAM if you’re using a HA pair). Once the data has been written to disk via a CP, the writes are purged from the controller’s NVRAM but retained (with a lower priority) in the controller’s RAM. The prioritization allows us to evict less commonly-used blocks in order to avoid overrunning the system memory. In other words, recently-written data is the first to be ejected from the first-level read cache in the controller’s RAM.

The parity issue also catches people out. Traditional filesystems and arrays write data (and metadata) into pre-allocated locations; Data ONTAP and WAFL let NetApp write data (and metadata) in whatever location will provide fastest access. This is usually a stripe to the nearest available set of free blocks. The ability of WAFL to write to the nearest available free disk blocks lets us greatly reduce disk seeking. This is the #1 performance challenge when using spinning disks! It also lets us avoid the “hot parity disk” paradigm, as WAFL always writes to new, free disk blocks using pre-calculated parity.

Consistency points are also commonly misunderstood. The purpose of a consistency point is simple: to write data to free space on disks. Once a CP has taken place, the contents of NVRAM are discarded. Incoming writes that are received during the actual process of a consistency point’s commitment (i.e., the actual writing to disk) will be written to disk in the next consistency point.

The workflow for a write:

1. Write is sent from the host to the storage system (via a NIC or HBA)
2. Write is processed into system memory while a) being logged in NVRAM and b) being logged in the HA partner’s NVRAM
3. Write is acknowledged to the host
4. Write is committed to disk storage in a consistency point (CP)

The importance of consistency points cannot be overstated — they are a cornerstone of Data ONTAP’s architecture! They are also the reason why Data ONTAP is optimized for writes: no matter the destination disk type, we acknowledge client writes as soon as they hit NVRAM, and NVRAM is always faster than any time of disk!

Consistency points typically they occur every 10 seconds or whenever NVRAM begins to get full, whichever comes first. The latter is referred to as a “watermark”. Imagine that NVRAM is a bucket, and now divide that bucket in half. One side of the bucket is for incoming data (from clients) and the other side of the bucket is for outgoing data (to disks). As one side of the bucket fills, the other side drains. In an HA pair, we actually divide NVRAM into four buckets: two for the local controller and two for the partner controller. (This is why, when activating or deactivating HA on a system, a reboot is required for the change to take effect. The 10-second rule is why, on a system that is doing practically no I/O, the disks will always blink every 10 seconds.)

The actual value of the watermark varies depending on the exact configuration of your environment: the Filer model, the Data ONTAP version, whether or not it’s HA, and whether SATA disks are present. Because SATA disks write more slowly than SAS disks, consistency points take longer to write to disk if they’re going to SATA disks. In order to combat this, Data ONTAP lowers the watermark in a system when SATA disks are present.

The size of NVRAM only really dictates a Filer’s performance envelope when doing a large amount of large sequential writes. But with the new FAS80xx family, NVRAM counts have grown massively — up to 4x of the FAS62xx family.

Getting started with NetApp Storage QoS

Storage QoS is a new feature in Clustered Data ONTAP 8.2. It is a full-featured QoS stack, which replaces the FlexShare stack of previous ONTAP versions. So what does it do and how does it work? Let’s take a look!

The administration of QoS involves two parts: policy groups, and policies. Policy groups define boundaries between workloads, and contain one or more storage objects. We can monitor, isolate and limit the workloads of storage objects from the biggest point of granularity down to the smallest — from entire Vservers, whole volumes, individual LUNs all the way down to single files.

The actual policies are behavior modifiers that are applied to a policy group. Right now, we can set throughput limits based on operation counts (i.e., IOps) or throughput counts (i.e., MB/s). When limiting throughput, storage QoS throttles traffic at the protocol stack. Therefore, a client whose I/O is being throttled will see queuing in their protocol stack (e.g., CIFS or NFS) and latency will eventually rise. However, the addition of this queuing will not affect the NetApp cluster’s resources.

In addition to the throttling of workloads, storage QoS also includes very effective measuring tools. And because QoS is “always on”, you don’t even need to have a policy group in order to monitor performance.

So, let’s get started. When it comes to creating our first policy, we actually require three steps in order for the policy to be applied to the workload:

  1. Create the policy (with qos policy-group create...)
  2. Apply the policy to a Vserver (with vserver modify...)
  3. Apply the policy to a volume (with vol modify...)
  4. Monitor what’s going on (with qos statistics...)

Before we start, let’s verify that we’re starting from a clean slate:

dot82cm::> qos policy-group show
This table is currently empty.

Okay, good — no policy groups exist yet. Step one of three is to create the policy group itself, which we’ll call blog-group. In reality, you’d specify a throughput limit (either IOps or MB/s), but for now we won’t bother limiting the throughput:

dot82cm::> qos policy create blog-group -vserver vs0

Let’s make sure the policy group was created:

dot82cm::> qos policy-group show                    
Name             Vserver     Class        Wklds Throughput  
---------------- ----------- ------------ ----- ------------
blog-group       vs0         user-defined -     0-INF

Let’s confirm:

But, because we didn’t specify a throughput limit, the Throughput column is still showing 0 to infinity. Let’s add a limit of 1500 IOps:

dot82cm::> qos policy-group modify -policy-group blog-group -max-throughput 1500iops

(If we wanted to limit that volume to 1500MB/s, we could have substituted 1500mb for 1500iops.)

And verify:

dot82cm::> qos policy-group show                                                
Name             Vserver     Class        Wklds Throughput  
---------------- ----------- ------------ ----- ------------
blog-group       vs0         user-defined 0     0-1500IOPS

So, step three is to associate the new policy group with an actual object whose I/O we wish to throttle. The object can be one or many volumes, LUNs or files. For now though, we’ll apply it to a single volume, blog_volume:

dot82cm::> volume modify blog_volume -vserver vs0 -qos-policy-group blog-group

Volume modify successful on volume: blog_volume

Let’s confirm that it was successfully modified:

dot82cm::> qos policy-group show                                                
Name             Vserver     Class        Wklds Throughput  
---------------- ----------- ------------ ----- ------------
blog-group       vs0         user-defined 1     0-1500IOPS

Cool! We can see that Workloads has gone from 0 to 1. I’ve mounted that volume via NFS on a Linux VM, and will throw a bunch of workloads at it using dd.

While the workload is running, here’s how it looks:

dot82cm::> qos statistics workload characteristics show                         
Workload          ID     IOPS      Throughput      Request size    Read  Concurrency 
--------------- ------ -------- ---------------- --------------- ------- ----------- 
-total-              -      392         5.26MB/s          14071B     14%          14 
_USERSPACE_APPS     14      170       109.46KB/s            659B     32%           0 
_Scan_Backgro..  11702      115            0KB/s              0B      0%           0 
blog_volume-w..  11792     1679       104.94MB/s          65527B      0%           4

As you can see, our volume blog_volume is pretty busy — it’s pushing almost 1,700 IOps at over 100MB/sec. So, let’s see if the throttling is effective. First, we’ll give the policy group a low throughput maximum:

dot82cm::> qos policy-group modify -policy-group blog-group -max-throughput 100iops

Now let’s check its status:

dot82cm::> qos policy-group show
Name             Vserver     Class        Wklds Throughput  
---------------- ----------- ------------ ----- ------------
blog-group       vs0         user-defined 1     0-100IOPS

Now let’s see how the Filer is doing:

dot82cm::> qos statistics workload characteristics show
Workload          ID     IOPS      Throughput      Request size    Read  Concurrency 
--------------- ------ -------- ---------------- --------------- ------- ----------- 
-total-              -      384         6.71MB/s          18333B     11%          33 
_USERSPACE_APPS     14      169         2.50MB/s          15528B     26%           0 
_Scan_Backgro..  11702      115            0KB/s              0B      0%           0 
blog_volume-w..  11792       83         5.19MB/s          65536B      0%          15 
-total-              -      207         4.81MB/s          24348B      0%          17

You can see that the throughput has gone way down! In fact it’s gone below our limit of 80 IOps. And that, of course, is what’s supposed to happen. Now let’s remove the limit and see if things return to normal:

dot82cm::> qos policy-group modify -policy-group blog-group -max-throughput none

dot82cm::> qos statistics workload characteristics show                            
Workload          ID     IOPS      Throughput      Request size    Read  Concurrency 
--------------- ------ -------- ---------------- --------------- ------- ----------- 
-total-              -     1073        44.37MB/s          43363B      8%           1 
blog_volume-w..  11792      626        39.12MB/s          65492B      0%           1 
_USERSPACE_APPS     14      302         4.90MB/s          17041B     29%           0 
_Scan_Backgro..  11702      115            0KB/s              0B      0%           0 
-total-              -      263       471.86KB/s           1837B     19%           0

Because we can apply the QoS policy groups to entire Vservers, volumes, files & LUNs, it is important to keep track of what’s applied where. This is how you’d apply a policy group to an individual volume:

dot82cm::> volume modify -vserver vs0 -volume blog_volume -qos-policy-group blog-group

(To remove the policy, set the -qos-policy-group field to none.)

To apply a policy group against an entire Vserver (in this case, Vserver vs0):

dot82cm::> vserver modify -vserver vs0 -qos-policy-group blog-group

(Again, to remove the policy, set the -qos-policy-group field to none.)

To see which volumes are assigned our policy group:

dot82cm::> volume show -vserver vs0 -qos-policy-group blog-group                       
Vserver   Volume       Aggregate    State      Type       Size  Available Used%
--------- ------------ ------------ ---------- ---- ---------- ---------- -----
vs0       blog_volume  aggr1_node1  online     RW          1GB    365.8MB   64%

To see all volumes in all Vservers’ QoS policy groups:

dot82cm::> volume show -vserver * -qos-policy-group * -fields vserver,volume,qos-policy-group

vserver    volume qos-policy-group 
---------- ------ ---------------- 
vs0        blog_empty 
                  -                
vs0        blog_volume 
                  blog-group

To see a Vserver’s full configuration, including its QoS policy group:

dot82cm::> vserver show -vserver vs0

                                    Vserver: vs0
                               Vserver Type: data
                               Vserver UUID: c280658e-bd77-11e2-a567-123478563412
                                Root Volume: vs0_root
                                  Aggregate: aggr1_node2
                        Name Service Switch: file, nis, ldap
                        Name Mapping Switch: file, ldap
                                 NIS Domain: newengland.netapp.com
                 Root Volume Security Style: unix
                                LDAP Client: -
               Default Volume Language Code: C
                            Snapshot Policy: default
                                    Comment: 
                 Antivirus On-Access Policy: default
                               Quota Policy: default
                List of Aggregates Assigned: -
 Limit on Maximum Number of Volumes allowed: unlimited
                        Vserver Admin State: running
                          Allowed Protocols: nfs, cifs, ndmp
                       Disallowed Protocols: fcp, iscsi
            Is Vserver with Infinite Volume: false
                           QoS Policy Group: -

To get a list of all Vservers by their policy groups:

dot82cm::> vserver show -vserver * -qos-policy-group * -fields vserver,qos-policy-group 
vserver qos-policy-group 
------- ---------------- 
dot82cm -                
dot82cm-01 
        -                
dot82cm-02 
        -                
vs0     blog-group       
4 entries were displayed.

If you’re in a hurry and want to remove all instances of a policy from volumes in a particular Vserver:

dot82cm::> vol modify -vserver vs0 -volume * -qos-policy-group none

That should be enough to get us going. Stay tuned, because in the next episode I’ll show some video with iometer running!

Documentation:

Differences between NetApp Flash Cache and Flash Pool

NetApp’s Flash Cache product has been around for several years, the Flash Pool product is much newer. While there is a lot of overlap between the two products, there are distinct differences as well. The goal of both technologies is the same: to accelerate reads by serving data from solid-state memory instead of spinning disk. Although they produce the same end result, they are built differently and they operate differently.

First of all, why do we bother with read (and write caching)? Because there’s a significant speed differential depending on the source form which we serve data. RAM is much faster than SSD, and SSD is much faster than rotating disk. This graphic illustrates the paradigm:

why-cache

(Note that we can optimize the performance of 100%-rotating disk environments with techniques like read-aheads, buffering and the like.)

Sequential reads can be read off rotating disks extremely quickly, and very rarely needs to be accelerated by solid-state storage. Random reads are much better served from Flash than rotating disks, and very recent random reads can be served from buffer cache RAM (i.e., the controller’s RAM) — which is an order of magnitude faster again. Although Flash Pool and Flash Cache provide a caching mechanism for those random reads (and some other I/O operations), they do so in different ways.

Here is a quick overview of some of the details that come up when comparing the two technologies:

Detail Flash Cache Flash Pool
Physical entity PCI-e card SSD drive
Physical location Controller head Disk shelf
Logical location Controller bus Disk stack/loop
Logical accessibility Controller head Disk stack/loop
Cache mechanism First-in, first-out Temperature map
Cache persistence on failover Requires re-warm Yes
Cache data support Reads Reads, random overwrites
Cache metadata support Yes Yes
Cache sequential data support Yes, with lopri mode No
Infinite Volume support Yes No
Management granularity System-level for all aggregates & volumes Aggregate-level with per-volume policies
32-bit aggregate support Yes No
64-bit aggregate support Yes Yes
RAID protection N/A RAID-4, RAID-DP (recommended)
Minimum quantity 1 PCI-e card 3 SSD drives
Minimum Data ONTAP version 7.3.2 8.1.1
Removable Yes Yes, but aggregate must be destroyed

The most basic difference between the two technologies is that Flash Cache is a PCI-e card (or cards) that sits in the NetApp controller, whereas Flash Pool are SSD drives that sits in NetApp shelves. This gives way to some important points. First off, Flash Cache memory that is accessed via the PCI-e bus is always going to have a much higher potential for throughput than Flash Pool memory that sits on a SAS stack. Secondly, the two different architectures result in different means of logical accessibility — for Flash Cache, any and all aggregates that sit on a controller with a Flash Cache card can be accelerated. With Flash Pool, any and all aggregates that sit in a disk stack (or loop) with a Flash Pool SSD can be accelerated. This gives way to the fact that in the event of planned downtime (such as a takeover/giveback), the Flash Cache card’s cache data can be copied to the HA pair’s partner node (“rewarming”), but in the event of unplanned downtime (such as a panic), the Flash Cache card’s cache data will be lost. The corollary of this is that if a controller with Flash Pool fails, the aggregate will fail over to the other controller and the Flash Pool will be accessible on the other controller.

Another important difference is Flash Pool’s ability to cache random overwrites. First, let’s define what we’re talking about: a random overwrite is a small, random write of a block(s) that was recently written to HDD, and now we’re seeing a request for that block to now be overwritten. The fact that we’re talking only about random overwrites is important, because Flash Pools do not accelerate traditional and sequential writes — Data ONTAP is already optimized for writes. Rather, Flash Pool lets us cache these random overwrites by allowing the Consistency Points (CPs) that contain random overwrites to be written into the SSD cache. Writing to SSD is up to 2x quicker than writing to rotating disk, which means that the Consistency Point occurs faster. Random overwrites are the most expensive (i.e., slowest) operation that we can perform to a rotating disk, so it’s in our interest to accelerate them.

Although Flash Cache cannot cache random overwrites, it can function as a write-through cache via the use of the lopri mode. In some workloads, we may desire that recently-written data be immediately read just after being written (the so-called “read after write” scenario). For these workloads, Flash Cache can improve performance by caching recently-written blocks rather than having to seek the rotating disks for that recently-written data. Note that we are not writing the client data into Flash Cache primarily, but rather we are writing the data to rotating disk through the Flash Cache — thus, a write-through cache. Flash Cache serves as a write-through cache under two scenarios. Flash Cache will serve as a write-through cache if the Flash Cache is less than 70% full if lopri is disabled. If lopri is enabled, it will always serve as a write-through cache. The enabling of lopri mode in Flash Cache will also cache sequential reads, whereas Flash Pool has no mechanism for caching sequential data.

Continuing this exercise are the differences between how Flash Cache and Flash Pools actually cache data. Flash Cache is populated with blocks of data that are being evicted from the controller’s primary memory (i.e. its RAM) but are requested by client I/O. Flash Cache utilizes a first-in, first-out algorithm: when a new block of data arrives in the Flash Cache, an existing block of data is purged. When the next new block arrives, our previous block is now one step closer to its own eviction. (The number of blocks that can exist in cache depends on the size of your Flash Cache card.)

Flash Pool is initially populated the same way: when blocks of data match the cache insertion policy and have been evicted from the controller’s primary memory but are requested by client I/O. Flash Pool utilizes a temperature map algorithm: when a new block of data arrives in the Flash Pool, it is assigned a neutral temperature. Data ONTAP keeps track of the temperature of blocks by forming a heat map of those blocks.

This is what a temperature map looks like, and shows how a read gets evicted:

read-cache-mgmt

You can see the read cache process: a block gets inserted and labeled with a neutral temperature. If the block gets accessed by clients, the scanner sees this and increases the temperature of the block — meaning it will stay in cache for longer. If the block doesn’t get accessed, the scanner sees this and decreases the temperature of the block. When it gets cold enough, it gets evicted. This means that we can keep hot data while discarding cold data. Note that the eviction scanner only starts running when the cache is at least 75% utilized.

This is how a random overwrite gets evicted:

write-cache-mgmt

The hotter the block, the farther right on the temperature gauge and the longer the block is kept in cache. Again, you can see the process: a block gets inserted and labeled with a neutral temperature. If the block gets randomly overwritten again by clients, the block is re-inserted and labeled with a neutral temperature. Unlike a read, a random overwrite cannot have its temperature increased — it can only be re-inserted.

Note that I mentioned Flash Pool actually has RAID protection. This is to ensure the integrity of the cached data on disk, but it’s important to remember that read cache data is just that – a cache of read data. The permanent copy of the data always lives on rotating disks. If your Flash Pool SSD drives all caught fire at the same time, the read cache would be disabled but you would not lose data.

Why cache is better than tiering for performance

When it comes to adjusting (either increasing or decreasing) storage performance, two approaches are common: caching and tiering. Caching refers to a process whereby commonly-accessed data gets copied into the storage controller’s high-speed, solid-state cache. Therefore, a client request for cached data never needs to hit the storage’s disk arrays at all; it is simply served right out of the cache. As you can imagine this is very, very fast.

Tiering, in contrast, refers to the movement of a data set from one set of disk media to another; be it from slower to faster disks (for high-performance, high-importance data) or from faster to slower disks (for low-performance, low-importance data). For example, you may have your high-performance data living on a SSD, FC or SAS disk array, and your low-performance data may only require the IOps that can be provided by relatively low-performance SATA disks.

Both solutions have pros and cons. Cache is typically less configurable by the user, as the cache’s operation will be managed by the storage controller. It is considerably faster, as the cache will live on the bus — it won’t need to traverse the disk subsystem (via SAS, FCAL etc.) to get there, nor will it have to compete with other I/O along the disk subsystem(s). But, it’s also more expensive: high-grade, high-performance solid-state cache memory is more costly than SSD disks. Last but not least, the cache needs to “warm up” in order to be effective — though in the real world this does not take long at all!

Tiering’s main advantages are that it is more easily tunable by the customer. However, all is not simple: in complex environments, tuning the tiering may literally be too complex to bother with. Also, manual tiering relies on you being able to predict the needs of your storage, and adjust tiering automatically: how do you know tomorrow which business application will require the hardest hit? Again, in complex environments, this relatively simply question may be decidedly difficult to answer. On the positive side, tiering offers more flexibility in terms of where you put your data. Cache is cache, regardless of environment; data is either on the cache or it’s not. On the other hand, tiering lets you take advantage of more types of storage: SSD or FC or SAS or SATA, depending on your business needs.

But if you’re tiering for performance (which is the focus of this blog post), then you have to deal with one big issue: the very act of tiering increases the load on your storage system! Tiering actually creates latency as it occurs: in order to move the data from one storage tier to another, we are literally creating IOps on the storage back-end in order to accomplish the performance increase! That is, in order to get higher performance, we’re actually hurting performance in the meantime (i.e., while the data is moving around.)

In stark contrast, caching reduces latency and increases throughput as it happens. This is because the data doesn’t really move: the first time data is requested, a cache request is made (and misses — it’s not in the cache yet) and the data is served from disk. On it’s way to the customer, though, the data will stay in the cache for a while. If it’s requested again, another cache request is made (and hits — the data is already in the cache) and the data is served from cache. And it’s served fast!

(It’s worthwhile to note that NetApp’s cache solutions actually offer more than “simple” caching: we can even cache things like file/block metadata. And customers can tune their cache to behave how they want it.)

Below is a graph from a customer’s benchmark. It was a large SQL Server read, but what is particularly interesting is the behavior of the of the graph: throughput (in red) goes up while latency (in blue) actually drops!

If you were seeking a performance augmentation via tiering, there would have been two different possibilities. If your data was already tiered, throughput will go up while latency will remain the same. If your data wasn’t already tiered, throughput will decrease as latencies will increase as the data is tiered; only after the tiering is completed will you actually see an increase in throughput.

For gaining performance in your storage system, caching is simply better than tiering.

BlackBoard & Oracle benchmarks on SSD storage

I spent Friday benchmarking two platforms against each other: our production BlackBoard instance (9.1 SP6) versus a development BlackBoard instance (also 9.1 SP6) that I spun up for this test. The unique thing about this test was that I put the development instance entirely on SSD storage.

Production instance:

  • BlackBoard 9.1 SP6 on VMware ESXi VM (Linux)
  • Oracle 10g R2 on HP BL465c G6 Blade (Linux)
  • BlackBoard on NetApp filer; 1 volume on a 45-disk RAID-DP aggregate
  • Oracle on local HP SAS RAID-1 disk set

Development instance:

  • BlackBoard 9.1 SP6 on VMware ESXi VM (Linux)
  • Oracle 10g R2 on VMware ESXi VM (Linux)
  • BlackBoard on FusionIO ioDrive SSD
  • Oracle also on (the same) FusionIO ioDrive SSD

With the exception of workloads (itself a significant difference, to be sure), I tried to keep everything else the same — same patches, same OS revision, same LDAP & SSL configurations, etc.

To test performance, I had a suggestion from Steve (@Seven_Seconds) at BlackBoard: hit the /webapps/login/ page with ab, the Apache benchmark tool.

The results were, well, strange. With each test of em running 5000 requests to /webapps/login/, here is a graph:

NetApp v SSD

You can see that the mean response times SSD (blue line) equals or out-performs the NetApp (red line) at every concurrency value tested. What is particularly interesting, though, is the incredibly slow rate at which the standard deviation of the SSD (purple bars) grows; compare that with the rate at which the standard deviation of the response times of the NetApp (green bars) grow.

Right now, I don’t have a clear answer as to why the deviations from the mean grow so differently. The NetApp filer has a significantly different workload to the SSD (the SSD has no workload other than this test), but the exponential growth of the NetApp’s standard deviation is something I’ll need to investigate further.

Last but not least, there were no SSD-specific tuning options put in place, nor where there any NetApp-specific tuning options. This was as vanilla as both installs could get in order to keep everything relatively comparable.

Performance in strange places

Sometimes, you find performance improvements in places where you weren’t looking for an improvement. I will give an example: at a meeting last week, our webmaster was passing on complaints from users that their Content Management System was performing poorly. He had done a bunch of investigative work and concluded that the problem was likely an Active Directory problem. While that’s not exactly what he meant, it’s what he said, and our Active Directory administrator was a little annoyed.

What the Webmaster actually meant was that, when publishing several thousand files that live on a NetApp filer that is Active Directory-aware, there may be issues if you’re performing thousands of LDAP lookups serially. Which is a valid point: that can get computationally expensive. So, we got to talking about it, and another engineer mentioned nscd, the name services caching daemon — something I didn’t know existed outside NIS (i.e., legacy platforms).

Long story short, it made for an amusing (and satisfying) test.

/home#  time ls -al | wc -l
3965

real    0m19.547s
user    0m3.729s
sys     0m1.700s

Almost 20 seconds to perform a lookup on an NFS-mounted directory that has almost 4,000 unique Active Directory users. That’s a lot of UIDs and GIDs! Now, we install nscd, and run the same test once to populate the cache.

Then we run the test again, with a populated cache. Care to see the results?

/home# time ls -al | wc -l     
3965

real    0m2.241s
user    0m0.360s
sys     0m0.570s

From 19 seconds down to less than 3 seconds. Quite the improvement! Obviously you may need to adjust your cache parameters, particularly if this is a fairly inactive system most of the time but does have the occasional need for performance — in such a case, you may set your cache to expire unusually slowly. The lesson from this, though, is that sometimes you find gains when you weren’t looking. Or even in places you didn’t know gains could be found.