Using CDP to understand Data ONTAP networking

Have you ever been on a Data ONTAP system without a clear idea of how the physical network is connected, and wish you could interrogate your network to try and find out? If so, CDP – the Cisco Discovery Protocol – might be the help you’re looking for. This can be very useful on systems with large or complex Ethernet configurations. Once CDP is enabled in Data ONTAP, your Cisco switches will become aware of which NetApp systems are cabled to which ports. It’ll know both the source port (on the NetApp) and the destination port (on the Cisco).

CDP is supported in both 7-mode and cDOT, and is simply enabled with an option command.

To enable CDP in 7-mode:

options cdpd.enable on

To enable CDP in cDOT:

node run -node * options cdpd.enable on

The nice thing about NetApp’s CDP implementation is that it’s bi-directional. That means you can query CDP from either the Cisco switch or the NetApp controller and see information — meaning you don’t have to rely on a network administrator to provide you the information.

To view CDP information from 7-mode Data ONTAP, you would use the cdpd show-neighbors command.

Here’s some sample output:

mystic> cdpd show-neighbors
Local  Remote          Remote                 Remote           Hold  Remote    
Port   Device          Interface              Platform         Time  Capability
------ --------------- ---------------------- ---------------- ----- ---------- 
e0M    charles         e0M                    FAS3170           146   H         
e0M    nane-cat4948-sw GigabitEthernet1/8     cisco WS-C4948-.  174   RSI       
e3a    nane-nx5010-sw. Ethernet1/4            N5K-C5010P-BF     173   SI        
e4a    nane-nx5010-sw. Ethernet1/14           N5K-C5010P-BF     177   SI

Note that we can see that Filer’s HA partner, charles, in the output. Here we can see that e0M is cabled to port Giga1/8 on nane-cat4948, whereas e3a and e4a are cabled to Eth1/4 and Eth1/14 on nane-nx5010-sw respectively.

This is incredibly useful information if you’re ever trying to track down how a system is cabled!

And from clustered Data ONTAP, to view CDP neighbor information you would use the run -node nodeName cdpd show-neighbors command.

The output is the same format as in 7-mode:

dot83cm::> node run -node local cdpd show-neighbors
Local  Remote          Remote                 Remote           Hold  Remote    
Port   Device          Interface              Platform         Time  Capability
------ --------------- ---------------------- ---------------- ----- ---------- 
e6a    nane-nx5010-sw. Ethernet1/12           N5K-C5010P-BF     145   SI        
e6b    nane-nx5010-sw. Ethernet1/5            N5K-C5010P-BF     145   SI        
e4a    dot83cm-01      e4a                    FAS3240           161   H         
e4b    dot83cm-01      e4b                    FAS3240           161   H         
e0a    nane-cat4948-s. GigabitEthernet1/9     cisco WS-C4948-.  168   RSI

In this case, e6a and e6b go to the same switch, with e4a and e4b going to the other node in this HA pair — that’s my switchless cluster interconnect. e0a goes to an older Catalyst switch.

And to view CDP information from Cisco IOS or NX-OS, you would use the show cdp neighbors command.

Some sample output:

nane-nx5010-sw# show cdp neighbors
Capability Codes: R - Router, T - Trans-Bridge, B - Source-Route-Bridge
                  S - Switch, H - Host, I - IGMP, r - Repeater,
                  V - VoIP-Phone, D - Remotely-Managed-Device,
                  s - Supports-STP-Dispute, M - Two-port Mac Relay

Device ID              Local Intrfce   Hldtme  Capability  Platform      Port ID
US-WLM-LS02            mgmt0           124     R S I       WS-C6509      Gig5/1 
nane-cat4948-sw        Eth1/2          179     R S I       WS-C4948-10GE Ten1/49
dot83cm-01             Eth1/3          163     H           FAS3240       e6b    
mystic                 Eth1/4          127     H           FAS3170       e3a    
dot83cm-02             Eth1/5          158     H           FAS3240       e6b

More detailed information about the output of CDP commands can be found in the relevant Network Management Guide, which are linked below. CDP has been supported since Data ONTAP 8.0.

In IOS/NX-OS, you may wish to run show cdp neighbors detail to gather more information.



Characterizing workloads in Data ONTAP

One of the most important things when analyzing Data ONTAP performance is being able to characterize the workload that you’re seeing. This can be done a couple of ways, but I’ll outline one here. Note that this is not a primer on analyzing storage bottlenecks — for that, you’ll have to attend Insight this year! — but rather a way to see how much data a system is serving out and what kind of data it is.

When it comes to characterizing workloads on a system, everyone has a different approach — and that’s totally fine. I tend to start with a few basic questions:

  1. Where is the data being served from?
  2. How much is being served?
  3. What protocols are doing the most work?
  4. What kind of work are the protocols doing?

You can do this with the the stats command if you know what counters to look for. If you find a statistic and you don’t know what it means, we have a good help system at hand. To wit and to use stats explain counters:

# rsh charles "priv set -q diag; stats explain counters readahead:readahead:incore"
Counters for object name: readahead
Name: incore
Description: Total number of blocks that were already resident
Properties: delta
Unit: none

Okay, so that’s not the greatest explanation, but it’s a start! More commonly-used counters are going to have better descriptions.

When it comes to analyzing your workload, a good place to start are the read_io_type counters in the <ttwafl object. These statistics are not normally collected, so you have to start counting and stop counting them. Here’s an example on a 7-mode system — I’m going to use rsh non-interactively because there are a lot of counters!

# rsh charles "priv set -q diag; stats show wafl:wafl:read_io_type"

Here is an explanation of these counters:

  • cache are reads served from the Filer’s RAM
  • ext_cache are reads served from the Filer’s Flash Cache
  • disk are reads served from any model of spinning disk
  • bamboo_ssd are reads served from any model of solid-state disk
  • hya_hdd are reads served from the spinning disks that are part of a Flash Pool
  • hya_cache are reads served from the solid-state disks that are part of a Flash Pool

Note that these counters are present regardless of your system’s configuration. For example, the Flash Cache counters will be visible even if you don’t have Flash Cache installed.

Now that you know where your reads are being served from, you might be curious as to how big the reads are, and if they’re random or sequential. Those statistics are available, but it takes a bit more grepping and cutting to get them. Here’s a little from my volume, sefiles. Note that I need to prefix this capture with priv set:

# rsh charles "priv set -q diag; stats start -I perfstat_nfs"
# rsh charles "priv set -q diag; stats stop -I perfstat_nfs"

Okay, so there’s a lot of output there. It’s data for the entire system. That said, it’s a great way to learn — you can see everything the system sees. And if you dump it to a text file, it will save you using rsh over and over again! Let’s use rsh again and drill down to the volume stats, which is the most useful way to investigate a particular volume:

# rsh charles "priv set -q diag; stats show volume:sefiles"

I’ve removed some of the internal counters for clarity. Now, let’s take a look at a couple of things: the volume name is listed as sefiles in the instance_name variable. During the time I was capturing statistics, the average amount of operations per second was 93 (“total_ops“). Read latency (“read_latency“) was 433us, which is 0.433ms. Notice that we did no writes ("write_ops") and basically no metadata either ("other_ops").

If you’re interested in seeing what the system is doing in general, I tend to go protocol-by-protocol. If you grep for nfsv3, you can start to wade through things:

# rsh charles "priv set -q diag; stats show nfsv3:nfs:nfsv3_read_latency_hist"
nfsv3:nfs:nfsv3_read_latency_hist.0 - <1ms:472
nfsv3:nfs:nfsv3_read_latency_hist.1 - <2ms:7
nfsv3:nfs:nfsv3_read_latency_hist.2 - <4ms:10
nfsv3:nfs:nfsv3_read_latency_hist.4 - <6ms:8
nfsv3:nfs:nfsv3_read_latency_hist.6 - <8ms:7
nfsv3:nfs:nfsv3_read_latency_hist.8 - <10ms:0
nfsv3:nfs:nfsv3_read_latency_hist.10 - <12ms:1

During our statistics capture, approximately 472 requests were served at 1ms or better. 7 requests were served between 2ms and 1ms. 10 requests were served between 4ms and 2ms, and so on. If you keep looking, you’ll see write latencies as well:

# rsh charles "priv set -q diag; stats show nfsv3:nfs:nfsv3_write_latency_hist"
nfsv3:nfs:nfsv3_write_latency_hist.0 - <1ms:1761
nfsv3:nfs:nfsv3_write_latency_hist.1 - <2ms:0
nfsv3:nfs:nfsv3_write_latency_hist.2 - <4ms:2
nfsv3:nfs:nfsv3_write_latency_hist.4 - <6ms:1
nfsv3:nfs:nfsv3_write_latency_hist.6 - <8ms:0
nfsv3:nfs:nfsv3_write_latency_hist.8 - <10ms:0
nfsv3:nfs:nfsv3_write_latency_hist.10 - <12ms:0

So that’s 1761 writes at 1ms or better. Your writes should always show excellent latencies as writes are acknowledged once they’re committed to NVRAM — the actual writing of the data to disk will occur at the next consistency point.

Keep going and you’ll see latency figures that are concatenated for all types of requests — reads, writes, other (i.e. metadata):

# rsh charles "priv set -q diag; stats show nfsv3:nfs:nfsv3_latency_hist"
nfsv3:nfs:nfsv3_latency_hist.0 - <1ms:2124
nfsv3:nfs:nfsv3_latency_hist.1 - <2ms:5
nfsv3:nfs:nfsv3_latency_hist.2 - <4ms:16
nfsv3:nfs:nfsv3_latency_hist.4 - <6ms:18
nfsv3:nfs:nfsv3_latency_hist.6 - <8ms:13
nfsv3:nfs:nfsv3_latency_hist.8 - <10ms:6
nfsv3:nfs:nfsv3_latency_hist.10 - <12ms:1

Once you’ve established latencies, you may be interested to see the I/O sizes of the actual requests. Behold:

# rsh charles "priv set -q diag; stats show nfsv3:nfs:nfsv3_read_size_histo"

These statistics are represented like the latencies were: 0 requests were served that were between 0-511 bytes in size, 0 between 512 & 1023 bytes, and so on; then, 328 requests were served that were between 4KB & 8KB in size and 71 requests between 8KB & 16KB.

When it comes to protocol operations, we break things down by the unique operation functions — including metadata operations. I’ll show the count numbers, but it’ll also show percentages of requests and individual latencies for each:

# rsh charles "priv set -q diag; stats show nfsv3:nfs:nfsv3_op_count"

In this case, most of my requests were reads & writes; there was very little metadata operation (just 4 attribute lookups). If you want to know what these operations represent, you can read some of my previous blog entries or check out the RFC that defines your protocol.

One last thing that I forgot to add is trying to figure out if your workload is random or sequential. The easiest way that I’ve found to do that is to look in the readahead stats, as the readahead engine is a particularly useful way of determining how your reads are operating. We don’t need to focus on writes as much because any write much larger than 16KB is coalesced in memory and written sequentially — even if it started as random at the client. (I’ll talk about RAVE, the new readahead engine in DOT 8.1+, in later posts.)

I’ve filtered out some of the irrelevant stuff. So let’s gather what we can use and see:

# rsh charles "priv set -q diag; stats show readahead:readahead:total_read_reqs"

These statistics work much the same as our previous ones. 629 read requests were serviced that were 4KB blocks, 77 were serviced that were 8KB blocks, etc. The numbers continue right up to 1024KB. (Keep this in mind whenever someone tells you that WAFL & ONTAP can only read or write 4KB at a time!) Now if you want to see sequentiality:

# rsh charles "priv set -q diag; stats show readahead:readahead:seq_read_reqs"

We can see that, of the 4KB reads that we serviced, 36% were sequential. Mathematically this means that ~64% would be random, which we can confirm here:

# rsh charles "priv set -q diag; stats show readahead:readahead:rand_read_reqs"

Here, of our 4KB requests, 63% of them are random. Of our 8KB requests, 53% were random. Notice that the random percentages and sequential percentages for each block histogram will add up to ~100% depending on rounding. If you dig deeper into the stats, readahead can provide you with some useful information about how your data came to be read:

# rsh charles "priv set -q diag; stats readahead:readahead"

You can interpret those numbers thusly. Again, I’ve removed the irrelevant stuff:

  • requested is the sum of all blocks requested by the client
  • read are the blocks actually read from disk (where the readahead engine had a “miss”)
  • incore are the blocks read from cache (where the readahead engine had a “hit”)
  • speculative are blocks that were readahead because we figured the client might need them later in the I/O stream
  • read_once are blocks that were not read ahead and was not cached because it didn’t match the cache policy (e.g. flexscale.enable.lopri_blocks being off)

With this information, you should have a pretty good idea about how to characterize what your Filer is doing for workloads. Although I’ve focused on NFSv3 here, the same techniques can be applied to the other protocols. If you’ve got any questions, please holler!

How Data ONTAP caches, assembles and writes data

In this post, I thought I would try and describe how ONTAP works and why. I’ll also explain what the “anywhere” means in the Write Anywhere File Layout, a.k.a. WAFL.

A couple of common points of confusion are the role that NVRAM plays in performance, and where our write optimization comes from. I’ll attempt to cover both of these here. I’ll also try and cover how we handle parity. Before we start, here are some terms (and ideas) that matter. This isn’t NetApp 101, but rather NetApp 100.5 — I’m not going to cover all the basics, but I’ll cover the basics that are relevant here. So here goes!

What hardware is in a controller, anyway?

NetApp storage controllers contain some essential ingredients: hardware and software. In terms of hardware the controller has CPUs, RAM and NVRAM. In terms of software the controller has its operating system, Data ONTAP. Our CPUs do what everyone else’s CPUs do — they run our operating system. Our RAM does what everyone else’s RAM does — it runs our operating system. Our RAM also has a very important function in that it serves as a cache. While our NVRAM does roughly what everyone else’s NVRAM does — i.e., it contains our transaction journal — a key difference is the way we use NVRAM.

That said, at NetApp we do things in unique ways. And although we have one operating system, Data ONTAP, we have several hardware platforms. They range from small controllers to big controllers, which medium controllers in between. Different controllers have different amounts of CPU, RAM & NVRAM but the principles are the same!


Our CPUs run the Data ONTAP operating system, and they also process data for clients. Our controllers vary from a single dual-core CPU (in a FAS2220) to a pair of hex-core CPUs (in a FAS6290). The higher the amount of client I/O, the harder the CPUs work. Not all protocols are equal, though; serving 10 IOps via CIFS generates a different CPU load than serving 10 IOps via FCP.


Our RAM contains the Data ONTAP operating system, and it also caches data for clients. It is also the source for all writes that are committed to disk via consistency points. Writes do not come from NVRAM! Our controllers vary from 6GB of RAM (FAS2220) to 96GB of RAM (FAS6290). Not all workloads are equal, though; different features and functionality requires a different memory footprint.


Physically, NVRAM is little more than RAM with a battery backup. Our NVRAM contains a transaction log of client I/O that has not yet been written to disk from RAM by a consistency point. Its primary mission is to preserve that not-yet-written data in the event of a power outage or similar, severe problem. Our controllers vary from 768MB of NVRAM (FAS2220) to 4GB of NVRAM (FAS6290). In my opinion, NVRAM’s function is perhaps the most-commonly misunderstood part of our architecture. NVRAM is simply a double-buffered journal of pending write operations. NVRAM, therefore, is simply a redo log — it is not the write cache! After data is written to NVRAM, it is not looked at again unless you experience a dirty shutdown. This is because NVRAM’s importance to performance comes from software.

In an HA pair environment where two controllers are connected to each other, NVRAM is mirrored between the two nodes. Its primary mission is to preserve data that not-yet-written data in the event a partner controller suffers a power outage or similar severe problem. NVRAM mirroring happens for HA pairs in Data ONTAP 7-mode, HA pairs in clustered Data ONTAP and HA pairs in MetroCluster environments.

Disks and disk shelves

Disk shelves contain disks. DS14 shelves contain 14 drives, and DS2246 & DS4243 & DS4246 shelves contain 24 disks. DS4248 shelves contain 48 disks. Disk shelves are connected to controllers via shelf modules, those logical connections run either FCAL (DS14) or SAS (DS2246/DS42xx) protocols for connectivity.

What software is in a controller, anyway?


Data ONTAP is our controller’s operating system. Almost everything sits here — from configuration files and databases to license keys, log files and some diagnostic tools. Our operating system is built on top of FreeBSD, and usually lives in a volume called vol0. Data ONTAP features implementations of protocols for client access (e.g. NFS, CIFS), APIs for programming access (ZAPI) and implementations of protocols for management access (SSH). It is fair to say that Data ONTAP is the heart of a NetApp controller.


WAFL is our Write Anywhere File Layout. If NVRAM’s role is the most-commonly misunderstood, WAFL comes in 2nd. Yet WAFL has a simple goal, which is to write data in full stripes across the storage media. WAFL acts as an intermediary of sorts — there is a top half where files and volumes sit, and a bottom half that interacts with RAID, manages SnapShots and some other things. WAFL isn’t a filesystem, but it does some things a filesystem does; it can also contain filesystems.

WAFL contains mechanisms for dealing with files & directories, for interacting with volumes & aggregates, and for interacting with RAID. If Data ONTAP is the heart of a NetApp controller, WAFL is the blood that it pumps.

Although WAFL can write anywhere we want, in reality we write where it makes the most sense: in the closest place (relative to the disk head) where we can write a complete stripe in order to minimize seek time on subsequent I/O requests. WAFL is optimized for writes, and we’ll see why below. Rather unusually for storage arrays, we can write client data and metadata anywhere.

A colleague has this to say about WAFL, and I couldn’t put it better:

There is a relatively simple “cheating at Tetris” analogy that can be used to articulate WAFL’s advantages. It is not hard to imagine how good you could be at Tetris if you were able to review the next thousand shapes that were falling into the pattern, rather than just the next shape.

Now imagine how much better you could be at Tetris if you could take any of the shapes from within the next thousand to place into your pattern, rather than being forced to use just the next shape that is falling.

Finally, imagine having plenty of time to review the next thousand shapes and plan your layout of all 1,000, rather than just a second or two to figure out what to do with the next piece that is falling. In summary, you could become the best Tetris player on Earth, and that is essentially what WAFL is in the arena of data allocation techniques onto underlying disk arrays.

The Tetris analogy incredibly important, as it directly relates to the way that NetApp uses WAFL to optimize for writes. Essentially, we collect random I/O that is destined to be written to disk, reorganize it so that it resembles sequential I/O as much as possible, and then write it to disk sequentially. Another way of explaining this behavior is that of write coalescing: we reduce the number of operations that ultimately land on the disk, because we re-organize them in memory before we commit them to disk and we wait until we have a bunch of them before committing them to disk via a Consistency Point. Put another way, write coalescing allows to avoid the common (and expensive) RAID workflow of “read-modify-write”.

Putting it all together

NetApp storage arrays are made up of controllers and disk shelves. Where the top and bottom is depends on your perspective: disks are grouped into RAID groups, and those RAID groups are combined to make aggregates. Volumes live in aggregates, and files and LUNs live in those volumes. A volume of CIFS data is shared to a client via a SMB share; a volume of NFS data is shared to a client via a NFS export. A LUN of data is shared to a client via an FCP, FCOE or ISCSI initiator group. Note the relationship here between controller and client — all clients care about are volumes, files and LUNs. They don’t care directly about CPUs, NVRAM or really anything else when it comes to hardware. There is data and I/O and that’s it.

In order to get data to and from clients as quickly as possible, NetApp engineers have done much to try to optimize controller performance. A lot of the architectural design you see in our controllers reflects this. Although the clients don’t care directly about how NetApp is architected, our architecture does matter to the way their underlying data is handled and the way their I/O is served. Here is a basic workflow, from the inimitable Recovery Monkey himself:

  1. The client writes some data
  2. Write arrives in controller’s RAM and is copied into controller’s NVRAM
  3. Write is then mirrored into partner’s NVRAM
  4. The client receives an acknowledgement that the data has been written

Sounds pretty simple right? On the surface, it is a pretty simple process. It also explains a few core concepts about NetApp architecture:

  • Because we acknowledge the client write once it’s hit NVRAM, we’re optimized for writes out of the box
  • Because we don’t need to wait for the disks, we can write anywhere we choose to
  • Why we can survive an outage to either half of an HA pair

The 2nd bullet provides the backbone of this post. Because we can write data anywhere we choose to, we tend to write data in the best place possible. This is typically in the largest contiguous stripe of free space in the volume’s aggregate (closest to the disk heads). After 10 seconds have elapsed, or if NVRAM becomes >=50% full, we write the client’s data from RAM (not from NVRAM) to disk.

This operation is called a consistency point. Because we use RAID, a consistency point requires us to perform RAID calculations and to calculate parity. These calculations are processed by the CPUs using data that exists in RAM.

Many people think that our acceleration occurs because of NVRAM. Rather, a lot of our acceleration happens while we’re waiting for NVRAM. For example — and very significantly — we transmogrify random I/O from the client into sequential I/O before writing it to disk. This processing is done by the CPU and occurs in RAM. Another significant benefit is because we calculate data in RAM, we do not need to hammer the disk drives that contain parity information.

So, with the added step of a CP, this is how things really happen:

  1. The client writes some data
  2. Write arrives in controller’s RAM and is copied into controller’s NVRAM
  3. Write is then mirrored into partner’s NVRAM
  4. When the partner NVRAM acknowledges the write, the client receives an acknowledgement that the data has been written [to disk]
  5. A consistency point occurs and the data (incl. parity) is written to disk

To re-iterate, writes are cached in the controller’s RAM at the same time as being logged into NVRAM (and the partner’s NVRAM if you’re using a HA pair). Once the data has been written to disk via a CP, the writes are purged from the controller’s NVRAM but retained (with a lower priority) in the controller’s RAM. The prioritization allows us to evict less commonly-used blocks in order to avoid overrunning the system memory. In other words, recently-written data is the first to be ejected from the first-level read cache in the controller’s RAM.

The parity issue also catches people out. Traditional filesystems and arrays write data (and metadata) into pre-allocated locations; Data ONTAP and WAFL let NetApp write data (and metadata) in whatever location will provide fastest access. This is usually a stripe to the nearest available set of free blocks. The ability of WAFL to write to the nearest available free disk blocks lets us greatly reduce disk seeking. This is the #1 performance challenge when using spinning disks! It also lets us avoid the “hot parity disk” paradigm, as WAFL always writes to new, free disk blocks using pre-calculated parity.

Consistency points are also commonly misunderstood. The purpose of a consistency point is simple: to write data to free space on disks. Once a CP has taken place, the contents of NVRAM are discarded. Incoming writes that are received during the actual process of a consistency point’s commitment (i.e., the actual writing to disk) will be written to disk in the next consistency point.

The workflow for a write:

1. Write is sent from the host to the storage system (via a NIC or HBA)
2. Write is processed into system memory while a) being logged in NVRAM and b) being logged in the HA partner’s NVRAM
3. Write is acknowledged to the host
4. Write is committed to disk storage in a consistency point (CP)

The importance of consistency points cannot be overstated — they are a cornerstone of Data ONTAP’s architecture! They are also the reason why Data ONTAP is optimized for writes: no matter the destination disk type, we acknowledge client writes as soon as they hit NVRAM, and NVRAM is always faster than any time of disk!

Consistency points typically they occur every 10 seconds or whenever NVRAM begins to get full, whichever comes first. The latter is referred to as a “watermark”. Imagine that NVRAM is a bucket, and now divide that bucket in half. One side of the bucket is for incoming data (from clients) and the other side of the bucket is for outgoing data (to disks). As one side of the bucket fills, the other side drains. In an HA pair, we actually divide NVRAM into four buckets: two for the local controller and two for the partner controller. (This is why, when activating or deactivating HA on a system, a reboot is required for the change to take effect. The 10-second rule is why, on a system that is doing practically no I/O, the disks will always blink every 10 seconds.)

The actual value of the watermark varies depending on the exact configuration of your environment: the Filer model, the Data ONTAP version, whether or not it’s HA, and whether SATA disks are present. Because SATA disks write more slowly than SAS disks, consistency points take longer to write to disk if they’re going to SATA disks. In order to combat this, Data ONTAP lowers the watermark in a system when SATA disks are present.

The size of NVRAM only really dictates a Filer’s performance envelope when doing a large amount of large sequential writes. But with the new FAS80xx family, NVRAM counts have grown massively — up to 4x of the FAS62xx family.

Getting started with Clustered Data ONTAP & FC storage

A couple of days ago, I helped a customer get their cDOT system up and running using SAN storage. They had inherited a Cisco MDS switch running NX-OS, and were having trouble getting the devices to log in to the fabric.

As you may know, Clustered Data ONTAP ”’requires”’ NPIV when using Fiber Channel storage — i.e., hosts connecting to a NetApp cluster via the Fiber Channel protocol. NPIV is N-Port ID Virtualization for Fiber Channel, and should not be confused with NPV — which is simply N-Port Virtualization. Scott Lowe has an excellent blog post comparing and contrasting the two.

NetApp uses NPIV in order to abstract away the underlying hardware (i.e., FC HBAs) to the client-facing hardware (i.e., Storage Virtual Machine Logical Interfaces). The use of logical interfaces, or LIFs, allows us to not only carve up a single physical HBA port into many logical ports, but also for the WWPNs to be different. This is particularly useful when it comes to zoning — if you buy an HBA today, you’ll create your FC zone based on ”’the LIF WWPNs”’ and not the HBA’s.

For example, I have a two-node FAS3170 cluster, and each node has two FC HBAs:

dot82cm::*> fcp adapter show -fields fc-wwnn,fc-wwpn
node       adapter fc-wwnn                 fc-wwpn                 
---------- ------- ----------------------- ----------------------- 
dot82cm-01 2a      50:0a:09:80:89:6a:bd:4d 50:0a:09:81:89:6a:bd:4d 
dot82cm-01 2b      50:0a:09:80:89:6a:bd:4d 50:0a:09:82:89:6a:bd:4d 

(Note that that command needs to be run in a privileged mode in the cluster shell.) But the LIFs have different port addresses, thanks to NPIV:

dot82cm::> net int show -vserver vs1
  (network interface show)
            Logical    Status     Network            Current       Current Is
Vserver     Interface  Admin/Oper Address/Mask       Node          Port    Home
----------- ---------- ---------- ------------------ ------------- ------- ----
                         up/up    20:05:00:a0:98:0d:e7:76 
                                                     dot82cm-01    2a      true
                         up/up    20:06:00:a0:98:0d:e7:76 
                                                     dot82cm-01    2b      true

So, I have one Vserver (sorry, SVM!) with four LIFs. If I ever remove my dual-port 8Gb FC HBA and replace them with, say, a dual-port 16Gb FC HBA, the port names on the LIFs that are attached to the SVM will ”not” change. So when you zone your FC switch, you’ll use the LIF WWPNs.

Speaking of FC switches, let’s look at what we need. I’m using a Cisco Nexus 5020 in my lab, which means I’ll need the NPIV (not NPV!) license enabled. To verify if you have that license, it’s pretty simple:

nane-nx5020-sw# show feature | i npiv
npiv                  1         enabled

That’s pretty much it. For a basic fabric configuration on a Nexus, you need the following to work with cluster-mode:

  1. An NPIV license
  2. A Virtual Storage Area Network, or VSAN
  3. A zoneset
  4. A zone

I’m using a VSAN of 101; for most environments the default VSAN is VSAN 1. I have a single zoneset, which contains a single zone. I’m using aliases to make the zone slightly easier to manage.

Here is the zoneset:

nane-nx5020-sw# show zoneset brief vsan 101
zoneset name vsan101-zoneset vsan 101
  zone vsan101-zone

You can see that the zoneset is named vsan101-zoneset, and it’s in VSAN 101. The member zone is rather creatively named vsan101-zone. Let’s look at the zone’s members:

nane-nx5020-sw# show zone vsan 101
zone name vsan101-zone vsan 101
  fcalias name ucs-esxi-1-vmhba1 vsan 101
    pwwn 20:00:00:25:b5:00:00:1a
  fcalias name dot82cm-01_fc_lif_1 vsan 101
    pwwn 20:05:00:a0:98:0d:e7:76

Note that I have two hosts defined by aliases, and that those aliases contain the relevant WWPN from the host. Make sure you commit your zone changes and activate your zoneset!

Once you’ve configured your switch appropriately, you need to do three four things from the NetApp perspective:

  1. Create an initiator group
  2. Populate that initiator group with the host’s WWPNs
  3. Create a LUN
  4. Map the LUN to the relative initiator group

When creating your initiator group, you’ll need to select a host type. This will ensure the correct ALUA settings, amongst others. After the initiator group is populated, it should look something like this:

dot82cm::> igroup show -vserver vs1
Vserver   Igroup       Protocol OS Type  Initiators
--------- ------------ -------- -------- ------------------------------------
vs1       vm5_fcp_igrp fcp      vmware   20:00:00:25:b5:00:00:1a

We’re almost there! Now all we need to do is map the initiator group to a LUN. I’ve already done this for one LUN:

dot82cm::> lun show -vserver vs1
Vserver   Path                            State   Mapped   Type        Size
--------- ------------------------------- ------- -------- -------- --------
vs1       /vol/vm5_fcp_volume/vm5_fcp_lun1 
                                          online  mapped   vmware      250GB

We can see that the LUN is mapped, but how do we know which initiator group it’s mapped to?

dot82cm::> lun mapped show -vserver vs1
Vserver    Path                                      Igroup   LUN ID  Protocol
---------- ----------------------------------------  -------  ------  --------
vs1        /vol/vm5_fcp_volume/vm5_fcp_lun1          vm5_fcp_igrp  0  fcp

Now we have all the pieces in place! We have a Vserver (or SVM), vs1. It contains a volume, vm5_fcp_volume, which in turn contains a single LUN, vm5_fcp_lun1. That LUN is mapped to an initiator group called vm5_fcp_igrp of type vmware, over protocol FCP. And that initiator group contains a single WWPN that corresponds to the WWPN of my ESXi host.

Clear as mud?

Getting started with NetApp Storage QoS

Storage QoS is a new feature in Clustered Data ONTAP 8.2. It is a full-featured QoS stack, which replaces the FlexShare stack of previous ONTAP versions. So what does it do and how does it work? Let’s take a look!

The administration of QoS involves two parts: policy groups, and policies. Policy groups define boundaries between workloads, and contain one or more storage objects. We can monitor, isolate and limit the workloads of storage objects from the biggest point of granularity down to the smallest — from entire Vservers, whole volumes, individual LUNs all the way down to single files.

The actual policies are behavior modifiers that are applied to a policy group. Right now, we can set throughput limits based on operation counts (i.e., IOps) or throughput counts (i.e., MB/s). When limiting throughput, storage QoS throttles traffic at the protocol stack. Therefore, a client whose I/O is being throttled will see queuing in their protocol stack (e.g., CIFS or NFS) and latency will eventually rise. However, the addition of this queuing will not affect the NetApp cluster’s resources.

In addition to the throttling of workloads, storage QoS also includes very effective measuring tools. And because QoS is “always on”, you don’t even need to have a policy group in order to monitor performance.

So, let’s get started. When it comes to creating our first policy, we actually require three steps in order for the policy to be applied to the workload:

  1. Create the policy (with qos policy-group create...)
  2. Apply the policy to a Vserver (with vserver modify...)
  3. Apply the policy to a volume (with vol modify...)
  4. Monitor what’s going on (with qos statistics...)

Before we start, let’s verify that we’re starting from a clean slate:

dot82cm::> qos policy-group show
This table is currently empty.

Okay, good — no policy groups exist yet. Step one of three is to create the policy group itself, which we’ll call blog-group. In reality, you’d specify a throughput limit (either IOps or MB/s), but for now we won’t bother limiting the throughput:

dot82cm::> qos policy create blog-group -vserver vs0

Let’s make sure the policy group was created:

dot82cm::> qos policy-group show                    
Name             Vserver     Class        Wklds Throughput  
---------------- ----------- ------------ ----- ------------
blog-group       vs0         user-defined -     0-INF

Let’s confirm:

But, because we didn’t specify a throughput limit, the Throughput column is still showing 0 to infinity. Let’s add a limit of 1500 IOps:

dot82cm::> qos policy-group modify -policy-group blog-group -max-throughput 1500iops

(If we wanted to limit that volume to 1500MB/s, we could have substituted 1500mb for 1500iops.)

And verify:

dot82cm::> qos policy-group show                                                
Name             Vserver     Class        Wklds Throughput  
---------------- ----------- ------------ ----- ------------
blog-group       vs0         user-defined 0     0-1500IOPS

So, step three is to associate the new policy group with an actual object whose I/O we wish to throttle. The object can be one or many volumes, LUNs or files. For now though, we’ll apply it to a single volume, blog_volume:

dot82cm::> volume modify blog_volume -vserver vs0 -qos-policy-group blog-group

Volume modify successful on volume: blog_volume

Let’s confirm that it was successfully modified:

dot82cm::> qos policy-group show                                                
Name             Vserver     Class        Wklds Throughput  
---------------- ----------- ------------ ----- ------------
blog-group       vs0         user-defined 1     0-1500IOPS

Cool! We can see that Workloads has gone from 0 to 1. I’ve mounted that volume via NFS on a Linux VM, and will throw a bunch of workloads at it using dd.

While the workload is running, here’s how it looks:

dot82cm::> qos statistics workload characteristics show                         
Workload          ID     IOPS      Throughput      Request size    Read  Concurrency 
--------------- ------ -------- ---------------- --------------- ------- ----------- 
-total-              -      392         5.26MB/s          14071B     14%          14 
_USERSPACE_APPS     14      170       109.46KB/s            659B     32%           0 
_Scan_Backgro..  11702      115            0KB/s              0B      0%           0 
blog_volume-w..  11792     1679       104.94MB/s          65527B      0%           4

As you can see, our volume blog_volume is pretty busy — it’s pushing almost 1,700 IOps at over 100MB/sec. So, let’s see if the throttling is effective. First, we’ll give the policy group a low throughput maximum:

dot82cm::> qos policy-group modify -policy-group blog-group -max-throughput 100iops

Now let’s check its status:

dot82cm::> qos policy-group show
Name             Vserver     Class        Wklds Throughput  
---------------- ----------- ------------ ----- ------------
blog-group       vs0         user-defined 1     0-100IOPS

Now let’s see how the Filer is doing:

dot82cm::> qos statistics workload characteristics show
Workload          ID     IOPS      Throughput      Request size    Read  Concurrency 
--------------- ------ -------- ---------------- --------------- ------- ----------- 
-total-              -      384         6.71MB/s          18333B     11%          33 
_USERSPACE_APPS     14      169         2.50MB/s          15528B     26%           0 
_Scan_Backgro..  11702      115            0KB/s              0B      0%           0 
blog_volume-w..  11792       83         5.19MB/s          65536B      0%          15 
-total-              -      207         4.81MB/s          24348B      0%          17

You can see that the throughput has gone way down! In fact it’s gone below our limit of 80 IOps. And that, of course, is what’s supposed to happen. Now let’s remove the limit and see if things return to normal:

dot82cm::> qos policy-group modify -policy-group blog-group -max-throughput none

dot82cm::> qos statistics workload characteristics show                            
Workload          ID     IOPS      Throughput      Request size    Read  Concurrency 
--------------- ------ -------- ---------------- --------------- ------- ----------- 
-total-              -     1073        44.37MB/s          43363B      8%           1 
blog_volume-w..  11792      626        39.12MB/s          65492B      0%           1 
_USERSPACE_APPS     14      302         4.90MB/s          17041B     29%           0 
_Scan_Backgro..  11702      115            0KB/s              0B      0%           0 
-total-              -      263       471.86KB/s           1837B     19%           0

Because we can apply the QoS policy groups to entire Vservers, volumes, files & LUNs, it is important to keep track of what’s applied where. This is how you’d apply a policy group to an individual volume:

dot82cm::> volume modify -vserver vs0 -volume blog_volume -qos-policy-group blog-group

(To remove the policy, set the -qos-policy-group field to none.)

To apply a policy group against an entire Vserver (in this case, Vserver vs0):

dot82cm::> vserver modify -vserver vs0 -qos-policy-group blog-group

(Again, to remove the policy, set the -qos-policy-group field to none.)

To see which volumes are assigned our policy group:

dot82cm::> volume show -vserver vs0 -qos-policy-group blog-group                       
Vserver   Volume       Aggregate    State      Type       Size  Available Used%
--------- ------------ ------------ ---------- ---- ---------- ---------- -----
vs0       blog_volume  aggr1_node1  online     RW          1GB    365.8MB   64%

To see all volumes in all Vservers’ QoS policy groups:

dot82cm::> volume show -vserver * -qos-policy-group * -fields vserver,volume,qos-policy-group

vserver    volume qos-policy-group 
---------- ------ ---------------- 
vs0        blog_empty 
vs0        blog_volume 

To see a Vserver’s full configuration, including its QoS policy group:

dot82cm::> vserver show -vserver vs0

                                    Vserver: vs0
                               Vserver Type: data
                               Vserver UUID: c280658e-bd77-11e2-a567-123478563412
                                Root Volume: vs0_root
                                  Aggregate: aggr1_node2
                        Name Service Switch: file, nis, ldap
                        Name Mapping Switch: file, ldap
                                 NIS Domain:
                 Root Volume Security Style: unix
                                LDAP Client: -
               Default Volume Language Code: C
                            Snapshot Policy: default
                 Antivirus On-Access Policy: default
                               Quota Policy: default
                List of Aggregates Assigned: -
 Limit on Maximum Number of Volumes allowed: unlimited
                        Vserver Admin State: running
                          Allowed Protocols: nfs, cifs, ndmp
                       Disallowed Protocols: fcp, iscsi
            Is Vserver with Infinite Volume: false
                           QoS Policy Group: -

To get a list of all Vservers by their policy groups:

dot82cm::> vserver show -vserver * -qos-policy-group * -fields vserver,qos-policy-group 
vserver qos-policy-group 
------- ---------------- 
dot82cm -                
vs0     blog-group       
4 entries were displayed.

If you’re in a hurry and want to remove all instances of a policy from volumes in a particular Vserver:

dot82cm::> vol modify -vserver vs0 -volume * -qos-policy-group none

That should be enough to get us going. Stay tuned, because in the next episode I’ll show some video with iometer running!


SnapMirror data between cDOT 8.1 and cDOT 8.2

I wanted to quickly put this together as I had some data I wanted to move from a cDOT 8.1 cluster to a cDOT 8.2 cluster. At the time of writing, you cannot do this procedure with OnCommand System Manager GUI — it has to be done through the command-line. Big thanks to Doug & Khalif for providing me some troubleshooting help.

Although this seems like a long post, the process is quite simple:

  1. Create the inter-cluster logical interfaces (LIFs) on each cluster
  2. Create a peer relationship between clusters
  3. Create SnapMirror relationships between clusters for volumes to be transferred
  4. Initialize that SnapMirror relationships to transfer the data
  5. Quiesce, break and remove that SnapMirror relationships

As you can see, the naming scheme I’m using is very simple. The cDOT 8.1 cluster is called dot81cm, and the cDOT 8.2 cluster is called dot82cm. On both sides of the cluster, the name of the Vserver is vserver1, and the volume will be called newDatastore on both the source and destination clusters. Of course, any or all of these parameters can be changed in your environment — the Vserver names don’t need to be the same, nor do the volume names. However, you will need to ensure that DNS resolution works between the two clusters.

First, we’ll probably need to create the inter-cluster LIFs to connect the two clusters together. Note that you can use dedicated inter-cluster LIFs or you can use shared ones. Here, I’m creating dedicated LIFs:

dot81cm::> network interface create -vserver dot81cm-01 -lif migration_iclif_81 -role intercluster -home-node dot81cm-01 -home-port e3a -address -netmask
dot82cm::> network interface create -vserver dot82cm-01 -lif migration_iclif_82 -role intercluster -home-node dot82cm-01 -home-port e0a -address -netmask

Note that the inter-cluster LIFs are created on the node Vserver (i.e., the node dot81cm-01) and not a data Vserver. In a larger environment, if you’re concerned about traffic load or failover resources, you may want to create more than one LIF. Also note that you’ll need to select your own home-node (e.g., e0a) values. And, in my case, the two LIFs are on the same network, so I didn’t need to bother about having them routed. You can find some useful information about creating the LIFs in the Clustered Data ONTAP® 8.2 Cluster and Vserver Peering Express Guide.

Create the cluster peer in cDOT 8.2 first:

dot82cm::> cluster peer create -peer-addrs

Notice: Successfully configured the local cluster. Execute this command on the
        remote cluster to complete the peer setup.

Now, create the cluster peer in cDOT 8.1:

dot81cm::> cluster peer create -peer-addrs

Verify the existence of the peer relationship:

dot82cm::> cluster peer show
Peer Cluster Name          Cluster Serial Number Availability
-------------------------- --------------------- ---------------
dot81cm                    1-80-000011           Available
dot81cm::> cluster peer show
Peer Cluster Name          Cluster Serial Number Availability
-------------------------- --------------------- ---------------
dot82cm                    1-80-000013           Available

Create the destination volume and configure it for data protection:

dot82cm::> vol create -vserver vserver1 -vol newDatastore -aggr aggr1_node1 -size 100GB -type DP 

Create the SnapMirror relationship. A key point here is that you must create the relationship on the cDOT 8.2 cluster:

dot82cm::> snapmirror create -source-path dot81cm://vserver1/newDatastore -destination-path dot82cm://vserver1/newDatastore -type DP

Initialize the SnapMirror relationship:

dot82cm::> snapmirror initialize -source-path dot81cm://vserver1/newDatastore -destination-path dot82cm://vserver1/newDatastore -type DP

You should see that a job is created. Find the job:

dot82cm::> job show
Job ID Name                 Vserver    Node           State
------ -------------------- ---------- -------------- ----------
1820   SnapMirror initialize 
                            dot82cm    dot82cm-01     Running
       Description: snapmirror initialize of destination dot82cm://vserver1/newDatastore

Find more status about the job:

dot82cm::> job watch-progress 1820
3.75GB sent for 1 of 1 Snapshot copies, transferring Snapshot copy snapmirror.aa

…and that’s pretty much it! Hit Ctrl-C to exit out of the job watch. If you’re only planning on transferring the data once, you can now finalize the migration by quiescing, breaking and deleting the SnapMirror relationship. (You can consider quiescing as an optional step.)

dot82cm::> snapmirror quiesce -source-path dot81cm://vserver1/newDatastore -destination-path dot82cm://vserver1/newDatastore

Now you can break the SnapMirror relationship:

dot82cm::> snapmirror break -source-path dot81cm://vserver1/newDatastore -destination-path dot82cm://vserver1/newDatastore

And now you can delete the relationship:

dot82cm::> snapmirror delete -source-path dot81cm://vserver1/newDatastore -destination-path dot82cm://vserver1/newDatastore

And we’re done!

Differences between NetApp Flash Cache and Flash Pool

NetApp’s Flash Cache product has been around for several years, the Flash Pool product is much newer. While there is a lot of overlap between the two products, there are distinct differences as well. The goal of both technologies is the same: to accelerate reads by serving data from solid-state memory instead of spinning disk. Although they produce the same end result, they are built differently and they operate differently.

First of all, why do we bother with read (and write caching)? Because there’s a significant speed differential depending on the source form which we serve data. RAM is much faster than SSD, and SSD is much faster than rotating disk. This graphic illustrates the paradigm:


(Note that we can optimize the performance of 100%-rotating disk environments with techniques like read-aheads, buffering and the like.)

Sequential reads can be read off rotating disks extremely quickly, and very rarely needs to be accelerated by solid-state storage. Random reads are much better served from Flash than rotating disks, and very recent random reads can be served from buffer cache RAM (i.e., the controller’s RAM) — which is an order of magnitude faster again. Although Flash Pool and Flash Cache provide a caching mechanism for those random reads (and some other I/O operations), they do so in different ways.

Here is a quick overview of some of the details that come up when comparing the two technologies:

Detail Flash Cache Flash Pool
Physical entity PCI-e card SSD drive
Physical location Controller head Disk shelf
Logical location Controller bus Disk stack/loop
Logical accessibility Controller head Disk stack/loop
Cache mechanism First-in, first-out Temperature map
Cache persistence on failover Requires re-warm Yes
Cache data support Reads Reads, random overwrites
Cache metadata support Yes Yes
Cache sequential data support Yes, with lopri mode No
Infinite Volume support Yes No
Management granularity System-level for all aggregates & volumes Aggregate-level with per-volume policies
32-bit aggregate support Yes No
64-bit aggregate support Yes Yes
RAID protection N/A RAID-4, RAID-DP (recommended)
Minimum quantity 1 PCI-e card 3 SSD drives
Minimum Data ONTAP version 7.3.2 8.1.1
Removable Yes Yes, but aggregate must be destroyed

The most basic difference between the two technologies is that Flash Cache is a PCI-e card (or cards) that sits in the NetApp controller, whereas Flash Pool are SSD drives that sits in NetApp shelves. This gives way to some important points. First off, Flash Cache memory that is accessed via the PCI-e bus is always going to have a much higher potential for throughput than Flash Pool memory that sits on a SAS stack. Secondly, the two different architectures result in different means of logical accessibility — for Flash Cache, any and all aggregates that sit on a controller with a Flash Cache card can be accelerated. With Flash Pool, any and all aggregates that sit in a disk stack (or loop) with a Flash Pool SSD can be accelerated. This gives way to the fact that in the event of planned downtime (such as a takeover/giveback), the Flash Cache card’s cache data can be copied to the HA pair’s partner node (“rewarming”), but in the event of unplanned downtime (such as a panic), the Flash Cache card’s cache data will be lost. The corollary of this is that if a controller with Flash Pool fails, the aggregate will fail over to the other controller and the Flash Pool will be accessible on the other controller.

Another important difference is Flash Pool’s ability to cache random overwrites. First, let’s define what we’re talking about: a random overwrite is a small, random write of a block(s) that was recently written to HDD, and now we’re seeing a request for that block to now be overwritten. The fact that we’re talking only about random overwrites is important, because Flash Pools do not accelerate traditional and sequential writes — Data ONTAP is already optimized for writes. Rather, Flash Pool lets us cache these random overwrites by allowing the Consistency Points (CPs) that contain random overwrites to be written into the SSD cache. Writing to SSD is up to 2x quicker than writing to rotating disk, which means that the Consistency Point occurs faster. Random overwrites are the most expensive (i.e., slowest) operation that we can perform to a rotating disk, so it’s in our interest to accelerate them.

Although Flash Cache cannot cache random overwrites, it can function as a write-through cache via the use of the lopri mode. In some workloads, we may desire that recently-written data be immediately read just after being written (the so-called “read after write” scenario). For these workloads, Flash Cache can improve performance by caching recently-written blocks rather than having to seek the rotating disks for that recently-written data. Note that we are not writing the client data into Flash Cache primarily, but rather we are writing the data to rotating disk through the Flash Cache — thus, a write-through cache. Flash Cache serves as a write-through cache under two scenarios. Flash Cache will serve as a write-through cache if the Flash Cache is less than 70% full if lopri is disabled. If lopri is enabled, it will always serve as a write-through cache. The enabling of lopri mode in Flash Cache will also cache sequential reads, whereas Flash Pool has no mechanism for caching sequential data.

Continuing this exercise are the differences between how Flash Cache and Flash Pools actually cache data. Flash Cache is populated with blocks of data that are being evicted from the controller’s primary memory (i.e. its RAM) but are requested by client I/O. Flash Cache utilizes a first-in, first-out algorithm: when a new block of data arrives in the Flash Cache, an existing block of data is purged. When the next new block arrives, our previous block is now one step closer to its own eviction. (The number of blocks that can exist in cache depends on the size of your Flash Cache card.)

Flash Pool is initially populated the same way: when blocks of data match the cache insertion policy and have been evicted from the controller’s primary memory but are requested by client I/O. Flash Pool utilizes a temperature map algorithm: when a new block of data arrives in the Flash Pool, it is assigned a neutral temperature. Data ONTAP keeps track of the temperature of blocks by forming a heat map of those blocks.

This is what a temperature map looks like, and shows how a read gets evicted:


You can see the read cache process: a block gets inserted and labeled with a neutral temperature. If the block gets accessed by clients, the scanner sees this and increases the temperature of the block — meaning it will stay in cache for longer. If the block doesn’t get accessed, the scanner sees this and decreases the temperature of the block. When it gets cold enough, it gets evicted. This means that we can keep hot data while discarding cold data. Note that the eviction scanner only starts running when the cache is at least 75% utilized.

This is how a random overwrite gets evicted:


The hotter the block, the farther right on the temperature gauge and the longer the block is kept in cache. Again, you can see the process: a block gets inserted and labeled with a neutral temperature. If the block gets randomly overwritten again by clients, the block is re-inserted and labeled with a neutral temperature. Unlike a read, a random overwrite cannot have its temperature increased — it can only be re-inserted.

Note that I mentioned Flash Pool actually has RAID protection. This is to ensure the integrity of the cached data on disk, but it’s important to remember that read cache data is just that – a cache of read data. The permanent copy of the data always lives on rotating disks. If your Flash Pool SSD drives all caught fire at the same time, the read cache would be disabled but you would not lose data.