Managing cDOT images and packages

When you’re upgrading cDOT clusters, we keep copies of both the installation files (i.e. the tarballs) and the installation packages.

Note that the downloaded images are kept on each node, hence we’re using a system image command to manipulate them:

dot83cm::> system image package show
Node         Repository     Package File Name
------------ -------------- -----------------
11 entries were displayed.

Those can be deleted simply:

dot83cm::> system image package delete -node * -package 83P1_q_image.tgz
1 entries were deleted.

You can delete them off one or multiple nodes. To verify that it’s gone:

dot83cm::> system image package show
Node         Repository     Package File Name
------------ -------------- -----------------
9 entries were displayed.

Note that we also keep the unpacked tarball, only this time it’s cluster-wide hence we’re using a cluster image command to see what’s out there:

dot83cm::> cluster image package show
Package Version  Package Build Time
---------------- ------------------
8.3.1            8/31/2015 08:42:08
8.3.2            2/23/2016 22:29:11
8.3.2P1          4/14/2016 11:58:25
8.3.2P3          6/22/2016 04:16:50
8.3.2RC1         11/5/2015 08:41:17
8.3.2RC2         12/8/2015 05:24:13
8.3P1            4/7/2015 12:05:35
8.3P2            5/19/2015 03:34:02
8 entries were displayed.

They can also be deleted:

dot83cm::> cluster image package delete -version 8.3P1      
Package Delete Operation Completed Successfully

Because it’s a cluster-wide operation, there’s no need to specify a node.

dot83cm::> cluster image package show
Package Version  Package Build Time
---------------- ------------------
8.3.2            2/23/2016 22:29:11
8.3.2P1          4/14/2016 11:58:25
8.3.2P3          6/22/2016 04:16:50
8.3.2RC1         11/5/2015 08:41:17
8.3.2RC2         12/8/2015 05:24:13
8.3P1            4/7/2015 12:05:35
8.3P2            5/19/2015 03:34:02
7 entries were displayed.

Using CDP to understand Data ONTAP networking

Have you ever been on a Data ONTAP system without a clear idea of how the physical network is connected, and wish you could interrogate your network to try and find out? If so, CDP – the Cisco Discovery Protocol – might be the help you’re looking for. This can be very useful on systems with large or complex Ethernet configurations. Once CDP is enabled in Data ONTAP, your Cisco switches will become aware of which NetApp systems are cabled to which ports. It’ll know both the source port (on the NetApp) and the destination port (on the Cisco).

CDP is supported in both 7-mode and cDOT, and is simply enabled with an option command.

To enable CDP in 7-mode:

options cdpd.enable on

To enable CDP in cDOT:

node run -node * options cdpd.enable on

The nice thing about NetApp’s CDP implementation is that it’s bi-directional. That means you can query CDP from either the Cisco switch or the NetApp controller and see information — meaning you don’t have to rely on a network administrator to provide you the information.

To view CDP information from 7-mode Data ONTAP, you would use the cdpd show-neighbors command.

Here’s some sample output:

mystic> cdpd show-neighbors
Local  Remote          Remote                 Remote           Hold  Remote    
Port   Device          Interface              Platform         Time  Capability
------ --------------- ---------------------- ---------------- ----- ---------- 
e0M    charles         e0M                    FAS3170           146   H         
e0M    nane-cat4948-sw GigabitEthernet1/8     cisco WS-C4948-.  174   RSI       
e3a    nane-nx5010-sw. Ethernet1/4            N5K-C5010P-BF     173   SI        
e4a    nane-nx5010-sw. Ethernet1/14           N5K-C5010P-BF     177   SI

Note that we can see that Filer’s HA partner, charles, in the output. Here we can see that e0M is cabled to port Giga1/8 on nane-cat4948, whereas e3a and e4a are cabled to Eth1/4 and Eth1/14 on nane-nx5010-sw respectively.

This is incredibly useful information if you’re ever trying to track down how a system is cabled!

And from clustered Data ONTAP, to view CDP neighbor information you would use the run -node nodeName cdpd show-neighbors command.

The output is the same format as in 7-mode:

dot83cm::> node run -node local cdpd show-neighbors
Local  Remote          Remote                 Remote           Hold  Remote    
Port   Device          Interface              Platform         Time  Capability
------ --------------- ---------------------- ---------------- ----- ---------- 
e6a    nane-nx5010-sw. Ethernet1/12           N5K-C5010P-BF     145   SI        
e6b    nane-nx5010-sw. Ethernet1/5            N5K-C5010P-BF     145   SI        
e4a    dot83cm-01      e4a                    FAS3240           161   H         
e4b    dot83cm-01      e4b                    FAS3240           161   H         
e0a    nane-cat4948-s. GigabitEthernet1/9     cisco WS-C4948-.  168   RSI

In this case, e6a and e6b go to the same switch, with e4a and e4b going to the other node in this HA pair — that’s my switchless cluster interconnect. e0a goes to an older Catalyst switch.

And to view CDP information from Cisco IOS or NX-OS, you would use the show cdp neighbors command.

Some sample output:

nane-nx5010-sw# show cdp neighbors
Capability Codes: R - Router, T - Trans-Bridge, B - Source-Route-Bridge
                  S - Switch, H - Host, I - IGMP, r - Repeater,
                  V - VoIP-Phone, D - Remotely-Managed-Device,
                  s - Supports-STP-Dispute, M - Two-port Mac Relay

Device ID              Local Intrfce   Hldtme  Capability  Platform      Port ID
US-WLM-LS02            mgmt0           124     R S I       WS-C6509      Gig5/1 
nane-cat4948-sw        Eth1/2          179     R S I       WS-C4948-10GE Ten1/49
dot83cm-01             Eth1/3          163     H           FAS3240       e6b    
mystic                 Eth1/4          127     H           FAS3170       e3a    
dot83cm-02             Eth1/5          158     H           FAS3240       e6b

More detailed information about the output of CDP commands can be found in the relevant Network Management Guide, which are linked below. CDP has been supported since Data ONTAP 8.0.

In IOS/NX-OS, you may wish to run show cdp neighbors detail to gather more information.


Characterizing workloads in Data ONTAP

One of the most important things when analyzing Data ONTAP performance is being able to characterize the workload that you’re seeing. This can be done a couple of ways, but I’ll outline one here. Note that this is not a primer on analyzing storage bottlenecks — for that, you’ll have to attend Insight this year! — but rather a way to see how much data a system is serving out and what kind of data it is.

When it comes to characterizing workloads on a system, everyone has a different approach — and that’s totally fine. I tend to start with a few basic questions:

  1. Where is the data being served from?
  2. How much is being served?
  3. What protocols are doing the most work?
  4. What kind of work are the protocols doing?

You can do this with the the stats command if you know what counters to look for. If you find a statistic and you don’t know what it means, we have a good help system at hand. To wit and to use stats explain counters:

# rsh charles "priv set -q diag; stats explain counters readahead:readahead:incore"
Counters for object name: readahead
Name: incore
Description: Total number of blocks that were already resident
Properties: delta
Unit: none

Okay, so that’s not the greatest explanation, but it’s a start! More commonly-used counters are going to have better descriptions.

When it comes to analyzing your workload, a good place to start are the read_io_type counters in the <ttwafl object. These statistics are not normally collected, so you have to start counting and stop counting them. Here’s an example on a 7-mode system — I’m going to use rsh non-interactively because there are a lot of counters!

# rsh charles "priv set -q diag; stats show wafl:wafl:read_io_type"

Here is an explanation of these counters:

  • cache are reads served from the Filer’s RAM
  • ext_cache are reads served from the Filer’s Flash Cache
  • disk are reads served from any model of spinning disk
  • bamboo_ssd are reads served from any model of solid-state disk
  • hya_hdd are reads served from the spinning disks that are part of a Flash Pool
  • hya_cache are reads served from the solid-state disks that are part of a Flash Pool

Note that these counters are present regardless of your system’s configuration. For example, the Flash Cache counters will be visible even if you don’t have Flash Cache installed.

Now that you know where your reads are being served from, you might be curious as to how big the reads are, and if they’re random or sequential. Those statistics are available, but it takes a bit more grepping and cutting to get them. Here’s a little from my volume, sefiles. Note that I need to prefix this capture with priv set:

# rsh charles "priv set -q diag; stats start -I perfstat_nfs"
# rsh charles "priv set -q diag; stats stop -I perfstat_nfs"

Okay, so there’s a lot of output there. It’s data for the entire system. That said, it’s a great way to learn — you can see everything the system sees. And if you dump it to a text file, it will save you using rsh over and over again! Let’s use rsh again and drill down to the volume stats, which is the most useful way to investigate a particular volume:

# rsh charles "priv set -q diag; stats show volume:sefiles"

I’ve removed some of the internal counters for clarity. Now, let’s take a look at a couple of things: the volume name is listed as sefiles in the instance_name variable. During the time I was capturing statistics, the average amount of operations per second was 93 (“total_ops“). Read latency (“read_latency“) was 433us, which is 0.433ms. Notice that we did no writes ("write_ops") and basically no metadata either ("other_ops").

If you’re interested in seeing what the system is doing in general, I tend to go protocol-by-protocol. If you grep for nfsv3, you can start to wade through things:

# rsh charles "priv set -q diag; stats show nfsv3:nfs:nfsv3_read_latency_hist"
nfsv3:nfs:nfsv3_read_latency_hist.0 - <1ms:472
nfsv3:nfs:nfsv3_read_latency_hist.1 - <2ms:7
nfsv3:nfs:nfsv3_read_latency_hist.2 - <4ms:10
nfsv3:nfs:nfsv3_read_latency_hist.4 - <6ms:8
nfsv3:nfs:nfsv3_read_latency_hist.6 - <8ms:7
nfsv3:nfs:nfsv3_read_latency_hist.8 - <10ms:0
nfsv3:nfs:nfsv3_read_latency_hist.10 - <12ms:1

During our statistics capture, approximately 472 requests were served at 1ms or better. 7 requests were served between 2ms and 1ms. 10 requests were served between 4ms and 2ms, and so on. If you keep looking, you’ll see write latencies as well:

# rsh charles "priv set -q diag; stats show nfsv3:nfs:nfsv3_write_latency_hist"
nfsv3:nfs:nfsv3_write_latency_hist.0 - <1ms:1761
nfsv3:nfs:nfsv3_write_latency_hist.1 - <2ms:0
nfsv3:nfs:nfsv3_write_latency_hist.2 - <4ms:2
nfsv3:nfs:nfsv3_write_latency_hist.4 - <6ms:1
nfsv3:nfs:nfsv3_write_latency_hist.6 - <8ms:0
nfsv3:nfs:nfsv3_write_latency_hist.8 - <10ms:0
nfsv3:nfs:nfsv3_write_latency_hist.10 - <12ms:0

So that’s 1761 writes at 1ms or better. Your writes should always show excellent latencies as writes are acknowledged once they’re committed to NVRAM — the actual writing of the data to disk will occur at the next consistency point.

Keep going and you’ll see latency figures that are concatenated for all types of requests — reads, writes, other (i.e. metadata):

# rsh charles "priv set -q diag; stats show nfsv3:nfs:nfsv3_latency_hist"
nfsv3:nfs:nfsv3_latency_hist.0 - <1ms:2124
nfsv3:nfs:nfsv3_latency_hist.1 - <2ms:5
nfsv3:nfs:nfsv3_latency_hist.2 - <4ms:16
nfsv3:nfs:nfsv3_latency_hist.4 - <6ms:18
nfsv3:nfs:nfsv3_latency_hist.6 - <8ms:13
nfsv3:nfs:nfsv3_latency_hist.8 - <10ms:6
nfsv3:nfs:nfsv3_latency_hist.10 - <12ms:1

Once you’ve established latencies, you may be interested to see the I/O sizes of the actual requests. Behold:

# rsh charles "priv set -q diag; stats show nfsv3:nfs:nfsv3_read_size_histo"

These statistics are represented like the latencies were: 0 requests were served that were between 0-511 bytes in size, 0 between 512 & 1023 bytes, and so on; then, 328 requests were served that were between 4KB & 8KB in size and 71 requests between 8KB & 16KB.

When it comes to protocol operations, we break things down by the unique operation functions — including metadata operations. I’ll show the count numbers, but it’ll also show percentages of requests and individual latencies for each:

# rsh charles "priv set -q diag; stats show nfsv3:nfs:nfsv3_op_count"

In this case, most of my requests were reads & writes; there was very little metadata operation (just 4 attribute lookups). If you want to know what these operations represent, you can read some of my previous blog entries or check out the RFC that defines your protocol.

One last thing that I forgot to add is trying to figure out if your workload is random or sequential. The easiest way that I’ve found to do that is to look in the readahead stats, as the readahead engine is a particularly useful way of determining how your reads are operating. We don’t need to focus on writes as much because any write much larger than 16KB is coalesced in memory and written sequentially — even if it started as random at the client. (I’ll talk about RAVE, the new readahead engine in DOT 8.1+, in later posts.)

I’ve filtered out some of the irrelevant stuff. So let’s gather what we can use and see:

# rsh charles "priv set -q diag; stats show readahead:readahead:total_read_reqs"

These statistics work much the same as our previous ones. 629 read requests were serviced that were 4KB blocks, 77 were serviced that were 8KB blocks, etc. The numbers continue right up to 1024KB. (Keep this in mind whenever someone tells you that WAFL & ONTAP can only read or write 4KB at a time!) Now if you want to see sequentiality:

# rsh charles "priv set -q diag; stats show readahead:readahead:seq_read_reqs"

We can see that, of the 4KB reads that we serviced, 36% were sequential. Mathematically this means that ~64% would be random, which we can confirm here:

# rsh charles "priv set -q diag; stats show readahead:readahead:rand_read_reqs"

Here, of our 4KB requests, 63% of them are random. Of our 8KB requests, 53% were random. Notice that the random percentages and sequential percentages for each block histogram will add up to ~100% depending on rounding. If you dig deeper into the stats, readahead can provide you with some useful information about how your data came to be read:

# rsh charles "priv set -q diag; stats readahead:readahead"

You can interpret those numbers thusly. Again, I’ve removed the irrelevant stuff:

  • requested is the sum of all blocks requested by the client
  • read are the blocks actually read from disk (where the readahead engine had a “miss”)
  • incore are the blocks read from cache (where the readahead engine had a “hit”)
  • speculative are blocks that were readahead because we figured the client might need them later in the I/O stream
  • read_once are blocks that were not read ahead and was not cached because it didn’t match the cache policy (e.g. flexscale.enable.lopri_blocks being off)

With this information, you should have a pretty good idea about how to characterize what your Filer is doing for workloads. Although I’ve focused on NFSv3 here, the same techniques can be applied to the other protocols. If you’ve got any questions, please holler!

How Data ONTAP caches, assembles and writes data

In this post, I thought I would try and describe how ONTAP works and why. I’ll also explain what the “anywhere” means in the Write Anywhere File Layout, a.k.a. WAFL.

A couple of common points of confusion are the role that NVRAM plays in performance, and where our write optimization comes from. I’ll attempt to cover both of these here. I’ll also try and cover how we handle parity. Before we start, here are some terms (and ideas) that matter. This isn’t NetApp 101, but rather NetApp 100.5 — I’m not going to cover all the basics, but I’ll cover the basics that are relevant here. So here goes!

What hardware is in a controller, anyway?

NetApp storage controllers contain some essential ingredients: hardware and software. In terms of hardware the controller has CPUs, RAM and NVRAM. In terms of software the controller has its operating system, Data ONTAP. Our CPUs do what everyone else’s CPUs do — they run our operating system. Our RAM does what everyone else’s RAM does — it runs our operating system. Our RAM also has a very important function in that it serves as a cache. While our NVRAM does roughly what everyone else’s NVRAM does — i.e., it contains our transaction journal — a key difference is the way we use NVRAM.

That said, at NetApp we do things in unique ways. And although we have one operating system, Data ONTAP, we have several hardware platforms. They range from small controllers to big controllers, which medium controllers in between. Different controllers have different amounts of CPU, RAM & NVRAM but the principles are the same!


Our CPUs run the Data ONTAP operating system, and they also process data for clients. Our controllers vary from a single dual-core CPU (in a FAS2220) to a pair of hex-core CPUs (in a FAS6290). The higher the amount of client I/O, the harder the CPUs work. Not all protocols are equal, though; serving 10 IOps via CIFS generates a different CPU load than serving 10 IOps via FCP.


Our RAM contains the Data ONTAP operating system, and it also caches data for clients. It is also the source for all writes that are committed to disk via consistency points. Writes do not come from NVRAM! Our controllers vary from 6GB of RAM (FAS2220) to 96GB of RAM (FAS6290). Not all workloads are equal, though; different features and functionality requires a different memory footprint.


Physically, NVRAM is little more than RAM with a battery backup. Our NVRAM contains a transaction log of client I/O that has not yet been written to disk from RAM by a consistency point. Its primary mission is to preserve that not-yet-written data in the event of a power outage or similar, severe problem. Our controllers vary from 768MB of NVRAM (FAS2220) to 4GB of NVRAM (FAS6290). In my opinion, NVRAM’s function is perhaps the most-commonly misunderstood part of our architecture. NVRAM is simply a double-buffered journal of pending write operations. NVRAM, therefore, is simply a redo log — it is not the write cache! After data is written to NVRAM, it is not looked at again unless you experience a dirty shutdown. This is because NVRAM’s importance to performance comes from software.

In an HA pair environment where two controllers are connected to each other, NVRAM is mirrored between the two nodes. Its primary mission is to preserve data that not-yet-written data in the event a partner controller suffers a power outage or similar severe problem. NVRAM mirroring happens for HA pairs in Data ONTAP 7-mode, HA pairs in clustered Data ONTAP and HA pairs in MetroCluster environments.

Disks and disk shelves

Disk shelves contain disks. DS14 shelves contain 14 drives, and DS2246 & DS4243 & DS4246 shelves contain 24 disks. DS4248 shelves contain 48 disks. Disk shelves are connected to controllers via shelf modules, those logical connections run either FCAL (DS14) or SAS (DS2246/DS42xx) protocols for connectivity.

What software is in a controller, anyway?


Data ONTAP is our controller’s operating system. Almost everything sits here — from configuration files and databases to license keys, log files and some diagnostic tools. Our operating system is built on top of FreeBSD, and usually lives in a volume called vol0. Data ONTAP features implementations of protocols for client access (e.g. NFS, CIFS), APIs for programming access (ZAPI) and implementations of protocols for management access (SSH). It is fair to say that Data ONTAP is the heart of a NetApp controller.


WAFL is our Write Anywhere File Layout. If NVRAM’s role is the most-commonly misunderstood, WAFL comes in 2nd. Yet WAFL has a simple goal, which is to write data in full stripes across the storage media. WAFL acts as an intermediary of sorts — there is a top half where files and volumes sit, and a bottom half that interacts with RAID, manages SnapShots and some other things. WAFL isn’t a filesystem, but it does some things a filesystem does; it can also contain filesystems.

WAFL contains mechanisms for dealing with files & directories, for interacting with volumes & aggregates, and for interacting with RAID. If Data ONTAP is the heart of a NetApp controller, WAFL is the blood that it pumps.

Although WAFL can write anywhere we want, in reality we write where it makes the most sense: in the closest place (relative to the disk head) where we can write a complete stripe in order to minimize seek time on subsequent I/O requests. WAFL is optimized for writes, and we’ll see why below. Rather unusually for storage arrays, we can write client data and metadata anywhere.

A colleague has this to say about WAFL, and I couldn’t put it better:

There is a relatively simple “cheating at Tetris” analogy that can be used to articulate WAFL’s advantages. It is not hard to imagine how good you could be at Tetris if you were able to review the next thousand shapes that were falling into the pattern, rather than just the next shape.

Now imagine how much better you could be at Tetris if you could take any of the shapes from within the next thousand to place into your pattern, rather than being forced to use just the next shape that is falling.

Finally, imagine having plenty of time to review the next thousand shapes and plan your layout of all 1,000, rather than just a second or two to figure out what to do with the next piece that is falling. In summary, you could become the best Tetris player on Earth, and that is essentially what WAFL is in the arena of data allocation techniques onto underlying disk arrays.

The Tetris analogy incredibly important, as it directly relates to the way that NetApp uses WAFL to optimize for writes. Essentially, we collect random I/O that is destined to be written to disk, reorganize it so that it resembles sequential I/O as much as possible, and then write it to disk sequentially. Another way of explaining this behavior is that of write coalescing: we reduce the number of operations that ultimately land on the disk, because we re-organize them in memory before we commit them to disk and we wait until we have a bunch of them before committing them to disk via a Consistency Point. Put another way, write coalescing allows to avoid the common (and expensive) RAID workflow of “read-modify-write”.

Putting it all together

NetApp storage arrays are made up of controllers and disk shelves. Where the top and bottom is depends on your perspective: disks are grouped into RAID groups, and those RAID groups are combined to make aggregates. Volumes live in aggregates, and files and LUNs live in those volumes. A volume of CIFS data is shared to a client via a SMB share; a volume of NFS data is shared to a client via a NFS export. A LUN of data is shared to a client via an FCP, FCOE or ISCSI initiator group. Note the relationship here between controller and client — all clients care about are volumes, files and LUNs. They don’t care directly about CPUs, NVRAM or really anything else when it comes to hardware. There is data and I/O and that’s it.

In order to get data to and from clients as quickly as possible, NetApp engineers have done much to try to optimize controller performance. A lot of the architectural design you see in our controllers reflects this. Although the clients don’t care directly about how NetApp is architected, our architecture does matter to the way their underlying data is handled and the way their I/O is served. Here is a basic workflow, from the inimitable Recovery Monkey himself:

  1. The client writes some data
  2. Write arrives in controller’s RAM and is copied into controller’s NVRAM
  3. Write is then mirrored into partner’s NVRAM
  4. The client receives an acknowledgement that the data has been written

Sounds pretty simple right? On the surface, it is a pretty simple process. It also explains a few core concepts about NetApp architecture:

  • Because we acknowledge the client write once it’s hit NVRAM, we’re optimized for writes out of the box
  • Because we don’t need to wait for the disks, we can write anywhere we choose to
  • Why we can survive an outage to either half of an HA pair

The 2nd bullet provides the backbone of this post. Because we can write data anywhere we choose to, we tend to write data in the best place possible. This is typically in the largest contiguous stripe of free space in the volume’s aggregate (closest to the disk heads). After 10 seconds have elapsed, or if NVRAM becomes >=50% full, we write the client’s data from RAM (not from NVRAM) to disk.

This operation is called a consistency point. Because we use RAID, a consistency point requires us to perform RAID calculations and to calculate parity. These calculations are processed by the CPUs using data that exists in RAM.

Many people think that our acceleration occurs because of NVRAM. Rather, a lot of our acceleration happens while we’re waiting for NVRAM. For example — and very significantly — we transmogrify random I/O from the client into sequential I/O before writing it to disk. This processing is done by the CPU and occurs in RAM. Another significant benefit is because we calculate data in RAM, we do not need to hammer the disk drives that contain parity information.

So, with the added step of a CP, this is how things really happen:

  1. The client writes some data
  2. Write arrives in controller’s RAM and is copied into controller’s NVRAM
  3. Write is then mirrored into partner’s NVRAM
  4. When the partner NVRAM acknowledges the write, the client receives an acknowledgement that the data has been written [to disk]
  5. A consistency point occurs and the data (incl. parity) is written to disk

To re-iterate, writes are cached in the controller’s RAM at the same time as being logged into NVRAM (and the partner’s NVRAM if you’re using a HA pair). Once the data has been written to disk via a CP, the writes are purged from the controller’s NVRAM but retained (with a lower priority) in the controller’s RAM. The prioritization allows us to evict less commonly-used blocks in order to avoid overrunning the system memory. In other words, recently-written data is the first to be ejected from the first-level read cache in the controller’s RAM.

The parity issue also catches people out. Traditional filesystems and arrays write data (and metadata) into pre-allocated locations; Data ONTAP and WAFL let NetApp write data (and metadata) in whatever location will provide fastest access. This is usually a stripe to the nearest available set of free blocks. The ability of WAFL to write to the nearest available free disk blocks lets us greatly reduce disk seeking. This is the #1 performance challenge when using spinning disks! It also lets us avoid the “hot parity disk” paradigm, as WAFL always writes to new, free disk blocks using pre-calculated parity.

Consistency points are also commonly misunderstood. The purpose of a consistency point is simple: to write data to free space on disks. Once a CP has taken place, the contents of NVRAM are discarded. Incoming writes that are received during the actual process of a consistency point’s commitment (i.e., the actual writing to disk) will be written to disk in the next consistency point.

The workflow for a write:

1. Write is sent from the host to the storage system (via a NIC or HBA)
2. Write is processed into system memory while a) being logged in NVRAM and b) being logged in the HA partner’s NVRAM
3. Write is acknowledged to the host
4. Write is committed to disk storage in a consistency point (CP)

The importance of consistency points cannot be overstated — they are a cornerstone of Data ONTAP’s architecture! They are also the reason why Data ONTAP is optimized for writes: no matter the destination disk type, we acknowledge client writes as soon as they hit NVRAM, and NVRAM is always faster than any time of disk!

Consistency points typically they occur every 10 seconds or whenever NVRAM begins to get full, whichever comes first. The latter is referred to as a “watermark”. Imagine that NVRAM is a bucket, and now divide that bucket in half. One side of the bucket is for incoming data (from clients) and the other side of the bucket is for outgoing data (to disks). As one side of the bucket fills, the other side drains. In an HA pair, we actually divide NVRAM into four buckets: two for the local controller and two for the partner controller. (This is why, when activating or deactivating HA on a system, a reboot is required for the change to take effect. The 10-second rule is why, on a system that is doing practically no I/O, the disks will always blink every 10 seconds.)

The actual value of the watermark varies depending on the exact configuration of your environment: the Filer model, the Data ONTAP version, whether or not it’s HA, and whether SATA disks are present. Because SATA disks write more slowly than SAS disks, consistency points take longer to write to disk if they’re going to SATA disks. In order to combat this, Data ONTAP lowers the watermark in a system when SATA disks are present.

The size of NVRAM only really dictates a Filer’s performance envelope when doing a large amount of large sequential writes. But with the new FAS80xx family, NVRAM counts have grown massively — up to 4x of the FAS62xx family.

Getting started with Clustered Data ONTAP & FC storage

A couple of days ago, I helped a customer get their cDOT system up and running using SAN storage. They had inherited a Cisco MDS switch running NX-OS, and were having trouble getting the devices to log in to the fabric.

As you may know, Clustered Data ONTAP ”’requires”’ NPIV when using Fiber Channel storage — i.e., hosts connecting to a NetApp cluster via the Fiber Channel protocol. NPIV is N-Port ID Virtualization for Fiber Channel, and should not be confused with NPV — which is simply N-Port Virtualization. Scott Lowe has an excellent blog post comparing and contrasting the two.

NetApp uses NPIV in order to abstract away the underlying hardware (i.e., FC HBAs) to the client-facing hardware (i.e., Storage Virtual Machine Logical Interfaces). The use of logical interfaces, or LIFs, allows us to not only carve up a single physical HBA port into many logical ports, but also for the WWPNs to be different. This is particularly useful when it comes to zoning — if you buy an HBA today, you’ll create your FC zone based on ”’the LIF WWPNs”’ and not the HBA’s.

For example, I have a two-node FAS3170 cluster, and each node has two FC HBAs:

dot82cm::*> fcp adapter show -fields fc-wwnn,fc-wwpn
node       adapter fc-wwnn                 fc-wwpn                 
---------- ------- ----------------------- ----------------------- 
dot82cm-01 2a      50:0a:09:80:89:6a:bd:4d 50:0a:09:81:89:6a:bd:4d 
dot82cm-01 2b      50:0a:09:80:89:6a:bd:4d 50:0a:09:82:89:6a:bd:4d 

(Note that that command needs to be run in a privileged mode in the cluster shell.) But the LIFs have different port addresses, thanks to NPIV:

dot82cm::> net int show -vserver vs1
  (network interface show)
            Logical    Status     Network            Current       Current Is
Vserver     Interface  Admin/Oper Address/Mask       Node          Port    Home
----------- ---------- ---------- ------------------ ------------- ------- ----
                         up/up    20:05:00:a0:98:0d:e7:76 
                                                     dot82cm-01    2a      true
                         up/up    20:06:00:a0:98:0d:e7:76 
                                                     dot82cm-01    2b      true

So, I have one Vserver (sorry, SVM!) with four LIFs. If I ever remove my dual-port 8Gb FC HBA and replace them with, say, a dual-port 16Gb FC HBA, the port names on the LIFs that are attached to the SVM will ”not” change. So when you zone your FC switch, you’ll use the LIF WWPNs.

Speaking of FC switches, let’s look at what we need. I’m using a Cisco Nexus 5020 in my lab, which means I’ll need the NPIV (not NPV!) license enabled. To verify if you have that license, it’s pretty simple:

nane-nx5020-sw# show feature | i npiv
npiv                  1         enabled

That’s pretty much it. For a basic fabric configuration on a Nexus, you need the following to work with cluster-mode:

  1. An NPIV license
  2. A Virtual Storage Area Network, or VSAN
  3. A zoneset
  4. A zone

I’m using a VSAN of 101; for most environments the default VSAN is VSAN 1. I have a single zoneset, which contains a single zone. I’m using aliases to make the zone slightly easier to manage.

Here is the zoneset:

nane-nx5020-sw# show zoneset brief vsan 101
zoneset name vsan101-zoneset vsan 101
  zone vsan101-zone

You can see that the zoneset is named vsan101-zoneset, and it’s in VSAN 101. The member zone is rather creatively named vsan101-zone. Let’s look at the zone’s members:

nane-nx5020-sw# show zone vsan 101
zone name vsan101-zone vsan 101
  fcalias name ucs-esxi-1-vmhba1 vsan 101
    pwwn 20:00:00:25:b5:00:00:1a
  fcalias name dot82cm-01_fc_lif_1 vsan 101
    pwwn 20:05:00:a0:98:0d:e7:76

Note that I have two hosts defined by aliases, and that those aliases contain the relevant WWPN from the host. Make sure you commit your zone changes and activate your zoneset!

Once you’ve configured your switch appropriately, you need to do three four things from the NetApp perspective:

  1. Create an initiator group
  2. Populate that initiator group with the host’s WWPNs
  3. Create a LUN
  4. Map the LUN to the relative initiator group

When creating your initiator group, you’ll need to select a host type. This will ensure the correct ALUA settings, amongst others. After the initiator group is populated, it should look something like this:

dot82cm::> igroup show -vserver vs1
Vserver   Igroup       Protocol OS Type  Initiators
--------- ------------ -------- -------- ------------------------------------
vs1       vm5_fcp_igrp fcp      vmware   20:00:00:25:b5:00:00:1a

We’re almost there! Now all we need to do is map the initiator group to a LUN. I’ve already done this for one LUN:

dot82cm::> lun show -vserver vs1
Vserver   Path                            State   Mapped   Type        Size
--------- ------------------------------- ------- -------- -------- --------
vs1       /vol/vm5_fcp_volume/vm5_fcp_lun1 
                                          online  mapped   vmware      250GB

We can see that the LUN is mapped, but how do we know which initiator group it’s mapped to?

dot82cm::> lun mapped show -vserver vs1
Vserver    Path                                      Igroup   LUN ID  Protocol
---------- ----------------------------------------  -------  ------  --------
vs1        /vol/vm5_fcp_volume/vm5_fcp_lun1          vm5_fcp_igrp  0  fcp

Now we have all the pieces in place! We have a Vserver (or SVM), vs1. It contains a volume, vm5_fcp_volume, which in turn contains a single LUN, vm5_fcp_lun1. That LUN is mapped to an initiator group called vm5_fcp_igrp of type vmware, over protocol FCP. And that initiator group contains a single WWPN that corresponds to the WWPN of my ESXi host.

Clear as mud?

Getting started with NetApp Storage QoS

Storage QoS is a new feature in Clustered Data ONTAP 8.2. It is a full-featured QoS stack, which replaces the FlexShare stack of previous ONTAP versions. So what does it do and how does it work? Let’s take a look!

The administration of QoS involves two parts: policy groups, and policies. Policy groups define boundaries between workloads, and contain one or more storage objects. We can monitor, isolate and limit the workloads of storage objects from the biggest point of granularity down to the smallest — from entire Vservers, whole volumes, individual LUNs all the way down to single files.

The actual policies are behavior modifiers that are applied to a policy group. Right now, we can set throughput limits based on operation counts (i.e., IOps) or throughput counts (i.e., MB/s). When limiting throughput, storage QoS throttles traffic at the protocol stack. Therefore, a client whose I/O is being throttled will see queuing in their protocol stack (e.g., CIFS or NFS) and latency will eventually rise. However, the addition of this queuing will not affect the NetApp cluster’s resources.

In addition to the throttling of workloads, storage QoS also includes very effective measuring tools. And because QoS is “always on”, you don’t even need to have a policy group in order to monitor performance.

So, let’s get started. When it comes to creating our first policy, we actually require three steps in order for the policy to be applied to the workload:

  1. Create the policy (with qos policy-group create...)
  2. Apply the policy to a Vserver (with vserver modify...)
  3. Apply the policy to a volume (with vol modify...)
  4. Monitor what’s going on (with qos statistics...)

Before we start, let’s verify that we’re starting from a clean slate:

dot82cm::> qos policy-group show
This table is currently empty.

Okay, good — no policy groups exist yet. Step one of three is to create the policy group itself, which we’ll call blog-group. In reality, you’d specify a throughput limit (either IOps or MB/s), but for now we won’t bother limiting the throughput:

dot82cm::> qos policy create blog-group -vserver vs0

Let’s make sure the policy group was created:

dot82cm::> qos policy-group show                    
Name             Vserver     Class        Wklds Throughput  
---------------- ----------- ------------ ----- ------------
blog-group       vs0         user-defined -     0-INF

Let’s confirm:

But, because we didn’t specify a throughput limit, the Throughput column is still showing 0 to infinity. Let’s add a limit of 1500 IOps:

dot82cm::> qos policy-group modify -policy-group blog-group -max-throughput 1500iops

(If we wanted to limit that volume to 1500MB/s, we could have substituted 1500mb for 1500iops.)

And verify:

dot82cm::> qos policy-group show                                                
Name             Vserver     Class        Wklds Throughput  
---------------- ----------- ------------ ----- ------------
blog-group       vs0         user-defined 0     0-1500IOPS

So, step three is to associate the new policy group with an actual object whose I/O we wish to throttle. The object can be one or many volumes, LUNs or files. For now though, we’ll apply it to a single volume, blog_volume:

dot82cm::> volume modify blog_volume -vserver vs0 -qos-policy-group blog-group

Volume modify successful on volume: blog_volume

Let’s confirm that it was successfully modified:

dot82cm::> qos policy-group show                                                
Name             Vserver     Class        Wklds Throughput  
---------------- ----------- ------------ ----- ------------
blog-group       vs0         user-defined 1     0-1500IOPS

Cool! We can see that Workloads has gone from 0 to 1. I’ve mounted that volume via NFS on a Linux VM, and will throw a bunch of workloads at it using dd.

While the workload is running, here’s how it looks:

dot82cm::> qos statistics workload characteristics show                         
Workload          ID     IOPS      Throughput      Request size    Read  Concurrency 
--------------- ------ -------- ---------------- --------------- ------- ----------- 
-total-              -      392         5.26MB/s          14071B     14%          14 
_USERSPACE_APPS     14      170       109.46KB/s            659B     32%           0 
_Scan_Backgro..  11702      115            0KB/s              0B      0%           0 
blog_volume-w..  11792     1679       104.94MB/s          65527B      0%           4

As you can see, our volume blog_volume is pretty busy — it’s pushing almost 1,700 IOps at over 100MB/sec. So, let’s see if the throttling is effective. First, we’ll give the policy group a low throughput maximum:

dot82cm::> qos policy-group modify -policy-group blog-group -max-throughput 100iops

Now let’s check its status:

dot82cm::> qos policy-group show
Name             Vserver     Class        Wklds Throughput  
---------------- ----------- ------------ ----- ------------
blog-group       vs0         user-defined 1     0-100IOPS

Now let’s see how the Filer is doing:

dot82cm::> qos statistics workload characteristics show
Workload          ID     IOPS      Throughput      Request size    Read  Concurrency 
--------------- ------ -------- ---------------- --------------- ------- ----------- 
-total-              -      384         6.71MB/s          18333B     11%          33 
_USERSPACE_APPS     14      169         2.50MB/s          15528B     26%           0 
_Scan_Backgro..  11702      115            0KB/s              0B      0%           0 
blog_volume-w..  11792       83         5.19MB/s          65536B      0%          15 
-total-              -      207         4.81MB/s          24348B      0%          17

You can see that the throughput has gone way down! In fact it’s gone below our limit of 80 IOps. And that, of course, is what’s supposed to happen. Now let’s remove the limit and see if things return to normal:

dot82cm::> qos policy-group modify -policy-group blog-group -max-throughput none

dot82cm::> qos statistics workload characteristics show                            
Workload          ID     IOPS      Throughput      Request size    Read  Concurrency 
--------------- ------ -------- ---------------- --------------- ------- ----------- 
-total-              -     1073        44.37MB/s          43363B      8%           1 
blog_volume-w..  11792      626        39.12MB/s          65492B      0%           1 
_USERSPACE_APPS     14      302         4.90MB/s          17041B     29%           0 
_Scan_Backgro..  11702      115            0KB/s              0B      0%           0 
-total-              -      263       471.86KB/s           1837B     19%           0

Because we can apply the QoS policy groups to entire Vservers, volumes, files & LUNs, it is important to keep track of what’s applied where. This is how you’d apply a policy group to an individual volume:

dot82cm::> volume modify -vserver vs0 -volume blog_volume -qos-policy-group blog-group

(To remove the policy, set the -qos-policy-group field to none.)

To apply a policy group against an entire Vserver (in this case, Vserver vs0):

dot82cm::> vserver modify -vserver vs0 -qos-policy-group blog-group

(Again, to remove the policy, set the -qos-policy-group field to none.)

To see which volumes are assigned our policy group:

dot82cm::> volume show -vserver vs0 -qos-policy-group blog-group                       
Vserver   Volume       Aggregate    State      Type       Size  Available Used%
--------- ------------ ------------ ---------- ---- ---------- ---------- -----
vs0       blog_volume  aggr1_node1  online     RW          1GB    365.8MB   64%

To see all volumes in all Vservers’ QoS policy groups:

dot82cm::> volume show -vserver * -qos-policy-group * -fields vserver,volume,qos-policy-group

vserver    volume qos-policy-group 
---------- ------ ---------------- 
vs0        blog_empty 
vs0        blog_volume 

To see a Vserver’s full configuration, including its QoS policy group:

dot82cm::> vserver show -vserver vs0

                                    Vserver: vs0
                               Vserver Type: data
                               Vserver UUID: c280658e-bd77-11e2-a567-123478563412
                                Root Volume: vs0_root
                                  Aggregate: aggr1_node2
                        Name Service Switch: file, nis, ldap
                        Name Mapping Switch: file, ldap
                                 NIS Domain:
                 Root Volume Security Style: unix
                                LDAP Client: -
               Default Volume Language Code: C
                            Snapshot Policy: default
                 Antivirus On-Access Policy: default
                               Quota Policy: default
                List of Aggregates Assigned: -
 Limit on Maximum Number of Volumes allowed: unlimited
                        Vserver Admin State: running
                          Allowed Protocols: nfs, cifs, ndmp
                       Disallowed Protocols: fcp, iscsi
            Is Vserver with Infinite Volume: false
                           QoS Policy Group: -

To get a list of all Vservers by their policy groups:

dot82cm::> vserver show -vserver * -qos-policy-group * -fields vserver,qos-policy-group 
vserver qos-policy-group 
------- ---------------- 
dot82cm -                
vs0     blog-group       
4 entries were displayed.

If you’re in a hurry and want to remove all instances of a policy from volumes in a particular Vserver:

dot82cm::> vol modify -vserver vs0 -volume * -qos-policy-group none

That should be enough to get us going. Stay tuned, because in the next episode I’ll show some video with iometer running!


SnapMirror data between cDOT 8.1 and cDOT 8.2

I wanted to quickly put this together as I had some data I wanted to move from a cDOT 8.1 cluster to a cDOT 8.2 cluster. At the time of writing, you cannot do this procedure with OnCommand System Manager GUI — it has to be done through the command-line. Big thanks to Doug & Khalif for providing me some troubleshooting help.

Although this seems like a long post, the process is quite simple:

  1. Create the inter-cluster logical interfaces (LIFs) on each cluster
  2. Create a peer relationship between clusters
  3. Create SnapMirror relationships between clusters for volumes to be transferred
  4. Initialize that SnapMirror relationships to transfer the data
  5. Quiesce, break and remove that SnapMirror relationships

As you can see, the naming scheme I’m using is very simple. The cDOT 8.1 cluster is called dot81cm, and the cDOT 8.2 cluster is called dot82cm. On both sides of the cluster, the name of the Vserver is vserver1, and the volume will be called newDatastore on both the source and destination clusters. Of course, any or all of these parameters can be changed in your environment — the Vserver names don’t need to be the same, nor do the volume names. However, you will need to ensure that DNS resolution works between the two clusters.

First, we’ll probably need to create the inter-cluster LIFs to connect the two clusters together. Note that you can use dedicated inter-cluster LIFs or you can use shared ones. Here, I’m creating dedicated LIFs:

dot81cm::> network interface create -vserver dot81cm-01 -lif migration_iclif_81 -role intercluster -home-node dot81cm-01 -home-port e3a -address -netmask
dot82cm::> network interface create -vserver dot82cm-01 -lif migration_iclif_82 -role intercluster -home-node dot82cm-01 -home-port e0a -address -netmask

Note that the inter-cluster LIFs are created on the node Vserver (i.e., the node dot81cm-01) and not a data Vserver. In a larger environment, if you’re concerned about traffic load or failover resources, you may want to create more than one LIF. Also note that you’ll need to select your own home-node (e.g., e0a) values. And, in my case, the two LIFs are on the same network, so I didn’t need to bother about having them routed. You can find some useful information about creating the LIFs in the Clustered Data ONTAP® 8.2 Cluster and Vserver Peering Express Guide.

Create the cluster peer in cDOT 8.2 first:

dot82cm::> cluster peer create -peer-addrs

Notice: Successfully configured the local cluster. Execute this command on the
        remote cluster to complete the peer setup.

Now, create the cluster peer in cDOT 8.1:

dot81cm::> cluster peer create -peer-addrs

Verify the existence of the peer relationship:

dot82cm::> cluster peer show
Peer Cluster Name          Cluster Serial Number Availability
-------------------------- --------------------- ---------------
dot81cm                    1-80-000011           Available
dot81cm::> cluster peer show
Peer Cluster Name          Cluster Serial Number Availability
-------------------------- --------------------- ---------------
dot82cm                    1-80-000013           Available

Create the destination volume and configure it for data protection:

dot82cm::> vol create -vserver vserver1 -vol newDatastore -aggr aggr1_node1 -size 100GB -type DP 

Create the SnapMirror relationship. A key point here is that you must create the relationship on the cDOT 8.2 cluster:

dot82cm::> snapmirror create -source-path dot81cm://vserver1/newDatastore -destination-path dot82cm://vserver1/newDatastore -type DP

Initialize the SnapMirror relationship:

dot82cm::> snapmirror initialize -source-path dot81cm://vserver1/newDatastore -destination-path dot82cm://vserver1/newDatastore -type DP

You should see that a job is created. Find the job:

dot82cm::> job show
Job ID Name                 Vserver    Node           State
------ -------------------- ---------- -------------- ----------
1820   SnapMirror initialize 
                            dot82cm    dot82cm-01     Running
       Description: snapmirror initialize of destination dot82cm://vserver1/newDatastore

Find more status about the job:

dot82cm::> job watch-progress 1820
3.75GB sent for 1 of 1 Snapshot copies, transferring Snapshot copy snapmirror.aa

…and that’s pretty much it! Hit Ctrl-C to exit out of the job watch. If you’re only planning on transferring the data once, you can now finalize the migration by quiescing, breaking and deleting the SnapMirror relationship. (You can consider quiescing as an optional step.)

dot82cm::> snapmirror quiesce -source-path dot81cm://vserver1/newDatastore -destination-path dot82cm://vserver1/newDatastore

Now you can break the SnapMirror relationship:

dot82cm::> snapmirror break -source-path dot81cm://vserver1/newDatastore -destination-path dot82cm://vserver1/newDatastore

And now you can delete the relationship:

dot82cm::> snapmirror delete -source-path dot81cm://vserver1/newDatastore -destination-path dot82cm://vserver1/newDatastore

And we’re done!