How Data ONTAP caches, assembles and writes data

In this post, I thought I would try and describe how ONTAP works and why. I’ll also explain what the “anywhere” means in the Write Anywhere File Layout, a.k.a. WAFL.

A couple of common points of confusion are the role that NVRAM plays in performance, and where our write optimization comes from. I’ll attempt to cover both of these here. I’ll also try and cover how we handle parity. Before we start, here are some terms (and ideas) that matter. This isn’t NetApp 101, but rather NetApp 100.5 — I’m not going to cover all the basics, but I’ll cover the basics that are relevant here. So here goes!

What hardware is in a controller, anyway?

NetApp storage controllers contain some essential ingredients: hardware and software. In terms of hardware the controller has CPUs, RAM and NVRAM. In terms of software the controller has its operating system, Data ONTAP. Our CPUs do what everyone else’s CPUs do — they run our operating system. Our RAM does what everyone else’s RAM does — it runs our operating system. Our RAM also has a very important function in that it serves as a cache. While our NVRAM does roughly what everyone else’s NVRAM does — i.e., it contains our transaction journal — a key difference is the way we use NVRAM.

That said, at NetApp we do things in unique ways. And although we have one operating system, Data ONTAP, we have several hardware platforms. They range from small controllers to big controllers, which medium controllers in between. Different controllers have different amounts of CPU, RAM & NVRAM but the principles are the same!


Our CPUs run the Data ONTAP operating system, and they also process data for clients. Our controllers vary from a single dual-core CPU (in a FAS2220) to a pair of hex-core CPUs (in a FAS6290). The higher the amount of client I/O, the harder the CPUs work. Not all protocols are equal, though; serving 10 IOps via CIFS generates a different CPU load than serving 10 IOps via FCP.


Our RAM contains the Data ONTAP operating system, and it also caches data for clients. It is also the source for all writes that are committed to disk via consistency points. Writes do not come from NVRAM! Our controllers vary from 6GB of RAM (FAS2220) to 96GB of RAM (FAS6290). Not all workloads are equal, though; different features and functionality requires a different memory footprint.


Physically, NVRAM is little more than RAM with a battery backup. Our NVRAM contains a transaction log of client I/O that has not yet been written to disk from RAM by a consistency point. Its primary mission is to preserve that not-yet-written data in the event of a power outage or similar, severe problem. Our controllers vary from 768MB of NVRAM (FAS2220) to 4GB of NVRAM (FAS6290). In my opinion, NVRAM’s function is perhaps the most-commonly misunderstood part of our architecture. NVRAM is simply a double-buffered journal of pending write operations. NVRAM, therefore, is simply a redo log — it is not the write cache! After data is written to NVRAM, it is not looked at again unless you experience a dirty shutdown. This is because NVRAM’s importance to performance comes from software.

In an HA pair environment where two controllers are connected to each other, NVRAM is mirrored between the two nodes. Its primary mission is to preserve data that not-yet-written data in the event a partner controller suffers a power outage or similar severe problem. NVRAM mirroring happens for HA pairs in Data ONTAP 7-mode, HA pairs in clustered Data ONTAP and HA pairs in MetroCluster environments.

Disks and disk shelves

Disk shelves contain disks. DS14 shelves contain 14 drives, and DS2246 & DS4243 & DS4246 shelves contain 24 disks. DS4248 shelves contain 48 disks. Disk shelves are connected to controllers via shelf modules, those logical connections run either FCAL (DS14) or SAS (DS2246/DS42xx) protocols for connectivity.

What software is in a controller, anyway?


Data ONTAP is our controller’s operating system. Almost everything sits here — from configuration files and databases to license keys, log files and some diagnostic tools. Our operating system is built on top of FreeBSD, and usually lives in a volume called vol0. Data ONTAP features implementations of protocols for client access (e.g. NFS, CIFS), APIs for programming access (ZAPI) and implementations of protocols for management access (SSH). It is fair to say that Data ONTAP is the heart of a NetApp controller.


WAFL is our Write Anywhere File Layout. If NVRAM’s role is the most-commonly misunderstood, WAFL comes in 2nd. Yet WAFL has a simple goal, which is to write data in full stripes across the storage media. WAFL acts as an intermediary of sorts — there is a top half where files and volumes sit, and a bottom half that interacts with RAID, manages SnapShots and some other things. WAFL isn’t a filesystem, but it does some things a filesystem does; it can also contain filesystems.

WAFL contains mechanisms for dealing with files & directories, for interacting with volumes & aggregates, and for interacting with RAID. If Data ONTAP is the heart of a NetApp controller, WAFL is the blood that it pumps.

Although WAFL can write anywhere we want, in reality we write where it makes the most sense: in the closest place (relative to the disk head) where we can write a complete stripe in order to minimize seek time on subsequent I/O requests. WAFL is optimized for writes, and we’ll see why below. Rather unusually for storage arrays, we can write client data and metadata anywhere.

A colleague has this to say about WAFL, and I couldn’t put it better:

There is a relatively simple “cheating at Tetris” analogy that can be used to articulate WAFL’s advantages. It is not hard to imagine how good you could be at Tetris if you were able to review the next thousand shapes that were falling into the pattern, rather than just the next shape.

Now imagine how much better you could be at Tetris if you could take any of the shapes from within the next thousand to place into your pattern, rather than being forced to use just the next shape that is falling.

Finally, imagine having plenty of time to review the next thousand shapes and plan your layout of all 1,000, rather than just a second or two to figure out what to do with the next piece that is falling. In summary, you could become the best Tetris player on Earth, and that is essentially what WAFL is in the arena of data allocation techniques onto underlying disk arrays.

The Tetris analogy incredibly important, as it directly relates to the way that NetApp uses WAFL to optimize for writes. Essentially, we collect random I/O that is destined to be written to disk, reorganize it so that it resembles sequential I/O as much as possible, and then write it to disk sequentially. Another way of explaining this behavior is that of write coalescing: we reduce the number of operations that ultimately land on the disk, because we re-organize them in memory before we commit them to disk and we wait until we have a bunch of them before committing them to disk via a Consistency Point. Put another way, write coalescing allows to avoid the common (and expensive) RAID workflow of “read-modify-write”.

Putting it all together

NetApp storage arrays are made up of controllers and disk shelves. Where the top and bottom is depends on your perspective: disks are grouped into RAID groups, and those RAID groups are combined to make aggregates. Volumes live in aggregates, and files and LUNs live in those volumes. A volume of CIFS data is shared to a client via a SMB share; a volume of NFS data is shared to a client via a NFS export. A LUN of data is shared to a client via an FCP, FCOE or ISCSI initiator group. Note the relationship here between controller and client — all clients care about are volumes, files and LUNs. They don’t care directly about CPUs, NVRAM or really anything else when it comes to hardware. There is data and I/O and that’s it.

In order to get data to and from clients as quickly as possible, NetApp engineers have done much to try to optimize controller performance. A lot of the architectural design you see in our controllers reflects this. Although the clients don’t care directly about how NetApp is architected, our architecture does matter to the way their underlying data is handled and the way their I/O is served. Here is a basic workflow, from the inimitable Recovery Monkey himself:

  1. The client writes some data
  2. Write arrives in controller’s RAM and is copied into controller’s NVRAM
  3. Write is then mirrored into partner’s NVRAM
  4. The client receives an acknowledgement that the data has been written

Sounds pretty simple right? On the surface, it is a pretty simple process. It also explains a few core concepts about NetApp architecture:

  • Because we acknowledge the client write once it’s hit NVRAM, we’re optimized for writes out of the box
  • Because we don’t need to wait for the disks, we can write anywhere we choose to
  • Why we can survive an outage to either half of an HA pair

The 2nd bullet provides the backbone of this post. Because we can write data anywhere we choose to, we tend to write data in the best place possible. This is typically in the largest contiguous stripe of free space in the volume’s aggregate (closest to the disk heads). After 10 seconds have elapsed, or if NVRAM becomes >=50% full, we write the client’s data from RAM (not from NVRAM) to disk.

This operation is called a consistency point. Because we use RAID, a consistency point requires us to perform RAID calculations and to calculate parity. These calculations are processed by the CPUs using data that exists in RAM.

Many people think that our acceleration occurs because of NVRAM. Rather, a lot of our acceleration happens while we’re waiting for NVRAM. For example — and very significantly — we transmogrify random I/O from the client into sequential I/O before writing it to disk. This processing is done by the CPU and occurs in RAM. Another significant benefit is because we calculate data in RAM, we do not need to hammer the disk drives that contain parity information.

So, with the added step of a CP, this is how things really happen:

  1. The client writes some data
  2. Write arrives in controller’s RAM and is copied into controller’s NVRAM
  3. Write is then mirrored into partner’s NVRAM
  4. When the partner NVRAM acknowledges the write, the client receives an acknowledgement that the data has been written [to disk]
  5. A consistency point occurs and the data (incl. parity) is written to disk

To re-iterate, writes are cached in the controller’s RAM at the same time as being logged into NVRAM (and the partner’s NVRAM if you’re using a HA pair). Once the data has been written to disk via a CP, the writes are purged from the controller’s NVRAM but retained (with a lower priority) in the controller’s RAM. The prioritization allows us to evict less commonly-used blocks in order to avoid overrunning the system memory. In other words, recently-written data is the first to be ejected from the first-level read cache in the controller’s RAM.

The parity issue also catches people out. Traditional filesystems and arrays write data (and metadata) into pre-allocated locations; Data ONTAP and WAFL let NetApp write data (and metadata) in whatever location will provide fastest access. This is usually a stripe to the nearest available set of free blocks. The ability of WAFL to write to the nearest available free disk blocks lets us greatly reduce disk seeking. This is the #1 performance challenge when using spinning disks! It also lets us avoid the “hot parity disk” paradigm, as WAFL always writes to new, free disk blocks using pre-calculated parity.

Consistency points are also commonly misunderstood. The purpose of a consistency point is simple: to write data to free space on disks. Once a CP has taken place, the contents of NVRAM are discarded. Incoming writes that are received during the actual process of a consistency point’s commitment (i.e., the actual writing to disk) will be written to disk in the next consistency point.

The workflow for a write:

1. Write is sent from the host to the storage system (via a NIC or HBA)
2. Write is processed into system memory while a) being logged in NVRAM and b) being logged in the HA partner’s NVRAM
3. Write is acknowledged to the host
4. Write is committed to disk storage in a consistency point (CP)

The importance of consistency points cannot be overstated — they are a cornerstone of Data ONTAP’s architecture! They are also the reason why Data ONTAP is optimized for writes: no matter the destination disk type, we acknowledge client writes as soon as they hit NVRAM, and NVRAM is always faster than any time of disk!

Consistency points typically they occur every 10 seconds or whenever NVRAM begins to get full, whichever comes first. The latter is referred to as a “watermark”. Imagine that NVRAM is a bucket, and now divide that bucket in half. One side of the bucket is for incoming data (from clients) and the other side of the bucket is for outgoing data (to disks). As one side of the bucket fills, the other side drains. In an HA pair, we actually divide NVRAM into four buckets: two for the local controller and two for the partner controller. (This is why, when activating or deactivating HA on a system, a reboot is required for the change to take effect. The 10-second rule is why, on a system that is doing practically no I/O, the disks will always blink every 10 seconds.)

The actual value of the watermark varies depending on the exact configuration of your environment: the Filer model, the Data ONTAP version, whether or not it’s HA, and whether SATA disks are present. Because SATA disks write more slowly than SAS disks, consistency points take longer to write to disk if they’re going to SATA disks. In order to combat this, Data ONTAP lowers the watermark in a system when SATA disks are present.

The size of NVRAM only really dictates a Filer’s performance envelope when doing a large amount of large sequential writes. But with the new FAS80xx family, NVRAM counts have grown massively — up to 4x of the FAS62xx family.

Getting started with Clustered Data ONTAP & FC storage

A couple of days ago, I helped a customer get their cDOT system up and running using SAN storage. They had inherited a Cisco MDS switch running NX-OS, and were having trouble getting the devices to log in to the fabric.

As you may know, Clustered Data ONTAP ”’requires”’ NPIV when using Fiber Channel storage — i.e., hosts connecting to a NetApp cluster via the Fiber Channel protocol. NPIV is N-Port ID Virtualization for Fiber Channel, and should not be confused with NPV — which is simply N-Port Virtualization. Scott Lowe has an excellent blog post comparing and contrasting the two.

NetApp uses NPIV in order to abstract away the underlying hardware (i.e., FC HBAs) to the client-facing hardware (i.e., Storage Virtual Machine Logical Interfaces). The use of logical interfaces, or LIFs, allows us to not only carve up a single physical HBA port into many logical ports, but also for the WWPNs to be different. This is particularly useful when it comes to zoning — if you buy an HBA today, you’ll create your FC zone based on ”’the LIF WWPNs”’ and not the HBA’s.

For example, I have a two-node FAS3170 cluster, and each node has two FC HBAs:

dot82cm::*> fcp adapter show -fields fc-wwnn,fc-wwpn
node       adapter fc-wwnn                 fc-wwpn                 
---------- ------- ----------------------- ----------------------- 
dot82cm-01 2a      50:0a:09:80:89:6a:bd:4d 50:0a:09:81:89:6a:bd:4d 
dot82cm-01 2b      50:0a:09:80:89:6a:bd:4d 50:0a:09:82:89:6a:bd:4d 

(Note that that command needs to be run in a privileged mode in the cluster shell.) But the LIFs have different port addresses, thanks to NPIV:

dot82cm::> net int show -vserver vs1
  (network interface show)
            Logical    Status     Network            Current       Current Is
Vserver     Interface  Admin/Oper Address/Mask       Node          Port    Home
----------- ---------- ---------- ------------------ ------------- ------- ----
                         up/up    20:05:00:a0:98:0d:e7:76 
                                                     dot82cm-01    2a      true
                         up/up    20:06:00:a0:98:0d:e7:76 
                                                     dot82cm-01    2b      true

So, I have one Vserver (sorry, SVM!) with four LIFs. If I ever remove my dual-port 8Gb FC HBA and replace them with, say, a dual-port 16Gb FC HBA, the port names on the LIFs that are attached to the SVM will ”not” change. So when you zone your FC switch, you’ll use the LIF WWPNs.

Speaking of FC switches, let’s look at what we need. I’m using a Cisco Nexus 5020 in my lab, which means I’ll need the NPIV (not NPV!) license enabled. To verify if you have that license, it’s pretty simple:

nane-nx5020-sw# show feature | i npiv
npiv                  1         enabled

That’s pretty much it. For a basic fabric configuration on a Nexus, you need the following to work with cluster-mode:

  1. An NPIV license
  2. A Virtual Storage Area Network, or VSAN
  3. A zoneset
  4. A zone

I’m using a VSAN of 101; for most environments the default VSAN is VSAN 1. I have a single zoneset, which contains a single zone. I’m using aliases to make the zone slightly easier to manage.

Here is the zoneset:

nane-nx5020-sw# show zoneset brief vsan 101
zoneset name vsan101-zoneset vsan 101
  zone vsan101-zone

You can see that the zoneset is named vsan101-zoneset, and it’s in VSAN 101. The member zone is rather creatively named vsan101-zone. Let’s look at the zone’s members:

nane-nx5020-sw# show zone vsan 101
zone name vsan101-zone vsan 101
  fcalias name ucs-esxi-1-vmhba1 vsan 101
    pwwn 20:00:00:25:b5:00:00:1a
  fcalias name dot82cm-01_fc_lif_1 vsan 101
    pwwn 20:05:00:a0:98:0d:e7:76

Note that I have two hosts defined by aliases, and that those aliases contain the relevant WWPN from the host. Make sure you commit your zone changes and activate your zoneset!

Once you’ve configured your switch appropriately, you need to do three four things from the NetApp perspective:

  1. Create an initiator group
  2. Populate that initiator group with the host’s WWPNs
  3. Create a LUN
  4. Map the LUN to the relative initiator group

When creating your initiator group, you’ll need to select a host type. This will ensure the correct ALUA settings, amongst others. After the initiator group is populated, it should look something like this:

dot82cm::> igroup show -vserver vs1
Vserver   Igroup       Protocol OS Type  Initiators
--------- ------------ -------- -------- ------------------------------------
vs1       vm5_fcp_igrp fcp      vmware   20:00:00:25:b5:00:00:1a

We’re almost there! Now all we need to do is map the initiator group to a LUN. I’ve already done this for one LUN:

dot82cm::> lun show -vserver vs1
Vserver   Path                            State   Mapped   Type        Size
--------- ------------------------------- ------- -------- -------- --------
vs1       /vol/vm5_fcp_volume/vm5_fcp_lun1 
                                          online  mapped   vmware      250GB

We can see that the LUN is mapped, but how do we know which initiator group it’s mapped to?

dot82cm::> lun mapped show -vserver vs1
Vserver    Path                                      Igroup   LUN ID  Protocol
---------- ----------------------------------------  -------  ------  --------
vs1        /vol/vm5_fcp_volume/vm5_fcp_lun1          vm5_fcp_igrp  0  fcp

Now we have all the pieces in place! We have a Vserver (or SVM), vs1. It contains a volume, vm5_fcp_volume, which in turn contains a single LUN, vm5_fcp_lun1. That LUN is mapped to an initiator group called vm5_fcp_igrp of type vmware, over protocol FCP. And that initiator group contains a single WWPN that corresponds to the WWPN of my ESXi host.

Clear as mud?