How Data ONTAP caches, assembles and writes data

In this post, I thought I would try and describe how ONTAP works and why. I’ll also explain what the “anywhere” means in the Write Anywhere File Layout, a.k.a. WAFL.

A couple of common points of confusion are the role that NVRAM plays in performance, and where our write optimization comes from. I’ll attempt to cover both of these here. I’ll also try and cover how we handle parity. Before we start, here are some terms (and ideas) that matter. This isn’t NetApp 101, but rather NetApp 100.5 — I’m not going to cover all the basics, but I’ll cover the basics that are relevant here. So here goes!

What hardware is in a controller, anyway?

NetApp storage controllers contain some essential ingredients: hardware and software. In terms of hardware the controller has CPUs, RAM and NVRAM. In terms of software the controller has its operating system, Data ONTAP. Our CPUs do what everyone else’s CPUs do — they run our operating system. Our RAM does what everyone else’s RAM does — it runs our operating system. Our RAM also has a very important function in that it serves as a cache. While our NVRAM does roughly what everyone else’s NVRAM does — i.e., it contains our transaction journal — a key difference is the way we use NVRAM.

That said, at NetApp we do things in unique ways. And although we have one operating system, Data ONTAP, we have several hardware platforms. They range from small controllers to big controllers, which medium controllers in between. Different controllers have different amounts of CPU, RAM & NVRAM but the principles are the same!

CPU

Our CPUs run the Data ONTAP operating system, and they also process data for clients. Our controllers vary from a single dual-core CPU (in a FAS2220) to a pair of hex-core CPUs (in a FAS6290). The higher the amount of client I/O, the harder the CPUs work. Not all protocols are equal, though; serving 10 IOps via CIFS generates a different CPU load than serving 10 IOps via FCP.

RAM

Our RAM contains the Data ONTAP operating system, and it also caches data for clients. It is also the source for all writes that are committed to disk via consistency points. Writes do not come from NVRAM! Our controllers vary from 6GB of RAM (FAS2220) to 96GB of RAM (FAS6290). Not all workloads are equal, though; different features and functionality requires a different memory footprint.

NVRAM

Physically, NVRAM is little more than RAM with a battery backup. Our NVRAM contains a transaction log of client I/O that has not yet been written to disk from RAM by a consistency point. Its primary mission is to preserve that not-yet-written data in the event of a power outage or similar, severe problem. Our controllers vary from 768MB of NVRAM (FAS2220) to 4GB of NVRAM (FAS6290). In my opinion, NVRAM’s function is perhaps the most-commonly misunderstood part of our architecture. NVRAM is simply a double-buffered journal of pending write operations. NVRAM, therefore, is simply a redo log — it is not the write cache! After data is written to NVRAM, it is not looked at again unless you experience a dirty shutdown. This is because NVRAM’s importance to performance comes from software.

In an HA pair environment where two controllers are connected to each other, NVRAM is mirrored between the two nodes. Its primary mission is to preserve data that not-yet-written data in the event a partner controller suffers a power outage or similar severe problem. NVRAM mirroring happens for HA pairs in Data ONTAP 7-mode, HA pairs in clustered Data ONTAP and HA pairs in MetroCluster environments.

Disks and disk shelves

Disk shelves contain disks. DS14 shelves contain 14 drives, and DS2246 & DS4243 & DS4246 shelves contain 24 disks. DS4248 shelves contain 48 disks. Disk shelves are connected to controllers via shelf modules, those logical connections run either FCAL (DS14) or SAS (DS2246/DS42xx) protocols for connectivity.

What software is in a controller, anyway?

Data ONTAP

Data ONTAP is our controller’s operating system. Almost everything sits here — from configuration files and databases to license keys, log files and some diagnostic tools. Our operating system is built on top of FreeBSD, and usually lives in a volume called vol0. Data ONTAP features implementations of protocols for client access (e.g. NFS, CIFS), APIs for programming access (ZAPI) and implementations of protocols for management access (SSH). It is fair to say that Data ONTAP is the heart of a NetApp controller.

WAFL

WAFL is our Write Anywhere File Layout. If NVRAM’s role is the most-commonly misunderstood, WAFL comes in 2nd. Yet WAFL has a simple goal, which is to write data in full stripes across the storage media. WAFL acts as an intermediary of sorts — there is a top half where files and volumes sit, and a bottom half that interacts with RAID, manages SnapShots and some other things. WAFL isn’t a filesystem, but it does some things a filesystem does; it can also contain filesystems.

WAFL contains mechanisms for dealing with files & directories, for interacting with volumes & aggregates, and for interacting with RAID. If Data ONTAP is the heart of a NetApp controller, WAFL is the blood that it pumps.

Although WAFL can write anywhere we want, in reality we write where it makes the most sense: in the closest place (relative to the disk head) where we can write a complete stripe in order to minimize seek time on subsequent I/O requests. WAFL is optimized for writes, and we’ll see why below. Rather unusually for storage arrays, we can write client data and metadata anywhere.

A colleague has this to say about WAFL, and I couldn’t put it better:

There is a relatively simple “cheating at Tetris” analogy that can be used to articulate WAFL’s advantages. It is not hard to imagine how good you could be at Tetris if you were able to review the next thousand shapes that were falling into the pattern, rather than just the next shape.

Now imagine how much better you could be at Tetris if you could take any of the shapes from within the next thousand to place into your pattern, rather than being forced to use just the next shape that is falling.

Finally, imagine having plenty of time to review the next thousand shapes and plan your layout of all 1,000, rather than just a second or two to figure out what to do with the next piece that is falling. In summary, you could become the best Tetris player on Earth, and that is essentially what WAFL is in the arena of data allocation techniques onto underlying disk arrays.

The Tetris analogy incredibly important, as it directly relates to the way that NetApp uses WAFL to optimize for writes. Essentially, we collect random I/O that is destined to be written to disk, reorganize it so that it resembles sequential I/O as much as possible, and then write it to disk sequentially. Another way of explaining this behavior is that of write coalescing: we reduce the number of operations that ultimately land on the disk, because we re-organize them in memory before we commit them to disk and we wait until we have a bunch of them before committing them to disk via a Consistency Point. Put another way, write coalescing allows to avoid the common (and expensive) RAID workflow of “read-modify-write”.

Putting it all together

NetApp storage arrays are made up of controllers and disk shelves. Where the top and bottom is depends on your perspective: disks are grouped into RAID groups, and those RAID groups are combined to make aggregates. Volumes live in aggregates, and files and LUNs live in those volumes. A volume of CIFS data is shared to a client via a SMB share; a volume of NFS data is shared to a client via a NFS export. A LUN of data is shared to a client via an FCP, FCOE or ISCSI initiator group. Note the relationship here between controller and client — all clients care about are volumes, files and LUNs. They don’t care directly about CPUs, NVRAM or really anything else when it comes to hardware. There is data and I/O and that’s it.

In order to get data to and from clients as quickly as possible, NetApp engineers have done much to try to optimize controller performance. A lot of the architectural design you see in our controllers reflects this. Although the clients don’t care directly about how NetApp is architected, our architecture does matter to the way their underlying data is handled and the way their I/O is served. Here is a basic workflow, from the inimitable Recovery Monkey himself:

  1. The client writes some data
  2. Write arrives in controller’s RAM and is copied into controller’s NVRAM
  3. Write is then mirrored into partner’s NVRAM
  4. The client receives an acknowledgement that the data has been written

Sounds pretty simple right? On the surface, it is a pretty simple process. It also explains a few core concepts about NetApp architecture:

  • Because we acknowledge the client write once it’s hit NVRAM, we’re optimized for writes out of the box
  • Because we don’t need to wait for the disks, we can write anywhere we choose to
  • Why we can survive an outage to either half of an HA pair

The 2nd bullet provides the backbone of this post. Because we can write data anywhere we choose to, we tend to write data in the best place possible. This is typically in the largest contiguous stripe of free space in the volume’s aggregate (closest to the disk heads). After 10 seconds have elapsed, or if NVRAM becomes >=50% full, we write the client’s data from RAM (not from NVRAM) to disk.

This operation is called a consistency point. Because we use RAID, a consistency point requires us to perform RAID calculations and to calculate parity. These calculations are processed by the CPUs using data that exists in RAM.

Many people think that our acceleration occurs because of NVRAM. Rather, a lot of our acceleration happens while we’re waiting for NVRAM. For example — and very significantly — we transmogrify random I/O from the client into sequential I/O before writing it to disk. This processing is done by the CPU and occurs in RAM. Another significant benefit is because we calculate data in RAM, we do not need to hammer the disk drives that contain parity information.

So, with the added step of a CP, this is how things really happen:

  1. The client writes some data
  2. Write arrives in controller’s RAM and is copied into controller’s NVRAM
  3. Write is then mirrored into partner’s NVRAM
  4. When the partner NVRAM acknowledges the write, the client receives an acknowledgement that the data has been written [to disk]
  5. A consistency point occurs and the data (incl. parity) is written to disk

To re-iterate, writes are cached in the controller’s RAM at the same time as being logged into NVRAM (and the partner’s NVRAM if you’re using a HA pair). Once the data has been written to disk via a CP, the writes are purged from the controller’s NVRAM but retained (with a lower priority) in the controller’s RAM. The prioritization allows us to evict less commonly-used blocks in order to avoid overrunning the system memory. In other words, recently-written data is the first to be ejected from the first-level read cache in the controller’s RAM.

The parity issue also catches people out. Traditional filesystems and arrays write data (and metadata) into pre-allocated locations; Data ONTAP and WAFL let NetApp write data (and metadata) in whatever location will provide fastest access. This is usually a stripe to the nearest available set of free blocks. The ability of WAFL to write to the nearest available free disk blocks lets us greatly reduce disk seeking. This is the #1 performance challenge when using spinning disks! It also lets us avoid the “hot parity disk” paradigm, as WAFL always writes to new, free disk blocks using pre-calculated parity.

Consistency points are also commonly misunderstood. The purpose of a consistency point is simple: to write data to free space on disks. Once a CP has taken place, the contents of NVRAM are discarded. Incoming writes that are received during the actual process of a consistency point’s commitment (i.e., the actual writing to disk) will be written to disk in the next consistency point.

The workflow for a write:

1. Write is sent from the host to the storage system (via a NIC or HBA)
2. Write is processed into system memory while a) being logged in NVRAM and b) being logged in the HA partner’s NVRAM
3. Write is acknowledged to the host
4. Write is committed to disk storage in a consistency point (CP)

The importance of consistency points cannot be overstated — they are a cornerstone of Data ONTAP’s architecture! They are also the reason why Data ONTAP is optimized for writes: no matter the destination disk type, we acknowledge client writes as soon as they hit NVRAM, and NVRAM is always faster than any time of disk!

Consistency points typically they occur every 10 seconds or whenever NVRAM begins to get full, whichever comes first. The latter is referred to as a “watermark”. Imagine that NVRAM is a bucket, and now divide that bucket in half. One side of the bucket is for incoming data (from clients) and the other side of the bucket is for outgoing data (to disks). As one side of the bucket fills, the other side drains. In an HA pair, we actually divide NVRAM into four buckets: two for the local controller and two for the partner controller. (This is why, when activating or deactivating HA on a system, a reboot is required for the change to take effect. The 10-second rule is why, on a system that is doing practically no I/O, the disks will always blink every 10 seconds.)

The actual value of the watermark varies depending on the exact configuration of your environment: the Filer model, the Data ONTAP version, whether or not it’s HA, and whether SATA disks are present. Because SATA disks write more slowly than SAS disks, consistency points take longer to write to disk if they’re going to SATA disks. In order to combat this, Data ONTAP lowers the watermark in a system when SATA disks are present.

The size of NVRAM only really dictates a Filer’s performance envelope when doing a large amount of large sequential writes. But with the new FAS80xx family, NVRAM counts have grown massively — up to 4x of the FAS62xx family.

Advertisements

3 thoughts on “How Data ONTAP caches, assembles and writes data

  1. Pingback: NetApp From the Ground Up - A Beginner's Guide Part 4 - OzNetNerd

  2. Indeed a very good explanation about the data flow on Netapp , thank you for making it clear . You have to be appreciated for the way you have clearly explained repeatedly keeping in mind what misunderstanding is seeded in Netapp storage administrators .

    Once again thanks a lot!!!!

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s