Fixing slow boot times with ESXi & NetApp iSCSI LUNs

I’ve been using a mix of iSCSI and NFS LUNs since VMware ESX 3.5; I used them quite heavily in ESX & ESXi 4 without issue. Since moving to ESXi 5, though, I’ve noticed that my ESXi hosts are taking a long time to boot — more than 45 minutes! During the boot process, they’re hanging on this screen for the majority of that time:

You can see that the message is vmware_vaaip_netapp loaded successfully. I did some debugging work, and my boss chimed in with his suggestions, and we got the issue squared away this morning. I had narrowed it to the point where I could identify the cause of the symptoms: the presence of a dynamic iSCSI target. You can have as many NFS datastores as you want, and even as many iSCSI software HBAs as you want, but the moment you add a dynamic iSCSI target is the moment where you have issues — at least in our environment. What I mean by that is, we have a large number of server VLANs (several dozen) and our NetApp filers provide file services to almost all of those VLANs:

[root@palin ~]# rsh blender iscsi portal show
Network portals:
IP address        TCP Port  TPGroup  Interface
10.140.229.7         3260    2000    vif0-260
10.140.231.135       3260    2001    vif0-265
10.140.235.7         3260    2003    vif0-278
10.140.224.135       3260    2006    vif0-251
10.140.225.7         3260    2007    vif0-252
10.140.226.7         3260    2008    vif0-254
10.140.227.7         3260    2009    vif0-256
10.140.227.135       3260    2010    vif0-257
10.140.228.7         3260    2011    vif0-258
10.140.228.135       3260    2012    vif0-259
10.140.230.7         3260    2013    vif0-262
10.140.231.7         3260    2014    vif0-264
10.140.232.7         3260    2015    vif0-266
10.140.232.71        3260    2016    vif0-267
10.140.232.135       3260    2017    vif0-268
10.140.232.199       3260    2018    vif0-269
10.140.233.7         3260    2019    vif0-270
10.140.233.135       3260    2021    vif0-272
10.140.233.199       3260    2022    vif0-273
10.140.234.7         3260    2023    vif0-274
10.140.234.71        3260    2024    vif0-275
10.140.234.135       3260    2025    vif0-276
10.140.234.199       3260    2026    vif0-277
10.140.235.199       3260    2027    vif0-281

You can see that there are two dozen VIFs there, each on their own VLAN. In my case, I’m looking for the target that sits on vif0-265; I don’t care about any of the other targets. Trouble is, though, that my ESXi hosts only have VLAN 265 trunked to their VMkernels, hence the only target they can see is on VLAN 265. After I explained this to my boss, he said “I bet the Filer is enumerating all of those portals to the ESXi host, and 99% of them are timing out” (because they are inaccessible.)

Turns out, he was right! This is taken from the iscsi(1) man page:

* If a network interface is disabled for iSCSI use (via iscsi interface disable), then it is not accessible to any initiator regardless of any accesslists in effect.

Since we’re using initiator groups and not accesslists, this is our problem: the Filer is indeed enumerating every portal (all two dozen!) it has configured, even though our ESXi host is only trunked out to one of them. So, I’m waiting for 23 separate connections to time out so that 1 connection can work. So we came up with this:

[root@palin ~]# rsh blender iscsi interface accesslist add iqn.1998-01.com.vmware:londonderry-29a21d34 vif0-265   
Adding interface vif0-265 to the accesslist for iqn.1998-01.com.vmware:londonderry-29a21d34

[root@palin ~]# rsh blender iscsi interface accesslist show     
Initiator Name                      Access List
iqn.1998-01.com.vmware:londonderry-29a21d34    vif0-265

Now, because that accesslist exists on that VIF, the Filer replies only with an initiator target as being present on the correct VIF & VLAN (in this case, vif0-265). Problem solved! Now, all I have to do is go through and add the rest of my iSCSI initiator names to my Filers, and Robert will be my father’s brother.

Advertisements

3 thoughts on “Fixing slow boot times with ESXi & NetApp iSCSI LUNs

  1. Can you tell me where you ran this command? On your Esxi Hosts? Your Netapp San or the Switch? Having the same issue

    Reply
  2. Hi Shaun,

    In this blog, “blender” was the name of my NetApp Filer, so that’s where you’ll need to run the commands. I ran them via non-interactive rsh, but you could do it via interactive telnet, SSH, serial if you wanted.

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s