Fixing slow boot times with ESXi & NetApp iSCSI LUNs
by bitpushr
I’ve been using a mix of iSCSI and NFS LUNs since VMware ESX 3.5; I used them quite heavily in ESX & ESXi 4 without issue. Since moving to ESXi 5, though, I’ve noticed that my ESXi hosts are taking a long time to boot — more than 45 minutes! During the boot process, they’re hanging on this screen for the majority of that time:
You can see that the message is vmware_vaaip_netapp loaded successfully. I did some debugging work, and my boss chimed in with his suggestions, and we got the issue squared away this morning. I had narrowed it to the point where I could identify the cause of the symptoms: the presence of a dynamic iSCSI target. You can have as many NFS datastores as you want, and even as many iSCSI software HBAs as you want, but the moment you add a dynamic iSCSI target is the moment where you have issues — at least in our environment. What I mean by that is, we have a large number of server VLANs (several dozen) and our NetApp filers provide file services to almost all of those VLANs:
[root@palin ~]# rsh blender iscsi portal show Network portals: IP address TCP Port TPGroup Interface 10.140.229.7 3260 2000 vif0-260 10.140.231.135 3260 2001 vif0-265 10.140.235.7 3260 2003 vif0-278 10.140.224.135 3260 2006 vif0-251 10.140.225.7 3260 2007 vif0-252 10.140.226.7 3260 2008 vif0-254 10.140.227.7 3260 2009 vif0-256 10.140.227.135 3260 2010 vif0-257 10.140.228.7 3260 2011 vif0-258 10.140.228.135 3260 2012 vif0-259 10.140.230.7 3260 2013 vif0-262 10.140.231.7 3260 2014 vif0-264 10.140.232.7 3260 2015 vif0-266 10.140.232.71 3260 2016 vif0-267 10.140.232.135 3260 2017 vif0-268 10.140.232.199 3260 2018 vif0-269 10.140.233.7 3260 2019 vif0-270 10.140.233.135 3260 2021 vif0-272 10.140.233.199 3260 2022 vif0-273 10.140.234.7 3260 2023 vif0-274 10.140.234.71 3260 2024 vif0-275 10.140.234.135 3260 2025 vif0-276 10.140.234.199 3260 2026 vif0-277 10.140.235.199 3260 2027 vif0-281
You can see that there are two dozen VIFs there, each on their own VLAN. In my case, I’m looking for the target that sits on vif0-265; I don’t care about any of the other targets. Trouble is, though, that my ESXi hosts only have VLAN 265 trunked to their VMkernels, hence the only target they can see is on VLAN 265. After I explained this to my boss, he said “I bet the Filer is enumerating all of those portals to the ESXi host, and 99% of them are timing out” (because they are inaccessible.)
Turns out, he was right! This is taken from the iscsi(1) man page:
* If a network interface is disabled for iSCSI use (via iscsi interface disable), then it is not accessible to any initiator regardless of any accesslists in effect.
Since we’re using initiator groups and not accesslists, this is our problem: the Filer is indeed enumerating every portal (all two dozen!) it has configured, even though our ESXi host is only trunked out to one of them. So, I’m waiting for 23 separate connections to time out so that 1 connection can work. So we came up with this:
[root@palin ~]# rsh blender iscsi interface accesslist add iqn.1998-01.com.vmware:londonderry-29a21d34 vif0-265 Adding interface vif0-265 to the accesslist for iqn.1998-01.com.vmware:londonderry-29a21d34 [root@palin ~]# rsh blender iscsi interface accesslist show Initiator Name Access List iqn.1998-01.com.vmware:londonderry-29a21d34 vif0-265
Now, because that accesslist exists on that VIF, the Filer replies only with an initiator target as being present on the correct VIF & VLAN (in this case, vif0-265). Problem solved! Now, all I have to do is go through and add the rest of my iSCSI initiator names to my Filers, and Robert will be my father’s brother.
