Migrating vCenter Server from SQL Server Express to SQL Server Standard

Under vSphere 4, we were using SQL Server Express 2005; when we upgraded to vSphere 5 we kept the same database (even though vSphere 5 comes bundled with SQL Server Express 2008). However, we had long since surpassed 5 hosts, hence VMware suggested we migrate from SQL Server Express 2005 to SQL Server Standard 2008 R2. Here is a quick synopsis of how that happened:

Pre-flight:

  1. Stop all VMware services, in particular VMware vCenter
  2. Install SQL Server 2008 R2 Management Studio
  3. Perform a full backup of the VIM_VCDB database (I do this via SQL Server Management Studio)
  4. Uninstall VMware vCenter Server
  5. Move the backup somewhere, like C:\
  6. Uninstall SQL Server 2005

Installing SQL Server 2008 R2:

  1. Install SQL Server 2008 R2 Enterprise Edition
  2. Installation features:
    • Database Engine Services
    • Management Tools – Basic
    • Management Tools – Complete
  3. Service Accounts:
    • SQL Server Agent: NT_AUTHORITY\SYSTEM; startup type Automatic
    • SQL Server Database Engine: NT_AUTHORITY\SYSTEM; startup type Automatic
    • SQL Server Browser: NT_AUTHORITY\LOCAL S...; startup type Automatic

Restoring the vCenter Server DB:

  1. Launch SQL Server Management Studio & connect to your instance
  2. Right-click on Databases and choose Restore Database…
  3. Select the file & and the database name (probably VIM_VCDB5 or something unique)
  4. Create a new, 64-bit System DSN and point it to the new database. Use SQL Server Native Client 10.0 as your driver
  5. Make sure the default database is VIM_VCDB5, not master!
  6. Start the SQL Server 2008 R2 agent, if it’s not already running

Now, you should be able to install vSphere 5. When prompted, select the new DSN you created, and make sure you use your existing database!

For those using VMware Update Manager, you will also need to re-create a 32-bit System DSN and point it to the same VIM_VCDB5 database. You create a 32-bit DSN by calling the 32-bit ODBC manager which is located at c:\windows\SysWOW64\odbcad32.exe. (You’ll still use SQL Server Native Client 10.0 as the driver, though.)

Fixing slow boot times with ESXi & NetApp iSCSI LUNs

I’ve been using a mix of iSCSI and NFS LUNs since VMware ESX 3.5; I used them quite heavily in ESX & ESXi 4 without issue. Since moving to ESXi 5, though, I’ve noticed that my ESXi hosts are taking a long time to boot — more than 45 minutes! During the boot process, they’re hanging on this screen for the majority of that time:

You can see that the message is vmware_vaaip_netapp loaded successfully. I did some debugging work, and my boss chimed in with his suggestions, and we got the issue squared away this morning. I had narrowed it to the point where I could identify the cause of the symptoms: the presence of a dynamic iSCSI target. You can have as many NFS datastores as you want, and even as many iSCSI software HBAs as you want, but the moment you add a dynamic iSCSI target is the moment where you have issues — at least in our environment. What I mean by that is, we have a large number of server VLANs (several dozen) and our NetApp filers provide file services to almost all of those VLANs:

[root@palin ~]# rsh blender iscsi portal show
Network portals:
IP address        TCP Port  TPGroup  Interface
10.140.229.7         3260    2000    vif0-260
10.140.231.135       3260    2001    vif0-265
10.140.235.7         3260    2003    vif0-278
10.140.224.135       3260    2006    vif0-251
10.140.225.7         3260    2007    vif0-252
10.140.226.7         3260    2008    vif0-254
10.140.227.7         3260    2009    vif0-256
10.140.227.135       3260    2010    vif0-257
10.140.228.7         3260    2011    vif0-258
10.140.228.135       3260    2012    vif0-259
10.140.230.7         3260    2013    vif0-262
10.140.231.7         3260    2014    vif0-264
10.140.232.7         3260    2015    vif0-266
10.140.232.71        3260    2016    vif0-267
10.140.232.135       3260    2017    vif0-268
10.140.232.199       3260    2018    vif0-269
10.140.233.7         3260    2019    vif0-270
10.140.233.135       3260    2021    vif0-272
10.140.233.199       3260    2022    vif0-273
10.140.234.7         3260    2023    vif0-274
10.140.234.71        3260    2024    vif0-275
10.140.234.135       3260    2025    vif0-276
10.140.234.199       3260    2026    vif0-277
10.140.235.199       3260    2027    vif0-281

You can see that there are two dozen VIFs there, each on their own VLAN. In my case, I’m looking for the target that sits on vif0-265; I don’t care about any of the other targets. Trouble is, though, that my ESXi hosts only have VLAN 265 trunked to their VMkernels, hence the only target they can see is on VLAN 265. After I explained this to my boss, he said “I bet the Filer is enumerating all of those portals to the ESXi host, and 99% of them are timing out” (because they are inaccessible.)

Turns out, he was right! This is taken from the iscsi(1) man page:

* If a network interface is disabled for iSCSI use (via iscsi interface disable), then it is not accessible to any initiator regardless of any accesslists in effect.

Since we’re using initiator groups and not accesslists, this is our problem: the Filer is indeed enumerating every portal (all two dozen!) it has configured, even though our ESXi host is only trunked out to one of them. So, I’m waiting for 23 separate connections to time out so that 1 connection can work. So we came up with this:

[root@palin ~]# rsh blender iscsi interface accesslist add iqn.1998-01.com.vmware:londonderry-29a21d34 vif0-265   
Adding interface vif0-265 to the accesslist for iqn.1998-01.com.vmware:londonderry-29a21d34

[root@palin ~]# rsh blender iscsi interface accesslist show     
Initiator Name                      Access List
iqn.1998-01.com.vmware:londonderry-29a21d34    vif0-265

Now, because that accesslist exists on that VIF, the Filer replies only with an initiator target as being present on the correct VIF & VLAN (in this case, vif0-265). Problem solved! Now, all I have to do is go through and add the rest of my iSCSI initiator names to my Filers, and Robert will be my father’s brother.

Deleting problem files from VMFS data stores

Following up from my issues yesterday, I had a bunch of files (old, bad snapshot deltas) that I needed to delete. Problem was, I couldn’t:

/vmfs/volumes/datastore/vm/badsnapshots # rm -rf *
rm: cannot remove 'foghorn-000101-ctk.vmdk': Invalid argument
rm: cannot remove 'foghorn-000101-delta.vmdk': Invalid argument

Try as I might, I couldn’t get rid of them. Via lsof I couldn’t see that the files weren’t locked; indeed, I was able to move them, just not delete them. So I cheated, by echoing a character to the file (to verify its sanity and update it’s mtime):

echo "a" > *

Then, rm worked. Victory!

vSphere problems with vpxa on hosts

I had a very bizarre issue recently, where two of my 20 vSphere ESXi 5 hosts disconnected from their clusters. When I’d try and reconnect (or, remove & connect) the hosts from the clusters, I would get an error message saying the host couldn’t be added because timed waiting for vpxa to start. Bad grammar theirs, not mine!

After filing a support request with VMware, a very helpful engineer helped me determine the cause. Looking through the vpxa logs (/var/log/vpxa.log), he noticed that some virtual machines on each host had lots of snapshot files, and vCenter Server was having trouble managing that host. So, we enabled SSH on the problematic ESXi host, and took a look:

/vmfs/volumes/4e68dec0-274d0c10-21f1-002655806654/Foghorn(Test BlackBaud DB) # ls
Foghorn-000001-ctk.vmdk    Foghorn-000041-delta.vmdk  Foghorn-000081.vmdk        Foghorn-000122-ctk.vmdk    Foghorn-000162-delta.vmdk  Foghorn-000202.vmdk
Foghorn-000001-delta.vmdk  Foghorn-000041.vmdk        Foghorn-000082-ctk.vmdk    Foghorn-000122-delta.vmdk  Foghorn-000162.vmdk        Foghorn-000203-ctk.vmdk

I cut that off, because there were more than 200 delta files for that VM! Obviously, the snapshot process had spun way out of control for this particular VM. It’s unclear why this happened, but removing those VMs from the host allowed me to add the hosts back to the cluster.

After that, I simply cloned the problematic VMs (which automatically flattens the snapshots) into new VMs and the problem was solved.