vSphere problems with vpxa on hosts

I had a very bizarre issue recently, where two of my 20 vSphere ESXi 5 hosts disconnected from their clusters. When I’d try and reconnect (or, remove & connect) the hosts from the clusters, I would get an error message saying the host couldn’t be added because timed waiting for vpxa to start. Bad grammar theirs, not mine!

After filing a support request with VMware, a very helpful engineer helped me determine the cause. Looking through the vpxa logs (/var/log/vpxa.log), he noticed that some virtual machines on each host had lots of snapshot files, and vCenter Server was having trouble managing that host. So, we enabled SSH on the problematic ESXi host, and took a look:

/vmfs/volumes/4e68dec0-274d0c10-21f1-002655806654/Foghorn(Test BlackBaud DB) # ls
Foghorn-000001-ctk.vmdk    Foghorn-000041-delta.vmdk  Foghorn-000081.vmdk        Foghorn-000122-ctk.vmdk    Foghorn-000162-delta.vmdk  Foghorn-000202.vmdk
Foghorn-000001-delta.vmdk  Foghorn-000041.vmdk        Foghorn-000082-ctk.vmdk    Foghorn-000122-delta.vmdk  Foghorn-000162.vmdk        Foghorn-000203-ctk.vmdk

I cut that off, because there were more than 200 delta files for that VM! Obviously, the snapshot process had spun way out of control for this particular VM. It’s unclear why this happened, but removing those VMs from the host allowed me to add the hosts back to the cluster.

After that, I simply cloned the problematic VMs (which automatically flattens the snapshots) into new VMs and the problem was solved.


14 thoughts on “vSphere problems with vpxa on hosts

  1. Oh man, I just had the same issue. It was my Virtual Center Server VM that had 234 VMDK files and caused the host to go crazy. After 8 hours of support looking at logs, I found the rouge VM. We ended up connecting using the vCenter Client directly to the problem host and did a consolidation of snapshots on the problem VM.

    I think it got so many snapshots after VDR 2 reported errors backing up this VM saying that it could not quiesce the file system. The VM was also running very slowly. The consolidation is still going, but I think we nailed it down.

    I saw this page in my research, but didn’t think any of my VMs has a lot of snapshots so I wrote it off. I should have looked closer.

  2. Same problem here..
    Using VMware Data Recovery with vSphere 5.0
    Two different VMs with tens of rogue snapshots each and vCenter Server disconnected the host containing the VMs (same “timed waiting for vpxa to start” error message on re-connection attempts). I had to remove the VMs from inventory directly on the host to get it to show back up as connected in vSphere and then use VMware Converter to re-add the VMs.

    Does anyone have any idea how to avoid this from happening again?

  3. Same problem here! Thanks for the help and comments.

    In my case a virtual machine had too many snapshots, so I consolidated them, check with “Repo browser” that all were deleted by consolidation and later the host was added to the cluster without problems.

  4. thank you for this article,
    i found 240 abandoned snapshots of a particular VM that always threw snapshot errors in VDR2 (it’s a win2k Server VM).
    those snapshots were never shown in the VSphere Client, thus i’m glad i found out about it this way, as my storage was almost full.

  5. SSH was not enabled for that particular host , so how i delete those VMs. I made esx upgrade to 5.5. And i never had this error before. Any more suggestions you have?

    • Suresh, the only thing I can think of is, do you have access to that datastore? Perhaps via a Linux host? You could conceivably delete them from there.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s