vSphere problems with vpxa on hosts
by bitpushr
I had a very bizarre issue recently, where two of my 20 vSphere ESXi 5 hosts disconnected from their clusters. When I’d try and reconnect (or, remove & connect) the hosts from the clusters, I would get an error message saying the host couldn’t be added because timed waiting for vpxa to start. Bad grammar theirs, not mine!
After filing a support request with VMware, a very helpful engineer helped me determine the cause. Looking through the vpxa logs (/var/log/vpxa.log), he noticed that some virtual machines on each host had lots of snapshot files, and vCenter Server was having trouble managing that host. So, we enabled SSH on the problematic ESXi host, and took a look:
/vmfs/volumes/4e68dec0-274d0c10-21f1-002655806654/Foghorn(Test BlackBaud DB) # ls Foghorn-000001-ctk.vmdk Foghorn-000041-delta.vmdk Foghorn-000081.vmdk Foghorn-000122-ctk.vmdk Foghorn-000162-delta.vmdk Foghorn-000202.vmdk Foghorn-000001-delta.vmdk Foghorn-000041.vmdk Foghorn-000082-ctk.vmdk Foghorn-000122-delta.vmdk Foghorn-000162.vmdk Foghorn-000203-ctk.vmdk
I cut that off, because there were more than 200 delta files for that VM! Obviously, the snapshot process had spun way out of control for this particular VM. It’s unclear why this happened, but removing those VMs from the host allowed me to add the hosts back to the cluster.
After that, I simply cloned the problematic VMs (which automatically flattens the snapshots) into new VMs and the problem was solved.
Oh man, I just had the same issue. It was my Virtual Center Server VM that had 234 VMDK files and caused the host to go crazy. After 8 hours of support looking at logs, I found the rouge VM. We ended up connecting using the vCenter Client directly to the problem host and did a consolidation of snapshots on the problem VM.
I think it got so many snapshots after VDR 2 reported errors backing up this VM saying that it could not quiesce the file system. The VM was also running very slowly. The consolidation is still going, but I think we nailed it down.
I saw this page in my research, but didn’t think any of my VMs has a lot of snapshots so I wrote it off. I should have looked closer.
I had the same issue with VDR2. This post saved me a ton of headache!
I had a similiar issue however, no rogue snapshots. It turns out a simple restart of the vpxa service fixed it.
Damn! You’re my hero! Thanks for posting this, wasted a day’s worth of time on this!
Same here.
One VM has 237 Snapshot Delta Files.
VDR2 could not backup the VM complaining it could not create a quiesced snapshot.
I’m still in the process of cleaning up…
Same problem here..
Using VMware Data Recovery 2.0.0.1861 with vSphere 5.0
Two different VMs with tens of rogue snapshots each and vCenter Server disconnected the host containing the VMs (same “timed waiting for vpxa to start” error message on re-connection attempts). I had to remove the VMs from inventory directly on the host to get it to show back up as connected in vSphere and then use VMware Converter to re-add the VMs.
Does anyone have any idea how to avoid this from happening again?
Same problem here! Thanks for the help and comments.
In my case a virtual machine had too many snapshots, so I consolidated them, check with “Repo browser” that all were deleted by consolidation and later the host was added to the cluster without problems.
I had the same problem, removing the vm from the inventory fixed the connection problems. The VDR had problems creating snapshots of this machine prior to this error.