I had a very bizarre issue recently, where two of my 20 vSphere ESXi 5 hosts disconnected from their clusters. When I’d try and reconnect (or, remove & connect) the hosts from the clusters, I would get an error message saying the host couldn’t be added because timed waiting for vpxa to start. Bad grammar theirs, not mine!
After filing a support request with VMware, a very helpful engineer helped me determine the cause. Looking through the vpxa logs (/var/log/vpxa.log), he noticed that some virtual machines on each host had lots of snapshot files, and vCenter Server was having trouble managing that host. So, we enabled SSH on the problematic ESXi host, and took a look:
/vmfs/volumes/4e68dec0-274d0c10-21f1-002655806654/Foghorn(Test BlackBaud DB) # ls Foghorn-000001-ctk.vmdk Foghorn-000041-delta.vmdk Foghorn-000081.vmdk Foghorn-000122-ctk.vmdk Foghorn-000162-delta.vmdk Foghorn-000202.vmdk Foghorn-000001-delta.vmdk Foghorn-000041.vmdk Foghorn-000082-ctk.vmdk Foghorn-000122-delta.vmdk Foghorn-000162.vmdk Foghorn-000203-ctk.vmdk
I cut that off, because there were more than 200 delta files for that VM! Obviously, the snapshot process had spun way out of control for this particular VM. It’s unclear why this happened, but removing those VMs from the host allowed me to add the hosts back to the cluster.
After that, I simply cloned the problematic VMs (which automatically flattens the snapshots) into new VMs and the problem was solved.