VMware HA Cluster Failure – Split Brain Interrogation

September 13, 2010

If one or more VMWare ESX cluster nodes have suffered a hard crash or failure, you must reintroduce them back into the cluster by following these steps below. Do these steps for each host one at a time. This guide is helpful when multiple ESX hosts in an HA cluster have crashed due to a power outage, massive hardware failure, etc and the HA service on all or some of the ESX nodes in the cluster are non-functional. Furthermore, virtual machines have been displaced by the God forbid this ever happens to you "split-brain scenario".

It may be useful using PowerShell to initially query the cluster for your HA Primaries. I use the VMware PowerCLI and run this simple script I call Get-HA-Primaries.ps1

Connect-VIServer YourVirtualCenterServerNameHere
((Get-View (Get-Cluster YourESXClusterNameHere).id).RetrieveDasAdvancedRuntimeInfo()).DasHostInfo.PrimaryHosts

This will output what the cluster currently knows about HA Primaries.

1)      At the root of the cluster, disable VMotion by setting it to “Manual”. This is to ensure that migrations do not start until all nodes are correctly configured  and are back in the cluster. In Virtual Center, right click the root of the cluster and choose “Edit Settings”, click on “VMWare DRS”, set it to “Manual” and  click OK.

2)      Power on the ESX host if it is off and watch it from the console to make sure it boots properly.

3)      Next, log into the SIM page of the host (if applicable) as root to validate that the hardware is not displaying any obvious problems.

4) In Virtual Center, verify that the ESX host is back in the cluster. If the host shows disconnected or has any HA errors, do steps 4 thru 8 in their exact   order.

5)      Restart the Virtual Center Server service – “VMware VirtualCenter Server”

6) Run the following commands from the problematic ESX host’s console (KVM, local console or Putty) as sudo or root.

        5) service vmware-vpxa restart

        6) service mgmt-vmware restart

        7) service xinetd restart

Verify that the VMware core services are running on the host server by typing:

         ps -ef | grep hostd

It should show results similar to this: The following shows that hostd is running.

root      1887     1  0 Oct31 ?        00:00:01 cmahostd -p 15 -s OK
root      2713     1  0 Oct31 ?        00:00:00 /bin/sh /usr/bin/vmware-watchdog -s hostd -u 60 -q 5 -c /usr/sbin/hostd-support /usr/sbin/vmware-hostd -u
root      2724  2713  0 Oct31 ?        00:11:41 /usr/lib/vmware/hostd/vmware-hostd /etc/vmware/hostd/config.xml -u
root     21263 12546  0 11:34 pts/0    00:00:00 grep hostd

End of host commands

        8) Reconfigure HA within VMCenter by right-clicking on the VM host and selecting    “Reconfigure for HA”. If any HA or connection errors persist, try disconnecting and reconnecting the host. These are both right-click operations on the host from within VMCenter. You may be asked to re-authenticate the host to VMCenter. Simply provide the root password for the host if you are prompted by this wizard.

If the host cannot be re-connected after following these steps, either call the VMWare lead or VMWare support at 1-877-4VM-Ware.

If the host becomes connected and operational, you may have VM guest registration issues.

There are several different scenarios that may require you to remove and re-add the virtual machines back into inventory. If multiple hosts crash simultaneously, you will most likely have HA issues that create a known state called “split-brain” whereas virtual machines are split around the cluster due to the SAN locking mechanism used by the ESX host servers. This results in more than one host “thinking” it has the same virtual machine registered to it. Also, the SAN locking on the hosts could have locks on the guest’s vswap files on several hosts at the same time. You must release the lock manually on each host with the outdated vswap file location info. This is time consuming. The virtual machine(s) will not boot until the lock is freed. The following command allows one to view where the lock is located (always on either vmnic0 or vmnic1) by enumerating the MAC address to determine which host has the invalid data.

vmkfstools -D /vmfs/volumes/sanvolumename/vmname/swapfile

tail -f /var/log/vmkernel

Once you identify the host, reboot it to flush the memory and locks to force the release of bad, outdated vm inventory data. Be sure to migrate all of the guests off and put the host into maintenance mode prior to rebooting it.

If the MAC indicates that the vm guest is actually locked on the host the guest is attempting to boot from, simply delete the vswap file and let the guest re-create it upon booting. The way to determine if the host booting the guest is the owner, the output command will contain all zeroes in the hex field the MAC would be otherwise. The vswap file is in the virtual machines folder in /vmfs/volumes/sanvolumename/vmname.

To view vm registration on a host, view /etc/vmware/hostd/vmInventory.xml

This is the esx host’s local database file for vm inventory.

Also can view this file via, vmware-cmd –l from the \ directory.
 

Good luck.

Comments are closed.