Difference between revisions of "Troubleshooting (ESX)"

Jump to navigation Jump to search
1,765 bytes added ,  09:54, 11 May 2012
→‎High Availability: Reviewed the HA section
(→‎High Availability: Reviewed the HA section)
Line 423: Line 423:
* Primary reconfigured for HA
* Primary reconfigured for HA


It's quite common for HA to go into an error state, normal course of action is to use the '''Reconfigure for HA''' option for the ESX that's experiencing the problem.  This reinstalls the HA agent onto the ESX onto the ESX.  It's also common to have to do this a couple of times for it to be successful.  Other things to try...
If HA has never worked on the cluster then you're best skipping the [[#First Fixes|First Fixes]] section, and should proceed to [[#Check DNS|Check DNS]] and/or check the [[#Error Hints|Error Hints]] to help diagnose what might be wrong.
* Restart the HA process - see [[#High_Availability_Stop.2FStart|High Availability Stop/Start]]
* [[#Manually Deinstall|Deinstall HA and VPXA]] and reinstall


HA is very dependant on proper DNS, to check everything is in order do the following from each ESX. Some versions of ESX3 are sensitive to case, always user lower, FQDN of ESX's should be lower case, and VC's FQDN and domain suffix search should be lower case
=== First Fixes ===
It's quite common for HA to go into an error state, and so there are some normal first things to try.  Neither of these should affect running VM's on any of the ESX's in the cluster (though see note above regarding Isolation Response).
 
# '''Reinstall HA on an ESX'''
#* This will cause the HA Agent to be reinstalled on the ESX, though note that the installer package is not refreshed from the vCentre if the right version is already on the ESX - it can be worth attempting twice
## Right-click over the problem ESX and select ''Reconfigure for VMware HA''
# '''Reinstall HA on the Cluster'''
#* This will cause a complete rebuild of the HA cluster configuration, with HA being reinstalled on all ESX's.  This can make matters worse if there are critical configuration problems with the cluster (those these deteriorations are inevitable, but maybe worth avoiding during production hours if in a high-blame environment)
## Right-click over the problem Cluster and select ''Edit Settings...''
## Untick the ''Turn On VMware HA'' option, and click ''OK''
## Wait for HA to be removed from all ESX's
## Right-click over the problem Cluster and select ''Edit Settings...''
## Tick the ''Turn On VMware HA'' option, and click ''OK''
## Wait for HA to be installed on all ESX's (it can be worth reinstalling on a few ESX's after this if there are a few persistent ESX's in error)
 
If the above fails then proceed to the sections below to investigate further.
 
=== Check DNS ===
HA is very dependant on proper DNS, to check everything is in order do the following from each ESX. Some versions of ESX3 are sensitive to case, always user lower, [[Acronyms#F|FQDN]] of ESX's should be lower case, and vCentre FQDN and domain suffix search should be lower case
# Check that the hostname/IP of the local ESX is as expected
# Check that the hostname/IP of the local ESX is as expected
#* <code> hostname </code>
#* <code> hostname </code>
Line 443: Line 459:
# Check the vCentre's FQDN and DNS suffix search are correct and lower case
# Check the vCentre's FQDN and DNS suffix search are correct and lower case


If you need to correct DNS names, don't be surprised if you need to reinstall HA and VPXA, it can be done without interrupting running VM's, but its obviously a lot less stressful not to.
If you need to correct DNS names, its likely that you will need to reinstall HA and VPXA on individual ESX's or the whole cluster.  This can be done without interrupting running VM's, but its obviously a lot less stressful not to.


=== Manually Deinstall ===
=== Manually De-Install ===
Sometimes reinstalling via the VI Client doesn't do a full enough job, or it can fail, so you have to revert to doing it yourself.
# Put the ESX into maintenance mode (optional - VM's can be left running on ESX)
# Put the ESX into maintenance mode (optional - VM's can be left running on ESX)
# Disconnect the ESX from the Virtual Centre
# Disconnect the ESX from the Virtual Centre
Line 452: Line 469:
# <code> ./VMware-vpxa-uninstall.sh </code>
# <code> ./VMware-vpxa-uninstall.sh </code>
# <code> ./VMware-aam-ha-uninstall.sh </code>
# <code> ./VMware-aam-ha-uninstall.sh </code>
# Reconect the ESX to the VC
#* See note below if the uninstallers fail to remove files
# Reconnect the ESX to the VC
# Take out of maintenance mode
# Take out of maintenance mode


Line 464: Line 482:
# Take out of maintenance mode
# Take out of maintenance mode


If the VC Agent or HA Agent fails due to the uninstaller being unable to remove files/folders, and you can't remove them manually, this is an indication that the disk is becoming corrupt.  Especially if installed on a USB key, consider replacing ASAP.  This is most likely to occur in ESX3 where HA tends to wear the USB key out, this is fixed in ESX4.
If the VC Agent or HA Agent fails due to the uninstaller being unable to remove files/folders, and you can't remove them manually, this is an indication that the disk is becoming corrupt.  Especially if installed on a USB key, consider replacing ASAP.  This is most likely to occur in ESX3 where HA tends to wear the USB key out, this is less of an issue in ESX4 where the amount of writing to the USB key is kept to a minimum. To work around, rename the folder(s) that were unable to be removed/modified.
* <code> mv /opt/vmware/aam /opt/vmware/aam_old </code>
* <code> mv /opt/vmware/vpxa /opt/vmware/vpxa_old </code>


=== Command Line Interface etc ===
=== Command Line Interface etc ===

Navigation menu