Difference between revisions of "Troubleshooting (ESX)"

Jump to navigation Jump to search
3,404 bytes added ,  13:33, 16 August 2012
→‎Random Problems: Added ""Host XXX currently has no management network redundancy" warning"
m (→‎ESXi Tech Support Mode: Minor rewording)
(→‎Random Problems: Added ""Host XXX currently has no management network redundancy" warning")
 
(5 intermediate revisions by the same user not shown)
Line 190: Line 190:


=== Storage Monitor Log Entries ===
=== Storage Monitor Log Entries ===
How to decode the following type of entries...
How to decode the following type of vmkernel log entries that are generated by the Storage Monitor...
  Sep  3 15:15:14 esx1 vmkernel: 85:01:23:01.532 cpu4:2264)StorageMonitor: 196: vmhba1:2:0:0 status = 2/0 0x6 0x2a 0x1
  Sep  3 15:15:14 esx1 vmkernel: 85:01:23:01.532 cpu4:2264)StorageMonitor: 196: vmhba1:2:0:0 status = 2/0 0x6 0x2a 0x1
  Sep  3 15:15:32 esx1 vmkernel: 85:01:23:19.391 cpu4:2253)StorageMonitor: 196: vmhba1:3:9:0 status = 2/0 0x6 0x2a 0x1
  Sep  3 15:15:32 esx1 vmkernel: 85:01:23:19.391 cpu4:2253)StorageMonitor: 196: vmhba1:3:9:0 status = 2/0 0x6 0x2a 0x1
Line 251: Line 251:
| 0x0 || No Sense Information
| 0x0 || No Sense Information
|-
|-
| 0x1 || Last command completed but used error correction
| 0x1 || Last command completed but required error correction to complete
|-
|-
| 0x2 || Unit Not Ready
| 0x2 || Unit Not Ready
|-
|-
| 0x3 || Medium Error
| 0x3 || Medium Error (non-recoverable data error)
|-
|-
| 0x4 || Hardware Error
| 0x4 || Hardware Error (non-recoverable hardware error)
|-
|-
| 0x5 || ILLEGAL_REQUEST (Passive SP)
| 0x5 || Illegal request (Passive SP)
|-
|-
| 0x6 || LUN Reset
| 0x6 || LUN Reset
Line 269: Line 269:
| 0xa || Copy_Aborted
| 0xa || Copy_Aborted
|-
|-
| 0xb || Aborted_Command - Target aborted command
| 0xb || Aborted_Command - Target disk aborted command
|-
|-
| 0xc || Comparison for SEARCH DATA unsuccessful
| 0xc || Comparison for SEARCH DATA unsuccessful
Line 278: Line 278:
|}
|}


The Additional Sense Code and Additional Sense Code Qualifier mean
In order to decode the Additional Sense Code and Additional Sense Code Qualifier meanings see [[SCSI Disk Additional Sense Codes]] or http://www.adaptec.com/en-US/support/scsi/2940/AHA-2940AU/use_prod/SCSI_event_codes.htm?nc=/en-US/support/scsi/2940/AHA-2940AU/use_prod/SCSI_event_codes.htm
{| class="vwikitable"
|-
! Hex !! Sense Code
|-
| 0x4 || Unit Not Ready
|-
| 0x3 || Unit Not Ready - Manual Intervention Required
|-
| 0x2 || Unit Not Ready - Initializing Command Required
|-
| 0x25 || Logical Unit Not Supported (eg LUN doesn't exist)
|-
| 0x29 || Device Power on or SCSI Reset
|}
 
For further info on sense codes see - http://www.adaptec.com/en-US/support/scsi/2940/AHA-2940AU/use_prod/SCSI_event_codes.htm?nc=/en-US/support/scsi/2940/AHA-2940AU/use_prod/SCSI_event_codes.htm


=== Recovering VM's from failed storage ===
=== Recovering VM's from failed storage ===
Line 423: Line 407:
* Primary reconfigured for HA
* Primary reconfigured for HA


It's quite common for HA to go into an error state, normal course of action is to use the '''Reconfigure for HA''' option for the ESX that's experiencing the problem.  This reinstalls the HA agent onto the ESX onto the ESX.  It's also common to have to do this a couple of times for it to be successful.  Other things to try...
If HA has never worked on the cluster then you're best skipping the [[#First Fixes|First Fixes]] section, and should proceed to [[#Check DNS|Check DNS]] and/or check the [[#Error Hints|Error Hints]] to help diagnose what might be wrong.
* Restart the HA process - see [[#High_Availability_Stop.2FStart|High Availability Stop/Start]]
* [[#Manually Deinstall|Deinstall HA and VPXA]] and reinstall


HA is very dependant on proper DNS, to check everything is in order do the following from each ESX. Some versions of ESX3 are sensitive to case, always user lower, FQDN of ESX's should be lower case, and VC's FQDN and domain suffix search should be lower case
=== First Fixes ===
It's quite common for HA to go into an error state, and so there are some normal first things to try.  Neither of these should affect running VM's on any of the ESX's in the cluster (though see note above regarding Isolation Response).
 
# '''Reinstall HA on an ESX'''
#* This will cause the HA Agent to be reinstalled on the ESX, though note that the installer package is not refreshed from the vCentre if the right version is already on the ESX - it can be worth attempting twice
## Right-click over the problem ESX and select ''Reconfigure for VMware HA''
# '''Reinstall HA on the Cluster'''
#* This will cause a complete rebuild of the HA cluster configuration, with HA being reinstalled on all ESX's.  This can make matters worse if there are critical configuration problems with the cluster (those these deteriorations are inevitable, but maybe worth avoiding during production hours if in a high-blame environment)
## Right-click over the problem Cluster and select ''Edit Settings...''
## Untick the ''Turn On VMware HA'' option, and click ''OK''
## Wait for HA to be removed from all ESX's
## Right-click over the problem Cluster and select ''Edit Settings...''
## Tick the ''Turn On VMware HA'' option, and click ''OK''
## Wait for HA to be installed on all ESX's (it can be worth reinstalling on a few ESX's after this if there are a few persistent ESX's in error)
 
If the above fails then proceed to the sections below to investigate further.
 
=== Check DNS ===
HA is very dependant on proper DNS, to check everything is in order do the following from each ESX. Some versions of ESX3 are sensitive to case, always user lower, [[Acronyms#F|FQDN]] of ESX's should be lower case, and vCentre FQDN and domain suffix search should be lower case
# Check that the hostname/IP of the local ESX is as expected
# Check that the hostname/IP of the local ESX is as expected
#* <code> hostname </code>
#* <code> hostname </code>
Line 443: Line 443:
# Check the vCentre's FQDN and DNS suffix search are correct and lower case
# Check the vCentre's FQDN and DNS suffix search are correct and lower case


If you need to correct DNS names, don't be surprised if you need to reinstall HA and VPXA, it can be done without interrupting running VM's, but its obviously a lot less stressful not to.
If you need to correct DNS names, its likely that you will need to reinstall HA and VPXA on individual ESX's or the whole cluster.  This can be done without interrupting running VM's, but its obviously a lot less stressful not to.


=== Manually Deinstall ===
=== Manually De-Install ===
Sometimes reinstalling via the VI Client doesn't do a full enough job, or it can fail, so you have to revert to doing it yourself.
# Put the ESX into maintenance mode (optional - VM's can be left running on ESX)
# Put the ESX into maintenance mode (optional - VM's can be left running on ESX)
# Disconnect the ESX from the Virtual Centre
# Disconnect the ESX from the Virtual Centre
Line 452: Line 453:
# <code> ./VMware-vpxa-uninstall.sh </code>
# <code> ./VMware-vpxa-uninstall.sh </code>
# <code> ./VMware-aam-ha-uninstall.sh </code>
# <code> ./VMware-aam-ha-uninstall.sh </code>
# Reconect the ESX to the VC
#* See note below if the uninstallers fail to remove files
# Reconnect the ESX to the VC
# Take out of maintenance mode
# Take out of maintenance mode


Line 464: Line 466:
# Take out of maintenance mode
# Take out of maintenance mode


If the VC Agent or HA Agent fails due to the uninstaller being unable to remove files/folders, and you can't remove them manually, this is an indication that the disk is becoming corrupt.  Especially if installed on a USB key, consider replacing ASAP.  This is most likely to occur in ESX3 where HA tends to wear the USB key out, this is fixed in ESX4.
If the VC Agent or HA Agent fails due to the uninstaller being unable to remove files/folders, and you can't remove them manually, this is an indication that the disk is becoming corrupt.  Especially if installed on a USB key, consider replacing ASAP.  This is most likely to occur in ESX3 where HA tends to wear the USB key out, this is less of an issue in ESX4 where the amount of writing to the USB key is kept to a minimum. To work around, rename the folder(s) that were unable to be removed/modified.
* <code> mv /opt/vmware/aam /opt/vmware/aam_old </code>
* <code> mv /opt/vmware/vpxa /opt/vmware/vpxa_old </code>


=== Command Line Interface etc ===
=== Command Line Interface etc ===
Line 539: Line 543:


* '''Could not copy...Have you run out of disk space?'''
* '''Could not copy...Have you run out of disk space?'''
** ESX - Check that there's space to be able to write in <code>/tmp<code>
** ESX - Check that there's space to be able to write in <code>/tmp</code>
** ESXi - Check that the ESX has been configured with a scratch disk, and that it has space
** ESXi - Check that the ESX has been configured with a scratch disk, and that it has space
* '''tar: write error: Broken pipe'''
* '''tar: write error: Broken pipe'''
** ESXi - Check that the ESX has been configured with a scratch disk
** ESXi - Check that the ESX has been configured with a scratch disk
=== "Remote Tech Support Mode (SSH) for the host XXX has been enabled" warning ===
Whilst not always recommended for production use, its certainly very common for admins to want to have SSH access enabled to all ESX's all of the time.  But this causes a yellow warning on ESX's with the message above.
You can remove the message by rebooting the ESX or restarting the hostd service (<code>/etc/init.d/hostd restart</code>), but this isn't always reliable, and doesn't survive an ESX upgrade.  The following disables the alert in through the ESX's advanced config.
# Go to the ESX's '''Advanced Settings'''
## In the VI Client, with an ESX selected in the left-hand pane
## Go to the ''Configuration'' tab, then in the ''Software'' section, go to ''Advanced Settings''
# Change <code> UserVars.SuppressShellWarning </code> to <code> 1 </code>
The change takes effect immediately, no restart etc required!
The above was gleamed from the script found at http://www.ivobeerens.nl/2011/11/11/enable-or-disable-remote-tech-support-mode-ssh/
=== "Host XXX currently has no management network redundancy" warning ===
If your ESX is meant to have redundant network connectivity then confirm that this is still fully operational.
* If redundancy is in place...
*# Right-click over the ESX and select '''Reconfigure for VMware HA'''
* If ESX is not meant to have redundancy, you'll need to disable the check at the cluster level...
*# Right-click over the ESX's cluster and select '''Edit Settings...'''
*# Select '''VMware HA''' and hit the '''Advanced Options...''' button
*# Type in a new option
*#* Option: <code>das.ignoreRedundantNetWarning</code>
*#* Value: <code>True</code>
*# '''Disable''' and '''re-enable HA''' on cluster to apply


[[Category:ESX]]
[[Category:ESX]]
[[Category:Troubleshooting]]

Navigation menu