Difference between revisions of "Troubleshooting (ESX)"

Jump to navigation Jump to search
2,868 bytes added ,  13:33, 16 August 2012
→‎Random Problems: Added ""Host XXX currently has no management network redundancy" warning"
(→‎Storage: Added "Recover lost SAN VMFS partition (ESX4)")
(→‎Random Problems: Added ""Host XXX currently has no management network redundancy" warning")
 
(7 intermediate revisions by the same user not shown)
Line 4: Line 4:
'''Timestamps in logfiles are in UTC !!!'''
'''Timestamps in logfiles are in UTC !!!'''
=== ESX ===
=== ESX ===
{|cellpadding="2" cellspacing="0" border="1"
{|class="vwikitable"
|- style="background-color:#bbddff;"
|-
! Item                  !!  Path                                      !!  Comments
! Item                  !!  Path                                      !!  Comments
|-
|-
Line 35: Line 35:
However, this is most easily achieved if you've got the PowerCLI installed, in which case see [[VI_Toolkit_(PowerShell)#ESXi_Logs|ESXi Logs via PowerCLI]]
However, this is most easily achieved if you've got the PowerCLI installed, in which case see [[VI_Toolkit_(PowerShell)#ESXi_Logs|ESXi Logs via PowerCLI]]


{|cellpadding="2" cellspacing="0" border="1"
{|class="vwikitable"
|- style="background-color:#bbddff;"
|-  
! Name          !!  PowerCLI Key            !!  Diagnostic Dump Path                        !!  Comments
! Name          !!  PowerCLI Key            !!  Diagnostic Dump Path                        !!  Comments
|-
|-
Line 53: Line 53:


=== ESXi Tech Support Mode ===
=== ESXi Tech Support Mode ===
There's no Service Console on ESXi, so you have to do without.  Well almost, there is the ''unsupported'' Tech Support Mode, which is a lightweight Service Console, to enable...
There's no Service Console on ESXi, so you have to do without.  Well almost, there is the ''unsupported'' Tech Support Mode, which is like a lightweight Service Console, to enable SSH access to your ESX...


'''ESXi 3.5 and 4.0'''
'''ESXi 3.5 and 4.0'''
Line 77: Line 77:
== ESXTOP ==
== ESXTOP ==


{|cellpadding="2" cellspacing="0" border="1"
{|class="vwikitable"
|- style="background-color:#bbddff;"
|-  
! Key              !!  Change View  !!  Key              !!  Sort by                                 
! Key              !!  Change View  !!  Key              !!  Sort by                                 
|-
|-
Line 149: Line 149:
CPU Ready can creep up if the the system is pushed, or if the VM has multiple CPUs (as it needs multiple physical CPUs to become available at the same time, aka CPU Co-Scheduling).  Multiple CPU's are especially a problem in environments where there are large number of SMP VM's.
CPU Ready can creep up if the the system is pushed, or if the VM has multiple CPUs (as it needs multiple physical CPUs to become available at the same time, aka CPU Co-Scheduling).  Multiple CPU's are especially a problem in environments where there are large number of SMP VM's.


{|cellpadding="2" cellspacing="0" border="1"
{|class="vwikitable"
|- style="background-color:#bbddff;"
|-  
! % CPU Ready !! MSec CPU Ready !! Performance
! % CPU Ready !! MSec CPU Ready !! Performance
|-
|-
Line 174: Line 174:


Latency occurs when the hypervisor or physical storage cannot keep pace with the demand for IO, as a rough guide to indicate if theres a problem or not...
Latency occurs when the hypervisor or physical storage cannot keep pace with the demand for IO, as a rough guide to indicate if theres a problem or not...
{|cellpadding="2" cellspacing="0" border="1"
{|class="vwikitable"
|- style="background-color:#bbddff;"
|-
! Latency up to !! Status
! Latency up to !! Status
|-
|-
Line 190: Line 190:


=== Storage Monitor Log Entries ===
=== Storage Monitor Log Entries ===
How to decode the following type of entries...
How to decode the following type of vmkernel log entries that are generated by the Storage Monitor...
  Sep  3 15:15:14 esx1 vmkernel: 85:01:23:01.532 cpu4:2264)StorageMonitor: 196: vmhba1:2:0:0 status = 2/0 0x6 0x2a 0x1
  Sep  3 15:15:14 esx1 vmkernel: 85:01:23:01.532 cpu4:2264)StorageMonitor: 196: vmhba1:2:0:0 status = 2/0 0x6 0x2a 0x1
  Sep  3 15:15:32 esx1 vmkernel: 85:01:23:19.391 cpu4:2253)StorageMonitor: 196: vmhba1:3:9:0 status = 2/0 0x6 0x2a 0x1
  Sep  3 15:15:32 esx1 vmkernel: 85:01:23:19.391 cpu4:2253)StorageMonitor: 196: vmhba1:3:9:0 status = 2/0 0x6 0x2a 0x1


The status message consists of the follow four decimal and hex blocks...
The status message consists of the follow four decimal and hex blocks...
{| cellpadding="4" cellspacing="0" border="1"
{| class="vwikitable"
|-  
|-  
|''Device Status'' / ''Host Status'' || ''Sense Key'' || ''Additional Sense Code'' || ''Additional Sense Code Qualifier''
|''Device Status'' / ''Host Status'' || ''Sense Key'' || ''Additional Sense Code'' || ''Additional Sense Code Qualifier''
Line 205: Line 205:
  Mar  2 10:04:44 vmkernel: 81:00:00:15.893 cpu8:4633964)StorageMonitor: 196: vmhba0:0:4:0 status = D:0x28/H:0x0 0x0 0x0 0x0
  Mar  2 10:04:44 vmkernel: 81:00:00:15.893 cpu8:4633964)StorageMonitor: 196: vmhba0:0:4:0 status = D:0x28/H:0x0 0x0 0x0 0x0


{| cellpadding="4" cellspacing="0" border="1"
{| class="vwikitable"
|-  
|-  
|<code>D:</code>''Device Status'' / <code>H:</code>''Host Status'' || ''Sense Key'' || ''Additional Sense Code'' || ''Additional Sense Code Qualifier''
|<code>D:</code>''Device Status'' / <code>H:</code>''Host Status'' || ''Sense Key'' || ''Additional Sense Code'' || ''Additional Sense Code Qualifier''
Line 211: Line 211:


Where the ESX Device and SAN host status' mean...
Where the ESX Device and SAN host status' mean...
{| cellpadding="4" cellspacing="0" border="1"
{| class="vwikitable"
|- style="background-color:#bbddff;"
|-  
! Decimal !! Device Status        !! Host Status      !! Comments
! Decimal !! Device Status        !! Host Status      !! Comments
|-
|-
Line 245: Line 245:


Where the Sense Key mean...
Where the Sense Key mean...
{| cellpadding="4" cellspacing="0" border="1"
{| class="vwikitable"
|- style="background-color:#bbddff;"
|-  
! Hex !! Sense Key
! Hex !! Sense Key
|-
|-
| 0x0 || No Sense Information
| 0x0 || No Sense Information
|-
|-
| 0x1 || Last command completed but used error correction
| 0x1 || Last command completed but required error correction to complete
|-
|-
| 0x2 || Unit Not Ready
| 0x2 || Unit Not Ready
|-
|-
| 0x3 || Medium Error
| 0x3 || Medium Error (non-recoverable data error)
|-
|-
| 0x4 || Hardware Error
| 0x4 || Hardware Error (non-recoverable hardware error)
|-
|-
| 0x5 || ILLEGAL_REQUEST (Passive SP)
| 0x5 || Illegal request (Passive SP)
|-
|-
| 0x6 || LUN Reset
| 0x6 || LUN Reset
Line 269: Line 269:
| 0xa || Copy_Aborted
| 0xa || Copy_Aborted
|-
|-
| 0xb || Aborted_Command - Target aborted command
| 0xb || Aborted_Command - Target disk aborted command
|-
|-
| 0xc || Comparison for SEARCH DATA unsuccessful
| 0xc || Comparison for SEARCH DATA unsuccessful
Line 278: Line 278:
|}
|}


The Additional Sense Code and Additional Sense Code Qualifier mean
In order to decode the Additional Sense Code and Additional Sense Code Qualifier meanings see [[SCSI Disk Additional Sense Codes]] or http://www.adaptec.com/en-US/support/scsi/2940/AHA-2940AU/use_prod/SCSI_event_codes.htm?nc=/en-US/support/scsi/2940/AHA-2940AU/use_prod/SCSI_event_codes.htm
{| cellpadding="4" cellspacing="0" border="1"
|- style="background-color:#bbddff;"
! Hex !! Sense Code
|-
| 0x4 || Unit Not Ready
|-
| 0x3 || Unit Not Ready - Manual Intervention Required
|-
| 0x2 || Unit Not Ready - Initializing Command Required
|-
| 0x25 || Logical Unit Not Supported (eg LUN doesn't exist)
|-
| 0x29 || Device Power on or SCSI Reset
|}
 
For further info on sense codes see - http://www.adaptec.com/en-US/support/scsi/2940/AHA-2940AU/use_prod/SCSI_event_codes.htm?nc=/en-US/support/scsi/2940/AHA-2940AU/use_prod/SCSI_event_codes.htm


=== Recovering VM's from failed storage ===
=== Recovering VM's from failed storage ===
Line 423: Line 407:
* Primary reconfigured for HA
* Primary reconfigured for HA


It's quite common for HA to go into an error state, normal course of action is to use the '''Reconfigure for HA''' option for the ESX that's experiencing the problem.  This reinstalls the HA agent onto the ESX onto the ESX.  It's also common to have to do this a couple of times for it to be successfulOther things to try...
If HA has never worked on the cluster then you're best skipping the [[#First Fixes|First Fixes]] section, and should proceed to [[#Check DNS|Check DNS]] and/or check the [[#Error Hints|Error Hints]] to help diagnose what might be wrong.
* Restart the HA process - see [[#High_Availability_Stop.2FStart|High Availability Stop/Start]]
 
* [[#Manually Deinstall|Deinstall HA and VPXA]] and reinstall
=== First Fixes ===
It's quite common for HA to go into an error state, and so there are some normal first things to try.  Neither of these should affect running VM's on any of the ESX's in the cluster (though see note above regarding Isolation Response).
 
# '''Reinstall HA on an ESX'''
#* This will cause the HA Agent to be reinstalled on the ESX, though note that the installer package is not refreshed from the vCentre if the right version is already on the ESX - it can be worth attempting twice
## Right-click over the problem ESX and select ''Reconfigure for VMware HA''
# '''Reinstall HA on the Cluster'''
#* This will cause a complete rebuild of the HA cluster configuration, with HA being reinstalled on all ESX'sThis can make matters worse if there are critical configuration problems with the cluster (those these deteriorations are inevitable, but maybe worth avoiding during production hours if in a high-blame environment)
## Right-click over the problem Cluster and select ''Edit Settings...''
## Untick the ''Turn On VMware HA'' option, and click ''OK''
## Wait for HA to be removed from all ESX's
## Right-click over the problem Cluster and select ''Edit Settings...''
## Tick the ''Turn On VMware HA'' option, and click ''OK''
## Wait for HA to be installed on all ESX's (it can be worth reinstalling on a few ESX's after this if there are a few persistent ESX's in error)
 
If the above fails then proceed to the sections below to investigate further.


HA is very dependant on proper DNS, to check everything is in order do the following from each ESX. Some versions of ESX3 are sensitive to case, always user lower, FQDN of ESX's should be lower case, and VC's FQDN and domain suffix search should be lower case
=== Check DNS ===
HA is very dependant on proper DNS, to check everything is in order do the following from each ESX. Some versions of ESX3 are sensitive to case, always user lower, [[Acronyms#F|FQDN]] of ESX's should be lower case, and vCentre FQDN and domain suffix search should be lower case
# Check that the hostname/IP of the local ESX is as expected
# Check that the hostname/IP of the local ESX is as expected
#* <code> hostname </code>
#* <code> hostname </code>
Line 443: Line 443:
# Check the vCentre's FQDN and DNS suffix search are correct and lower case
# Check the vCentre's FQDN and DNS suffix search are correct and lower case


If you need to correct DNS names, don't be surprised if you need to reinstall HA and VPXA, it can be done without interrupting running VM's, but its obviously a lot less stressful not to.
If you need to correct DNS names, its likely that you will need to reinstall HA and VPXA on individual ESX's or the whole cluster.  This can be done without interrupting running VM's, but its obviously a lot less stressful not to.


=== Manually Deinstall ===
=== Manually De-Install ===
Sometimes reinstalling via the VI Client doesn't do a full enough job, or it can fail, so you have to revert to doing it yourself.
# Put the ESX into maintenance mode (optional - VM's can be left running on ESX)
# Put the ESX into maintenance mode (optional - VM's can be left running on ESX)
# Disconnect the ESX from the Virtual Centre
# Disconnect the ESX from the Virtual Centre
Line 452: Line 453:
# <code> ./VMware-vpxa-uninstall.sh </code>
# <code> ./VMware-vpxa-uninstall.sh </code>
# <code> ./VMware-aam-ha-uninstall.sh </code>
# <code> ./VMware-aam-ha-uninstall.sh </code>
# Reconect the ESX to the VC
#* See note below if the uninstallers fail to remove files
# Reconnect the ESX to the VC
# Take out of maintenance mode
# Take out of maintenance mode


Line 464: Line 466:
# Take out of maintenance mode
# Take out of maintenance mode


If the VC Agent or HA Agent fails due to the uninstaller being unable to remove files/folders, and you can't remove them manually, this is an indication that the disk is becoming corrupt.  Especially if installed on a USB key, consider replacing ASAP.  This is most likely to occur in ESX3 where HA tends to wear the USB key out, this is fixed in ESX4.
If the VC Agent or HA Agent fails due to the uninstaller being unable to remove files/folders, and you can't remove them manually, this is an indication that the disk is becoming corrupt.  Especially if installed on a USB key, consider replacing ASAP.  This is most likely to occur in ESX3 where HA tends to wear the USB key out, this is less of an issue in ESX4 where the amount of writing to the USB key is kept to a minimum. To work around, rename the folder(s) that were unable to be removed/modified.
* <code> mv /opt/vmware/aam /opt/vmware/aam_old </code>
* <code> mv /opt/vmware/vpxa /opt/vmware/vpxa_old </code>


=== Command Line Interface etc ===
=== Command Line Interface etc ===
Line 474: Line 478:
The interface is a bit dodgy, you can enter the same command twice, and it'll be rejected one time and accepted another, patience is required.
The interface is a bit dodgy, you can enter the same command twice, and it'll be rejected one time and accepted another, patience is required.


{|cellpadding="4" cellspacing="0" border="1"
{|class="vwikitable"
|- style="background-color:#bbddff;"
|-  
! Command                                !! Comments
! Command                                !! Comments
|-
|-
Line 539: Line 543:


* '''Could not copy...Have you run out of disk space?'''
* '''Could not copy...Have you run out of disk space?'''
** ESX - Check that there's space to be able to write in <code>/tmp<code>
** ESX - Check that there's space to be able to write in <code>/tmp</code>
** ESXi - Check that the ESX has been configured with a scratch disk, and that it has space
** ESXi - Check that the ESX has been configured with a scratch disk, and that it has space
* '''tar: write error: Broken pipe'''
* '''tar: write error: Broken pipe'''
** ESXi - Check that the ESX has been configured with a scratch disk
** ESXi - Check that the ESX has been configured with a scratch disk
=== "Remote Tech Support Mode (SSH) for the host XXX has been enabled" warning ===
Whilst not always recommended for production use, its certainly very common for admins to want to have SSH access enabled to all ESX's all of the time.  But this causes a yellow warning on ESX's with the message above.
You can remove the message by rebooting the ESX or restarting the hostd service (<code>/etc/init.d/hostd restart</code>), but this isn't always reliable, and doesn't survive an ESX upgrade.  The following disables the alert in through the ESX's advanced config.
# Go to the ESX's '''Advanced Settings'''
## In the VI Client, with an ESX selected in the left-hand pane
## Go to the ''Configuration'' tab, then in the ''Software'' section, go to ''Advanced Settings''
# Change <code> UserVars.SuppressShellWarning </code> to <code> 1 </code>
The change takes effect immediately, no restart etc required!
The above was gleamed from the script found at http://www.ivobeerens.nl/2011/11/11/enable-or-disable-remote-tech-support-mode-ssh/
=== "Host XXX currently has no management network redundancy" warning ===
If your ESX is meant to have redundant network connectivity then confirm that this is still fully operational.
* If redundancy is in place...
*# Right-click over the ESX and select '''Reconfigure for VMware HA'''
* If ESX is not meant to have redundancy, you'll need to disable the check at the cluster level...
*# Right-click over the ESX's cluster and select '''Edit Settings...'''
*# Select '''VMware HA''' and hit the '''Advanced Options...''' button
*# Type in a new option
*#* Option: <code>das.ignoreRedundantNetWarning</code>
*#* Value: <code>True</code>
*# '''Disable''' and '''re-enable HA''' on cluster to apply


[[Category:ESX]]
[[Category:ESX]]
[[Category:Troubleshooting]]

Navigation menu