Troubleshooting (ESX): Difference between revisions
(→Recover lost SAN VMFS partition: Renamed to "Recover lost SAN VMFS partition (ESX3)") |
(→Random Problems: Added ""Host XXX currently has no management network redundancy" warning") |
||
(8 intermediate revisions by the same user not shown) | |||
Line 4: | Line 4: | ||
'''Timestamps in logfiles are in UTC !!!''' | '''Timestamps in logfiles are in UTC !!!''' | ||
=== ESX === | === ESX === | ||
{| | {|class="vwikitable" | ||
|- | |- | ||
! Item !! Path !! Comments | ! Item !! Path !! Comments | ||
|- | |- | ||
Line 35: | Line 35: | ||
However, this is most easily achieved if you've got the PowerCLI installed, in which case see [[VI_Toolkit_(PowerShell)#ESXi_Logs|ESXi Logs via PowerCLI]] | However, this is most easily achieved if you've got the PowerCLI installed, in which case see [[VI_Toolkit_(PowerShell)#ESXi_Logs|ESXi Logs via PowerCLI]] | ||
{| | {|class="vwikitable" | ||
|- | |- | ||
! Name !! PowerCLI Key !! Diagnostic Dump Path !! Comments | ! Name !! PowerCLI Key !! Diagnostic Dump Path !! Comments | ||
|- | |- | ||
Line 53: | Line 53: | ||
=== ESXi Tech Support Mode === | === ESXi Tech Support Mode === | ||
There's no Service Console on ESXi, so you have to do without. Well almost, there is the ''unsupported'' Tech Support Mode, which is a lightweight Service Console, to enable... | There's no Service Console on ESXi, so you have to do without. Well almost, there is the ''unsupported'' Tech Support Mode, which is like a lightweight Service Console, to enable SSH access to your ESX... | ||
'''ESXi 3.5 and 4.0''' | '''ESXi 3.5 and 4.0''' | ||
Line 77: | Line 77: | ||
== ESXTOP == | == ESXTOP == | ||
{| | {|class="vwikitable" | ||
|- | |- | ||
! Key !! Change View !! Key !! Sort by | ! Key !! Change View !! Key !! Sort by | ||
|- | |- | ||
Line 149: | Line 149: | ||
CPU Ready can creep up if the the system is pushed, or if the VM has multiple CPUs (as it needs multiple physical CPUs to become available at the same time, aka CPU Co-Scheduling). Multiple CPU's are especially a problem in environments where there are large number of SMP VM's. | CPU Ready can creep up if the the system is pushed, or if the VM has multiple CPUs (as it needs multiple physical CPUs to become available at the same time, aka CPU Co-Scheduling). Multiple CPU's are especially a problem in environments where there are large number of SMP VM's. | ||
{| | {|class="vwikitable" | ||
|- | |- | ||
! % CPU Ready !! MSec CPU Ready !! Performance | ! % CPU Ready !! MSec CPU Ready !! Performance | ||
|- | |- | ||
Line 174: | Line 174: | ||
Latency occurs when the hypervisor or physical storage cannot keep pace with the demand for IO, as a rough guide to indicate if theres a problem or not... | Latency occurs when the hypervisor or physical storage cannot keep pace with the demand for IO, as a rough guide to indicate if theres a problem or not... | ||
{| | {|class="vwikitable" | ||
|- | |- | ||
! Latency up to !! Status | ! Latency up to !! Status | ||
|- | |- | ||
Line 190: | Line 190: | ||
=== Storage Monitor Log Entries === | === Storage Monitor Log Entries === | ||
How to decode the following type of entries... | How to decode the following type of vmkernel log entries that are generated by the Storage Monitor... | ||
Sep 3 15:15:14 esx1 vmkernel: 85:01:23:01.532 cpu4:2264)StorageMonitor: 196: vmhba1:2:0:0 status = 2/0 0x6 0x2a 0x1 | Sep 3 15:15:14 esx1 vmkernel: 85:01:23:01.532 cpu4:2264)StorageMonitor: 196: vmhba1:2:0:0 status = 2/0 0x6 0x2a 0x1 | ||
Sep 3 15:15:32 esx1 vmkernel: 85:01:23:19.391 cpu4:2253)StorageMonitor: 196: vmhba1:3:9:0 status = 2/0 0x6 0x2a 0x1 | Sep 3 15:15:32 esx1 vmkernel: 85:01:23:19.391 cpu4:2253)StorageMonitor: 196: vmhba1:3:9:0 status = 2/0 0x6 0x2a 0x1 | ||
The status message consists of the follow four decimal and hex blocks... | The status message consists of the follow four decimal and hex blocks... | ||
{| | {| class="vwikitable" | ||
|- | |- | ||
|''Device Status'' / ''Host Status'' || ''Sense Key'' || ''Additional Sense Code'' || ''Additional Sense Code Qualifier'' | |''Device Status'' / ''Host Status'' || ''Sense Key'' || ''Additional Sense Code'' || ''Additional Sense Code Qualifier'' | ||
Line 205: | Line 205: | ||
Mar 2 10:04:44 vmkernel: 81:00:00:15.893 cpu8:4633964)StorageMonitor: 196: vmhba0:0:4:0 status = D:0x28/H:0x0 0x0 0x0 0x0 | Mar 2 10:04:44 vmkernel: 81:00:00:15.893 cpu8:4633964)StorageMonitor: 196: vmhba0:0:4:0 status = D:0x28/H:0x0 0x0 0x0 0x0 | ||
{| | {| class="vwikitable" | ||
|- | |- | ||
|<code>D:</code>''Device Status'' / <code>H:</code>''Host Status'' || ''Sense Key'' || ''Additional Sense Code'' || ''Additional Sense Code Qualifier'' | |<code>D:</code>''Device Status'' / <code>H:</code>''Host Status'' || ''Sense Key'' || ''Additional Sense Code'' || ''Additional Sense Code Qualifier'' | ||
Line 211: | Line 211: | ||
Where the ESX Device and SAN host status' mean... | Where the ESX Device and SAN host status' mean... | ||
{| | {| class="vwikitable" | ||
|- | |- | ||
! Decimal !! Device Status !! Host Status !! Comments | ! Decimal !! Device Status !! Host Status !! Comments | ||
|- | |- | ||
Line 245: | Line 245: | ||
Where the Sense Key mean... | Where the Sense Key mean... | ||
{| | {| class="vwikitable" | ||
|- | |- | ||
! Hex !! Sense Key | ! Hex !! Sense Key | ||
|- | |- | ||
| 0x0 || No Sense Information | | 0x0 || No Sense Information | ||
|- | |- | ||
| 0x1 || Last command completed but | | 0x1 || Last command completed but required error correction to complete | ||
|- | |- | ||
| 0x2 || Unit Not Ready | | 0x2 || Unit Not Ready | ||
|- | |- | ||
| 0x3 || Medium Error | | 0x3 || Medium Error (non-recoverable data error) | ||
|- | |- | ||
| 0x4 || Hardware Error | | 0x4 || Hardware Error (non-recoverable hardware error) | ||
|- | |- | ||
| 0x5 || | | 0x5 || Illegal request (Passive SP) | ||
|- | |- | ||
| 0x6 || LUN Reset | | 0x6 || LUN Reset | ||
Line 269: | Line 269: | ||
| 0xa || Copy_Aborted | | 0xa || Copy_Aborted | ||
|- | |- | ||
| 0xb || Aborted_Command - Target aborted command | | 0xb || Aborted_Command - Target disk aborted command | ||
|- | |- | ||
| 0xc || Comparison for SEARCH DATA unsuccessful | | 0xc || Comparison for SEARCH DATA unsuccessful | ||
Line 278: | Line 278: | ||
|} | |} | ||
In order to decode the Additional Sense Code and Additional Sense Code Qualifier meanings see [[SCSI Disk Additional Sense Codes]] or http://www.adaptec.com/en-US/support/scsi/2940/AHA-2940AU/use_prod/SCSI_event_codes.htm?nc=/en-US/support/scsi/2940/AHA-2940AU/use_prod/SCSI_event_codes.htm | |||
=== Recovering VM's from failed storage === | === Recovering VM's from failed storage === | ||
Line 345: | Line 329: | ||
# Via the ESX, browse the datastore and re-add the VM's to the inventory (right-click over the .vmx file) | # Via the ESX, browse the datastore and re-add the VM's to the inventory (right-click over the .vmx file) | ||
#* For a Virtual Machine Question about what to do about a UUID, select Keep | #* For a Virtual Machine Question about what to do about a UUID, select Keep | ||
=== Recover lost SAN VMFS partition (ESX4) === | |||
Recovering SAN LUN's that can be seen by the ESX, but for which the VMFS isn't visible can generally be resolved by re-adding the storage... | |||
# In the Configuration > Hardware > Storage view of an ESX, select '''Add Storage...''' | |||
# The ESX should find the LUN, and correctly display its VMFS name in the ''VMFS Label'' column | |||
# Select the LUN and click Next, then try each of the following options in order to re-add the VMFS (if neither of these options are available see further procedure below) | |||
## Keep the existing signature | |||
## Assign a new signature | |||
# Refresh the storage on the other ESX's that can see the LUN. | |||
If you're unable to re-add the VMFS without formatting the disk, but the VMFS is visible to at least one ESX, then perform the following to get the VMFS added to ESX's that don't have it | |||
# Get the UUID of the volume (the ''Can mount'' flag needs to be ''Yes'') | |||
#* <code> esxcfg-volume -l </code> | |||
# Force the ESX to mount the volume | |||
#* EG <code> esxcfg-volume -M 4d2b6ba3-123ebca8-4f9b-18a905c1234a </code> | |||
# Refresh the storage view in vCentre | |||
# If the VMFS is a scratch disk, restart the ESX | |||
For more info see [http://kb.vmware.com/kb/1015986 VMWare KB 1015986] | |||
=== USB / SD Hypervisor Checks === | === USB / SD Hypervisor Checks === | ||
Line 404: | Line 407: | ||
* Primary reconfigured for HA | * Primary reconfigured for HA | ||
If HA has never worked on the cluster then you're best skipping the [[#First Fixes|First Fixes]] section, and should proceed to [[#Check DNS|Check DNS]] and/or check the [[#Error Hints|Error Hints]] to help diagnose what might be wrong. | |||
HA is very dependant on proper DNS, to check everything is in order do the following from each ESX. Some versions of ESX3 are sensitive to case, always user lower, FQDN of ESX's should be lower case, and | === First Fixes === | ||
It's quite common for HA to go into an error state, and so there are some normal first things to try. Neither of these should affect running VM's on any of the ESX's in the cluster (though see note above regarding Isolation Response). | |||
# '''Reinstall HA on an ESX''' | |||
#* This will cause the HA Agent to be reinstalled on the ESX, though note that the installer package is not refreshed from the vCentre if the right version is already on the ESX - it can be worth attempting twice | |||
## Right-click over the problem ESX and select ''Reconfigure for VMware HA'' | |||
# '''Reinstall HA on the Cluster''' | |||
#* This will cause a complete rebuild of the HA cluster configuration, with HA being reinstalled on all ESX's. This can make matters worse if there are critical configuration problems with the cluster (those these deteriorations are inevitable, but maybe worth avoiding during production hours if in a high-blame environment) | |||
## Right-click over the problem Cluster and select ''Edit Settings...'' | |||
## Untick the ''Turn On VMware HA'' option, and click ''OK'' | |||
## Wait for HA to be removed from all ESX's | |||
## Right-click over the problem Cluster and select ''Edit Settings...'' | |||
## Tick the ''Turn On VMware HA'' option, and click ''OK'' | |||
## Wait for HA to be installed on all ESX's (it can be worth reinstalling on a few ESX's after this if there are a few persistent ESX's in error) | |||
If the above fails then proceed to the sections below to investigate further. | |||
=== Check DNS === | |||
HA is very dependant on proper DNS, to check everything is in order do the following from each ESX. Some versions of ESX3 are sensitive to case, always user lower, [[Acronyms#F|FQDN]] of ESX's should be lower case, and vCentre FQDN and domain suffix search should be lower case | |||
# Check that the hostname/IP of the local ESX is as expected | # Check that the hostname/IP of the local ESX is as expected | ||
#* <code> hostname </code> | #* <code> hostname </code> | ||
Line 424: | Line 443: | ||
# Check the vCentre's FQDN and DNS suffix search are correct and lower case | # Check the vCentre's FQDN and DNS suffix search are correct and lower case | ||
If you need to correct DNS names, | If you need to correct DNS names, its likely that you will need to reinstall HA and VPXA on individual ESX's or the whole cluster. This can be done without interrupting running VM's, but its obviously a lot less stressful not to. | ||
=== Manually | === Manually De-Install === | ||
Sometimes reinstalling via the VI Client doesn't do a full enough job, or it can fail, so you have to revert to doing it yourself. | |||
# Put the ESX into maintenance mode (optional - VM's can be left running on ESX) | # Put the ESX into maintenance mode (optional - VM's can be left running on ESX) | ||
# Disconnect the ESX from the Virtual Centre | # Disconnect the ESX from the Virtual Centre | ||
Line 433: | Line 453: | ||
# <code> ./VMware-vpxa-uninstall.sh </code> | # <code> ./VMware-vpxa-uninstall.sh </code> | ||
# <code> ./VMware-aam-ha-uninstall.sh </code> | # <code> ./VMware-aam-ha-uninstall.sh </code> | ||
# | #* See note below if the uninstallers fail to remove files | ||
# Reconnect the ESX to the VC | |||
# Take out of maintenance mode | # Take out of maintenance mode | ||
Line 445: | Line 466: | ||
# Take out of maintenance mode | # Take out of maintenance mode | ||
If the VC Agent or HA Agent fails due to the uninstaller being unable to remove files/folders, and you can't remove them manually, this is an indication that the disk is becoming corrupt. Especially if installed on a USB key, consider replacing ASAP. This is most likely to occur in ESX3 where HA tends to wear the USB key out, this is | If the VC Agent or HA Agent fails due to the uninstaller being unable to remove files/folders, and you can't remove them manually, this is an indication that the disk is becoming corrupt. Especially if installed on a USB key, consider replacing ASAP. This is most likely to occur in ESX3 where HA tends to wear the USB key out, this is less of an issue in ESX4 where the amount of writing to the USB key is kept to a minimum. To work around, rename the folder(s) that were unable to be removed/modified. | ||
* <code> mv /opt/vmware/aam /opt/vmware/aam_old </code> | |||
* <code> mv /opt/vmware/vpxa /opt/vmware/vpxa_old </code> | |||
=== Command Line Interface etc === | === Command Line Interface etc === | ||
Line 455: | Line 478: | ||
The interface is a bit dodgy, you can enter the same command twice, and it'll be rejected one time and accepted another, patience is required. | The interface is a bit dodgy, you can enter the same command twice, and it'll be rejected one time and accepted another, patience is required. | ||
{| | {|class="vwikitable" | ||
|- | |- | ||
! Command !! Comments | ! Command !! Comments | ||
|- | |- | ||
Line 520: | Line 543: | ||
* '''Could not copy...Have you run out of disk space?''' | * '''Could not copy...Have you run out of disk space?''' | ||
** ESX - Check that there's space to be able to write in <code>/tmp<code> | ** ESX - Check that there's space to be able to write in <code>/tmp</code> | ||
** ESXi - Check that the ESX has been configured with a scratch disk, and that it has space | ** ESXi - Check that the ESX has been configured with a scratch disk, and that it has space | ||
* '''tar: write error: Broken pipe''' | * '''tar: write error: Broken pipe''' | ||
** ESXi - Check that the ESX has been configured with a scratch disk | ** ESXi - Check that the ESX has been configured with a scratch disk | ||
=== "Remote Tech Support Mode (SSH) for the host XXX has been enabled" warning === | |||
Whilst not always recommended for production use, its certainly very common for admins to want to have SSH access enabled to all ESX's all of the time. But this causes a yellow warning on ESX's with the message above. | |||
You can remove the message by rebooting the ESX or restarting the hostd service (<code>/etc/init.d/hostd restart</code>), but this isn't always reliable, and doesn't survive an ESX upgrade. The following disables the alert in through the ESX's advanced config. | |||
# Go to the ESX's '''Advanced Settings''' | |||
## In the VI Client, with an ESX selected in the left-hand pane | |||
## Go to the ''Configuration'' tab, then in the ''Software'' section, go to ''Advanced Settings'' | |||
# Change <code> UserVars.SuppressShellWarning </code> to <code> 1 </code> | |||
The change takes effect immediately, no restart etc required! | |||
The above was gleamed from the script found at http://www.ivobeerens.nl/2011/11/11/enable-or-disable-remote-tech-support-mode-ssh/ | |||
=== "Host XXX currently has no management network redundancy" warning === | |||
If your ESX is meant to have redundant network connectivity then confirm that this is still fully operational. | |||
* If redundancy is in place... | |||
*# Right-click over the ESX and select '''Reconfigure for VMware HA''' | |||
* If ESX is not meant to have redundancy, you'll need to disable the check at the cluster level... | |||
*# Right-click over the ESX's cluster and select '''Edit Settings...''' | |||
*# Select '''VMware HA''' and hit the '''Advanced Options...''' button | |||
*# Type in a new option | |||
*#* Option: <code>das.ignoreRedundantNetWarning</code> | |||
*#* Value: <code>True</code> | |||
*# '''Disable''' and '''re-enable HA''' on cluster to apply | |||
[[Category:ESX]] | [[Category:ESX]] | ||
[[Category:Troubleshooting]] |
Latest revision as of 13:33, 16 August 2012
If all else fails you can always raise a VMware Service Request
Useful paths / logfiles
Timestamps in logfiles are in UTC !!!
ESX
Item | Path | Comments |
---|---|---|
Vmkernel logfile | /var/log/vmkernel |
Pretty much everything seems to be recorded here |
Vmkernel warnings | /var/log/vmkwarning |
Virtual machine warnings |
Host Daemon logfile | /var/log/vmware/hostd.log |
Services log |
vCentre Agent logfile | /var/log/vmware/vpx/vpxa.log |
vCentre agent |
Local VM files | /vmfs/volumes/storage |
storage name can vary, use TAB so shell selects available |
SAN VM files | /vmfs/volumes/SAN |
SAN will vary depending on what you've called your storage |
HA agent logs | /opt/LGTOaam512/log/ |
Various logs of limited use - depreciated |
HA agent log | /var/log/vmware/aam/agent/run.log |
Main HA log |
HA agent install log | /var/log/vmware/aam/aam_config_util_install.log |
HA install log |
ESXi
To view logfiles from an ESXi server, assuming you don't have SSH access, they need to be downloaded to your client machine 1st, and then viewed from there...
- Using VI Client, go to File | Export | Export System Logs...
- Tick the appropriate object
- Untick Include information from vCenter Server and vSphere Client, unless you additionally want this info
- Once exported, uncompress the ESX's tgz file
However, this is most easily achieved if you've got the PowerCLI installed, in which case see ESXi Logs via PowerCLI
Name | PowerCLI Key | Diagnostic Dump Path | Comments |
---|---|---|---|
Syslog | messages |
/var/log/messages |
Equivalent to ESX hostd and vmkernel logs combined |
Host Daemon | hostd |
/var/log/vmware/hostd.log |
Equivalent to ESX hostd log |
vCenter Agent | vpxa |
/var/log/vmware/vpx/vpxa.log |
|
SNMP Config | /etc/vmware/snmp.xml |
Edit via vicfg-snmp |
Logfiles get lost at restart ! If you have to restart your ESX (say, because it locked up) there will be no logs prior to the most recent boot. In theory they'll get written to a dump file if a crash is detected, but I've never found them, so assume they're only generated during a semi-graceful software crash.
However, there is a way around this. Message's can be sent to a syslog file (say on centrally available SAN LUN), a syslog server (in both cases see VM KB 1016621), or to a vMA server (see http://www.vmware.com/support/developer/vima/vima40/doc/vma_40_guide.pdf). Be aware that when sending logs over the network (eg to a Syslog server) its quite common for that last few log entries to not be written when an ESX fails, you'll get more complete logs when writing direct to a file.
ESXi Tech Support Mode
There's no Service Console on ESXi, so you have to do without. Well almost, there is the unsupported Tech Support Mode, which is like a lightweight Service Console, to enable SSH access to your ESX...
ESXi 3.5 and 4.0
- Go to the local ESXi console and press Alt+F1
- Type unsupported
- Blindly type the root password (yes, there's no prompt)
- Edit
/etc/inetd.conf
and uncomment (remove the #) from the line that starts with#ssh
, and save - Restart the management service
/sbin/services.sh restart
ESXi 4.1
- Go to the local ESXi console and press F2
- Enter root user and pass
- Go to the Troubleshooting Options
- Enable Local Tech Support or Remote Tech Support (SSH) as required
Alternatively...
- From the vSphere Client, select the host and click the Configuration tab
- Go to Security profile > Properties
- Select Local Tech Support or Remote Tech Support (SSH) and click Options button
- Choose the Start automatically startup policy, click Start, and then OK.
- This will cause a yellow warning alert on the vCentre VI Client for the ESX, to remove restart the hostd process
/etc/init.d/hostd restart
ESXTOP
Key | Change View | Key | Sort by |
---|---|---|---|
c |
ESX CPU | U |
% CPU Used |
R |
% CPU Ready | ||
N |
Normal / default | ||
m |
ESX Memory | M |
Memsz |
B |
Mctlsz | ||
N |
Normal / default | ||
d |
ESX Disk Adapter | r |
Reads/sec |
w |
Writes/sec | ||
R |
Read MB/sec | ||
T |
Write MB/sec | ||
N |
Normal / default | ||
u |
ESX Disk Drive/LUN | r |
Reads/sec |
w |
Writes/sec | ||
R |
Read MB/sec | ||
T |
Write MB/sec | ||
N |
Normal / default | ||
v |
VM Disk | r |
Reads/sec |
w |
Writes/sec | ||
R |
Read MB/sec | ||
T |
Write MB/sec | ||
N |
Normal / default | ||
n |
ESX NIC | t |
Transmit Packet/sec |
r |
Receive Packet/sec | ||
T |
Transmit MB/sec | ||
R |
Receive MB/sec | ||
N |
Normal / default |
CPU
Poor performance
Basic things to check are that the VM or the ESX its hosted on aren't saturating their available CPU. However if VM's are performing sluggishly and/or are slow to start, depsite not appearing to be excessively using CPU time futehr investigation is required...
- Use
esxtop
on the ESX service console. Look at Ready Time (%RDY), which is how long a VM is waiting for CPUs to become available. - Alternatively look for CPU Ready in performance charts. Here its measured in msec, over the normal 20 sec sampling interval.
CPU Ready can creep up if the the system is pushed, or if the VM has multiple CPUs (as it needs multiple physical CPUs to become available at the same time, aka CPU Co-Scheduling). Multiple CPU's are especially a problem in environments where there are large number of SMP VM's.
% CPU Ready | MSec CPU Ready | Performance |
---|---|---|
< 1..25 % | < 500 msec | Excellent |
< 2.5 % | < 500 msec | Good |
< 5 % | < 1000 msec | Acceptible |
< 10 % | < 2000 msec | Poor |
> 15 % | > 3000 msec | Bad |
CPU Co-Scheduling is more relaxed in ESX4 than ESX3, due to changes in the way that differences to seperate vCPU's progress within a single VM are calculated. Meaning that the derogatory affect on pCPU effciency of having multiple CPU VM is reduced (but not eliminated). See http://www.vmware.com/files/pdf/perf-vsphere-cpu_scheduler.pdf for further info.
Storage
Poor throughput
Use esxtop
on the service console and switch to the disk monitor. Enable views for latency, you will see values like GAVG, KAVG and DAVG.
- GAVG is the total guest experienced latency on IO commands averaged over 2 seconds
- KAVG is the vmkernel/hypervisor IO latency averaged over 2 seconds
- DAVG is the device (HBA) IO latency averaged over the last 2 seconds (will include any latency at lower level, eg SAN)
Latency occurs when the hypervisor or physical storage cannot keep pace with the demand for IO, as a rough guide to indicate if theres a problem or not...
Latency up to | Status |
---|---|
2 ms | Excellent - look elsewhere |
10 ms | Good |
20 ms | Reasonable |
50 ms | Poor / Busy |
higher | Bad |
Storage Monitor Log Entries
How to decode the following type of vmkernel log entries that are generated by the Storage Monitor...
Sep 3 15:15:14 esx1 vmkernel: 85:01:23:01.532 cpu4:2264)StorageMonitor: 196: vmhba1:2:0:0 status = 2/0 0x6 0x2a 0x1 Sep 3 15:15:32 esx1 vmkernel: 85:01:23:19.391 cpu4:2253)StorageMonitor: 196: vmhba1:3:9:0 status = 2/0 0x6 0x2a 0x1
The status message consists of the follow four decimal and hex blocks...
Device Status / Host Status | Sense Key | Additional Sense Code | Additional Sense Code Qualifier |
...or in the more recent format (ESX v3.5 Update 4 and above)...
Mar 2 10:04:44 vmkernel: 81:00:00:15.893 cpu8:3258649)StorageMonitor: 196: vmhba0:0:4:0 status = D:0x28/H:0x0 0x0 0x0 0x0 Mar 2 10:04:44 vmkernel: 81:00:00:15.893 cpu8:4633964)StorageMonitor: 196: vmhba0:0:4:0 status = D:0x28/H:0x0 0x0 0x0 0x0
D: Device Status / H: Host Status |
Sense Key | Additional Sense Code | Additional Sense Code Qualifier |
Where the ESX Device and SAN host status' mean...
Decimal | Device Status | Host Status | Comments |
---|---|---|---|
0 | No Errors | Host_OK | |
1 | Host No_Connect | ||
2 | Check Condition | Host_Busy_Busy | |
3 | Host_Timeout | ||
4 | Host_Bad_Target | ||
5 | Host_Abort | ||
6 | Host_Parity | ||
7 | Host_Error | ||
8 | Device Busy | Host_Reset | |
9 | Host_Bad_INTR | ||
10 | Host_PassThrough | ||
11 | Host_Soft_Error | ||
24 | Reservation Conflict | 24/0 indicates a locking error, normally caused by too many ESX's mounting a LON, wrong config on storage array, or too many VM's on a LUN | |
28 | Queue full / Task set full | Indicates the SAN is busy handling write's and is passing back notification of such when asked to handle more data |
Where the Sense Key mean...
Hex | Sense Key |
---|---|
0x0 | No Sense Information |
0x1 | Last command completed but required error correction to complete |
0x2 | Unit Not Ready |
0x3 | Medium Error (non-recoverable data error) |
0x4 | Hardware Error (non-recoverable hardware error) |
0x5 | Illegal request (Passive SP) |
0x6 | LUN Reset |
0x7 | Data_Protect - Access to data is blocked |
0x8 | Blank_Check - Reached an unexpected region |
0xa | Copy_Aborted |
0xb | Aborted_Command - Target disk aborted command |
0xc | Comparison for SEARCH DATA unsuccessful |
0xd | Volume_Overflow - Medium is full |
0xe | Source and Data on Medium do not agree |
In order to decode the Additional Sense Code and Additional Sense Code Qualifier meanings see SCSI Disk Additional Sense Codes or http://www.adaptec.com/en-US/support/scsi/2940/AHA-2940AU/use_prod/SCSI_event_codes.htm?nc=/en-US/support/scsi/2940/AHA-2940AU/use_prod/SCSI_event_codes.htm
Recovering VM's from failed storage
Procedure generated from an occasion where the ESX software was installed on top of the shared SAN VMFS storage, where the VM files still existed so the VM’s continued to run, but as the file system index no longer existed, the vmdk’s etc were orphaned and would be lost if the VM’s were to be restarted. Though it could be adapted to suit any situation where the ESX datastore is corrupted, cannot power on VM’s, and rebooting a VM would lose it. However, its well worth calling VMware support before carrying this out, they may be able to provide an easier solution.
- On each VM
- Shut-down running applications
- Install VMware Converter (Typical install, all default options)
- Hot migrate local VM to a new VM on new storage
- As VMware converter starts, select Continue in Starter Mode
- Select Import Machine from the bottom of the initial screen
- Select source as Physical Machine, then on next screen This local machine
- Select default options for source disk
- Select VMware ESX server... as your destination
- Enter ESX hostname, and root user/pass
- Enter new VM name, e.g. myserver-recov (not the same as the existing, it will let you do it, but the VC isn’t happy later on)
- Select host
- Select datastore
- Select network and uncheck Connect at power on...
- Don’t select power on after creation, and let the migration run
- Reconfig the new VM, edit its settings as follows
- Floppy Drive 1 --> Client Device
- CD/DVD Drive 1 --> Client Device
- Parallel Port 1 --> Remove
- Serial Port 1 --> Remove
- Serial Port 2 --> Remove
- USB Controller --> Remove
- Power up the new VM and check it over
- Power off the old VM (you will lose it forever, be very sure the new VM is good)
- Connect the network of the new VM
- Delete the old VM
- Delete the knackered SAN datastore and refresh on all other ESX’s that share it (deletes the name but doesn’t free up any space)
- Create a new SAN datastore (this formats the old space)
- Refresh on all other ESX’s that share the datastore
- Shutdown all the new VM’s
- Clone them to the new SAN datastore using the original name (e.g. myserver)
- Power up new new VM’s on SAN datastore, confirm OK, then delete myserver-recov servers
Recover lost SAN VMFS partition (ESX3)
EG After a powerdown, ESX's can see the SAN storage, but the VMFS cannot be found in the Storage part of the ESX config, even after Refresh. To fix, the VMFS needs to be resignatured...
Do not attempt to Add Storage to recover the VMFS, this will format the partition
- On one of the ESX's, in Advanced Settings, change
LVM.EnableResignature
to1
- Refresh Storage, the VMFS should be found with a new name, something like snap-000000002-OriginalName.
- Remove from Inventory all VM's from the old storage, the old storage should disappear from the list of datastores
- Rename the found storage to the original name
- Refresh Storage on all other ESX's, they should see the VMFS again
- Revert
LVM.EnableResignature
on the appropriate ESX to0
- Via the ESX, browse the datastore and re-add the VM's to the inventory (right-click over the .vmx file)
- For a Virtual Machine Question about what to do about a UUID, select Keep
Recover lost SAN VMFS partition (ESX4)
Recovering SAN LUN's that can be seen by the ESX, but for which the VMFS isn't visible can generally be resolved by re-adding the storage...
- In the Configuration > Hardware > Storage view of an ESX, select Add Storage...
- The ESX should find the LUN, and correctly display its VMFS name in the VMFS Label column
- Select the LUN and click Next, then try each of the following options in order to re-add the VMFS (if neither of these options are available see further procedure below)
- Keep the existing signature
- Assign a new signature
- Refresh the storage on the other ESX's that can see the LUN.
If you're unable to re-add the VMFS without formatting the disk, but the VMFS is visible to at least one ESX, then perform the following to get the VMFS added to ESX's that don't have it
- Get the UUID of the volume (the Can mount flag needs to be Yes)
esxcfg-volume -l
- Force the ESX to mount the volume
- EG
esxcfg-volume -M 4d2b6ba3-123ebca8-4f9b-18a905c1234a
- EG
- Refresh the storage view in vCentre
- If the VMFS is a scratch disk, restart the ESX
For more info see VMWare KB 1015986
USB / SD Hypervisor Checks
USB and SD cards are notorious for causing problems. Especially USB sticks, which were designed for occasional access storage, and not to be repetitively used in the fashion they are when running ESXi hypervisor. The SD cards may well be tarnished with the shadow of USB. In order to perform a disk check, use the following...
Assumes your running ESXi4, if using ESXi3 use this procedure (from which this section is adapted from): http://www.vm-help.com/esx/esx3i/check_system_partitions.php
Firstly a quick overview of the partitions...
/vmfs/volumes/Hypervisor1 /bootbank Where the ESX boots from /vmfs/volumes/Hypervisor2 /altbootbank Used during ESX updates /vmfs/volumes/Hypervisor3 /store VMTools ISO's etc
Everything else in an ESXi server is stored on the scratch disk, or is created at boot in a ramdisk
Run fdisk -l
to list the available partitions on the USB/SD card (you'll also see your SAN partitions as well)..
Disk /dev/disks/mpx.vmhba32:C0:T0:L0: 8166 MB, 8166309888 bytes 64 heads, 32 sectors/track, 7788 cylinders Units = cylinders of 2048 * 512 = 1048576 bytes Device Boot Start End Blocks Id System /dev/disks/mpx.vmhba32:C0:T0:L0p1 5 900 917504 5 Extended /dev/disks/mpx.vmhba32:C0:T0:L0p4 * 1 4 4080 4 FAT16 <32M /dev/disks/mpx.vmhba32:C0:T0:L0p5 5 254 255984 6 FAT16 /dev/disks/mpx.vmhba32:C0:T0:L0p6 255 504 255984 6 FAT16 /dev/disks/mpx.vmhba32:C0:T0:L0p7 505 614 112624 fc VMKcore /dev/disks/mpx.vmhba32:C0:T0:L0p8 615 900 292848 6 FAT16
The two partitions with the identical number of blocks are /bootbank and /altbootbank, perform a check disk on these
dosfsck -v /dev/disks/mpx.vmhba32:C0:T0:L0:5 dosfsck -v /dev/disks/mpx.vmhba32:C0:T0:L0:6
to perform a verification pass use -V, or to test for bad sectors use -t (with which you also need to include -a (automatically repair) or -r (interactively repair) options).
dosfsck -V /dev/disks/mpx.vmhba32:C0:T0:L0:5 dosfsck -t -r /dev/disks/mpx.vmhba32:C0:T0:L0:5
Unable to Add RDM
Basic steps to add an RDM are...
- Provision LUN on SAN
- Rescan LUN's on ESX
- Add RDM to VM
.vmdk is larger than the maximum size supported by datastore
- Normally this error is misleading and really means that RDM can't be created due to an untrapped reason. It does not mean that there is not enough space to create the (very small) RDM mapping file on the VMFS!
- Double check that the LUN has been properly created and available.
- Attempt to add the disk as a new VMFS to an ESX (cancel at the last part of wizard)
- Then re-attempt to add the RDM to the VM
High Availability
Be aware that playing with HA can have disastrous effects, especially if the Isolation Response of your cluster is set to Power Off If you can, consider waiting until outside of production hours before trying to resolve a problem. Unstable clusters can disintegrate if you're unlucky.
There are 5 primaries in an HA cluster, the first ESX's to join the cluster become primaries, this only changes (through an election) when the following occurs (note - not during an ESX failure)..
- Primary ESX goes into Maintenance Mode
- Primary disconnected from the cluster
- Primary removed from the cluster
- Primary reconfigured for HA
If HA has never worked on the cluster then you're best skipping the First Fixes section, and should proceed to Check DNS and/or check the Error Hints to help diagnose what might be wrong.
First Fixes
It's quite common for HA to go into an error state, and so there are some normal first things to try. Neither of these should affect running VM's on any of the ESX's in the cluster (though see note above regarding Isolation Response).
- Reinstall HA on an ESX
- This will cause the HA Agent to be reinstalled on the ESX, though note that the installer package is not refreshed from the vCentre if the right version is already on the ESX - it can be worth attempting twice
- Right-click over the problem ESX and select Reconfigure for VMware HA
- Reinstall HA on the Cluster
- This will cause a complete rebuild of the HA cluster configuration, with HA being reinstalled on all ESX's. This can make matters worse if there are critical configuration problems with the cluster (those these deteriorations are inevitable, but maybe worth avoiding during production hours if in a high-blame environment)
- Right-click over the problem Cluster and select Edit Settings...
- Untick the Turn On VMware HA option, and click OK
- Wait for HA to be removed from all ESX's
- Right-click over the problem Cluster and select Edit Settings...
- Tick the Turn On VMware HA option, and click OK
- Wait for HA to be installed on all ESX's (it can be worth reinstalling on a few ESX's after this if there are a few persistent ESX's in error)
If the above fails then proceed to the sections below to investigate further.
Check DNS
HA is very dependant on proper DNS, to check everything is in order do the following from each ESX. Some versions of ESX3 are sensitive to case, always user lower, FQDN of ESX's should be lower case, and vCentre FQDN and domain suffix search should be lower case
- Check that the hostname/IP of the local ESX is as expected
hostname
hostname -s
hostname -i
- If not check the following files
/etc/hosts
/etc/sysconfig/network
/etc/vmware/esx.conf
- Check that HA can properly resolve other ESX's in the cluster (note: only one IP address should be returned)
/opt/vmware/aam/bin/ft_gethostbyname <my_esx_name>
- Check that HA can properly resolve the vCentre
/opt/vmware/aam/bin/ft_gethostbyname <my_vc_name>
- Check the vCentre server can properly resolve the ESX names
- Check the vCentre's FQDN and DNS suffix search are correct and lower case
If you need to correct DNS names, its likely that you will need to reinstall HA and VPXA on individual ESX's or the whole cluster. This can be done without interrupting running VM's, but its obviously a lot less stressful not to.
Manually De-Install
Sometimes reinstalling via the VI Client doesn't do a full enough job, or it can fail, so you have to revert to doing it yourself.
- Put the ESX into maintenance mode (optional - VM's can be left running on ESX)
- Disconnect the ESX from the Virtual Centre
- SSH to the ESX server (or use ESXi Tech Support Mode)
cd /opt/vmware/uninstallers
./VMware-vpxa-uninstall.sh
./VMware-aam-ha-uninstall.sh
- See note below if the uninstallers fail to remove files
- Reconnect the ESX to the VC
- Take out of maintenance mode
Alternatively, to avoid re-installing the vCentre agent
- Put the ESX into maintenance mode (optional - VM's can be left running on ESX)
- SSH to the ESX server (or use ESXi Tech Support Mode)
/etc/opt/init.d/vmware-vpxa stop
cd /opt/vmware/uninstallers
./VMware-aam-ha-uninstall.sh
/etc/opt/init.d/vmware-vpxa start
- Take out of maintenance mode
If the VC Agent or HA Agent fails due to the uninstaller being unable to remove files/folders, and you can't remove them manually, this is an indication that the disk is becoming corrupt. Especially if installed on a USB key, consider replacing ASAP. This is most likely to occur in ESX3 where HA tends to wear the USB key out, this is less of an issue in ESX4 where the amount of writing to the USB key is kept to a minimum. To work around, rename the folder(s) that were unable to be removed/modified.
mv /opt/vmware/aam /opt/vmware/aam_old
mv /opt/vmware/vpxa /opt/vmware/vpxa_old
Command Line Interface etc
Using the commands in this section isn't supported by VMware
To start the CLI run the following command...
/opt/vmware/aam/bin/Cli
The interface is a bit dodgy, you can enter the same command twice, and it'll be rejected one time and accepted another, patience is required.
Command | Comments |
---|---|
ln |
List cluster nodes and their status |
addNode <hostname> |
Add ESX/node to cluster (use ESX's short DNS name) |
promoteNode <hostname> |
Promote existing ESX/node to be a primary |
demoteNode <hostname> |
Demote existing ESX/node to be a secondary |
There's also the following scripts to be found which behave as you'd expect (found in /opt/vmware/aam/bin
)...
./ft_setup
./ft_startup
./ft_shutdown
Error Hints
Host in HA Cluster must have userworld swap enabled
- ESXi servers need to have scratch space enabled
- In vCentre, go to the Advanced Settings of the ESX
- Go to ScratchConfig and locate
ScratchConfig.ConfiguredScratchLocation
- Set to directory with sufficient space (1GB) (can be configured on local storage or shared storage, folder must exist and be dedicated to ESX, delete contents if you've rebuilt the ESX)
- Format
/vmfs/volumes/<DatastoreName>
- EG
/vmfs/volumes/SCRATCH-DISK/my_esx
- Locate
ScratchConfig.ConfiguredSwapState
and set
- Format
- Bounce the ESX
Unable to contact primary host in cluster
- The ESX is unable to contact a primary ESX in cluster, some kind of networking issue
- If there's no existing HA'ed ESX's, start by looking at the vCentre's networking (for example inconsistent domain names, including case)
:cmd remove failed: HA failed to uninstall properly prior to being reinstalled, try to manually deinstall HA as per these instructions. This can be indicative of a dying USB key (if you're ESX is installed on a USB key), so fingers crossed.
Snapshots
See also Virtual Machines Snapshot Troubleshooting
Random Problems
ESXi Lockup
Affects ESXi v3.5 Update 4 only. Caused by a problem with updated CIM software in Update 4.
- Workaround
- Disable CIM (disables hardware monitoring) by setting
Advanced Settings | Misc | Misc.CimEnabled
to0
(restart to apply)
- Disable CIM (disables hardware monitoring) by setting
- Fix
- Apply patch ESXe350-200910401-I-SG, see http://kb.vmware.com/kb/1014761
For further info see http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1012575
Cimserver High CPU
Caused by problems with the VMware CIM server software. However can be caused by other problems causing it to go nuts (check VMKernel logs, etc).
- Restart
service pegasus restart
Log Bundles Creation Fails
ESX log bundle creation fails, either via the VI Client or via vm-support
- SSH to the ESX
- Run
vm-support
to try to create a log bundle
- Could not copy...Have you run out of disk space?
- ESX - Check that there's space to be able to write in
/tmp
- ESXi - Check that the ESX has been configured with a scratch disk, and that it has space
- ESX - Check that there's space to be able to write in
- tar: write error: Broken pipe
- ESXi - Check that the ESX has been configured with a scratch disk
"Remote Tech Support Mode (SSH) for the host XXX has been enabled" warning
Whilst not always recommended for production use, its certainly very common for admins to want to have SSH access enabled to all ESX's all of the time. But this causes a yellow warning on ESX's with the message above.
You can remove the message by rebooting the ESX or restarting the hostd service (/etc/init.d/hostd restart
), but this isn't always reliable, and doesn't survive an ESX upgrade. The following disables the alert in through the ESX's advanced config.
- Go to the ESX's Advanced Settings
- In the VI Client, with an ESX selected in the left-hand pane
- Go to the Configuration tab, then in the Software section, go to Advanced Settings
- Change
UserVars.SuppressShellWarning
to1
The change takes effect immediately, no restart etc required!
The above was gleamed from the script found at http://www.ivobeerens.nl/2011/11/11/enable-or-disable-remote-tech-support-mode-ssh/
"Host XXX currently has no management network redundancy" warning
If your ESX is meant to have redundant network connectivity then confirm that this is still fully operational.
- If redundancy is in place...
- Right-click over the ESX and select Reconfigure for VMware HA
- If ESX is not meant to have redundancy, you'll need to disable the check at the cluster level...
- Right-click over the ESX's cluster and select Edit Settings...
- Select VMware HA and hit the Advanced Options... button
- Type in a new option
- Option:
das.ignoreRedundantNetWarning
- Value:
True
- Option:
- Disable and re-enable HA on cluster to apply