|
|
(10 intermediate revisions by the same user not shown) |
Line 1: |
Line 1: |
| {{TOC limit|3}}
| | == Build Notes == |
| | |
| = Build Notes = | |
| == Installation ==
| |
| * '''[[ESX3 Installation]]''' - Example, based on an old ESX v3 build guide | | * '''[[ESX3 Installation]]''' - Example, based on an old ESX v3 build guide |
| | * '''[[ESX4i Installation]]''' - Example, bit brief in places |
| * [http://www.jam-software.com/heavyload/download.shtml HeavyLoad] - Load tester (stick it in a test VM, memory test doesn't really work as ESX page sharing kicks in) | | * [http://www.jam-software.com/heavyload/download.shtml HeavyLoad] - Load tester (stick it in a test VM, memory test doesn't really work as ESX page sharing kicks in) |
| | |
| | == Build Numbers == |
| | ESX build numbers, note that installing subsequent patches, on top of one of the major releases below will increase the build number. |
| | {|class="vwikitable" |
| | |- |
| | ! ESX version !! ESX !! ESXi |
| | |- |
| | | 3.5 Update 1 || 82663 || 82664 |
| | |- |
| | | 3.5 Update 2 || 110268 || 110271 |
| | |- |
| | | 3.5 Update 3 || 123630 || 123629 |
| | |- |
| | | 3.5 Update 4 ||colspan="2"| 153875 |
| | |- |
| | | 3.5 Update 5 ||colspan="2"| 207095 |
| | |- |
| | | 4.0 ||colspan="2"| 164009 |
| | |- |
| | | 4.0 Update 1 ||colspan="2"| 208167 |
| | |- |
| | | 4.0 Update 2 ||colspan="2"| 261974 |
| | |- |
| | | 4.0 Update 3 ||colspan="2"| 398348 |
| | |- |
| | | 4.0 Update 4 ||colspan="2"| 504850 |
| | |- |
| | | 4.1 ||colspan="2"| 260247 |
| | |- |
| | | 4.1 Update 1 ||colspan="2"| 348481 |
| | |- |
| | | 4.1 Update 2 ||colspan="2"| 502767 |
| | |- |
| | | 4.1 Update 3 ||colspan="2"| 800380 |
| | |- |
| | | 5.0 ||colspan="2"| 469512 |
| | |- |
| | | 5.0 Update 1 ||colspan="2"| 623860 |
| | |- |
| | | 5.1 ||colspan="2"| 799733 |
| | |} |
|
| |
|
| == USB Image == | | == USB Image == |
Line 35: |
Line 74: |
| # Disconnect all images, reboot server, cross fingers | | # Disconnect all images, reboot server, cross fingers |
| #* <code> reboot </code> | | #* <code> reboot </code> |
|
| |
| == Build Numbers ==
| |
| {|cellpadding="4" cellspacing="0" border="1"
| |
| |- style="background-color:#bbddff;"
| |
| ! ESX version !! ESX !! ESXi
| |
| |-
| |
| | 3.5 Update 1 || 82663 || 82664
| |
| |-
| |
| | 3.5 Update 2 || 110268 || 110271
| |
| |-
| |
| | 3.5 Update 3 || 123630 || 123629
| |
| |-
| |
| | 3.5 Update 4 || 153875 || 153875
| |
| |-
| |
| | 3.5 Update 5 || 207095 || 207095
| |
| |-
| |
| | 4.0 || 164009 ||
| |
| |-
| |
| | 4.0 Update 1 || 208167 || 208167
| |
| |-
| |
| | 4.0 Update 2 || 261974 || 261974
| |
| |-
| |
| | 4.1 || 260247 || 260247
| |
| |}
| |
|
| |
|
| == VMware CLI == | | == VMware CLI == |
Line 65: |
Line 80: |
| == Security Hardening == | | == Security Hardening == |
| === Service Console === | | === Service Console === |
| | Applicable to ESX only (not ESXi, as ESXi doesn't have a service console) |
| ==== Disk Partitions ==== | | ==== Disk Partitions ==== |
| Suggesting partition sizing for Service Console on local disk to prevent Root partition being filled with user data | | Suggesting partition sizing for Service Console on local disk to prevent Root partition being filled with user data |
Line 150: |
Line 166: |
| ==== Network Settings ==== | | ==== Network Settings ==== |
|
| |
|
| {|cellpadding="2" cellspacing="0" border="1" | | {|class="vwikitable" |
| |- | | |- |
| ! Setting !! Default !! Preferred !! Explanantion | | ! Setting !! Default !! Preferred !! Explanantion |
Line 164: |
Line 180: |
| |} | | |} |
|
| |
|
| = Configuration Considerations =
| | [[Category:ESX]] |
| == Hardware ==
| |
| === CPU ===
| |
| {|cellpadding="1" cellspacing="0" border="1"
| |
| |- style="background-color:#bbddff;"
| |
| ! Feature !! Set to !! Intel name !! AMD name
| |
| |-
| |
| | Node Interleaving || Disabled (allows NUMA operation)
| |
| |-
| |
| | Execute Protection || Enabled || eXecute Disable (XD) || No-Execute Page-Protection
| |
| |-
| |
| | Virtualisation assist || Enabled || Intel VT || AMD-V
| |
| |}
| |
| | |
| === CPU Power vs Performance ===
| |
| '''If in doubt put server BIOS settings to maximum performance''' - this ensures that ESX can get the most out of the hardware, allowing the BIOS to balance or use low power modes may impact VM performance. ESX's are expected to work hard, that's how they save you money, and so they should be set-up to be able to perform. In theory, allowing the motherboard to throttle back the CPUs when under low load shouldn't cause a problem.
| |
| | |
| '''When using ESX4.1 or higher''' then set the BIOS to allow the OS (ie ESX) control of CPU performance (if the setting is available), this allows the CPU Performance to be controlled dynamically by ESX as it manages VM load (and configurable through the VI Client).
| |
| | |
| See [http://kb.vmware.com/kb/1018206 VM KB 1018206 - Poor virtual machine application performance may be caused by processor power management settings] for further info
| |
| | |
| === HP ASR ===
| |
| '''Should be disabled.'''
| |
| | |
| VMware don’t recommend that we use the HP ASR feature (designed to restart a server in the case of an OS hang), they’ve come across occasions when an ESX under load will suddenly restart due to ASR time-outs. See [http://kb.vmware.com/kb/1010842 VM KB 1010842 - HP Automatic Server Recovery in a VMware ESX Environment] for further info.
| |
| | |
| == Networking ==
| |
| === Beacon Probing ===
| |
| Should only be used when there are 3 or more physical NIC's assigned to the vSwitch, uplinked to the network switch.
| |
| | |
| This is to enable the ESX to be able to properly determine the state of the network during a faulty condition. If there's only two uplinks and the beacon gets lost between the two NIC's, then the ESX can't know which uplink is faulty, just that there is a fault.
| |
| | |
| See [http://kb.vmware.com/kb/1005577 VM KB 1005577 - What is beacon probing?] for further info.
| |
| | |
| == Storage ==
| |
| === ESX Installation Sizing ===
| |
| See [http://kb.vmware.com/kb/1026500 VM KB 1026500 - Recommended disk or LUN sizes for VMware ESX/ESXi installations]
| |
| | |
| === SCSI Resets ===
| |
| When accessing centralised storage via SCSI, VMware recommends the following configuration (only the disabling of SCSI Device Resets is a change from the default). These settings are intended to limit the scope of SCSI Resets, and so reduce contention and overlapping of SCSI commands from different hosts accessing the same storage system.
| |
| * <code> Disk.UseLunReset </code> set to <code> 1 </code>
| |
| * <code> Disk.UseDeviceReset </code> set to <code> 0 </code>
| |
| | |
| === Path Selection Policy (PSP) ===
| |
| * Active-Active (AA) - Storage array allows access to to LUN's through all paths simultaneously.
| |
| * Active-Passive (AP) - Storage array allows access to to LUN's through one storage processor at a time
| |
| * Asymmetric (ALUA) - Storage array prioritises paths available to access a LUN (See http://www.yellow-bricks.com/2009/09/29/whats-that-alua-exactly/)
| |
| | |
| {|cellpadding="2" cellspacing="0" border="1"
| |
| |- style="background-color:#bbddff;"
| |
| ! Policy !! For Arrays !! Description
| |
| |-
| |
| | Most Recently Used (VMW_PSP_MRU) || All (default for AP arrays) || ESX uses whatever path is available, initially defaulting to last used or first detected at start up
| |
| |-
| |
| | Fixed (VMW_PSP_FIXED) || Active-Active (not for AP) || ESX uses preferred path, unless its not available. Can cause path thrashing with AP arrays
| |
| |-
| |
| | Fixed AP (VMW_PSP_FIXED_AP) || All (though really for ALUA) || As for Fixed, but the ESX picks the preferred path, and uses path-thrashing avoidance algorithm
| |
| |-
| |
| | Round Robin (VMW_PSP_RR) || All || ESX uses all available paths (will be limited by AP arrays)
| |
| |}
| |
| | |
| = Procedures =
| |
| Links to VMware KB docs...
| |
| * [http://kb.vmware.com/kb/1026380 VMware KB1026380 - Committing snapshots on ESX/ESXi host from command line]
| |
| | |
| == Quick commands ==
| |
| {|cellpadding="2" cellspacing="0" border="1"
| |
| |-
| |
| |<code> vmware -v </code> || ESX3 software version and build
| |
| |-
| |
| |<code> vmware -l </code> || ESX4 software version and build
| |
| |-
| |
| |<code> vm-support -x </code> || List running VM's
| |
| |-
| |
| |<code> vmware-cmd -l </code> || List config files of VM's registered to ESX
| |
| |-
| |
| |<code> esxcfg-rescan vmhba0 </code> || Perform LUN rescan on vmhba0
| |
| |-
| |
| |<code> esxcfg-vmhbadevs </code> || List HBA LUN mappings
| |
| |-
| |
| |<code> esxcfg-scsidevs --hbas </code> || List HBA devices
| |
| |-
| |
| |<code> esxcfg-mpath -l </code> || List all LUNS and their paths
| |
| |}
| |
| | |
| == ESX Shutdown / Reboot ==
| |
| '''ESX'''
| |
| * Shutdown a host ready for power off
| |
| ** <code> shutdown -h now </code>
| |
| * Restart a host
| |
| ** <code> shutdown -r now </code>
| |
| | |
| '''ESXi'''
| |
| * Shutdown a host ready for power off, either of
| |
| ** <code> /bin/host_reboot.sh </code>
| |
| ** <code> reboot </code>
| |
| * Restart a host
| |
| ** <code> /bin/host_shutdown.sh </code>
| |
| | |
| == High Availability Stop/Start ==
| |
| * Stop HA...
| |
| ** <code> /etc/init.d/VMWAREAAM51_vmware stop </code>
| |
| * Start HA...
| |
| ** <code> /etc/init.d/VMWAREAAM51_vmware start </code>
| |
| | |
| == VMware Management Agent Restart ==
| |
| '''ESX'''
| |
| <pre>
| |
| service mgmt-vmware restart
| |
| Stopping VMware ESX Server Management services:
| |
| VMware ESX Server Host Agent Services [ OK ]
| |
| VMware ESX Server Host Agent Watchdog [ OK ]
| |
| VMware ESX Server Host Agent [ OK ]
| |
| Starting VMware ESX Server Management services:
| |
| VMware ESX Server Host Agent (background) [ OK ]
| |
| Availability report startup (background) [ OK ]
| |
| </pre>
| |
| | |
| If this fails to stop the service, you can try to manually kill the processes.
| |
| # Determine the PID's of the processes
| |
| #* <code> ps -auxwww | grep vmware-hostd </code>
| |
| #* which should give you something like, in which case the PID's are 2807 and 2825...
| |
| #* <code> root 2807 0.0 0.3 4244 884 ? S Mar10 0:00 /bin/sh /usr/bin/vmware-watchdog -s hostd -u 60 -q 5 -c /usr/sbin/vmware-hostd-support /usr/sbin/vmware-hostd -u </code>
| |
| #* <code> root 2825 0.1 12.0 72304 32328 ? S Mar10 1:14 /usr/lib/vmware/hostd/vmware-hostd /etc/vmware/hostd/config.xml -u </code>
| |
| #* <code> root 13848 0.0 0.2 3696 556 pts/0 R 08:43 0:00 grep vmware-hostd </code>
| |
| # Kill the PID's using <code> kill -p pid </code>
| |
| #* So, for example, <code> kill -9 2807 </code> and <code> kill -9 2825 </code>
| |
| # Then reattempt the service restart
| |
| | |
| To also restart the Virtual Centre Agent, use
| |
| service vmware-vpxa restart
| |
| | |
| '''ESXi'''<br>
| |
| <code> services.sh restart </code>
| |
| | |
| == VMware Web Access Restart ==
| |
| <pre>
| |
| service vmware-webAccess restart
| |
| Stopping VMware ESX Server webAccess:
| |
| VMware ESX Server webAccess [FAILED]
| |
| Starting VMware ESX Server webAccess:
| |
| VMware ESX Server webAccess [ OK ]
| |
| </pre>
| |
| | |
| == VM Start ==
| |
| On the ESX that currently owns the VM...
| |
| # Get the VM's config file path
| |
| #* <code> vmware-cmd -l | grep VM_Name </code>
| |
| # Start the VM using the path found
| |
| #* <code> vmware-cmd \vm_path\VM_Name.vmx start </code>
| |
| # Wait for start-up to complete, if start-up fails check the VM's log
| |
| #* <code> less \vm_path\vmware.log </code>
| |
| | |
| == Maintenance Mode ==
| |
| To put the ESX into maintenance mode with no access from the Infrastructure Client (VCP) use the following commands - use with caution
| |
| | |
| Put esx into maintenance mode:
| |
| <pre>
| |
| vimsh -n -e /hostsvc/maintenance_mode_enter
| |
| </pre>
| |
| | |
| check the esx is in maintenance mode
| |
| <pre>
| |
| vimsh -n -e /hostsvc/runtimeinfo | grep inMaintenanceMode | awk ‘{print $3}’
| |
| </pre>
| |
| | |
| exit maintenance mode
| |
| <pre>
| |
| vimsh -n -e /hostsvc/maintenance_mode_exit
| |
| </pre>
| |
| | |
| | |
| == TCPDump Network Sniffer ==
| |
| Basic network sniffer available in Service Console
| |
| | |
| [http://www.tcpdump.org/tcpdump_man.html TCPDump instruction manual]
| |
| | |
| EG To sniff all traffic on the Service Console interface, vswif0, going to/from 159.104.227.40
| |
| | |
| <code> tcpdump -i vswif0 host 159.104.224.70 </code>
| |
| | |
| | |
| == Security ==
| |
| === Password Complexity Override ===
| |
| In order to be able to change a user (or root) password to one that breaches password complexity checking
| |
| | |
| # Disable PAM module
| |
| #* <code> esxcfg-auth --usepamqc -1 -1 -1 -1 -1 -1 </code>
| |
| # Disable complexity checker
| |
| #* <code> esxcfg-auth --usecrack -1 -1 -1 -1 -1 -1 </code>
| |
| # Change password
| |
| # Re-enable PAM module
| |
| #* <code> esxcfg-auth --usepamqc=-1 -1 -1 -1 8 8 </code>
| |
| | |
| === Regenerate Certificate ===
| |
| You might need to regenerate certificates if
| |
| * Change ESX host name
| |
| * Accidentally delete the certificates
| |
| | |
| To generate new Certificates for the ESX Server host...
| |
| # Change directories to /etc/vmware/ssl.
| |
| # Create backups of any existing certificates:
| |
| #* <code> mv rui.crt orig.rui.crt </code>
| |
| #* <code> mv rui.key orig.rui.key </code>
| |
| # Rstart the vmware-hostd process:
| |
| #* <code> service mgmt-vmware restart </code>
| |
| # Confirm that the ESX Server host generated new certificates by executing the following command comparing the time stamps of the new certificate files with orig.rui.crt and orig.rui.key
| |
| #* <code> ls -la </code>
| |
| | |
| | |
| == NIC Operations ==
| |
| === Get NIC Firmware/Driver versions ===
| |
| * '''ESX4'''
| |
| ** <code> ethtool -i vmnic<no> </code>
| |
| ** Where <code><no></code> is your NIC no, eg <code> ethtool -i vmnic0 </code>
| |
| * '''ESX3i / ESX4i'''
| |
| ** <code> vsish -e get net/pNics/vmnic<no>/properties</code>
| |
| ** Where <code><no></code> is your NIC no, eg <code> vsish -e get net/pNics/vmnic1/properties </code>
| |
| | |
| == HBA and SAN Operations ==
| |
| === VMFS / LUN Addition ===
| |
| The new LUN needs to be carved up and presented to all ESX's that should see it (normally all ESX's from a particular cluster). Once completed, follow the procedure below to add to the ESX's...
| |
| # Pick ESX in cluster with lowest load
| |
| # Go to '''Storage Adapters''', hit '''Rescan...''' and untick the ''Scan for New VMFS Volumes''
| |
| # Once scan has complete, go to '''Storage''', and hit '''Add Storage...'''
| |
| # Click '''Next >''' to select ''Disk/LUN'' storage
| |
| # Select the appropriate device and click '''Next >'''
| |
| # Check the current disk layout (ie its blank if its meant to be) and click '''Next >'''
| |
| # Give the datastore an appropriate name, and click '''Next >'''
| |
| # Select an approriate block size (this limits maximum VMDK size), and click '''Next >'''
| |
| # Review config and click '''Finish'''
| |
| # On the remaining ESX's, go to '''Storage Adapters''', hit '''Rescan...''' (leave both boxes checked)
| |
| | |
| === SAN LUN ID ===
| |
| The SAN LUN ID is used by SAN admin's to identify LUN's. It's not readily available from the GUI and has to be extracted from the vml file...
| |
| | |
| So from the following...
| |
| * <code> /vmfs/devices/disks/vml.020006000060060160c6931100cc319eea7adddd11524149442035 </code>
| |
| you need to extract the mid characters from the vml name...
| |
| * <code> /vmfs/devices/disks/vml.0200060000'''60060160c6931100cc319eea7adddd11'''524149442035 </code>
| |
| So the SAN LUN ID is <code> 60060160c6931100cc319eea7adddd11 </code>
| |
| | |
| === Emulex ===
| |
| ==== Find Emulex HBA Driver and Firmware Version, and WWPN ====
| |
| Doesn't require Emulex HBA utility to be installed
| |
| # <code> cd /proc/scsi/lpfc </code>
| |
| # <code> more 1 </code> for HBA 1
| |
| # <code> more 2 </code> for HBA 2
| |
| | |
| The <code> Portname </code> number is the WWPN number used to identify the HBA's by the SAN.
| |
| <pre>
| |
| [root@uklonesxp2 lpfc]# more 1
| |
| Emulex LightPulse FC SCSI 7.1.14_vmw1
| |
| Emulex LightPulse LP1050 2 Gigabit PCI Fibre Channel Adapter on PCI bus 0f devic
| |
| e 20 irq 121
| |
| SerialNum: BG70569148
| |
| Firmware Version: 1.91A1 (M2F1.91A1)
| |
| Hdw: 1001206d
| |
| VendorId: 0xf0a510df
| |
| Portname: 10:00:00:00:c9:61:73:de Nodename: 20:00:00:00:c9:61:73:de
| |
| | |
| Link Up - Ready:
| |
| PortID 0x645213
| |
| Fabric
| |
| Current speed 2G
| |
| </pre>
| |
| | |
| ==== Install Emulex HBA Utility ====
| |
| Can be found at [http://www.emulex.com/vmware/support/index.jsp Emulex Lputil].
| |
| | |
| To install lputil (uses example of lpfcutil-7.1.14;
| |
| # Put the downloaded tgz file on the ESX server
| |
| #* EG <code> mkdir /var/updates/Emulex-lpfcutil-7.1.14 </code>
| |
| # Go into folder and extract;
| |
| #* <code> cd /var/updates/Emulex-lpfcutil-7.1.14/ </code>
| |
| #* <code> tar -xvzf Emulex-lpfcutil-7.1.14.tgz </code>
| |
| # Install;
| |
| #* <code> ./Install.sh </code>
| |
| <pre>
| |
| [root@uklonesxp2 Emulex-lpfcutil-7.1.14]# ./Install.sh
| |
| Installing Emulex HBAAPI libraries and applications...
| |
| Installation of Emulex HBAAPI libraries and utilities is completed.
| |
| </pre>
| |
| * Start the utility (on startup it should detect one or more HBA's);
| |
| * <code> /usr/sbin/lpfc/lputil </code>
| |
| <pre>
| |
| LightPulse Common Utility for Linux. Version 1.6a10 (10/7/2004).
| |
| Copyright (c) 2004, Emulex Network Systems, Inc.
| |
| | |
| Emulex Fibre Channel Host Adapters Detected: 1
| |
| Host Adapter 0 (lpfc0) is an LP1050 (Ready Mode)
| |
| </pre>
| |
| | |
| ==== HBAnywhere Installation ====
| |
| # Download the Driver and Application kit for VMware from [http://www.emulex.com/downloads/emulex/cnas-and-hbas/drivers/vmware/fc-74040-pkg.html Emulex's website].
| |
| #* At time of writing the current version of package was <code>elxvmwarecorekit-esx35-4.0a45-1.i386.rpm</code>
| |
| # Copy the package to the server
| |
| #* EG <code> pscp -pw [password] elxvmwarecorekit-esx35-4.0a45-1.i386.rpm platadmn@dtcp-esxsvce01a:/home/platadmn</code>
| |
| # Install the package
| |
| #* EG <code> rpm -ivh elxvmwarecorekit-2.1a42-1.i386.rpm </code>
| |
| | |
| ==== Check Emulex HBA Firmware Version ====
| |
| Requires the HBA Utility to be installed 1st (see above)
| |
| | |
| # Start the utility (on startup it should detect one or more HBA's;
| |
| #* <code> /usr/sbin/lpfc/lputil </code>
| |
| # From the Main menu, enter 2, '''Adapter Revision Levels'''
| |
| #* Example shows version 1.91a5
| |
| <pre>
| |
| BIU: 1001206D
| |
| Sequence Manager: 00000000
| |
| Endec: 00000000
| |
| Operational Firmware: SLI-2 Overlay
| |
| Kernel: 1.40a3
| |
| Initial Firmware: Initial Load 1.91a5 (MS1.91A5 )
| |
| SLI-1: SLI-1 Overlay 1.91a5 (M1F1.91A5 )
| |
| SLI-2: SLI-2 Overlay 1.91a5 (M2F1.91A5 )
| |
| Highest FC-PH Version: 4.3
| |
| Lowest FC-PH Version: 4.3
| |
| </pre>
| |
| | |
| | |
| ==== Update Emulex HBA Firmware ====
| |
| * '''Using HBA Utility''' (must be installed 1st - see above). See the Emulex website for the latest version, eg [http://www.emulex.com/ts/downloads/lp1050/lp1050ex.jsp Emulex LP1050Ex]
| |
| | |
| To update the firmware (example uses LP1050Ex-mf191a5)
| |
| # Put the downloaded zip file on the UKLONVCP1 NFS Share, and unzip to a folder, eg EmulexLP1050Ex-mf191a5
| |
| # Create folder in /var/updates;
| |
| #* <code> mkdir /var/updates/EmulexLP1050Ex-mf191a5 </code>
| |
| # Copy the firmware update onto the ESX
| |
| #* <code> cp /vmfs/volumes/UKLONVCP1\ NFS\ Share/EmulexLP1050Ex-mf191a5/mf191a5.all /var/updates/EmulexLP1050Ex-mf191a5/ </code>
| |
| # Start the utility (on startup it should detect one or more HBA's;
| |
| #* <code> /usr/sbin/lpfc/lputil </code>
| |
| # From the Main menu, enter 3, '''Firmware Maintenance'''.
| |
| # If prompted, choose the HBA that is being updated.
| |
| # Enter 1, '''Load Firmware Image'''.
| |
| # Enter the full path to the firmware file, upgrade will then complete, eg
| |
| <pre>
| |
| Enter Image Filename => /var/updates/EmulexLP1050Ex-mf191a5/mf191a5.all
| |
| Opening File...
| |
| End Of File
| |
| Checksum OK!!!
| |
| Reading AIF Header #1...
| |
| Validating Checksum...
| |
| Erasing Flash ROM Sectors...
| |
| 100% complete
| |
| Loading Image...
| |
| First Download
| |
| 100% complete
| |
| Image Successfully Downloaded...
| |
| Reading AIF Header #2...
| |
| Validating Checksum...
| |
| Erasing Flash ROM Sectors...
| |
| 100% complete
| |
| Loading Image...
| |
| First Download
| |
| 100% complete
| |
| Updating Wakeup Parameters...
| |
| Image Successfully Downloaded...
| |
| Reading AIF Header #3...
| |
| End Of File
| |
| Resetting Host Adapter...
| |
| Image Successfully Downloaded...
| |
| </pre>
| |
| | |
| | |
| * '''Using HBAnywhere''' (must be installed 1st - see above)
| |
| # Download the correct firmware version from Emulex's website
| |
| #* EG for [http://www.emulex.com/downloads/emulex/cnas-and-hbas/firmware-and-boot-code/lpe11002.html LPe11002's]
| |
| # Extract, and copy file to server
| |
| # Find adapter's WWPN's
| |
| #* EG <code>/usr/sbin/hbanyware/hbacmd ListHBAs</code>
| |
| # Download new firware version to each HBA
| |
| #* EG <code>/usr/sbin/hbanyware/hbacmd download 10:00:00:00:c9:82:97:9e zf280a4.all</code>
| |
| | |
| ==== EMCgrab Collection ====
| |
| # Download correct verion from EMC's website
| |
| #* At time of writing the current version file was [ftp://ftp.emc.com/pub/emcgrab/ESX/Old_Releases/v1.1/ emcgrab_ESX_v1.1.tar]
| |
| # Copy to server
| |
| #* EG <code>pscp emcgrab_ESX_v1.1.tar platadmn@dtcp-esxsvce02a:/home/platadmn</code>
| |
| # Uncompress the file
| |
| #* EG <code>tar -xvf emcgrab_ESX_v1.1.tar</code>
| |
| # Run grab (can take a few minutes, best done out of hours)
| |
| #* EG <code>./emcgrab.sh</code>
| |
| # Results can be found in <code>\emcgrab\outputs</code> folder
| |
| | |
| === QLogic ===
| |
| ==== Find QLogic HBA Driver and Firmware Version ====
| |
| # <code> cd /proc/scsi/qla2300 </code>
| |
| # <code> more 1 </code> for HBA 1
| |
| <pre>
| |
| [root@uklonesxp1 qla2300]# more 1
| |
| QLogic PCI to Fibre Channel Host Adapter for QLA2340 :
| |
| Firmware version: 3.03.19, Driver version 7.07.04
| |
| Entry address = 0x7dc314
| |
| HBA: QLA2312 , Serial# E79916
| |
| Request Queue = 0x3f403000, Response Queue = 0x3f414000
| |
| ...
| |
| </pre>
| |
| | |
| | |
| ==== Install QLogic HBA Utility ====
| |
| Installation instructions for the SANsurfer utility
| |
| # Put the downloaded tgz file on the UKLONVCP1 NFS Share, eg scli-1.7.0-12.i386.rpm.gz
| |
| # Copy to folder /var/updates (create if it doesn't exist)
| |
| #* <code> cp /vmfs/volumes/UKLONVCP1\ NFS\ Share/scli-1.7.0-12.i386.rpm.gz /var/updates </code>
| |
| # Uncompress the file with the following command;
| |
| #* <code> gunzip scli-1.7.0-12.i386.rpm.gz </code>
| |
| # Enter the following commands to install the package, and then check its installed;
| |
| #* rpm -iv scli-1.7.0-12.i386.rpm
| |
| #* rpm -q scli
| |
| <pre>
| |
| [root@uklonesxp1 updates]# rpm -iv scli-1.7.0-12.i386.rpm
| |
| Preparing packages for installation...
| |
| scli-1.7.0-12
| |
| [root@uklonesxp1 updates]# rpm -q scli
| |
| scli-1.7.0-12
| |
| </pre>
| |
| | |
| | |
| ==== Update QLogic HBA Firmware ====
| |
| See QLogic website for latest version, you must ensure the firmware version is compatible with the current running driver version. Requires SANsurfer to be installed 1st (see above)
| |
| | |
| # Put the downloaded tgz file on the UKLONVCP1 NFS Share, eg q231x_234x_bios147.zip, and unzip to folder
| |
| # Create a new folder for the update;
| |
| #* <code> mkdir /var/updates/q231x_234x_bios147
| |
| # Copy the firmware onto the ESX server;
| |
| #* <code> cp /vmfs/volumes/UKLONVCP1\ NFS\ Share/q231x_234x_bios147/QL23ROM.BIN /var/updates/q231x_234x_bios147/ </code>
| |
| # Move to the folder containing the update;
| |
| #* <code> cd /var/updates/q231x_234x_bios147/ </code>
| |
| # Start the SANsurfer utility
| |
| #* <code> scli </code>
| |
| # Go into the '''HBA Utilities''' option
| |
| # Select the '''Save Flash'' option
| |
| # Follow the prompts to save the flash to a backup file, eg BackupROM.bin
| |
| # Select the '''Update Flash''' option
| |
| # Follow the prompts to update the flash, using the file copied to the ESX, eg QL23ROM.BIN
| |
| <pre>
| |
| Enter a file name or Hit <RETURN> to abort: QL23ROM.BIN
| |
| Updating flash on HBA 0 - QLA2340 . Please wait...
| |
| Option ROM update complete. Changes have been saved to the HBA 0.
| |
| Please reboot the system for the changes to take effect.
| |
| Updating flash on HBA 1 - QLA2340 . Please wait...
| |
| Option ROM update complete. Changes have been saved to the HBA 1.
| |
| Please reboot the system for the changes to take effect.
| |
| </pre>
| |
| | |
| | |
| === SAN Downtime ===
| |
| ESX's don't like to loose the SAN, to the extent that during the scheduled SAN downtime the following is recommended...
| |
| # Shutdown ESX's (and hosted VM's) connected to affected storage
| |
| # Perform SAN maintenance
| |
| # Restart ESX's (and hosted VM's)
| |
| If the above is not possible then its recommended that...
| |
| # Migrate away/shutdown VM's that are hosted on affected storage
| |
| # Un-present LUN's
| |
| # Resan LUN's from ESX and confirm they disappear (any VM's on extinct storage will become greyed-out)
| |
| # Perform SAN maintenance
| |
| # Re-present LUN's
| |
| # Re-scan LUN's from ESX and confirm that they re-appear (grey-ed out VM's should ''reconnect'')
| |
| # Restart / migrate VM's
| |
| | |
| == Netflow ==
| |
| '''Netflow is available on ESX v3 only, and is an experimental feature. Netflow v5 is sent.'''
| |
| | |
| * '''To start Netflow'''
| |
| *# Load the module
| |
| *#* <code> vmkload_mod netflow </code>
| |
| *# Configure monitoring of appropriate vSwitch's to Netflow collector IP and port
| |
| *#* <code> /usr/lib/vmware/bin/vmkload_app -S -i vmktcp /usr/lib/vmware/bin/net-netflow -e vSwitch0,vSwitch1 10.20.255.31:2055 </code>
| |
| ** To reconfigure the Netflow module you must stop and restart the module
| |
| | |
| * '''To confirm running'''
| |
| *# Check the module is running...
| |
| *#* <code> [root@esx1 root]# vmkload_mod -l | grep netflow </code>
| |
| *#* <code> netflow 0x9b4000 0x3000 0x298b640 0x1000 16 Yes </code>
| |
| *# Check the correct config is running...
| |
| *#* <code> [root@esx1 root]# ps -ef | grep netflow </code>
| |
| *#* <code> root 2413 1 0 Feb05 ? 00:00:00 /usr/lib/vmware/bin/vmkload_app -S -i vmktcp /usr/lib/vmware/bin/net-netflow -e vSwitch0,vSwitch1 10.20.255.31:2055 </code>
| |
| | |
| * '''To stop Netflow'''
| |
| *# <code> ps -ef | grep netflow </code>
| |
| *# <code> kill <pid> </code>
| |
| *# <code> vmkload_mod -u netflow </code>
| |
| | |
| == Change Service Console IP Information ==
| |
| Logged in as root use the esxcfg-vswif command <code>esxcfg-vswif <options> [vswif] </code>
| |
| | |
| Description: Creates and updates service console network settings. This command is used if you cannot manage the ESX Server host through the VI Client because of network configuration issues.
| |
| | |
| Note that the -l command will display the names(s) of the virtual switches which must be specified on the other commands so the trailing [vswif] is not optional on most commands.
| |
| | |
| Options:
| |
| | |
| -a Add vswif, requires IP parameters. Automatically enables interface.
| |
| -d Delete vswif.
| |
| -l List configured vswifs.
| |
| -e Enable this vswif interface.
| |
| -s Disable this vswif interface.
| |
| -p Set the portgroup name of the vswif.
| |
| -i <x.x.x.x> or DHCP The IP address for this vswif or specify DHCP to use DHCP for this address.
| |
| -n <x.x.x.x> The IP netmask for this vswif.
| |
| -b <x.x.x.x> The IP broadcast address for this vswif. (not required if netmask and ip are set)
| |
| -c Check to see if a virtual NIC exists. Program outputs a 1 if the given vswif exists, 0 otherwise.
| |
| -D Disable all vswif interfaces. (WARNING: This may result in a loss of network connectivity to the Service Console)
| |
| -E Enable all vswif interfaces and bring them up.
| |
| -r Restore all vswifs from the configuration file. (Internal use only)
| |
| -h Displays command help.
| |
| | |
| Note: You set the Service Console default gateway by editing the /etc/sysconfig/network file or through the VI Client under Configuration, DNS & Routing.
| |
| | |
| Note: You set the Service Console VLAN (to 1234) using a similar command to: <code>esxcfg-vswitch -v1234 -p"Service Console" vSwitch0></code>
| |
| | |
| == Change Timezone ==
| |
| # Log into the ESX Server service console as root.
| |
| # Find the desired time zone under the directory /usr/share/zoneinfo
| |
| # Edit <code> /etc/sysconfig/clock </code> Edit this file to show the relative path to the file representing the new time zone, and ensure that UTC and ARC are set as shown:
| |
| #* <code> ZONE="Etc/GMT" </code>
| |
| #* <code> UTC=true </code>
| |
| #* <code> ARC=false </code>
| |
| # Copy the desired time zone file to /etc/localtime
| |
| #* <code> cp /usr/share/zoneinfo/GMT /etc/localtime </code>
| |
| # Confirm that /etc/localtime has been updated with the correct zoneinfo data using the following steps:
| |
| # Reference the zoneinfo file used in step 2 and compare it to /etc/localtime, if the files are identical, your prompt will return without any output.
| |
| #* <code> diff /etc/localtime /usr/share/zoneinfo/GMT </code>
| |
| # Confirm the system and hardware clocks are correct. Use the Linux date command to check and set the correct time if necessary.
| |
| #* Set the hardware clock to match the correct system time.
| |
| #* Set the system clock to the local date and time: \\\\ date MMDDhhmmYYYY
| |
| # Update the hardware clock with current time of the system clock;
| |
| #* <code> /sbin/hwclock --systohc </code>
| |
| | |
| = Troubleshooting =
| |
| If all else fails you can always raise a [[VMware Service Request]]
| |
| | |
| == Useful paths / logfiles==
| |
| '''Timestamps in logfiles are in UTC !!!'''
| |
| === ESX ===
| |
| {|cellpadding="2" cellspacing="0" border="1"
| |
| |- style="background-color:#bbddff;"
| |
| ! Item !! Path !! Comments
| |
| |-
| |
| | Vmkernel logfile || <code> /var/log/vmkernel </code> || Pretty much everything seems to be recorded here
| |
| |-
| |
| | Vmkernel warnings || <code> /var/log/vmkwarning </code> || Virtual machine warnings
| |
| |-
| |
| | Host Daemon logfile || <code> /var/log/vmware/hostd.log </code> || Services log
| |
| |-
| |
| | vCentre Agent logfile || <code> /var/log/vmware/vpx/vpxa.log </code> || vCentre agent
| |
| |-
| |
| | Local VM files || <code> /vmfs/volumes/storage </code> || ''storage'' name can vary, use TAB so shell selects available
| |
| |-
| |
| | SAN VM files || <code> /vmfs/volumes/SAN </code> || ''SAN'' will vary depending on what you've called your storage
| |
| |-
| |
| | HA agent logs || <code> /opt/LGTOaam512/log/ </code> || Various logs of limited use - depreciated
| |
| |-
| |
| | HA agent log || <code> /var/log/vmware/aam/agent/run.log </code> || Main HA log
| |
| |-
| |
| | HA agent install log || <code> /var/log/vmware/aam/aam_config_util_install.log </code> || HA install log
| |
| |}
| |
| | |
| === ESXi ===
| |
| To view logfiles from an ESX'''i''' server, assuming you don't have SSH access, they need to be downloaded to your client machine 1st, and then viewed from there...
| |
| # Using VI Client, go to '''File | Export | Export System Logs...'''
| |
| #* Tick the appropriate object
| |
| #* Untick ''Include information from vCenter Server and vSphere Client'', unless you additionally want this info
| |
| # Once exported, uncompress the ESX's tgz file
| |
| However, this is most easily achieved if you've got the PowerCLI installed, in which case see [[VI_Toolkit_(PowerShell)#ESXi_Logs|ESXi Logs via PowerCLI]]
| |
| | |
| {|cellpadding="2" cellspacing="0" border="1"
| |
| |- style="background-color:#bbddff;"
| |
| ! Name !! PowerCLI Key !! Diagnostic Dump Path !! Comments
| |
| |-
| |
| | Syslog || <code> messages </code> || <code> /var/log/messages </code> || Equivalent to ESX ''hostd'' and ''vmkernel'' logs combined
| |
| |-
| |
| | Host Daemon || <code> hostd </code> || <code> /var/log/vmware/hostd.log </code> || Equivalent to ESX ''hostd'' log
| |
| |-
| |
| | vCenter Agent || <code> vpxa </code> || <code> /var/log/vmware/vpx/vpxa.log </code> ||
| |
| |-
| |
| | SNMP Config || || <code> /etc/vmware/snmp.xml </code> || Edit via [[RemoteCLI#vicfg-snmp|vicfg-snmp]]
| |
| |}
| |
| | |
| '''Logfiles get lost at restart !''' If you have to restart your ESX (say, because it locked up) there will be no logs prior to the most recent boot. In theory they'll get written to a dump file if a crash is detected, but I've never found them, so assume they're only generated during a semi-graceful software crash.
| |
| | |
| However, there is a way around this. Message's can be sent to a syslog file (say on centrally available SAN LUN), a syslog server (in both cases see [http://kb.vmware.com/kb/1016621 VM KB 1016621]), or to a vMA server (see http://www.vmware.com/support/developer/vima/vima40/doc/vma_40_guide.pdf). Be aware that when sending logs over the network (eg to a Syslog server) its quite common for that last few log entries to not be written when an ESX fails, you'll get more complete logs when writing direct to a file.
| |
| | |
| === ESXi Tech Support Mode ===
| |
| There's no Service Console on ESXi, so you have to do without. Well almost, there is the ''unsupported'' Tech Support Mode, which is a lightweight Service Console, to enable...
| |
| | |
| '''ESXi 3.5 and 4.0'''
| |
| # Go to the local ESXi console and press Alt+F1
| |
| # Type '''unsupported'''
| |
| # Blindly type the root password (yes, there's no prompt)
| |
| # Edit <code> /etc/inetd.conf </code> and uncomment (remove the #) from the line that starts with <code> #ssh </code>, and save
| |
| # Restart the management service <code> /sbin/services.sh restart </code>
| |
| | |
| '''ESXi 4.1'''
| |
| # Go to the local ESXi console and press F2
| |
| # Enter root user and pass
| |
| # Go to the '''Troubleshooting Options'''
| |
| # Enable '''Local Tech Support''' or '''Remote Tech Support (SSH)''' as required
| |
| Alternatively...
| |
| # From the vSphere Client, select the host and click the '''Configuration''' tab
| |
| # Go to '''Security profile > Properties'''
| |
| # Select '''Local Tech Support''' or '''Remote Tech Support (SSH)''' and click '''Options''' button
| |
| # Choose the '''Start automatically''' startup policy, click '''Start''', and then '''OK'''.
| |
| | |
| == ESXTOP ==
| |
| | |
| {|cellpadding="2" cellspacing="0" border="1"
| |
| |- style="background-color:#bbddff;"
| |
| ! Key !! Change View !! Key !! Sort by
| |
| |-
| |
| | rowspan="3" | '''<code> c </code>''' || rowspan="3" | '''ESX CPU'''
| |
| | <code> U </code> || % CPU Used
| |
| |-
| |
| | <code> R </code> || % CPU Ready
| |
| |-
| |
| | <code> N </code> || Normal / default
| |
| |-
| |
| | rowspan="3" | '''<code> m </code>''' || rowspan="3" | '''ESX Memory'''
| |
| | <code> M </code> || Memsz
| |
| |-
| |
| | <code> B </code> || Mctlsz
| |
| |-
| |
| | <code> N </code> || Normal / default
| |
| |-
| |
| | rowspan="5" | '''<code> d </code>''' || rowspan="5" | '''ESX Disk Adapter'''
| |
| | <code> r </code> || Reads/sec
| |
| |-
| |
| | <code> w </code> || Writes/sec
| |
| |-
| |
| | <code> R </code> || Read MB/sec
| |
| |-
| |
| | <code> T </code> || Write MB/sec
| |
| |-
| |
| | <code> N </code> || Normal / default
| |
| |-
| |
| | rowspan="5" | '''<code> u </code>''' || rowspan="5" | '''ESX Disk Drive/LUN'''
| |
| | <code> r </code> || Reads/sec
| |
| |-
| |
| | <code> w </code> || Writes/sec
| |
| |-
| |
| | <code> R </code> || Read MB/sec
| |
| |-
| |
| | <code> T </code> || Write MB/sec
| |
| |-
| |
| | <code> N </code> || Normal / default
| |
| |-
| |
| | rowspan="5" | '''<code> v </code>''' || rowspan="5" | '''VM Disk'''
| |
| | <code> r </code> || Reads/sec
| |
| |-
| |
| | <code> w </code> || Writes/sec
| |
| |-
| |
| | <code> R </code> || Read MB/sec
| |
| |-
| |
| | <code> T </code> || Write MB/sec
| |
| |-
| |
| | <code> N </code> || Normal / default
| |
| |-
| |
| | rowspan="5" | '''<code> n </code>''' || rowspan="5" | '''ESX NIC'''
| |
| | <code> t </code> || Transmit Packet/sec
| |
| |-
| |
| | <code> r </code> || Receive Packet/sec
| |
| |-
| |
| | <code> T </code> || Transmit MB/sec
| |
| |-
| |
| | <code> R </code> || Receive MB/sec
| |
| |-
| |
| | <code> N </code> || Normal / default
| |
| |}
| |
| | |
| == CPU ==
| |
| === Poor performance ===
| |
| Basic things to check are that the VM or the ESX its hosted on aren't saturating their available CPU. However if VM's are performing sluggishly and/or are slow to start, depsite not appearing to be excessively using CPU time futehr investigation is required...
| |
| | |
| * Use <code>esxtop</code> on the ESX service console. Look at Ready Time (%RDY), which is how long a VM is waiting for CPUs to become available.
| |
| * Alternatively look for CPU Ready in performance charts. Here its measured in msec, over the normal 20 sec sampling interval.
| |
| | |
| CPU Ready can creep up if the the system is pushed, or if the VM has multiple CPUs (as it needs multiple physical CPUs to become available at the same time, aka CPU Co-Scheduling). Multiple CPU's are especially a problem in environments where there are large number of SMP VM's.
| |
| | |
| {|cellpadding="2" cellspacing="0" border="1"
| |
| |- style="background-color:#bbddff;"
| |
| ! % CPU Ready !! MSec CPU Ready !! Performance
| |
| |-
| |
| | < 1..25 % || < 500 msec || Excellent
| |
| |-
| |
| | < 2.5 % || < 500 msec || Good
| |
| |-
| |
| | < 5 % || < 1000 msec || Acceptible
| |
| |-
| |
| | < 10 % || < 2000 msec || Poor
| |
| |-
| |
| | > 15 % || > 3000 msec || Bad
| |
| |}
| |
| | |
| CPU Co-Scheduling is more relaxed in ESX4 than ESX3, due to changes in the way that differences to seperate vCPU's progress within a single VM are calculated. Meaning that the derogatory affect on pCPU effciency of having multiple CPU VM is reduced (but ''not'' eliminated). See http://www.vmware.com/files/pdf/perf-vsphere-cpu_scheduler.pdf for further info.
| |
| | |
| == Storage ==
| |
| === Poor throughput ===
| |
| Use <code>esxtop</code> on the service console and switch to the disk monitor. Enable views for latency, you will see values like GAVG, KAVG and DAVG.
| |
| * '''GAVG''' is the total guest experienced latency on IO commands averaged over 2 seconds
| |
| * '''KAVG''' is the vmkernel/hypervisor IO latency averaged over 2 seconds
| |
| * '''DAVG''' is the device (HBA) IO latency averaged over the last 2 seconds (will include any latency at lower level, eg SAN)
| |
| | |
| Latency occurs when the hypervisor or physical storage cannot keep pace with the demand for IO, as a rough guide to indicate if theres a problem or not...
| |
| {|cellpadding="2" cellspacing="0" border="1"
| |
| |- style="background-color:#bbddff;"
| |
| ! Latency up to !! Status
| |
| |-
| |
| | 2 ms || Excellent - look elsewhere
| |
| |-
| |
| | 10 ms || Good
| |
| |-
| |
| | 20 ms || Reasonable
| |
| |-
| |
| | 50 ms || Poor / Busy
| |
| |-
| |
| | higher || Bad
| |
| |}
| |
| | |
| === Storage Monitor Log Entries ===
| |
| How to decode the following type of entries...
| |
| Sep 3 15:15:14 esx1 vmkernel: 85:01:23:01.532 cpu4:2264)StorageMonitor: 196: vmhba1:2:0:0 status = 2/0 0x6 0x2a 0x1
| |
| Sep 3 15:15:32 esx1 vmkernel: 85:01:23:19.391 cpu4:2253)StorageMonitor: 196: vmhba1:3:9:0 status = 2/0 0x6 0x2a 0x1
| |
| | |
| The status message consists of the follow four decimal and hex blocks...
| |
| {| cellpadding="4" cellspacing="0" border="1"
| |
| |-
| |
| |''Device Status'' / ''Host Status'' || ''Sense Key'' || ''Additional Sense Code'' || ''Additional Sense Code Qualifier''
| |
| |}
| |
| | |
| ...or in the more recent format (ESX v3.5 Update 4 and above)...
| |
| | |
| Mar 2 10:04:44 vmkernel: 81:00:00:15.893 cpu8:3258649)StorageMonitor: 196: vmhba0:0:4:0 status = D:0x28/H:0x0 0x0 0x0 0x0
| |
| Mar 2 10:04:44 vmkernel: 81:00:00:15.893 cpu8:4633964)StorageMonitor: 196: vmhba0:0:4:0 status = D:0x28/H:0x0 0x0 0x0 0x0
| |
| | |
| {| cellpadding="4" cellspacing="0" border="1"
| |
| |-
| |
| |<code>D:</code>''Device Status'' / <code>H:</code>''Host Status'' || ''Sense Key'' || ''Additional Sense Code'' || ''Additional Sense Code Qualifier''
| |
| |}
| |
| | |
| Where the ESX Device and SAN host status' mean...
| |
| {| cellpadding="4" cellspacing="0" border="1"
| |
| |- style="background-color:#bbddff;"
| |
| ! Decimal !! Device Status !! Host Status !! Comments
| |
| |-
| |
| | 0 || No Errors || Host_OK ||
| |
| |-
| |
| | 1 || || Host No_Connect ||
| |
| |-
| |
| | 2 || Check Condition || Host_Busy_Busy ||
| |
| |-
| |
| | 3 || || Host_Timeout ||
| |
| |-
| |
| | 4 || || Host_Bad_Target ||
| |
| |-
| |
| | 5 || || Host_Abort ||
| |
| |-
| |
| | 6 || || Host_Parity ||
| |
| |-
| |
| | 7 || || Host_Error ||
| |
| |-
| |
| | 8 || Device Busy || Host_Reset ||
| |
| |-
| |
| | 9 || || Host_Bad_INTR ||
| |
| |-
| |
| | 10 || || Host_PassThrough ||
| |
| |-
| |
| | 11 || || Host_Soft_Error ||
| |
| |-
| |
| | 24 || Reservation Conflict || || 24/0 indicates a locking error, normally caused by too many ESX's mounting a LON, wrong config on storage array, or too many VM's on a LUN
| |
| |-
| |
| | 28 || Queue full / Task set full || || Indicates the SAN is busy handling write's and is passing back notification of such when asked to handle more data
| |
| |}
| |
| | |
| Where the Sense Key mean...
| |
| {| cellpadding="4" cellspacing="0" border="1"
| |
| |- style="background-color:#bbddff;"
| |
| ! Hex !! Sense Key
| |
| |-
| |
| | 0x0 || No Sense Information
| |
| |-
| |
| | 0x1 || Last command completed but used error correction
| |
| |-
| |
| | 0x2 || Unit Not Ready
| |
| |-
| |
| | 0x3 || Medium Error
| |
| |-
| |
| | 0x4 || Hardware Error
| |
| |-
| |
| | 0x5 || ILLEGAL_REQUEST (Passive SP)
| |
| |-
| |
| | 0x6 || LUN Reset
| |
| |-
| |
| | 0x7 || Data_Protect - Access to data is blocked
| |
| |-
| |
| | 0x8 || Blank_Check - Reached an unexpected region
| |
| |-
| |
| | 0xa || Copy_Aborted
| |
| |-
| |
| | 0xb || Aborted_Command - Target aborted command
| |
| |-
| |
| | 0xc || Comparison for SEARCH DATA unsuccessful
| |
| |-
| |
| | 0xd || Volume_Overflow - Medium is full
| |
| |-
| |
| | 0xe || Source and Data on Medium do not agree
| |
| |}
| |
| | |
| The Additional Sense Code and Additional Sense Code Qualifier mean
| |
| {| cellpadding="4" cellspacing="0" border="1"
| |
| |- style="background-color:#bbddff;"
| |
| ! Hex !! Sense Code
| |
| |-
| |
| | 0x4 || Unit Not Ready
| |
| |-
| |
| | 0x3 || Unit Not Ready - Manual Intervention Required
| |
| |-
| |
| | 0x2 || Unit Not Ready - Initializing Command Required
| |
| |-
| |
| | 0x25 || Logical Unit Not Supported (eg LUN doesn't exist)
| |
| |-
| |
| | 0x29 || Device Power on or SCSI Reset
| |
| |}
| |
| | |
| For further info on sense codes see - http://www.adaptec.com/en-US/support/scsi/2940/AHA-2940AU/use_prod/SCSI_event_codes.htm?nc=/en-US/support/scsi/2940/AHA-2940AU/use_prod/SCSI_event_codes.htm
| |
| | |
| === Recovering VM's from failed storage ===
| |
| Procedure generated from an occasion where the ESX software was installed on top of the shared SAN VMFS storage, where the VM files still existed so the VM’s continued to run, but as the file system index no longer existed, the vmdk’s etc were orphaned and would be lost if the VM’s were to be restarted. Though it could be adapted to suit any situation where the ESX datastore is corrupted, cannot power on VM’s, and rebooting a VM would lose it. However, its well worth calling VMware support before carrying this out, they may be able to provide an easier solution.
| |
| | |
| # On each VM
| |
| ## Shut-down running applications
| |
| ## Install VMware Converter (Typical install, all default options)
| |
| ## Hot migrate local VM to a new VM on new storage
| |
| ### As VMware converter starts, select '''Continue in Starter Mode'''
| |
| ### Select '''Import Machine''' from the bottom of the initial screen
| |
| ### Select source as '''Physical Machine''', then on next screen '''This local machine'''
| |
| ### Select default options for source disk
| |
| ### Select '''VMware ESX server...''' as your destination
| |
| ### Enter ESX hostname, and root user/pass
| |
| ### Enter new VM name, e.g. ''myserver''-recov (not the same as the existing, it will let you do it, but the VC isn’t happy later on)
| |
| ### Select host
| |
| ### Select datastore
| |
| ### Select network and uncheck '''Connect at power on...'''
| |
| ### Don’t select power on after creation, and let the migration run
| |
| ## Reconfig the new VM, edit its settings as follows
| |
| ##* Floppy Drive 1 --> Client Device
| |
| ##* CD/DVD Drive 1 --> Client Device
| |
| ##* Parallel Port 1 --> Remove
| |
| ##* Serial Port 1 --> Remove
| |
| ##* Serial Port 2 --> Remove
| |
| ##* USB Controller --> Remove
| |
| ## Power up the new VM and check it over
| |
| ## Power off the old VM (you will lose it forever, be very sure the new VM is good)
| |
| ## Connect the network of the new VM
| |
| ## Delete the old VM
| |
| # Delete the knackered SAN datastore and refresh on all other ESX’s that share it (deletes the name but doesn’t free up any space)
| |
| # Create a new SAN datastore (this formats the old space)
| |
| # Refresh on all other ESX’s that share the datastore
| |
| # Shutdown all the new VM’s
| |
| # Clone them to the new SAN datastore using the original name (e.g. ''myserver'')
| |
| # Power up new new VM’s on SAN datastore, confirm OK, then delete ''myserver''-recov servers
| |
| | |
| === Recover lost SAN VMFS partition ===
| |
| EG After a powerdown, ESX's can see the SAN storage, but the VMFS cannot be found in the Storage part of the ESX config, even after Refresh. To fix, the VMFS needs to be resignatured...
| |
| | |
| '''Do not attempt to ''Add Storage'' to recover the VMFS, this will format the partition'''
| |
| | |
| # On one of the ESX's, in Advanced Settings, change LVM.EnableResignature to 1
| |
| # '''Refresh''' Storage, the VMFS should be found with a new name, something like snap-000000002-''OriginalName''.
| |
| # '''Remove from Inventory''' all VM's from the old storage, the old storage should disappear from the list of datastores
| |
| # Rename the found storage to the original name
| |
| # '''Refresh''' Storage on all other ESX's, they should see the VMFS again
| |
| # Revert LVM.EnableResignature on the appropriate ESX
| |
| # Via the ESX, browse the datastore and re-add the VM's to the inventory (right-click over the .vmx file)
| |
| #* For a Virtual Machine Question about what to do about a UUID, select Keep
| |
| | |
| === USB / SD Hypervisor Checks ===
| |
| USB and SD cards are notorious for causing problems. Especially USB sticks, which were designed for occasional access storage, and not to be repetitively used in the fashion they are when running ESXi hypervisor. The SD cards may well be tarnished with the shadow of USB. In order to perform a disk check, use the following...
| |
| | |
| '''Assumes your running ESXi4, if using ESXi3 use this procedure (from which this section is adapted from): http://www.vm-help.com/esx/esx3i/check_system_partitions.php'''
| |
| | |
| Firstly a quick overview of the partitions...
| |
| /vmfs/volumes/Hypervisor1 /bootbank Where the ESX boots from
| |
| /vmfs/volumes/Hypervisor2 /altbootbank Used during ESX updates
| |
| /vmfs/volumes/Hypervisor3 /store VMTools ISO's etc
| |
| Everything else in an ESXi server is stored on the scratch disk, or is created at boot in a ramdisk
| |
| | |
| Run <code> fdisk -l </code> to list the available partitions on the USB/SD card (you'll also see your SAN partitions as well)..
| |
| <pre>
| |
| Disk /dev/disks/mpx.vmhba32:C0:T0:L0: 8166 MB, 8166309888 bytes
| |
| 64 heads, 32 sectors/track, 7788 cylinders
| |
| Units = cylinders of 2048 * 512 = 1048576 bytes
| |
| | |
| Device Boot Start End Blocks Id System
| |
| /dev/disks/mpx.vmhba32:C0:T0:L0p1 5 900 917504 5 Extended
| |
| /dev/disks/mpx.vmhba32:C0:T0:L0p4 * 1 4 4080 4 FAT16 <32M
| |
| /dev/disks/mpx.vmhba32:C0:T0:L0p5 5 254 255984 6 FAT16
| |
| /dev/disks/mpx.vmhba32:C0:T0:L0p6 255 504 255984 6 FAT16
| |
| /dev/disks/mpx.vmhba32:C0:T0:L0p7 505 614 112624 fc VMKcore
| |
| /dev/disks/mpx.vmhba32:C0:T0:L0p8 615 900 292848 6 FAT16
| |
| </pre>
| |
| | |
| The two partitions with the identical number of blocks are /bootbank and /altbootbank, perform a check disk on these
| |
| | |
| dosfsck -v /dev/disks/mpx.vmhba32:C0:T0:L0:5
| |
| dosfsck -v /dev/disks/mpx.vmhba32:C0:T0:L0:6
| |
| | |
| to perform a verification pass use -V, or to test for bad sectors use -t (with which you also need to include -a (automatically repair) or -r (interactively repair) options).
| |
| | |
| dosfsck -V /dev/disks/mpx.vmhba32:C0:T0:L0:5
| |
| dosfsck -t -r /dev/disks/mpx.vmhba32:C0:T0:L0:5
| |
| | |
| === Unable to Add RDM ===
| |
| Basic steps to add an RDM are...
| |
| # Provision LUN on SAN
| |
| # Rescan LUN's on ESX
| |
| # Add RDM to VM
| |
| | |
| '''.vmdk is larger than the maximum size supported by datastore'''
| |
| * Normally this error is misleading and really means that RDM can't be created due to an untrapped reason. It does not mean that there is not enough space to create the (very small) RDM mapping file on the VMFS!
| |
|
| |
| # Double check that the LUN has been properly created and available.
| |
| # Attempt to add the disk as a new VMFS to an ESX (cancel at the last part of wizard)
| |
| # Then re-attempt to add the RDM to the VM
| |
| | |
| == High Availability ==
| |
| '''Be aware that playing with HA can have disastrous effects, especially if the ''Isolation Response'' of your cluster is set to ''Power Off''''' If you can, consider waiting until outside of production hours before trying to resolve a problem. Unstable clusters can disintegrate if you're unlucky.
| |
| | |
| There are 5 primaries in an HA cluster, the first ESX's to join the cluster become primaries, this only changes (through an election) when the following occurs (note - not during an ESX failure)..
| |
| * Primary ESX goes into Maintenance Mode
| |
| * Primary disconnected from the cluster
| |
| * Primary removed from the cluster
| |
| * Primary reconfigured for HA
| |
| | |
| It's quite common for HA to go into an error state, normal course of action is to use the '''Reconfigure for HA''' option for the ESX that's experiencing the problem. This reinstalls the HA agent onto the ESX onto the ESX. It's also common to have to do this a couple of times for it to be successful. Other things to try...
| |
| * Restart the HA process - see [[#High_Availability_Stop.2FStart|High Availability Stop/Start]]
| |
| * [[#Manually Deinstall|Deinstall HA and VPXA]] and reinstall
| |
| | |
| HA is very dependant on proper DNS, to check everything is in order do the following from each ESX. Some versions of ESX3 are sensitive to case, always user lower, FQDN of ESX's should be lower case, and VC's FQDN and domain suffix search should be lower case
| |
| # Check that the hostname/IP of the local ESX is as expected
| |
| #* <code> hostname </code>
| |
| #* <code> hostname -s </code>
| |
| #* <code> hostname -i </code>
| |
| #* If not check the following files
| |
| #** <code> /etc/hosts </code>
| |
| #** <code> /etc/sysconfig/network </code>
| |
| #** <code> /etc/vmware/esx.conf </code>
| |
| # Check that HA can properly resolve other ESX's in the cluster (note: only one IP address should be returned)
| |
| #* <code> /opt/vmware/aam/bin/ft_gethostbyname <my_esx_name> </code>
| |
| # Check that HA can properly resolve the vCentre
| |
| #* <code> /opt/vmware/aam/bin/ft_gethostbyname <my_vc_name> </code>
| |
| # Check the vCentre server can properly resolve the ESX names
| |
| # Check the vCentre's FQDN and DNS suffix search are correct and lower case
| |
| | |
| If you need to correct DNS names, don't be surprised if you need to reinstall HA and VPXA, it can be done without interrupting running VM's, but its obviously a lot less stressful not to.
| |
| | |
| === Manually Deinstall ===
| |
| # Put the ESX into maintenance mode
| |
| # Disconnect the ESX from the Virtual Centre
| |
| # SSH to the ESX server (or use [[#ESXi_Tech_Support_Mode|ESXi Tech Support Mode]])
| |
| # <code> cd /opt/vmware/uninstallers </code>
| |
| # <code> ./VMware-vpxa-uninstall.sh </code>
| |
| # <code> ./VMware-aam-ha-uninstall.sh </code>
| |
| # Reconect the ESX to the VC
| |
| # Take out of maintenance mode
| |
| | |
| If the VC Agent or HA Agent fails due to the uninstaller being unable to remove files/folders, and you can't remove them manually, this is an indication that the disk is becoming corrupt. Especially if installed on a USB key, consider replacing ASAP.
| |
| | |
| === Command Line Interface etc ===
| |
| '''''Using the commands in this section isn't supported by VMware'''''
| |
| | |
| To start the CLI run the following command...
| |
| * <code> /opt/vmware/aam/bin/Cli </code>
| |
| | |
| The interface is a bit dodgy, you can enter the same command twice, and it'll be rejected one time and accepted another, patience is required.
| |
| | |
| {|cellpadding="4" cellspacing="0" border="1"
| |
| |- style="background-color:#bbddff;"
| |
| ! Command !! Comments
| |
| |-
| |
| | <code> ln </code> || List cluster nodes and their status
| |
| |-
| |
| | <code> addNode <hostname> </code> || Add ESX/node to cluster (use ESX's short DNS name)
| |
| |-
| |
| | <code> promoteNode <hostname> </code> || Promote existing ESX/node to be a primary
| |
| |-
| |
| | <code> demoteNode <hostname> </code> || Demote existing ESX/node to be a secondary
| |
| |}
| |
| | |
| There's also the following scripts to be found which behave as you'd expect (found in <code> /opt/vmware/aam/bin </code>)...
| |
| * <code> ./ft_setup </code>
| |
| * <code> ./ft_startup </code>
| |
| * <code> ./ft_shutdown </code>
| |
| | |
| === Error Hints ===
| |
| '''Host in HA Cluster must have userworld swap enabled'''
| |
| * ESXi servers need to have scratch space enabled
| |
| # In vCentre, go to the '''Advanced Settings''' of the ESX
| |
| # Go to '''ScratchConfig''' and locate <code>ScratchConfig.ConfiguredScratchLocation </code>
| |
| # Set to directory with sufficient space (1GB) (can be configured on local storage or shared storage, folder must exist and be dedicated to ESX, delete contents if you've rebuilt the ESX)
| |
| #* Format <code> /vmfs/volumes/<DatastoreName> </code>
| |
| #* EG <code> /vmfs/volumes/SCRATCH-DISK/my_esx </code>
| |
| #* Locate <code> ScratchConfig.ConfiguredSwapState </code> and set
| |
| # Bounce the ESX
| |
| | |
| '''Unable to contact primary host in cluster'''
| |
| * The ESX is unable to contact a primary ESX in cluster, some kind of networking issue
| |
| ** If there's no existing HA'ed ESX's, start by looking at the vCentre's networking (for example inconsistent domain names, including case)
| |
| | |
| ''':cmd remove failed:'''
| |
| HA failed to uninstall properly prior to being reinstalled, try to manually deinstall HA as per [[#Manually_Deinstall|these instructions]]. This can be indicative of a dying USB key (if you're ESX is installed on a USB key), so fingers crossed.
| |
| | |
| == Snapshots ==
| |
| http://geosub.es/vmutils/Troubleshooting.Virtual.Machine.snapshot.problems/Troubleshooting.Virtual.Machine.snapshot.problems.html
| |
| | |
| See also [[Virtual_Machines#Can.27t_Snapshot|Virtual Machines Snapshot Troubleshooting]]
| |
| | |
| == Random Problems ==
| |
| === ESXi Lockup ===
| |
| Affects ESXi v3.5 Update 4 ''only''. Caused by a problem with updated CIM software in Update 4.
| |
| | |
| * Workaround
| |
| ** Disable CIM (disables hardware monitoring) by setting <code>Advanced Settings | Misc | Misc.CimEnabled</code> to <code>0</code> (restart to apply)
| |
| * Fix
| |
| ** Apply patch ESXe350-200910401-I-SG, see http://kb.vmware.com/kb/1014761
| |
| | |
| For further info see http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1012575
| |
| | |
| === Cimserver High CPU ===
| |
| Caused by problems with the VMware CIM server software. However can be caused by other problems causing it to go nuts (check VMKernel logs, etc).
| |
| | |
| * Restart
| |
| ** <code> service pegasus restart </code>
| |
| | |
| === Log Bundles Creation Fails ===
| |
| ESX log bundle creation fails, either via the VI Client or via <code>vm-support</code>
| |
| | |
| # SSH to the ESX
| |
| # Run <code>vm-support</code> to try to create a log bundle
| |
| | |
| * '''Could not copy...Have you run out of disk space?'''
| |
| ** ESX - Check that there's space to be able to write in <code>/tmp<code>
| |
| ** ESXi - Check that the ESX has been configured with a scratch disk, and that it has space
| |
| * '''tar: write error: Broken pipe'''
| |
| ** ESXi - Check that the ESX has been configured with a scratch disk
| |
| | |
| [[Category:VMware]]
| |