Build Notes

Installation

ESX3 Installation - Example, based on an old ESX v3 build guide
HeavyLoad - Load tester (stick it in a test VM, memory test doesn't really work as ESX page sharing kicks in)

USB Image

If you're installing ESXi 4 then you don't need to do this, the installer will detect the USB stick and install to it.

Required software etc...

WinImage - http://www.winimage.com/download.htm
DD - http://www.chrysocome.net/dd
ESXi install ISO
Disk Cloner, eg G4U - http://www.feyrer.de/g4u/
- Ideally use a cloner that ignores the actual disk contents and does a block by block copy, anything that tries to interpret the disk image may not copy it faithfully
You must be able to connect two image files remotely to your server, a disk cloner CD ISO, and the image USB ISO (hint: use the floppy drive).

Creating the USB image file

Open up the ISO with WinImage
Extract the INSTALL.TGZ from the ISO
Uncompress INSTALL.TGZ and locate .\INSTALL\usr\lib\vmware\installer\VMware-VMvisor-big-3.5.0_Update_4-153875.i386.dd.bz2
Uncompress VMware-VMvisor-big-3.5.0_Update_4-153875.i386.dd.bz2 so that you have VMware-VMvisor-big-3.5.0_Update_4-153875.i386.dd
Create ISO image from DD image by using DD
- dd bs=1M if=VMware-VMvisor-big-3.5.0_Update_4-153875.i386.dd of=esx3.5ihp-usbimage.iso

Deploying the USB image file

Attach your disk cloner image to your server and boot
Once the the server is booting to the CD ISO, attach the USB ISO
List the avaialble disks
- list
Identify the image disk (which is 750MB) and the USB disk (which will be whatever size your USB key is)
Copy the image to the USB key
- copydisk sd1 sd0
Disconnect all images, reboot server, cross fingers
- reboot

Build Numbers

ESX version	ESX	ESXi
3.5 Update 1	82663	82664
3.5 Update 2	110268	110271
3.5 Update 3	123630	123629
3.5 Update 4	153875	153875
3.5 Update 5	207095	207095
4.0	164009
4.0 Update 1	208167	208167
4.0 Update 2	261974	261974
4.1	260247	260247

VMware CLI

Especially if using ESXi, you'll need to install the VMware CLI on any machine you want to access the ESX command line from. Be aware that ActivePerl gets installed as well, so proceed with caution if you've already got Perl installed on the machine.

Security Hardening

Service Console

Disk Partitions

Suggesting partition sizing for Service Console on local disk to prevent Root partition being filled with user data

part /boot --fstype ext3 --size 1024 --ondisk=sda --asprimary
part / --fstype ext3 --size 5120 --ondisk=sda --asprimary
part swap --size 2048 --ondisk=sda --asprimary
part /var --fstype ext3 --size 5120 --ondisk=sda
part /tmp --fstype ext3 --size 5120 --ondisk=sda
part /home --fstype ext3 --size 2048 --ondisk=sda
part None --fstype vmkcore --size 100 --ondisk sda

Local Accounts

Password Policy

No policy is implemented by default, if not using AD Integration then its sensible to apply a policy on the ESX, using the PAMQC module. Its not particularly elegant.

Active Directory Integration

Because service console authentication is Unix-based, it cannot use Active Directory to define user accounts. However, it can use Active Directory to authenticate users by matching local passwd file account name with Active directory with appropriate support of SFU (Services For Unix).

See Scott Lowe's blog for further info

Sudo

It is possible to limit the enhanced privileges that a user can gain by using sudo. This is most appropriate where there is a large number admins. However, in such an environment there is likely to be a large number of ESX's, managing the config on ESX is a headache.

Example of possible sudo config (/etc/sudoers)

...
# Defaults specification
Defaults logfile=/var/log/sudolog

# User privilege specification
root    ALL=(ALL) ALL
User_Alias VI_JR_ADMINS=esxoper, esxoper2
User_Alias VI_ADMINS=esxadmin

Cmnd_Alias STOP=/usr/sbin/shutdown, /usr/sbin/halt, /usr/sbin/poweroff 
Cmnd_Alias REBOOT=/usr/sbin/reboot
Cmnd_Alias KILL=/usr/bin/kill 
Cmnd_Alias NTP=/usr/sbin/ntpdate, /sbin/hwclock 

VI_JR_ADMINS ALL=STOP, REBOOT, KILL, NTP
VI_ADMINS ALL=(ALL) ALL
...

Logging

It is recommended to compress and increase the maximum log file size by modifying the configuration files in the /etc/logrotate.d directory and the /etc/logrotate.conf file.

For example, changing vmkwarning to be 2096k in size, and compressed...

[root@dtcp-esxsvce01b root]# more /etc/logrotate.d/vmkwarning
/var/log/vmkwarning{
    create 0600 root root
    missingok
    compress
    sharedscripts
    postrotate
    size 2096k
        /bin/kill -HUP `cat /var/run/syslogd.pid 2> /dev/null` 2> /dev/null || true
    endscript
}

...and changing relevent part of /etc/logrotate.conf to allow compression...

...
# uncomment this if you want your log files compressed
compress

...

Finally, its worth redirecting sudo log activity to /var/log/sudolog, see above section on sudo.

Banners

There are three modes of direct management access to an ESX, web, ssh, and direct (local) console.

Web Access

Edit the html page /usr/lib/vmware/hostd/docroot/index.html

SSH

Edit the /etc/ssh/sshd_config file so that it knows to display a defined banner file during login...

Banner /etc/banner

Create the banner file with the appropriate contents.

Console

Prepend your banner to the /etc/issue file

ESX

Network Settings

Setting	Default	Preferred	Explanantion
Promiscuous Mode	Reject	Reject	Principally used in situations where you need to perform a network traffic (snif) capture. Data from all ports propagates to all ports (VM Port group becomes a hub rather than a switch)
MAC address changes	Accept	Reject	There are situations where allowing MAC Address Changes to Accept is required. For example; legacy applications, clustered environments, and licensing. Legacy applications may require a specific MAC addresses to be used for the application. Microsoft Clusters utilize an artificial MAC address for all servers in the cluster
Forged Transmits	Accept	Reject	The setting affects traffic transmitted from a virtual machine. If this option is set to reject, the virtual switch compares the source MAC address being transmitted by the operating system with the effective MAC address for its virtual network adapter to see if they are the same. If the MAC addresses are different, the virtual switch drops the frame. The guest operating system will not detect that its virtual network adapter cannot send packets using the different MAC address. To protect against MAC address impersonation, all virtual switches should have forged transmissions set to reject

Configuration Considerations

Hardware

CPU

Feature	Set to	Intel name	AMD name
Node Interleaving	Disabled (allows NUMA operation)
Execute Protection	Enabled	eXecute Disable (XD)	No-Execute Page-Protection
Virtualisation assist	Enabled	Intel VT	AMD-V

CPU Power vs Performance

If in doubt put server BIOS settings to maximum performance - this ensures that ESX can get the most out of the hardware, allowing the BIOS to balance or use low power modes may impact VM performance. ESX's are expected to work hard, that's how they save you money, and so they should be set-up to be able to perform. In theory, allowing the motherboard to throttle back the CPUs when under low load shouldn't cause a problem.

When using ESX4.1 or higher then set the BIOS to allow the OS (ie ESX) control of CPU performance (if the setting is available), this allows the CPU Performance to be controlled dynamically by ESX as it manages VM load (and configurable through the VI Client).

See VM KB 1018206 - Poor virtual machine application performance may be caused by processor power management settings for further info

HP ASR

Should be disabled.

VMware don’t recommend that we use the HP ASR feature (designed to restart a server in the case of an OS hang), they’ve come across occasions when an ESX under load will suddenly restart due to ASR time-outs. See VM KB 1010842 - HP Automatic Server Recovery in a VMware ESX Environment for further info.

Networking

Beacon Probing

Should only be used when there are 3 or more physical NIC's assigned to the vSwitch, uplinked to the network switch.

This is to enable the ESX to be able to properly determine the state of the network during a faulty condition. If there's only two uplinks and the beacon gets lost between the two NIC's, then the ESX can't know which uplink is faulty, just that there is a fault.

See VM KB 1005577 - What is beacon probing? for further info.

Storage

ESX Installation Sizing

See VM KB 1026500 - Recommended disk or LUN sizes for VMware ESX/ESXi installations

SCSI Resets

When accessing centralised storage via SCSI, VMware recommends the following configuration (only the disabling of SCSI Device Resets is a change from the default). These settings are intended to limit the scope of SCSI Resets, and so reduce contention and overlapping of SCSI commands from different hosts accessing the same storage system.

Disk.UseLunReset set to 1
Disk.UseDeviceReset set to 0

Procedures

Links to VMware KB docs...

VMware KB1026380 - Committing snapshots on ESX/ESXi host from command line

Quick commands

`vmware -v`	ESX3 software version and build
`vmware -l`	ESX4 software version and build
`vm-support -x`	List running VM's
`vmware-cmd -l`	List config files of VM's registered to ESX
`esxcfg-rescan vmhba0`	Perform LUN rescan on vmhba0
`esxcfg-vmhbadevs`	List hba LUN mappings
`esxcfg-mpath -l`	List all LUNS and their paths

ESX Shutdown / Reboot

ESX

Shutdown a host ready for power off
- shutdown -h now
Restart a host
- shutdown -r now

ESXi

Shutdown a host ready for power off, either of
- /bin/host_reboot.sh
- reboot
Restart a host
- /bin/host_shutdown.sh

High Availability Stop/Start

Stop HA...
- /etc/init.d/VMWAREAAM51_vmware stop
Start HA...
- /etc/init.d/VMWAREAAM51_vmware start

VMware Management Agent Restart

ESX

service mgmt-vmware restart
Stopping VMware ESX Server Management services:
  VMware ESX Server Host Agent Services                   [  OK  ]
  VMware ESX Server Host Agent Watchdog                   [  OK  ]
  VMware ESX Server Host Agent                            [  OK  ]
Starting VMware ESX Server Management services:
  VMware ESX Server Host Agent (background)               [  OK  ]
  Availability report startup (background)                [  OK  ]

If this fails to stop the service, you can try to manually kill the processes.

Determine the PID's of the processes
- ps -auxwww | grep vmware-hostd
- which should give you something like, in which case the PID's are 2807 and 2825...
- root 2807 0.0 0.3 4244 884 ? S Mar10 0:00 /bin/sh /usr/bin/vmware-watchdog -s hostd -u 60 -q 5 -c /usr/sbin/vmware-hostd-support /usr/sbin/vmware-hostd -u
- root 2825 0.1 12.0 72304 32328 ? S Mar10 1:14 /usr/lib/vmware/hostd/vmware-hostd /etc/vmware/hostd/config.xml -u
- root 13848 0.0 0.2 3696 556 pts/0 R 08:43 0:00 grep vmware-hostd
Kill the PID's using kill -p pid
- So, for example, kill -9 2807 and kill -9 2825
Then reattempt the service restart

To also restart the Virtual Centre Agent, use

service vmware-vpxa restart

ESXi
services.sh restart

VMware Web Access Restart

service vmware-webAccess restart
Stopping VMware ESX Server webAccess:
   VMware ESX Server webAccess                             [FAILED]
Starting VMware ESX Server webAccess:
   VMware ESX Server webAccess                             [  OK  ]

VM Start

On the ESX that currently owns the VM...

Get the VM's config file path
- vmware-cmd -l | grep VM_Name
Start the VM using the path found
- vmware-cmd \vm_path\VM_Name.vmx start
Wait for start-up to complete, if start-up fails check the VM's log
- less \vm_path\vmware.log

Maintenance Mode

To put the ESX into maintenance mode with no access from the Infrastructure Client (VCP) use the following commands - use with caution

Put esx into maintenance mode:

vimsh -n -e /hostsvc/maintenance_mode_enter

check the esx is in maintenance mode

vimsh -n -e /hostsvc/runtimeinfo | grep inMaintenanceMode | awk ‘{print $3}’

exit maintenance mode

vimsh -n -e /hostsvc/maintenance_mode_exit

TCPDump Network Sniffer

Basic network sniffer available in Service Console

TCPDump instruction manual

EG To sniff all traffic on the Service Console interface, vswif0, going to/from 159.104.227.40

tcpdump -i vswif0 host 159.104.224.70

Security

Password Complexity Override

In order to be able to change a user (or root) password to one that breaches password complexity checking

Disable PAM module
- esxcfg-auth --usepamqc -1 -1 -1 -1 -1 -1
Disable complexity checker
- esxcfg-auth --usecrack -1 -1 -1 -1 -1 -1
Change password
Re-enable PAM module
- esxcfg-auth --usepamqc=-1 -1 -1 -1 8 8

Regenerate Certificate

You might need to regenerate certificates if

Change ESX host name
Accidentally delete the certificates

To generate new Certificates for the ESX Server host...

Change directories to /etc/vmware/ssl.
Create backups of any existing certificates:
- mv rui.crt orig.rui.crt
- mv rui.key orig.rui.key
Rstart the vmware-hostd process:
- service mgmt-vmware restart
Confirm that the ESX Server host generated new certificates by executing the following command comparing the time stamps of the new certificate files with orig.rui.crt and orig.rui.key
- ls -la

NIC Operations

Get NIC Firmware/Driver versions

ESX4
- ethtool -i vmnic<no>
- Where <no> is your NIC no, eg ethtool -i vmnic0
ESX4i
- vsish -e get net/pNics/vmnic<no>/properties
- Where <no> is your NIC no, eg vsish -e get net/pNics/vmnic1/properties

HBA and SAN Operations

VMFS / LUN Addition

The new LUN needs to be carved up and presented to all ESX's that should see it (normally all ESX's from a particular cluster). Once completed, follow the procedure below to add to the ESX's...

Pick ESX in cluster with lowest load
Go to Storage Adapters, hit Rescan... and untick the Scan for New VMFS Volumes
Once scan has complete, go to Storage, and hit Add Storage...
Click Next > to select Disk/LUN storage
Select the appropriate device and click Next >
Check the current disk layout (ie its blank if its meant to be) and click Next >
Give the datastore an appropriate name, and click Next >
Select an approriate block size (this limits maximum VMDK size), and click Next >
Review config and click Finish
On the remaining ESX's, go to Storage Adapters, hit Rescan... (leave both boxes checked)

SAN LUN ID

The SAN LUN ID is used by SAN admin's to identify LUN's. It's not readily available from the GUI and has to be extracted from the vml file...

So from the following...

/vmfs/devices/disks/vml.020006000060060160c6931100cc319eea7adddd11524149442035

you need to extract the mid characters from the vml name...

/vmfs/devices/disks/vml.020006000060060160c6931100cc319eea7adddd11524149442035

So the SAN LUN ID is 60060160c6931100cc319eea7adddd11

Emulex

Find Emulex HBA Driver and Firmware Version, and WWPN

Doesn't require Emulex HBA utility to be installed

cd /proc/scsi/lpfc
more 1 for HBA 1
more 2 for HBA 2

The Portname number is the WWPN number used to identify the HBA's by the SAN.

[root@uklonesxp2 lpfc]# more 1
Emulex LightPulse FC SCSI 7.1.14_vmw1
Emulex LightPulse LP1050 2 Gigabit PCI Fibre Channel Adapter on PCI bus 0f devic
e 20 irq 121
SerialNum: BG70569148
Firmware Version: 1.91A1 (M2F1.91A1)
Hdw: 1001206d
VendorId: 0xf0a510df
Portname: 10:00:00:00:c9:61:73:de   Nodename: 20:00:00:00:c9:61:73:de

Link Up - Ready:
   PortID 0x645213
   Fabric
   Current speed 2G

Install Emulex HBA Utility

Can be found at Emulex Lputil.

To install lputil (uses example of lpfcutil-7.1.14;

Put the downloaded tgz file on the ESX server
- EG mkdir /var/updates/Emulex-lpfcutil-7.1.14
Go into folder and extract;
- cd /var/updates/Emulex-lpfcutil-7.1.14/
- tar -xvzf Emulex-lpfcutil-7.1.14.tgz
Install;
- ./Install.sh

[root@uklonesxp2 Emulex-lpfcutil-7.1.14]# ./Install.sh
Installing Emulex HBAAPI libraries and applications...
Installation of Emulex HBAAPI libraries and utilities is completed.

Start the utility (on startup it should detect one or more HBA's);
/usr/sbin/lpfc/lputil

LightPulse Common Utility for Linux. Version 1.6a10 (10/7/2004).
Copyright (c) 2004, Emulex Network Systems, Inc.

Emulex Fibre Channel Host Adapters Detected: 1
Host Adapter 0 (lpfc0) is an LP1050 (Ready Mode)

HBAnywhere Installation

Download the Driver and Application kit for VMware from Emulex's website.
- At time of writing the current version of package was elxvmwarecorekit-esx35-4.0a45-1.i386.rpm
Copy the package to the server
- EG pscp -pw [password] elxvmwarecorekit-esx35-4.0a45-1.i386.rpm platadmn@dtcp-esxsvce01a:/home/platadmn
Install the package
- EG rpm -ivh elxvmwarecorekit-2.1a42-1.i386.rpm

Check Emulex HBA Firmware Version

Requires the HBA Utility to be installed 1st (see above)

Start the utility (on startup it should detect one or more HBA's;
- /usr/sbin/lpfc/lputil
From the Main menu, enter 2, Adapter Revision Levels
- Example shows version 1.91a5

                   BIU: 1001206D
      Sequence Manager: 00000000
                 Endec: 00000000
  Operational Firmware: SLI-2 Overlay
                Kernel: 1.40a3
      Initial Firmware: Initial Load 1.91a5 (MS1.91A5 )
                 SLI-1: SLI-1 Overlay 1.91a5 (M1F1.91A5 )
                 SLI-2: SLI-2 Overlay 1.91a5 (M2F1.91A5 )
 Highest FC-PH Version: 4.3
  Lowest FC-PH Version: 4.3

Update Emulex HBA Firmware

Using HBA Utility (must be installed 1st - see above). See the Emulex website for the latest version, eg Emulex LP1050Ex

To update the firmware (example uses LP1050Ex-mf191a5)

Put the downloaded zip file on the UKLONVCP1 NFS Share, and unzip to a folder, eg EmulexLP1050Ex-mf191a5
Create folder in /var/updates;
- mkdir /var/updates/EmulexLP1050Ex-mf191a5
Copy the firmware update onto the ESX
- cp /vmfs/volumes/UKLONVCP1\ NFS\ Share/EmulexLP1050Ex-mf191a5/mf191a5.all /var/updates/EmulexLP1050Ex-mf191a5/
Start the utility (on startup it should detect one or more HBA's;
- /usr/sbin/lpfc/lputil
From the Main menu, enter 3, Firmware Maintenance.
If prompted, choose the HBA that is being updated.
Enter 1, Load Firmware Image.
Enter the full path to the firmware file, upgrade will then complete, eg

Enter Image Filename => /var/updates/EmulexLP1050Ex-mf191a5/mf191a5.all
Opening File...
End Of File
Checksum OK!!!
Reading AIF Header #1...
Validating Checksum...
Erasing Flash ROM Sectors...
100% complete
Loading Image...
First Download
100% complete
Image Successfully Downloaded...
Reading AIF Header #2...
Validating Checksum...
Erasing Flash ROM Sectors...
100% complete
Loading Image...
First Download
100% complete
Updating Wakeup Parameters...
Image Successfully Downloaded...
Reading AIF Header #3...
End Of File
Resetting Host Adapter...
Image Successfully Downloaded...

Using HBAnywhere (must be installed 1st - see above)

Download the correct firmware version from Emulex's website
- EG for LPe11002's
Extract, and copy file to server
Find adapter's WWPN's
- EG /usr/sbin/hbanyware/hbacmd ListHBAs
Download new firware version to each HBA
- EG /usr/sbin/hbanyware/hbacmd download 10:00:00:00:c9:82:97:9e zf280a4.all

EMCgrab Collection

Download correct verion from EMC's website
- At time of writing the current version file was emcgrab_ESX_v1.1.tar
Copy to server
- EG pscp emcgrab_ESX_v1.1.tar platadmn@dtcp-esxsvce02a:/home/platadmn
Uncompress the file
- EG tar -xvf emcgrab_ESX_v1.1.tar
Run grab (can take a few minutes, best done out of hours)
- EG ./emcgrab.sh
Results can be found in \emcgrab\outputs folder

QLogic

Find QLogic HBA Driver and Firmware Version

cd /proc/scsi/qla2300
more 1 for HBA 1

[root@uklonesxp1 qla2300]# more 1
QLogic PCI to Fibre Channel Host Adapter for QLA2340 :
        Firmware version:  3.03.19, Driver version 7.07.04
Entry address = 0x7dc314
HBA: QLA2312 , Serial# E79916
Request Queue = 0x3f403000, Response Queue = 0x3f414000
...

Install QLogic HBA Utility

Installation instructions for the SANsurfer utility

Put the downloaded tgz file on the UKLONVCP1 NFS Share, eg scli-1.7.0-12.i386.rpm.gz
Copy to folder /var/updates (create if it doesn't exist)
- cp /vmfs/volumes/UKLONVCP1\ NFS\ Share/scli-1.7.0-12.i386.rpm.gz /var/updates
Uncompress the file with the following command;
- gunzip scli-1.7.0-12.i386.rpm.gz
Enter the following commands to install the package, and then check its installed;
- rpm -iv scli-1.7.0-12.i386.rpm
- rpm -q scli

[root@uklonesxp1 updates]# rpm -iv scli-1.7.0-12.i386.rpm
Preparing packages for installation...
scli-1.7.0-12
[root@uklonesxp1 updates]# rpm -q scli
scli-1.7.0-12

Update QLogic HBA Firmware

See QLogic website for latest version, you must ensure the firmware version is compatible with the current running driver version. Requires SANsurfer to be installed 1st (see above)

Put the downloaded tgz file on the UKLONVCP1 NFS Share, eg q231x_234x_bios147.zip, and unzip to folder
Create a new folder for the update;
- mkdir /var/updates/q231x_234x_bios147


Copy the firmware onto the ESX server;
 cp /vmfs/volumes/UKLONVCP1\ NFS\ Share/q231x_234x_bios147/QL23ROM.BIN /var/updates/q231x_234x_bios147/ 
Move to the folder containing the update;
 cd /var/updates/q231x_234x_bios147/ 
Start the SANsurfer utility
 scli 
Go into the HBA Utilities option
Select the 'Save Flash option
Follow the prompts to save the flash to a backup file, eg BackupROM.bin
Select the Update Flash option
Follow the prompts to update the flash, using the file copied to the ESX, eg QL23ROM.BIN

Enter a file name or Hit <RETURN> to abort: QL23ROM.BIN
Updating flash on HBA 0 - QLA2340 . Please wait...
Option ROM update complete. Changes have been saved to the HBA 0.
Please reboot the system for the changes to take effect.
Updating flash on HBA 1 - QLA2340 . Please wait...
Option ROM update complete. Changes have been saved to the HBA 1.
Please reboot the system for the changes to take effect.

SAN Downtime

ESX's don't like to loose the SAN, to the extent that during the scheduled SAN downtime the following is recommended...

Shutdown ESX's (and hosted VM's) connected to affected storage
Perform SAN maintenance
Restart ESX's (and hosted VM's)

If the above is not possible then its recommended that...

Migrate away/shutdown VM's that are hosted on affected storage
Un-present LUN's
Resan LUN's from ESX and confirm they disappear (any VM's on extinct storage will become greyed-out)
Perform SAN maintenance
Re-present LUN's
Re-scan LUN's from ESX and confirm that they re-appear (grey-ed out VM's should reconnect)
Restart / migrate VM's

Netflow

Netflow is available on ESX v3 only, and is an experimental feature.  Netflow v5 is sent.

To start Netflow
Load the module
 vmkload_mod netflow 
Configure monitoring of appropriate vSwitch's to Netflow collector IP and port
 /usr/lib/vmware/bin/vmkload_app -S -i vmktcp /usr/lib/vmware/bin/net-netflow -e vSwitch0,vSwitch1 10.20.255.31:2055 
To reconfigure the Netflow module you must stop and restart the module

To confirm running
Check the module is running...
 [root@esx1 root]# vmkload_mod -l | grep netflow 
 netflow             0x9b4000          0x3000      0x298b640         0x1000        16 Yes 
Check the correct config is running...
 [root@esx1 root]# ps -ef | grep netflow 
 root      2413     1  0 Feb05 ?        00:00:00 /usr/lib/vmware/bin/vmkload_app -S -i vmktcp /usr/lib/vmware/bin/net-netflow -e vSwitch0,vSwitch1 10.20.255.31:2055

To stop Netflow
 ps -ef | grep netflow 
 kill <pid> 
 vmkload_mod -u netflow

Change Service Console IP Information

Logged in as root use the esxcfg-vswif command esxcfg-vswif <options> [vswif] 
Description: Creates and updates service console network settings. This command is used if you cannot manage the ESX Server host through the VI Client because of network configuration issues.
Note that the -l command will display the names(s) of the virtual switches which must be specified on the other commands so the trailing [vswif] is not optional on most commands.
Options:
-a	 Add vswif, requires IP parameters. Automatically enables interface. 	
-d	 Delete vswif. 	
-l	 List configured vswifs. 	
-e	 Enable this vswif interface. 	
-s	 Disable this vswif interface. 	
-p	 Set the portgroup name of the vswif. 	
-i <x.x.x.x> or DHCP 	The IP address for this vswif or specify DHCP to use DHCP for this address. 	
-n <x.x.x.x> 	The IP netmask for this vswif. 	
-b <x.x.x.x> 	The IP broadcast address for this vswif. (not required if netmask and ip are set) 	
-c	 Check to see if a virtual NIC exists. Program outputs a 1 if the given vswif exists, 0 otherwise. 	
-D	 Disable all vswif interfaces. (WARNING: This may result in a loss of network connectivity to the Service Console) 	
-E	 Enable all vswif interfaces and bring them up. 	
-r	 Restore all vswifs from the configuration file. (Internal use only) 	
-h	 Displays command help. 	
Note: You set the Service Console default gateway by editing the /etc/sysconfig/network file or through the VI Client under Configuration, DNS & Routing.
Note: You set the Service Console VLAN (to 1234) using a similar command to:   esxcfg-vswitch -v1234 -p"Service Console" vSwitch0>

Change Timezone

Log into the ESX Server service console as root.
Find the desired time zone under the directory /usr/share/zoneinfo
Edit  /etc/sysconfig/clock  Edit this file to show the relative path to the file representing the new time zone, and ensure that UTC and ARC are set as shown:
 ZONE="Etc/GMT" 
 UTC=true 
 ARC=false 
Copy the desired time zone file to /etc/localtime
 cp /usr/share/zoneinfo/GMT /etc/localtime 
Confirm that /etc/localtime has been updated with the correct zoneinfo data using the following steps:
Reference the zoneinfo file used in step 2 and compare it to /etc/localtime, if the files are identical, your prompt will return without any output.
 diff /etc/localtime /usr/share/zoneinfo/GMT 
Confirm the system and hardware clocks are correct. Use the Linux date command to check and set the correct time if necessary.
Set the hardware clock to match the correct system time.
Set the system clock to the local date and time: \\\\ date MMDDhhmmYYYY
Update the hardware clock with current time of the system clock;
 /sbin/hwclock --systohc

Troubleshooting

If all else fails you can always raise a VMware Service Request

Useful paths / logfiles

Timestamps in logfiles are in UTC !!!

ESX




Item
Path
Comments


Vmkernel logfile
 /var/log/vmkernel  
Pretty much everything seems to be recorded here


Vmkernel warnings
 /var/log/vmkwarning  
Virtual machine warnings


Host Daemon logfile
 /var/log/vmware/hostd.log 
Services log


vCentre Agent logfile
 /var/log/vmware/vpx/vpxa.log 
vCentre agent


Local VM files
 /vmfs/volumes/storage 
storage name can vary, use TAB so shell selects available


SAN VM files
 /vmfs/volumes/SAN 
SAN will vary depending on what you've called your storage


HA agent logs
 /opt/LGTOaam512/log/ 
Various logs of limited use - depreciated


HA agent log
 /var/log/vmware/aam/agent/run.log 
Main HA log


HA agent install log
 /var/log/vmware/aam/aam_config_util_install.log 
HA install log

ESXi

To view logfiles from an ESXi server, assuming you don't have SSH access, they need to be downloaded to your client machine 1st, and then viewed from there...

Using VI Client, go to File | Export | Export System Logs...
Tick the appropriate object
Untick Include information from vCenter Server and vSphere Client, unless you additionally want this info
Once exported, uncompress the ESX's tgz file

However, this is most easily achieved if you've got the PowerCLI installed, in which case see ESXi Logs via PowerCLI




Name
PowerCLI Key
Diagnostic Dump Path
Comments


Syslog
 messages 
 /var/log/messages 
Equivalent to ESX hostd and vmkernel logs combined


Host Daemon
 hostd 
 /var/log/vmware/hostd.log 
Equivalent to ESX hostd log


vCenter Agent
 vpxa 
 /var/log/vmware/vpx/vpxa.log 



SNMP Config

 /etc/vmware/snmp.xml 
Edit via vicfg-snmp

Logfiles get lost at restart ! If you have to restart your ESX (say, because it locked up) there will be no logs prior to the most recent boot.  In theory they'll get written to a dump file if a crash is detected, but I've never found them, so assume they're only generated during a semi-graceful software crash.  
However, there is a way around this.  Message's can be sent to a syslog file (say on centrally available SAN LUN), a syslog server (in both cases see VM KB 1016621), or to a vMA server (see http://www.vmware.com/support/developer/vima/vima40/doc/vma_40_guide.pdf).

ESXi Tech Support Mode

There's no Service Console on ESXi, so you have to do without.  Well almost, there is the unsupported Tech Support Mode, which is a lightweight Service Console, to enable...
ESXi 3.5 and 4.0

Go to the local ESXi console and press Alt+F1
Type unsupported
Blindly type the root password (yes, there's no prompt)
Edit  /etc/inetd.conf  and uncomment (remove the #) from the line that starts with  #ssh , and save
Restart the management service  /sbin/services.sh restart

ESXi 4.1

Go to the local ESXi console and press F2
Enter root user and pass
Go to the Troubleshooting Options
Enable Local Tech Support or Remote Tech Support (SSH) as required

Alternatively...

From the vSphere Client, select the host and click the Configuration tab
Go to Security profile > Properties
Select Local Tech Support or Remote Tech Support (SSH) and click Options button
Choose the Start automatically startup policy, click Start, and then OK.

ESXTOP




Key
Change View
Key
Sort by


 c 
ESX CPU

 U 
% CPU Used


 R 
% CPU Ready


 N 
Normal / default


 m 
ESX Memory

 M 
Memsz


 B 
Mctlsz


 N 
Normal / default


 d 
ESX Disk Adapter

 r 
Reads/sec


 w 
Writes/sec


 R 
Read MB/sec


 T 
Write MB/sec


 N 
Normal / default


 u 
ESX Disk Drive/LUN

 r 
Reads/sec


 w 
Writes/sec


 R 
Read MB/sec


 T 
Write MB/sec


 N 
Normal / default


 v 
VM Disk

 r 
Reads/sec


 w 
Writes/sec


 R 
Read MB/sec


 T 
Write MB/sec


 N 
Normal / default


 n 
ESX NIC

 t 
Transmit Packet/sec


 r 
Receive Packet/sec


 T 
Transmit MB/sec


 R 
Receive MB/sec


 N 
Normal / default

CPUPoor performance

Basic things to check are that the VM or the ESX its hosted on aren't saturating their available CPU.  However if VM's are performing sluggishly and/or are slow to start, depsite not appearing to be excessively using CPU time futehr investigation is required...

Use esxtop on the ESX service console.  Look at Ready Time (%RDY), which is how long a VM is waiting for CPUs to become available.
Alternatively look for CPU Ready in performance charts.  Here its measured in msec, over the normal 20 sec sampling interval.

CPU Ready can creep up if the the system is pushed, or if the VM has multiple CPUs (as it needs multiple physical CPUs to become available at the same time, aka CPU Co-Scheduling).  Multiple CPU's are especially a problem in environments where there are large number of SMP VM's.




% CPU Ready
MSec CPU Ready
Performance


< 1..25 %
< 500 msec
Excellent


< 2.5 %
< 500 msec
Good


< 5 %
< 1000 msec
Acceptible


< 10 %
< 2000 msec
Poor


> 15 %
> 3000 msec
Bad

CPU Co-Scheduling is more relaxed in ESX4 than ESX3, due to changes in the way that differences to seperate vCPU's progress within a single VM are calculated.  Meaning that the derogatory affect on pCPU effciency of having multiple CPU VM is reduced (but not eliminated). See http://www.vmware.com/files/pdf/perf-vsphere-cpu_scheduler.pdf for further info.

StoragePoor throughput

Use esxtop on the service console and switch to the disk monitor.  Enable views for latency, you will see values like GAVG, KAVG and DAVG.

GAVG is the total guest experienced latency on IO commands averaged over 2 seconds
KAVG is the vmkernel/hypervisor IO latency averaged over 2 seconds
DAVG is the device (HBA) IO latency averaged over the last 2 seconds (will include any latency at lower level, eg SAN)

Latency occurs when the hypervisor or physical storage cannot keep pace with the demand for IO, as a rough guide to indicate if theres a problem or not...




Latency up to
Status


2 ms
Excellent - look elsewhere


10 ms
Good


20 ms
Reasonable


50 ms
Poor / Busy


higher
Bad

Storage Monitor Log Entries

How to decode the following type of entries...

Sep  3 15:15:14 tfukesxent1 vmkernel: 85:01:23:01.532 cpu4:2264)StorageMonitor: 196: vmhba1:2:0:0 status = 2/0 0x6 0x2a 0x1
Sep  3 15:15:32 tfukesxent1 vmkernel: 85:01:23:19.391 cpu4:2253)StorageMonitor: 196: vmhba1:3:9:0 status = 2/0 0x6 0x2a 0x1

The status message consists of the follow four decimal and hex blocks...




Device Status / Host Status
Sense Key
Additional Sense Code
Additional Sense Code Qualifier

Where the ESX Device and SAN host status' mean...




Decimal
Device Status
Host Status
Comments


0
No Errors
Host_OK



1

Host No_Connect



2
Check Condition
Host_Busy_Busy



3

Host_Timeout



4

Host_Bad_Target



5

Host_Abort



6

Host_Parity



7

Host_Error



8
Device Busy
Host_Reset



9

Host_Bad_INTR



10

Host_PassThrough



11

Host_Soft_Error



24
Reservation Conflict

24/0 indicates a locking error, normally caused by too many ESX's mounting a LON, wrong config on storage array, or too many VM's on a LUN

Where the Sense Key mean...




Hex
Sense Key


0x0
No Sense Information


0x1
Last command completed but used error correction


0x2
Unit Not Ready


0x3
Medium Error


0x4
Hardware Error


0x5
ILLEGAL_REQUEST (Passive SP)


0x6
LUN Reset


0x7
Data_Protect - Access to data is blocked


0x8
Blank_Check - Reached an unexpected region


0xa
Copy_Aborted


0xb
Aborted_Command - Target aborted command


0xc
Comparison for SEARCH DATA unsuccessful


0xd
Volume_Overflow - Medium is full


0xe
Source and Data on Medium do not agree

The Additional Sense Code and Additional Sense Code Qualifier mean




Hex
Sense Code


0x4
Unit Not Ready


0x3
Unit Not Ready - Manual Intervention Required


0x2
Unit Not Ready - Initializing Command Required


0x25
Logical Unit Not Supported (eg LUN doesn't exist)


0x29
Device Power on or SCSI Reset

For further info on sense codes see - http://www.adaptec.com/en-US/support/scsi/2940/AHA-2940AU/use_prod/SCSI_event_codes.htm?nc=/en-US/support/scsi/2940/AHA-2940AU/use_prod/SCSI_event_codes.htm

Recovering VM's from failed storage

Procedure generated from an occasion where the ESX software was installed on top of the shared SAN VMFS storage, where the VM files still existed so the VM’s continued to run, but as the file system index no longer existed, the vmdk’s etc were orphaned and would be lost if the VM’s were to be restarted.  Though it could be adapted to suit any situation where the ESX datastore is corrupted, cannot power on VM’s, and rebooting a VM would lose it.  However, its well worth calling VMware support before carrying this out, they may be able to provide an easier solution.

On each VM
Shut-down running applications
Install VMware Converter (Typical install, all default options)
Hot migrate local VM to a new VM on new storage
As VMware converter starts, select Continue in Starter Mode
Select Import Machine from the bottom of the initial screen
Select source as Physical Machine, then on next screen This local machine
Select default options for source disk
Select VMware ESX server... as your destination
Enter ESX hostname, and root user/pass
Enter new VM name, e.g. myserver-recov (not the same as the existing, it will let you do it, but the VC isn’t happy later on)
Select host
Select datastore
Select network and uncheck Connect at power on...
Don’t select power on after creation, and let the migration run
Reconfig the new VM, edit its settings as follows
Floppy Drive 1 --> Client Device
CD/DVD Drive 1 --> Client Device
Parallel Port 1 --> Remove
Serial Port 1 --> Remove
Serial Port 2 --> Remove
USB Controller --> Remove
Power up the new VM and check it over
Power off the old VM (you will lose it forever, be very sure the new VM is good)
Connect the network of the new VM
Delete the old VM
Delete the knackered SAN datastore and refresh on all other ESX’s that share it (deletes the name but doesn’t free up any space)
Create a new SAN datastore (this formats the old space)
Refresh on all other ESX’s that share the datastore
Shutdown all the new VM’s
Clone them to the new SAN datastore using the original name (e.g. myserver)
Power up new new VM’s on SAN datastore, confirm OK, then delete myserver-recov servers

Recover lost SAN VMFS partition

EG After a powerdown, ESX's can see the SAN storage, but the VMFS cannot be found in the Storage part of the ESX config, even after Refresh.  To fix, the VMFS needs to be resignatured...
Do not attempt to Add Storage to recover the VMFS, this will format the partition

On one of the ESX's, in Advanced Settings, change LVM.EnableResignature to 1
Refresh Storage, the VMFS should be found with a new name, something like snap-000000002-OriginalName.
Remove from Inventory all VM's from the old storage, the old storage should disappear from the list of datastores
Rename the found storage to the original name
Refresh Storage on all other ESX's, they should see the VMFS again
Revert LVM.EnableResignature on the appropriate ESX
Via the ESX, browse the datastore and re-add the VM's to the inventory (right-click over the .vmx file)
For a Virtual Machine Question about what to do about a UUID, select Keep

USB / SD Hypervisor Checks

USB and SD cards are notorious for causing problems.  Especially USB sticks, which were designed for occasional access storage, and not to be repetitively used in the fashion they are when running ESXi hypervisor.  The SD cards may well be tarnished with the shadow of USB.  In order to perform a disk check, use the following...
Assumes your running ESXi4, if using ESXi3 use this procedure (from which this section is adapted from): http://www.vm-help.com/esx/esx3i/check_system_partitions.php
Firstly a quick overview of the partitions...

/vmfs/volumes/Hypervisor1	/bootbank	Where the ESX boots from
/vmfs/volumes/Hypervisor2	/altbootbank	Used during ESX updates
/vmfs/volumes/Hypervisor3	/store		VMTools ISO's etc

Everything else in an ESXi server is stored on the scratch disk, or is created at boot in a ramdisk
Run  fdisk -l  to list the available partitions on the USB/SD card (you'll also see your SAN partitions as well)..

Disk /dev/disks/mpx.vmhba32:C0:T0:L0: 8166 MB, 8166309888 bytes
64 heads, 32 sectors/track, 7788 cylinders
Units = cylinders of 2048 * 512 = 1048576 bytes

                           Device Boot      Start         End      Blocks  Id System
/dev/disks/mpx.vmhba32:C0:T0:L0p1             5       900    917504    5  Extended
/dev/disks/mpx.vmhba32:C0:T0:L0p4   *         1         4      4080    4  FAT16 <32M
/dev/disks/mpx.vmhba32:C0:T0:L0p5             5       254    255984    6  FAT16
/dev/disks/mpx.vmhba32:C0:T0:L0p6           255       504    255984    6  FAT16
/dev/disks/mpx.vmhba32:C0:T0:L0p7           505       614    112624   fc  VMKcore
/dev/disks/mpx.vmhba32:C0:T0:L0p8           615       900    292848    6  FAT16

The two partitions with the identical number of blocks are /bootbank and /altbootbank, perform a check disk on these

dosfsck -v /dev/disks/mpx.vmhba32:C0:T0:L0:5
dosfsck -v /dev/disks/mpx.vmhba32:C0:T0:L0:6

to perform a verification pass use -V, or to test for bad sectors use -t (with which you also need to include -a (automatically repair) or -r (interactively repair) options).

dosfsck -V /dev/disks/mpx.vmhba32:C0:T0:L0:5
dosfsck -t -r /dev/disks/mpx.vmhba32:C0:T0:L0:5

High Availability

Be aware that playing with HA can have disastrous effects, especially if the Isolation Response of your cluster is set to Power Off  If you can, consider waiting until outside of production hours before trying to resolve a problem.  Unstable clusters can disintegrate if you're unlucky.
There are 5 primaries in an HA cluster, the first ESX's to join the cluster become primaries, this only changes (through an election) when the following occurs (note - not during an ESX failure)..

Primary ESX goes into Maintenance Mode
Primary disconnected from the cluster
Primary removed from the cluster
Primary reconfigured for HA

It's quite common for HA to go into an error state, normal course of action is to use the Reconfigure for HA option for the ESX that's experiencing the problem.  This reinstalls the HA agent onto the ESX onto the ESX.  It's also common to have to do this a couple of times for it to be successful.  Other things to try...

Restart the HA process - see High Availability Stop/Start
Deinstall HA and VPXA and reinstall

HA is very dependant on proper DNS, to check everything is in order do the following from each ESX. Some versions of ESX3 are sensitive to case, always user lower, FQDN of ESX's should be lower case, and VC's FQDN and domain suffix search should be lower case

Check that the hostname/IP of the local ESX is as expected
 hostname 
 hostname -s 
 hostname -i 
If not check the following files
 /etc/hosts 
 /etc/sysconfig/network 
 /etc/vmware/esx.conf 
Check that HA can properly resolve other ESX's in the cluster (note: only one IP address should be returned)
 /opt/vmware/aam/bin/ft_gethostbyname <my_esx_name> 
Check that HA can properly resolve the vCentre
 /opt/vmware/aam/bin/ft_gethostbyname <my_vc_name> 
Check the vCentre server can properly resolve the ESX names
Check the vCentre's FQDN and DNS suffix search are correct and lower case

If you need to correct DNS names, don't be surprised if you need to reinstall HA and VPXA, it can be done without interrupting running VM's, but its obviously a lot less stressful not to.

Manually Deinstall

Put the ESX into maintenance mode
Disconnect the ESX from the Virtual Centre
SSH to the ESX server (or use ESXi Tech Support Mode)
 cd /opt/vmware/uninstallers 
 ./VMware-vpxa-uninstall.sh 
 ./VMware-aam-ha-uninstall.sh 
Reconect the ESX to the VC
Take out of maintenance mode

If the VC Agent or HA Agent fails due to the uninstaller being unable to remove files/folders, and you can't remove them manually, this is an indication that the disk is becoming corrupt.  Especially if installed on a USB key, consider replacing ASAP.

Command Line Interface etc

Using the commands in this section isn't supported by VMware
To start the CLI run the following command...

/opt/vmware/aam/bin/Cli

The interface is a bit dodgy, you can enter the same command twice, and it'll be rejected one time and accepted another, patience is required.




Command
Comments


 ln 
List cluster nodes and their status


 addNode <hostname> 
Add ESX/node to cluster (use ESX's short DNS name)


 promoteNode <hostname> 
Promote existing ESX/node to be a primary


 demoteNode <hostname> 
Demote existing ESX/node to be a secondary

There's also the following scripts to be found which behave as you'd expect (found in  /opt/vmware/aam/bin )...

 ./ft_setup 
 ./ft_startup 
 ./ft_shutdown

Error Hints

Host in HA Cluster must have userworld swap enabled

ESXi servers need to have scratch space enabled

In vCentre, go to the Advanced Settings of the ESX
Go to ScratchConfig and locate ScratchConfig.ConfiguredScratchLocation 
Set to directory with sufficient space (1GB) (can be configured on local storage or shared storage, folder must exist and be dedicated to ESX, delete contents if you've rebuilt the ESX)
Format  /vmfs/volumes/<DatastoreName> 
EG  /vmfs/volumes/SCRATCH-DISK/my_esx 
Locate  ScratchConfig.ConfiguredSwapState  and set
Bounce the ESX

Unable to contact primary host in cluster

The ESX is unable to contact a primary ESX in cluster, some kind of networking issue
If there's no existing HA'ed ESX's, start by looking at VC networking

Snapshots

http://geosub.es/vmutils/Troubleshooting.Virtual.Machine.snapshot.problems/Troubleshooting.Virtual.Machine.snapshot.problems.html
See also Virtual Machines Snapshot Troubleshooting

Random ProblemsESXi Lockup

Affects ESXi v3.5 Update 4 only.  Caused by a problem with updated CIM software in Update 4.

Workaround
Disable CIM (disables hardware monitoring) by setting Advanced Settings | Misc | Misc.CimEnabled to 0 (restart to apply)
Fix
Apply patch ESXe350-200910401-I-SG, see http://kb.vmware.com/kb/1014761

For further info see http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1012575

Cimserver High CPU

Caused by problems with the VMware CIM server software. However can be caused by other problems causing it to go nuts (check VMKernel logs, etc).

Restart
 service pegasus restart

Log Bundles Creation Fails

ESX log bundle creation fails, either via the VI Client or via vm-support

SSH to the ESX
Run vm-support to try to create a log bundle

Could not copy...Have you run out of disk space?
ESX - Check that there's space to be able to write in /tmp

ESXi - Check that the ESX has been configured with a scratch disk, and that it has space

tar: write error: Broken pipe
ESXi - Check that the ESX has been configured with a scratch disk

Key	Change View	Key	Sort by
`c`	ESX CPU	`U`	% CPU Used
		`R`	% CPU Ready
		`N`	Normal / default
`m`	ESX Memory	`M`	Memsz
		`B`	Mctlsz
		`N`	Normal / default
`d`	ESX Disk Adapter	`r`	Reads/sec
		`w`	Writes/sec
		`R`	Read MB/sec
		`T`	Write MB/sec
		`N`	Normal / default
`u`	ESX Disk Drive/LUN	`r`	Reads/sec
		`w`	Writes/sec
		`R`	Read MB/sec
		`T`	Write MB/sec
		`N`	Normal / default
`v`	VM Disk	`r`	Reads/sec
		`w`	Writes/sec
		`R`	Read MB/sec
		`T`	Write MB/sec
		`N`	Normal / default
`n`	ESX NIC	`t`	Transmit Packet/sec
		`r`	Receive Packet/sec
		`T`	Transmit MB/sec
		`R`	Receive MB/sec
		`N`	Normal / default

Installation (ESX)

Build Notes

Installation

USB Image

Build Numbers

VMware CLI

Security Hardening

Service Console

Disk Partitions

Local Accounts

Password Policy

Active Directory Integration

Sudo

Logging

Banners

Web Access

SSH

Console

ESX

Network Settings

Configuration Considerations

Hardware

CPU

CPU Power vs Performance

HP ASR

Networking

Beacon Probing

Storage

ESX Installation Sizing

SCSI Resets

Procedures

Quick commands

ESX Shutdown / Reboot

High Availability Stop/Start

VMware Management Agent Restart

VMware Web Access Restart

VM Start

Maintenance Mode

TCPDump Network Sniffer

Security

Password Complexity Override

Regenerate Certificate

NIC Operations

Get NIC Firmware/Driver versions

HBA and SAN Operations

VMFS / LUN Addition

SAN LUN ID

Emulex

Find Emulex HBA Driver and Firmware Version, and WWPN

Install Emulex HBA Utility

HBAnywhere Installation

Check Emulex HBA Firmware Version

Update Emulex HBA Firmware

EMCgrab Collection

QLogic

Find QLogic HBA Driver and Firmware Version

Install QLogic HBA Utility

Update QLogic HBA Firmware

SAN Downtime

Netflow

Change Service Console IP Information

Change Timezone

Troubleshooting

Useful paths / logfiles

ESX

ESXi

ESXi Tech Support Mode

ESXTOP

CPU

Poor performance

Storage

Poor throughput

Storage Monitor Log Entries

Recovering VM's from failed storage

Recover lost SAN VMFS partition

USB / SD Hypervisor Checks

High Availability

Manually Deinstall

Command Line Interface etc

Error Hints