Installation (ESX): Difference between revisions
m (→Troubleshooting: Re-ordered sections) |
m (→Troubleshooting: Added CPU troubleshooting) |
||
Line 169: | Line 169: | ||
= Troubleshooting = | = Troubleshooting = | ||
== CPU == | |||
=== Poor performance === | |||
If VM's are performing sluggishly and/or are slow to start, use <code>esxtop</code> on the ESX service console. Look at Ready Time (%RDY), which is how long a VM is waiting for CPUs to become available. This can creep up if the the system is pushed, or if the VM has multiple CPUs (as it needs multiple physical CPUs to become available at the same time). | |||
Ideally %RDY should <5%, though <10% is normally acceptable, anything >15% is bad. | |||
== Storage == | == Storage == | ||
=== Poor throughput === | === Poor throughput === |
Revision as of 11:44, 7 December 2009
Build Notes
Security Hardening
Service Console
Disk Partitions
Suggesting partition sizing for Service Console on local disk to prevent Root partition being filled with user data
part /boot --fstype ext3 --size 1024 --ondisk=sda --asprimary part / --fstype ext3 --size 5120 --ondisk=sda --asprimary part swap --size 2048 --ondisk=sda --asprimary part /var --fstype ext3 --size 5120 --ondisk=sda part /tmp --fstype ext3 --size 5120 --ondisk=sda part /home --fstype ext3 --size 2048 --ondisk=sda part None --fstype vmkcore --size 100 --ondisk sda
Local Accounts
Password Policy
No policy is implemented by default, if not using AD Integration then its sensible to apply a policy on the ESX, using the PAMQC module. Its not particularly elegant.
Active Directory Integration
Because service console authentication is Unix-based, it cannot use Active Directory to define user accounts. However, it can use Active Directory to authenticate users by matching local passwd file account name with Active directory with appropriate support of SFU (Services For Unix).
See Scott Lowe's blog for further info
Sudo
It is possible to limit the enhanced privileges that a user can gain by using sudo. This is most appropriate where there is a large number admins. However, in such an environment there is likely to be a large number of ESX's, managing the config on ESX is a headache.
Example of possible sudo config (/etc/sudoers
)
... # Defaults specification Defaults logfile=/var/log/sudolog # User privilege specification root ALL=(ALL) ALL User_Alias VI_JR_ADMINS=esxoper, esxoper2 User_Alias VI_ADMINS=esxadmin Cmnd_Alias STOP=/usr/sbin/shutdown, /usr/sbin/halt, /usr/sbin/poweroff Cmnd_Alias REBOOT=/usr/sbin/reboot Cmnd_Alias KILL=/usr/bin/kill Cmnd_Alias NTP=/usr/sbin/ntpdate, /sbin/hwclock VI_JR_ADMINS ALL=STOP, REBOOT, KILL, NTP VI_ADMINS ALL=(ALL) ALL ...
Logging
It is recommended to compress and increase the maximum log file size by modifying the configuration files in the /etc/logrotate.d
directory and the /etc/logrotate.conf
file.
For example, changing vmkwarning to be 2096k in size, and compressed...
[root@dtcp-esxsvce01b root]# more /etc/logrotate.d/vmkwarning /var/log/vmkwarning{ create 0600 root root missingok compress sharedscripts postrotate size 2096k /bin/kill -HUP `cat /var/run/syslogd.pid 2> /dev/null` 2> /dev/null || true endscript }
...and changing relevent part of /etc/logrotate.conf
to allow compression...
... # uncomment this if you want your log files compressed compress ...
Finally, its worth redirecting sudo log activity to /var/log/sudolog
, see above section on sudo.
Banners
There are three modes of direct management access to an ESX, web, ssh, and direct (local) console.
Web Access
Edit the html page /usr/lib/vmware/hostd/docroot/index.html
SSH
Edit the /etc/ssh/sshd_config
file so that it knows to display a defined banner file during login...
Banner /etc/banner
Create the banner file with the appropriate contents.
Console
Prepend your banner to the /etc/issue
file
ESX
Network Settings
Setting | Default | Preferred | Explanantion |
---|---|---|---|
Promiscuous Mode | Reject | Reject | Principally used in situations where you need to perform a network traffic (snif) capture. Data from all ports propagates to all ports (VM Port group becomes a hub rather than a switch) |
MAC address changes | Accept | Reject | There are situations where allowing MAC Address Changes to Accept is required. For example; legacy applications, clustered environments, and licensing. Legacy applications may require a specific MAC addresses to be used for the application. Microsoft Clusters utilize an artificial MAC address for all servers in the cluster |
Forged Transmits | Accept | Reject | The setting affects traffic transmitted from a virtual machine. If this option is set to reject, the virtual switch compares the source MAC address being transmitted by the operating system with the effective MAC address for its virtual network adapter to see if they are the same. If the MAC addresses are different, the virtual switch drops the frame. The guest operating system will not detect that its virtual network adapter cannot send packets using the different MAC address. To protect against MAC address impersonation, all virtual switches should have forged transmissions set to reject |
Procedures
Quick commands
vmware -v |
ESX software version and build |
Security
Password Complexity Override
In order to be able to change a user (or root) password to one that breaches password complexity checking
- Disable PAM module
esxcfg-auth --usepamqc -1 -1 -1 -1 -1 -1
- Disable complexity checker
esxcfg-auth --usecrack -1 -1 -1 -1 -1 -1
- Change password
- Re-enable PAM module
esxcfg-auth --usepamqc=-1 -1 -1 -1 8 8
Regenerate Certificate
You might need to regenerate certificates if
- Change ESX host name
- Accidentally delete the certificates
To generate new Certificates for the ESX Server host...
- Change directories to /etc/vmware/ssl.
- Create backups of any existing certificates:
mv rui.crt orig.rui.crt
mv rui.key orig.rui.key
- Rstart the vmware-hostd process:
service mgmt-vmware restart
- Confirm that the ESX Server host generated new certificates by executing the following command comparing the time stamps of the new certificate files with orig.rui.crt and orig.rui.key
ls -la
HBA and SAN Operations
HBAnywhere Installation
- Download the Driver and Application kit for VMware from Emulex's website.
- At time of writing the current version of package was
elxvmwarecorekit-esx35-4.0a45-1.i386.rpm
- At time of writing the current version of package was
- Copy the package to the server
- EG
pscp -pw [password] elxvmwarecorekit-esx35-4.0a45-1.i386.rpm platadmn@dtcp-esxsvce01a:/home/platadmn
- EG
- Install the package
- EG
rpm -ivh elxvmwarecorekit-2.1a42-1.i386.rpm
- EG
HBA Firmware Upgrade
Requires HBAnywhere to be installed 1st, see HBAnywhere Installation for further info.
- Download the correct firmware version from Emulex's website
- EG for LPe11002's
- Extract, and copy file to server
- Find adapter's WWPN's
- EG
/usr/sbin/hbanyware/hbacmd ListHBAs
- EG
- Download new firware version to each HBA
- EG
/usr/sbin/hbanyware/hbacmd download 10:00:00:00:c9:82:97:9e zf280a4.all
- EG
EMCgrab Collection
- Download correct verion from EMC's website
- At time of writing the current version file was emcgrab_ESX_v1.1.tar
- Copy to server
- EG
pscp emcgrab_ESX_v1.1.tar platadmn@dtcp-esxsvce02a:/home/platadmn
- EG
- Uncompress the file
- EG
tar -xvf emcgrab_ESX_v1.1.tar
- EG
- Run grab (can take a few minutes, best done out of hours)
- EG
./emcgrab.sh
- EG
- Results can be found in
\emcgrab\outputs
folder
Troubleshooting
CPU
Poor performance
If VM's are performing sluggishly and/or are slow to start, use esxtop
on the ESX service console. Look at Ready Time (%RDY), which is how long a VM is waiting for CPUs to become available. This can creep up if the the system is pushed, or if the VM has multiple CPUs (as it needs multiple physical CPUs to become available at the same time).
Ideally %RDY should <5%, though <10% is normally acceptable, anything >15% is bad.
Storage
Poor throughput
Use esxtop
on the service console and switch to the disk monitor. Enable views for latency, you will see values like GAVG, KAVG and DAVG.
- GAVG is the total latency on IO commands averaged over 2 seconds
- KAVG is the hypervisor IO latency averaged over 2 seconds
- DAVG is everything outside the ESX server IO latency averaged over the last 2 seconds
Latency occurs when the hypervisor or physical storage cannot keep pace with the demand for IO
Storage Monitor Log Entries
How to decode the following type of entries...
Sep 3 15:15:14 tfukesxent1 vmkernel: 85:01:23:01.532 cpu4:2264)StorageMonitor: 196: vmhba1:2:0:0 status = 2/0 0x6 0x2a 0x1 Sep 3 15:15:32 tfukesxent1 vmkernel: 85:01:23:19.391 cpu4:2253)StorageMonitor: 196: vmhba1:3:9:0 status = 2/0 0x6 0x2a 0x1
The status message consists of the follow four decimal and hex blocks...
Device Status / Host Status | Sense Key | Additional Sense Code | Additional Sense Code Qualifier |
Where the ESX Device and SAN host status' mean...
Decimal | Device Status | Host Status | Comments |
---|---|---|---|
0 | No Errors | Host_OK | |
1 | Host No_Connect | ||
2 | Check Condition | Host_Busy_Busy | |
3 | Host_Timeout | ||
4 | Host_Bad_Target | ||
5 | Host_Abort | ||
6 | Host_Parity | ||
7 | Host_Error | ||
8 | Device Busy | Host_Reset | |
9 | Host_Bad_INTR | ||
10 | Host_PassThrough | ||
11 | Host_Soft_Error | ||
24 | Reservation Conflict | 24/0 indicates a locking error, normally caused by too many ESX's mounting a LON, wrong config on storage array, or too many VM's on a LUN |
Where the Sense Key mean...
Hex | Sense Key |
---|---|
0x0 | No Sense Information |
0x1 | Last command completed but used error correction |
0x2 | Unit Not Ready |
0x3 | Medium Error |
0x4 | Hardware Error |
0x5 | ILLEGAL_REQUEST (Passive SP) |
0x6 | LUN Reset |
0x7 | Data_Protect - Access to data is blocked |
0x8 | Blank_Check - Reached an unexpected region |
0xa | Copy_Aborted |
0xb | Aborted_Command - Target aborted command |
0xc | Comparison for SEARCH DATA unsuccessful |
0xd | Volume_Overflow - Medium is full |
0xe | Source and Data on Medium do not agree |
The Additional Sense Code and Additional Sense Code Qualifier mean
Hex | Sense Code |
---|---|
0x4 | Unit Not Ready |
0x3 | Unit Not Ready - Manual Intervention Required |
0x2 | Unit Not Ready - Initializing Command Required |
0x29 | Device Power on or SCSI Reset |