Difference between revisions of "Installation (ESX)"

From vwiki
Jump to navigation Jump to search
(→‎Storage: Enhanced content)
m (→‎Troubleshooting: Re-ordered sections)
Line 169: Line 169:


= Troubleshooting =
= Troubleshooting =
== Vmkernel Log Analysis ==
== Storage ==
=== Storage ===
=== Poor throughput ===
==== Poor throughput ====
Use <code>esxtop</code> on the service console and switch to the disk monitor.  Enable views for latency, you will see values like GAVG, KAVG and DAVG.
Use <code>esxtop</code> on the service console and switch to the disk monitor.  Enable views for latency, you will see values like GAVG, KAVG and DAVG.
* '''GAVG''' is the total latency on IO commands averaged over 2 seconds
* '''GAVG''' is the total latency on IO commands averaged over 2 seconds
Line 180: Line 179:




==== Storage Monitor Log Entries ====
=== Storage Monitor Log Entries ===


How to decode the following type of entries...
How to decode the following type of entries...

Revision as of 10:47, 16 October 2009

Build Notes

Security Hardening

Service Console

Disk Partitions

Suggesting partition sizing for Service Console on local disk to prevent Root partition being filled with user data

part /boot --fstype ext3 --size 1024 --ondisk=sda --asprimary
part / --fstype ext3 --size 5120 --ondisk=sda --asprimary
part swap --size 2048 --ondisk=sda --asprimary
part /var --fstype ext3 --size 5120 --ondisk=sda
part /tmp --fstype ext3 --size 5120 --ondisk=sda
part /home --fstype ext3 --size 2048 --ondisk=sda
part None --fstype vmkcore --size 100 --ondisk sda

Local Accounts

Password Policy

No policy is implemented by default, if not using AD Integration then its sensible to apply a policy on the ESX, using the PAMQC module. Its not particularly elegant.

Active Directory Integration

Because service console authentication is Unix-based, it cannot use Active Directory to define user accounts. However, it can use Active Directory to authenticate users by matching local passwd file account name with Active directory with appropriate support of SFU (Services For Unix).

See Scott Lowe's blog for further info

Sudo

It is possible to limit the enhanced privileges that a user can gain by using sudo. This is most appropriate where there is a large number admins. However, in such an environment there is likely to be a large number of ESX's, managing the config on ESX is a headache.

Example of possible sudo config (/etc/sudoers)

...
# Defaults specification
Defaults logfile=/var/log/sudolog

# User privilege specification
root    ALL=(ALL) ALL
User_Alias VI_JR_ADMINS=esxoper, esxoper2
User_Alias VI_ADMINS=esxadmin

Cmnd_Alias STOP=/usr/sbin/shutdown, /usr/sbin/halt, /usr/sbin/poweroff 
Cmnd_Alias REBOOT=/usr/sbin/reboot
Cmnd_Alias KILL=/usr/bin/kill 
Cmnd_Alias NTP=/usr/sbin/ntpdate, /sbin/hwclock 

VI_JR_ADMINS ALL=STOP, REBOOT, KILL, NTP
VI_ADMINS ALL=(ALL) ALL
...

Logging

It is recommended to compress and increase the maximum log file size by modifying the configuration files in the /etc/logrotate.d directory and the /etc/logrotate.conf file.

For example, changing vmkwarning to be 2096k in size, and compressed...

[root@dtcp-esxsvce01b root]# more /etc/logrotate.d/vmkwarning
/var/log/vmkwarning{
    create 0600 root root
    missingok
    compress
    sharedscripts
    postrotate
    size 2096k
        /bin/kill -HUP `cat /var/run/syslogd.pid 2> /dev/null` 2> /dev/null || true
    endscript
}

...and changing relevent part of /etc/logrotate.conf to allow compression...

...
# uncomment this if you want your log files compressed
compress

...

Finally, its worth redirecting sudo log activity to /var/log/sudolog, see above section on sudo.

Banners

There are three modes of direct management access to an ESX, web, ssh, and direct (local) console.

Web Access

Edit the html page /usr/lib/vmware/hostd/docroot/index.html

SSH

Edit the /etc/ssh/sshd_config file so that it knows to display a defined banner file during login...

Banner /etc/banner

Create the banner file with the appropriate contents.

Console

Prepend your banner to the /etc/issue file

ESX

Network Settings

Setting Default Preferred Explanantion
Promiscuous Mode Reject Reject Principally used in situations where you need to perform a network traffic (snif) capture. Data from all ports propagates to all ports (VM Port group becomes a hub rather than a switch)
MAC address changes Accept Reject There are situations where allowing MAC Address Changes to Accept is required. For example; legacy applications, clustered environments, and licensing. Legacy applications may require a specific MAC addresses to be used for the application. Microsoft Clusters utilize an artificial MAC address for all servers in the cluster
Forged Transmits Accept Reject The setting affects traffic transmitted from a virtual machine. If this option is set to reject, the virtual switch compares the source MAC address being transmitted by the operating system with the effective MAC address for its virtual network adapter to see if they are the same. If the MAC addresses are different, the virtual switch drops the frame. The guest operating system will not detect that its virtual network adapter cannot send packets using the different MAC address. To protect against MAC address impersonation, all virtual switches should have forged transmissions set to reject

Procedures

Quick commands

vmware -v ESX software version and build

Security

Password Complexity Override

In order to be able to change a user (or root) password to one that breaches password complexity checking

  1. Disable PAM module
    • esxcfg-auth --usepamqc -1 -1 -1 -1 -1 -1
  2. Disable complexity checker
    • esxcfg-auth --usecrack -1 -1 -1 -1 -1 -1
  3. Change password
  4. Re-enable PAM module
    • esxcfg-auth --usepamqc=-1 -1 -1 -1 8 8

Regenerate Certificate

You might need to regenerate certificates if

  • Change ESX host name
  • Accidentally delete the certificates

To generate new Certificates for the ESX Server host...

  1. Change directories to /etc/vmware/ssl.
  2. Create backups of any existing certificates:
    • mv rui.crt orig.rui.crt
    • mv rui.key orig.rui.key
  3. Rstart the vmware-hostd process:
    • service mgmt-vmware restart
  4. Confirm that the ESX Server host generated new certificates by executing the following command comparing the time stamps of the new certificate files with orig.rui.crt and orig.rui.key
    • ls -la


HBA and SAN Operations

HBAnywhere Installation

  1. Download the Driver and Application kit for VMware from Emulex's website.
    • At time of writing the current version of package was elxvmwarecorekit-esx35-4.0a45-1.i386.rpm
  2. Copy the package to the server
    • EG pscp -pw [password] elxvmwarecorekit-esx35-4.0a45-1.i386.rpm platadmn@dtcp-esxsvce01a:/home/platadmn
  3. Install the package
    • EG rpm -ivh elxvmwarecorekit-2.1a42-1.i386.rpm

HBA Firmware Upgrade

Requires HBAnywhere to be installed 1st, see HBAnywhere Installation for further info.

  1. Download the correct firmware version from Emulex's website
  2. Extract, and copy file to server
  3. Find adapter's WWPN's
    • EG /usr/sbin/hbanyware/hbacmd ListHBAs
  4. Download new firware version to each HBA
    • EG /usr/sbin/hbanyware/hbacmd download 10:00:00:00:c9:82:97:9e zf280a4.all

EMCgrab Collection

  1. Download correct verion from EMC's website
  2. Copy to server
    • EG pscp emcgrab_ESX_v1.1.tar platadmn@dtcp-esxsvce02a:/home/platadmn
  3. Uncompress the file
    • EG tar -xvf emcgrab_ESX_v1.1.tar
  4. Run grab (can take a few minutes, best done out of hours)
    • EG ./emcgrab.sh
  5. Results can be found in \emcgrab\outputs folder

Troubleshooting

Storage

Poor throughput

Use esxtop on the service console and switch to the disk monitor. Enable views for latency, you will see values like GAVG, KAVG and DAVG.

  • GAVG is the total latency on IO commands averaged over 2 seconds
  • KAVG is the hypervisor IO latency averaged over 2 seconds
  • DAVG is everything outside the ESX server IO latency averaged over the last 2 seconds

Latency occurs when the hypervisor or physical storage cannot keep pace with the demand for IO


Storage Monitor Log Entries

How to decode the following type of entries...

Sep  3 15:15:14 tfukesxent1 vmkernel: 85:01:23:01.532 cpu4:2264)StorageMonitor: 196: vmhba1:2:0:0 status = 2/0 0x6 0x2a 0x1
Sep  3 15:15:32 tfukesxent1 vmkernel: 85:01:23:19.391 cpu4:2253)StorageMonitor: 196: vmhba1:3:9:0 status = 2/0 0x6 0x2a 0x1

The status message consists of the follow four decimal and hex blocks...

Device Status / Host Status Sense Key Additional Sense Code Additional Sense Code Qualifier

Where the ESX Device and SAN host status' mean...

Decimal Device Status Host Status Comments
0 No Errors Host_OK
1 Host No_Connect
2 Check Condition Host_Busy_Busy
3 Host_Timeout
4 Host_Bad_Target
5 Host_Abort
6 Host_Parity
7 Host_Error
8 Device Busy Host_Reset
9 Host_Bad_INTR
10 Host_PassThrough
11 Host_Soft_Error
24 Reservation Conflict 24/0 indicates a locking error, normally caused by too many ESX's mounting a LON, wrong config on storage array, or too many VM's on a LUN

Where the Sense Key mean...

Hex Sense Key
0x0 No Sense Information
0x1 Last command completed but used error correction
0x2 Unit Not Ready
0x3 Medium Error
0x4 Hardware Error
0x5 ILLEGAL_REQUEST (Passive SP)
0x6 LUN Reset
0x7 Data_Protect - Access to data is blocked
0x8 Blank_Check - Reached an unexpected region
0xa Copy_Aborted
0xb Aborted_Command - Target aborted command
0xc Comparison for SEARCH DATA unsuccessful
0xd Volume_Overflow - Medium is full
0xe Source and Data on Medium do not agree

The Additional Sense Code and Additional Sense Code Qualifier mean

Hex Sense Code
0x4 Unit Not Ready
0x3 Unit Not Ready - Manual Intervention Required
0x2 Unit Not Ready - Initializing Command Required
0x29 Device Power on or SCSI Reset