Troubleshooting (Virtual Machine)

Can't Connect to VM Console

Error connecting: Cannot connect to host... or Can't connect to MKS...

This is caused by a TCP connection failure to the ESX server the VM is hosted on. Using telnet or a port test utility, confirm you can connect on both TCP 902 and 443 from your machine to the ESX server.
If the problem is affecting a single ESX that previously worked, restart the management services on that ESX

Can't Deploy VM

The VirtualCenter server is unable to decrypt passwords stored in the customization specification

Bizarrely caused by the Virtual Centre running out of disk space, free up some space and all will be well.

A general system error occurred: Failed to create journal file provider

Check ESX disks are not full

Customization of the guest operating system 'winLonghornGuest' is not supported in this configuration. Microsoft Vista (TM) and Linux guests with Logical Volume Manager are supported only for recent ESX host and VMware Tools versions.

Caused by you trying to deploy a guest customised Windows 2008 template, where the OS of the source template is set to Windows 2008(!). Essentially Win2008 is only barely supported in ESX3.5. Setting the source machine to Vista should resolve this issue.
With Windows 2008 R2 templates the above fix has been seen to not work, in which case
1. Deploy a clone (with no guest customisation)
2. Perform a Sysprep

Can't Start VM

HA Admission Control

Can't start VM as doing so wouldn't leave enough failover capacity in order to be able to restart failed VM's should an ESX fail. Options are to
- Reduce resource usage of VM's that are already running
- Increase cluster capacity
- Reduce the cluster's failover capacity, or allow constraints violations
If no VM's have been recently added to the cluster, its likely that the HA agent on one of the ESX's has stopped functioning, in which case, within the cluster, one of the ESX's will have a red warning/exclamation triangle. If so you can restart HA on that ESX;
1. Highlight this VM, on the Summary tab you should see a notice regarding HA problems
2. Run the Reconfigure for HA command, this will re-install the HA agent on the ESX

Failed to relocate virtual machine

DRS is attempting to relocate a VM at power up, and this relocation failing
- Reattempt to power on machine
- Manually migrate to a less loaded ESX and reattempt power on

Access to VMFS storage

ESX may have lost connectivity to VMFS partition on which VM resides

VMFS full

If VMFS is full, the ESX won't be able to write to the VM's logs when it starts it up, causing VM start-up to fail

ESX licensing

Either ESX isn't licensed, or has lost contact with the license server (VI3) for a long period of time

Waiting for question to be answered

Generally after changes (such as cold migrations or new deployments), a VM may need to have a question answered before it can continue to power on

Could not power on VM: No swap file. Failed to power on VM

The ESX you're starting the VM up on can't get proper access the VM's files, either because
- The VM is already powered up on another ESX
- The VM is already powered up (but shows as down on the VI Client)
- The VM's files have been corrupted / locked

Is the VM actually powered off?
- If the VM responds to ping and RDP/VNC/SSH etc (as appropriate) then proceed to VM is Powered On, but appears Powered Off
Has an ESX recently failed?
- If the ESX the virtual is/was on has recently failed and HA's isolation response is set to leave powered-on then its possible that only the ESX's network connections have failed, and the virtual machines are still running on the ESX, but are isolated from the network.
  - To cause a full HA failover, pull the power cables out of the ESX to kill it completely
  - Alternatively, attempt to restore network connectivity to allow the VM's to be reachable again
- If the ESX the virtual is/was on has recently failed its possible that the file lock times have not yet expired (or are being kept updated).
  - If you're able to get a console onto the failed ESX, ensure it has fully failed (powered off or PSOD). If not, power it off to ensure its not failed enough to stop VM's running, but not enough to stop updating the file locks. HA will restart the VM if its still a very recent failure, else restart the VM manually.

If there have been no ESX failures, then the VM's files may be corrupted. The VM can be re-registered by removing and re-adding it to the inventory, but the re-add may fail if the wrong files are corrupted. To investigate corruption further...

To test whether the ESX should be able to lock the VM's files use touch . Within the VM's directory, do touch *.vswp
- If success, retry power on
- If device or resource busy then the VM is probably owned by another ESX - find that ESX!
- If Invalid argument then the file can't be accessed (eg corrupt or other storage problem)
Its also worth doing a touch on the following files, if they are not inaccessible then the VM may be recoverable. To work-around the .vswp issue, remove the reference to the file in the .vmx config file
- touch *.vmx
- touch *flat.vmdk
- touch *delta.vmdk
- touch vmware.log

For further info see - VMware KB10051 - Virtual machine does not power on because of missing or locked files

Cannot open the disk '/vmfs/volumes/.../MyVM-000001.vmdk' or one of the snapshot disks it depends on...

Cannot open the disk '/vmfs/volumes/.../MyVM-000001.vmdk' or one of the snapshot disks it depends on. Reason: The parent virtual disk has been modified since the child was deleted

The ESX can't work out the chain of vmdk's that make up the VM's disks, most likely because
- Snapshot CID chain is corrupted

You need to establish the chain of files, start by looking at the vmx file to work out the top vmdk, then track back through them until you get to the base disk.
- Any vmdk files not referenced in this chain are erroneous and can be deleted (or better, moved to a temporary sub-folder)
- Any delta file <= 16MB is effectively empty and can be skipped
Now display the CID's stored and then work out their correct order
- grep CID My-VM.vmdk My-VM-00000[1-9].vmdk


You then need to edit the  vmdk  files to correct the CID chain
Start the VM and confirm it's working as expected
Create a new temporary snapshot, then remove it to clear them up

General system error occurred...

A general system error occurred: The system returned an error. Communication with the virtual machine might have been interrupted.

This error seems to be generally occurred when the ESX is having trouble launching the VM's processes, sometime because its having trouble reading the VM's VMX file.
If the problem is erratically effecting one or more VM's, its likely that the ESX's hostd process is struggling a bit - in which case restart the ESX management agents
If the problem is continually effecting one (or possibly more) VM's, the VM('s) config file may be corrupted, or storage may be experiencing problems.

Can't Stop / Power-Off a VM

This normally occurs because you've lost management (VI Client) access to the ESX, or the ESX doesn't appear to be aware that its running the VM, but it is (so appears Inaccessible via the VI Client).  If you have access to the VM via the VI Client but can't power off, it'll probably be a permissioning issue.  There is no way to gracefully shutdown a VM without access via the VI Client (or direct access to the VM via RDP, VNC, etc).

SSH to the ESX you believe the VM is running on
Find the path to the VM's config file
EG  vmware-cmd -l | grep VM_Name 
If the VM is not listed, the VM isn't registered to that ESX
Instruct the ESX to power off the VM using the VMX path already found
EG  vmware-cmd /path/to/VM_Name.vmx stop

If the above fails, you'll need to get a bit more forceful...

Find the PID of the VM
EG  ps -auxwww | grep VM_Name 
Kill the VM using the PID found (make sure you've got the right PID, you could kill the ESX by mistake!)
EG  kill -9 1234

VM is Powered On, but appears Powered Off

The VM responds to ping and RDP/VNC/SSH etc (as appropriate) but is showing as down in the VI Client.  Also see Confirm VM's Status on ESX

Restart the management agents on the ESX and recheck

If that doesn't improve matters...

Find the location of the vmx file for the VM (so it can be re-added to the inventory)
Connect a VI Client to the ESX and unregister the VM (remove from inventory)
Restart the management agents on the ESX
Re-add the VM to the inventory

If running ESX4i see VMware KB 1033591 - Virtual machine appears powered off after restarting the management services on the host, but note that...

vMotion all powered-on VM's off the affected ESX first
Recover 1 VM at a time, and vMotion it off as soon as it is recovered (it may disappear when recovering the next VM)
Recovered VM's may end up with a state of Unknown on vCentre and ESX, in which case, remove from ESX inventory and re-add
Restart the ESX once all recovered

Can't VMotion a VM

VM network doesn't exist at destination

VM is using a particular port group which doesn’t exist on the destination ESX

ESX / network too busy

VMotion can’t copy across VMs memory contents/changes quickly enough.  An alternative is to use a Low Priory VMotion, which is more likely to succeed, but may result in the VM experiencing temporary freezes (avoids full OS downtime, but not without impact to hosted applications)

ESXs can't communicate

ESXs need to be able to communicate via VMotion network. DNS problems and FQDN inaccuracies can also cause problems

VM is connect to CD-ROM/ISO

VMs CD-ROM is connecting to an ISO file via the host ESX, tying it to that ESXCan't Increase a VM's Disk

A general system error occurred: Internal error

Can be caused by existing snapshots running on a VM
Check the ESX logs / available disk space etc

SnapshotsCan't Create Snapshot

Cannot create a quiesced snapshot because the create snapshot operation exceeded the time limit for holding off I/O in the frozen virtual machine

Prevents hot-cloning or snapshot based backup of a machine, because of either of...
VMware Tools aren't properly installed
The machine has high transactional IO (eg Exchange, SQL, AD) and cannot pause disk access in order to create snapshot
See the following...
VMware KB1003383 - 64bit Windows virtual machines generate errors when trying to use the VSS and NT backup
VMware KB1007696 - Troubleshooting Volume Shadow Copy (VSS) quiesce related issues
VMware KB1009073 - Unable to take a quiesced VMware snapshot of a virtual machine

Can't Delete/Commit Snapshot

If snapshot files are large then patience is of the essence, and if possible, shut the VM down 1st, or at the very least limit activity on the VM.  To commit a snapshot in a running VM, first a new snapshot is started, then the original redo files are merged with the base disk(s), then the extra redo file is merged.
Operation timed-out

Not unusual for large (>10GB) redo files, the process continues and its just vCentre reporting it as a time-out
Check the VM's files for any activity (changes in disk sizes/timestamps), speed is dependant on redo size, storage speed, ESX load, VM activity (if possible shut the VM down before removing the snapshot)
Also see Snapshot Still Active?

No Snapshots Exist in Snaphot Manager (but still exist)

Can happen if a snapshot Delete (All) fails to complete properly (eg ESX pseudo-hangs and you restart the management agents)
Backup and then delete the VM's VMSD file
Start a new snapshot
In snapshot manager use Delete All (not Delete!)
If this fails, check the ESX log to see what went wrong

Is Snapshot Still Active?

Check Snapshot Manager, if there's snapshots listed then there are still active snapshots
Open up Datastore Browser to the VM's folder, and see if any snapshot files exist, if not then there are no active snapshots
Check the VM's VMX file, the VMDK filename(s) will be either a snapshot or normal flat base disk file
EG  scsi0:0.fileName = "MyVM-000001.vmdk"   ←←←←← Snapshot file (snapshot running)
EG  scsi0:0.fileName = "MyVM-000001-delta.vmdk"  ← Snapshot file (snapshot running)
EG  scsi0:0.fileName = "MyVM.vmdk"  ←←←←←←←←← Base disk file (no snapshot running)
EG  scsi0:0.fileName = "MyVM-flat.vmdk"   ←←←←←← Base disk file (no snapshot running)
If there's no snapshots running, but snapshot files exist then the files can be deleted (if you're sure!)

Revert to Snapshot Causes Trust Relationship Failure

When reverting a VM that is a member of a Windows domain to a snapshot you can get the following errors at boot up, or when trying to logon

The trust relationship between this workstation and the primary domain failed
Windows cannot connect to the domain, either because the domain controller is down or otherwise unavailable, or because your computer account was not found. Please try again later. If this message continues to appear, contact your system administrator for assistance.

The problem is caused by the VM's computer account, which is used by the domain client/snapshotted machine to access the domain controller, having an invalid password.  Domain member servers periodically change the password they use to connect to the domain with (by default every 30 days).  So if a VM is snapshotted, then following that updates its computer account password; on a revert to snapshot it will revert to the old invalid password and be unable to login to the domain.

To resolve:
The machine needs to be taken off the domain, and put back on (you'll need a domain account with rights to do this)
See Re-Add Server to Domain for further info
To prevent: - see note below
Disable machine account password changes
On the domain member machine update the registry
 HKLM\SYSTEM\CurrentControlSet\Services\NetLogon\Parameters\DisablePasswordChange  to 1
Reduce machine account password change frequency
On the domain member machine update the registry
 HKLM\SYSTEM\CurrentControlSet\Services\NetLogon\Parameters\MaximumPasswordAge  to a higher value (in days), eg 60




The prevention options reduce domain security !


They should only be actioned if you understand the risks and are not breaching any security policies that may in force at your organisation.
If its not a regular occurrence, its probably best to just live the problem, and resolve when required.  Snapshots should not be allowed to run for many days in normal operation, which means that the problem should not occur frequently in a well run environment.

Further reading...

http://blogs.msdn.com/b/mikekol/archive/2009/03/18/does-restoring-a-snapshot-break-domain-connectivity-here-s-why.aspx
http://www.petri.co.il/working-with-domain-member-virtual-machines-and-snapshots.htm

Can't Customise

Windows setup could not configure Windows to run on this computer's hardware

Windows could not complete the installation.  To install Windows on this computer, restart the installation.

The guest customisation is failing because either
The virtual hardware has changed (especially disk type) since the original machine was created
Sysprep can't customise the machine because it doesn't have administrator rights, this can occur where a DC's users have been offloaded to LDS

Can't Connect VM's NIC

When powering up a VM its network card becomes disconnected.  If you tick the Connected checkbox, the task completes without error, but the checkbox is unchecked again.
This can happen when exporting a VM from Lab Manager, or can be caused by config errors with vShield Zones (and there are probably triggers as well).  You may notice either of the following in the VM's vmware.log file

vcpu-0| [msg.ethernet.e1000.openFailed] Failed to connect ethernet0.
vcpu-0| [msg.ethernet.openFailed] Failed to initialize ethernet0.

Filter config has been left on the VM's NIC, which is causing problems when trying to connect the vNIC to its portgroup.
In order to resolve either...

Replace the virtual NIC in the VM's config (remove the NIC, then readd it)
Manually remove the offending config lines from the VMX file
Power off the VM, and identify where its VMX config file is, and what ESX its on
Remove the VM from the vCenter inventory
SSH to the ESX, and edit the VM's VMX config file
Remove any lines that reference a filter on the affected NIC, for example...
ethernet0.filter1.param0 = "0x2000029" 
ethernet0.filter1.param1 = "3" 
ethernet0.filter1.name = "vsla-fence" 
Register the VM
Verify the NIC's connected network, and reconnected the NIC
Power the VM back up

For more info see VMware KB 1028151

VMTools Automatic Cursor Release Not Working

Sometimes the console automatic cursor release (which allows you to seamlessly switch focus from a VM console to your desktop by moving your mouse, avoiding having to use CTRL+ALT) sometimes doesn't work.  Seems to be more common with VM's deployed from templates/cloned from VM's. 
To resolve...

Uninstall VM Tools
Reboot
Install VM Tools
Reboot

Confirm VM's Status on ESX

The following commands take you through confirming the status of a VM, as determined by the ESX

Get list of VM's registered to ESX to check ESX believes its hosting the VM
 vm-support -x 
Get the VM's PID
 vim-cmd vmsvc/getallvms | grep <VM name> 
Get the state of VM (as the ESX believes)
 vim-cmd vmsvc/power.getstate <vmid> 
Check if the ESX has any running processes for the VM (in which case its powered on, regardless of the above)
 ps | grep <VM name>

To check that a VM is being locked by the ESX you're on

Get the lock info for the VM's disk (use the 1st if there's numerous)
 vmkfstools -D <VM-name>-flat.vmdk 
Pick out the MAC address from the lock info (78e7d192a548 in example below)
List the NIC info for the ESX
 esxcfg-vmknic -l

Lock [type 10c00001 offset 72968192 v 470, hb offset 3985408
gen 583, mode 1, owner 4d2dcc7b-20fb6d90-2b80-78e7d192a548 mtime 25711553]
Addr <4, 151, 197>, gen 299, links 1, type reg, flags 0, uid 0, gid 0, mode 600
len 37580963840, nb 17688 tbz 0, cow 0, zla 3, bs 2097152

Troubleshooting (Virtual Machine)

Contents

Can't Connect to VM Console

Can't Deploy VM

Can't Start VM

HA Admission Control

Failed to relocate virtual machine

Access to VMFS storage

VMFS full

ESX licensing

Waiting for question to be answered

Could not power on VM: No swap file. Failed to power on VM

Cannot open the disk '/vmfs/volumes/.../MyVM-000001.vmdk' or one of the snapshot disks it depends on...

General system error occurred...

Can't Stop / Power-Off a VM

VM is Powered On, but appears Powered Off

Can't VMotion a VM

Can't Increase a VM's Disk

Snapshots

Can't Create Snapshot

Can't Delete/Commit Snapshot

Is Snapshot Still Active?

Revert to Snapshot Causes Trust Relationship Failure

Can't Customise

Can't Connect VM's NIC

VMTools Automatic Cursor Release Not Working

Confirm VM's Status on ESX

Navigation menu

Troubleshooting (Virtual Machine)

Can't Connect to VM Console

Can't Deploy VM

Can't Start VM

HA Admission Control

Failed to relocate virtual machine

Access to VMFS storage

VMFS full

ESX licensing

Waiting for question to be answered

Could not power on VM: No swap file. Failed to power on VM

Cannot open the disk '/vmfs/volumes/.../MyVM-000001.vmdk' or one of the snapshot disks it depends on...

General system error occurred...

Can't Stop / Power-Off a VM

VM is Powered On, but appears Powered Off

Can't VMotion a VM

Can't Increase a VM's Disk

Snapshots

Can't Create Snapshot

Can't Delete/Commit Snapshot

Is Snapshot Still Active?

Revert to Snapshot Causes Trust Relationship Failure

Can't Customise

Can't Connect VM's NIC

VMTools Automatic Cursor Release Not Working

Confirm VM's Status on ESX

Navigation menu

Search