Troubleshooting and Performance Monitoring Virtualized Environments

From DocWiki

Revision as of 19:35, 7 June 2011 by Cchetty (Talk | contribs)
Jump to: navigation, search

Contents

Introduction

A virtual environment brings new considerations to troubleshooting and performance monitoring. Those considerations are discussed in this section.

General Guidelines

Performance indicators still valid from within virtual machines. For the UC applications that support it, use RTMT or the perfmon data for to analyze the performance of the UC application. Data from these tools provides a view of the guest performance: disk, CPU, memory, and other details.

Move to the VMware infrastructure when there is a need to get the perspective from the ESXi host. Use the vSphere Client to view data:

  • If vCenter is available, historical data is available through the client
  • If vCenter is not available, live data from the host is available through the client

VMware and VM Configuration

  1. Verify your virtualization configuration matches the UC requirements/restrictions
  2. Verify your VM was conforms to the specifications of one of the supported configurations available from the OVA  for the specific release of the application you are running.
Note Note: The released OVAs include virtual disk drives with aligned partition(s) (to optimize performance). It is required that the OVA be used to create the virtual machine.

vCenter Settings

Just like some of the UC applications, vCenter can be configured to save more performance data. The more historical data saved, the bigger disk space needed by the database used by vCenter. Note, this is one of the main areas where you need vCenter rather than going directly to the ESXi host for performance data. vCenter can save historical data that the ESXi host does not keep.

The configurations to change the amount historical data saved by vCenter is located in the vSphere client under Administration > Server Settings. For each interval duration and save time the statistic level can be set. The statistics levels range from 1 to 4 with level 4 containing the most data. View the data size estimates to ensure there is enough space to keep all statistics.

VMware Performance Indicators

The following table lists the performance indicators to monitor and view from a VMware perspective when a virtual machine is having suboptimal (or bad) performance. Most counters are from the ESXi host, which can give a perspective of VM interactions and overall host and data store utilization.

Performance Area Object Counter Acceptable range
CPU Host Usage Less than 80%
CPU Virtual Machine Ready Less than 3%
Memory Host Consumed General trend is stable
Memory Host Balloon/Swap used 0 Kb
Disk Specific datastore Kernel command latency Less than 3ms
Disk Specific datastore Physical device command latency Less than 20ms
Disk Specific datastore Average commands issued per second Less than LUN capacity
Network Host Receive packets dropped/Transmit packets dropped 0 packets


Physical Hardware Serviceability Items

Area Top Items View at Alerted How?
CPU
  1. Temperature
  2. Utilization/status
  3. Thresholds with events
  4. Condition & events for abnormal state
ESXi Host or vCenter SNMP/Email(via vCenter)
Memory
  1. Utilization/status
  2. Errors/condition
ESXi Host or vCenter SNMP/Email(via vCenter)
Hard Drives
  1. Utilization/status
  2. Disk failure alerting
ESXi Host or vCenter
SNMP/Email(via vCenter)
RAID Controller
  1. State (defunct, rebuilding, etc.)
  2. Cache/battery status
  3. Thresholds with events
ESXi Host or vCenter (DAS only) SNMP/Email(via vCenter)
NIC
  1. Port failure events
vCenter SNMP/Email(via vCenter)
Power Supply
  1. Voltage
  2. Redundancy status
  3. Thresholds with events

ESXi Host or vCenter

UCS Manager(B-series)(2)

SNMP/Email(via vCenter)
Fans
  1. Status/Speed
  2. Thresholds with events
ESXi Host or vCenter(C-series)
UCS Manager(B-series)
SNMP/Email(via vCenter)
IO Controller
ESXi Host or vCenter(DAS only) SNMP/Email(via vCenter)

Note Note: The vSphere client can be used to view the data and alarms. vCenter is required for any automatic notification.

CPU Troubleshooting

A high CPU usage could be due to a small number of VMs taking all of the resources or too many VMs running on the host. For the too many VMs running case, look at the VMs running on the host and see if CPU reservations are in use (see oversubscription section). To isolate a CPU issue for a particular VM, consider moving it to another ESXi host.

To view the CPU performance indicators, go to the ESXi host's performance tab and select the Advanced button. Under Chart options, select CPU, timeframe, and then only the host (not individual cores) to view overall CPU usage on the host. You can view each VM's CPU usage from the Virtual Machines tab on the host.

To get a view of the reservations set by all of the VMs, use the Resource Allocation tab of the cluster.

Note Note: The "Resource Allocation" tab is only available via vCenter.

Memory Troubleshooting

Our guidelines do not support memory sharing between VMs. To verify, follow the following performance indicators to make sure swapping and ballooning counters are zero. If a given VM does not have enough memory and there are not memory issues on the specific host, consider increasing the VM's memory.

To view the memory performance indicators, go to the ESXi host's Performance tab and select the Advanced button. Under Chart options, select Memory and Timeframe, then select the following counters:

  • Used memory (to view general trends)
  • Swap used
  • Balloon

 Swap and Balloon should always be ZERO, otherwise memory sharing is being used (which should not be the case).

Disk Troubleshooting

Bad disk performance often shows up as high CPU usage. IOPS data can provide information on how hard the application/VM is working the disks. Specific activities can cause spikes in IOPS: upgrades and DB maintenance are two examples. If VMs running on the same datastore are all doing these activities at the same time, the disks might not be able to keep up. IOPS data can be seen from vCenter or the SAN. Disk latency (response time) is a good indicator of disk performance.

To view the disk performance indicators, go to the ESXi host's performance tab and select the advanced button. The appropriate datastore needs to be selected, which can be found on the datastore page (see below). Under chart options, select disk and timeframe, then select the following counters:

  • Physical device command latency
  • Kernel command latency
  • average commands issued per second

The kernel counter should not be greater than 2-3 ms. The physical device counter should not be greater than 15-20 ms. The "average commands issued per second" counter can be used if IOPS are not available from the SAN. IOPS should be considered if it looks like datastore is overload. This IOPS data is viewable from the host and each VM. Note, for NFS datastores, the physical and kernel latency data is not available. Starting in VMware 4.0 update 2 and beyond the esxtop command (see below) can be used to view NFS counters and in particular the guest latency (called GAVG in esxtop). The guest latency is a summation of the physical device and kernel latencies.

On the C-series UCS servers there have been issues with the write cache battery backup. If this battery is not operating correctly, performance will suffer. Use a tool like wbemcli to verify the battery is ok. An example of using the wbemcli:

wbemcli ei -noverify 'https://root:<password>@<ESXi Host IP>:5989/root/cimv2:VMware_HHRCBattery'

See the MegaCli User Guide for more information.

Network Troubleshooting

Generally, network performance issues can be seen by dropped packets. If dropped packets are seen from a ESXi host, the network infrastructure needs to be investigated for the issue, which might include a virtualized switch (Nexus 1000V). In ESXi 4.1, issues have been seen with large file transfers (e.g. SFTP/FTP transfers). For this issue, the Large Receive Offload options need to be disabled on the ESXi host. That setting is found on the host's Configuration tab -> Advanced Settings -> Net.*. Note, there are several LRO settings on this page and all of them need to be disabled. If a VM has been cloned and uses static MAC addresses, verify there are not duplicate MAC addresses in the network. LRO settings:

To view the network performance indicators, go to the ESXi host's performance tab and select the advanced button. Under chart options, select Network, timeframe, then select the following counters:

  • Receive packets dropped
  • Transmit packets dropped

The main thing to check is that no packets are getting dropped in the network.

Note Note: Advanced network debugging and configuration can be done on Nexus 1000v (if used, which requires vCenter and Enterprise Plus licensing).

Alternate Access to Performance Data

If vCenter and/or the vSphere client are not available, some real time data can be pulled using command line tools. If you have a vMA VM, then the resxtop tool can be used. The resxtop tool is a remote version of the esxtop tool. Otherwise, the esxtop tool can be used directly on the ESXi host (root access must be enabled). See http://communities.vmware.com/docs/DOC-11812 for details on esxtop.



Back to: Unified Communications in a Virtualized Environment

Rating: 4.0/5 (4 votes cast)

Personal tools