Troubleshooting and Performance Monitoring Virtualized Environments
(→partition unalignment detail)
m (1 revision)
Latest revision as of 06:40, 17 October 2013
A virtual environment brings new considerations to troubleshooting and performance monitoring. Those considerations are discussed in this section.
Performance indicators still valid from within virtual machines. For the UC applications that support it, use RTMT or the perfmon data for to analyze the performance of the UC application. Data from these tools provides a view of the guest performance: disk, CPU, memory, and other details.
Move to the VMware infrastructure when there is a need to get the perspective from the ESXi host. Use the vSphere Client to view data:
- If vCenter is available, historical data is available through the client
- If vCenter is not available, live data from the host is available through the client
VMware and VM Configuration
Verify your virtualization configuration matches the requirements/restrictions for each application.
- E.g. correct application versions, allowed VM configurations using Cisco-provided files, co-residency policy and virtual-to-physical sizing rules, correct ESXi versions, compliance with either TRC or Specs-based hardware support policies.
- See application links, Sizing links and Hardware links on www.cisco.com/go/uc-virtualized for more information.
Verify each VM uses a configuration from the Cisco-provided OVA download file of the application version you are running.
- To be TAC supported, it is required to use a Cisco-provided OVA to build the VM for initial install. E.g. see instructions for Unified Communications Manager here: http://www.cisco.com/en/US/docs/voice_ip_comm/cucm/virtual/CUCM_BK_CA526319_00_cucm-on-virtualized-servers_chapter_00.html#CUCM_TK_D1CB01EA_00
- E.g. if you are running Unified Communications Manager 9.1(2), new deployments must use the OVA download file for 9.1(2). If you are upgrading from older versions, see the readme for the 9.1(2) OVA on how to handle the VM configurations from the old version.
- It is not enough to just match the specs of the virtual hardware. The Cisco-provided OVAs include virtual disk drives whose partitions are aligned to 64K boundaries to optimize storage performance. It is required to use the Cisco-provided OVA to create the virtual machine, or you risk application issues due to non-optimized storage performance.
- The storage / partition / filesystem alignment is setup at install via use of the OVA file, and is not changed by subsequent upgrades.
- If the VM is manually created without use of the OVA, and alignment is not configured, it can only be resolved after the fact via the following procedure:
- Deploy a VM configuration from the Cisco-provided OVA file of the application version
- Reinstall application
- Restore from backup
- If Cisco TAC detects unaligned partitions, then if deemed necessary to provide effective support, you will be required to correct the unalignment before further troubleshooting can occur.
- Some application versions will generate an alert if they detect unaligned partitions. For example, Unified Communications Manager 9.1(2) or higher will generate an alert similar to the following:
- VMware Installation: 2 vCPU Intel(R) Xeon(R) CPU E5540 @ 2.53GHz, disk 1: 146Gbytes, disk 2: 146Gbytes, 6144Mbytes RAM, ERROR-UNSUPPORTED: Partitions unaligned
- You may also notice from the above alert that there is a second problem: the VM was created with 2x146GB vDisks which does not match any of the supported VM configurations in the UCM OVA download file.
|Note:||Recall that VMware vCenter is mandatory for UC on UCS Specs-based and HP/IBM Specs-based, as described here. VMware vCenter is optional for UC on UCS TRC deployments.|
Just like some of the UC applications, vCenter can be configured to save more performance data. The more historical data saved, the bigger disk space needed by the database used by vCenter. Note, this is one of the main areas where you need vCenter rather than going directly to the ESXi host for performance data. vCenter can save historical data that the ESXi host does not keep.
The configurations to change the amount historical data saved by vCenter is located in the vSphere client under Administration > Server Settings. For each interval duration and save time the statistic level can be set. The statistics levels range from 1 to 4 with level 4 containing the most data. View the data size estimates to ensure there is enough space to keep all statistics.
For a UC on UCS Specs-based or HP/IBM Specs-based deployment, Statistics Level 4 is required on all statistics. Configuring VMware vCenter to capture detailed logs, as shown in Figure 1 below, is strongly recommended. If not configured by default, Cisco TAC may request enabling these settings in order to troubleshoot problems.
VMware Performance Indicators
The following table lists the performance indicators to monitor and view from a VMware perspective when a virtual machine is having suboptimal (or bad) performance. Most counters are from the ESXi host, which can give a perspective of VM interactions and overall host and data store utilization.
|Performance Area||Object||Counter||Acceptable range|
|CPU||Host||Usage||Less than 80%|
|CPU||Virtual Machine||Ready||Less than 3%|
|Memory||Host||Consumed||General trend is stable|
|Memory||Host||Balloon/Swap used||0 Kb|
|Disk||Specific datastore||Kernel command latency||Less than 3ms|
|Disk||Specific datastore||Physical device command latency||Less than 20ms|
|Disk||Specific datastore||Average commands issued per second||Less than LUN capacity|
|Network||Host||Receive packets dropped/Transmit packets dropped||0 packets|
Physical Hardware Serviceability Items
|Area||Top Items||View at||Alerted How?|
||ESXi Host or vCenter||SNMP/Email(via vCenter)|
||ESXi Host or vCenter||SNMP/Email(via vCenter)|
|| ESXi Host or vCenter ||SNMP/Email(via vCenter)|
||ESXi Host or vCenter (DAS only)||SNMP/Email(via vCenter)|
ESXi Host or vCenter
|| ESXi Host or vCenter(C-series)|
|IO Controller|| ||ESXi Host or vCenter(DAS only)||SNMP/Email(via vCenter)|
|Note:||The vSphere client can be used to view the data and alarms. vCenter is required for any automatic notification.|
A high CPU usage could be due to a small number of VMs taking all of the resources or too many VMs running on the host. For the too many VMs running case, look at the VMs running on the host and see if CPU reservations are in use (see oversubscription section). To isolate a CPU issue for a particular VM, consider moving it to another ESXi host.
To view the CPU performance indicators, go to the ESXi host's performance tab and select the Advanced button. Under Chart options, select CPU, timeframe, and then only the host (not individual cores) to view overall CPU usage on the host. You can view each VM's CPU usage from the Virtual Machines tab on the host.
To get a view of the reservations set by all of the VMs, use the Resource Allocation tab of the cluster.
|Note:||The "Resource Allocation" tab is only available via vCenter.|
Our guidelines do not support memory sharing between VMs. To verify, follow the following performance indicators to make sure swapping and ballooning counters are zero. If a given VM does not have enough memory and there are not memory issues on the specific host, consider increasing the VM's memory.
To view the memory performance indicators, go to the ESXi host's Performance tab and select the Advanced button. Under Chart options, select Memory and Timeframe, then select the following counters:
- Used memory (to view general trends)
- Swap used
Swap and Balloon should always be ZERO, otherwise memory sharing is being used (which should not be the case).
Bad disk performance often shows up as high CPU usage. IOPS data can provide information on how hard the application/VM is working the disks. Specific activities can cause spikes in IOPS: upgrades and DB maintenance are two examples. If VMs running on the same datastore are all doing these activities at the same time, the disks might not be able to keep up. IOPS data can be seen from vCenter or the SAN. Disk latency (response time) is a good indicator of disk performance.
To view the disk performance indicators, go to the ESXi host's performance tab and select the advanced button. The appropriate datastore needs to be selected, which can be found on the datastore page (see below). Under chart options, select disk and timeframe, then select the following counters:
- Physical device command latency
- Kernel command latency
- average commands issued per second
The kernel counter should not be greater than 2-3 ms. The physical device counter should not be greater than 15-20 ms. The "average commands issued per second" counter can be used if IOPS are not available from the SAN. IOPS should be considered if it looks like datastore is overload. This IOPS data is viewable from the host and each VM. Note, for NFS datastores, the physical and kernel latency data is not available. Starting in VMware 4.0 update 2 and beyond the esxtop command (see below) can be used to view NFS counters and in particular the guest latency (called GAVG in esxtop). The guest latency is a summation of the physical device and kernel latencies.
On the C-series UCS servers there have been issues with the write cache battery backup. If this battery is not operating correctly, performance will suffer. Use a tool like wbemcli to verify the battery is ok. An example of using the wbemcli:
wbemcli ei -noverify 'https://root:<password>@<ESXi Host IP>:5989/root/cimv2:VMware_HHRCBattery'
See the MegaCli User Guide for more information.
Generally, network performance issues can be seen by dropped packets. If dropped packets are seen from a ESXi host, the network infrastructure needs to be investigated for the issue, which might include a virtualized switch (Nexus 1000V). In ESXi 4.1, issues have been seen with large file transfers (e.g. SFTP/FTP transfers). For this issue, the Large Receive Offload options need to be disabled on the ESXi host. That setting is found on the host's Configuration tab -> Advanced Settings -> Net.*. Note, there are several LRO settings on this page and all of them need to be disabled. If a VM has been cloned and uses static MAC addresses, verify there are not duplicate MAC addresses in the network. LRO settings:
To view the network performance indicators, go to the ESXi host's performance tab and select the advanced button. Under chart options, select Network, timeframe, then select the following counters:
- Receive packets dropped
- Transmit packets dropped
The main thing to check is that no packets are getting dropped in the network.
|Note:||Advanced network debugging and configuration can be done on Nexus 1000v (if used, which requires vCenter and Enterprise Plus licensing).|
Alternate Access to Performance Data
If vCenter and/or the vSphere client are not available, some real time data can be pulled using command line tools. If you have a vMA VM, then the resxtop tool can be used. The resxtop tool is a remote version of the esxtop tool. Otherwise, the esxtop tool can be used directly on the ESXi host (root access must be enabled). See http://communities.vmware.com/docs/DOC-11812 for details on esxtop.
|Back to: Unified Communications in a Virtualized Environment|