Monitoring and Maintaining CUCM Appliance Hardware
Hardware Platform Monitoring and Management The Appliance supports a variety of interfaces to enable monitoring in the following eight focus areas:
1) CPU status/utilization
2) Memory status/utilization
3) System components temperatures
4) Fan status
5) Power Supply status
6) RAID & disk status
7) Network status (incl. NIC)
8) Operational status, including instrumentation of system/kernel status and data dumps following major system issues, indicating nature/type of the operational problem and degree of severity.
This section focuses on hardware-layer monitoring for 3) thru 8). 1) and 2) are covered in the section on CUCM Application-layer and Services-layer Monitoring.
Hardware Monitoring via SNMP
CUCM hardware server is monitored via SNMP MIBs. The following MIBs are supported for CUCM:
1.Native Hardware Platform MIBs IBM-SYSTEM-LMSENSOR IBM-SYSTEM-POWER IBM-SYSTEM-RAID IBM-SYSTEM-xxx-MIB CPQ-xxx-MIB (HP) CPQHEALTH (HP) MIB-DELL-10892
2.Standard MIBs SYSAPPL-MIB HOST-RESOURCES-MIB RFC1213-MIB IF-MIB
3. Cisco MIBs CISCO-CCM-MIB CISCO-SYSLOG-MIB
Through polling and traps with the MIBs mentioned above, the eight focus areas can be monitored. You configure your SNMP trap receiver (Network Management Applications) to receive these traps.
Specific MIB support varies by CUCM version and hardware vendor (HP or IBM models). For MCS vendor server MIB support versus CUCM releases, please take a look the following URL: http://www.cisco.com/en/US/docs/voice_ip_comm/cucm/compat/cmmibcmp.xls
Hardware Monitoring via RTMT and syslog
In addition to SNMP trap, Cisco Unified Real Time Monitoring Tool (RTMT) (please see the following section in this document for details) can monitor and parse syslog messages provided by the hardware vendors, and then send these alerts to RTMT Alert Central. A CUCM system administrator can configure through RTMT how he or she can be notified if these alerts occur (either email or Epage or both).
There is pre-canned alert in RTMT: Hardware Failure. When you see this alert, the following problem might occur:
1. Raid: Drive removed, Drive failed, I/O error 2. Hard drive Replication 3. Bios Versions 4. Power supply: Redundant power failed, Voltage fluctuations
RTMT is best used for single cluster; for large and enterprise customers who have multiple clusters deployed, Cisco Unified Operations Manager (CUOM) is recommended. For details on CUCM, please see the following URL: http://www.cisco.com/en/US/products/ps6535/index.html
Hardware Monitoring via CLI
System BIOS is viewable during the Appliance’s boot sequence.
The following are useful commands to view details on hardware, BIOS, RAID, firmware and other details. These items are included as part of the CUCM image and do not need to be managed separately as in CUCM 4.x, but may need to be inspected during diagnostic activity. Show hardware Show environment [fans | power-supply | temperature] Show tech all Utils create report hardware (e.g. to see firmware versions)
Integration with Uninterruptible Power Supplies (UPS)
As of CUCM 6.0(1a) and higher, the Appliance supports integration with select models of APC UPS for select MCS 7800 models. Previous Appliance releases rely on an external script monitoring the UPS and issuing the Cisco CLI for graceful shutdown. A future release will expand UPS and MCS support.
See the release notes for CUCM 6.0(1a) for more details.
Note: native hardware out-of-band management such as HP iLO or IBM RSA II cannot be used for graceful shutdown of CUCM software.
Use of Native Hardware Out of Band Management (OOB)
Select features of HP iLO and IBM RSA II are supported to enable the eight focus areas.
Phase 1 support of these interfaces on the Appliance includes the following capabilities (specific feature names vary by hardware vendor): Remote console (to access boot screens and the Cisco CLI) Remote power management CUCM OS, Application and Services Monitoring The primary tool to monitor CUCM, IP phones and other services is Real Time Monitoring Tool (RTMT). Real-Time Monitoring Tool (RTMT) is a tool that is available through Unified CM Administration page as a plug-in. This tool is used to monitor: 1.The real time information of various compo nents that are part of Cisco Unified Communications Manager 2.The real time information of various devices that are controlled by CUCM, for example, MGCP voice gateways, IP Phones, and CTI applications. RTMT presents a unified view for the entire CCM cluster by providing real time information of all the IP Phones, gateways, and CTI devices registered to all the CUCM servers in a cluster as well as the real time information of native components within CUCM.
RTMT can be installed as a standalone Java application on the client side. It has considerable security improvements. It uses HTTPS (unlike HTTP in Unified CM version 4.x and earlier). Communications Manager sends alarms to a real time database, which is polled by RTMT. RTMT also polls the performance counters for CUCM. RTMT is installed with a default configuration including polling intervals and threshold settings. An administrator may choose to customize these setting by creating new profiles in the System menu. More performance counters can be added by the administrator. The polling interval values can be changed as well through the RIS Data Collector service.
RTMT Summary View
This view displays over all health of the system including: CPU utilization level Memory utilization level Phone registration status Call in progress Gateway status This information should be monitored on a daily basis. If CPU and memory utilization levels exceeds the 70% mark, then check to see if the CUCM publisher and subscribers that are participating in call processing are overloaded .
Key indicators of system health and performance issues are: System Time, User Time, IOWait, soft irq, irq CPU Pegging Alerts Process using most CPU High % iowait High % iowait due to Common Partition Process responsible for Disk IO CodeYellow
If you don’t want to have RTMT client running on your workstation or PC all the time, you can log into to RTMT client to setup a threshold for each alert that you are interested and how you would like to be notified if a specified alert occurs, and then close the RTMT client on your workstation or PC. The RTMT backend (AMC service), which is up and running as soon as CUCM server is up and running, is able to collect and process all the information needed, and notify you in the way you configure.
RTMT CPU and Memory page reports CPU usage in terms of: %System: the percentage of CPU utilization that occurred while executing at the system level (kernel) %User: the percentage of CPU utilization that occurred while executing at the user level (application). %IOWait: the percentage of time that the CPU was idle waiting for outstanding disk I/O request. %SoftIrq: the percentage of time that the processor is executing deferred IRQ processing (e.g., processing of network packets). %Irq: the percentage of time that the processor is executing the interrupt request which is assigned to devices for interrupt, or sending a signal to the computer when it is finished processing.
CPU Usage Monitoring High CPU utilization can impact the call processing by creating delay or interruption in the service which may be noticeable by the end users. Sometimes high memory utilization is indicative of a memory leak.
RIS DataCollector PerfMonLog should be enabled to track CPU usage. Please check the RIS Data Collector service parameter “Enable Logging” to enable DataCollector PerfMonLog. Please general CPU usage guideline as the table below.
MCS-7835 MCS-7845 Total CPU usage “Processor (_Total) \ % CPU Time” < 68% - good; 68 – 70% warning; > 80% - bad < 68% - good; 68 – 70% warning; > 80% - bad
Process ccm CPU < 44% < 22% IOWAIT Processor (_Total) \I Owait Percentage <10% - good <10% - good CallManager Service Virtual Memory size < 2.1 GB < 2.1 GB
You can also monitor CPU usage through APIs: Through SOAP API, you can monitor the following perfmon counters: Under Processor object: % CPU Time, System Percentage, User Percentage, IOwait Percentage, Softirq Percentage, Irq Percentage Under Process object: % CPU Time Through SNMP interface, you can monitor the following perfmon counters: Host Resource MIB: hrProcessorLoad, hrSWRunPerfCPU CPQHOST-MIB: cpqHoCpuUtilMin, cpqHoCpuUtilFiveMin You can also download some historical information using RTMT Trace Log Center or SOAP APIs, such as Cisco AMC Service PerfMonLog // enabled by default. Deprecated in CUCM 6.0 because Cisco RIS Data Collector PerfMonLog is introduced. Cisco RIS Data Collector PerfMonLog // disabled by default in CUCM 5.x; enabled by default in CUCM 6.0 If you see high CPU usage, try to identify which process causes high CPU usage. If %system and/or %user is high enough to generate CPUPegging alert, check the alert message to see processes using most CPU. You can go to RTMT Process page, sort by %CPU to identify high CPU processes. Please see below as an example:
For postmortem analysis, RIS Data Collector PerfMonLog tracks processes %cpu usage as well as at system level.
RTMT monitors CPU usage. When CPU usage is above a threshold, RTMT generates CPUPegging/CallProcessNodeCPUPegging alerts. From RTM Alert Central, you can also see current status.
There are two kinds of RTMT alerts. The ﬁrst set is pre-conﬁgured (also called pre-canned) , and the second set is user deﬁned. You can customize both of them. The main difference is that you cannot delete pre-conﬁgured, whereas you can add and delete user-deﬁned alerts. However, you can disable both pre-conﬁgured and user-deﬁned alerts. To view the pre-conﬁgured alerts, from the RTMT client application, select the RTMT -> Tools -> Alert -> Alert Central menu option. The pre-conﬁgured alerts are enabled by default. In most cases, you do not have to change the default threshold settings conﬁgured for the pre-conﬁgured alerts. However, you have an option to change the threshold settings to meet your requirements. The notiﬁcation can be an e-mail or a pager. To set up e-mail notiﬁcation, you should specify the SMTP server name and port number. You can do this in the RTMT client application by selecting the Alert/Threshold -> Enable E-Mail Server menu option.
In addition to CPUPegging / CallProcessNodeCPUPeggin, high CPU usage potentially causes other alerts to occur such as: CodeYellow CodeRed CoreDumpFileFound CriticalServiceDown LowCallManagerHeartbeatRate LowTFTPServerHeartbeakRate LowAttendantConsoleHeartRate
% Iowait Monitoring High %iowait indicates high disk I/O activities. A few things needed to be considered: High IOwait due to heavy memory swapping. Please check %CPU Time for Swap Partition to see if there is high level of memory swapping activity. One potential cause of high memory swapping is memory leak. High IOwait due to DB activity . Database accesses Active Partition. If %CPU Time for Active Partition is high, then most likely there are a lot of DB activities. High IOwait due to Common (or Log) Partition, where trace and log files are stored. You can check the following things: 1.Check Trace Log Center to see if there is any trace collection activity going on. If call processing is impacted (ie, CodeYellow), then consider adjusting trace collection schedule. If zip option is used, please turning it off. 2.Trace setting – At Detailed level, CUCM generates a lot of trace. If high %iowait and/or CUCM is in CodeYellow state, and CUCM service trace setting is at Detailed, please chance trace setting to “Error” to reduce the trace writing. You can use RTMT to identify processes that are responsible for high %iowait: If %iowait is high enough to cause CPUPegging alert, check the alert
message to check processes waiting for disk IO.
Go to RTMT Process page, sort by Status. Check for processes in Uninterruptible Disk Sleep state Download RIS Data Collector PerfMonLog file to examine the process status for longer period of time. Below is an example of RTMT Process page, sorted by Status. You can check for processes in Uninterruptible Disk Sleep state. In the case below, it’s sFTP process:
You can also use CLI to isolate which process causes high IOwait:
Syntax admin:utils fior
utils fior status utils fior enable utils fior disable utils fior start utils fior stop utils fior list utils fior top
For example: admin:utils fior list 2007-05-31 Counters Reset
Time Process PID State Bytes Read Bytes Written
----------------- ----- ----- -------------------- --------------------
17:02:45 rpmq 31206 Done 14173728 0 17:04:51 java 31147 Done 310724 3582 17:04:56 snmpget 31365 Done 989543 0 17:10:22 top 12516 Done 7983360 0 17:21:17 java 31485 Done 313202 2209 17:44:34 java 1194 Done 192483 0 17:44:51 java 1231 Done 192291 0 17:45:09 cdpd 6145 Done 0 2430100 17:45:25 java 1319 Done 192291 0 17:45:31 java 1330 Done 192291 0 17:45:38 java 1346 Done 192291 0 17:45:41 rpmq 1381 Done 14172704 0 17:45:44 java 1478 Done 192291 0 17:46:05 rpmq 1540 Done 14172704 0 17:46:55 cat 1612 Done 2560 165400 17:46:56 troff 1615 Done 244103 0 18:41:52 rpmq 4541 Done 14172704 0 18:42:09 rpmq 4688 Done 14172704 0
CLI fior output sorted by top disk users admin:utils fior top Top processes for interval starting 2007-05-31 15:27:23 Sort by Bytes Written Process PID Bytes Read Read Rate Bytes Written Write Rate
----- -------------- ------------- -------------- -------------
Linuxzip 19556 61019083 15254771 12325229 3081307 Linuxzip 19553 58343109 11668622 9860680 1972136 Linuxzip 19544 55679597 11135919 7390382 1478076 installdb 28786 3764719 83660 6847693 152171 Linuxzip 20150 18963498 6321166 6672927 2224309 Linuxzip 20148 53597311 17865770 5943560 1981187 Linuxzip 19968 9643296 4821648 5438963 2719482 Linuxzip 19965 53107868 10621574 5222659 1044532 Linuxzip 19542 53014605 13253651 4922147 1230537 mv 5048 3458525 3458525 3454941 3454941
utils diagnose list: This command will list all available diagnostic tests. For exemple: admin: utils diagnose list Available diagnostics modules disk_space - Check available disk space as well as any unusual disk usage service_manager - Check if service manager is running tomcat - Check if Tomcat is deadlocked or not running utils diagnose test: This command will execute each diagnostic test, but will not attempt to repair anything. Example: admin: utils diagnose test
Starting diagnostic test(s)
test - disk_space : Passed test - service_manager : Passed test - tomcat : Passed Diagnostics Completed utils diagnose module <moduleName> This command will execute a single diagnostic test and attempt to fix the problem if possible. You can also use the command "utils diagnose fix" to run all of the diagnostic tests at once. Example: admin: utils diagnose module tomcat Starting diagnostic test(s)
test - tomcat : Passed Diagnostics Completed utils diagnose fix: This command will execute all diagnostic tests, and if possible, attempt to repair the system. Example: admin: utils diagnose fix Starting diagnostic test(s)
test - disk_space : Passed test - service_manager : Passed test - tomcat : Passed
Diagnostics Completed utils create report hardware no parameters are required Creates a system report containing disk array, remote console, diagnositic, and environmental data. Example: admin:utils create report hardware
*** W A R N I N G ***
This process can take several minutes as the disk array, remote console, system diagnostics and environmental systems are probed for their current values. Continue? Press y or Y to continue, any other key to cancel request. Continuing with System Report request... Collecting Disk Array Data...SmartArray Equipped server detected...Done Collecting Remote Console Data...Done Collecting Model Specific System Diagnostic Information...Done Collecting Environmental Data...Done Collecting Remote Console System Log Data...Done Creating single compressed system report...Done System report written to SystemReport-20070730020505.tgz To retrieve diagnostics use CLI command: file get activelog platform/log/SystemReport-20070730020505.tgz
utils iostat interval optional (seconds) Interval between two iostat readings - mandatory if iterations is being used iterations optional The number of iostat iterations to be performed - mandatory if interval is being used filename optional Redirect the output to a file Help: utils iostat: This command will provide the iostat output for the given number of iterations and interval. Example: admin: utils iostat Executing command... Please be patient Tue Oct 9 12:47:09 IST 2007 Linux 2.4.21-47.ELsmp (csevdir60) 10/09/2007 Time: 12:47:09 PM avg-cpu: %user %nice %sys %iowait %idle
3.61 0.02 3.40 0.51 92.47
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 3.10 19.78 0.34 7.49 27.52 218.37 13.76 109.19 31.39 0.05 5.78 0.73 0.57 sda1 0.38 4.91 0.14 0.64 4.21 44.40 2.10 22.20 62.10 0.02 26.63 1.62 0.13 sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10.88 0.00 2.20 2.20 0.00 sda3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5.28 0.00 1.88 1.88 0.00 sda4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.83 0.00 1.67 1.67 0.00 sda5 0.00 0.08 0.01 0.01 0.04 0.73 0.02 0.37 64.43 0.00 283.91 69.81 0.08 sda6 2.71 14.79 0.20 6.84 23.26 173.24 11.63 86.62 27.92 0.02 2.98 0.61 0.43
The following table lists some equivalent perfmon counters between CUCM 4.x and CUCM 5.x and later:
CUCM 4.x Perfmon counters CUCM 5.x appliance Perfmon counters Process % Privileged Time Process STime % Processor Time % CPU Time Processor % UserTime Processor User Percentage % Privileged Time System Percentage % Idle Time Nice Percentage % Processor Time % CPU Time
Memory Monitoring Virtual memory consists of physical memory (RAM) and swap memory (Disk). RTMT “CPU & Memory” page has system level memory usage information as the following: Total: total amount of physical memory Free: amount of free memory Shared: amount of shared memory used Buffers: amount of memory used for buffering purpose Cached: amount of cached memory Used: calculated as Total – Free – Buffers – Cached + Shared Total Swap: total amount of swap space. Used Swap: the amount of swap space in use on the system. Free Swap: the amount of free swap space available on the system
You can also query memory information through APIs: Through SOAP, you can query the following perfmon counters: Under Memory object: % Mem Used, % VM Used, Total Kbytes, Total Swap Kbytes, Total VM Kbytes, Used Kbytes, Used Swap Kbytes, Used VM KBytes Under Process object: VmSize, VmData, VmRSS, % Memory Usage Through SNMP, you can query the following perfmon counters: Host Resource MIB: hrStorageSize, hrStorageUsed, hrStorageAllocationUnits, hrStorageDescr, hrStorageType hrMemorySize You can also download some historical information by using RTMT Trace Log Central: Cisco AMC Service PerfMonLog // enabled by default. Deprecated in ccm 6.0 because Cisco RIS Data Collector PerfMonLog is introduced Cisco RIS Data Collector PerfMonLog // disabled by default in CUCM 5.x; enabled by default in CUCM 6.0 Note: Perfmon Virtual Memory refers to Total (Physical + Swap) memory whereas Host Resource MIB Virtual Memory refers to Swap memory only.
RTMT “Process” pre-can screen displays process level memory usage (VmSize, VmRSS, and VmData) information. VmSize is total virtual memory used by the process VmRSS is the Resident Set currently in physical memory used by the process including Code, Data and Stack VmData is the virtual memory usage of heap by the process Page Fault Count represents the number of major page faults that a process encountered that required the data to be loaded into physical memory You can go to RTMT “Process” pre-can screen and sort VmSize by clicking on VmSize tab. Then you can identify which process consumes more memory.
Hints on Memory leak From RTMT Process page, if a process’ VmSize is continuously increasing, that process causes memory leaking. When process leaks memory, the system administrator should report to Cisco with proper trace files. Ris Data Collector PerfMonLog is a good one to collect as it contains historical information on memory usage. Then the system administrator can schedule restarting the service during off hour to reclaim the memory.
Alert Central Alert Central (RTMT -> Tools -> Alert -> Alert Central) has all the Cisco predefined alerts and provides the current status of each alertable condition. One column to pay attention to is the “In Safe Range”. If it’s marked as No then the condition is still not corrected. For instance, if “In Safe Range” is “No” for CallProcessingNodeCPUPegging, then it means the CPU usage on that node is still above the threshold. In the bottom is the history information. You can take a look to see alerts generated previously. Quite often by the time you realize that service has crashed, the corresponding trace files have been overwritten. It would be hard for Cisco TAC to work on the issue without trace files. In this case, it would be useful to know that CoreDumpFileFound, CodeYellow, and CriticalServiceDown alerts have Enable Trace Download option. To enable it, open the Set Alert Properties. The last page has option to enable trace download. This can be used to make sure a trace file corresponding to a crash is created.
Caution - Enabling TCT Download may affect services on the server. Configuring a high number of downloads will adversely impact the quality of services on the server. Alerts can also send out an alarms (syslog messages) by configuring the Alarm Configuration page for the Cisco AMC Service. In addition to AlertHistory shown in RTMT Alert Central, there is AMC Alert Log which you can download with Trace & Log Central to get up to 7 days (by default) worth of Alert history.
From Alert Central, you can see the current status:
The following table compares the names of perfmon counters on virtual memory between CUCM 4.x and CUCM 5.x. CUCM 4.x Perfmon counters CUCM 5.x appliance Perfmon counters Process Private Bytes Process VmRSS Virtual Bytes VmSize
Partition (Disk) Usage monitoring There are 4 partitions in CUCM hard drive: Common partition, also referred as Log partition, where trace/log files are stored Active partition contains files (binaries, libraries and config files) of active OS and CUCM version Inactive partition contains files for alternative CUCM version (e.g., older version that was upgraded from or newer version recently upgraded to but the server has not been toggled to this version to run). Swap partition is used for Swap space. You can also get partition information through APIs: Through SOAP APIs, you can query the following perfmon counters Under Partition object: Total Mbytes, Used Mbytes, Queue Length, Write Bytes Per Sec, Read Bytes Per Sec Through SNMP MIB, you can query the following information: Host Resource MIB: hrStorageSize, hrStorageUsed hrStorageAllocationUnits, hrStorageDescr, hrStorageType You can also download historical information by using RTMT Trace and Log Central: Cisco AMC Service PerfMonLog // enabled by default. Deprecated in CUCM 6.0, because Cisco RIS Data Collector PerfMonLog is introduced. Cisco RIS Data Collector PerfMonLog // disabled by default in CUCM 5.x; enabled by default in CUCM 6.0 You can use RTMT to monitor disk Usage:
Partition Name mapping Perfmon Instance Names as shown in RTMT and SOAP Names shown in Host Resource hrStorage Description Active / Inactive /partB Common /common Boot /grub Swap Virtual Memory SharedMemory /dev/shm
LogPartitionLowWaterMarkExceeded alert occurs when the percentage of used disk space in the log partition has exceeded the configured low water mark. This alert should be considered as early warning for an administrator to clean up disk space. You can use RMT Trace/Log Central to collect trace/log files and then delete these trace/log files from the server. In addition to manually clean up the traces/log files, the system administrator should also adjust the number of trace files to be kept to avoid hitting low water mark again. LogPartitionHighWaterMarkExceeded alert occurs when the percentage of used disk space in the log partition has exceeded the configured high water mark. When this alert is generated, Log Partition Monitoring (LPM) utility starts to delete files in Log Partition until the Log Partition is down to the low water mark to avoid running out of disk space. Since LPM may delete some files that you want to keep, you need to act upon receiving LogPartitionLowWaterMarkExceed alert. LowActivePartitionAvailableDiskSpace alert occurs when the percentage of available disk space of the Active Partition is lower than the configured value. Please use the default threshold that Cisco recommends. At default threshold, this alert should never be generated. If this alert occurs, a system administrator can adjust the threshold as temporary workaround but Cisco TAC should look into this. One place to look is /tmp using remote access. We have seen cases where large files are left there by 3rd party software. LowInactivePartitionAvailableDiskSpace alert occurs when the percentage of available disk space of the InActive Partition is lower than the configured value. Please use the default threshold that Cisco recommends. At default threshold, this alert should never be generated. If this alert occurs, a system administrator can adjust the threshold as temporary workaround but Cisco TAC should look into this.
The following table is a comparison of partition related perfmon counters between CUCM 4.x and CUCM 5.x. CCM 4.x Perfmon counters CCM 5.x appliance Perfmon counters Logical Disk % Disk Time Partition % CPU Time Disk Read Bytes/sec Read Kbytes Per Sec
Disk Write Bytes/sec Write Kbytes Per Sec Current Disk Queue Length Queue Length Free Megabytes Used Mbytes Total Mbytes % Free Space % Used
Database Replication among CUCM nodes
You can use RTMT database Summary to monitor your database activities (ie. CallManager -> Service -> Database Summary):
The following CLI can be used to monitor and manage intra-cluster connections: utils dbreplication status utils dbreplication repair all/nodename utils dbreplication reset all/nodename utils dbreplication stop utils dbreplication dropadmindb utils dbreplication setrepltimeout show tech dbstateinfo show tech dbinuse show tech notify run sql <query> Cisco Unified Communications Manager Monitoring “ccm” is the process name for Cisco Unified Communications Manager service. The following table is a general guideline for ccm service CPU usage
ccm CPU usage “Process(ccm)\% CPU Time” MCS-7835 Server MCS-7845 Server < 44% - good < 22% - good 44-52 % - warning 22-36 % -warning > 60% - bad > 30% -bad
You may ask: “ Why MCS-7845 server has more processors, but it has lower threadshold for CUP usage?”
Here is why: CCM process is multithreaded application. But main router thread does the bulk of call processing. A single thread can run only on one processor at any given time even when there are multiple processors available. That means ccm main router thread can run out of cpu resource even when there are idle processors. With hyper-threading on, MCS 7845 server has 4 virtual processors. So on server where the main router thread is running at full blast to do call processing, it is possible three other processors are near idle. In this situation UC Manager can get into Code Yellow state even when total CPU usage is 25-30%. (Similarly 7835 server with two virtual processors, UC Manager could get into Code Yellow state at around 50-60% cpu usage. NOTE 1: Code Yellow state is when ccm service is so overloaded that it cannot process incoming calls anymore. In this case, ccm initiates call throttling. NOTE 2: This doesn't mean you will see one processor's cpu usage at 100% and rest 0% in RTMT. Since main thread can run on processor A for 1/10th of second and processor B next 2/10th of seconds, etc, the cpu usage shown in RTMT would be more balanced. By default RTMT shows average CPU usage for 30 second duration.
You can also use APIs to query perfmon counters. Through SOAP APIs, you can query: Perfmon counters Device information DB access CDR access Through SNMP, CISCO-CCM-MIB: ccmPhoneTable, ccmGatewayTable, etc You can also download historical information by using RTMT Trace/Log Central Cisco AMC Service PerfMonLog // enabled by default. Deprecated in CUCM 6.0 because Cisco RIS Data Collector PerfMonLog is introduced. Cisco RIS Data Collector PerfMonLog // disabled by default in CUCM 5.x; enabled by default in CUCM 6.0.
Code Yellow CodeYellow alert is generated when ccm service goes into Code Yellow state, which means ccm service is overloaded. You can configure Code Yellow alert so that once Code Yellow alert occurs, the trace files can be downloaded for troubleshooting purpose.
AverageExpectedDelay counter represents the current average expected delay for handling any incoming message. If the value is above the value specified in "Code Yellow Entry Latency" service parameter, CodeYellow alarm is generated. This counter is one of key indicator of call processing performance issue.
Sometimes, you might see CodeYellow, but total CPU usage is only 25%. This is because CUCM needs one processor for call processing, when no processor resource available, CodeYellow may occur even total CPU usage is only around 25-30% in a four virtual processor server. Similarly on a two processor server, CodeYellow is possible around 50% total CPU usage.
Other perfmon counters should be monitored are: Cisco CallManager\CallsActive, CallsAttempted, EncryptedCallsActive, AuthenticatedCallsActive, VideoCallsActive Cisco CallManager\RegisteredHardwarePhones, RegisteredMGCPGateway, Cisco CallManager\T1ChannelsActive, FXOPortsActive, MTPResourceActive, MOHMulticastResourceActive Cisco Locations\BandwidthAvailable Cisco CallManager System Performance\AverageExpectedDelay CodeYellow DBReplicationFailure LowCallManagerHeartbeat ExcessiveVoiceQualityReports MaliciousCallTrace CDRFileDeliveryFailure/CDRAgentSendFileFailed Critical Service Down CoreDumpFileFound The following is screen shot of RTMT performance page:
Note: In general, CUCM 4.x Communications Manager perfmon counters have been preserved by using the same names and representing the same values. And also CISCO-CCM-MIB has backward compatibility.
RIS Data Collector PerfMonLog CCM 5.x, RIS Data Collector PerfMonLog file is not enabled by default. To Enable RIS Data Collector PerfMonLog, go to CUCM admin page, go to Service Parameter Page, select Cisco RIS Data Collector service and set Enable Logging to True, as the following:
It is recommended enable RIS Data Collector PerfMonLog which is very useful for troubleshooting since it tracks CPU, memory, disk, network, etc. If you enable RIS Data Collector PerfMonLog, then you can disable AMC PerfMonLog.
Note: RIS Data Collector PerfMonLog is introduced in CUCM 6.0 to replace AMC PerfMonLog. RIS Data Collector PerfMonLog provides a little more information than AMC PerfMonLog. For detailed information, please see CUCM Serviceability User Guide.
It is recommended turn on RIS Data Collector PerfMonLog as soon as CUCM is up and running (by default, it is turned on). When RIS Data Collector PerfMonLog is turned, the impact on CPU is so small (around 1%) that can be ignored.
RIS Data Collector PerfMonLog Use RTMT Trace & Log Center to download Cisco RIS Data Collector PerfMonLog files for a time period that you are interested in; Open the log file using Windows Perfmon Viewer (or RTMT Perfmon viewer), then add Performance counters of interest such as CPU usage -> Processor or Process % CPU Memory usage -> Memory %VM Used Disk usage -> Partition % Used Call Processing -> Cisco CallManager CallsActive The following is a screen shot of Windows Perfmon Viewer:
Service Status Monitoring
RTMT Critical Service page provides current status of all critical services, as the following:
CriticalServiceDown alert is generated when any of service is down.
Note 1: RTMT backend service checks for the status (by default) every 30 seconds. So it is possible if service goes down and comes back up within that period, CriticalServiceDown alert may not be generated. Note 2: CriticalServiceDown alert monitors only those services listed in RTMT Critical Services page. If you suspect (or want to double check) if service got restarted (without generating Core files), a few ways to check are: RTMT Critical Service page has elapsed time. Check RIS Troubleshooting perfmon log files and see if PID for service (process) is changed.
The following CLI can be used to check the logs of Service Manager: file get activelog platform/servm_startup.log file get activelog platform/log/servm*.log
The following CLI can be used to duplicate certain RTMT functions: utils service show perf show risdb
CoreDumpFileFound alert is generated when RTMT backend service detects new Core Dump file. Both CriticalServiceDown and CoreDumpFileFound alert can be configured to download corresponding trace files for troubleshooting purpose. This helps to preserve trace files at the time of a crash.
Syslog Messages Monitoring Syslog can be viewed using RTMT syslog viewer, please the following screen shot:
Sending syslog traps to remote server (CISCO-SYSLOG-MIB) If you want to send syslog messages as syslog traps, here are the steps. 1.Setup Trap (Notification) destination from Unified CM Serviceability SNMP page – http://www.cisco.com/en/US/docs/voice_ip_comm/cucm/service/5_1_3/ccmsrva/sasnmpv1.html 2.Enable trap generation in CISCO-SYSLOG-MIB 3.Set appropriate SysLog level in CISCO-SYSLOG-MIB If you feel you are missing syslog traps for some Unified Communications Manager service alarms, check RTMT syslog viewer to see if the alarms are shown there. If not, adjust alarm configuration setting to send alarms to local syslog. For information on alarm configuration, refer to the Alarm Configuration section of the Cisco Unified CallManager Serviceability Administration Guide here – http://www.cisco.com/en/US/docs/voice_ip_comm/cucm/service/5_1_3/ccmsrva/saalarm.html Also check that the SysLog level in CISCO-SYSLOG-MIB is set at the appropriate level.
Syslog generated due to hardware failures has an event severity of 4 or higher and contains one of the following patterns:
Therefore, you can do a manual search for the patterns above to find hardware failure events in syslog.
RTMT Alerts as Syslog Messages and Traps RTMT Alerts can be logged as syslog messages and send to remote syslog and syslog traps server. To send to local and remote syslog, please configure AMC alarm configuration page of CUCM Serviceability Web Page. For CUCM 5.1 and later releases, please go to Serviceability Web Page, under Alarm Configuration, check AMC service parameter “Alarm Enabled”. Go to Serviceability Web Page, under Tools -> Control Center – Network Services, restart AMC services in Serviceability Web Page.
Phone registration status needs to be monitored for sudden changes. If the registration status changes slightly and readjusts quickly over a short time frame, then it could be indicative of phone move, add, or change. A sudden smaller drop in phone registration counter can be indicative of a localized outage, for instance an access switch or a WAN circuit outage or malfunction. A significant drop in registered phone level needs immediate attention by the administrator. This counter especially needs to be monitored before and after the upgrades to ensure the system is restored completely.
RTMT has a number of pre-can screens for information such as Summary, Call Activity, Device Status, Server Status, Service Status, and Alert Status. RTMT “Summary” pre-can screen shows a summary view of CUC M system health. It shows CPU, Memory, Registered Phones, CallsInProgress, and ActiveGateway ports & channels. This should be one of the first thing you want to check each day to make sure CPU & memory usage are within normal range for your cluster and all phones are registered properly.
Phone Summary and Device Summary pre-can screens provide more detailed information about phone and gateway status. If there are a number of devices that fail to register, then you can use the Admin Find/List page or RTMT device search to get further information regarding the problem devices. Critical Services pre-can screen displays the current running/activation status of key services. You can access all the pre-can screens by simply clicking the corresponding icons on the left.
Serviceability Reports Archive The Cisco Serviceability Reporter service generates daily reports in Cisco Unified CallManager Serviceability Web Page. Each report provides a summary that comprises different charts that display the statistics for that particular report. Reporter generates reports once a day on the basis of logged information, such as Device Statistics Report Server Statistics Report Service Statistics Report Call Activities Report Alert Summary Report Performance Protection Report For detailed information about each report, please see the following URL: http://www.cisco.com/en/US/docs/voice_ip_comm/cucm/service/5_0_2/ccmsrvs/sssrvrep.html#wp1033420