Cisco Nexus 7000 Series NX-OS Troubleshooting Guide -- Troubleshooting Tools and Methodology
Troubleshooting Tools and Methodology
This article describes the troubleshooting tools and methodology available for Cisco NX-OS.
Troubleshooting Installs, Upgrades, and Reboots
Before Contacting Technical Support
Troubleshooting Tools and Methodology (this section)
Command-Line Interface Troubleshooting Commands
The command-line interface (CLI) allows you to configure and monitor Cisco NX-OS using a local console or remotely using a Telnet or SSH session. The CLI provides a command structure similar to Cisco IOS software, with context-sensitive help, show commands, multiuser support, and roles-based access control.
Each feature has show commands that provide information about the feature configuration, status, and performance. Additionally, you can use the following commands for more information:
- show system--Provides information about system-level components, including cores, errors, and exceptions. Use the show system error-id command to find details on error codes.
- switch# copy running-config startup-config
[########################################] 100% 2008 Jan 16 09:59:29 zoom %$ VDC-1 %$ %BOOTVAR-2-AUTOCOPY_FAILED: Autocopy of file /bootflash/n7000-s1-dk220.127.116.11.837.bin.S8 to standby failed, '''error=0x401e0008'''
- switch# show system error-id 0x401e0008
Error Facility: sysmgr Error Description: request was aborted, standby disk may be full
- show platform--Provides platform-specific information, including route forwarding, Quality of Service (QoS), and access control list (ACL) information.
Configuration files contain the Cisco NX-OS commands used to configure the features on a Cisco NX-OS device. Cisco NX-OS has two types of configuration files, running configuration and startup configuration. The device uses the startup configuration (startup-config) during the device startup to configure the software features. The running configuration (running-config) contains the current changes that you make to the startup-configuration file. You should create a backup version of your configuration files before modifying that configuration. You can back up the configuration files to a remote server. See the configuration file information in the Cisco NX-OS Fundamentals Configuration Guide, Release 4.0. You can also create a checkpoint copy of the configuration file that you can roll back to if problems occur. See the Rollback feature in the Cisco NX-OS System Management Configuration Guide, Release 4.0.
Cisco NX-OS features can create internal locks on the startup configuration file. In rare cases, these locks may not be removed by the features. Use the show system internal sysmgr startup-config locks command to determine if any locks remain on the startup configuration file. Use the system startup-config unlock command to remove these locks.
Cisco NX-OS supports an extensive debugging feature set for actively troubleshooting a network. Using the CLI, you can enable debugging modes for each feature and view a real-time updated activity log of the control protocol exchanges. Each log entry has a timestamp and is listed chronologically. You can limit access to the debug feature through the CLI roles mechanism to partition access on a per-role basis. While the debug commands show real-time information, you can use the show commands to list historical and real-time information.
|Caution:||Use the debug commands only under the guidance of your Cisco technical support representative because some debug commands can impact your network performance.|
|Note:||You can log debug messages to a special log file, which is more secure and easier to process than sending the debug output to the console.|
By using the ? option, you can see the options that are available for any feature. A log entry is created for each entered command in addition to the actual debug output. The debug output shows a time-stamped account of the activity that occurred between the local device and other adjacent devices.
You can use the debug facility to track events, internal messages, and protocol errors. However, you should be careful when using the debug utility in a production environment, because some options may prevent access to the device by generating too many messages to the console or creating CPU-intensive events that could seriously affect network performance.
|Note:||You can filter out unwanted debug information by using the debug-filter command.|
Ping and Traceroute
|Note:||Use the ping and traceroute features to troubleshoot problems with connectivity and path choices. Do not use these features to identify or resolve network performance issues.|
The ping and traceroute commands are two of the most useful tools for troubleshooting TCP/IP networking problems. The ping utility generates a series of echo packets to a destination across a TCP/IP internetwork. When the echo packets arrive at the destination, they are rerouted and sent back to the source.
The traceroute utility operates in a similar fashion but can also determine the specific path that a frame takes to its destination on a hop-by-hop basis.
Use the ping command to verify connectivity and latency to a particular destination across an IPv4 routed network.
Use the ping6 command to verify connectivity and latency to a particular destination across an IPv6 routed network.
The ping utility allows you to send a short message to a port or end device. By specifying the IPv4 or IPv6 address, you can send a series of frames to a target destination. Once these frames reach the target, they are looped back to the source and a time stamp is taken.
- switch# ping 172.28.230.1 vrf management
PING 172.28.230.1 (172.28.230.1): 56 data bytes 64 bytes from 172.28.230.1: icmp_seq=0 ttl=254 time=1.095 ms 64 bytes from 172.28.230.1: icmp_seq=1 ttl=254 time=1.083 ms 64 bytes from 172.28.230.1: icmp_seq=2 ttl=254 time=1.101 ms 64 bytes from 172.28.230.1: icmp_seq=3 ttl=254 time=1.093 ms 64 bytes from 172.28.230.1: icmp_seq=4 ttl=254 time=1.237 ms --- 172.28.230.1 ping statistics --- 5 packets transmitted, 5 packets received, 0.00% packet loss round-trip min/avg/max = 1.083/1.121/1.237 ms
Use traceroute to do the following:
- Trace the route followed by the data traffic.
- Compute the interswitch (hop-to-hop) latency.
The traceroute utility identifies the path taken on a hop-by-hop basis and includes a time stamp at each hop in both directions. You can use traceroute to test the connectivity of ports along the path between the generating device and the device closest to the destination.
Use the traceroute command for IPv4 networks. Use the traceroute6 command for IPv6 networks. If the destination cannot be reached, the path discovery traces the path up to the point of failure.
- switch# traceroute 172.28.254.254 vrf management
traceroute to 172.28.254.254 (172.28.254.254), 30 hops max, 40 byte packets 1 172.28.230.1 (172.28.230.1) 0.941 ms 0.676 ms 0.585 ms 2 172.24.114.213 (172.24.114.213) 0.733 ms 0.7 ms 0.69 ms 3 172.20.147.46 (172.20.147.46) 0.671 ms 0.619 ms 0.615 ms 4 172.28.254.254 (172.28.254.254) 0.613 ms 0.628 ms 0.61 ms
Monitoring Processes and CPUs
Use the show processes command to identify the processes that are running and the status of each process. The command output includes the following:
- PID = process ID.
- State = process state.
- PC = current program counter in hexadecimal format.
- Start_cnt = how many times a process has been started (or restarted).
- TTY = terminal that controls the process. A --(hyphen) usually means a daemon not running on any particular TTY.
- Process = name of the process.
Process states are as follows:
- D = uninterruptible sleep (usually I/O).
- R = runnable (on run queue).
- S = sleeping.
- T = traced or stopped.
- Z = defunct (zombie) process.
- NR = not-running.
- ER = should be running but currently not-running.
- Note: Typically, the ER state designates a process that has been restarted too many times, causing the system to classify it as faulty and disable it.
- switch# show processes ?
cpu Show processes CPU Info log Show information about process logs memory Show processes Memory Info
- switch# show processes
PID State PC Start_cnt TTY Type Process ----- ----- -------- ----------- ---- ---- ------------- 1 S b7f9e468 1 - O init 2 S 0 1 - O migration/0 3 S 0 1 - O ksoftirqd/0 4 S 0 1 - O desched/0 5 S 0 1 - O migration/1 6 S 0 1 - O ksoftirqd/1 7 S 0 1 - O desched/1 8 S 0 1 - O events/0 9 S 0 1 - O events/1 10 S 0 1 - O khelper 15 S 0 1 - O kthread 24 S 0 1 - O kacpid 103 S 0 1 - O kblockd/0 104 S 0 1 - O kblockd/1 117 S 0 1 - O khubd 184 S 0 1 - O pdflush 185 S 0 1 - O pdflush 187 S 0 1 - O aio/0 188 S 0 1 - O aio/1 189 S 0 1 - O SerrLogKthread ...
Using the show processes cpu Command
Use the show processes cpu command to display CPU utilization. The command output includes the following:
- Runtime(ms) = CPU time that the process has used, expressed in milliseconds.
- Invoked = Number of times the process has been invoked.
- uSecs = Average CPU time, in microseconds, for each process invocation.
- 1Sec = Percentage of CPU utilization for the last one second.
- switch# show processes cpu
PID Runtime(ms) Invoked uSecs 1Sec Process ----- ----------- -------- ----- ----- ----------- 1 2264 108252 20 0 init 2 950 211341 4 0 migration/0 3 1154 32833341 0 0 ksoftirqd/0 4 609 419568 1 0 desched/0 5 758 214253 3 0 migration/1 6 2462 155309355 0 0 ksoftirqd/1 7 2496 392083 6 0 desched/1 8 443 282990 1 0 events/0 9 578 260184 2 0 events/1 10 56 2681 21 0 khelper 15 0 30 25 0 kthread 24 0 2 5 0 kacpid 103 81 89 914 0 kblockd/0 104 56 265 213 0 kblockd/1 117 0 5 17 0 khubd 184 0 3 3 0 pdflush 185 1796 104798 17 0 pdflush 187 0 2 3 0 aio/0 188 0 2 3 0 aio/1 189 0 1 3 0 SerrLogKthread ...
Using the show system resource Command
Use the show system resources command to display system-related CPU and memory statistics. The output includes the following:
- Load is defined as the number of running processes. The average reflects the system load over the past 1, 5, and 15 minutes.
- Processes displays the number of processes in the system, and how many are actually running when the command is issued.
- CPU states shows the CPU usage percentage in user mode, kernel mode, and idle time in the last one second.
- Memory usage provides the total memory, used memory, free memory, memory used for buffers, and memory used for cache in kilobytes. Buffers and the cache are also included in the used memory statistics.
- switch# show system resources
Load average: 1 minute: 0.30 5 minutes: 0.34 15 minutes: 0.28 Processes : 606 total, 2 running CPU states : 0.0% user, 0.0% kernel, 100.0% idle Memory usage: 2063268K total, 1725944K used, 337324K free 2420K buffers, 857644K cache
Using Onboard Failure Logging
Cisco NX-OS provides the facility to log failure data to the persistent storage, which can be retrieved and displayed for analysis. This onboard failure logging (OBFL) feature stores failure and environmental information in nonvolatile memory on the module. This information will help you analyze failed modules.
The data stored by the OBFL facility includes the following:
- Time of initial power on
- Slot number of the module in the chassis
- Initial temperature of the module
- Firmware, BIOS, FPGA, and ASIC versions
- Serial number of the module
- Stack trace for crashes
- CPU hog information
- Memory leak information
- Software error messages
- Hardware exception logs
- Environmental history
- OBFL specific history information
- ASIC interrupt and error statistics history
- ASIC register dumps
For more information about configuring OBFL, see the Cisco NX-OS System Management Configuration Guide, Release 4.0.
Using GOLD Diagnostics
Generic Online Diagnostics (GOLD) defines a common framework for diagnostic operations across Cisco platforms. The GOLD implementation checks the health of hardware components and verifies proper operation of the system data and control planes. Some tests take effect when the system is booting up; other tests take effect when the system is operational.
A booting module goes through a series of checks before coming online to allow the system to detect faults in the hardware components at bootup and to ensure that a failing module is not introduced in a live network.
Defects are also diagnosed during system operation or runtime. You can configure a series of diagnostic checks to determine the condition of an online system. You must distinguish between disruptive and nondisruptive diagnostic tests. Although nondisruptive tests occur in the background and do not affect the system data or control planes, disruptive tests do affect live packet flows. You should schedule disruptive tests during special maintenance windows. The show diagnostic content module output displays test attributes such as disruptive or nondisruptive tests.
You can configure runtime diagnostic checks to run at a specific time or to run continually in the background.
Health-monitoring diagnostic tests are nondisruptive, and they run in the background while the system is in operation. The role of online diagnostic health monitoring is to proactively detect hardware failures in the live network environment and inform you of a failure.
GOLD collects diagnostic results and detailed statistics for all tests including the last execution time, the first and last test pass time, the first and last test failure time, the total run count, the total failure count, the consecutive failure count, and the error code. These test results help administrators determine the condition of a system and understand the reason for a system failure. Use the show diagnostic result command to view diagnostic results.
For more information about configuring GOLD, see the Cisco NX-OS System Management Configuration Guide, Release 4.0.
Using Embedded Event Manager
Embedded Event Manager (EEM) is a policy-based framework that allows you to monitor key system events and then act on those events through a set policy. The policy is a preprogrammed script that you can load that defines actions that the device should invoke based on set events occurring. The script can generate actions, including, but not limited to, generating custom syslog or SNMP traps, invoking CLI commands, forcing a failover, and much more.
For more information about configuring EEM, see the Cisco NX-OS System Management Configuration Guide, Release 4.0.
Ethanalyzer is a Cisco NX-OS protocol analyzer tool based on the Wireshark (formerly Ethereal) open source code. Ethanalyzer is a command-line version of Wireshark that captures and decodes packets. You can use Ethanalyzer to troubleshoot your network and analyze the control-plane traffic.
To configure Ethanalyzer, use the following commands:
|ethanalyzer local interface||Captures packets sent or received by the supervisor and provides detailed protocol information.|
|ethanalyzer local interface brief||Captures packets sent or received by the supervisor and provides a summary of protocol information.|
|ethanalyzer local interface limit-captured-frames||Limits the number of frames to capture.|
|ethanalyzer local interface limit-frame-size||Limits the length of the frame to capture.|
|ethanalyzer local interface capture-filter||Filters the types of packets to capture.|
|ethanalyzer local interface display-filter||Filters the types of captured packets to display.|
|ethanalyzer local interface decode-internal|| Decodes the internal frame header for Cisco NX-OS.
|ethanalyzer local interface write||Saves the captured data to a file.|
|ethanalyzer local interface read||Opens the captured data file and analyzes it.|
Ethanalyzer does not capture data traffic that Cisco NX-OS forwards in the hardware.
This example shows captured data (limited to four packets) on the management interface:
- switch(config)# ethanalyzer local interface mgmt brief limit-captured-frames 4
Capturing on eth1 2008-02-18 13:21:21.841182 172.28.230.2 -> 18.104.22.168 HSRP Hello (state Standy) 2008-02-18 13:21:21.842190 10.86.249.17 -> 172.28.231.193 TCP 4261 > telnet [AC] Seq=0 Ack=0 Win=64475 Len=0 2008-02-18 13:21:21.843039 172.28.231.193 -> 10.86.249.17 TELNET Telnet Data .. 2008-02-18 13:21:21.850463 00:13:5f:1c:ee:80 -> ab:00:00:02:00:00 0x6002 DEC DN Remote Console 4 packets captured
This example shows detailed captured data for one HSRP packet:
- switch(config)# ethanalyzer local interface mgmt capture-filter "udp port 1985"
limit-captured-frames 1 Capturing on eth1 Frame 1 (62 bytes on wire, 62 bytes captured) Arrival Time: Feb 18, 2008 13:29:19.961280000 [Time delta from previous captured frame: 1203341359.961280000 seconds] [Time delta from previous displayed frame: 1203341359.961280000 seconds] [Time since reference or first frame: 1203341359.961280000 seconds] Frame Number: 1 Frame Length: 62 bytes Capture Length: 62 bytes [Frame is marked: False] [Protocols in frame: eth:ip:udp:hsrp] Ethernet II, Src: 00:00:0c:07:ac:01 (00:00:0c:07:ac:01), Dst: 01:00:5e:00:00:02 (01:00:5e:00:00:02) Destination: 01:00:5e:00:00:02 (01:00:5e:00:00:02) Address: 01:00:5e:00:00:02 (01:00:5e:00:00:02) .... ...1 .... .... .... .... = IG bit: Group address (multicast/broadcast) .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default) Source: 00:00:0c:07:ac:01 (00:00:0c:07:ac:01) Address: 00:00:0c:07:ac:01 (00:00:0c:07:ac:01) .... ...0 .... .... .... .... = IG bit: Individual address (unicast) .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default) Type: IP (0x0800) Internet Protocol, Src: 172.28.230.3 (172.28.230.3), Dst: 22.214.171.124 (126.96.36.199) Version: 4 Header length: 20 bytes Differentiated Services Field: 0xc0 (DSCP 0x30: Class Selector 6; ECN: 0x00) 1100 00.. = Differentiated Services Codepoint: Class Selector 6 (0x30) .... ..0. = ECN-Capable Transport (ECT): 0 .... ...0 = ECN-CE: 0 Total Length: 48 Identification: 0x0000 (0) Flags: 0x00 0... = Reserved bit: Not set .0.. = Don't fragment: Not set ..0. = More fragments: Not set Fragment offset: 0 Time to live: 1 Protocol: UDP (0x11) Header checksum: 0x46db [correct] [Good: True] [Bad : False] Source: 172.28.230.3 (172.28.230.3) Destination: 188.8.131.52 (184.108.40.206) User Datagram Protocol, Src Port: 1985 (1985), Dst Port: 1985 (1985) Source port: 1985 (1985) Destination port: 1985 (1985) Length: 28 Checksum: 0x8ab9 [correct] [Good Checksum: True] [Bad Checksum: False] Cisco Hot Standby Router Protocol Version: 0 Op Code: Hello (0) State: Active (16) Hellotime: Default (3) Holdtime: Default (10) Priority: 105 Group: 1 Reserved: 0Authentication Data: Default (cisco) Virtual IP Address: 172.28.230.1 (172.28.230.1) 1 packets captured
This example uses a display filter to show only those HSRP packets that have an active HSRP state:
- switch(config)# ethanalyzer local interface mgmt brief display-filter "hsrp.state==Active" limit-captured-frames 2
Capturing on eth1 2008-02-18 14:35:41.443118 172.28.230.3 -> 220.127.116.11 HSRP Hello (state Active) 2008-02-18 14:35:44.326892 172.28.230.3 -> 18.104.22.168 HSRP Hello (state Active) 2 packets captured
Cisco DCNM gathers events and statistics for each supported feature.
For more information about DCNM, see the Cisco DCNM Fundamentals Configuration Guide, Release 4.0.
SNMP and RMON Support
Cisco NX-OS provides extensive SNMPv1, v2, and v3 support, including Management Information Bases (MIBs) and notifications (traps and informs).
The SNMP standard allows any third-party applications that support the different MIBs to manage and monitor Cisco NX-OS.
SNMPv3 provides extended security. Each device can be selectively enabled or disabled for SNMP service. In addition, each device can be configured with a method of handling SNMPv1 and v2 requests.
Cisco NX-OS also supports Remote Monitoring (RMON) alarms and events. RMON alarms and events provide a mechanism for setting thresholds and sending notifications based on changes in network behavior.
The AlarmGroup allows you to set alarms. Alarms can be set on one or multiple parameters within a device. For example, you can set an RMON alarm for a specific level of CPU utilization on a device. The EventGroup allows you to configure events that are actions to be taken based on an alarm condition. The types of events that are supported include logging, SNMP traps, and log-and-trap.
For more information about configuring SNMP and RMON, see the Cisco NX-OS System Management Configuration Guide, Release 4.0.
The RADIUS protocol is used to exchange attributes or credentials between a head-end RADIUS server and a client device. These attributes relate to three classes of services:
Authentication refers to the authentication of users for access to a specific device. You can use RADIUS to manage user accounts for access to a Cisco NX-OS device. When you try to log into a device, Cisco NX-OS validates you with information from a central RADIUS server.
Authorization refers to the scope of access that you have once you have been authenticated. Assigned roles for users can be stored in a RADIUS server with a list of actual devices that the user should have access to. Once the user has been authenticated, the device can then refer to the RADIUS server to determine the access that the user will have.
Accounting refers to the log information that is kept for each management session in a device. You can use this information to generate reports for troubleshooting purposes and user accountability. You can implement accounting locally or remotely (using RADIUS).
This example shows accounting log entries:
- switch# show accounting log
Sun Dec 15 04:02:27 2007:start:/dev/pts/0_1039924947:admin Sun Dec 15 04:02:28 2007:stop:/dev/pts/0_1039924947:admin:vsh exited normally Sun Dec 15 04:02:33 2007:start:/dev/pts/0_1039924953:admin Sun Dec 15 04:02:34 2007:stop:/dev/pts/0_1039924953:admin:vsh exited normally Sun Dec 15 05:02:08 2007:start:snmp_1039928528_172.22.95.167:public Sun Dec 15 05:02:08 2007:update:snmp_1039928528_172.22.95.167:public:Switchname
|Note:||The accounting log shows only the beginning and end (start and stop) for each session.|
The system message logging software saves messages in a log file or directs the messages to other devices. This feature provides the following capabilities:
- Logging information for monitoring and troubleshooting
- Selection of the types of logging information to be captured
- Selection of the destination of the captured logging information
You can use syslog to store a chronological log of system messages locally or to send this information to a central syslog server. The syslog messages can also be sent to the console for immediate use. These messages can vary in detail depending on the configuration that you choose.
The syslog messages are categorized into seven severity levels from debug to critical events. You can limit the severity levels that are reported for specific services within the device. For example, you may wish only to report debug events for the OSPF service but record all severity level events for the BGP service.
Log messages are not saved across system reboots. However, a maximum of 100 log messages with a severity level of critical and below (levels 0, 1, and 2) are saved in NVRAM. You can view this log at any time with the show logging nvram command.
Cisco NX-OS supports the following logging levels:
By default, the device logs normal but significant system messages to a log file and sends these messages to the system console. Users can specify which system messages should be saved based on the type of facility and the severity level. Messages have a time stamp to enhance real-time debugging and management.
Enabling Logging for Telnet or SSH
System logging messages are sent to the console based on the default or configured logging facility and severity values.
Users can disable logging to the console or enable logging to a given Telnet or Secure Shell (SSH) session.
- To disable console logging, use the no logging console command in configuration mode.
- To enable logging for Telnet or SSH, use the terminal monitor command in EXEC mode.
- When logging to a console session is disabled or enabled, that state is applied to all future console sessions. If a user exits and logs in again to a new session, the state is preserved. However, when logging to a Telnet or SSH session is enabled or disabled, that state is applied only to that session. The state is not preserved after the user exits the session.
The no logging console command disables console logging and is enabled by default.
- switch(config)# no logging console
The terminal monitor command enables logging for Telnet or SSH and is disabled by default.
- switch# terminal monitor
For more information about configuring syslog, see the Cisco NX-OS System Management Configuration Guide, Release 4.0.
You can use the Switched Port Analyzer (SPAN) utility to perform detailed troubleshooting or to take a sample of traffic from a particular application host for proactive monitoring and analysis.
When you have a problem in your network that you cannot solve by fixing the device configuration, you typically need to take a look at the protocol level. You can use debug commands to look at the control traffic between an end node and a device. However, when you need to focus on all the traffic that originates from or is destined to a particular end node, you can use a protocol analyzer to capture protocol traces.
To use a protocol analyzer, you must insert the analyzer inline with the device under analysis, which disrupts input and output (I/O) to and from the device.
In Ethernet networks, you can solve this problem by using the SPAN utility. SPAN allows you to take a copy of all traffic and direct it to another port within the device. The process is nondisruptive to any connected devices and is facilitated in the hardware, which prevents any unnecessary CPU load.
SPAN allows you to create up to 16 independent SPAN sessions within the device. Each session can have up to four unique sources and one destination port. In addition, you can apply a filter to capture only the traffic received or the traffic transmitted. You can even capture traffic from a particular VLAN.
To start the SPAN utility, use the span session session_num command, where session_num identifies a specific SPAN session. When you enter this command, the system displays a submenu, which allows you to configure the destination interface and the source VLAN or interfaces.
- switch2# config terminal
- switch2(config)# span session 1 <<=== Create a span session
- switch2(config-span)# source interface e1/8 <<=== Specify the port to be spanned
- switch2(config-span)# destination interface e1/3 <<==== Specify the span destination port
- switch2(config-span)# end
- switch2# show span session 1
Session 1 (active) Destination is e1/3 No session filters configured Ingress (rx) sources are e1/8, Egress (tx) sources are fe1/8,
For more information about configuring SPAN, see the Cisco NX-OS System Management Configuration Guide, Release 4.0.
Using the Blue Beacon Feature
On some platforms, you can cause the platform LEDs to blink. This feature is a useful way to mark a piece of hardware so that a local administrator can quickly identify the hardware for troubleshooting or replacement.
To flash the LEDs on a hardware entity, use the following commands:
|blink chassis||Flashes the chassis LED.|
|blink fan number||Flashes one of the fan LEDs.|
|blink module slot||Flashes the selected module LED.|
|blink powersupply number||Flashes one of the power supply LEDs.|
|blink xbar number||Flashes one of the crossbar module LEDs.|
To flash a single port LED on a module, use the following command in interface configuration mode:
|beacon||Flashes the interface LED.|