Cisco Application Control Engine (ACE) Module Troubleshooting Guide, Release A2(x) -- Troubleshooting Redundancy
From DocWiki
This article describes the procedures for troubleshooting redundancy issues with your ACE.
Contents |
Overview of ACE Redundancy
Redundancy (or fault tolerance) allows your network to remain operational even if one of the ACEs becomes unresponsive. Redundancy ensures that your network services and applications are always available.
Redundancy provides seamless switchover of flows if an ACE becomes unresponsive or a critical host, interface, or HSRP group fails. Redundancy supports the following network applications that require fault tolerance:
- Mission-critical enterprise applications
- Banking and financial services
- E-commerce
- Long-lived flows such as FTP and HTTP file transfers
Redundancy Protocol
You can configure a maximum of two ACEs (peers) in the same Catalyst 6500 series switch or in different chassis for redundancy. Each peer module can contain one or more fault-tolerant (FT) groups. Each FT group consists of two members: one active context and one standby context. For more information about contexts, see the Cisco Application Control Engine Module Virtualization Configuration Guide. An FT group has a unique group ID that you assign.
Both ACE modules can be active at the same time, processing traffic for distinct virtual devices and backing up each other (stateful redundancy). See Figure 1.
Figure 1. Example of an Active-Active Configuration
The ACE uses the redundancy protocol to communicate between the redundant peers. The election of the active member within each FT group is based on a priority scheme. The member configured with the higher priority is elected as the active member. If a member with a higher priority is found after the other member becomes active, the new member becomes active because it has a higher priority. This behavior is known as preemption and is enabled by default.
One virtual MAC address (VMAC) is associated with each FT group. The format of the VMAC is: 00-0b-fc-fe-1b-groupID. Because a VMAC does not change upon a switchover, the client and server ARP tables does not require updating. The ACE selects a VMAC from a pool of virtual MACs available to it. You can specify the pool of MAC addresses that the local ACE and the peer ACE use by configuring the shared-vlan-hostid command and the peer shared-vlan-hostid command, respectively. To avoid MAC address conflicts, be sure that the two pools are different on the two ACEs. For more information about VMACs and MAC address pools, see the Cisco Application Control Engine Module Routing and Bridging Configuration Guide.
Each FT group acts as an independent redundancy instance. When a switchover occurs, the active member in the FT group becomes the standby member and the original standby member becomes the active member. A switchover can occur for the following reasons:
- The active member becomes unresponsive.
- A tracked host, interface, or HSRP group fails.
- You enter the ft switchover command to force a switchover.
FT VLAN
Redundancy uses a dedicated FT VLAN between redundant ACEs to transmit flow-state information and the redundancy heartbeat. You must configure this same VLAN on both peer modules. You also must configure a different IP address within the same subnet on each module for the FT VLAN. Cisco recommends two port-channeled 1-Gigabit Ethernet links fo the FT VLAN.
| Note: | Do not use the FT VLAN for any other network traffic, including HSRP traffic and data. |
The two redundant modules constantly communicate over the FT VLAN to determine the operating status of each module. The standby member uses the heartbeat packet to monitor the health of the active member. The active member uses the heartbeat packet to monitor the health of the standby member. Communications over the switchover link include the following data:
- Redundancy protocol packets
- State information replication data
- Configuration synchronization information
- Heartbeat packets
For multiple contexts, the FT VLAN resides in the system configuration file. Each FT VLAN on the ACE has one unique MAC address associated with it. The ACE uses these device MAC addresses as the source or destination MACs for sending or receiving redundancy protocol state and configuration replication packets.
| Note: | The IP address and the MAC address of the FT VLAN do not change at switchover. |
Configuration Requirements and Restrictions
Follow these requirements and restrictions when configuring the redundancy feature:
- Redundancy is not supported between an ACE module and an ACE appliance operating as peers. Redundancy must be of the same ACE device type and software release.
- In bridged mode (Layer 2), two contexts cannot share the same VLAN.
- To achieve active-active redundancy, a minimum of two contexts and two FT groups are required on each ACE.
- When you configure redundancy, the ACE keeps all interfaces that do not have an IP address in the Down state. The IP address and the peer IP address that you assign to a VLAN interface should be in the same subnet but should be different IP addresses. For more information about configuring VLAN interfaces, see the Cisco Application Control Engine Module Routing and Bridging Configuration Guide.
Example of a Redundancy Configuration
The following example shows a running-configuration file that defines fault tolerance (FT) for a single ACE module operating in a redundancy configuration. You must configure a maximum of two ACE modules (peers) for redundancy to fail over from the active module to the standby module.
| Note: | All FT parameters are configured in the Admin context. |
This configuration addresses the following redundancy components:
- A dedicated FT VLAN for communication between the members of an FT group. You must configure this same VLAN on both peer modules.
- An FT peer definition.
- An FT group that is associated with the Admin context.
- A critical tracking and failure detection process for an interface.
access-list ACL1 line 10 extended permit ip any any
class-map type management match-any L4_REMOTE-MGT_CLASS
2 match protocol telnet any
3 match protocol ssh any
4 match protocol icmp any
5 match protocol http any
7 match protocol snmp any
8 match protocol https any
policy-map type management first-match L4_REMOTE-MGT_POLICY
class L4_REMOTE-MGT_CLASS
permit
interface vlan 100
ip address 192.168.83.219 255.255.255.0
peer ip address 192.168.83.230 255.255.255.0
alias 192.168.83.200 255.255.255.0
access-group input ACL1
service-policy input L4_REMOTE-MGT_POLICY
no shutdown
ft interface vlan 200
ip address 192.168.1.1 255.255.255.0
peer ip address 192.168.1.2 255.255.255.0
no shutdown
ft peer 1
ft-interface vlan 200
heartbeat interval 300
heartbeat count 10
ft group 1
peer 1
priority 200
associate-context Admin
inservice
ft track interface TRACK_VLAN100
track-interface vlan 100
peer track-interface vlan 200
priority 50
peer priority 5
ip route 0.0.0.0 0.0.0.0 192.168.83.1
Troubleshooting ACE Redundancy
This section describes the methods and CLI commands that you can use to troubleshoot redundancy issues in your ACE.
1. Ensure that the software versions and licenses installed in the two ACEs are identical. A software or license mismatch may generate the following syslog message:
%ACE-1-727006: HA: Peer is incompatible due to error str. Cannot be Redundant.
To verify the software (SRG) and license compatibility of the FT peer, enter the following command:
ACE_module5/Admin# show ft peer status Peer Id : 1 State : FSM_PEER_STATE_MY_IPADDR Maintenance mode : MAINT_MODE_OFF SRG Compatibility : COMPATIBLE License Compatibility : COMPATIBLE FT Groups : 1
If the software or license is incompatible, install the appropriate software image or license in the peer to correct the problem.
2. Ensure that any SSL certificates (certs) and keys that exist in the active ACE are also configured in the standby ACE. SSL certs and keys are not synchronized automatically from the active to the standby. Use the crypto export and crypto import commands to accomplish this task. This requirement also applies to scripts and scripted probes. Failure to keep the active and standby configurations identical will cause configuration synchronization to fail and may cause the standby ACE to enter the STANDBY-COLD state.
The ACE sends heartbeat packets via UDP over the FT VLAN between peers. When heartbeats are not received during the specified interval (the interval and count are configurable), the ACE notifies the HA processor on the CP by sending a Peer_Down interprocess communication protocol (IPCP) message. If a peer is down or unreachable, you may receive one of the following syslog messages:
%ACE-1-727001: HA: Peer IP address is not reachable. Error: error str %ACE-1-727002: HA: FT interface interface name to reach peer IP address is down. Error: error str
3. Verify connectivity between the peers over the FT VLAN. If a peer device is physically up but connectivity is the problem, you may end up with two active devices. If connectivity is lost due to the peer going down, reboot the peer to restore redundancy between the two devices.
4. Display heartbeat statistics, including missed heartbeats, by entering the following command:
ACE_module5/Admin# show ft stats HA Heartbeat Statistics ------------------------ Number of Heartbeats Sent : 0 Number of Heartbeats Received : 0 Number of Heartbeats Missed : 0 Number of Unidirectional HB's Received : 0 Number of HB Timeout Mismatches : 0 Num of Peer Up Events Sent : 0 Num of Peer Down Events Sent : 0 Successive HB's miss Intervals counter : 0 Successive Uni HB's recv counter : 0
5. Provide an alternate path for the ACE to check the peer's status in case of missed heartbeats and configure a query interface using the followng commands:
ACE_module5/Admin# config Enter configuration commands, one per line. End with CNTL/Z. ACE_module5/Admin(config)# ft peer 1 ACE_module5/Admin(config-ft-peer)# query-interface vlan 100
If the query interface is configured, upon receiving a PEER_DOWN message from the heartbeat process, the ACE data plane attempts to ping the peer using the Query VLAN. If the ping fails, the standby transitions to the ACTIVE state. If the ping is successful, the standby transitions to the STANDBY_COLD state. To recover from the STANDBY_COLD state, reboot the standby.
Each peer uses a VMAC that is dependent on the FT group number. If you are using multiple ACEs in the same chassis, be careful when using the same FT groups in more than one module.
6. Display the VMAC for an FT group by entering the following command:
ACE_module5/Admin# show interface internal iftable vlan100 vlan100 -------- ifid: 6 Context: 0 ifIndex: 16777316 physid: 100 rmode: 0 (unknown) iftype: 0 (vlan) bvi_bgid: 0 MTU: 1500 MAC: 00:18:b9:a6:91:15 VMAC: 00:00:00:00:00:00 <------- Virtual MAC Address Flags: 0x8a000800 (valid, down, admin-down, Non-redundant, tracked) ACL In: 0 ACL Out: 0 Route ID: 0 FTgroupID: 0 Zone ID: 6 Sec Level: 0 L2 ACL: bpdu DENY, ipv6 DENY, mpls DENY, all DENY LastChange: 0 (Thu Jan 1 00:00:00 1970) iflookup index: 100 vlan-vmac index:0 Next Shared IF: 0 Lock: Unlocked, seq 5 Lock errors: 0 Unlock errors: 0 No. of times locked: 5 No. of times unlocked: 5 Current/last owner: 0x40a7fc
If the members of an FT group are unable to reach the active or standby state, there may be a context name mismatch for the same FT group. You may receive the following syslog message:
%ACE-1-727003: HA: Mismatch in context names detected for FT group FTgroupID. Cannot be redundant.
7. Check the FT group configuration on both devices. Make sure that both devices are associated with the same context. Enter the following command:
ACE_module5/Admin# show running-config ft
8. Verify the FT peer status and configuration by entering the following command:
ACE_module5/Admin# show ft peer detail Peer Id : 1 State : FSM_PEER_STATE_COMPATIBLE Maintenance mode : MAINT_MODE_OFF FT Vlan : 100 FT Vlan IF State : DOWN My IP Addr : 10.1.1.1 Peer IP Addr : 10.1.1.2 Query Vlan : 110 Query Vlan IF State : DOWN Peer Query IP Addr : 172.25.91.202 Heartbeat Interval : 300 Heartbeat Count : 20 Tx Packets : 318573 Tx Bytes : 66301061 Rx Packets : 318540 Rx Bytes : 66272840 Rx Error Bytes : 0 Tx Keepalive Packets : 318480 Rx Keepalive Packets : 318480 TL_CLOSE count : 0 FT_VLAN_DOWN count : 0 PEER_DOWN count : 0 SRG Compatibility : COMPATIBLE License Compatibility : COMPATIBLE FT Groups : 3
9. Verify the FT group status and configuration by entering the following command:
ACE_module5/Admin# show ft group detail FT Group : 1 No. of Contexts : 1 Configured Status : in-service Maintenance mode : MAINT_MODE_OFF My State : FSM_FT_STATE_ACTIVE My Config Priority : 110 My Net Priority : 110 My Preempt : Enabled Peer State : FSM_FT_STATE_STANDBY Peer Config Priority : 100 Peer Net Priority : 100 Peer Preempt : Enabled Peer Id : 1 Last State Change time : Thu Apr 2 00:00:00 2009 Running cfg sync enabled : Enabled Running cfg sync status : Running configuration sync has completed Startup cfg sync enabled : Enabled Startup cfg sync status : Running configuration sync has completed Bulk sync done for ARP: 0 Bulk sync done for LB: 0 Bulk sync done for ICM: 0
For information on troubleshooting the FT group status, see the "FT Group Status Conditions"
FT Group Status Conditions
Certain error conditions are indicated by a FT group status that indicates the ACE is locked in the STANDBY_COLD or STANDBY_CONFIG states.
Troubleshooting STANDBY_COLD Status
The STANDBY_COLD state may result from these errors:
- Config sync failure (including, incr-sync and bulk-sync)
- FT VLAN is down while the query interface is up
Config Sync Failure—A config sync failure can be diagnosed with the following steps:
- Output of the show ft peer detail command shows that the peer state is Compatible.
- Running "show ft group detail" shows that the FT group is in "Standby Cold" mode and running cfg sync status shows the reason for the failure. For incr-sync failure, the output shows exactly which command resulted in an execution error on the standby; for a bulk-sync failure, the reason is Error on Standby device when applying configuration file replicated from active.
- To further investigate bulk-sync failure, perform these steps on the standby device:
- For A2(2.0) & prior and A2(1.3) & prior, from the Admin context, run sh ft history cfg_cntlr and grep for "error:" to find any CLI commands that caused execution errors.
- For later releases, run sh ft config-error <ctx_name> to view failed CLI commands.
To workaround a bulk sync failure, perform these steps:
- Remove the CLI commands that triggered the error (as identified from the preceding analysis) and the retrigger bulk sync operation, as follows.
- Retrigger bulk sync by disabling config sync with the no ft auto-sync running command and then re-enabling config sync with ft auto-sync running.
- If the problem persists, repeat the above sequence until you eliminate the CLI command that triggered the problem.
FT VLAN Down with Query Interface Up—To diagnose whether the FT VLAN is down with the query interface up, perform these steps:
- Run show ft peer detail. The peer state shows "FT_VLAN_DOWN".
- Run show ft stats. It shows heartbeats are being missed.
In this case, check the physical connectivity of the device. It might be a physical port or cable issue.
Troubleshooting STANDBY_CONFIG status
If the FT group status indicates that the device is stuck in the STANDBY_CONFIG state:
- Run sh ft history cfg_cntlr to determine whether the peer devices successfully exchanged notifications regarding configuration synchronization.
- Grep for the keywords MTS_OPC_REQ_CFG_DNLD_STATUS and MTS_OPC_CFG_DNLD_STATUS.
If one or both of the messages are missing, an error occurred in the synchronization exchange process.
Note that once stuck in the STANDBY_CONFIG state, configuration mode will be disabled on both the active and standby devices. It can be stuck in this state for up to 4 hours, after which a timeout period expires.
About WARM_COMPATIBLE and STANDBY_WARM
While peers should operate with identical versions of the software, during a version upgrade it's possible for peers to temporarily have different software versions. To ease the task of upgrading and downgrading the software, an HA Peer SRG state WARM_COMPATIBLE and the HA FT state STANDBY_WARM have been introduced to allow best-effort configuration sync and state replication between peers.
When HA peers run on different versions, you will see SRG compatibility: WARM_COMPATIBLE instead of COMPATIBLE from the output of the show ft peer detail command. When the peer SRG is WARM_COMPATIBLE, the ft groups on standby go to STANDBY_WARM instead of STANDBY_HOT.
In WARM_COMPATIBLE, whether the bulk config sync fails or passes, the transition to STANDBY_BULK is always made and eventually the standby goes to STANDBY_WARM. (If the peer SRG is COMPATIBLE, then the steady state will continue to be ACTIVE/STANDBY_HOT).
The STANDBY_WARM state is similar to the STANDBY_HOT state (the config mode of standby will be locked, state replication/config sync are continued), but when config-sync failed (because of new/obsolete/enhanced CLIs, for instance), there is no moving to STANDBY_COLD state. It is a best effort state; the active will keep sync'ing/replicating its config/state to standby. However, when FT VLAN goes down with query-interface configured, standby still goes to STANDBY_COLD state. It is because there is no way to do state/configuration sync between peers anymore. Regarding auto-switchover, the STANDBY_WARM keeps the same behavior as STANDBY_HOT. (That's based on "net-priority" and "preempt"; HA will decide when auto-switchover should happen.)
The SRG compatibility matrix is the following:
Module: C: COMPATIBLE / WC: WARM_COMPATIBLE
| Active(Column)/Standby(Row) | <A2(1.3) | A2(1.4) | A2(1.5) | A2(1.6) | A2(2.0) | A2(2.1) | A2(2.2) | A2(3.0) |
| < = A2(1.3) | C | C | C | C | C | C | C | C |
| A2(1.4) | C | C | C | WC | C | C | WC | WC |
| A2(1.5) | C | C | C | WC | C | C | WC | WC |
| A2(1.6) | C | WC | WC | C | C | WC | WC | WC |
| A2(2.0) | C | C | C | C | C | C | C | C |
| A2(2.1) | C | C | C | WC | C | C | WC | WC |
| A2(2.2) | C | WC | WC | WC | C | WC | C | WC |
| A2(3.0) | C | WC | WC | WC | C | WC | WC | C |
Appliance: C: COMPATIBLE / WC: WARM_COMPATIBLE
| Active(Column)/Standby(Row) | <A1(7.0) | A1(8.0) | A3(1.0) | A3(2.0) | A3(2.1) | A3(2.2) | A3(2.3) | A3(2.4) |
| < = A1(7.0) | C | C | C | C | C | C | C | C |
| A1(8.0) | C | C | WC | WC | WC | WC | WC | WC |
| A3(1.0) | C | WC | C | C | C | C | WC | WC |
| A3(2.0) | C | WC | C | C | C | C | WC | WC |
| A3(2.1) | C | WC | C | C | C | C | WC | WC |
| A3(2.2) | C | WC | C | C | C | C | WC | WC |
| A3(2.3) | C | WC | WC | WC | WC | WC | C | WC |
| A3(2.4) | C | WC | WC | WC | WC | WC | WC | C |
Here is a show command output example:
itasca-1/Admin# show ft peer de
Peer Id : 1
State : FSM_PEER_STATE_COMPATIBLE
Maintenance mode : MAINT_MODE_OFF
FT Vlan : 20
FT Vlan IF State : UP
My IP Addr : 209.165.201.1
Peer IP Addr : 209.165.201.2
Query Vlan : Not Configured
Query Vlan IF State : DOWN
Peer Query IP Addr : 0.0.0.0
Heartbeat Interval : 300
Heartbeat Count : 10
Tx Packets : 926
Tx Bytes : 220440
Rx Packets : 879
Rx Bytes : 232241
Rx Error Bytes : 0
Tx Keepalive Packets : 756
Rx Keepalive Packets : 756
TL_CLOSE count : 0
FT_VLAN_DOWN count : 0
PEER_DOWN count : 0
SRG Compatibility : WARM_COMPATIBLE <<<<<<<<<<<<<<<<<<<<<<<<<<
License Compatibility : COMPATIBLE
FT Groups : 1
itasca-1/Admin#
itasca-1/Admin# show ft group de
FT Group : 1
No. of Contexts : 1
Context Name : Admin
Context Id : 0
Configured Status : in-service
Maintenance mode : MAINT_MODE_OFF
My State : FSM_FT_STATE_ACTIVE
My Config Priority : 120
My Net Priority : 120
My Preempt : Enabled
Peer State : FSM_FT_STATE_STANDBY_WARM <<<<<<<<<<<<<<<<<<<<<<<
Peer Config Priority : 110
Peer Net Priority : 110
Peer Preempt : Enabled
Peer Id : 1
Last State Change time : Fri Mar 21 19:07:13 2008
Running cfg sync enabled : Enabled
Running cfg sync status : Running configuration sync has completed
Startup cfg sync enabled : Enabled
Startup cfg sync status : Startup configuration sync has completed
Bulk sync done for ARP: 0
Bulk sync done for LB: 0
Bulk sync done for ICM: 0
itasca-2/Admin# show system srg
Software ID: swid-aceapp Software Version: A3(1.0) Software Interim Version: 3.0(0)A3(0.0.13)
ema_11:58:21-2008/03/20_/ws/ema/mercury
Switchover Supersedes: <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
swid-aceapp A1(8.0)
Compatible: swid-aceapp A1(7a) A1(7b) A1(7c)

