Cisco Application Control Engine (ACE) Module Troubleshooting Guide, Release A2(x) -- Troubleshooting Redundancy

From DocWiki

Jump to: navigation, search

This article describes the procedures for troubleshooting redundancy issues with your ACE.

Guide Contents
Main Article
Overview of ACE Module Troubleshooting
Understanding the ACE Module Architecture and Traffic Flow
Preliminary ACE Module Troubleshooting
Troubleshooting ACE Boot Issues
Troubleshooting with ACE Logging
Troubleshooting Connectivity
Troubleshooting Remote Access
Troubleshooting Access Control Lists
Troubleshooting Network Address Translation
Troubleshooting ACE Health Monitoring
Troubleshooting Layer 4 Load Balancing
Troubleshooting Layer 7 Load Balancing
Troubleshooting Redundancy
Troubleshooting SSL
Troubleshooting Performance Issues
ACE Module Resource Limits
Managing ACE Resources

Contents













Overview of ACE Redundancy

Redundancy (or fault tolerance) allows your network to remain operational even if one of the ACEs becomes unresponsive. Redundancy ensures that your network services and applications are always available.

Redundancy provides seamless switchover of flows if an ACE becomes unresponsive or a critical host, interface, or HSRP group fails. Redundancy supports the following network applications that require fault tolerance:

  • Mission-critical enterprise applications
  • Banking and financial services
  • E-commerce
  • Long-lived flows such as FTP and HTTP file transfers

Redundancy Protocol

You can configure a maximum of two ACEs (peers) in the same Catalyst 6500 series switch or in different chassis for redundancy. Each peer module can contain one or more fault-tolerant (FT) groups. Each FT group consists of two members: one active context and one standby context. For more information about contexts, see the Cisco Application Control Engine Module Virtualization Configuration Guide. An FT group has a unique group ID that you assign.

Both ACE modules can be active at the same time, processing traffic for distinct virtual devices and backing up each other (stateful redundancy). See Figure 1.

Figure 1. Example of an Active-Active Configuration

image:Active-Active Example.jpg


The ACE uses the redundancy protocol to communicate between the redundant peers. The election of the active member within each FT group is based on a priority scheme. The member configured with the higher priority is elected as the active member. If a member with a higher priority is found after the other member becomes active, the new member becomes active because it has a higher priority. This behavior is known as preemption and is enabled by default.

One virtual MAC address (VMAC) is associated with each FT group. The format of the VMAC is: 00-0b-fc-fe-1b-groupID. Because a VMAC does not change upon a switchover, the client and server ARP tables does not require updating. The ACE selects a VMAC from a pool of virtual MACs available to it. You can specify the pool of MAC addresses that the local ACE and the peer ACE use by configuring the shared-vlan-hostid command and the peer shared-vlan-hostid command, respectively. To avoid MAC address conflicts, be sure that the two pools are different on the two ACEs. For more information about VMACs and MAC address pools, see the Cisco Application Control Engine Module Routing and Bridging Configuration Guide.

Each FT group acts as an independent redundancy instance. When a switchover occurs, the active member in the FT group becomes the standby member and the original standby member becomes the active member. A switchover can occur for the following reasons:

  • The active member becomes unresponsive.
  • A tracked host, interface, or HSRP group fails.
  • You enter the ft switchover command to force a switchover.

FT VLAN

Redundancy uses a dedicated FT VLAN between redundant ACEs to transmit flow-state information and the redundancy heartbeat. You must configure this same VLAN on both peer modules. You also must configure a different IP address within the same subnet on each module for the FT VLAN. Cisco recommends two port-channeled 1-Gigabit Ethernet links fo the FT VLAN.

Note Note: Do not use the FT VLAN for any other network traffic, including HSRP traffic and data.

The two redundant modules constantly communicate over the FT VLAN to determine the operating status of each module. The standby member uses the heartbeat packet to monitor the health of the active member. The active member uses the heartbeat packet to monitor the health of the standby member. Communications over the switchover link include the following data:

  • Redundancy protocol packets
  • State information replication data
  • Configuration synchronization information
  • Heartbeat packets

For multiple contexts, the FT VLAN resides in the system configuration file. Each FT VLAN on the ACE has one unique MAC address associated with it. The ACE uses these device MAC addresses as the source or destination MACs for sending or receiving redundancy protocol state and configuration replication packets.

Note Note: The IP address and the MAC address of the FT VLAN do not change at switchover.

Configuration Requirements and Restrictions

Follow these requirements and restrictions when configuring the redundancy feature:

  • Redundancy is not supported between an ACE module and an ACE appliance operating as peers. Redundancy must be of the same ACE device type and software release.
  • In bridged mode (Layer 2), two contexts cannot share the same VLAN.
  • To achieve active-active redundancy, a minimum of two contexts and two FT groups are required on each ACE.
  • When you configure redundancy, the ACE keeps all interfaces that do not have an IP address in the Down state. The IP address and the peer IP address that you assign to a VLAN interface should be in the same subnet but should be different IP addresses. For more information about configuring VLAN interfaces, see the Cisco Application Control Engine Module Routing and Bridging Configuration Guide.

Example of a Redundancy Configuration

The following example shows a running-configuration file that defines fault tolerance (FT) for a single ACE module operating in a redundancy configuration. You must configure a maximum of two ACE modules (peers) for redundancy to fail over from the active module to the standby module.

Note Note: All FT parameters are configured in the Admin context.

This configuration addresses the following redundancy components:

  • A dedicated FT VLAN for communication between the members of an FT group. You must configure this same VLAN on both peer modules.
  • An FT peer definition.
  • An FT group that is associated with the Admin context.
  • A critical tracking and failure detection process for an interface.
access-list ACL1 line 10 extended permit ip any any

class-map type management match-any L4_REMOTE-MGT_CLASS
  2 match protocol telnet any
  3 match protocol ssh any
  4 match protocol icmp any
  5 match protocol http any
  7 match protocol snmp any
  8 match protocol https any

policy-map type management first-match L4_REMOTE-MGT_POLICY
  class L4_REMOTE-MGT_CLASS
    permit

interface vlan 100
  ip address 192.168.83.219 255.255.255.0
  peer ip address 192.168.83.230 255.255.255.0
  alias 192.168.83.200 255.255.255.0
  access-group input ACL1
  service-policy input L4_REMOTE-MGT_POLICY
  no shutdown

ft interface vlan 200
  ip address 192.168.1.1 255.255.255.0
  peer ip address 192.168.1.2 255.255.255.0
  no shutdown

ft peer 1
  ft-interface vlan 200
  heartbeat interval 300
  heartbeat count 10

ft group 1
  peer 1
  priority 200
  associate-context Admin
  inservice

ft track interface TRACK_VLAN100
  track-interface vlan 100
  peer track-interface vlan 200
  priority 50
  peer priority 5

ip route 0.0.0.0 0.0.0.0 192.168.83.1

Troubleshooting ACE Redundancy

This section describes the methods and CLI commands that you can use to troubleshoot redundancy issues in your ACE.

1. Ensure that the software versions and licenses installed in the two ACEs are identical. A software or license mismatch may generate the following syslog message:

%ACE-1-727006: HA: Peer is incompatible due to error str. Cannot be Redundant.

To verify the software (SRG) and license compatibility of the FT peer, enter the following command:

ACE_module5/Admin# show ft peer status

Peer Id                      : 1
State                        : FSM_PEER_STATE_MY_IPADDR
Maintenance mode             : MAINT_MODE_OFF
SRG Compatibility            : COMPATIBLE
License Compatibility        : COMPATIBLE
FT Groups                    : 1

If the software or license is incompatible, install the appropriate software image or license in the peer to correct the problem.

2. Ensure that any SSL certificates (certs) and keys that exist in the active ACE are also configured in the standby ACE. SSL certs and keys are not synchronized automatically from the active to the standby. Use the crypto export and crypto import commands to accomplish this task. This requirement also applies to scripts and scripted probes. Failure to keep the active and standby configurations identical will cause configuration synchronization to fail and may cause the standby ACE to enter the STANDBY-COLD state.

The ACE sends heartbeat packets via UDP over the FT VLAN between peers. When heartbeats are not received during the specified interval (the interval and count are configurable), the ACE notifies the HA processor on the CP by sending a Peer_Down interprocess communication protocol (IPCP) message. If a peer is down or unreachable, you may receive one of the following syslog messages:

%ACE-1-727001: HA: Peer IP address is not reachable. Error: error str

%ACE-1-727002: HA: FT interface interface name to reach peer IP address is down. Error: error str

3. Verify connectivity between the peers over the FT VLAN. If a peer device is physically up but connectivity is the problem, you may end up with two active devices. If connectivity is lost due to the peer going down, reboot the peer to restore redundancy between the two devices.

4. Display heartbeat statistics, including missed heartbeats, by entering the following command:

ACE_module5/Admin# show ft stats
HA Heartbeat Statistics
------------------------

Number of Heartbeats Sent                 : 0
Number of Heartbeats Received             : 0
Number of Heartbeats Missed               : 0
Number of Unidirectional HB's Received    : 0
Number of HB Timeout Mismatches           : 0
Num of Peer Up Events Sent                : 0
Num of Peer Down Events Sent              : 0
Successive HB's miss Intervals counter    : 0
Successive Uni HB's recv counter          : 0

5. Provide an alternate path for the ACE to check the peer's status in case of missed heartbeats and configure a query interface using the followng commands:

ACE_module5/Admin# config
Enter configuration commands, one per line.  End with CNTL/Z.
ACE_module5/Admin(config)# ft peer 1
ACE_module5/Admin(config-ft-peer)# query-interface vlan 100

If the query interface is configured, upon receiving a PEER_DOWN message from the heartbeat process, the ACE data plane attempts to ping the peer using the Query VLAN. If the ping fails, the standby transitions to the ACTIVE state. If the ping is successful, the standby transitions to the STANDBY_COLD state. To recover from the STANDBY_COLD state, reboot the standby.

Each peer uses a VMAC that is dependent on the FT group number. If you are using multiple ACEs in the same chassis, be careful when using the same FT groups in more than one module.

6. Display the VMAC for an FT group by entering the following command:

ACE_module5/Admin# show interface internal iftable vlan100
vlan100
--------
ifid:           6
Context:        0
ifIndex:        16777316
physid:         100
rmode:          0 (unknown)
iftype:         0 (vlan)
bvi_bgid:       0
MTU:            1500
MAC:            00:18:b9:a6:91:15
VMAC:           00:00:00:00:00:00 <------- Virtual MAC Address
Flags:          0x8a000800 (valid, down, admin-down, Non-redundant, tracked)
ACL In:         0
ACL Out:        0
Route ID:       0
FTgroupID:      0
Zone ID:        6
Sec Level:      0
L2 ACL:         bpdu DENY, ipv6 DENY, mpls DENY, all DENY

LastChange:     0 (Thu Jan  1 00:00:00 1970)
iflookup index: 100
vlan-vmac index:0
Next Shared IF: 0
Lock:           Unlocked, seq 5
Lock errors:    0
Unlock errors:  0
No. of times locked:    5
No. of times unlocked:  5
Current/last owner:     0x40a7fc

If the members of an FT group are unable to reach the active or standby state, there may be a context name mismatch for the same FT group. You may receive the following syslog message:

%ACE-1-727003: HA: Mismatch in context names detected for FT group FTgroupID. Cannot be redundant.

7. Check the FT group configuration on both devices. Make sure that both devices are associated with the same context. Enter the following command:

ACE_module5/Admin# show running-config ft

8. Verify the FT peer status and configuration by entering the following command:

ACE_module5/Admin# show ft peer detail

Peer Id                      : 1
State                        : FSM_PEER_STATE_COMPATIBLE
Maintenance mode             : MAINT_MODE_OFF
FT Vlan                      : 100
FT Vlan IF State             : DOWN
My IP Addr                   : 10.1.1.1
Peer IP Addr                 : 10.1.1.2
Query Vlan                   : 110
Query Vlan IF State          : DOWN
Peer Query IP Addr           : 172.25.91.202
Heartbeat Interval           : 300
Heartbeat Count              : 20
Tx Packets                   : 318573
Tx Bytes                     : 66301061
Rx Packets                   : 318540
Rx Bytes                     : 66272840
Rx Error Bytes               : 0
Tx Keepalive Packets         : 318480
Rx Keepalive Packets         : 318480
TL_CLOSE count               : 0
FT_VLAN_DOWN count           : 0
PEER_DOWN count              : 0
SRG Compatibility            : COMPATIBLE
License Compatibility        : COMPATIBLE
FT Groups                    : 3

9. Verify the FT group status and configuration by entering the following command:

ACE_module5/Admin# show ft group detail           
 
FT Group                     : 1
No. of Contexts              : 1
Configured Status            : in-service
Maintenance mode             : MAINT_MODE_OFF
My State                     : FSM_FT_STATE_ACTIVE
My Config Priority           : 110
My Net Priority              : 110
My Preempt                   : Enabled
Peer State                   : FSM_FT_STATE_STANDBY
Peer Config Priority         : 100
Peer Net Priority            : 100
Peer Preempt                 : Enabled
Peer Id                      : 1
Last State Change time       : Thu Apr  2 00:00:00 2009
Running cfg sync enabled     : Enabled
Running cfg sync status      : Running configuration sync has completed
Startup cfg sync enabled     : Enabled
Startup cfg sync status      : Running configuration sync has completed
Bulk sync done for ARP: 0
Bulk sync done for LB: 0
Bulk sync done for ICM: 0

For information on troubleshooting the FT group status, see the "FT Group Status Conditions"


FT Group Status Conditions

Certain error conditions are indicated by a FT group status that indicates the ACE is locked in the STANDBY_COLD or STANDBY_CONFIG states.


Troubleshooting STANDBY_COLD Status

The STANDBY_COLD state may result from these errors:

  • Config sync failure (including, incr-sync and bulk-sync)
  • FT VLAN is down while the query interface is up

Config Sync Failure—A config sync failure can be diagnosed with the following steps:

  1. Output of the show ft peer detail command shows that the peer state is Compatible.
  2. Running "show ft group detail" shows that the FT group is in "Standby Cold" mode and running cfg sync status shows the reason for the failure. For incr-sync failure, the output shows exactly which command resulted in an execution error on the standby; for a bulk-sync failure, the reason is Error on Standby device when applying configuration file replicated from active.
  3. To further investigate bulk-sync failure, perform these steps on the standby device:
    1. For A2(2.0) & prior and A2(1.3) & prior, from the Admin context, run sh ft history cfg_cntlr and grep for "error:" to find any CLI commands that caused execution errors.
    2. For later releases, run sh ft config-error <ctx_name> to view failed CLI commands.

To workaround a bulk sync failure, perform these steps:

  1. Remove the CLI commands that triggered the error (as identified from the preceding analysis) and the retrigger bulk sync operation, as follows.
  2. Retrigger bulk sync by disabling config sync with the no ft auto-sync running command and then re-enabling config sync with ft auto-sync running.
  3. If the problem persists, repeat the above sequence until you eliminate the CLI command that triggered the problem.

FT VLAN Down with Query Interface Up—To diagnose whether the FT VLAN is down with the query interface up, perform these steps:

  1. Run show ft peer detail. The peer state shows "FT_VLAN_DOWN".
  2. Run show ft stats. It shows heartbeats are being missed.

In this case, check the physical connectivity of the device. It might be a physical port or cable issue.

Troubleshooting STANDBY_CONFIG status

If the FT group status indicates that the device is stuck in the STANDBY_CONFIG state:

  1. Run sh ft history cfg_cntlr to determine whether the peer devices successfully exchanged notifications regarding configuration synchronization.
  2. Grep for the keywords MTS_OPC_REQ_CFG_DNLD_STATUS and MTS_OPC_CFG_DNLD_STATUS.

If one or both of the messages are missing, an error occurred in the synchronization exchange process.

Note that once stuck in the STANDBY_CONFIG state, configuration mode will be disabled on both the active and standby devices. It can be stuck in this state for up to 4 hours, after which a timeout period expires.

About WARM_COMPATIBLE and STANDBY_WARM

While peers should operate with identical versions of the software, during a version upgrade it's possible for peers to temporarily have different software versions. To ease the task of upgrading and downgrading the software, an HA Peer SRG state WARM_COMPATIBLE and the HA FT state STANDBY_WARM have been introduced to allow best-effort configuration sync and state replication between peers.

When HA peers run on different versions, you will see SRG compatibility: WARM_COMPATIBLE instead of COMPATIBLE from the output of the show ft peer detail command. When the peer SRG is WARM_COMPATIBLE, the ft groups on standby go to STANDBY_WARM instead of STANDBY_HOT.

In WARM_COMPATIBLE, whether the bulk config sync fails or passes, the transition to STANDBY_BULK is always made and eventually the standby goes to STANDBY_WARM. (If the peer SRG is COMPATIBLE, then the steady state will continue to be ACTIVE/STANDBY_HOT).

The STANDBY_WARM state is similar to the STANDBY_HOT state (the config mode of standby will be locked, state replication/config sync are continued), but when config-sync failed (because of new/obsolete/enhanced CLIs, for instance), there is no moving to STANDBY_COLD state. It is a best effort state; the active will keep sync'ing/replicating its config/state to standby. However, when FT VLAN goes down with query-interface configured, standby still goes to STANDBY_COLD state. It is because there is no way to do state/configuration sync between peers anymore. Regarding auto-switchover, the STANDBY_WARM keeps the same behavior as STANDBY_HOT. (That's based on "net-priority" and "preempt"; HA will decide when auto-switchover should happen.)

The SRG compatibility matrix is the following:

Module: C: COMPATIBLE / WC: WARM_COMPATIBLE

Active(Column)/Standby(Row)<A2(1.3)A2(1.4)A2(1.5)A2(1.6)A2(2.0)A2(2.1)A2(2.2)A2(3.0)
< = A2(1.3)CCCCCCCC
A2(1.4)CCCWCCCWCWC
A2(1.5)CCCWCCCWCWC
A2(1.6)CWCWCCCWCWCWC
A2(2.0)CCCCCCCC
A2(2.1)CCCWCCCWCWC
A2(2.2)CWCWCWCCWCCWC
A2(3.0)CWCWCWCCWCWCC

Appliance: C: COMPATIBLE / WC: WARM_COMPATIBLE

Active(Column)/Standby(Row)<A1(7.0)A1(8.0)A3(1.0)A3(2.0)A3(2.1)A3(2.2)A3(2.3)A3(2.4)
< = A1(7.0)CCCCCCCC
A1(8.0)CCWCWCWCWCWCWC
A3(1.0)CWCCCCCWCWC
A3(2.0)CWCCCCCWCWC
A3(2.1)CWCCCCCWCWC
A3(2.2)CWCCCCCWCWC
A3(2.3)CWCWCWCWCWCCWC
A3(2.4)CWCWCWCWCWCWCC


Here is a show command output example:

itasca-1/Admin# show ft peer de

Peer Id : 1
State : FSM_PEER_STATE_COMPATIBLE
Maintenance mode : MAINT_MODE_OFF
FT Vlan : 20
FT Vlan IF State : UP
My IP Addr : 209.165.201.1
Peer IP Addr : 209.165.201.2
Query Vlan : Not Configured
Query Vlan IF State : DOWN
Peer Query IP Addr : 0.0.0.0
Heartbeat Interval : 300
Heartbeat Count : 10
Tx Packets : 926
Tx Bytes : 220440
Rx Packets : 879
Rx Bytes : 232241
Rx Error Bytes : 0
Tx Keepalive Packets : 756
Rx Keepalive Packets : 756
TL_CLOSE count : 0
FT_VLAN_DOWN count : 0
PEER_DOWN count : 0
SRG Compatibility : WARM_COMPATIBLE <<<<<<<<<<<<<<<<<<<<<<<<<<
License Compatibility : COMPATIBLE
FT Groups : 1
itasca-1/Admin#



itasca-1/Admin# show ft group de

FT Group : 1
No. of Contexts : 1
Context Name : Admin
Context Id : 0
Configured Status : in-service
Maintenance mode : MAINT_MODE_OFF
My State : FSM_FT_STATE_ACTIVE
My Config Priority : 120
My Net Priority : 120
My Preempt : Enabled
Peer State : FSM_FT_STATE_STANDBY_WARM <<<<<<<<<<<<<<<<<<<<<<<
Peer Config Priority : 110
Peer Net Priority : 110
Peer Preempt : Enabled
Peer Id : 1
Last State Change time : Fri Mar 21 19:07:13 2008
Running cfg sync enabled : Enabled
Running cfg sync status : Running configuration sync has completed
Startup cfg sync enabled : Enabled
Startup cfg sync status : Startup configuration sync has completed
Bulk sync done for ARP: 0
Bulk sync done for LB: 0
Bulk sync done for ICM: 0



itasca-2/Admin# show system srg

Software ID: swid-aceapp Software Version: A3(1.0) Software Interim Version: 3.0(0)A3(0.0.13) 
                         ema_11:58:21-2008/03/20_/ws/ema/mercury
Switchover Supersedes: <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
swid-aceapp A1(8.0)
Compatible: swid-aceapp A1(7a) A1(7b) A1(7c)
Personal tools