RCA for Engine failover
From DocWiki
Abrupt Engine mastership failover to the HA node can happen for multiple reasons
- Engine process crashed
- How to find it? Refer to How to check if a process crashed
- How to analyze the core file? Refer to How to analyze the core file
- CVD process crashed. Engine service is dependent upon CVD service. So if CVD crashes/restarted, Engine too gets restarted, thereby causing mastership failover to the other node.
- Engine ran into OutOfMemory
- Check the MIVR logs for the OOM reason.
- java.lang.OutOfMemoryError: GC overhead limit exceeded
- Debug it based upon the OOM reason. Refer to How to debug OutOfMemoryError
- Check the MIVR logs for the OOM reason.
- Nodes went into island mode (multiple masters) and recovered. Upon recovery publisher node retains mastership.
- Check the MCVD logs for the failover logs.
- Application error happened, and Engine decided to shutdown
- Look for com.cisco.wfapi.WFKeepAliveException: KeepAliveException in ManagerManagerImpl in MIVR logs.