Automatic failover

Switching a master domain manager to a backup master domain manager.

Recovery is easy when you are prepared for potential problems. If the master domain manager becomes unavailable, to ensure continuous operations, a long-term switchmgr operation is triggered and the workload is automatically switched to an eligible backup master domain manager. Similarly, the backup event processors automatically detect if the event processor is unavailable, and a long-term switcheventprocessor command triggered. This is the default behavior for a complete fresh installation of V9.5 Fix Pack 2 or later, but it can be enabled for back-level environments that are upgraded to V9.5 Fix Pack 2 or later.

Note: If you perform a fresh installation of a backup master domain manager at the V9.5 Fix Pack level in an existing back-level environment, the automatic failover feature is disabled. To enable it, follow this procedure. The feature is enabled by default for only a complete fresh installation.

You can optionally define potential backups for both the master domain manager and the event processor in two separate lists, adding preferential backups at the top of the lists. The backup engines monitor the behavior of the master domain manager and event processor to detect anomalous behavior and then attempt to recover. Each component plays a role in either detecting a failure or recovering from it:

Each backup master domain manager monitors the status of the active master domain manager.
The master domain manager (active or backup) is made to be self-aware. It monitors the status of its fault-tolerant agent to check on the status of processes such as, Batchman, Mailman and Jobman. If at least one of these processes are down, the master domain manager makes 3 attempts to restart them.
If the WebSphere Application Server Liberty Base goes down, the watchdog process attempts to restart it.
If the active master domain manager cannot be automatically restored within 5 minutes (the threshold after which the master is declared unavailable), then a permanent switch to a backup is automatically triggered by any of the backup candidates when one or more of the following conditions persist:
- The fault-tolerant agent, WebSphere Application Server Liberty Base, or both are still down.
- The engine is unable to communicate with the database, for example, due to a network outage.
If you have defined potential backups in a list, and a switch after 5 minutes is not possible with the first backup in the list because it is unavailable, then an attempt is made to contact the remaining backups in the list, following the order specified in the list, until an available backup is found to perform the switch. In this case, 5 minutes pass between each attempt.

The list for potential event processor backups is a separate list from the potential master domain manager backups, because you might have a workstation that can serve as the event manager backup, but you do not want it to act as a potential master domain manager backup. If the event manager fails, but the master domain manager is running fine, then only the event manager switches to a backup manager defined in the list of potential backups.

You can track detected failures and the actions taken by checking the messages.log file located in the path:

<TWA_DATA_DIR>/stdlist/appserver/engineServer/logs/messages.log
<TWA_home>\TWS\stdlist\appserver\engineServer\logs\messages.log

Note: On Linux® and UNIX®, for a fresh installation, an extended agent is installed with the master domain manager which is used to communicate where to run the FINAL job stream, along with its jobs. With an extended agent, $MASTER can be used to indicate that the agent's host workstation is the master domain manager. If the role of the master is switched to a backup, then the new master is represented by $MASTER. This supports both a short-term and long-term switch for the automatic failover feature. If you are upgrading from a version earlier than 9.5 Fix Pack 2, then you must define the extended agent manually.

On Windows™ workstations, the FINAL job stream is not defined on the extended agent, but remains on the master domain manager. The FINAL and FINALPOSTREPORTS job streams and jobs need to be moved from the master to the extended agent workstation. For this reason, only a short-term switch can be performed automatically and the long-term switch must be performed manually as documented in Extended loss or permanent change of master domain manager and in Complete procedure for switching a domain manager. See also switchmgr command that contains both the command-line syntax, as well as the procedural steps to perform the switch from the Dynamic Workload Console.