Switching a master domain manager to a
backup master domain manager.
Recovery is easy when you are prepared for potential problems. If the
master domain manager becomes unavailable, to ensure continuous
operations, a long-term
switchmgr operation is triggered and the workload is
automatically switched to an eligible
backup master domain manager. Similarly, the backup event processors
automatically detect if the event processor is unavailable, and a long-term
switcheventprocessor command triggered. This is the default behavior for a
complete fresh installation of V9.5 Fix Pack 2 or later, but it can be enabled for back-level
environments that are upgraded to V9.5 Fix Pack 2 or later.
Note: If you perform a fresh installation
of a backup master domain manager at the V9.5 Fix Pack level
in an existing back-level environment, the automatic failover feature is disabled. To enable it,
follow this procedure. The feature is enabled by default for only a complete fresh installation.
You can optionally define potential backups for both the
master domain manager and the event processor in two separate
lists, adding preferential backups at the top of the lists. The backup engines monitor the behavior
of the
master domain manager and event
processor to detect anomalous behavior and then attempt to recover. Each component plays a role in
either detecting a failure or recovering from it:
- Each backup master domain manager monitors the status of
the active master domain manager.
- The master domain manager (active or
backup) is made to be self-aware. It monitors the status of its fault-tolerant agent to check on the status of
processes such as, Batchman, Mailman and Jobman. If at least one of these processes are down, the
master domain manager makes 3 attempts to
restart them.
- If the WebSphere Application Server Liberty Base goes down,
the watchdog process attempts to restart it.
- If the active master domain manager cannot be
automatically restored within 5 minutes (the threshold after which the master is declared
unavailable), then a permanent switch to a backup is automatically triggered by any of the backup
candidates when one or more of the following conditions persist:
- The fault-tolerant agent, WebSphere Application Server Liberty Base, or both are still
down.
- The engine is unable to communicate with the database, for example, due to a network
outage.
If you have defined potential backups in a list, and a switch after 5 minutes is not possible
with the first backup in the list because it is unavailable, then an attempt is made to contact the
remaining backups in the list, following the order specified in the list, until an available backup
is found to perform the switch. In this case, 5 minutes pass between each attempt.
The list for potential event processor backups is a separate list from the potential master domain manager backups, because you might have
a workstation that can serve as the event manager backup, but you do not want it to act as a
potential master domain manager backup. If
the event manager fails, but the master domain manager is running fine, then only the
event manager switches to a backup manager defined in the list of potential backups.
You can track detected failures and the actions taken by checking the
messages.log
file located in the path:
- <TWA_DATA_DIR>/stdlist/appserver/engineServer/logs/messages.log
- <TWA_home>\TWS\stdlist\appserver\engineServer\logs\messages.log
Note: On
Linux® and
UNIX®, for a fresh installation, an extended agent is installed with the
master domain manager which is used to communicate
where to run the FINAL job stream, along with its jobs. With an extended agent, $MASTER can be used
to indicate that the agent's host workstation is the
master domain manager. If the role of the master is
switched to a backup, then the new master is represented by $MASTER. This supports both a short-term
and long-term switch for the automatic failover feature. If you are upgrading from a version earlier
than 9.5 Fix Pack 2, then you must define the extended agent manually.
On Windows™ workstations, the FINAL job stream is not defined on the extended agent,
but remains on the master domain manager. The
FINAL and FINALPOSTREPORTS job streams and jobs need to be moved from the master to the extended
agent workstation. For this reason, only a short-term switch can be performed automatically and the
long-term switch must be performed manually as documented in Extended loss or permanent change of master domain manager and in Complete procedure for switching a domain manager. See also switchmgr command that contains both the command-line
syntax, as well as the procedural steps to perform the switch from the Dynamic Workload Console.