Understanding Quality of Service (QoS) behavior and logging

This topic covers details of QoS including server and server controller behavior during kill events, failover trigger, and log file content.

QoS kill events

The following is how the server and server controller should behave during kill events.
  • 'nsd -kill' does not produce an nsd. It produces only a kill_* file.
  • If and only if the server is due to be restarted, the controller generates its own 'nsd -stacks' for troubleshooting purposes.
  • With QoSShutdownNSD=seconds set in the notes.ini, an 'nsd -stacks' is generated every QoSShutdownNSD seconds if the server has not come down cleanly within QoSShutdownNSD seconds. This notes.ini setting is used for troubleshooting servers that are taking too long to shut down.
Note: In the following table, timeout indicates that the qosprobe server add-in is unable to open the server's names.nsf ($Servers view) successfully within QOS_PROBE_TIMEOUT milliseconds.
Table 1. Events and actions
Event Controller action Configurable?
Probe (qosprobe) timeout Server is killed after 5 minutes and restarted. dcontroller.ini:QOS_PROBE_TIMEOUT=minutes
Long running applications timeout Server is killed after 10 minutes and restarted. dcontroller.ini:QOS_APPS_TIMEOUT=minutes
Server runs out of shared handles Server is killed and restarted. no
Server runs out of session tables Server is killed and restarted. no
Server runs out of net memory Server is killed and restarted. no
Server runs out of shared memory handles Server is killed and restarted. no
Server crash/panic while running Server is restarted after 5 minutes. no
Server takes too long to shutdown ('quit') Server is killed after 5 minutes. dcontroller.ini:QOS_SHUTDOWN_TIMEOUT=minutes
Server takes too long to restart ('restart server') Server is killed after 5 minutes and restarted. dcontroller.ini:QOS_RESTART_TIMEOUT=minutes
The server process has terminated abnormally Server is killed and restarted. no

QoS failover trigger

A QoS smart kill can have a server down for up to 20 minutes. Total downtime can include an approximately 5-minute detection of a probe timeout, the running of nsd to collect data on all processes (~3 minutes), the killing of the server (~1-2 minutes), and the restarting (including gating task time - up to 10 minutes). Any new requests designated to process on a server that QoS is set to will immediately fail over to a clustermate within seconds of the moment that QoS detects that the server should be smart killed.

Because failover matters only when the server is known to be up, running, and processing, the fast failover feature is not used in the following smart kill scenarios:
  • Server shutdown is taking too long
  • Server restart is taking too long
  • The server has crashed and QoS needs to clean up after the crash
Note: You can disable the StaticHang mechanism by using the notes.ini setting QOS_DISABLE_FAILOVER_TRIGGER=1. With this parameter set, the triggerImmediateServerFailover file is stillcreated and deleted, but the server does not StaticHang to force failover.

QoS controller log file

QoS places a new log file in the Domino® server's data directory. The QoS controller log file contains details corresponding to various events as captured or processed by the QoS controller, events relating to QoS probing, hygienic server restart, server crashes, QoS smart kills, and other miscellaneous events. The following sections describe this log file, how it works, and how to properly read it when troubleshooting an event in the service.

Note: You may also want to provide IBM® support with the log file if you are troubleshooting a server problem with them.
Log file naming convention

The QoS controller log file name contains a 24 hour timestamp in the format YYYYMMDDHHmm, for example:
qoscntrlr201105171528.out

This timestamp indicates the time that the QoS controller was started. The example filename would be the QoS controller log for a service start of May 17th, 2011 at 3:28 PM. If the service is stopped and started again, the current qoscntrlrYYYYMMDDHHmm.out file is given the .log extension and a new qoscntrlrYYYYMMDDHHmm.out file is created with the current time. These qoscntrlrYYYYMMDDHHmm.log files are automatically deleted when the service is started if they are older than 14 days.

Log file naming convention

The QoS controller log file name contains a 24 hour timestamp in the format YYYYMMDDHHmm, for example:
qoscntrlr201105171528.out

This timestamp indicates the time that the QoS controller was started. The example filename would be the QoS controller log for a service start of May 17th, 2011 at 3:28 PM. If the service is stopped and started again, the current qoscntrlrYYYYMMDDHHmm.out file is given the .log extension and a new qoscntrlrYYYYMMDDHHmm.out file is created with the current time. These qoscntrlrYYYYMMDDHHmm.log files are automatically deleted when the service is started if they are older than 14 days.

How to read the log file

At the beginning of the log file, general configuration information appears for this logged run of the QoS controller:
2012/08/06 06:33:34 QoS Controller: Starting QOSPipeWatcher
2012/08/06 06:33:34 QoS Controller: QOS_PROBE_TIMEOUT=5 minutes
2012/08/06 06:33:34 QOSController: QOS_SHUTDOWN_TIMEOUT=5 minutes
2012/08/06 06:33:34 QOSController: QOS_RESTART_TIMEOUT=5 minutes
2012/08/06 06:33:34 QOSController: QOS_APPS_TIMEOUT=10 minutes
2012/08/06 06:33:34 QoS Controller: nsd Program Path=/opt/ibm/notes/latest/linux/nsd.sh
2012/08/06 06:33:34 QoS Controller: QOS_RESTART_LIMIT_MAXIMUM=3
2012/08/06 06:33:34 QoS Controller: QOS_RESTART_LIMIT_PERIOD=30 minutes
2012/08/06 06:33:34 QoS Controller: QOS_NOKILL=false
2012/08/06 06:33:34 QoS Controller: QOS_MAIL_TO=test/ibm
2012/08/06 06:33:34 QoS Controller: QOS_MAIL_SMTP_SERVER=xx
These items, along with some other basic items, can be configured in the Domino® controller ini file (dcontroller.ini), found in the server's data directory. The rest of the file from this point on contains a log entry for each message sent to the QoS controller by the server or one of its tasks. These messages have the format:
2012/05/08 00:15:09 QoS Controller: OpMsg=START Type=QOS ObjectType=ServerName ObjectValue=CN=rc45/O=dev ObjectType2=ProcessName ObjectValue2=nserver TimeDate=20120508T001506,95-04
2012/05/08 00:15:09 QoS Controller: OpMsg=START Type=SERVER TimeDate=20120508T001507,40-04
2012/05/08 00:15:21 QoS Controller: OpMsg=READY Type=SERVER TimeDate=20120508T001517,92-04
All messages logged to the QoS controller log file have a timestamp. If the QoS controller logs the message, it has the format:
TimeDate=20120508T001506,95-04
If one of the QoS controller's other threads logs a message to the log file, it has the format:
2012/05/08 00:15:21 QoS Probe: message
2012/05/08 00:15:21 QoS Applications: message
2012/05/08 00:15:21 QoS Kill: message

What to look for in the log file

This table shows examples of basic logging events you should see when looking at the QoS controller log file.

Table 2. Examples of basic logging events
Event Examples of what log shows
Normal server startup
2012/05/08 00:15:09 QoS Controller: OpMsg=START Type=QOS ObjectType=ServerName ObjectValue=CN=rc45/O=dev ObjectType2=ProcessName ObjectValue2=nserver TimeDate=20120508T001506,95-04
2012/05/08 00:15:09 QoS Controller: OpMsg=START Type=SERVER TimeDate=20120508T001507,40-04
2012/05/08 00:15:10 QoS Applications: Clearing long running apps list
2012/05/08 00:15:21 QoS Controller: OpMsg=READY Type=SERVER TimeDate=20120508T001517,92-04
2012/05/08 00:15:21 QoS Controller: Server is ready to process requests
Normal server shutdown
2012/05/08 00:45:22 QoS Controller: OpMsg=END Type=SERVER ObjectType=Detail ObjectValue=Quit TimeDate=20120508T004516,01-04
2012/05/08 00:45:22 QoS Controller: Deactivating probe...
2012/05/08 00:45:22 QoS Controller: QoS Probe deactivated.
2012/05/08 00:45:26 QoS Controller: OpMsg=END Type=QOS ObjectType=ServerName ObjectValue=CN=rc45/O=dev TimeDate=20120508T004523,51-04
2012/05/08 00:45:27 QoS Applications: Clearing long running apps list
QoS probing
2012/05/08 00:15:21 QoS Controller: Activating probe...
2012/05/08 00:15:21 QoS Controller: QoS Probe activated.
2012/05/08 00:15:21 QoS Probe: Starting qosprobe...
2012/05/08 00:15:25 QoS Probe: OpMsg=START, Type=PROBE
2012/05/08 00:16:25 QoS Probe: The QoS Probe is probing.
2012/05/08 00:16:25 QoS Probe: SUCCESS (156ms)
2012/05/08 00:17:25 QoS Probe: SUCCESS (16ms)
2012/05/08 00:18:25 QoS Probe: SUCCESS (31ms)
2012/05/08 00:19:25 QoS Probe: SUCCESS (16ms)
2012/05/08 00:20:26 QoS Probe: SUCCESS (15ms)
Long-running applications
2012/05/08 00:38:32 QoS Controller: OpMsg=START Type=FIXUP ObjectType=DB ObjectValue=C:\Program Files\IBM\Domino\Data\ddm.nsf TimeDate=20120508T003826,18-04
2012/05/08 00:38:32 QoS Controller: OpMsg=END Type=FIXUP ObjectType=DB ObjectValue=C:\Program Files\IBM\Domino\Data\ddm.nsf TimeDate=20120508T003829,79-04
2012/05/08 00:38:32 QoS Applications: Adding FIXUP[C:\Program Files\IBM\Domino\Data\ddm.nsf] to long running apps list
2012/05/08 00:38:32 QoS Applications: Removing FIXUP[C:\Program Files\IBM\Domino\Data\ddm.nsf] from long running apps list..
2012/05/08 00:47:42 QoS Controller: OpMsg=START Type=COMPACT ObjectType=DB ObjectValue=events4.nsf TimeDate=20120508T004740,23-04
2012/05/08 00:47:42 QoS Controller: OpMsg=END Type=COMPACT ObjectType=DB ObjectValue=events4.nsf TimeDate=20120508T004740,23-04
2012/05/08 00:47:43 QoS Applications: Adding COMPACT[events4.nsf] to long running apps list
2012/05/08 00:47:43 QoS Applications: Removing COMPACT[events4.nsf] from long running apps list

Evidence of a server crash in the log file

The QoS controller monitors and logs crash events to ensure the kill and restart are performed in a reasonable amount of time. To see evidence of this in the QoS controller log, search for the text "=CRASH" in the log file. Here is an example:
2012/05/08 01:00:44 QoS Controller: OpMsg=CRASH Type=QOS ObjectType=ServerName ObjectValue=CN=rc45/O=dev TimeDate=20120508T010039,48-04
2012/05/08 01:00:44 QoS Controller: Server CN=rc45/O=dev has crashed.
2012/05/08 01:00:44 QoS Controller: Deactivating probe...
2012/05/08 01:00:44 QoS Controller: QoS Probe deactivated.

Evidence of a smart kill in the log file

The QoS controller is coded to kill the server intelligently based on information it receives from the server or from QoS probing. Here is what a smart kill from an QoS Probe timeout might look like in the QoS controller file:
2012/05/08 00:31:41 QoS Probe: SUCCESS (78ms)
2012/05/08 00:32:41 QoS Probe: SUCCESS (16ms)
2012/05/08 00:37:41 The probe thread has not received a message from qosprobe within the timeout period.
2012/05/08 00:37:41 QoS Probe: The qosprobe addin has timed out, is not responding, or is not running.
2012/05/08 00:37:41 QoS Controller: Deactivating probe...
2012/05/08 00:37:41 QoS Controller: QoS Probe deactivated.
2012/05/08 00:37:43 QoS Controller: OpMsg=TIMEOUT Type=PROBE TimeDate=null
2012/05/08 00:37:43 QoS Controller: The controller has received a probe timeout.
2012/05/08 00:37:43 QoS Kill: Triggering failover...
2012/05/08 00:37:47 QoS Kill: Running nsd...
2012/05/08 00:38:12 QoS Kill: Running nsd -kill
2012/05/08 00:38:16 QoS Kill: Setting kill complete.
2012/05/08 00:38:21 QoS Kill: Restarting DominoStarter thread