Understanding Quality of Service (QoS) behavior and logging
This topic covers details of QoS including server and server controller behavior during kill events, failover trigger, and log file content.
QoS kill events
- '
nsd -kill
' does not produce an nsd. It produces only a kill_* file. - If and only if the server is due to be restarted, the controller
generates its own '
nsd -stacks
' for troubleshooting purposes. - With
QoSShutdownNSD=seconds
set in the notes.ini, an 'nsd -stacks
' is generated everyQoSShutdownNSD
seconds if the server has not come down cleanly withinQoSShutdownNSD
seconds. This notes.ini setting is used for troubleshooting servers that are taking too long to shut down.
qosprobe
server
add-in is unable to open the server's names.nsf ($Servers
view) successfully within QOS_PROBE_TIMEOUT
milliseconds.Event | Controller action | Configurable? |
---|---|---|
Probe (qosprobe) timeout | Server is killed after 5 minutes and restarted. | dcontroller.ini:QOS_PROBE_TIMEOUT=minutes |
Long running applications timeout | Server is killed after 10 minutes and restarted. | dcontroller.ini:QOS_APPS_TIMEOUT=minutes |
Server runs out of shared handles | Server is killed and restarted. | no |
Server runs out of session tables | Server is killed and restarted. | no |
Server runs out of net memory | Server is killed and restarted. | no |
Server runs out of shared memory handles | Server is killed and restarted. | no |
Server crash/panic while running | Server is restarted after 5 minutes. | no |
Server takes too long to shutdown ('quit') | Server is killed after 5 minutes. | dcontroller.ini:QOS_SHUTDOWN_TIMEOUT=minutes |
Server takes too long to restart ('restart server') | Server is killed after 5 minutes and restarted. | dcontroller.ini:QOS_RESTART_TIMEOUT=minutes |
The server process has terminated abnormally | Server is killed and restarted. | no |
QoS failover trigger
A QoS smart kill can have a server down for up to 20 minutes. Total downtime can include an approximately 5-minute detection of a probe timeout, the running of nsd to collect data on all processes (~3 minutes), the killing of the server (~1-2 minutes), and the restarting (including gating task time - up to 10 minutes). Any new requests designated to process on a server that QoS is set to will immediately fail over to a clustermate within seconds of the moment that QoS detects that the server should be smart killed.
- Server shutdown is taking too long
- Server restart is taking too long
- The server has crashed and QoS needs to clean up after the crash
QOS_DISABLE_FAILOVER_TRIGGER=1
.
With this parameter set, the triggerImmediateServerFailover
file
is stillcreated and deleted, but the server does not StaticHang to
force failover.QoS controller log file
QoS places a new log file in the Domino® server's data directory. The QoS controller log file contains details corresponding to various events as captured or processed by the QoS controller, events relating to QoS probing, hygienic server restart, server crashes, QoS smart kills, and other miscellaneous events. The following sections describe this log file, how it works, and how to properly read it when troubleshooting an event in the service.
qoscntrlr201105171528.out
This timestamp indicates the time that the QoS controller was started. The example filename would be the QoS controller log for a service start of May 17th, 2011 at 3:28 PM. If the service is stopped and started again, the current qoscntrlrYYYYMMDDHHmm.out file is given the .log extension and a new qoscntrlrYYYYMMDDHHmm.out file is created with the current time. These qoscntrlrYYYYMMDDHHmm.log files are automatically deleted when the service is started if they are older than 14 days.
Log file naming convention
qoscntrlr201105171528.out
This timestamp indicates the time that the QoS controller was started. The example filename would be the QoS controller log for a service start of May 17th, 2011 at 3:28 PM. If the service is stopped and started again, the current qoscntrlrYYYYMMDDHHmm.out file is given the .log extension and a new qoscntrlrYYYYMMDDHHmm.out file is created with the current time. These qoscntrlrYYYYMMDDHHmm.log files are automatically deleted when the service is started if they are older than 14 days.
How to read the log file
2012/08/06 06:33:34 QoS Controller: Starting QOSPipeWatcher
2012/08/06 06:33:34 QoS Controller: QOS_PROBE_TIMEOUT=5 minutes
2012/08/06 06:33:34 QOSController: QOS_SHUTDOWN_TIMEOUT=5 minutes
2012/08/06 06:33:34 QOSController: QOS_RESTART_TIMEOUT=5 minutes
2012/08/06 06:33:34 QOSController: QOS_APPS_TIMEOUT=10 minutes
2012/08/06 06:33:34 QoS Controller: nsd Program Path=/opt/ibm/notes/latest/linux/nsd.sh
2012/08/06 06:33:34 QoS Controller: QOS_RESTART_LIMIT_MAXIMUM=3
2012/08/06 06:33:34 QoS Controller: QOS_RESTART_LIMIT_PERIOD=30 minutes
2012/08/06 06:33:34 QoS Controller: QOS_NOKILL=false
2012/08/06 06:33:34 QoS Controller: QOS_MAIL_TO=test/ibm
2012/08/06 06:33:34 QoS Controller: QOS_MAIL_SMTP_SERVER=xx
2012/05/08 00:15:09 QoS Controller: OpMsg=START Type=QOS ObjectType=ServerName ObjectValue=CN=rc45/O=dev ObjectType2=ProcessName ObjectValue2=nserver TimeDate=20120508T001506,95-04
2012/05/08 00:15:09 QoS Controller: OpMsg=START Type=SERVER TimeDate=20120508T001507,40-04
2012/05/08 00:15:21 QoS Controller: OpMsg=READY Type=SERVER TimeDate=20120508T001517,92-04
TimeDate=20120508T001506,95-04
2012/05/08 00:15:21 QoS Probe: message
2012/05/08 00:15:21 QoS Applications: message
2012/05/08 00:15:21 QoS Kill: message
What to look for in the log file
This table shows examples of basic logging events you should see when looking at the QoS controller log file.
Event | Examples of what log shows |
---|---|
Normal server startup |
|
Normal server shutdown |
|
QoS probing |
|
Long-running applications |
|
Evidence of a server crash in the log file
2012/05/08 01:00:44 QoS Controller: OpMsg=CRASH Type=QOS ObjectType=ServerName ObjectValue=CN=rc45/O=dev TimeDate=20120508T010039,48-04
2012/05/08 01:00:44 QoS Controller: Server CN=rc45/O=dev has crashed.
2012/05/08 01:00:44 QoS Controller: Deactivating probe...
2012/05/08 01:00:44 QoS Controller: QoS Probe deactivated.
Evidence of a smart kill in the log file
2012/05/08 00:31:41 QoS Probe: SUCCESS (78ms)
2012/05/08 00:32:41 QoS Probe: SUCCESS (16ms)
2012/05/08 00:37:41 The probe thread has not received a message from qosprobe within the timeout period.
2012/05/08 00:37:41 QoS Probe: The qosprobe addin has timed out, is not responding, or is not running.
2012/05/08 00:37:41 QoS Controller: Deactivating probe...
2012/05/08 00:37:41 QoS Controller: QoS Probe deactivated.
2012/05/08 00:37:43 QoS Controller: OpMsg=TIMEOUT Type=PROBE TimeDate=null
2012/05/08 00:37:43 QoS Controller: The controller has received a probe timeout.
2012/05/08 00:37:43 QoS Kill: Triggering failover...
2012/05/08 00:37:47 QoS Kill: Running nsd...
2012/05/08 00:38:12 QoS Kill: Running nsd -kill
2012/05/08 00:38:16 QoS Kill: Setting kill complete.
2012/05/08 00:38:21 QoS Kill: Restarting DominoStarter thread