Monitoring quality of service
Quality of Service, or QoS, is designed to react to the general operation of a Domino® server in order to keep that server functioning reliably and always available. If QoS detects that a server is not responding or hung, QoS probing can be configured to email an administrator about the problem and/or automatically terminate the server and restart it. QoS log information can also be useful for analysis by Support.
About this task
QoS requires that the Domino® server be run under the java controller using the java console.
qosprobe
add-in
task can be configured with the following settings on the Domino® server in the server NOTES.INI file:
QOS_PROBE_INTERVAL=n
The probe interval in minutes. This can be set in the notes.ini. The default is 1 minute.
QOS_PROBE_TIMEOUT=n
The probe timeout in minutes. This can be set in the dcontroller.ini. The default is 5 minutes.
QOS_PROBE_TIMEOUT
should be
much greater than QOS_PROBE_INTERVAL
. If the timeout
occurs before the probe is set to respond, the server will be restarted
constantly.qosprobe
add-in communicates its
probing results. (SUCCESS, ERROR, TIMEOUT). The messages are captured
in the qosctnrlrtimestamp.out file found in the
server data directory. The following is an example of a SUCCESS message: 2010/01/07 07:42:56 QoS Probe: SUCCESS (88ms)
The following is an example of an error message:2010/01/07 08:05:59 QoS Probe: ERROR: ProbeError=4803
- The
NSFDbOpen
orNIFOpenCollection
calls used by the probe return Domino's ERR_TIMEOUT error. This error is sent to the controller and a smart kill/restart is initiated. The controller does not receive a message from qosprobe within the timeout period (QOS_PROBE_TIMEOUT
). This can happen in one of the following ways: qosprobe was told to quit ('tell qosprobe quit
') or is not running.qosprobe
becomes hung while probing.
If the controller receives a probe timeout, it may not initiate a server kill/restart because long running and/or load intensive operations are running (and thus may have caused the probe to time out). These operations include BACKUP, COMPACT, DBCOPY, FIXUP and DBPURGE. In these cases, you see the messages like the following ones in the qoscntrlrtimestamp.out file:
2010/01/07 07:42:56 QoS Controller: The controller has received a probe timeout.
2010/01/07 07:42:56 QoS Controller: There are long running applications - probing will pause until they have completed.
If this condition is detected, the controller will then allow the lengthy ("long-running") operation more time to complete. If any lengthy operation fails to complete within that amount of time, the controller will then proceed with the smart kill/restart. You see a message like the one in the following example in the qoscntrlrtimestamp.out file:
2010/01/07 07:42:56 QoS Controller: Applications are not making progress.
QOS_PROBE_INTERVAL
QOS_PROBE_TIMEOUT
QOS_RESTART_LIMIT_PERIOD
QOS_SHUTDOWN_TIMEOUT
QOS_RESTART_TIMEOUT
QOS_APPS_TIMEOUT