Status commands
The tell status command for IBM Traveler server is tell traveler
status
.
If you run the command when the overall status is Green, the only message the system displays is
IBM Traveler overall status is GREEN
. When the status is Yellow or Red, the system
displays all the conditions causing noncompliance. The returned messages include both the reason for
the noncompliance and the probable cause for the failure (when available). This status information
is part of the systemdump
command.
tell traveler status
The IBM Traveler task has been running since Tue Jun 15 17:08:37 EDT 2010.
The last successful device sync was on Mon Jun 21 06:43:01 EDT 2010.
Yellow Status Messages
The response times for opening databases on mail server CN=Mail1/O=Test are above the acceptable threshold.
The response times for opening databases on mail server CN=Mail7/O=Test are above the acceptable threshold.
Red Status Messages
17,238 errors have been logged for user CN=Joe Tester/OU=Test/O=IBM.
There have been 3,845 device sync failures for reasons other than the server is too busy.
The overall status of IBM Traveler is Red.
Threadchecks
- DS or PS threads have run for a "long period" of time
- Problem threshold:
- Yellow: Wall clock run time is greater than 30 minutes
- Red: Wall clock run time is greater than 120 minutes
Console Message:
User {User name} on thread {thread name} has been running for {xx} minutes.
Probable cause: If the Red threshold is reached, then the thread is likely hung. In rare instances there may be a device sync or an extremely long prime sync that is working against a very large user database or a slow mail server, which is normal.
Corrective actions:- Persistent yellow conditions might indicate a slow mail server or an overloaded Traveler server. Monitor and look for other status conditions that might have a better indication of a diagnosis.
- For first occurrence, take a system dump which will include the
information about all of the threads in the Traveler service. Use
tell traveler systemdump
and run annsd
at the domino command line to gather native stacks. Collect the logs. - Restart the Traveler service. There is a good chance this will require a complete Domino® server restart and you may need to kill the Domino® server in order for it to shutdown completely.
- Percentage of Device Syncs that failed with 503 return code
- Problem threshold:
- Yellow: The number of 503 synchs is more than 5%.
- Red: The number of 503 synchs is more than 10%.
Console Message:
There have been {number of 503 RC} device sync failures because the server is too busy and returned status code 503
.Probable Cause: The most probable cause is that the server is running over capacity. 503 means that there are no threads available to handle a synchronization request, and the Traveler server continues to allocate threads until it becomes resource constrained.
Corrective actions: Either increase the memory, or move some of the users to another IBM Traveler server.
- Percentage of Device Syncs are failing with error code other than 503
- Problem threshold:
- Yellow: The number of unsuccessful synchs is more than 5%.
- Red: The number of unsuccessful synchs is more than 10%.
Console Message:
There have been {number of error code other than 503 RC} device sync failures for reasons other than the server is too busy.
Probable cause: There are network connectivity issues between IBM Traveler server and the user's device(s).
- HTTP thread allocations
- Problem threshold:
- Yellow: The peak or current number of connections is greater than 80% of HTTP threads.
- Red: The peak or current number of connections is greater than 90% of HTTP threads.
Console Message:The number of active HTTP connections is {current percentage} percent of the available HTTP threads ({HTTP Threads}).
The peak number of HTTP connections is {peak percentage} percent of the available HTTP threads ({HTTP Threads}).
Probable cause: This condition implies that there are not enough HTTP threads for the number of devices trying to user the IBM Traveler server.
Corrective actions:- Increase the number of HTTP threads if there is enough memory and CPU resources.
- Move some of the users to another IBM Traveler server.
Memory checks
- Native memory usage
- Problem threshold:
- Yellow: Native Memory usage is greater than 85%
- Red: Native Memory usage is greater than 95%
Console Message:
The current native memory usage is {current percentage} percent of the available memory.
Probable cause: Native share memory includes shared memory with other Domino® applications on the Domino® Server.
Corrective actions:- Verify whether too many HTTP Threads are allocated.
- Reduce the number of applications running on the Domino® server.
- Reduce the number of IBM Traveler users on the machines.
- Issue
tell traveler mem
command to see the history of memory and CPU usage.
- Java™ memory usage
- Problem threshold:
- Yellow: Java™ Memory usage is greater than 85%
- Red: Java™ Memory usage is greater than 95%
- Trusted server causing yellow status
- Problem threshold:
- Yellow: Mail server
{MailServerName}
does not have the IBM Traveler server{TravelerServerName}
in the trusted server list.
- Yellow: Mail server
Other checks
- CPU usage
Checks the current data to see if the system is over worked. The code checks from the present back through one complete interval. On average the time period used for measuring the CPU utilization will be 1.5 times the interval length. By default the interval is 15 minutes.
Problem threshold:- Yellow: CPU threshold is 70%.
- Red: CPU threshold is 90%.
Console Message:
The IBM Traveler's CPU usage is {current percentage} percent over the last {minutes} minutes of processing.
Corrective actions:- Reduce the number of applications running on the Domino® server.
- Reduce the number of IBM Traveler users on the machines.
- Issue
tell traveler mem
command to see the history of memory and CPU usage.
- Error messages logged
Checks to see if the number of error messages logged for a user has reach the threshold. These thresholds are monitored per person, not for all users on the system.
Problem Threshold:- Yellow: A user's error count is greater than 50 errors
- Red: A user's error count is greater than 100 errors \
Console Message:
{0} errors have been logged for user {1}.
Checks the time of database open for a given server.Problem Thresholds:- Yellow: 10% of the opens are above the "Yellow Open Threshold"
- Red: 5% of the opens are above the "Red Open Threshold"
Console Message:
The response times for opening databases on mail server {mail server name} are above the acceptable threshold.
Probable Cause: Check for network delays between the IBM Traveler server and mail server.
- Free Disk Space.
- Checks that there is adequate free disk space on the IBM Traveler server. Applies to both the
data directory and the log directory as indicated by the
*_DATA_DIR_FREE_*
and*_LOG_DIR_FREE_*
parameters. By default, the log directory is contained under the data directory, but it is possible for the administrator to move the log directory to a different disk.Problem threshold:- Yellow: Less than 15% Free Disk Space
- Red: Less than 5% Free Disk Space
Console Message:
Disk space for {location} has {%} percent free.
Corrective action: Remove unneeded files to increase free disk space.
Constraint processing
The constraint processing is proactive code that monitors the system checking to see if it has entered a resource constraint state. The system enters a constrained state when the system memory or database connections exceed a given threshold. Once the constraint state is detected, IBM Traveler will not allow new device sync or prime sync threads to start. Other threads will be allowed to complete and hopefully the constraint condition will be alleviated. If the constraint condition persists, then the existing IBM Traveler thread pool logic will kill over the additional unused threads, further reducing the system's memory footprint. The minimum number of prime sync threads is 5 and the minimum device sync threads is 10. If the system is in constraint state, new device syncs will be denied with the 503 status code (server is busy). The system will log the information level of messages when entering and exiting constraint state with the thread summary information. Whenever a constraint state lasts longer than 60 minutes, an error message will be logged and a system dump executed.
The system
enters constraint mode when memory conditions hit the Red state, and
exit when it is 5% below the Red entry level. By default, the system
enters constraint when native memory percentage usage is greater than NTS_STATUS_MEMORY_NATIVE_RED
,
which is 95% by default or when Java™ memory
is greater than NTS_STATUS_MEMORY_NATIVE_RED
which
is 85% by default. The system exits constraint when native memory
usage is below 90% and when Java™ memory
is below 91%.
Stats
GetAlarm.Time.Histogram
NameLookup.Time.Histogram
DCA.DB_OPEN
DCA.DB_CLOSE
ERRORS.<UserId>
CPU.Pct.<% CPU Range in 10% increments> (000-010, 010-020, and so on)
DATABASE.QUERY.HISTOGRAM<SimpleName>.(000-001,001-002,002-005, and so on)
Configuration parameters
notes.ini
parameters required to
change the thresholds. Parameter name | Default | Description |
---|---|---|
NTS_STATUS_CPU_PCT_RED_THRESHOLD | 90 | Red CPU percentage threshold. |
NTS_STATUS_CPU_PCT_YELLOW_THRESHOLD | 70 | Yellow CPU percentage threshold. |
NTS_STATUS_DATA_DIR_FREE_GIGABYTES_RED | 5 | Red threshold for gigabytes of free space on the data directory. |
NTS_STATUS_DATA_DIR_FREE_GIGABYTES_YELLOW | 10 | Yellow threshold for gigabytes of free space on the data directory. |
NTS_STATUS_DATA_DIR_FREE_PERCENTAGE_RED | 5 | Red threshold for percentage of free space on the data directory. |
NTS_STATUS_DATA_DIR_FREE_PERCENTAGE_YELLOW | 15 | Yellow threshold for percentage of free space on the data directory. |
NTS_STATUS_DB_ACCESS_INTERVAL | 0 | Defines which histogram bucket for the Database.Query.Histogram stat
is considered acceptable. Any entries in buckets longer than this
setting will count towards the percentage for Yellow or Red status. |
NTS_STATUS_DB_ACCESS_PCT_OVER_RED | 5 | Sets the status to red if the percentage of
the Database.Query.Histogram stat is in a bucket
higher than what is defined in NTS_STATUS_DB_ACCESS_INTERVAL . |
NTS_STATUS_DB_ACCESS_PCT_OVER_YELLOW | 2 | Sets the status to yellow if the percentage
of the Database.Query.Histogram stat is in a bucket
higher than what is defined in NTS_STATUS_DB_ACCESS_INTERVAL . |
NTS_STATUS_DB_OPEN_INTERVAL_YELLOW | 4 | Lower time limit interval index to open Databases
in GENERAL_TIME_HISTOGRAM_BOUNDARIES_NAMES . The intervals
are "000-001", "001-002", "002-005", "005-010", "010-030", "030-060",
"060-120", "120-Inf". |
NTS_STATUS_DB_OPEN_PCT_OVER_YELLOW | 5 | Percentage over the STATUS_DB_OPEN_INTERVAL_YELLOW to
set status to Yellow. |
NTS_STATUS_DS_FAILURE_503_RED | 10 | Percentage of threads failing with a 503 error message to be considered in Red state |
NTS_STATUS_DS_FAILURE_503_YELLOW | 5 | Percentage of threads failing with a 503 error message to be considered in Yellow state. |
NTS_STATUS_DS_FAILURE_NON_503_RED | 10 | Percentage threads failing with a non-503 error message to be considered in Red state |
NTS_STATUS_DS_FAILURE_NON_503_YELLOW | 5 | Percentage threads failing with a non-503 error message to be considered in Yellow state |
NTS_STATUS_ERROR_COUNT_RED | 100 | For each user, if their error count is above this value, their status will be set to Red. |
NTS_STATUS_ERROR_COUNT_YELLOW | 50 | For each user, if their error count is above this value, the status will be set to Yellow. |
NTS_STATUS_HTTP_THREAD_PCT_RED | 90 | If the peak HTTP thread usage is above this limit, the status will be set to Red. |
NTS_STATUS_HTTP_THREAD_PCT_YELLOW | 80 | If the peak HTTP thread usage is above this limit, the status will be set to Yellow |
NTS_STATUS_IPC_DELAY_TIME_PCT_YELLOW | 95 | IPC.DelayTime statistics are a histogram measuring the delay
for sending objects between HTTP and IBM Traveler. The status will be set to yellow if the number in
the smallest histogram IPC.DelayTime bucket is over this percentage. |
NTS_STATUS_LOG_DIR_FREE_GIGABYTES_RED | 5 | Red threshold for gigabytes of free space on the logging directory. |
NTS_STATUS_LOG_DIR_FREE_GIGABYTES_YELLOW | 10 | Yellow threshold for gigabytes of free space on the logging directory. |
NTS_STATUS_LOG_DIR_FREE_PERCENTAGE_RED | 5 | Red threshold for percentage of free space on the logging directory. |
NTS_STATUS_LOG_DIR_FREE_PERCENTAGE_YELLOW | 15 | Yellow threshold for percentage of free space on the logging directory. |
NTS_STATUS_MEMORY_EXIT_CONSTRAINT_DELTA | 5 | When high memory usage causes IBM Traveler to enter the constrained state, the current memory usage must be below the limit set here before the constrained state can be exited. |
NTS_STATUS_MEMORY_JAVA_RED | 85 | Red Java™ memory percentage threshold. |
NTS_STATUS_MEMORY_JAVA_YELLOW | 75 | Yellow Java™ memory percentage threshold. |
NTS_STATUS_MEMORY_NATIVE_RED | 95 | Red native memory percentage threshold. |
NTS_STATUS_MEMORY_NATIVE_YELLOW | 85 | Yellow native memory percentage threshold . |
NTS_STATUS_MINIMUM_SAMPLES | 100 | The minimum number of samples that must be taken before the percentages are computed for red/yellow status. |
NTS_STATUS_SSL_CERT_EXPIRATION_RED | 7 | If NTS_SSL is true, this is the threshold for red status for the number of days remaining before the SSL certificate expiration date. |
NTS_STATUS_SSL_CERT_EXPIRATION_YELLOW | 30 | If NTS_SSL is true, this is the threshold for yellow status for the number of days remaining before the SSL certificate expiration date. |
NTS_STATUS_THREAD_MAX_RUN_RED | 120 | If a thread runs longer than this number of minutes, the state will be consider to be Red. |
NTS_STATUS_THREAD_MAX_RUN_YELLOW | 30 | If a thread runs longer than this number of minutes, the state will be consider to be Yellow. |
Performance considerations
Highly efficient system performance while running the health check commands is not absolutely critical, as it is only run periodically (15 minutes by default). However, because it is frequently executed, the process should be efficient as possible. The new method for determining if the system is in constraint state is critical to performance, as it executes each time a new device sync begins.
The other critical piece for performance is the collection of additional stats. Because the current procedure already batch writes stats, the addition of additional stats should not cause any additional degradation of performance.
Java™ memory usage will moderate, as there is cache for CPU and Memory statistics that are retrieved every 15 minutes, for a total of 100 entries. This is only a small memory usage, when compared to the memory usage of the system as a whole.