Status commands
The tell status command for HCL Traveler server is tell traveler
status
.
If you run the command when the overall status is Green, the only message the system displays is
HCL Traveler overall status is GREEN
. When the status is Yellow or Red, the system
displays all the conditions causing noncompliance. The returned messages include both the reason for
the noncompliance and the probable cause for the failure (when available). This status information
is part of the systemdump
command.
tell traveler status
The HCL Traveler task has been running since Tue Jun 15 17:08:37 EDT 2010.
The last successful device sync was on Mon Jun 21 06:43:01 EDT 2010.
Yellow Status Messages
The response times for opening databases on mail server CN=Mail1/O=Test are above the acceptable threshold.
The response times for opening databases on mail server CN=Mail7/O=Test are above the acceptable threshold.
Red Status Messages
17,238 errors have been logged for user CN=Joe Tester/OU=Test/O=HCL.
There have been 3,845 device sync failures for reasons other than the server is too busy.
The overall status of HCL Traveler is Red.
Threadchecks
- DS or PS threads have run for a "long period" of time
- Problem threshold:
- Yellow: Wall clock run time is greater than 30 minutes
- Red: Wall clock run time is greater than 120 minutes
Console Message:
User {User name} on thread {thread name} has been running for {xx} minutes.
Probable cause: If the Red threshold is reached, then the thread is likely hung. In rare instances there may be a device sync or an extremely long prime sync that is working against a very large user database or a slow mail server, which is normal.
Corrective actions:- Persistent yellow conditions might indicate a slow mail server or an overloaded Traveler server. Monitor and look for other status conditions that might have a better indication of a diagnosis.
- For first occurrence, take a system dump which will include the
information about all of the threads in the Traveler service. Use
tell traveler systemdump
and run annsd
at the domino command line to gather native stacks. Collect the logs. - Restart the Traveler service. There is a good chance this will require a complete Domino® server restart and you may need to kill the Domino® server in order for it to shutdown completely.
- Percentage of Device Syncs that failed with 503 return code
- Problem threshold:
- Yellow: The number of 503 synchs is more than 5%.
- Red: The number of 503 synchs is more than 10%.
Console Message:
There have been {number of 503 RC} device sync failures because the server is too busy and returned status code 503
.Probable Cause: The most probable cause is that the server is running over capacity. 503 means that there are no threads available to handle a synchronization request, and the Traveler server continues to allocate threads until it becomes resource constrained.
Corrective actions: Either increase the memory, or move some of the users to another HCL Traveler server.
- Percentage of Device Syncs are failing with error code other than 503
- Problem threshold:
- Yellow: The number of unsuccessful synchs is more than 5%.
- Red: The number of unsuccessful synchs is more than 10%.
Console Message:
There have been {number of error code other than 503 RC} device sync failures for reasons other than the server is too busy.
Probable cause: There are network connectivity issues between HCL Traveler server and the user's device(s).
- HTTP thread allocations
- Problem threshold:
- Yellow: The peak or current number of connections is greater than 80% of HTTP threads.
- Red: The peak or current number of connections is greater than 90% of HTTP threads.
Console Message:The number of active HTTP connections is {current percentage} percent of the available HTTP threads ({HTTP Threads}).
The peak number of HTTP connections is {peak percentage} percent of the available HTTP threads ({HTTP Threads}).
Probable cause: This condition implies that there are not enough HTTP threads for the number of devices trying to user the HCL Traveler server.
Corrective actions:- Increase the number of HTTP threads if there is enough memory and CPU resources.
- Move some of the users to another HCL Traveler server.
Memory checks
- Native memory usage
- Problem threshold:
- Yellow: Native Memory usage is greater than 85%
- Red: Native Memory usage is greater than 95%
Console Message:
The current native memory usage is {current percentage} percent of the available memory.
Probable cause: Native share memory includes shared memory with other Domino® applications on the Domino® Server.
Corrective actions:- Verify whether too many HTTP Threads are allocated.
- Reduce the number of applications running on the Domino® server.
- Reduce the number of HCL Traveler users on the machines.
- Issue
tell traveler mem
command to see the history of memory and CPU usage.
- Java™ memory usage
- Problem threshold:
- Yellow: Java™ Memory usage is greater than 85%
- Red: Java™ Memory usage is greater than 95%
- Trusted server causing yellow status
- Problem threshold:
- Yellow: Mail server
{MailServerName}
does not have the HCL Traveler server{TravelerServerName}
in the trusted server list.
- Yellow: Mail server
Other checks
- CPU usage
-
Checks the current data to see if the system is over worked. The code checks from the present back through one complete interval. On average the time period used for measuring the CPU utilization will be 1.5 times the interval length. By default the interval is 15 minutes.
Problem threshold:- Yellow: CPU threshold is 70%.
- Red: CPU threshold is 90%.
Console Message:
The HCL Traveler's CPU usage is {current percentage} percent over the last {minutes} minutes of processing.
Corrective actions:- Reduce the number of applications running on the Domino® server.
- Reduce the number of HCL Traveler users on the machines.
- Issue
tell traveler mem
command to see the history of memory and CPU usage.
- Error messages logged
-
Checks to see if the number of error messages logged for a user has reach the threshold. These thresholds are monitored per person, not for all users on the system.
Problem Threshold:- Yellow: A user's error count is greater than 50 errors
- Red: A user's error count is greater than 100 errors \
Console Message:
{0} errors have been logged for user {1}.
Checks the time of database open for a given server.Problem Thresholds:- Yellow: 10% of the opens are above the "Yellow Open Threshold"
- Red: 5% of the opens are above the "Red Open Threshold"
Console Message:
The response times for opening databases on mail server {mail server name} are above the acceptable threshold.
Probable Cause: Check for network delays between the HCL Traveler server and mail server.
- Free Disk Space.
- Checks that there is adequate free disk space on the HCL Traveler server. Applies to
both the data directory and the log directory as indicated by the
*_DATA_DIR_FREE_*
and*_LOG_DIR_FREE_*
parameters. By default, the log directory is contained under the data directory, but it is possible for the administrator to move the log directory to a different disk.Problem threshold:- Yellow: Less than 15% Free Disk Space
- Red: Less than 5% Free Disk Space
Console Message:
Disk space for {location} has {%} percent free.
Corrective action: Remove unneeded files to increase free disk space.
- Time difference between database server and Traveler server.
- Checks that the database server time and the Traveler server time are properly configured. The code compares the difference between the current time of the Traveler server with the current time of the database server.
- Expiring APNS Certificates
- Checks that enabled APNS certificates are not approaching expiration. The code takes
the current date and checks that it is within the specified number of days from
expiration.
Problem Thresholds
- Yellow: 60 days from expiration
- Red: 7 days from expiration
Console Message
- The APNS certificate for {APNS Provider description} expires on {Expiration date}.
- The APNS certificate for {APNS Provider description} has expired.
Corrective Action
Ensure that the latest Traveler version is installed or that third party certificates are up to date.
- Expiring SSL Certificates
- Checks that the SSL certificates are not approaching expiration if SSL is enabled.
The code takes the current date and checks that it is within the specified number of
days from expiration.
Problem Thresholds
Yellow: 30 days from expiration.
Red: 7 days from expiration.
Console Message
Traveler Server SSL certificate with alias ''{certificate alias}'' in file {key store file name} expires on {expiration date}.
Corrective Action
Update SSL Certificates to their newest version. This only applies to server to server communication. For more information see Enable server to server secure communications (optional).
- Tracking HTTP Task Status
-
Checks that whether HTTP task is running or not once the traveler server is started.
Problem Thresholds
Yellow: wait time is 45 seconds after traveler server started.
Red: wait time is 180 seconds.
Console Message
HTTP task is not running
Corrective Action
Start HTTP if not started.
Analyze why HTTP is slow to start or not starting and take corrective action as needed.
Constraint processing
The constraint processing is proactive code that monitors the system checking to see if it has entered a resource constraint state. The system enters a constrained state when the system memory or database connections exceed a given threshold. Once the constraint state is detected, HCL Traveler will not allow new device sync or prime sync threads to start. Other threads will be allowed to complete and hopefully the constraint condition will be alleviated. If the constraint condition persists, then the existing HCL Traveler thread pool logic will kill the additional unused threads, further reducing the system's memory footprint. The minimum number of prime sync threads is 5 and the minimum device sync threads is 10. If the system is in constraint state, new device syncs will be denied with the 503 status code (server is busy). The system will log the information level of messages when entering and exiting constraint state with the thread summary information. Whenever a constraint state lasts longer than 60 minutes, an error message will be logged and a system dump executed.
The system
enters constraint mode when memory conditions hit the Red state, and
exit when it is 5% below the Red entry level. By default, the system
enters constraint when native memory percentage usage is greater than NTS_STATUS_MEMORY_NATIVE_RED
,
which is 95% by default or when Java™ memory
is greater than NTS_STATUS_MEMORY_NATIVE_RED
which
is 85% by default. The system exits constraint when native memory
usage is below 90% and when Java™ memory
is below 91%.
Stats
GetAlarm.Time.Histogram
NameLookup.Time.Histogram
DCA.DB_OPEN
DCA.DB_CLOSE
ERRORS.<UserId>
CPU.Pct.<% CPU Range in 10% increments> (000-010, 010-020, and so on)
DATABASE.QUERY.HISTOGRAM<SimpleName>.(000-001,001-002,002-005, and so on)
Configuration parameters
notes.ini
parameters required to
change the thresholds. Parameter name | Default | Description |
---|---|---|
NTS_STATUS_APNS_CERTIFICATE_EXPIRATION_YELLOW | 60 days | If APNS notifications are enabled, this is the threshold for yellow status for the number of days remaining before the APNS certificate expiration date. |
NTS_STATUS_APNS_CERTIFICATE_EXPIRATION_RED | 7 days | If APNS notifications are enabled, this is the threshold for red status for the number of days remaining before the APNS certificate expiration date. |
NTS_STATUS_CPU_PCT_RED_THRESHOLD | 90 | Red CPU percentage threshold. |
NTS_STATUS_CPU_PCT_YELLOW_THRESHOLD | 70 | Yellow CPU percentage threshold. |
NTS_STATUS_DATA_DIR_FREE_GIGABYTES_RED | 5 | Red threshold for gigabytes of free space on the data directory. |
NTS_STATUS_DATA_DIR_FREE_GIGABYTES_YELLOW | 10 | Yellow threshold for gigabytes of free space on the data directory. |
NTS_STATUS_DATA_DIR_FREE_PERCENTAGE_RED | 5 | Red threshold for percentage of free space on the data directory. |
NTS_STATUS_DATA_DIR_FREE_PERCENTAGE_YELLOW | 15 | Yellow threshold for percentage of free space on the data directory. |
NTS_STATUS_DB_ACCESS_INTERVAL | 6 | Defines which histogram bucket for the Database.Query.Histogram stat
is considered acceptable. Any entries in buckets longer than this
setting will count towards the percentage for Yellow or Red status. |
NTS_STATUS_DB_ACCESS_PCT_OVER_RED | 5 | Sets the status to red if the percentage of
the Database.Query.Histogram stat is in a bucket
higher than what is defined in NTS_STATUS_DB_ACCESS_INTERVAL . |
NTS_STATUS_DB_ACCESS_PCT_OVER_YELLOW | 2 | Sets the status to yellow if the percentage
of the Database.Query.Histogram stat is in a bucket
higher than what is defined in NTS_STATUS_DB_ACCESS_INTERVAL . |
NTS_STATUS_DB_OPEN_INTERVAL_YELLOW | 6 | Lower time limit interval index to open Databases in
GENERAL_TIME_HISTOGRAM_BOUNDARIES_NAMES . The intervals are (in
milliseconds) "00000-00100", "00100-00200", "00200-00300", "00300-00400", "00400-00600",
"00600-00800", "00800-01000","01000-02000", "02000-05000","05000-10000",
"10000-30000","30000-60000", "60000-Inf". |
NTS_STATUS_DB_OPEN_PCT_OVER_YELLOW | 5 | Percentage over the STATUS_DB_OPEN_INTERVAL_YELLOW to
set status to Yellow. |
NTS_STATUS_DB_TIME_DIFFERENCE_YELLOW_THRESHOLD | 60000 milliseconds | Time difference threshold in milliseconds between the database server and the traveler server for a yellow status. |
NTS_STATUS_DB_TIME_DIFFERENCE_RED_THRESHOLD | 900000 milliseconds | Time difference threshold in milliseconds between the database server and the traveler server for a red status. |
NTS_STATUS_DS_FAILURE_503_RED | 10 | Percentage of threads failing with a 503 error message to be considered in Red state |
NTS_STATUS_DS_FAILURE_503_YELLOW | 5 | Percentage of threads failing with a 503 error message to be considered in Yellow state. |
NTS_STATUS_DS_FAILURE_NON_503_RED | 10 | Percentage threads failing with a non-503 error message to be considered in Red state |
NTS_STATUS_DS_FAILURE_NON_503_YELLOW | 5 | Percentage threads failing with a non-503 error message to be considered in Yellow state |
NTS_STATUS_ERROR_COUNT_RED | 100 | For each user, if their error count is above this value, their status will be set to Red. |
NTS_STATUS_ERROR_COUNT_YELLOW | 50 | For each user, if their error count is above this value, the status will be set to Yellow. |
NTS_STATUS_HTTP_NOT_RUNNING_RED | 180 | Seconds to wait for checking whether HTTP task is loaded or not before a red status message. |
NTS_STATUS_HTTP_NOT_RUNNING_YELLOW | 45 | Seconds to wait for checking whether HTTP task is loaded or not before a yellow status message. |
NTS_STATUS_HTTP_THREAD_PCT_RED | 90 | If the peak HTTP thread usage is above this limit, the status will be set to Red. |
NTS_STATUS_HTTP_THREAD_PCT_YELLOW | 80 | If the peak HTTP thread usage is above this limit, the status will be set to Yellow |
NTS_STATUS_IPC_DELAY_TIME_PCT_YELLOW | 95 | IPC.DelayTime statistics are a histogram measuring the delay
for sending objects between HTTP and HCL Traveler. The status will be set to yellow if the number in
the smallest histogram IPC.DelayTime bucket is over this percentage. |
NTS_STATUS_LOG_DIR_FREE_GIGABYTES_RED | 5 | Red threshold for gigabytes of free space on the logging directory. |
NTS_STATUS_LOG_DIR_FREE_GIGABYTES_YELLOW | 10 | Yellow threshold for gigabytes of free space on the logging directory. |
NTS_STATUS_LOG_DIR_FREE_PERCENTAGE_RED | 5 | Red threshold for percentage of free space on the logging directory. |
NTS_STATUS_LOG_DIR_FREE_PERCENTAGE_YELLOW | 15 | Yellow threshold for percentage of free space on the logging directory. |
NTS_STATUS_MEMORY_EXIT_CONSTRAINT_DELTA | 5 | When high memory usage causes HCL Traveler to enter the constrained state, the current memory usage must be below the limit set here before the constrained state can be exited. |
NTS_STATUS_MEMORY_JAVA_RED | 85 | Red Java™ memory percentage threshold. |
NTS_STATUS_MEMORY_JAVA_YELLOW | 75 | Yellow Java™ memory percentage threshold. |
NTS_STATUS_MEMORY_NATIVE_RED | 95 | Red native memory percentage threshold. |
NTS_STATUS_MEMORY_NATIVE_YELLOW | 85 | Yellow native memory percentage threshold . |
NTS_STATUS_MINIMUM_SAMPLES | 100 | The minimum number of samples that must be taken before the percentages are computed for red/yellow status. |
NTS_STATUS_SSL_CERT_EXPIRATION_RED | 7 | If NTS_SSL is true, this is the threshold for red status for the number of days remaining before the SSL certificate expiration date. |
NTS_STATUS_SSL_CERT_EXPIRATION_YELLOW | 30 | If NTS_SSL is true, this is the threshold for yellow status for the number of days remaining before the SSL certificate expiration date. |
NTS_STATUS_THREAD_MAX_RUN_RED | 120 | If a thread runs longer than this number of minutes, the state will be consider to be Red. |
NTS_STATUS_THREAD_MAX_RUN_YELLOW | 30 | If a thread runs longer than this number of minutes, the state will be consider to be Yellow. |
Performance considerations
Highly efficient system performance while running the health check commands is not absolutely critical, as it is only run periodically (15 minutes by default). However, because it is frequently executed, the process should be efficient as possible. The new method for determining if the system is in constraint state is critical to performance, as it executes each time a new device sync begins.
The other critical piece for performance is the collection of additional stats. Because the current procedure already batch writes stats, the addition of additional stats should not cause any additional degradation of performance.
Java™ memory usage will moderate, as there is cache for CPU and Memory statistics that are retrieved every 15 minutes, for a total of 100 entries. This is only a small memory usage, when compared to the memory usage of the system as a whole.