Status commands

The tell status command for HCL Traveler server is tell traveler status.

If you run the command when the overall status is Green, the only message the system displays is HCL Traveler overall status is GREEN. When the status is Yellow or Red, the system displays all the conditions causing noncompliance. The returned messages include both the reason for the noncompliance and the probable cause for the failure (when available). This status information is part of the systemdump command.

The following section is an example of what the results may look like, given a status return of Red:

tell traveler status
The HCL Traveler task has been running since Tue Jun 15 17:08:37 EDT 2010.
The last successful device sync was on Mon Jun 21 06:43:01 EDT 2010.
Yellow Status Messages
The response times for opening databases on mail server CN=Mail1/O=Test are above the acceptable threshold.
The response times for opening databases on mail server CN=Mail7/O=Test are above the acceptable threshold.
Red Status Messages
17,238 errors have been logged for user CN=Joe Tester/OU=Test/O=HCL. 
There have been 3,845 device sync failures for reasons other than the server is too busy.
The overall status of HCL Traveler is Red.

Threadchecks

The threshold values specified in these sections are default values. The thresholds for red and yellow thresholds can be customized using configuration files. The configuration parameters are detailed later in this document.

DS or PS threads have run for a "long period" of time

Problem threshold:

Yellow: Wall clock run time is greater than 30 minutes
Red: Wall clock run time is greater than 120 minutes

Console Message: User {User name} on thread {thread name} has been running for {xx} minutes.

Probable cause: If the Red threshold is reached, then the thread is likely hung. In rare instances there may be a device sync or an extremely long prime sync that is working against a very large user database or a slow mail server, which is normal.

Corrective actions:

Persistent yellow conditions might indicate a slow mail server or an overloaded Traveler server. Monitor and look for other status conditions that might have a better indication of a diagnosis.
For first occurrence, take a system dump which will include the information about all of the threads in the Traveler service. Use tell traveler systemdump and run an nsd at the domino command line to gather native stacks. Collect the logs.
Restart the Traveler service. There is a good chance this will require a complete Domino® server restart and you may need to kill the Domino® server in order for it to shutdown completely.

Percentage of Device Syncs that failed with 503 return code

Problem threshold:

Yellow: The number of 503 synchs is more than 5%.
Red: The number of 503 synchs is more than 10%.

Console Message: There have been {number of 503 RC} device sync failures because the server is too busy and returned status code 503.

Probable Cause: The most probable cause is that the server is running over capacity. 503 means that there are no threads available to handle a synchronization request, and the Traveler server continues to allocate threads until it becomes resource constrained.

Corrective actions: Either increase the memory, or move some of the users to another HCL Traveler server.

Percentage of Device Syncs are failing with error code other than 503

Problem threshold:

Yellow: The number of unsuccessful synchs is more than 5%.
Red: The number of unsuccessful synchs is more than 10%.

Console Message: There have been {number of error code other than 503 RC} device sync failures for reasons other than the server is too busy.

Probable cause: There are network connectivity issues between HCL Traveler server and the user's device(s).

HTTP thread allocations

Problem threshold:

Yellow: The peak or current number of connections is greater than 80% of HTTP threads.
Red: The peak or current number of connections is greater than 90% of HTTP threads.

Console Message:

The number of active HTTP connections is {current percentage} percent of the available HTTP threads ({HTTP Threads}).
The peak number of HTTP connections is {peak percentage} percent of the available HTTP threads ({HTTP Threads}).

Probable cause: This condition implies that there are not enough HTTP threads for the number of devices trying to user the HCL Traveler server.

Corrective actions:

Increase the number of HTTP threads if there is enough memory and CPU resources.
Move some of the users to another HCL Traveler server.

Memory checks

The threshold values specified are default values. The thresholds for red and yellow thresholds can be customized using configuration files. The configuration parameters are detailed later in this document.

Native memory usage

Problem threshold:

Yellow: Native Memory usage is greater than 85%
Red: Native Memory usage is greater than 95%

Console Message: The current native memory usage is {current percentage} percent of the available memory.

Probable cause: Native share memory includes shared memory with other Domino® applications on the Domino® Server.

Corrective actions:

Verify whether too many HTTP Threads are allocated.
Reduce the number of applications running on the Domino® server.
Reduce the number of HCL Traveler users on the machines.
Issue tell traveler memcommand to see the history of memory and CPU usage.

Java™ memory usage

Problem threshold:

Yellow: Java™ Memory usage is greater than 85%
Red: Java™ Memory usage is greater than 95%

Console Message:

The current Java memory usage
is {current percentage} percent of the available memory.

Probable cause: Not enough Java™ heap memory for the number of users on the system.

Corrective actions:

Issue the tell traveler mem command to see the history of memory and CPU usage.
Increase the Maximum Memory Size in the Domino® server document under the HCL Traveler tab.

Trusted server causing yellow status

Problem threshold:

Yellow: Mail server {MailServerName} does not have the HCL Traveler server {TravelerServerName} in the trusted server list.

Corrective actions:

The customer needs to add the {TravelerServerName} to the "Trusted Servers" list on the Security tab in the {MailServerName} server document.

Other checks

CPU usage

Checks the current data to see if the system is over worked. The code checks from the present back through one complete interval. On average the time period used for measuring the CPU utilization will be 1.5 times the interval length. By default the interval is 15 minutes.

Problem threshold:

Yellow: CPU threshold is 70%.
Red: CPU threshold is 90%.

Console Message: The HCL Traveler's CPU usage is {current percentage} percent over the last {minutes} minutes of processing.

Corrective actions:

Reduce the number of applications running on the Domino® server.
Reduce the number of HCL Traveler users on the machines.
Issue tell traveler mem command to see the history of memory and CPU usage.

Error messages logged

Checks to see if the number of error messages logged for a user has reach the threshold. These thresholds are monitored per person, not for all users on the system.

Problem Threshold:

Yellow: A user's error count is greater than 50 errors
Red: A user's error count is greater than 100 errors \

Console Message: {0} errors have been logged for user {1}. Checks the time of database open for a given server.

Problem Thresholds:

Yellow: 10% of the opens are above the "Yellow Open Threshold"
Red: 5% of the opens are above the "Red Open Threshold"

Console Message: The response times for opening databases on mail server {mail server name} are above the acceptable threshold.

Probable Cause: Check for network delays between the HCL Traveler server and mail server.

Free Disk Space.

Checks that there is adequate free disk space on the HCL Traveler server. Applies to both the data directory and the log directory as indicated by the *_DATA_DIR_FREE_* and *_LOG_DIR_FREE_* parameters. By default, the log directory is contained under the data directory, but it is possible for the administrator to move the log directory to a different disk.

Problem threshold:

Yellow: Less than 15% Free Disk Space
Red: Less than 5% Free Disk Space

Console Message: Disk space for {location} has {%} percent free.

Corrective action: Remove unneeded files to increase free disk space.

Time difference between database server and Traveler server.

Checks that the database server time and the Traveler server time are properly configured. The code compares the difference between the current time of the Traveler server with the current time of the database server.

Problem Thresholds

Yellow: 60000 milliseconds difference between database server time and traveler server time
Red: 900000 milliseconds difference between database server time and traveler server time

Console Message

The database time {database time} is outside the configured threshold {red or yellow threshold} (milliseconds) from the Traveler server time {Traveler server time}.

Corrective Action

Check that the Traveler Server and database server are on the same timezone
Synchronize the time between the Traveler Server and the database server such that the time difference is less than one minute.

Expiring APNS Certificates

Checks that enabled APNS certificates are not approaching expiration. The code takes the current date and checks that it is within the specified number of days from expiration.

Problem Thresholds

Yellow: 60 days from expiration
Red: 7 days from expiration

Console Message

The APNS certificate for {APNS Provider description} expires on {Expiration date}.
The APNS certificate for {APNS Provider description} has expired.

Corrective Action

Ensure that the latest Traveler version is installed or that third party certificates are up to date.

Expiring SSL Certificates

Checks that the SSL certificates are not approaching expiration if SSL is enabled. The code takes the current date and checks that it is within the specified number of days from expiration.

Problem Thresholds

Yellow: 30 days from expiration.

Red: 7 days from expiration.

Console Message

Traveler Server SSL certificate with alias ''{certificate alias}'' in file {key store file name} expires on {expiration date}.

Corrective Action

Update SSL Certificates to their newest version. This only applies to server to server communication. For more information see Enable server to server secure communications (optional).

Constraint processing

The constraint processing is proactive code that monitors the system checking to see if it has entered a resource constraint state. The system enters a constrained state when the system memory or database connections exceed a given threshold. Once the constraint state is detected, HCL Traveler will not allow new device sync or prime sync threads to start. Other threads will be allowed to complete and hopefully the constraint condition will be alleviated. If the constraint condition persists, then the existing HCL Traveler thread pool logic will kill the additional unused threads, further reducing the system's memory footprint. The minimum number of prime sync threads is 5 and the minimum device sync threads is 10. If the system is in constraint state, new device syncs will be denied with the 503 status code (server is busy). The system will log the information level of messages when entering and exiting constraint state with the thread summary information. Whenever a constraint state lasts longer than 60 minutes, an error message will be logged and a system dump executed.

The system enters constraint mode when memory conditions hit the Red state, and exit when it is 5% below the Red entry level. By default, the system enters constraint when native memory percentage usage is greater than NTS_STATUS_MEMORY_NATIVE_RED, which is 95% by default or when Java™ memory is greater than NTS_STATUS_MEMORY_NATIVE_RED which is 85% by default. The system exits constraint when native memory usage is below 90% and when Java™ memory is below 91%.

Stats

GetAlarm.Time.Histogram
NameLookup.Time.Histogram
DCA.DB_OPEN
DCA.DB_CLOSE
ERRORS.<UserId>
CPU.Pct.<% CPU Range in 10% increments> (000-010, 010-020, and so on)
DATABASE.QUERY.HISTOGRAM<SimpleName>.(000-001,001-002,002-005, and so on)

Configuration parameters

The table below shows all of the notes.ini parameters required to change the thresholds.

Table 1. Configuration parameters
Parameter name	Default	Description
NTS_STATUS_APNS_CERTIFICATE_EXPIRATION_YELLOW	60 days	If APNS notifications are enabled, this is the threshold for yellow status for the number of days remaining before the APNS certificate expiration date.
NTS_STATUS_APNS_CERTIFICATE_EXPIRATION_RED	7 days	If APNS notifications are enabled, this is the threshold for red status for the number of days remaining before the APNS certificate expiration date.
NTS_STATUS_CPU_PCT_RED_THRESHOLD	90	Red CPU percentage threshold.
NTS_STATUS_CPU_PCT_YELLOW_THRESHOLD	70	Yellow CPU percentage threshold.
NTS_STATUS_DATA_DIR_FREE_GIGABYTES_RED	5	Red threshold for gigabytes of free space on the data directory.
NTS_STATUS_DATA_DIR_FREE_GIGABYTES_YELLOW	10	Yellow threshold for gigabytes of free space on the data directory.
NTS_STATUS_DATA_DIR_FREE_PERCENTAGE_RED	5	Red threshold for percentage of free space on the data directory.
NTS_STATUS_DATA_DIR_FREE_PERCENTAGE_YELLOW	15	Yellow threshold for percentage of free space on the data directory.
NTS_STATUS_DB_ACCESS_INTERVAL	6	Defines which histogram bucket for the `Database.Query.Histogram` stat is considered acceptable. Any entries in buckets longer than this setting will count towards the percentage for Yellow or Red status.
NTS_STATUS_DB_ACCESS_PCT_OVER_RED	5	Sets the status to red if the percentage of the `Database.Query.Histogram` stat is in a bucket higher than what is defined in `NTS_STATUS_DB_ACCESS_INTERVAL`.
NTS_STATUS_DB_ACCESS_PCT_OVER_YELLOW	2	Sets the status to yellow if the percentage of the `Database.Query.Histogram` stat is in a bucket higher than what is defined in `NTS_STATUS_DB_ACCESS_INTERVAL`.
NTS_STATUS_DB_OPEN_INTERVAL_YELLOW	6	Lower time limit interval index to open Databases in `GENERAL_TIME_HISTOGRAM_BOUNDARIES_NAMES`. The intervals are (in milliseconds) "00000-00100", "00100-00200", "00200-00300", "00300-00400", "00400-00600", "00600-00800", "00800-01000","01000-02000", "02000-05000","05000-10000", "10000-30000","30000-60000", "60000-Inf".
NTS_STATUS_DB_OPEN_PCT_OVER_YELLOW	5	Percentage over the `STATUS_DB_OPEN_INTERVAL_YELLOW` to set status to Yellow.
NTS_STATUS_DB_TIME_DIFFERENCE_YELLOW_THRESHOLD	60000 milliseconds	Time difference threshold in milliseconds between the database server and the traveler server for a yellow status.
NTS_STATUS_DB_TIME_DIFFERENCE_RED_THRESHOLD	900000 milliseconds	Time difference threshold in milliseconds between the database server and the traveler server for a red status.
NTS_STATUS_DS_FAILURE_503_RED	10	Percentage of threads failing with a 503 error message to be considered in Red state
NTS_STATUS_DS_FAILURE_503_YELLOW	5	Percentage of threads failing with a 503 error message to be considered in Yellow state.
NTS_STATUS_DS_FAILURE_NON_503_RED	10	Percentage threads failing with a non-503 error message to be considered in Red state
NTS_STATUS_DS_FAILURE_NON_503_YELLOW	5	Percentage threads failing with a non-503 error message to be considered in Yellow state
NTS_STATUS_ERROR_COUNT_RED	100	For each user, if their error count is above this value, their status will be set to Red.
NTS_STATUS_ERROR_COUNT_YELLOW	50	For each user, if their error count is above this value, the status will be set to Yellow.
NTS_STATUS_HTTP_THREAD_PCT_RED	90	If the peak HTTP thread usage is above this limit, the status will be set to Red.
NTS_STATUS_HTTP_THREAD_PCT_YELLOW	80	If the peak HTTP thread usage is above this limit, the status will be set to Yellow
NTS_STATUS_IPC_DELAY_TIME_PCT_YELLOW	95	`IPC.DelayTime` statistics are a histogram measuring the delay for sending objects between HTTP and HCL Traveler. The status will be set to yellow if the number in the smallest histogram `IPC.DelayTime` bucket is over this percentage.
NTS_STATUS_LOG_DIR_FREE_GIGABYTES_RED	5	Red threshold for gigabytes of free space on the logging directory.
NTS_STATUS_LOG_DIR_FREE_GIGABYTES_YELLOW	10	Yellow threshold for gigabytes of free space on the logging directory.
NTS_STATUS_LOG_DIR_FREE_PERCENTAGE_RED	5	Red threshold for percentage of free space on the logging directory.
NTS_STATUS_LOG_DIR_FREE_PERCENTAGE_YELLOW	15	Yellow threshold for percentage of free space on the logging directory.
NTS_STATUS_MEMORY_EXIT_CONSTRAINT_DELTA	5	When high memory usage causes HCL Traveler to enter the constrained state, the current memory usage must be below the limit set here before the constrained state can be exited.
NTS_STATUS_MEMORY_JAVA_RED	85	Red Java™ memory percentage threshold.
NTS_STATUS_MEMORY_JAVA_YELLOW	75	Yellow Java™ memory percentage threshold.
NTS_STATUS_MEMORY_NATIVE_RED	95	Red native memory percentage threshold.
NTS_STATUS_MEMORY_NATIVE_YELLOW	85	Yellow native memory percentage threshold .
NTS_STATUS_MINIMUM_SAMPLES	100	The minimum number of samples that must be taken before the percentages are computed for red/yellow status.
NTS_STATUS_SSL_CERT_EXPIRATION_RED	7	If NTS_SSL is true, this is the threshold for red status for the number of days remaining before the SSL certificate expiration date.
NTS_STATUS_SSL_CERT_EXPIRATION_YELLOW	30	If NTS_SSL is true, this is the threshold for yellow status for the number of days remaining before the SSL certificate expiration date.
NTS_STATUS_THREAD_MAX_RUN_RED	120	If a thread runs longer than this number of minutes, the state will be consider to be Red.
NTS_STATUS_THREAD_MAX_RUN_YELLOW	30	If a thread runs longer than this number of minutes, the state will be consider to be Yellow.

Performance considerations

Highly efficient system performance while running the health check commands is not absolutely critical, as it is only run periodically (15 minutes by default). However, because it is frequently executed, the process should be efficient as possible. The new method for determining if the system is in constraint state is critical to performance, as it executes each time a new device sync begins.

The other critical piece for performance is the collection of additional stats. Because the current procedure already batch writes stats, the addition of additional stats should not cause any additional degradation of performance.

Java™ memory usage will moderate, as there is cache for CPU and Memory statistics that are retrieved every 15 minutes, for a total of 100 entries. This is only a small memory usage, when compared to the memory usage of the system as a whole.