Managing Always On Availability Groups

Monitoring and patching process

Console Operators can plan patching Windows 2012 and later clusters with Always On Availability Groups (AAGs) in two phases. In the first phase, before a given maintenance window, they can monitor the nodes in the environment and manually patch secondary nodes. In the second phase, within the maintenance window, they can target the primary and standalone nodes. This helps to complete patching within the scheduled time limit without spill over, because only the active node is patched during the maintenance window.

Status Collector

Status Collector is a service implemented as a Windows scheduled task that collects important information about the cluster nodes. This task checks the status of the nodes to monitor the health of the failover cluster. This identifies which ones are primary and which ones are secondary nodes.

Prerequisites

Following are the prerequisites to install Status Collector.

Operating System: The cluster nodes must have Windows® 2012 and later and must be configured with AAGs.
Database: The database on the nodes must be MS SQL enterprise edition only.
For detailed prerequisites, restrictions, and recommendations for Always On Availability Groups , see https://docs.microsoft.com/en-us/sql/database-engine/availability-groups/windows/prereqs-restrictions-recommendations-always-on-availability?view=sql-server-ver16

Status Collector Tasks

Status Collector runs on the cluster nodes. The scheduled task periodically executes a PowerShell script to query the node and reports the required information through

Analysis 176 - Status of the
        Always-On failover cluster

.

The following are the Tasks available to install, edit, and uninstall Status Collector.

Task 173 - Deploy Status collector for Windows cluster

Run this Task to install Status Collector. This task checks if the Windows Task Scheduler service is enabled and set to Automatic. If yes, it creates a new task to invoke the provided PowerShell script and the Trigger will be a schedule reflecting the frequency chosen by the Console user. The PowerShell script gathers the node status and writes this information together with the execution timestamp, in a file to be saved in a fixed location (output file location will be <BES Client folder>\Applications\StatusCollector folder).

Important:

This Task becomes relevant only for the computers that have SQL Server service and Windows Cluster service installed. The applicability of the install Task checks for the Status Collector to be already scheduled in the Windows task Scheduler.
This is a specific scheduled task to support AAGs. If this task is deployed with another type of Windows cluster other tha AAGs, it does not report any information.
As far as the scheduled task is configured, the user account used to run it is set as SYSTEM by default. If you want it to run under a different user account after the installation, ensure the user account has the required privileges to execute the PowerShell script. To configure the user account, open the Windows scheduled task, and under General→Security options, change it to the required value.

You can configure the following settings while installing Status Collector.

Polling time for script execution (in Minutes): This is the execution frequency of the Status collector task to fetch the status of the node. The default frequency for the script execution is set to 60 minutes.
DB Instance name: It is the name of the SQL Server instance. If kept as DEFAULT, the computer name is used as the DB Instance name. If the SQL Server instance name is different from the computer name, you must specify the exact SQL Server instance name of the node here.
Note:
- BigFix Console does not provide the information about the name of the cluster that a particular node belongs to. The Console Operator is expected to find this out.
Threshold for Logs folder (in MB): The maximum disk space that is allowed to store the Status collector log files. Default value is 100 MB. When this threshold is reached, the oldest log files are automatically deleted to accommodate the newest log files within the specified threshold.

Task 174 - Edit Status collector for windows cluster: With this Task, you can modify the settings (Polling time for script execution, DB Instance name, Threshold for Logs folder) configured during the installation of Status Collector through Task 173 - Deploy status collector. This Task reconfigures Status Collector on the fly without the need to uninstall or re-deploy it.

Task 175 - Remove Status collector for windows cluster: This Task uninstalls Status Collector that was previously installed. When you uninstall, you cannot monitor through Status Collector anymore.

Analysis to monitor the status

From the BigFix Console, you can monitor the status of the nodes in the cluster with Analysis 176 - Status of the Always-On failover cluster. With this analysis, user can monitor and configure the roles of the nodes such as Primary, Secondary, or Standalone node. It reports the timestamp at which the active node last reported. Based on this, the Console Operator can decide which node to be patched within and outside the maintenance window and ensure high availability. This analysis displays the following information.

Important: When the node is not configured with AAG cluster, <None> is displayed in all the fields.

Computer Name: Computer name of the node in a cluster.

Role: Role of the node as configure by the user such as HA_Primary, HA_Secondary, and Standalone.

Failover mode description

Manual – If the HA_Primary node goes down, it does not switch over to secondary node automatically. User must manually select another node as HA_Primary.
Automatic – If the HA_Primary node goes down, it automatically switches over to the secondary node configured under Preferred Failover column.
None – When the node is not configured with AAG cluster, it displays <None>.

Preferred failover: If Automatic failover mode is selected, when the HA_Primary node goes down, the server name mentioned under Preferred failover becomes the HA_Primary node.
Important: Ensure the server mentioned under preferred failover is patched outside the maintenance window appropriately to enable it to respond immediately in case of failure of the primary node.

Timestamp: Timestamp shows the date and time at which the data was collected from the nodes to display in the analysis. This indicates when the server responded to the query at the latest. Based on this timestamp, Console Operators can decide if the server status has changed. This can help to decide which server to patch within or outside the maintenance window.

Number of nodes in the cluster: It shows the number of nodes configured in the cluster that a node belongs to.

Task 344 – Make node unavailable as possible owner of resources in cluster with safety check

This Task is an updated version (besides the currently available version) of fixlet responsible for marking Cluster nodes unavailable.

This Task is available only for the clusters with 3 or more nodes. It performs an additional safety check before marking any node unavailable for the cluster failover. If the safety check fails, it retries another two times (three attempts in total).

This Task ensures that at any point in time, there are more than half the number of nodes are available to respond to the requests. The actual patching can proceed only if the number of available nodes is greater than half the number of nodes in the cluster, allowing for the failover process, and prevents any data loss. For example, if there are 3 nodes in a cluster, then the maximum number of nodes that can be taken out for patching is 1. The number of nodes that can be available are 2 (3/2=1.5). This prevents any data loss.

In a 3-node cluster, standalone is a replica of the active cluster. Primary and standalone are considered to be active and they both are patched inside the maintenance window.

Since the previously mentioned algorithm is not applicable for the 2-node clusters (there can never be more than half the nodes available nodes for failover), fixlet responsible for marking Cluster nodes unavailable must be used.

Logs and troubleshooting

Errors and debugging information about the Status Collector becomes available in the folder BES CLient/Applications/StatusCollector/Logs.
One or more logs files are saved with the name format: StatusCollector.<date>.log. Every log file name contains the corresponding date.
The log files provide all the messages related to the queries to the cluster node.
Depending on the disk space configured for accumulating the logs through the install or edit tasks, the latest logs are kept and oldest logs are automatically deleted.
To troubleshoot the Status Collector, the starting point is the timestamp in the analysis which does not refresh after the configured refresh period for a specific node. It could be caused by the Task Scheduler service not running properly (in this case the edit fixlet can be used to reconfigure it to start automatically) or by other reasons. If the edit fixlet does not solve the issue, the Status Collector log file of the node which is not updating needs to be manually checked to troubleshoot the problem.