Auth/Live Separation and Dynamic Sharding in Elasticsearch

In an Elasticsearch deployment, all nodes often share responsibilities for ingestion, query processing, and master functions. While convenient, this approach can lead to performance degradation and resource contention, especially in scenarios where authoring (Auth) operations, such as reindexing or data updates, often consume significant CPU and memory, potentially impacting the live (Live) environment's ability to handle real-time queries efficiently.

To mitigate this, the Auth/Live Separation configuration assigns dedicated node groups to each workload:

Authoring Nodes focuses on indexing and store preview operations, ideal for near-real-time (NRT) requirements and ensuring these intensive operations do not affect query performance in the Live environment in production.
Live Nodes are dedicated to serving query-heavy production environments, ensuring consistent performance and meeting Service Level Agreements (SLAs).

This separation:

Prevents ingestion spikes in the Auth environment from impacting Live query performance.
Allows each node group to be optimized for its specific workload, enhancing reliability and stability.

By isolating indices into separate Elasticsearch node groups, reindexing operations and other heavy processes in the Auth environment are prevented from degrading the Live storefront's responsiveness. This architecture forms the foundation for a robust and scalable cluster that meets the demands of modern commerce and search-driven applications.

When to implement Auth/Live Separation in Elasticsearch

The decision to implement Auth/Live separation in Elasticsearch requires carefully assessing your workload patterns, operational needs, and overall cluster performance goals. While separating authoring and live environment offers clear benefits, it is not always immediately necessary and could increase deployment costs.

Understanding the Need for Auth/Live Separation

In a unified Elsticsearch cluster where all nodes share responsibilities for ingestion, querying, and cluster management, resource contention can arise. Authoring operations such as reindexing and data updates demand significant CPU and memory resources. Without separation, these resource-intensive tasks may degrade the performance of the live environment, which serves real-time queries for production workloads.

However, not every Elasticsearch deployment encounters such challenges. To determine if Auth/Live separation is necessary, consider the following scenarios:

High Resource Utilization during Indexing: If reindexing or ingestion tasks often lead to spike in resource usage, causing slowdowns in query response times or impacting cluster stability, separating Auth and Live operations may help mitigate these issues.
Performance Degradation in Production Queries: If you experience inconsistent query performance in your live environment, particularly during periods of heavy indexing activity, Auth/Live Separation can provide predictable and stable performance for production users.
Critical SLA Requirements for Production: If your live environment must meet strict Service Level Agreements (SLAs) for query response times and uptime, dedicating resources to handle live queries can ensure compliance and improve customer satisfaction.
Frequent Store Preview or NRT Requirement: Organizations with frequent near-real-time (NRT) indexing needs or store preview requirements can benefit from isolating these operations within the Auth environment, ensuring they do not interfere with live query workloads.
Scalability and Growth Projections: If your data volume or query traffic is growing, separating workloads early on can help you scale your cluster more efficiently, optimizing resources for each specific task.

Pros and Cons of Auth/Live Separation

Following are the pros of the Auth/Live separation:

Enhanced Stability: Prevents resource contention by isolating heavy indexing tasks from live queries.
Optimized Performance: Each node group can be tuned specifically for its workload, improving efficiency.
Improved Scalability: Easier to scale individual node groups as workloads increase.
Better Resource Management: Reduced risk of unexpected bottlenecks during high-demand periods.

Following are the cons of the Auth/Live separation:

Increased Complexity: Managing separate node groups requires additional configuration and monitoring.
Higher Initial Costs: Dedicated node groups may increase infrastructure costs initially.
Dependency on Proper Planning: Separation might result in underutilized resources without accurate workload predictions.

When Auth/Live Separation May Not Be Necessary

Auth/Live Separation may not be required in smaller clusters or deployments with low query traffic and minimal indexing operations. In such cases, a unified node architecture might provide sufficient performance and simplicity. However, monitoring performance metrics is critical to determine if separation is warranted as workloads grow or become more complex.

Dynamic Sharding: Advanced Resource Optimization

Building on Auth/Live Separation, Dynamic Sharding introduces an on-demand Build node pool for reindexing and scaling, adding flexibility and efficiency to cluster operations. This approach allows users to dynamically adjust their Elasticsearch configuration based on workload demands.

How Dynamic Sharding Works

Build Node Pool: Dedicated nodes are spun to handle multi-shard reindexing tasks. These nodes can be brought online only when needed, reducing idle resource costs.
Shard Shrinking: After reindexing, the newly built index is reduced to a single shard before moving to the Auth node pool. This shrinkage minimizes the runtime resource footprint and enhances query performance.
Optional Segment Optimization: The index can be optimized into a single segment to further improve read performance, reducing file system usage and query latency.
Dynamic Scaling: All node pools (Build, Auth, Live) can scale up or down at runtime. The Build pool, in particular, can be completely shut down when not in use, offering significant cost savings.

Value Proposition

Cost Efficiency: Dynamic Sharding minimizes operational costs by scaling resources on demand and shutting down unused nodes. This ensures that the operating costs are minimized without sacrificing performance.
Improved Runtime Efficiency: Shrinking and optimizing indices before transitioning them to the Auth or Live environments significantly reduces memory and CPU usage, enhancing runtime query performance.
Enhanced Scalability: Users can bring up any number of data nodes for indexing as needed and dynamically scale the cluster based on workload demands.

Ingest Configurations for Dynamic Sharding

Dynamic Sharding in Elasticsearch is driven by specific ingest configuration settings that enable fine-grained control over shard and replica management, node group roles, and resource allocation. These configurations ensure flexibility and scalability during indexing and runtime operations while aligning with the broader Auth/Live Separation framework.

Node Group Role Configuration:
1. cluster.index.nodegroup.build: is the Elasticsearch node environment attribute property name that identifies Elasticsearch data nodes associated with the Build node group for reindexing tasks.
2. cluster.index.nodegroup.auth: is the Elasticsearch node environment attribute property name that Identifies Elasticsearch data nodes associated with the Auth node group for NRT indexing and store preview operations.
3. cluster.index.nodegroup.live: is the Elasticsearch node environment attribute property name that Identifies Elasticsearch data nodes associated with the Live node group, which hosts replica indices for production.
These role definitions are foundational for allocating tasks to the appropriate nodes and ensuring resource isolation between Build, Auth, and Live environments.
Shard Management Settings:
1. cluster.index.shard.limit: Defines the maximum number of shards that can be used for indexing in the Build node group.
  1. The actual number of shards "cluster.index.shard.size" is dynamically determined at the start of the ingest operation based on the total available Elasticsearch data nodes in the Build node group.
  2. This calculated value is stored as a flowfile attribute and reflected in the Ingest Summary report.
Replica Management Settings:
1. cluster.index.replica.limit: Sets the maximum number of replicas for indices in the Live node group.
  1. The replica count "cluster.index.replica.size" is similarly calculated during ingestion and noted in the Ingest Summary report.
Dynamic Shard Allocation:
1. Elasticsearch detects the availability of node groups (Build, Auth, Live) during indexing and sets the shard allocation attribute "cluster.index.shard.allocation" based on the current environment.
2. Note that the Shard and Replica Management settings are only used as a threshold limit. The logic to determine, at the beginning of each indexing operation, the actual number of shards for indexing is based on the total number of Elasticsearch data nodes available within the Elasticsearch node group used for indexing. NiFi detects the availability of the Build and Auth node group at the time of indexing and stores the name of the node attribute used for indexing as a flow file attribute, called "cluster.index.shard.allocation", and this too can be found in the “attributes” section of the Ingest Summary report.
  Example:
  - If six Build nodes are available, "cluster.index.shard.allocation" is set to "build" with "cluster.index.shard.size" set to 6.
  - If the Build node group is shut down, the attribute shifts to "auth", and indexing tasks are allocated accordingly.
Index Optimization Settings:
1. cluster.index.merge.limit: To enable optional merging of index segments to optimize read performance, assign the maximum number of desired index segments to Ingest Configuration called cluster.index.merge.limit. The default is “none,” which implies there is no merging.
  For example, setting "cluster.index.merge.limit" to 1, merges all index segments into a single segment, reducing disk space and improving query efficiency.

Example: Using Dynamic Sharding with Build Nodepool

To better illustrate the usage of dynamic sharding, here’s a practical example setup:

You can configure a Kubernetes CronJob to automatically start and stop a dedicated Build nodepool, this nodepool is intended specifically for Elasticsearch data nodes used during the Ingest operation. Alternatively, this Build nodepool can also be manually managed through Kubernetes commands. When the Ingest process begins, NiFi automatically detects the currently available Build nodes. Based on this, NiFi determines the number of Elasticsearch index shards to allocate for the operation. More available Build nodes typically result in more shards being used, which can significantly improve Ingest performance.

Note: The Build nodepool is optional. Ingest operations will still function without it, but using it allows for faster processing, especially helpful when you're dealing with large volumes of data or long ingest durations. Once the ingest is complete, the Build nodes can be safely shut down to optimize resource usage and reduce costs.

Understanding the Upper Shard Limit vs. Number of Data Nodes

It’s important to distinguish between the number of Elasticsearch data nodes and the shard count limit:

The Build nodepool can dynamically spin up any number of Elasticsearch data nodes during startup. There's no fixed cap on this number. This is another way of saying the overall Ingest time can be reduced by adding more Elasticsearch data nodes to the Build nodepool.
However, NiFi uses a configurable upper limit to determine how many shards can be assigned to a given index, especially the product index. This limit is controlled by the setting, cluster.index.shard.limit.

For example, if 10 data nodes are available but the shard limit is set to 5, only 5 shards will be used for the product index. The remaining nodes will still be utilized by Elasticsearch to manage other indexes or support distributed workload balancing.

Note: Increasing the number of data nodes in the Build nodepool can reduce ingest time, but only if the shard limit allows NiFi to distribute the workload across those nodes.

Segment Optimization in Elasticsearch

Segment Optimization is a powerful feature that improves query performance by reducing the number of index segments within a shard. While it works seamlessly with Dynamic Sharding, it is an independent optimization technique that can be applied to any Elasticsearch deployment, regardless of whether Auth/Live Separation is in use.

What is Segment Optimization?

Elasticsearch stores data in segments within each index shard. Over time, as indexing operations occur (like document updates or deletions), these segments increase in number and can become fragmented, leading to slower query performance. Segment Optimization reduces this fragmentation by merging smaller segments into fewer, larger segments.

Think of it like your computer's hard drive: even if it works fine, over time, it slows down as data becomes scattered. With Segment Optimization, smaller segments are merged into fewer, larger ones. This process cleans up the data, removes outdated or deleted documents, and ensures queries run faster and more efficiently.

How Segment Optimization Works?

Shard-Level Optimization:
1. Elasticsearch indices are often divided into multiple shards for parallel processing and scalability.
2. Each shard contains segments which store portions of the index data. Over time, these segments may include deleted or outdated documents, causing inefficiencies.
3. Segment Optimization merges smaller segments into larger ones, removing deleted documents and reducing overall disk usage.
Optimization Benefits:
1. Improved Query Performance: Fewer segments mean less overhead during query execution, as Elasticsearch has to scan fewer files.
2. Reduced Resource Usage: Optimization minimizes disk space and memory usage by cleaning up fragmented data.
Compatibility with Sharding and Shrinking:
1. Segment Optimization can be applied alongside sharding (to distribute data) and shrinking (to combine shards) for additional efficiency. For example:
  1. Sharding accelerates indexing by spreading data across nodes.
  2. Shrinking combines the results into a single, optimized shard for faster querying.
  3. Segment Optimization further defragments the data within each shard, making it even more responsive.

When to Use Segment Optimization

After Heavy Indexing: Apply Segment Optimization once significant data ingestion, updates, or deletions have occurred. This ensures the index remains performant.
As Part of Maintenance, Segment Optimization can be run regularly to keep query performance optimal, similar to periodic defragmentation of a hard drive.
With Sharding and Shrinking: When leveraging Dynamic Shardingor other sharding strategies, optimize the resulting index to maximize efficiency.

Segment Optimization vs. Sharding and Shrinking

Table 1. Segment Optimization vs. Sharding and Shrinking
Feature	Purpose	Scope
Sharding	Distributes data across multiple nodes for parallel processing	Index-Level
Shrinking	Combines multiple shards into a single shard for optimized query performance.	Index-Level
Segment Optimization	Merges segments within a shard to reduce fragmentation and improve query speed.	Shard-Level (Within Index)

Key Difference: Sharding and shrinking focus on how data is distributed across nodes, whereas Segment Optimization improves data layout within each shard for faster queries.

Benefits of Segment Optimization

Faster Query Performance: Optimized indices respond to queries more efficiently.
Reduced Storage Overhead: Removes deleted documents and redundant segments.
Independent and Flexible: Can be used with or without Dynamic Sharding or Auth/Live Separation.

Segment Optimization is a simple yet powerful way to keep your Elasticsearch indices efficient and performant. Whether managing a large-scale deployment with sharding or a single-node cluster, this feature ensures your queries remain fast, your storage is optimized, and your system runs smoothly.

You'll maintain top-tier performance by integrating Segment Optimization into your regular Elasticsearch maintenance routines. This ensures that indices remain efficient, queries execute faster, and resources are used optimally, delivering a seamless search experience for both small and large-scale deployments.