Optimizing index build and overall flow

Optmizations are provided full index builds and for tuning parameters. Potential improvements that can be implimented for Near Real Time (NRT) index updates are not described.

The index building process

A full index build consists of three major steps: retrieving data, processing data, and uploading data. There are several predefined connectors, which consist of several process groups for different purposes. Usually, each process group will contain process subgroups to handle the data retrieval, processing, and uploading stages that are associated with the build process.

Retrieving data group: Fetch data from database or Elasticsearch.
Processing data group: Based on the fetched data, build, update, and/or copy the index documents.
Uploading data group: Upload the index document to Elasticsearch. From the HCL Commerce 9.1.3.0 release, each index has one associated group.

Tuning

Generally, default settings will work for this process, but it may take a long time to finish the indexing process for a large data set. Depending on the data set size and hardware configuration (memory, CPU, disk, and network), improvements to the subgroups and process groups can be made so process flows are faster and more efficient within subgroups, and between connected subgroups.

Retrieving data group

In this group, there are three different sources that the data can be retrieved from.

Data can be:

Fetched from a database with SQL scroll;
Fetched from a database with SQL without scroll;
Fetched from Elasticsearch with scroll

Fetch data from a database with SQL scroll

Enter the SCROLL SQL group, then right click on the base canvas and select variables.

The scroll.page.size is the number of rows fetched from database by the SQL.
The scroll.bucket.size is the number of rows from the fetched data in each bucket for processing. The bucket.size will determine the size of the flow files (and the number of documents contained within each file).

Change the values of scroll.bucket.size and scroll.page.size based on the following considerations:

• Depending on the catalog size, the SQL can take a long time to get response data back to the NiFi. The purpose of the scroll SQL is to limit the data size that can be processed in NiFi at once, to avoid memory errors on large catalogs.
• The scroll settings are optimal when the time that it takes to process the data is evenly matched with the amount of time that the next SQL scroll takes to receive its data. With this optimization, unnecessary processing or I/O delay is minimized.
• The output from one subgroup is handed in turn to the next connected process subgroup. This process must be audited, to ensure that there are no bottlenecks which can impact the efficiency of the overall process.

Fetch data from database with SQL without scroll

In the process group, the data set is fetched from the database by using a single SQL stream. For example, the Find Associations at DatabaseProductStage1b.

Enter the Processor group that we are optimizing, right click on the base canvas, and select variables.

Set the scroll.bucket.size parameter to the number that you want.

The scroll.bucket.size is the number of rows from the fetched data that is placed in each bucket for processing. The bucket.size will determine the size of the flow files (and the number of documents contained within each file).

Fetch data from Elasticsearch

Since the index build is a staged process, some information may be added to existed index documents. In the case, NiFi needs to fetch data from Elasticsearch. Let us use the URL Stage 2 as the example.

Enter the SCROLL Elasticsearch group, right click on the base canvas, and select variables.

Change scroll.bucket.size and scroll.page.size to values that you want, based on the following considerations:

The scroll.page.size is the number of documents that are fetched from Elasticsearch. If the number is too small, NiFi must make more connections to Elasticsearch.

The scroll.bucket.size is the number of documents from the fetched data in each bucket for processing. The bucket.size will determine the size of the flow files (and the number of documents contained within each file).

Another parameter that is useful for tuning is scroll.duration. This value defines the amount of time that Elasticsearch will store the query result set in memory. This parameter is useful when dealing with many stores and languages running in parallel, where a running out of scroll error can be encountered. This error indicates that you are running out of scroll space, and reducing the scroll duration will force Elasticsearch to free older or obsolete buffers faster. Inversely, increasing the scroll duration in Elasticsearch for that index will provide extra time to complete processing operations.

Processing data group

In the following example, the process group DatabaseProductStage 1a is used.

Enter the Create Product Document from Database, right click Create Product Document from Database and select configure. Under the SCHEDULING tab, update the Concurrent Tasks value to set the number of threads that will be used for the process group. When increasing the number of concurrent tasks, the memory usage for the process group is also increased accordingly. Therefore, setting this value to a number that is greater than the number of CPUs that are allocated to the pod, or beyond the amount of memory that is allocated to the pod may not make sense, and can have a negative impact on performance.

Uploading documents group

Enter the Bulk Elasticsearch group. There are are several processes displayed. Right click on each process, and select configure. Under the SCHEDULING tab, update the Concurrent Tasks values to one that makes sense for your environment.

The Post Bulk Elasticsearch processor sends the created index documents to Elasticsearch. By default, Elasticsearch will use the same number of CPUs as the number of the connections. Considering the possible delay or pools, the number that is set for the Post Bulk Elasticsearch processor may be larger than number of CPUs that are allocated to the Elasticesearch pod.

It is a good rule of thumb to consider all processes in a group together. In general, it is best to try and make later processes complete faster than earlier ones. Otherwise, queued objects will consume extra memory and slow down overall processing efficiency.

Overall considerations for tuning

The general objective for tuning is to process and store the most data for a given time period, given a configured system’s resources, and their utilization. In general, we can tune and improve the following areas:

Concurrent tasks/threads

Increasing the number of threads that are processed can help to improve a processing group throughput, however this needs to be assessed carefully.

Do not multithread processors that use scrolling to getting data from database or Elasticsearch. Since scrolling approach is used to batch the data in sizes that are best fit, multithreading would have negative impact of the overal system processing efficiency.
Consider increases to bucket sizes for multithreaded processors that send bulk updates to Elasticsearch, or perform single reads from Elasticsearch.
More concurrent tasks/threads naturally consume more memory and vCPU resources.
If the cost of memory garbage collection is high, you may need to reduce the concurrent number of threads, or add additional memory resources.
• Monitor all servers (Database, NiFi, and Elasticsearch) during the ingest process to find the bottleneck in the pipeline.

Bucket size/Flow file size

Increasing the bucket size increases the flow file size, i.e., the number of documents that are processed as a group in NiFi. The bigger the flow file, the better the efficiency. However, limited operating system resources will limit the maximum size of the flow file.

Flow file size tuning is very visible and impactful on the system.
Large flow files have several negative side effects:
- They demand a large memory heap for NiFi.
- They require a matched funnel on Elasticsearch, to accept the data as it comes over.
NiFi GC overhead may become prohibitively high, or NiFi can run out of heap space with an Out of Memory error.

Back pressure in links between processes, process groups, and process subgroups

In the pipeline, there are many links between steps. Each link has its own queue for the resulting objects (queued items) for processing in the next step.

The link is a location which can be used to identify the bottleneck of a process or process group. If a link has large number of queued items, the next process will be the bottleneck. This is especially true for longer duration builds.

For example, the following link has 451 queued items for the process Analyze Successful SQL Response.

Back pressure is a configuration threshold that controls the overall data streaming speed. This threshold indicates how much data should be allowed to exist in the queue before the component (Processor or Processor Group) that is producing the data in the queue is no longer scheduled to run. This is designed to avoid the system from being overrun with data in motion.

NiFi provides two configuration elements to control back pressure:

Back Pressure Object Threshold - This is the number of objects that can be in the queue before back pressure control is applied.
Back Pressure Size Threshold - This specifies the size of the objects that can be in the queue before back pressure control is applied.

If you usually work with documents, use the back pressure object threshold to control the back pressure. To configure it, right click the link and select View Configuration.

The Back Pressure Object Threshold and Back Pressure Size threshold can be set from their default values.