Creating a custom NiFi process group

NiFi process groups and their connections are the building blocks of dataflow pipeline that perform data ingestion and transformation tasks to ready the data for the search index. Creating your own custom process group provides a way to perform data ingestion and transformation tasks according to specific business requirements. Here are the four default templates to help you to get started with the creation of a specific process group according to your business logic.

You can create a custom pipeline in any one of the following ways:

Use the Ingest container to create a new connector by importing a connector template. For more details refer to the profit margin tutorial.
Import the individual process group templates and connect them to create a custom pipeline.

The following four process group templates are shipped with the default NiFi Registry. To import any of these templates individually you use the NIFI UI (use this link for the NiFi UI: http://<hostname/IP>:30600/nifi/) and then import the process group templates directly from the NiFi Registry. To learn about how to import these default templates from the Nifi Registry, see NiFi Registry Documentation

Here is the detailed description along with the names of these default process group templates.

Schema update

Template name in the Nifi Registry : _Template-Schema

The Elasticsearch schema defines how your data is stored in Elasticsearch. This NiFi process group template provides the processes required to modify the existing Elasticsearch schema definitions. All processes are grouped in a process group for better organization.

After importing this template you need to work upon the processes required to extract and input data into Elasticsearch. This template does not provide processes to extract and input data into Elasticsearch. It only provides the process to modify the data structure based on which Elasticsearch organizes and uses the data. The Java templates provide the template to support the complete process.

Java ETL (database)

Template name in the Nifi Registry: _Template-DatabaseETL

The Java template provides the processes required to extract, transform and create the data for the Elasticsearch index. The template provides the process group comprising the processors performing the data transformation logic. In this case, the extraction is done via a custom written process that you create using Java in a Java editor of your choice. This custom Java Processor should be deployed as a NAR file to be used in NiFi. Refer to Building and deploying a custom NAR file. The custom Java Processor consumes a NiFi flow message and transforms the data to create the part of the document (_doc) required to populate the Elasticsearch index.

HCL Commerce Version 9.1.11.0 or later To use the new simple mapping processor, see Configuring the connector/pipeline in NiFi.

The Java template uses the custom processor written in Java and deployed into NIFI for the data transformation.

This template consists of the custom connector pipe for performing ETL (Extract, Transform, Load) from the HCL Commerce to an Elasticsearch index. It has the following pre-built components:

Execute SQL process group: This process group contains the SQL used for extracting data from the HCL Commerce database. This process group does not support database paging, so use it with a sample data set.
Custom connector pipe processor: This Processor contains the transformation logic written in Java.
NiFi flow remote process group: This process group routes the index-ready document to another bulk service for indexing to Elasticsearch.
Route on Master Catalog process group: This process group is used for allowing the only dataflow to go through the master catalog. It is also capable of working with WaitLink for blocking.

This pipe can be re-used for the following three supported dataflows:

Full-Re-indexing
Near real- time update (NRT)
Dataload

Java ETL (database paging)

Template name in the Nifi Registry: _Template-DatabasePagingETL

Similar to the Java template, this template also provides the processes required to extract, transform and create the data for the Elasticsearch index. The template provides the process group comprising the processors performing the data transformation logic. In this case, also, the extraction is done via a custom written process that you create using Java in a java editor of choice. This custom Java Processor should be deployed as a NAR file to be used in NIFI. Refer to Building and deploying a custom NAR file. The custom Java Processor consumes a NiFi flow message and transforms the data to create the part of the document (_doc) required to populate the Elasticsearch index.

HCL Commerce Version 9.1.11.0 or later To use the new simple mapping processor, see Configuring the connector/pipeline in NiFi.

In addition to this, this template consists of an SQL process to return a large result set and a pagination process for the NiFi logic.

This template consists of the custom connector pipe for performing ETL (Extract, Transform, Load) from the HCL Commerce to an Elasticsearch index. It has the following pre-built components:

SCROLL SQL process group: This process group contains the SQL used for extracting data from the HCL Commerce database. It supports database paging, so can be used for large data set.
Custom connector pipe processor: This processor contains the transformation logic written in Java.
NiFi flow remote process group: This process group routes the index-ready document to another bulk service for indexing to Elasticsearch.
Route on Master Catalog process group: This process group is used for allowing the only dataflow to go through the master catalog. It is also capable of working with WaitLink for blocking.

This pipe can be re-used for the following three supported dataflows:

Full-Re-indexing
Near real- time update (NRT)
Dataload

Note:

Prior to HCL Commerce Search Version 9.1.15, the SQL was kept in the same template schema in the NiFi registry as other customization code. To make it easier for you to customize the SQL by itself, this SQL has been moved out to separate files in a src/main/resources/sql directory under the version-specific NAR and JAR. For example, you could find DatabaseProductStage.sql under commerce-search-processors-nar-9.1.15.0.nar/bundled-dependencies/commerce-search-processors-9.1.15.0.jar/src/main/resources/sql if commerce-search-processors-nar-9.1.15.0.nar was your current version-specific JAR.

Groovy ETL (database)

Template name in the Nifi Registry: _Template-Groovy-DatabaseETL

You can use Groovy for prototyping but its use is not recommended for production environments.

The Groovy template provides the processes required to extract, transform and create the data for the Elasticsearch index. The template provides the process group comprising the processors performing the data transformation logic. The extraction is done via default NIFI Apache process (ExecuteSQLRecord), then the data is transformed to create the part of the document (_doc) required to populate the Elasticsearch index. The data transformation is done via an in code NiFi Apache processor (ScriptExecutor) that provides the ability to have embedded code in the template.

Important: While using this template an error appears at the time of server startup. This error disappears in a while, as NiFi recovers from it within a few minutes and resumes working. This error does not re-appear until the next time the NiFi container is restarted.

This template consists of the custom connector pipe for performing ETL (Extract, Transform, Load) from the HCL Commerce to an Elasticsearch index. It has the following pre-built components:

Execute SQL process group: This process group contains the SQL used for extracting data from the HCL Commerce database.
Custom connector pipe processor: This Processor contains the transformation logic written in Groovy.
NiFi flow remote process group: This process group routes the index-ready document to another bulk service for indexing to Elasticsearch.

This pipe can be re-used for the following three supported dataflows:

Full-Re-indexing
Near real- time update (NRT)
Dataload