Creating a custom NiFi process group
NiFi process groups and their connections are the building blocks of dataflow pipeline that perform data ingestion and transformation tasks to ready the data for the search index. Creating your own custom process group provides a way to perform data ingestion and transformation tasks according to specific business requirements. Here are the four default templates to help you to get started with the creation of a specific process group according to your business logic.
- Use the Ingest container to create a new connector by importing a connector template. For more details refer to the profit margin tutorial.
- Import the individual process group templates and connect them to create a custom pipeline.
http://<hostname/IP>:30600/nifi/
) and then
import the process group templates directly from the NiFi Registry. To learn about how
to import these default templates from the Nifi Registry, see NiFi Registry Documentation Here is the detailed description along with the names of these default process group templates.
Schema update
Template name in the Nifi Registry : _Template-Schema
The Elasticsearch schema defines how your data is stored in Elasticsearch. This NiFi process group template provides the processes required to modify the existing Elasticsearch schema definitions. All processes are grouped in a process group for better organization.
After importing this template you need to work upon the processes required to extract and input data into Elasticsearch. This template does not provide processes to extract and input data into Elasticsearch. It only provides the process to modify the data structure based on which Elasticsearch organizes and uses the data. The Java templates provide the template to support the complete process.
Java ETL (database)
Template name in the Nifi Registry: _Template-DatabaseETL
The Java template provides the processes required to extract, transform and create the data for the Elasticsearch index. The template provides the process group comprising the processors performing the data transformation logic. In this case, the extraction is done via a custom written process that you create using Java in a Java editor of your choice. This custom Java Processor should be deployed as a NAR file to be used in NiFi. Refer to Building and deploying a custom NAR file. The custom Java Processor consumes a NiFi flow message and transforms the data to create the part of the document (_doc) required to populate the Elasticsearch index.
To use the new simple mapping processor, see Configuring the connector/pipeline in NiFi.
The Java template uses the custom processor written in Java and deployed into NIFI for the data transformation.
- Execute SQL process group: This process group contains the SQL used for extracting data from the HCL Commerce database. This process group does not support database paging, so use it with a sample data set.
- Custom connector pipe processor: This Processor contains the transformation logic written in Java.
- NiFi flow remote process group: This process group routes the index-ready document to another bulk service for indexing to Elasticsearch.
- Route on Master Catalog process group: This process group is used for allowing the only dataflow to go through the master catalog. It is also capable of working with WaitLink for blocking.
- Full-Re-indexing
- Near real- time update (NRT)
- Dataload
Java ETL (database paging)
Template name in the Nifi Registry: _Template-DatabasePagingETL
Similar to the Java template, this template also provides the processes required to extract, transform and create the data for the Elasticsearch index. The template provides the process group comprising the processors performing the data transformation logic. In this case, also, the extraction is done via a custom written process that you create using Java in a java editor of choice. This custom Java Processor should be deployed as a NAR file to be used in NIFI. Refer to Building and deploying a custom NAR file. The custom Java Processor consumes a NiFi flow message and transforms the data to create the part of the document (_doc) required to populate the Elasticsearch index.
To use the new simple mapping processor, see Configuring the connector/pipeline in NiFi.
In addition to this, this template consists of an SQL process to return a large result set and a pagination process for the NiFi logic.
- SCROLL SQL process group: This process group contains the SQL used for extracting data from the HCL Commerce database. It supports database paging, so can be used for large data set.
- Custom connector pipe processor: This processor contains the transformation logic written in Java.
- NiFi flow remote process group: This process group routes the index-ready document to another bulk service for indexing to Elasticsearch.
- Route on Master Catalog process group: This process group is used for allowing the only dataflow to go through the master catalog. It is also capable of working with WaitLink for blocking.
- Full-Re-indexing
- Near real- time update (NRT)
- Dataload
Prior to HCL Commerce Search Version 9.1.15, the SQL was kept in the same template schema in the NiFi registry as other customization code. To make it easier for you to customize the SQL by itself, this SQL has been moved out to separate files in a src/main/resources/sql directory under the version-specific NAR and JAR. For example, you could find DatabaseProductStage.sql under commerce-search-processors-nar-9.1.15.0.nar/bundled-dependencies/commerce-search-processors-9.1.15.0.jar/src/main/resources/sql if commerce-search-processors-nar-9.1.15.0.nar was your current version-specific JAR.
Groovy ETL (database)
Template name in the Nifi Registry: _Template-Groovy-DatabaseETL
You can use Groovy for prototyping but its use is not recommended for production environments.
- Execute SQL process group: This process group contains the SQL used for extracting data from the HCL Commerce database.
- Custom connector pipe processor: This Processor contains the transformation logic written in Groovy.
- NiFi flow remote process group: This process group routes the index-ready document to another bulk service for indexing to Elasticsearch.
- Full-Re-indexing
- Near real- time update (NRT)
- Dataload