Configuring a Distributor Node
- From File: The node reads data from the path specified in the File Path setting.
- From Map: The map executes and produces batches based on the defined input card.
- Parameterization: To use a flow variable for the File Path, enclose the variable name in percent signs (for example, %my_data_file%).
Consuming Data from a Map
-
Set the Action property (Fetch As) to Burst.
-
This is supported by adapters including FILE, REST, and messaging adapters (for example, Kafka, JMS, and MQ).
-
The Fetch Unit controls the number of records per batch. If not set (default 0), the node will fetch all records.
- Map Batch Size: You can override the map’s fetch unit at runtime using the map_batch_size property, allowing for flexible tuning via flow variables.
Distributed Instances
- Parallel execution is limited by the Maximum Instances setting.
- If Maximum Instances exceeds the number of available REST
runtime instances, the parallel execution will be limited to the total number of
available REST instances. A flow that distributes batches does not count toward
the execution process limit for flow executions. However, the execution of
distributed batches for a single flow instance is restricted to one
batch per flow executor.
Example Scenario
Consider a system with 5 available executors and a Distributor node configured with the following parameters:
-
Source: A CSV file containing 1,000,000 records.
-
Maximum Instances: 5.
-
Batch Size: 100,000 records (resulting in 10 total batches).
Execution Logic:
-
Initial State: The Distributor node in the main flow generates 5 initial requests for distributed batches.
-
Concurrency: Since each executor can process only one request at a time, all 5 executors are immediately engaged.
-
Queueing: As each distributed batch completes, the main Distributor instance issues a new request. This cycle continues until all 10 distributed batches have been processed.
-
New File Adapter Features
When the Distributor node splits a large CSV document, it would be inefficient to create a file for each batch and use it across distributed instances. Instead, the Distributor node identifies the start offset and data length and provides that information as part of the payload for the distributed instances. All distributed instances consume the same file but read different portions of data based on the specified offset and data size.
If the Map node is the first node that consumes the data in a distributed instance, it uses the legacy file adapter to read the data. The file adapter has been enhanced to support two additional properties: offset and data size.
Initialization Flow and Status Flow
If the flow includes an initialization flow, it is executed only on the main instance before the main flow begins distributing work. Distributed instances do not execute the initialization or status flows.
Similarly, the status flow is executed only once when the flow finishes and is not used by distributed instances.
Flow Variables
Flow variables available in the Distribute node—whether passed directly to the flow or generated during the initialization flow—are propagated to distributed instances through the flow run payload. Updates to variables made within distributed instances are not shared with other instances. If a shared state is required, global cache variables must be used.