Data Load parallelization
The data load utility is improved in HCL Commerce Version 9.1 to allow for parallelization. Parallelization allows for certain data load jobs to complete much faster, by increasing the number of threads that are used to load data into the database.
- Performance tuning of the Data Load utility and your data is required to utilize this feature effectively. This feature can reduce overall Data Load performance. You must carefully consider the relationship between parallelization, its configuration, the type of data, and its particular structure.
- Data Load parallelization is only compatible with CSV formatted data.
In previous versions of HCL Commerce, the data load utility was a single threaded application that was constrained by singleton classes which were not designed for parallel usage. This design limited the use of the utility for some large data jobs, hamstringing the performance of the tool. In some instances, users of the tool can find their ability to get work done impaired by long running jobs. With this new upgrade to the data load utility, multiple users of the tool can load data concurrently. In addition, shorter jobs can allow for future jobs to be run sooner.
Architecture
The architectural enhancements made to the data load utility include the addition of a queue where the reader of the CSV file creates batches of data to be processed. The queue has a maximum size, which when reached temporarily halts the reader from further production of batches. The reader thread will continue to enter batches of data into the queue as the batches are consumed. After all of the data is read from the input file, the reader thread will place an empty batch into the queue, and then exit with a data load summary report.
Each writer thread will remove one batch from the queue and process the batch to load data into the database. When a writer thread gets an empty batch from the queue, it will place the empty batch back into the queue and the writer thread will exit with a data load summary report.
Until all writer threads finish and exit, the Data Load utility will check if there are any errors from each writer thread. If there are reprocessing error CSV files created by the writer threads, the Data Load utility will merge all error reprocess CSV files into a single error reprocess CSV file, and then reload this CSV file using a single writer thread. Once all writer threads finish, the Data Load utility will produce a combined data load summary report.
Performance considerations and error handling
Due to the complex nature of loading hierarchical data with multiple threads, error handling must be carefully considered when enabling and configuration parallelization. This is especially true when it comes to performance tuning for your particular environment and dataset.
- Use the existing data load parameters
commitCount
,batchSize
andmaxError
per LoadItem, to ensure your data load utility performance is dialed in. - Format your data to leverage parallelization appropriately. For example, if your data contains hierarchical data, place parent data together towards the beginning of the file. This will reduce the chances of attempting to load child data for which parent data is not yet present.
Configurable parameters and defaults
By default, the data load utility is set to run in single-thread mode. This ensures the same expected job behavior and performance as users have come to expect. The following new parameters have been added to control the parallelization of the data load utility.
A sample data load configuration that includes the full use of these parameters is available here.
Parameter | Value type | Default value | Description |
---|---|---|---|
numberOfThreads |
Integer | 1 | The maximum number of individual writer threads that take batches of data from the queue,
process them in order, and write the processed data into the
database. By default, the numberOfThreads parameter
is set to 1, meaning the data load utility should
run in single threaded (legacy) mode.The maximum number of threads is 8. From internal performance
testing, HCL recommends that the number of threads used be
4. Use of more than four threads has
shown to reduce overall load performance, and can result in
errors such as the
following:
If a number greater than 8 is provided, the maximum number of threads is used. |
inputDataListSize |
Integer | 20 | The maximum number of CSV line entries that is included in a batch of data to be added to the
queue. Each writer thread handles a single batch of data from the queue. Once it is loaded, the
thread is freed to process another batch from the queue. By default, the
inputDataListSize parameter is set to 20. |
queueSize |
Integer | numberOfThreads | The maximum number of batches that can exist in the queue. Once the queue is filled with the maximum number of batches, the reader waits for batches from the queue to be consumed before continuing to produce and queue further batches. By default, this is set to the numberOfThreads property value. |
multipleThreadsEnabled |
Boolean | false | Defines whether parallelization is enabled for the specific load item. By setting this
parameter to false for a specific LoadItem, you override the set parallelization
parameters and force the data load utility into single threaded operation. Manually set per LoadItem, if this parameter is not specified its default value of false is assumed. |