Data Load utility performance tuning

Scheduled Data Load utility jobs can affect HCL Commerce performance. You can reduce the impact of this process by tuning the data load performance appropriately for your implementation.

When you are considering how to tune your Data Load process, ensure that you review the Data Load summary reports that generate after you run the Data Load utility. These reports can be used to identify what elements of the Data Load process that require tuning to improve performance.

Before you begin

Ensure that you are familiar with and understand the following concepts and tasks that are related to the Data Load utility:
  • Running the Data Load utility
  • Configuring the Data Load utility

Data Load parallelization

When used with CSV formatted data, Data Load can be run with multiple writer threads. This can drastically increase load performance for large sets of data that are formatted with considerations made for parallelization.

For more information on Data Load parallelization, see Data Load parallelization.

Data Load mode

The Data Load mode parameter is used to set the type of load process that the Data Load utility is to run. You can set this mode to be Insert, Replace, or Delete in the wc-dataload.xml file for the data you are loading. Typically, Replace is used, however Insert and Delete can run faster. Running the utility in Insert or Delete mode does not require as many Ids to be resolved with the ID resolver utility. When you are using Insert or Delete, ensure that these actions are the only database operations that are required by your CSV file.

File difference preprocessing

You can run a file difference preprocess for routine data loads to improve the Data Load utility performance for loading these files. By using this preprocessor that you can compare two input files, such as a previously loaded file and a new version of this file. The preprocessor generates a difference file that contains only the records in the new file that are not within the old file or that are changed from the records in the old file. The Data Load utility can then load this difference file. If your routinely loaded files contain many previous loaded records, then running this file difference can result in shorter load times. Running a file difference can reduce the loading time that is required to load your routine updates to your HCL Commerce database, reduce server usage time, and improve server performance.

You can configure the Data Load utility file difference preprocessor to compare files by the values in each column, instead of entire records, to identify the changed records. You can also configure the file difference preprocessor to ignore specific columns when the process is comparing files.

For more information about this preprocessor, see Data Load file difference preprocessing.

Validation options

Configuring the Data Load utility to validate the data that you are loading can affect your Data Load performance. The validation of the data your are loading is performed against your HCL Commerce database. If you are validating many records, or the validation process encounters many invalid records, this validation process can affect performance. By default, the following validations options are available as configurable properties for the Data Load utility:
attributeValueValidation
Indicates whether to validate the attribute value. The attribute value is mandatory except within a product and defining attribute relationship.
validateAttribute
Validates whether a SKU and a product have compatible defining attributes when the SKU is moved under the product. The validation logic determines whether the attributes or allowed values that are to be created, updated, or deleted belong to the current store.
validateCatalog
Validates whether more than one master catalog is being created for a store. If the store supports sales catalogs, the validation checks whether a catalog entry belongs to more than one master category. The validation also checks whether an attribute allowed value can be set to default in the current store.
validateCatalogEntry
Validates whether to check the types of the SKU and product when the Data Load adds a SKU under a product. This check is to make sure that the SKU is really a SKU and the product is really a product.
validateCatalogGroup
Validates whether a catalog group belongs to a specified catalog.
validateUniqueDN
Validates the uniqueness of the distinguished name (DN) to identify a user in CSV file. By default, to optimize data load performance, users in the CSV file are identified by the logon ID instead of the distinguished name.
If the data that you are loading does not require any of the validation processes to occur, ensure that the configurable property is set false to ensure that no validation performs.

ID resolver cache

If the ID resolver cache size is large enough, the cache can store all of the required IDs for a database table that data is being loaded into. If the size of this cache is not large enough to store all of the required Ids, then none of the IDs are cached. When the IDs are not cached, the data load process requires that the IDs are resolved directly against the database. When you are configuring your cache setting, consider the following behavior:
  • To cache all of the IDs for a large table, the ID resolver cache can require a significant amount of time. If you are loading few records into a table with many individual records and IDs to cache, directly resolving the IDs against the database can require less time than caching all of the IDs for an entire table and resolving the IDs against the cache. You can configure the cache size to ensure that the cache is too small to store the IDs for large tables, but still large enough size to cache smaller tables to resolve the IDs against the cache. By reducing the size of this cache, you can reduce the time that is spent caching IDs for an entire table that you are loading only a few records into.
  • The ID resolver cache is cleared after a load item is completed in the load order configuration file. If you are running multiple CSV files, then the cache must be repopulated after each item completes. If multiple CSV files load data into the same tables, consider merging the files wherever possible to reduce the caching time for repopulating the same table ID data.
To tune your Id resolver cache, set an initial value that ensures all of the tables that you are loading data into can be cached. Then, set a second Data Load scenario with the Id resolver cache set to 0. In both instances, pass in the parameter -Dcom.ibm.commerce.foundation.dataload.idresolve.level=FINE when you call the Data Load utility to run. This parameter adds the resolving times to the trace, including the time that is required to populate the cache (if it is not set to 0), and the time that is required to resolve the Ids against the database or cache. With these times, you can identify whether you can increase or decrease your caching of Ids and reduce the time that is required to resolve the Ids for the data your are loading.
The size of the Id resolver cache and is set in the Data Load environment configuration file. The following is a sample declaration of a cache that can store 1 million records:
<_config:IDResolver className="com.ibm.commerce.foundation.dataload.idresolve.IDResolverImpl" cacheSize="1000000"/>
To help tune the ID resolver cache, review the data load summary report after a data load completes. This summary includes the total time that is necessary for the ID resolver to resolve and check the IDs for the objects that are loaded. The Data Load utility can also be configured to output more information about the data load ID resolver process. This addition to the summary report includes the following information:
  • The time that it takes to fetch and load the IDs for a table into the ID resolver cache
  • The number of entries per table that are stored in the cache
  • The number of hits to the cache per table to resolve IDs
  • The time that it takes to resolve IDs for a table directly from the database table
  • The number of hits to the database to resolve IDs for a table
With this information, you can identify whether it would be more efficient to resolve and check IDs for a table against the ID resolver cache or directly against the database. You can then adjust the size of your ID resolver cache as needed, or exclude IDs from certain tables from being included within the cache. For more information about the ID resolver information that can be included in the data load summary report, see Verifying the results of the data load.

Batch size and commit count

Change the Data Load utility batch size and commit count parameters to reduce the effect of network latency and reduce the processing load on the Transaction server. The commit count must be a multiple of the batch size. The commit count parameter specifies the number of rows that are flushed in a single transaction before a commit is issued. Database transactions are loaded in batches. These batches are kept in the Java memory until there are enough rows stored for a flush. Then, the batch contents are stored to the database as a single packet of data. The batches are stored in the database until the next commit occurs and the database loads the changes.

By increasing the batch size, you can reduce the effect that network latency can have on the Data Load process. Increasing the batch size can reduce the number of batches that are required to be sent to your database. The wait time for the database response can also be reduced by increasing the batch size.

By increasing the commit count, you can reduce the processing load on the Transaction server. Increasing the commit count increases the load on the database. Increasing the commit count causes more records to be committed to the database in a single transaction. This increase results in less uncommitted data that remains stored on your Transaction server and fewer overall transactions that are required to commit the data.

The values for the batch size and commit count parameters are set in the load order configuration file and are typically in the 500-1000 range. The following is a sample declaration of the batch size and commit count parameters:
<_config:LoadOrder commitCount="1000" batchSize="500" dataLoadMode="Replace">

Java virtual machine (JVM) heap size

When you are tuning the ID resolver cache, you can adjust the JVM heap allocation. The JVM heap size must be proportionate to the ID resolver cache. For instance, if the ID resolver cache is large, specify a large JVM heap size. If the ID resolver cache is large, the Data Load utility does not resolve the ID from the database directly. The ID resolver cache however might use much of the JVM heap memory. For 1 GB JVM heap size, set the ID resolver cache size to be less than 2 million to prevent a Java out of memory error. If you encounter an out of memory exception during the data load process, the allocated JVM heap size might be too small. Ensure that the JVM heap size is sufficient to accommodate the ID resolver cache and batches that are store in memory. Set the value for the JVM heap size in the following parameters within the Data Load utility, dataload.sh file: -Xms1024m -Xmx4096m

Network latency

If your environment is configured to have a large physical distance between servers, the data load process can be impacted. Your network latency can affect the flush time when the Data Load utility runs. The flush time can be viewed in the Data Load utility summary report. The flush time includes the time that is required to transport a batch, including the time that is required for the database to respond. This flush time also includes network latency. If you are experiencing a large flush time, your system can be experiencing either poor database or network performance, or both. If you tune your environment and the Data Load performance is not within the required performance range, then, installing HCL Commerce on a local system might be necessary to improve Data Load utility performance.

Database tuning

By tuning the database, you can improve the performance of the Data Load utility by reducing the time that is required to commit records. You can view the time that is required to commit loaded data records in the Data Load utility summary report. There are many performance tuning tools available for improving database performance. For more information about database performance tuning, see: