Data Load utility performance tuning
Scheduled Data Load utility jobs can affect HCL Commerce performance. You can reduce the impact of this process by tuning the data load performance appropriately for your implementation.
When you are considering how to tune your Data Load process, ensure that you review the Data Load summary reports that generate after you run the Data Load utility. These reports can be used to identify what elements of the Data Load process that require tuning to improve performance.
Before you begin
- Running the Data Load utility
- Configuring the Data Load utility
Data Load parallelization
When used with CSV formatted data, Data Load can be run with multiple writer threads. This can drastically increase load performance for large sets of data that are formatted with considerations made for parallelization.
For more information on Data Load parallelization, see Data Load parallelization.
Data Load mode
The Data Load mode parameter is used to set the type of load process that the Data Load utility is to run. You can set this mode to be Insert, Replace, or Delete in the wc-dataload.xml file for the data you are loading. Typically, Replace is used, however Insert and Delete can run faster. Running the utility in Insert or Delete mode does not require as many Ids to be resolved with the ID resolver utility. When you are using Insert or Delete, ensure that these actions are the only database operations that are required by your CSV file.
File difference preprocessing
You can run a file difference preprocess for routine data loads to improve the Data Load utility performance for loading these files. By using this preprocessor that you can compare two input files, such as a previously loaded file and a new version of this file. The preprocessor generates a difference file that contains only the records in the new file that are not within the old file or that are changed from the records in the old file. The Data Load utility can then load this difference file. If your routinely loaded files contain many previous loaded records, then running this file difference can result in shorter load times. Running a file difference can reduce the loading time that is required to load your routine updates to your HCL Commerce database, reduce server usage time, and improve server performance.
You can configure the Data Load utility file difference preprocessor to compare files by the values in each column, instead of entire records, to identify the changed records. You can also configure the file difference preprocessor to ignore specific columns when the process is comparing files.
For more information about this preprocessor, see Data Load file difference preprocessing.
Validation options
- attributeValueValidation
- Indicates whether to validate the attribute value. The attribute value is mandatory except within a product and defining attribute relationship.
- validateAttribute
- Validates whether a SKU and a product have compatible defining attributes when the SKU is moved under the product. The validation logic determines whether the attributes or allowed values that are to be created, updated, or deleted belong to the current store.
- validateCatalog
- Validates whether more than one master catalog is being created for a store. If the store supports sales catalogs, the validation checks whether a catalog entry belongs to more than one master category. The validation also checks whether an attribute allowed value can be set to default in the current store.
- validateCatalogEntry
- Validates whether to check the types of the SKU and product when the Data Load adds a SKU under a product. This check is to make sure that the SKU is really a SKU and the product is really a product.
- validateCatalogGroup
- Validates whether a catalog group belongs to a specified catalog.
- validateUniqueDN
- Validates the uniqueness of the distinguished name (DN) to identify a user in CSV file. By default, to optimize data load performance, users in the CSV file are identified by the logon ID instead of the distinguished name.
ID resolver cache
- To cache all of the IDs for a large table, the ID resolver cache can require a significant amount of time. If you are loading few records into a table with many individual records and IDs to cache, directly resolving the IDs against the database can require less time than caching all of the IDs for an entire table and resolving the IDs against the cache. You can configure the cache size to ensure that the cache is too small to store the IDs for large tables, but still large enough size to cache smaller tables to resolve the IDs against the cache. By reducing the size of this cache, you can reduce the time that is spent caching IDs for an entire table that you are loading only a few records into.
- The ID resolver cache is cleared after a load item is completed in the load order configuration file. If you are running multiple CSV files, then the cache must be repopulated after each item completes. If multiple CSV files load data into the same tables, consider merging the files wherever possible to reduce the caching time for repopulating the same table ID data.
-Dcom.ibm.commerce.foundation.dataload.idresolve.level=FINE
when you call the Data
Load utility to run. This parameter adds the resolving times to the trace, including the time that
is required to populate the cache (if it is not set to 0), and the time that is required to resolve
the Ids against the database or cache. With these times, you can identify whether you can increase
or decrease your caching of Ids and reduce the time that is required to resolve the Ids for the data
your are loading.<_config:IDResolver className="com.ibm.commerce.foundation.dataload.idresolve.IDResolverImpl" cacheSize="1000000"/>
- The time that it takes to fetch and load the IDs for a table into the ID resolver cache
- The number of entries per table that are stored in the cache
- The number of hits to the cache per table to resolve IDs
- The time that it takes to resolve IDs for a table directly from the database table
- The number of hits to the database to resolve IDs for a table
Batch size and commit count
Change the Data Load utility batch size and commit count parameters to reduce the effect of network latency and reduce the processing load on the Transaction server. The commit count must be a multiple of the batch size. The commit count parameter specifies the number of rows that are flushed in a single transaction before a commit is issued. Database transactions are loaded in batches. These batches are kept in the Java memory until there are enough rows stored for a flush. Then, the batch contents are stored to the database as a single packet of data. The batches are stored in the database until the next commit occurs and the database loads the changes.
By increasing the batch size, you can reduce the effect that network latency can have on the Data Load process. Increasing the batch size can reduce the number of batches that are required to be sent to your database. The wait time for the database response can also be reduced by increasing the batch size.
By increasing the commit count, you can reduce the processing load on the Transaction server. Increasing the commit count increases the load on the database. Increasing the commit count causes more records to be committed to the database in a single transaction. This increase results in less uncommitted data that remains stored on your Transaction server and fewer overall transactions that are required to commit the data.
<_config:LoadOrder commitCount="1000" batchSize="500" dataLoadMode="Replace">
Java virtual machine (JVM) heap size
When you are tuning the
ID resolver cache, you can adjust the JVM heap allocation. The JVM heap size must be proportionate
to the ID resolver cache. For instance, if the ID resolver cache is large, specify a large JVM heap
size. If the ID resolver cache is large, the Data Load utility does not resolve the ID from the
database directly. The ID resolver cache however might use much of the JVM heap memory. For 1 GB JVM
heap size, set the ID resolver cache size to be less than 2 million to prevent a Java out of memory
error. If you encounter an out of memory exception during the data load process, the allocated JVM
heap size might be too small. Ensure that the JVM heap size is sufficient to accommodate the ID
resolver cache and batches that are store in memory. Set the value for the JVM heap size in the
following parameters within the Data Load utility, dataload.sh file:
-Xms1024m -Xmx4096m
Network latency
If your environment is configured to have a large physical distance between servers, the data load process can be impacted. Your network latency can affect the flush time when the Data Load utility runs. The flush time can be viewed in the Data Load utility summary report. The flush time includes the time that is required to transport a batch, including the time that is required for the database to respond. This flush time also includes network latency. If you are experiencing a large flush time, your system can be experiencing either poor database or network performance, or both. If you tune your environment and the Data Load performance is not within the required performance range, then, installing HCL Commerce on a local system might be necessary to improve Data Load utility performance.