Configuring the Data Load utility to run a file difference preprocess
If you routinely load the same generated Data Load input file from an external system or source, you can choose to run a file difference preprocess as part of the Data Load process to ensure that you are loading only new changes when you load your newest input file.
Before you begin
- Identify the two Data Load input files that you want to compare and generate a difference file from.
- Successfully load the old input file into your HCL Commerce database. If any records are in the old file and in the new file that the old file is compared with, these records are not in the generated difference file or in your database. To prevent records from being omitted without ever being loaded into your database, verify that the contents of the old file are loaded into your database.
About this task
A file difference tool is available as a data reader preprocessor when you run the Data Load utility. This file difference preprocessor can be used to read and compare two CSV or two XML files. The preprocessor uses a different data reader class to read CSV files (CSVFileDiffPreprocessor) and XML files (XmlFileDiffPreprocessor) so you cannot compare a CSV file to an XML file.
Configuration properties for file difference preprocessor
keyColumns
property, which must be specified in the business object
configuration file.Configuration property | Description |
---|---|
keyColumns |
Required. Key columns are the CSV columns or XML elements that uniquely identify a record in your input file. |
numberOfSplitFiles |
Optional. Use this property to specify how many files that the input files are to be split
into when the old input file is too large to be stored in memory. It is recommended that the
|
checkDuplicatedKeys |
Optional. Specify this property as true to perform an extra check for duplicate entries.
It is recommended that you specify |
diffFileDirectory |
Optional. This property is for changing the directory where the generated difference file is saved. |
dataReaderPreprocessOnly |
Optional. Specify this property as true stops the Data Load process after the difference file is generated and saved. |
cleanupSplitFiles |
Optional. If your input files are split, you can set this property to false to save the temporary generated smaller files. If this property is set to true or omitted, the generated smaller files are deleted after the files are merged. |
columnBasedCompare |
Optional. Indicates whether the preprocessor is to use a column-based comparison to compare
files. You can set the following values for this property:
Note: Configuring a column-based comparison can take longer to complete than using the
default file difference preprocessor behavior. With a column-based comparison, the preprocessor must
complete an extra look-up between the files. |
includeCompareColumns |
Optional. Indicates whether the file difference preprocessor is to compare only specific
columns. Use a comma-separated list as the value for this property to identify the columns to be
compared. Any column that is not in this list is ignored during the file comparison. When you
include this property, the columnBasedCompare property is configured by default
with a value of true when the property is not explicitly configured.If you
include both the If you include the Note: If you include the
includeCompareColumns property and do not set a value and the
excludeCompareColumns property is not set with a value, the file difference
preprocessor compares only the key columns. The generated difference file then includes only the
records from the new input file that have a key column value that is not in the old input
file. |
excludeCompareColumns |
Optional. Indicates whether the file difference preprocessor is to exclude specific columns
from being compared. Use a comma-separated list as the value for this property to identify the
columns to be excluded from comparison. All other columns are compared. When you include this
property, the columnBasedCompare property is configured by default with a value of
true when the property is not explicitly configured.If you include both the
If you include the
|
Configuring the file difference to handling large input files
The file difference preprocess loads the old input file into a hash map in your system memory and compares this hash map to the new input file to generate a difference file. If the old file is too large to be loaded into your system memory, the file difference preprocessor splits the file into smaller files. The new input file is also split into the same number of smaller files. The preprocess generates a difference file for each pairing of these smaller files and then merges these files into a single larger difference file.
By default, the file difference preprocessor automatically determines the number of files that are required to split a large file. You can choose to configure the number of files that your large input files are split into. If you do configure this property, ensure that you specify a large enough number of files so that all records in the input file can be stored in memory.
Splitting the input files into smaller files does require processing time and disk space. If your system has sufficient physical memory and uses a 64-bit JVM, you increase the JVM maximum heap size to handle large input files. If your system does have sufficient memory that is allocated and the preprocess does not split the input files, the difference file can be generated faster. For more information about tuning your JVM performance, including your JVM heap size, see JVM performance tuning.
Procedure
-
Update the Data Load business object configuration file for your business object to include the
file difference preprocessor when the Data Load utility runs.
-
Update the Data Load load order configuration file for your load order to identify the location
of the files to be compared. To run the file difference preprocess, you must identify two
files.
For example, the previously loaded version and the newest version of a file.
When you specify the
oldLocation
for a file, you are indicating that the file difference preprocess is to run. The file types of files you identify for comparison determines the data reader (CSV or XML) that is used in the preprocessing. - Run the Data Load utility. The file difference preprocess runs and generates and saves the difference file. Depending on your configurations the Data Load utility can then load the difference file, or stop the Data Load process so that you can review the difference file and load the file later.
Example
<_config:DataLoadConfiguration xsi:schemaLocation="http://www.ibm.com/xmlns/prod/commerce/foundation/config ../xsd/wc-dataload.xsd">
<_config:DataLoadEnvironment configFile="wc-dataload-env.xml"/>
<_config:LoadOrder commitCount="100" batchSize="1" dataLoadMode="Replace" >
<_config:LoadItem name="CatalogEntry" businessObjectConfigFile="wc-loader-catalog-entry.xml" >
<_config:property name="dataReaderPreprocessOnly" value="true"/>
<_config:DataSourceLocation location="c:/temp/dataload/samples/CatalogEntryNew.csv" oldLocation="c:/temp/dataload/samples/CatalogEntryOld.csv" />
</_config:LoadItem>
</_config:LoadOrder>
</_config:DataLoadConfiguration>
In
this example configuration file, the two files are located in a temporary sample directory. After
the preprocessor completes, the generated difference file, CatalogEntryNew_diff_2013.03.28_12.01.01.001.csv, is
saved in the same temporary directory. This sample includes the
dataReaderPreprocessOnly
configuration property that causes the Data Load utility
to run only the file difference preprocessor. To run the preprocessor the configuration file
specifies that the Data Load utility is to run in Replace mode.