Configuring the Data Load utility to run a file difference preprocess

If you routinely load the same generated Data Load input file from an external system or source, you can choose to run a file difference preprocess as part of the Data Load process to ensure that you are loading only new changes when you load your newest input file.

Note: Running a file difference preprocess is not required. If you are planning to configure a file difference to run, ensure that you review the best practices and scenarios for running this preprocess that is described in the following topic, Data Load file difference preprocessing.

Before you begin

Before you configure and run a file difference preprocessing, ensure that you complete the following prerequisites:

Identify the two Data Load input files that you want to compare and generate a difference file from.
Successfully load the old input file into your WebSphere Commerce database. If any records are in the old file and in the new file that the old file is compared with, these records are not in the generated difference file or in your database. To prevent records from being omitted without ever being loaded into your database, verify that the contents of the old file are loaded into your database.

About this task

A file difference tool is available as a data reader preprocessor when you run the Data Load utility. This file difference preprocessor can be used to read and compare two CSV or two XML files. The preprocessor uses a different data reader class to read CSV files (CSVFileDiffPreprocessor) and XML files (XmlFileDiffPreprocessor) so you cannot compare a CSV file to an XML file.

Configuration properties for file difference preprocessor

By default there are a number of configuration properties available for you to use to configure your preprocess to meet your requirements. The following table lists the available properties. These properties are all specified within the load order configuration file except for the keyColumns property, which must be specified in the business object configuration file.

Configuration property	Description
`keyColumns`	Required. Key columns are the CSV columns or XML elements that uniquely identify a record in your input file.
`numberOfSplitFiles`	Optional. Use this property to specify how many files that the input files are to be split into when the old input file is too large to be stored in memory. It is recommended that the `numberOfSplitFiles` is specified in the configuration file to avoid the extra calculation by the file difference preprocessor. For example, if your old input file is not large, for instance, less than 200 MB, you can specify the `numberOfSplitFiles` value to be 1. See the note in the following procedure for setting this property for details about determining the number of split files to specify.
`checkDuplicatedKeys`	Optional. Specify this property as true to perform an extra check for duplicate entries. It is recommended that you specify `checkDuplicatedKeys` to be false if you know that you do not have duplicate records in your input files. This setting avoids the extra checking by the file differences preprocessor.
`diffFileDirectory`	Optional. This property is for changing the directory where the generated difference file is saved.
`dataReaderPreprocessOnly`	Optional. Specify this property as true stops the Data Load process after the difference file is generated and saved.
`cleanupSplitFiles`	Optional. If your input files are split, you can set this property to false to save the temporary generated smaller files. If this property is set to true or omitted, the generated smaller files are deleted after the files are merged.
`columnBasedCompare`	Optional. Indicates whether the preprocessor is to use a column-based comparison to compare files. You can set the following values for this property: `true` Each line in the files are compared column by column instead of by the full records. The preprocessor determines whether the value in the same column in both files is the same. The preprocessor can ignore minor differences such as the following file contents: The same columns exist in both files, but in a different order. Extra commas at the end of a record in one file but not the other file. Quotation marks are found for a column value in one file but the value does not include quotation marks in the other file. Note: The difference between quotation marks in files applies only to CSV files, such as for CSV tokens. If a record in an XML file includes no quotation marks, but the record in the other XML file includes double quotation marks, the record is considered to be changed. `false` The default value. The preprocessor compares the full records in each file line by line. If columns are in a different arrangement, all records can be considered different. The preprocessor also considers records to be different when minor differences are found, such as extra commas at the end of a record are found. Note: Configuring a column-based comparison can take longer to complete than using the default file difference preprocessor behavior. With a column-based comparison, the preprocessor must complete an extra look-up between the files.
`includeCompareColumns`	Optional. Indicates whether the file difference preprocessor is to compare only specific columns. Use a comma-separated list as the value for this property to identify the columns to be compared. Any column that is not in this list is ignored during the file comparison. When you include this property, the `columnBasedCompare` property is configured by default with a value of `true` when the property is not explicitly configured. If you include both the `includeCompareColumns` and the `excludeCompareColumns` property, the `includeCompareColumns` property takes precedence. If you include the `includeCompareColumns` property without a value and include the `excludeCompareColumns` property with a value, the `excludeCompareColumns` property takes priority. Note: If you include the `includeCompareColumns` property and do not set a value and the `excludeCompareColumns` property is not set with a value, the file difference preprocessor compares only the key columns. The generated difference file then includes only the records from the new input file that have a key column value that is not in the old input file.
`excludeCompareColumns`	Optional. Indicates whether the file difference preprocessor is to exclude specific columns from being compared. Use a comma-separated list as the value for this property to identify the columns to be excluded from comparison. All other columns are compared. When you include this property, the `columnBasedCompare` property is configured by default with a value of `true` when the property is not explicitly configured. If you include both the `includeCompareColumns` and the `excludeCompareColumns` property, the `includeCompareColumns` property takes precedence. If you include the `excludeCompareColumns` property and do not set a value, the file difference preprocessor ignores the property.

Configuring the file difference to handling large input files

The file difference preprocess loads the old input file into a hash map in your system memory and compares this hash map to the new input file to generate a difference file. If the old file is too large to be loaded into your system memory, the file difference preprocessor splits the file into smaller files. The new input file is also split into the same number of smaller files. The preprocess generates a difference file for each pairing of these smaller files and then merges these files into a single larger difference file.

By default, the file difference preprocessor automatically determines the number of files that are required to split a large file. You can choose to configure the number of files that your large input files are split into. If you do configure this property, ensure that you specify a large enough number of files so that all records in the input file can be stored in memory.

Splitting the input files into smaller files does require processing time and disk space. If your system has sufficient physical memory and uses a 64-bit JVM, you increase the JVM maximum heap size to handle large input files. If your system does have sufficient memory that is allocated and the preprocess does not split the input files, the difference file can be generated faster. For more information about tuning your JVM performance, including your JVM heap size, see JVM performance tuning.

Procedure

Update the Data Load business object configuration file for your business object to include the file difference preprocessor when the Data Load utility runs.
1. Open the wc-loader-object.xml configuration file, where object is the business user object you are loading.
  For example, wc-loader-catalog-entries.xml.
2. Define the required property keyColumns for the data object that you are loading.
  To run a file difference preprocess, you must specify the key columns for your input files in the data reader element of the Data Load business object configuration file. These key columns are the CSV columns or XML elements that uniquely identify a record in your input file.
  - CSV files:
```
<_config:DataReader className="com.ibm.commerce.foundation.dataload.datareader.CSVReader" firstLineIsHeader="true" useHeaderAsColumnName="true" >
  <_config:property name="keyColumns" value="key_columns" />
</_config:DataReader>
```
  - XML files:
```
<_config:DataReader className="com.ibm.commerce.foundation.dataload.datareader.XmlReader" >
  <_config:property name="keyColumns" value="key_columns" />
</_config:DataReader>
```
  Where the code in bold text, adds the configuration for the property keyColumns. Replace the value "key_columns with a list of the columns or elements that are required to uniquely identify a record in your input files. If multiple columns or elements are required to identify a record, separate these key columns by a comma. For example, if you are loading catalog entries, your business object configuration file can specify a single key column, PartNumber.
3. Save and close your file.
Update the Data Load load order configuration file for your load order to identify the location of the files to be compared. To run the file difference preprocess, you must identify two files.
For example, the previously loaded version and the newest version of a file.
When you specify the oldLocation for a file, you are indicating that the file difference preprocess is to run. The file types of files you identify for comparison determines the data reader (CSV or XML) that is used in the preprocessing.
1. Open the wc-dataload-object.xml configuration file, where object is the business user object you are loading.
  For example, wc-dataload-catalog-entries.xml.
2. Add the configuration statement that identifies the directory location where the old and new input files exist:
```
<_config:LoadItem name="CatalogEntry" businessObjectConfigFile="wc-loader-catalog-entry.xml" >
   <_config:DataSourceLocation location="new_file_location" oldLocation="old_file_location" />
</_config:LoadItem>
```
  Where the code in bold text adds the configuration for identifying the input file locations. Replace the value for new_file_location with the value for the new input file, and replace the value for old_file_location with the value for the old (previously loaded) input file. For example,
```
 <_config:DataSourceLocation location="CatalogEntry_new.csv" oldLocation="CatalogEntry_old.csv" />
```
3. Optional: Add a configuration statement to change the directory where the generated difference file is saved. By default, this file is saved in the same directory as the identified new input file. This generated file is saved with the same file name as the new input file, but with _diff_ and a timestamp appended to the name
  For example, if your new input file is named catentry.xml, the generated difference file is named catentry_diff_2012.01.01_12.01.01.001.xml.
```
<_config:LoadItem name="CatalogEntry" businessObjectConfigFile="wc-loader-catalog-entry.xml" >
  <_config:property name="diffFileDirectory" value="difference_file_directory" />
  <_config:DataSourceLocation location="CatalogEntry_new.csv" oldLocation="CatalogEntry_old.csv" />
</_config:LoadItem>
```
  Where the code in bold text adds the configuration for changing the directory. Replace the value for difference_file_directory with the value for your directory. For example,
```
<_config:property name="diffFileDirectory" value="c:\dataload" />
```
4. Optional: Add the configuration statement to stop the Data Load process after the difference file is generated and saved. If you want to review the generated difference file before you load this file, add this configuration property. This configuration provides you with the ability to generate the difference file at any time, and then load the data into your system when running the Data Load utility has the least impact on your system performance.
```
<_config:LoadItem name="CatalogEntry" businessObjectConfigFile="wc-loader-catalog-entry.xml" >
  <_config:property name="dataReaderPreprocessOnly" value="true"/>
  <_config:DataSourceLocation location="new_file_location" oldLocation="old_file_location" />
</_config:LoadItem>
```
  Where the code in bold text adds the configuration property to run only the file difference preprocess. If you set the value for this property to true, the Data Load utility will stop running after the difference file is generated and saved. The default value for this property is false, which results in the Data Load process to continue after it saves the difference file and loads it into your database.
5. Optional: Add the configuration statement to specify how many files that the input files are to be split into when the old input file is too large to be stored in memory. If you split this file, each new file that the original old input file is split into is compared with the new file. A smaller difference file is generated for each comparison between the new file and the portions of the old input file. The file difference preprocessor, then merges the generated difference files into a single file.
```
<_config:LoadItem name="CatalogEntry" businessObjectConfigFile="wc-loader-catalog-entry.xml" >
  <_config:property name="numberOfSplitFiles" value="number_of_files" />
  <_config:DataSourceLocation location="new_file_location" oldLocation="old_file_location" />
</_config:LoadItem>
```
  Where the code in bold text, adds the configuration for the property numberOfSplitFiles. Replace the value "number_of_files with the number of files that you want the old input file split into. For example, the following statement configures the file difference preprocess to split the old input file into three files:
```
 <_config:property name="numberOfSplitFiles" value="3" />
```
  Note: If you do not specify this property, the file difference preprocessor calculates the number of files that are required to split your input files into. This calculation is based on an estimation of the memory that is required to load the old file into a hash map. You can run the file difference preprocessor with the following trace turned on to view the time that is required for the preprocessor to make this calculation and view the number of files that your input files are split into.
  -Dcom.ibm.commerce.foundation.dataload.level=CONFIG
  You can use this trace output to specify the value for the numberOfSplitFiles property when you next run a file difference for newer versions of your input files. By determining the value for this property in this method, you can save the time that is required for the preprocessor to calculate this value automatically when the preprocessor runs.
6. Optional: Add the configuration statement to perform an extra check for duplicate entries. If you know that your input files do not contain any duplicate keys, you can disable this extra check.
```
<_config:LoadItem name="CatalogEntry" businessObjectConfigFile="wc-loader-catalog-entry.xml" >
  <_config:property name="checkDuplicatedKeys" value="false" />
  <_config:DataSourceLocation location="new_file_location" oldLocation="old_file_location" />
</_config:LoadItem>
```
  Where the code in bold text adds the configuration property to perform an extra check for duplicate key column values during the file difference preprocess. If you set the value for this property to false, the extra check for duplicate key values does not run. If you do not specify this property, the default value is true, which results in an extra check for duplicate key values.
7. Optional: If the file difference preprocess splits input files to handle large files, the smaller generated difference files are deleted after the files are merged. If you require these files for debugging purposes, you can add a configuration statement to have the file difference preprocessor save the temporary files, not automatically delete them.
```
<_config:LoadItem name="CatalogEntry" businessObjectConfigFile="wc-loader-catalog-entry.xml" >
   <_config:property name="cleanupSplitFiles" value="false" />
  <_config:DataSourceLocation location="new_file_location" oldLocation="old_file_location" />
</_config:LoadItem>
```
  Where the code in bold text, adds the configuration for the property cleanupSplitFiles. If you specify the value for this property as false, the temporary files are not deleted after the files are merged. By default, the value for this property is true.
8. Optional: Configure file difference preprocessor to complete a column-based comparison of records between the files instead of comparing each full record as a string. With the column-based comparison, the arrangement of the columns or XML elements in the files can be ignored. The column-based comparison also provides you with the ability to set the columns to be compared by configuring the preprocess to include or exclude specified columns. If your files include columns or XML elements that are arranged differently between the files, configure the preprocess to use the column-based comparison.
  Add the columnBasedCompare property to the data reader configuration to configure the preprocessor to use a column-based comparison. For example:
```
<_config:LoadItem name="CatalogEntry" businessObjectConfigFile="wc-loader-catalog-entry.xml" >
  <_config:property name="dataReaderPreprocessOnly" value="true"/>
  <_config:property name="columnBasedCompare" value="true" />
  <_config:DataSourceLocation location="c:/temp/dataload/samples/CatalogEntryNew.csv" oldLocation="c:/temp/dataload/samples/CatalogEntryOld.csv" />
</_config:LoadItem>
```
  When you configure a column-based comparison, you can configure the preprocessor to ignore specific columns, or to compare only specific columns. You must indicate the columns to be excluded or included in the comparison with a comma-separated list. The column names are case-sensitive and must match the column headings in both files.
  - To specify a list of columns to exclude from being compared, add the excludeCompareColumns property to the data reader configuration. For example, the following configuration excludes the columns ListPrice, Price, and Thumbnail from being compared during the file difference:
    <_config:LoadItem name="CatalogEntry" businessObjectConfigFile="wc-loader-catalog-entry.xml" > <_config:property name="dataReaderPreprocessOnly" value="true"/> <_config:property name="columnBasedCompare" value="true" /> <_config:property name="excludeCompareColumns" value="ListPrice, Price, Thumbnail" /> <_config:DataSourceLocation location="c:/temp/dataload/samples/CatalogEntryNew.csv" oldLocation="c:/temp/dataload/samples/CatalogEntryOld.csv" /> </_config:LoadItem>
  - To specify a list of columns that are to be the only columns that the file difference compares, add the includeCompareColumns property to the data reader configuration. For example, the following configuration identifies that the file difference should compare only the columns PartNumber, ShortDescription, and LongDescription. All other columns are not compared:
    <_config:LoadItem name="CatalogEntry" businessObjectConfigFile="wc-loader-catalog-entry.xml" > <_config:property name="dataReaderPreprocessOnly" value="true"/> <_config:property name="columnBasedCompare" value="true" /> <_config:property name="includeCompareColumns" value="PartNumber, ShortDescription, LongDescription" /> <_config:DataSourceLocation location="c:/temp/dataload/samples/CatalogEntryNew.csv" oldLocation="c:/temp/dataload/samples/CatalogEntryOld.csv" /> </_config:LoadItem>
9. Save and close your file.
Run the Data Load utility. The file difference preprocess runs and generates and saves the difference file. Depending on your configurations the Data Load utility can then load the difference file, or stop the Data Load process so that you can review the difference file and load the file later.

Example

The following example demonstrates how to configure your business object configuration file to run a file difference on the following two files:

To run the file difference preprocessor to compare these files without proceeding to load the generated difference file, your business object configuration file can resemble the following code:

<_config:DataLoadConfiguration xsi:schemaLocation="http://www.ibm.com/xmlns/prod/commerce/foundation/config ../xsd/wc-dataload.xsd">
  <_config:DataLoadEnvironment configFile="wc-dataload-env.xml"/>
  <_config:LoadOrder commitCount="100" batchSize="1" dataLoadMode="Replace" >
    <_config:LoadItem name="CatalogEntry" businessObjectConfigFile="wc-loader-catalog-entry.xml" >
      <_config:property name="dataReaderPreprocessOnly" value="true"/>
      <_config:DataSourceLocation location="c:/temp/dataload/samples/CatalogEntryNew.csv" oldLocation="c:/temp/dataload/samples/CatalogEntryOld.csv" />
    </_config:LoadItem>
  </_config:LoadOrder>
</_config:DataLoadConfiguration>

In this example configuration file, the two files are located in a temporary sample directory. After the preprocessor completes, the generated difference file, CatalogEntryNew_diff_2013.03.28_12.01.01.001.csv, is saved in the same temporary directory. This sample includes the dataReaderPreprocessOnly configuration property that causes the Data Load utility to run only the file difference preprocessor. To run the preprocessor the configuration file specifies that the Data Load utility is to run in Replace mode.