General data load best practices
The following best practices are recommended when you use the Data Load utility to load data.
- Configuration for the initial loads
- Configuration for the delta loads
- Running the data load script file
- File difference preprocessing
- Data Load utility configuration files
- Configuring the data load order file (wc-dataload.xml)
- Configuring the data load environment configuration file (wc-dataload-env.xml)
- Configuring the data load business object configuration file
- CSV input files
- Loading by unique ID
- Reversing a data load
- Tuning the Data Load utility
Configuration for the initial loads
For more information about recommended configurations during initial loads, see Scenario: Initial load.Configuration for the delta loads
For more information about recommended configuration during delta loads, see Scenario: Delta load.Running the data load script file
- Turn off XML validation if you are using some variable substitution for integer attributes by
specifying the following optional
parameter:
-DXmlValidation
=false - Turn on more tracing in the
WC_installdir\logs\wc-dataload.log, by specifying the
following option to turn on the FINEST tracing for all
packages:
-D.level
=FINESTNote: For large loads, specifying FINEST trace level, causes too much tracing in the log file; you can turn on tracing for one package. Here are some examples that you might find useful:- To get a finer log level for catalog information, specify the following
trace:
-Dcom.ibm.commerce.catalog.dataload.level=FINER
- To log SQL issues, specify the following
trace:
-Dcom.ibm.commerce.foundation.dataload.database.level
=FINE - To log Data Load performance issues, specify the following
trace:
-Dcom.ibm.commerce.foundation.dataload.level=CONFIG
- To log price-related load issues, specify the following
trace:
-Dcom.ibm.commerce.price.dataload.level=FINER
- To log inventory-related load issues, specify the following
trace:
-Dcom.ibm.commerce.inventory.dataload.level=FINER
- To log member-related load issues, specify the following
trace:
-Dcom.ibm.commerce.member.dataload.level=FINER
- To log marketing-related load issues, specify the following
trace:
-Dcom.ibm.commerce.marketing.dataload.level=FINER
- To log promotion-related load issues, specify the following
trace:
-Dcom.ibm.commerce.promotion.dataload.level=FINER
- To log Commerce Composer-related load issues, specify the following
trace:
-Dcom.ibm.commerce.pagelayout.dataload.level=FINER
- To get a finer log level for catalog information, specify the following
trace:
- Customize the Java logging configuration file, WC_installdir\wc.ear\xml\config\dataload\logging.properties. For example, you can change the log file path, the maximum log file size and how many log files to cycle through. By default, you have one log file, and the log file is overwritten every time you run the Data Load utility.
File difference preprocessing
You can run a file difference preprocess for routine data loads to improve the Data Load utility performance for loading these files. By using this preprocessor that you can compare two input files, such as a previously loaded file and a new version of this file. The preprocessor generates a difference file that contains only the records in the new file that are not within the old file or that are changed from the records in the old file. The Data Load utility can then load this difference file. If your routinely loaded files contain many previous loaded records, then running this file difference can result in shorter load times. Running a file difference can reduce the loading time that is required to load your routine updates to your WebSphere Commerce database, reduce server usage time, and improve server performance.
You can configure the Data Load utility file difference preprocessor to compare files by the values in each column, instead of entire records, to identify the changed records. You can also configure the file difference preprocessor to ignore specific columns when the process is comparing files.
For more information about this preprocessor, see Data Load file difference preprocessing.
Data Load utility configuration files
- Load order configuration file (wc-dataload.xml): You can either have
multiple load order configuration files or one load order configuration file to include all the load
items. If you want to load only a few load items, use the following command line when you are
running the Data Load utility:
-DLoadOrder
="loadItemName1, loadItemName2, loadItemName3" - Environment configuration file (wc-dataload-env.xml): You need only one copy of this configuration file.
- Business object configuration files: Typically one business object configuration file corresponds to one type of input data that loads one type of business object. The file name convention is wc-loader-<business object>.xml. This file defines the DataReader, BusinessObjectBuilder, and BusinessObjectMediator used for the data load. It also defines all the customization configuration options.
Configuring the data load order file (wc-dataload.xml)
- Specify commitCount, batchSize, dataLoaderMode at the LoadOrder level, so you do not need to specify them at each LoadItem level.
- Specify the commitCount to be greater than or equal to the batchSize. The commitCount is a multiple of the batchSize.
- To minimize the impact to your production server, specify the commitCount and batchSize to 1. Specifying a large commitCount and batchSize improves the data load performance. However, this large commitCount and batchSize might affect the database and result in more database tables and rows that are being locked for a longer time.
- For easy debugging of some SQL errors, specify the batchSize to 1 and turn on the database tracing. These settings can help you determine which SQL statement or input line caused the SQL error. If the batchSize is greater than 1, the JDBC batch update is enabled. With this batch update enabled, relating the SQL error to the input line or SQL statement that caused the error can be difficult.
- Running the Data Load utility in insert mode can improve the performance of loading initial
loads of large amounts of data. However, when you run the Data Load utility in insert mode the
utility does not check your database before the utility loads your input data. The utility does not
determine whether the data objects that you are creating exist in your database before the utility
attempts to load and create the data objects. This behavior can cause the load process to fail when
the utility attempts to create data objects that exist in your database. For example, if your input file is a CSV file that contains a line that creates a product and a product description. If the file also contains a second line that loads the same product with a description in a different language, the load process fails. Since the utility creates the product and description in the first line, when the utility encounters the second line, the utility attempts to create the product again. Since the product exists, the load fails. To load this data, you can use one of the following methods:
- Run the Data Load utility in replace mode. In replace mode the Data Load utility generates insert, update, or delete SQL statements depending on the data that you are loading. This mode replaces existing data that is contained in your database with your input data.
- Run the Data Load utility in insert mode, but load your product and description information separately. Use one input file to load your product information, and a different input file to load all of your descriptions.
Configuring the data load environment configuration file (wc-dataload-env.xml)
- Do not specify the
<_config:FilePath />
element if all your configuration files are relative to the wc-dataload.xml file. - Encrypt your database password in your configuration file. You can run the
wcs-encrypt utility in the
WC_installdir\bin directory to find out the encrypted ASCII
string:
wcs-encrypt.bat/sh <plain password>
- You can also leave the setting for the database password empty. When the password is empty, you are prompted to enter the password when you run the Data Load utility. You must enter the plain unencrypted password; your password is not echoed on the console for security purposes.
- The IDResolver configuration is optional. You can specify the ID resolver cache size. For initial load, specify a large cache size, such as 1 million. For delta load, specify a cache size of 0. If you do not specify the ID resolver configuration, the default cache size is 0.
- Instead of hardcoding the data writer in each of the data load business object configuration file, specify the default data writer class in this configuration file.
- Set the cache size for the ID resolver to be 0 when you run frequent loads that can include the same information in the repeated loads. For instance, when you are running scheduled updates of your catalog data that overrides existing data. For more information about configuring this cache size and this file, see Configuring the data load environment settings.
- If an environment configuration file is specified in the command line, it takes precedence over the element that exists in the data load environment configuration file.
Configuring the data load business object configuration file
- If you are loading files in CSV format, instead of hardcoding the CSV column names in the
<_config:Data>
node of the<_config:DataReader>
element, you can place all the column names in the first line of the CSV file. Then, you must specify firstLineIsHeader="true" in the<_config:DataReader>
element.Note: This configuration applies only when you use the CSVReader interface as your data reader. - When you define the mapping or table configuration in the
<_config:BusinessObjectBuilder>
element, if the value is from the input data, you do not need to specify thevalueFrom
attribute. The default isvalueFrom="InputData"
. - If you do not want the Data Load utility to update all columns for a table when you are loading
data, you can configure a column exclusion list. Configuring a column exclusion list causes the Data
Load utility to avoid loading data into every column of a table. This exclusion configuration
provides you the ability to exclude columns from being overwritten if the columns are known to
already be populated with data.
For more information, see Configuring a column exclusion list.
When you configure an exclusion list, you can configure the list to with an optional parameter,
forUpdateOnly="true"
. If you configure an exclusion list with this parameter set to true, when you update a table record with an input file that includes changes for an excluded column, the utility does not update the value for the excluded column with the input file value. If the same load operation includes data to create records in the table, the utility ignores the exclusion list and inserts the input file values into all columns for the new records. - If you are loading CSV files that contain leading whitespaces, you can avoid errors by
configuring the data load CSV reader to trim the token. To do so, set the
trimTokenWhiteSpace property to true. For
example:
<_config:DataReader className="com.ibm.commerce.foundation.dataload.datareader.CSVReader" firstLineIsHeader="false" useHeaderAsColumnName="false"> <_config:property name="trimTokenWhiteSpace" value="true" /> </_config:DataReader>
CSV input files
- For clarity and readability, use the first line of the CSV file as a header.
You can use a spreadsheet software to open the CSV file to see whether your data matches the column heading.<_config:DataReader className="com.ibm.commerce.foundation.dataload.datareader.CSVReader" firstLineIsHeader="true" >
- For flexibility of using and omitting optional columns or rearranging the columns, use the
header as column name instead of hardcoding the
columns.
instead of:<_config:DataReader className="com.ibm.commerce.foundation.dataload.datareader.CSVReader" firstLineIsHeader="true" useHeaderAsColumnName="true" >
<_config:Data> <_config:column number="1" name="Identifier" /> <_config:column number="2" name="Name" /> <_config:column number="3" name="ShortDescription" /> <_config:column number="4" name="LongDescription" /> <_config:column number="5" name="Thumbnail" /> <_config:column number="6" name="FullImage" /> <_config:column number="7" name="Delete" /> </_config:Data>
- If you want to use spreadsheet software to edit and save the CSV file, ensure that you use an editor that can save in UTF-8 format. For example, Open Office Calc. Otherwise, the spreadsheet software might reformat your CSV file and invalidate your data for the CSVReader. For example, if you have a timestamp column in the CSV file, not saving in UTF-8 format can reformat the timestamp data according to your locale setting. The reformatted timestamp data might not be valid for the use with your data load.
- By default, data load supports only the Java standard timestamp format "
yyyy-MM-dd hh:mm:ss.nnnnnn
". The Data Load utility can also support a customized timestamp format. If you have a customized timestamp format in your CSV file, you must specify the format in the load item configuration in the wc-dataload.xml file. For example, you can specify the following row in the<_config:LoadItem>
element:<_config:property name="timestampPattern" value="yyyy-MM-dd HH.mm.ss" />
Loading by unique ID
Specifying the unique ID is optional when you are using the Data Load utility. However, if you specify the unique ID, you save the processing time that is required to resolve the ID, and performance is improved.
Reversing a data load
dataLoadMode="Delete"
specified in the <_config:LoadOrder>
element in your wc-dataload.xml data load configuration file. If you are
reversing a load, specify the following configuration row within the
<_config:LoadOrder>
element:<_config:property name="actionOnError" value="1" />
This
configuration row ensures that the Data Load utility continues the process upon a soft delete error.
This continuation is because dependent child records no longer exist because of cascade delete.Tuning the Data Load utility
To reduce any performance impact from running the Data Load utility, you can adjust the Idresolver cache size and parameters that are related to the utility. For more information, see Data Load utility performance tuning.