Unstructured content indexing and handling
The information that is stored in unstructured content can be organized and stored from several locations, including the HCL Commerce database, in file systems of servers, and on the internet. Therefore, the indexing process of unstructured content uses a hybrid of data sources to create indexing information by using the existing HCL Commerce Search indexing framework.
Unstructured content organization and retrieval
Unstructured content includes catalog entry information and its associated attachments. HCL Commerce attachments are used to find the catalog entry-related attachment for a specific language. Since not all attachments contain a language_id, the default behavior merges the null language_id results with the specific language_id results.
Where the language_id is defined from the HCL Commerce table, -1 represents United States English and the catentry_id list is the input parameter. The result of the search contains the catentry_id, attachment path, attachment usage, and attachment description. The default attachment usage type list is DOCUMENTS, USERMANUAL, WARRANTY, and OTHER. For customization, these usage types are configurable to meet your specific search requirements.
<script><![CDATA[
function isWriteToFile(row) {
var ruleName = row.get('RULENAME');
var writeToFile = "false";
if(ruleName != null){
if(ruleName == 'DOCUMENTS' || ruleName == 'USERMANUAL' || ruleName == 'WARRANTY'
|| ruleName == 'OTHER'){
writeToFile = "true";
}
}
row.put('writeToFile', writeToFile);
return row;
}
]]></script>
Where
the rule names are in the check conditions.Data Import Handler and indexing unstructured content
The data import handler handles the indexing process of unstructured content by using a hybrid of data sources to create indexing information that uses the existing HCL Commerce Search indexing framework. The TikaEntityProcessor is used to support the hybrid data.
The following diagram illustrates the role of the TikaEntityProcessor in handling unstructured content in HCL Commerce:
- 1, 2
- Catalog entries are indexed from the HCL Commerce database as structured content.
- 3, 4, 5
- The logic reuses the features of the DIH framework such as looping through the SQL result set
rows and passing parameters in a format similar to
${Attachment.CATENTRY_ID}
. The TikaEntityProcessor uses thesourceUrl
parameter withcommercebase
parameters to fetch content from the internet, parses the binary content, and returns the results to the unstructured content index. Next, the TikaEntityProcessor appends the text content to a catentryId.txt file, which is located in the temp folder under the unstructured core root folder.
The commercebase
parameters can be customized to meet your specific search
requirements. The tikacontentfield
and tikaprefix
parameters are
directly mapped to the fmap.content
and uprefix
Solr Cell
parameters. For more information, see ExtractingRequestHandler.
The buildindex RESTful call is used to index the unstructured content.
Structured content data-import handler process updates to index unstructured content
Structured objects might need to be searched by the content of related unstructured content.
Therefore, the unstructured content is also needed by structured content. That is, the default
structured content DIH indexing process must also read the content of unstructured content. The
PlainTextEntityProcessor
is used to read the content of the temporary files and
index them in the defined field. It uses the temporary files that the
TikaEntityProcessor
creates during the unstructured content DIH process.
The basePath
data source in the DIH configuration file defines the temporary
folder location. In this configuration, onError="continue"
is set so that if the
file does not exist, or there are other errors, the DIH process continues running and ignores the
error. The column name of the unstructure
field is a fixed value set as
plainText
and must not be changed.