The indexing process
The indexing process involves adding Documents to an IndexWriter. The searching process involves retrieving Documents from an index by using an IndexSearcher. Solr can index both structured and unstructured content.
Structured content is organized. For example, some of the product description's predefined fields are title, manufacture name, description, and color.
Unstructured content, in contrast, lacks structure and organization. For example, it can consist of PDF files or content from external sources (such as tweets) that do not follow any predefined patterns.
Data Import Handler
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">wc-data-config.xml</str>
<str name="update.chain">wc-conditionalCopyFieldChain</str>
</lst>
</requestHandler>
dataImport
is the only import handler used in HCL Commerce.
The Solr project includes the updateHandler
method, but it is not
supported in HCL Commerce Search.
Fetching, reading, and processing data
- How to fetch data, such as using queries or URLs.
- What to read, such as result set columns or XML fields.
- How to process, such as modifying, adding, or removing fields.
solrhome
/v3/CatalogEntry/conf/wc-data-config.xml file contains
the following content:
<dataConfig>
<dataSource name="WC database"
type="JdbcDataSource"
jndiName="jdbc/WCDB"
readOnly="true"
autoCommit="true"
transactionIsolation="TRANSACTION_READ_COMMITTED"
holdability="CLOSE_CURSORS_AT_COMMIT"
/>
<dataSource basePath="${solr.core.instanceDir}/../Unstructured/temp/"
name="unstructuretmpfile"
type="com.ibm.commerce.solr.handler.RequestFileDataSource"
/>
- The HCL Commerce database is the data source for structured data.
- The unstructuretmpfile specifies the path to the unstructured data.
In addition, the file contains the following types of content by default:
The following three documents exist: one for CatalogEntry, one for bundle, and one for dynamic kit.
The CatalogEntry document contains the following entities: Product and attachment_content.
- query
- Identifies the data to populate fields of the Solr document during full imports.
- deltaImportQuery
- Identifies the data to populate fields during delta imports.
- deltaQuery
- Identifies the primary keys of the current entities that changed since the last index time.
- deletedPkQuery
- Identifies the documents to remove.
- Transformer
- Every set of fields that are fetched by the entity can be consumed either directly by the indexing process, or changed by using transformers to modify a field or create a new set of fields.
<field column="CATENTRY_ID" name="catentry_id" />
<field column="MEMBER_ID" name="member_id" />
<field column="CATENTTYPE_ID" name="catenttype_id_ntk_cs" />
<field column="PARTNUMBER" name="partNumber_ntk" />
Where: CATENTRY_ID is the database column name and catentry_id is the index field name.
Crawling unstructured content
For unstructured content, the Solr ExtractingRequestandler uses Apache Tika to allow users to upload binary files and unstructured data to Solr. Then, Solr extracts and indexes the content.
<!-- Solr Cell Update Request Handler
http://wiki.apache.org/solr/ExtractingRequestHandler
-->
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<!-- All the main content goes into "text"... if you need to return
the extracted text or do highlighting, use a stored field. -->
<str name="fmap.content">text</str>
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<!-- capture link hrefs but ignore div attributes -->
<str name="captureAttr">true</str>
<str name="fmap.a">links</str>
<str name="fmap.div">ignored_</str>
</lst>
</requestHandler>
- Tika automatically determines the input document type and produces an XHTML stream that is then fitted to a SAX ContentHandler.
- Solr then reacts to the Tika SAX events and creates the fields to index.
- Tika produces metadata information such as Title, Subject, and Author.
- All of the extracted text is added to the content field. Setting Fmap.content to text causes the content to be added to the text field.
For more information about unstructured content, see Unstructured and site content.
For more information about the HCL Commerce index schema, see HCL Commerce Search index schema and HCL Commerce Search index schema definition.