The indexing process
The indexing process involves adding Documents to an IndexWriter. The searching process involves retrieving Documents from an index by using an IndexSearcher. Solr can index both structured and unstructured content.
Structured content is organized. For example, some of the product description's predefined fields are title, manufacture name, description, and color.
Unstructured content, in contrast, lacks structure and organization. For example, it can consist of PDF files or content from external sources (such as tweets) that do not follow any predefined patterns.
Data Import Handler
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">wc-data-config.xml</str>
<str name="update.chain">wc-conditionalCopyFieldChain</str>
</lst>
</requestHandler>
Fetching, reading, and processing data
- How to fetch data, such as using queries or URLs.
- What to read, such as result set columns or XML fields.
- How to process, such as modifying, adding, or removing fields.
<dataConfig>
<dataSource name="WC database"
type="JdbcDataSource"
jndiName="jdbc/WCDB"
readOnly="true"
autoCommit="true"
transactionIsolation="TRANSACTION_READ_COMMITTED"
holdability="CLOSE_CURSORS_AT_COMMIT"
/>
<dataSource name="unstructuretmpfile"
type="FileDataSource"
basepath="W:\WCDE_I~1\search\solr\home\MC_10001_\en_US\CatalogEntry\unstructured\temp/"
/>
- The WebSphere Commerce database is the data source for structured data.
- The unstructuretmpfile specifies the path to the unstructured data.
In addition, the file contains the following types of content by default:
The following three documents exist: one for CatalogEntry, one for bundle, and one for dynamic kit.
The CatalogEntry document contains the following entities: Product and attachment_content.
- query
- Identifies the data to populate fields of the Solr document during full imports.
- deltaImportQuery
- Identifies the data to populate fields during delta imports.
- deltaQuery
- Identifies the primary keys of the current entities that changed since the last index time.
- deletedPkQuery
- Identifies the documents to remove.
- Transformer
- Every set of fields that are fetched by the entity can be consumed either directly by the indexing process, or changed by using transformers to modify a field or create a new set of fields.
<field column="CATENTRY_ID" name="catentry_id" />
<field column="MEMBER_ID" name="member_id" />
<field column="CATENTTYPE_ID" name="catenttype_id_ntk_cs" />
<field column="PARTNUMBER" name="partNumber_ntk" />
Where: CATENTRY_ID is the database column name and catentry_id is the index field name.
Setting up and building the search index
- setupSearchIndex, which is run once per master catalog. For more information, see Setting up the search index.
- di-preprocess, which extracts and flattens WebSphere Commerce data and then outputs the data into a set of temporary tables inside the WebSphere Commerce database. The data in the temporary tables is then used by the index building utility to populate the data into search indexes by using the Data Import Handler (DIH).
- di-buildindex, which crawls the temporary tables that are populated by the
preprocess utility and then populates the Solr index.
For more information, see Preprocessing and building the search index.
Crawling unstructured content
For unstructured content, the Solr ExtractingRequestandler uses Apache Tika to allow users to upload binary files and unstructured data to Solr. Then, Solr extracts and indexes the content.
<!-- Solr Cell Update Request Handler
http://wiki.apache.org/solr/ExtractingRequestHandler
-->
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<!-- All the main content goes into "text"... if you need to return
the extracted text or do highlighting, use a stored field. -->
<str name="fmap.content">text</str>
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<!-- capture link hrefs but ignore div attributes -->
<str name="captureAttr">true</str>
<str name="fmap.a">links</str>
<str name="fmap.div">ignored_</str>
</lst>
</requestHandler>
- Tika automatically determines the input document type and produces an XHTML stream that is then fitted to a SAX ContentHandler.
- Solr then reacts to the Tika SAX events and creates the fields to index.
- Tika produces metadata information such as Title, Subject, and Author.
- All of the extracted text is added to the content field. Setting Fmap.content to text causes the content to be added to the text field.
For more information about unstructured content, see Unstructured and site content.
For more information about the WebSphere Commerce index schema, see WebSphere Commerce Search index schema and WebSphere Commerce Search index schema definition.