Unstructured and site content
Unstructured site content includes documents that do not adhere to a specific data model, such as product attachments contained in various formats. For example, content such as user manuals and warranty information are considered unstructured content, as its elements, construction, and organization are typically unknown and can vary depending on its file type.
Although the WebSphere Commerce database might not store the unstructured content, unstructured content can still be indexed and retrieved. For example, when submitting a search for laptop, the search result can find the unstructured content such as attachments in .pdf or .doc format which contain the laptop keyword.
Site content
When working with search index types, site content is categorized under the catalog entry search index.
Site content includes HTML and other site files from WebSphere Commerce starter stores. It is fetched and crawled by the site content crawler.
WebSphere Commerce provides sample static HTML files by default, that the site content crawler fetches and crawls to help populate the site content search index. You can configure the site content crawler to fetch additional content from WebSphere Commerce starter stores.
For more information, see Indexing site content with WebSphere Commerce search.
Supported file types
WebSphere Commerce search uses parser libraries to detect and extract metadata and structured text content from documents.
- Microsoft Office
- Excel 97-2003 (.xls)
- JAVA
- Classes (.class)
- Documents and text
- OpenDocument (.odt, odp, .ods)
- Tika 0.4
- Tika 0.8
- Tika 1.0
Unstructured content schema
WebSphere Commerce search can directly extract metadata and content from the unstructured data source. Differing unstructured data formats might contain varying metadata information. For example, Microsoft Word files contain metadata such as creator, company, and created date, whereas JPEG image files contain metadata such as width and height.
Solr Cell provides a mechanism to add a prefix to the generated
metadata field. This behavior requests that the typical schema design of
unstructured content must contain at least one dynamic field, such as
tika_*
, to store all metadata information. The main difference
between structured and unstructured content is that the name and total number of
fields for one unstructured document might vary from another unstructured
document.
WebSphere Commerce search manages unstructured content by requesting Tika to parse the documents before processing them and sending them to the WebSphere Commerce search server and eventual indexing.
Schema changes when relating structured content with unstructured content
When structured content contains a relationship with unstructured content, it must contain a new field in the structured schema.xml file to represent the unstructured information. This new field can query the structured objects by their unstructured content.
<field name="unstructure" type="wc_text" indexed="true" stored="false" multiValued="true" />
Where the stored="false"
snippet enables
unstructured content to not be retrieved by queries.