Solr schema file (schema.xml)

Solr organizes its data into documents, which consist of fields. In WebSphere Commerce, one document consists of all the descriptions of a product.

The schema.xml file contains information about the Solr fields and how they are analyzed and filtered during searches. Different field types can contain different types of data. Solr uses the schema.xml file to determine how to build indexes from the input documents, and how to perform index and query time processing.

The schema.xml file contains the following fields and options:

Field type

Field types define a list of different data types for values. You can define strings, numeric types, or new types. For example:


<fieldType name="int" class="solr.TrieIntField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>

A more complicated field definition resembles the following example:


<fieldType name="wc_text" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
      <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" 
                catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt" />
      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
      <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" 
                catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt" />
      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
   </fieldType>

Where the wc_text fieldType is processed by the analyzers, tokenizers, and filters, and are used to influence search relevancy.

Text analyzers

Text analyzers map the source string of text and the final list of tokens. This process occurs during indexing and querying. You can have different analyzer chains for indexing and querying, depending on your business needs. For example, the EdgeNGramFilterFactory is better suited for indexing, rather than querying section. However, typically the same analyzer chains are used for indexing and queries, as searches require the query tokens to match the indexed tokens.

Different analyzers return different results. For example, you can use some analyzers to return the term Flash, when flash is used.

Tokenizers

Tokenizers break field data into tokens. The following example shows how an input of O'Reilly's wi-fi guide is tokenized:

Tokenizer example
Tokenizer	Output
Standard tokenizer (standardTokenizerFactory)	`O Reilly s wi fi guide`
Keyword tokenizer	`O'Reilly's wi-fi guide`
Whitespace tokenizer	`O'Reilly's wi-fi guide`
WordDelimiterFilterFactory	`o Reilly s oReillys Wi Fi WiFi guide`

By using the WordDelimiterFilterFactory, you can make searches for sailboat match searches for sail boat or sail-boat.

Filters

Filters are used after tokenizers to examine a stream of tokens and either keep them as-is, transform or discard them, or create new ones. Tokenizers and filters can be combined to form pipelines, or chains, where the output of one becomes the input for the next. A sequence of tokenizers and filters is called an analyzer, and the resulting output of an analyzer is used to match query results or build indexes.

The snowballPorterFilterFactory is used for stemming, where stemming is used to reduce a word to a shorter form. For example, increased, increasing, and increases all stem to the word increase.

Solr uses Porter and KStem by default when stemming. Porter stems the word to a base form that does not necessarily have to be a dictionary word. In contrast, KStem stems the word to a base form that matches a dictionary word. Typically, Porter results in more matches with less precision.

For example, Porter can identify territorial when searching for territory (KStem does not), but it also identifies visuals when searching for visualization (again, where KStem does not).

The SnowballPorterFilterFactory is the Snowball Porter Stemming algorithm that will be applied to each word (token). It is the implementation of the Porter2 (snowball) stemming algorithm. The Porter2 algorithm is a slight improvement over the Porter algorithm.

The most common stemming algorithms being used are Porter, Snowball (Porter2), and Lancaster, where Porter is the least aggressive, and Lancaster the most (some shorter words will become totally obfuscated after Lancaster). Porter is the slowest and Lancaster the fastest. For more information about the snowball algorithm, see The English (Porter2) stemming algorithm. Porter and Porter2 stem approximately 5% of words to different forms.

Output from snowballPorterFilterFactory: o Reilli s oReilli Wi Fi WiFi guid

Field

Every field must declare a unique name and associate it with one of the previously-defined types. For example:


<field name="buyable" type="int" indexed="true" stored="true" multiValued="false"/>
<field name="mfName" type="wc_text" indexed="true" stored="true"  multiValued="false"/>

Every field can define three important attributes:

multiValued: When set to true, a field can hold as many values as it is assigned, separated by characters defined in the wc-data-config.xml file.
indexed: When set to true, the field is used in the search index, and can be used for searching, sorting, and faceting in the storefront.
stored: When set to true, the field can be included in the search result set. It is used to permanently save the original data value for a field, whether it is indexed (and used for searches) or not.; The stored text is not used for searching. It is stored separately and only returned when requested. That is, if a particular field is used for searching but is never displayed to shoppers, it can be set to false.

For example, in the schema.xml file, the categoryname is set to hold multiple values:


<field name="categoryname" type="wc_text" indexed="true" stored="false" multiValued="true" />

Then, in the wc-data-config.xml file:


<field column="categoryname" splitBy=";" sourceColName="CATGRPNAME" />

Copy fields

Copy fields are used when the content of a source field needs to be added and indexed on different destination fields. The following example copies the name field to the name_suggest destination field:


<copyField source="name" dest="name_suggest"/>

Dynamic fields

Dynamic fields allow you to index data without defining the name of the field. Instead, the name of the field is defined by a wildcard (*). You can use prefixes and suffixes so that the actual name of the field is accepted at runtime. However, the field type must be defined. For example, the following dynamic field accepts names such as price_USD= "100" or

price_EUR=
"120"


<dynamicField name="price_*" type="wc_price" indexed="true" stored="true" multiValued="false"/>

This sample is ideal, as you can work with the Solr API without defining a final schema for the data.

Unique key

Unique keys assign a unique identity to a specific Solr document, similar to primary keys in databases. In the CategoryEntry index's schema.xml file, the uniqueKey is defined as catentry_id.

Default search field

This field is used when the request does not contain a specific field.

Default operator

The default operator indicates the default behavior when handling multiple tokens in the search. The OR operator is defined by default and cannot be changed. The following naming conventions are used in the schema.xml file:

fieldName: Tokenized and not case sensitive. For example, mfName.
fieldName_cs: Tokenized and case sensitive. For example, mfName_cs.
fieldName_ntk: Non-tokenized and not case sensitive. For example, mfName_ntk.
fieldName_ntk_cs: Non-tokenized and case sensitive. For example, catenttype_id_ntk_cs.

Use non-tokenized fields when you want to search for exact matches. For example, use non-tokenized field for brand names, so that when a shopper searches for Econo Sense, products from Econo Sense appear in search results, instead of products from Sense.

Use case sensitive searches for more accurate matches. For example, for fields such as part number.