Solr schema file (schema.xml)
Solr organizes its data into documents, which consist of fields. In WebSphere Commerce, one document consists of all the descriptions of a product.
The schema.xml file contains information about the Solr fields, and how they are analyzed and filtered during searches. Different field types can contain different types of data. Solr uses the schema.xml file to determine how to build indexes from the input documents, and how to perform index and query time processing.
The schema.xml file contains the following fields and options:
Field type
<fieldType name="int" class="solr.TrieIntField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
<fieldType name="wc_text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
Where
the wc_text fieldType
is processed by the analyzers, tokenizers, and filters, and
are used to influence search relevancy.Text analyzers
Text analyzers map the source string of text and the final list of tokens. This process occurs during indexing and querying. You can have different analyzer chains for indexing and querying, depending on your business needs. For example, the EdgeNGramFilterFactory is better suited for indexing, rather than querying section. However, typically the same analyzer chains are used for indexing and queries, as searches require the query tokens to match the indexed tokens.
Different analyzers return different results. For example, you can use some analyzers to return the term Flash, when flash is used.
Tokenizers
Tokenizer | Output |
---|---|
Standard tokenizer (standardTokenizerFactory) | O Reilly s wi fi guide |
Keyword tokenizer | O'Reilly's wi-fi guide |
Whitespace tokenizer | O'Reilly's wi-fi guide |
WordDelimiterFilterFactory | o Reilly s oReillys Wi Fi WiFi guide |
Filters
Filters are used after tokenizers to examine a stream of tokens and either keep them as-is, transform or discard them, or create new ones. Tokenizers and filters can be combined to form pipelines, or chains, where the output of one becomes the input for the next. A sequence of tokenizers and filters is called an analyzer, and the resulting output of an analyzer is used to match query results or build indexes.
The snowballPorterFilterFactory is used for stemming, where stemming is used to reduce a word to a shorter form. For example, increased, increasing, and increases all stem to the word increase.
Solr uses Porter and KStem by default when stemming. Porter stems the word to a base form that does not necessarily have to be a dictionary word. In contrast, KStem stems the word to a base form that matches a dictionary word. Typically, Porter results in more matches with less precision.
For example, Porter can identify territorial when searching for territory (KStem does not), but it also identifies visuals when searching for visualization (again, where KStem does not).
The SnowballPorterFilterFactory is the Snowball Porter Stemming algorithm that will be applied to each word (token). It is the implementation of the Porter2 (snowball) stemming algorithm. The Porter2 algorithm is a slight improvement over the Porter algorithm.
The most common stemming algorithms being used are Porter, Snowball (Porter2), and Lancaster, where Porter is the least aggressive, and Lancaster the most (some shorter words will become totally obfuscated after Lancaster). Porter is the slowest and Lancaster the fastest. For more information about the snowball algorithm, see The English (Porter2) stemming algorithm. Porter and Porter2 stem approximately 5% of words to different forms.
Output from snowballPorterFilterFactory: o Reilli s oReilli Wi Fi WiFi guid
Field
<field name="buyable" type="int" indexed="true" stored="true" multiValued="false"/>
<field name="mfName" type="wc_text" indexed="true" stored="true" multiValued="false"/>
- multiValued
- When set to true, a field can hold as many values as it is assigned, separated by characters defined in the wc-data-config.xml file.
- indexed
- When set to true, the field is used in the search index, and can be used for searching, sorting, and faceting in the storefront.
- stored
- When set to true, the field can be included in the search result set. It is used to permanently save the original data value for a field, whether it is indexed (and used for searches) or not.
<field name="categoryname" type="wc_text" indexed="true" stored="false" multiValued="true" />
Then,
in the wc-data-config.xml
file:
<field column="categoryname" splitBy=";" sourceColName="CATGRPNAME" />
Copy fields
<copyField source="name" dest="name_suggest"/>
Dynamic fields
*
). You can use prefixes and suffixes so
that the actual name of the field is accepted at runtime. However, the field type must be defined.
For example, the following dynamic field accepts names such as price_USD= "100"
or
price_EUR=
"120"
.
<dynamicField name="price_*" type="wc_price" indexed="true" stored="true" multiValued="false"/>
This
sample is ideal, as you can work with the Solr API without defining a final schema for the data.Unique key
Unique keys assign a unique identity to a specific Solr document, similar to primary keys in databases. In the CategoryEntry index's schema.xml file, the uniqueKey is defined as catentry_id.
Default search field
This field is used when the request does not contain a specific field.
Default operator
- fieldName
- Tokenized and not case sensitive. For example, mfName.
- fieldName_cs
- Tokenized and case sensitive. For example, mfName_cs.
- fieldName_ntk
- Non-tokenized and not case sensitive. For example, mfName_ntk.
- fieldName_ntk_cs
- Non-tokenized and case sensitive. For example, catenttype_id_ntk_cs.
Use non-tokenized fields when you want to search for exact matches. For example, use non-tokenized field for brand names, so that when a shopper searches for Econo Sense, products from Econo Sense appear in search results, instead of products from Sense.
Use case sensitive searches for more accurate matches. For example, for fields such as part number.