Snowball analyzer

The Snowball analyzer converts words into language and code set specific stem words.

The Snowball analyzer is similar to the Standard analyzer except that is converts words to stem words.

The Snowball analyzer processes text characters in the following ways:

  • Converts words to stem word tokens.
  • Stopwords are not indexed.
  • Converts alphabetical characters to lower case.
  • Ignores colons, #, %, $, parentheses, and slashes.
  • Indexes underscores, hyphens, @, and & symbols when they are part of words or numbers.
  • Separately indexes numbers and words if numbers appear at the beginning of a word.
  • Indexes numbers as part of the word if they are within or at the end of the word.
  • Indexes apostrophes if they are in the middle of a word, but removes them if they are at the beginning or end of a word.
  • Ignores an apostrophe followed by the letter s at the end of a word.

By default, the Snowball analyzer uses the language and code set that is specified by the DB_LOCALE environment variable. You can specify a different language for the Snowball analyzer by appending the language name or synonym to the Snowball analyzer name in the CREATE INDEX statement: snowball.language.

  • Danish, da, dan
  • Dutch, nl nld, dut
  • English, en, eng
  • Porter, por (the original English stemmer)
  • Finnish, fi, fin
  • French, fr, fra, fre
  • German, de, deu, ger
  • Italian, it, ita
  • Norwegian, no, nor
  • Portuguese, pt
  • Spanish, es, esl, spa
  • Swedish, sv, swe

The Snowball analyzer supports the 8859-1 code set.

Examples

In these examples, the input string is shown on the first line and the resulting tokens are shown on the second line, each surrounded by square brackets. These examples use the English language, specified by the analyzer="snowball.en" index parameter. For examples of how the Snowball analyzer uses word stemming in languages other than English, see the Snowball web site at http://snowball.tartarus.org.

In the following example, stopwords are removed, the words are converted to lower case, and the word "lazy" is converted to its stem word:

The Quick Brown Fox Jumped Over The Lazy Dog
[quick] [brown] [fox] [jump] [over] [lazi] [dog]

In the following example, the apostrophe at the beginning of a word and the apostrophe followed by an s are ignored, but the apostrophe in the middle of a word is indexed:

Prequ'ile Mark's 'cause 
[prequ'ile] [mark] [cause]

In the following example, the colon and backslash are ignored:

c:/informix 
[c] [informix]

In the following example, the ampersand is indexed as part of the company name:

XY&Z Corporation 
[xy&z] [corpor]

In the following example, the e-mail address is indexed as is:

xyz@example.com
[xyz@example.com]

In the following example, the three different words are indexed with the same stem word:

accept
[accept]

acceptable
[accept]

acceptance
[accept]