Searching over Multiple Languages
The Search application provides globalization support by using different dictionaries for different languages. Each dictionary file must be enabled in the Search configuration file before indexing. By default, only the English language dictionary is enabled during installation.
- Tokenizing terms correctly.
- Reducing to terms base form.
- Matching, for example, singular and plural and verbs tenses (depending on the specific language rules).
Adding extra dictionaries is a mandatory post-installation step that you perform before you start the IBM® Connections Search server for the first time.
How does search choose the right language dictionary to index content with?
When content is analyzed at indexing time, Search attempts to guess (or detect) which of the enabled IBM® LanguageWare® dictionaries to use when applying the text analysis process. If the attempt is unsuccessful or if the language guessed does not have a corresponding dictionary enabled, the default dictionary is used.
How does search choose the language dictionary to analyze the users search queries with?
The language is
taken from the user browser settings (the Accept-Language
HTTP
header). If there is a problem when loading the dictionary corresponding
to the language specified or if there is no corresponding dictionary
enabled, then the default dictionary is used.
Choosing languages and setting the default dictionary
Configure languages that users are using to creating a significant amount of content. It is recommended to keep the number of languages to a minimum because adding languages increases the risk of a language detection miss that would reduce search quality.
If you do not have any English content, remove English from the available dictionaries.
During indexing, the default language is used if Search cannot detect the content's language with high confidence, or if the detected language dictionary is not enabled.
During search time, the default dictionary is used if the dictionary for the user's browser language not enabled.
- The default language should be the language in which most of your content is written.
- In case where the amount of content is the same in two languages, set the default language to the language with the more complex grammar rules. For example, set Japanese over English, so that if the wrong language is detected, the language dictionary with the more complex grammar rules is used.