Modifying file attachment indexing
Administrators can configure indexing processes for file attachments at the database and file levels.
- Should attachments be indexed for this database?
- Should the particular attachment under examination be indexed?
- How will text be retrieved from this particular attachment?
Database-level controls
The following INI values can be set to control attachment indexing for every database, server-wide:
- FT_INDEX_ATTACHMENTS=1
Index attachments for every indexed database, even if that option was not chosen by the database manager. Additionally, filtering will never be performed on the attachments, only brute force text-stripping.
- FT_INDEX_ATTACHMENTS=2
Never index attachments for any indexed database, even if the database manager chose that option.
- FT_INDEX_ATTACHMENTS=3
Index attachments for every indexed database, even if that option was not chosen. The difference from FT_INDEX_ATTACHMENTS=1 is that filtering will be performed on attachments when applicable, and brute force text-stripping will be used based on the brute force list of file extensions.
File-level controls
There are two coarse-grained devices that can be used to control whether a particular attachment is a candidate for indexing or not: the ignore list (enabled by default) and the white list (must be explicitly enabled). Both lists can be extended beyond their defaults and the white list can be entirely substituted if desired.
If an attachment file's extension matches an item in the ignore list, then it will typically not be indexed.
If an attachment file's extension matches an item in the white list, then it will always be indexed. If without match, it will not be indexed.
If the extensions in the ignore list and white list collide, then the white list takes precedence.
- Ignore list
*.ap, *.au, *.bkf, *.bqy, *.cab, *.cca, *.dbd, *.dll, *.exe, *.gif, *.gz, *.img, *.jar, *.jpg, *.lwp, *.m4p, *.m4v, *.MIF, *.mov, *.mp3, *.mp4, *.mpg, *.msi, *.nsf, *.ntf, *.p7m, *.p7s, *.pag, *.pdb, *.pic, *.png, *.pst, *.rar, *.shw, *.sys, *.tar, *.tar, *.tif, *.wav, *.wmf, *.wpl, *.wq1, *.z, *.zip
- White list
*.123, *.ami, *.as, *.aw, *.dca, *.doc*, *.dwg, *.emf, *.emz, *.fff, *.fft, *.flg, *.fm, *.htm*, *.hwp, *.jar, *.jtd, *.jtt, *.mime, *.oas, *.odp, *.ods, *.odt, *.pdf*, *.ppt*, *.qpw, *.r13, *.r14, *.rtf, *.sam, *.swp, *.vsd*, *.wk4, *.wks, *.wp*, *.wri, *.xlr, *.xls*, *.xml, *.xy*, *.zip
To modify the ignore list, white list, and other indexing processes, refer to the following actions:
Extending the ignore list
FT_INDEX_IGNORE_ATTACHMENT_TYPES=*.asf,*.avi,*.bin,*.bmp,*.dat,*.iso,*.mpeg,*.ogg,*.qz,*.rm,*.so,*.swf,*.wmv
Enabling the white list
- FT_USE_ATTACHMENT_WHITE_LIST=1 setting enables the default white list, which has the default file extensions listed earlier in this document. You can append to this default list using Extending the white list.
- FT_USE_MY_ATTACHMENT_WHITE_LIST=1 setting discards the default list and exclusively references FT_INDEX_FILTER_ATTACHMENT_TYPES as documented in Extending the white list.
The white list can be expanded in a similar fashion to the ignore list. To do so, set the FT_INDEX_FILTER_ATTACHMENT_TYPES notes.ini by listing file type extensions with a wildcard character (*), separated by commas, using no space characters.
Additionally, FT_INDEX_FILTER_ATTACHMENT_TYPES_MAX_MB is a companion setting that enforces an upper limit on the size of files included in the white list. It accepts an integer value representing mebibytes (MiB).
Overriding the white list
Set FT_USE_MY_ATTACHMENT_WHITE_LIST=1 along with FT_INDEX_FILTER_ATTACHMENT_TYPES to exclusively use a custom list of files to be indexed.
Extending the white list for a particular database
Whichever white list is in effect on the system can be additionally extended for a specific database via the setting FT_INDEX_FILTER_ATTACHMENT_TYPES_<database replica id>. The white list in effect can either be the default, or extended or replaced via FT_INDEX_FILTER_ATTACHMENT_TYPES,
Also, any attachment file types appearing in this list can be size-capped by specifying the setting FT_INDEX_FILTER_ATTACHMENT_TYPES_MAX_MB_<database replica id> if so desired.
Controlling text retrieval
Once the full-text subsystem has determined that an attachment will be indexed, the next decision is how to extract text from that file attachment. Two methods exist: an intelligent parser (Tika) and ASCII text-stripping.
By default, files are sent to the intelligent parser unless the file extensions are explicitly listed in the ASCII text-stripping list. While the intelligent parser typically returns more relevant text tokens to the indexer, it is slower than raw ASCII text-stripping. Text-stripping, however, can result in many more superfluous tokens – such as text formatting elements and the like – to be returned to the indexer, which may decrease search accuracy.
The following are the ASCII text-stripping default list of file extensions:
*.ans,*.ascii,*.log,*.out,*.sms,*.text,*.txt,*.uni,*.utxt
Extending the ASCII text-stripping listSimilar to the ignore list and white list, the text-stripping list can be extended by adding entries via FT_INDEX_BRUTE_FORCE_ATTACHMENT_TYPES notes.ini. Again, list file type extensions with a wildcard character (*), separated by commas, using no space characters.
Overriding the ASCII text-stripping
Stating in Domino 14, you can set FT_USE_MY_ATTACHMENT_BRUTE_LIST=1 along with FT_INDEX_BRUTE_FORCE_ATTACHMENT_TYPES to exclusively use a custom list of files to be text-stripped.
Disabling the ASCII text-stripping
Set FT_DISABLE_BRUTE_FORCE=1 to prevent sending attachments through ASCII text-stripping.
Disabling attachment file name indexing
By default, both the intelligent parser and the ASCII text-stripper record the name of the file from which text is retrieved. If you prefer that users not search for attachment file names, then use the DISABLE_ATTACHMENT_SEARCH_BY_FILENAMES=1 setting.