The indexing process
The Search index is generated by retrieving information from each of the applications based on a schedule defined by the administrator. Search uses the IBM® WebSphere® Application Server scheduling service for creating and updating the Search index. The index must be deployed on each node running the Search enterprise application.
Indexing overview
Search indexing happens in several stages:- Crawling
- Crawling is the process of accessing and reading content from
each application in order to create entries for indexing.
During the crawling process, the Search application requests a seedlist from each HCL Connections application. This seedlist is generated when each application runs queries on the data stored in its database, based on the parameters that the Search application submits in its HTTP request.
The contents of the seedlists are persisted to disk. They are deleted when the next incremental indexing task completes successfully.
- File content extraction
- Search provides a document conversion service to extract the content
of the files to be indexed. During the file content extraction stage,
the document conversion service downloads files to a temporary folder
in the index directory, converts them to plain text, and stores this
in the folder defined by the WebSphere® Application
Server variable, EXTRACTED_FILE_STORE. The extracted text is then
indexed.
HCL Connections supports the indexing of file attachment content from the Files and Wikis applications, and IBM® FileNet® documents.
File content extraction takes place on the schedule defined for the file content extraction task, which runs every 20 minutes by default. File content is not searchable until the file content conversion is complete and the next indexing task has also completed.
- Indexing
- During the indexing phase, the entries in the persisted seedlists
are processed into Lucene documents, which are serialized into a database
table that acts as an index cache.
When the indexing phase is complete, the seedlists are removed from disk. A resume token marks where the last seedlist request finished so that the Search application can start from this point on the next seedlist request. This resume token enables Search to retrieve only the new data that was added after the last seedlists were generated and crawled.
The crawling and indexing stages for multiple applications take place concurrently in incremental foreground indexing. For example, if an indexing task that indexes Files, Activities, and Blogs is created, each of these applications is crawled and added to the database cache at the same time. During initial and background indexing, only the crawling stage for multiple applications takes place concurrently.
During incremental foreground indexing, after the crawling and indexing stages are complete, all the nodes are notified that they can build their index. At this point, the index builder on each node begins extracting entries from the database cache and storing them in the index on the local file system.
- Index building
- Index building refers to the deserialization and writing of the
Lucene documents into the Search index. This process only occurs during
incremental foreground indexing. During index building, the index
builder takes entries from the database cache and stores them in an
index on the local file system. Each node has its own index builder,
so crawling and preparing entries only takes place once in a network
deployment, and then the index is created on each node from the information
that has already been processed.
During initial and background indexing, the indexing stage and the index building stage are merged, and no database serialization or deserialization occurs.
- Post processing
- After index building (for incremental foreground indexing) or
indexing (for initial or background indexing), post-processing work
takes place on the new index entries to add additional metadata to
the search results. This work includes bookmark rollup and the addition
of file content to Files search results.
Bookmark rollup refers to the process of aggregating the information for public bookmarks that point to the same URL. For example, if 1000 users create a public bookmark for the same URL, when someone searches for that URL, a single bookmark is returned instead of 1000 search results. The bookmark that is returned includes the information for all 1000 bookmarks rolled up into a single search result, so that all of the tags and people associated with each of the individual bookmarks are now associated with the one document.
In addition, if two users bookmark the same internal document, for example, a wiki page, then the wiki page gets rolled up with the bookmark so if the user then searches for the wiki page or the bookmark that they created to the wiki page, only one result is returned in the search results. The tags and people associated with the bookmark and the wiki page are combined into a single document.
Indexing types
The following table explains the differences between the various types of indexing:Foreground indexing | Background indexing | |
---|---|---|
Initial indexing | The initial index is built by using the default 15
min-search-indexing-task. Alternatively, it can be built by a custom indexing task that is created by the SearchService.addIndexingTask command or a command that is run once, such as SearchService.indexNow(String applicationNames). This index is used for searching and for further indexing. The database cache is not used. |
An index is built by using the SearchService.startBackgroundIndex
command. The background indexing command creates a one-off index in a specified location on disk. This index is not used for searching. The database cache is not used. |
Incremental indexing | The index is updated by using the default
15min-search-indexing-task. Alternatively, the index can be updated by a custom indexing task that is created by the SearchService.addIndexingTask command or a command that is run once, such as SearchService.indexNow. This index is used for searching and for further indexing. The database cache is used. |
A background index can be updated by using the
SearchService.startBackgroundIndex command. This index is not used for searching. The database cache is not used. |
Indexing steps
The indexing process involves the following steps:- Initial and background indexing
- Crawl all pages of the seedlist and persist them to disk.
- Extract the file content and persist it to disk.
- Crawl a seedlist page from disk.
- Index the seedlist entries into Lucene documents.
- Write the documents to the Lucene index.
- Repeat until all the persisted seedlist pages have been crawled.
- Incremental foreground indexing
- The node that has the scheduler lease crawls all the pages of the seedlist and persists them to disk.
- Crawl a seedlist page from disk.
- Index the seedlist entries into Lucene documents.
- Serialize the Lucene documents into the database cache.
- Send a JMS message to all Search nodes to alert them of the completion of the serialization.
- Each node deserializes the Lucene documents into the Lucene index.