Parallel preprocessing and distributed indexing configuration options
- Large amounts of data: Disk space must be maximized. Heap size and memory buffer size must also be changed, especially with a single JVM.
- Disk I/O: A high number of shards results in more disk I/O operations. Therefore, a fast write disk type must be considered.
- Database configurations.
- Network latency.
Parallel preprocessing configurations
- Consider database tuning options, such as transactions cache, logger cache, and table space.
- Tune the JVM heap size to set the minimum and maximum memory usage of the JVM used for the
preprocessing. For example,
Xms<memory> -Xmx<memory>
. - Run each preprocessing job in a separate JVM (command-line session).
Distributed indexing configurations
The indexing configurations cover variations of single and multiple JVMs. They are typically affected by JVM heap size, disk space and speed, and network latency for a remote search server.
The following configurations can be used for distributed indexing:
Configuration 1: Single JVM
All shard index cores and the master index core are managed by a single Solr JVM. All shard's cores share JVM memory resources. It is important to calculate and allocate each of the cores' maximum memory buffer size according to the JVM maximum heap size.
- Determine the JVM heap size, and set the value to at least 2 GB.
- Determine the number of shards, and divide the overall JVM heap size over the number of shards. Then, add one to account for the master index. Remember this value.
- Set the solrconfig.xml
ramBufferSizeMB
property to the value obtained from the previous step. - Disable any caches that are defined in the solrconfig.xml file for each of
the shards, if enabled. For example,
filterCache
,queryResultCache
, anddocumentCache
. - Optional: Disable any WebSphere Commerce components that are defined in the
solrconfig.xml file. For example, comment out the following section:
Instead, use native Solr queries and facet components.<arr name="components"> <str>wc_query</str> <str>wc_facet</str> <str>mlt</str> <str>stats</str> <str>debug</str> <str>wc_spellcheck</str> </arr>
A variation of this configuration is to use different disks for each or some of the shards. For example, in cases where the disk I/O creates a bottleneck, or there are other disk constraints. Then, mount each of the shards under the Solr home into different disks.
When you merge the index, the master JVM must have access to the master index core and all of the shards' index cores directories.
Configuration 2: Multiple JVMs
Each of the shard index cores and the master index core are managed by different dedicated Solr JVMs. If some of the JVMs share disk and memory resources, the memory resources must be distributed across all of the shards.
A variation of this configuration is to use a dedicated disk and memory for each of the JVMs.
When you merge the index, the master JVM must have access to the master index core and all of the shards' index cores directories.