Crawling HCL Commerce site content
You can crawl HCL Commerce site content in starter stores using the REST
API.
Before you begin
- Build the HCL Commerce Search index.
- Important: Ensure that you configure the site content crawler
configuration files for your site:
- droidConfig.xml
- filters.txt
Note: If you are crawling content in a clustered topology, the crawler must be run from a staging environment. That is, crawling should not be performed in a production environment. If the production content must be crawled, configure the crawler to visit the production site rather than running directly from production environment. This method simplifies the setup by restricting the crawler to run in an HCL Commerce staging environment and update the index in the repeater.
Procedure
-
You can run the utility from the following URL on
the HCL Commerce Search server:
http://searchHost:port/search/admin/resources/crawler?action=start&langId=langId&storeId=storeId&catalogId=catalogId
Where the method is GET and authentication is spiuser. action is the action that the crawler should perform. The possible values are:- start
- Starts the crawler. Required parameters:
- langId
- Language identifier that you want to use in building an unstructured index. For example,
langId=
-1
. - storeId
- The store ID that you want to use in building an unstructured index. For example,
storeId=
10501
. - catalogId
- The catalog identifier that you want to use in building an unstructured index. For example,
catalogId=
10001
.
- status
- Shows the crawler status.
- stop
- Stops the crawler.
-
Ensure that the utility runs successfully.
Running the utility with all the parameters involves the following factors:
- Crawling and downloading the crawled pages in HTML format into the destination directory.
- Updating the database with the created manifest.txt file.
- Invoking the indexer.
Depending on the passed parameters, you can check that the utility runs successfully by:- Verifying that the crawled pages are downloaded into the destination directory, searchServerPath\search\index\crawler\cache\date\number, where date is the date that the utility was run, and number is the number of runs on that date, starting with 1.
- Verifying that SRCHCONFEXT table with
indexsubtype='WebContent'
has been updated with the correct manifest.txt location. - If setting auto index to true: Verifying that the crawled pages are also indexed.