Site content crawler configuration
The site content crawler uses configuration files and manifest files to determine the site content crawler behavior.
You can start the site content crawler by accessing the following
URL:
http://searchHost:port/search/admin/resources/crawler?action=start&langId=langId&storeId=storeId&catalogId=catalogId
The site content crawler relies on the following input configuration files, which are in the
following directory,
Liberty/usr/servers/searchServer/resources\search\index\crawler\ext\:
- droidConfig.xml
- The site content crawler configuration file contains variables and parameters that determine the site content crawler behavior. The variables that are specified in the site content crawler configuration file are then used to populate values further in the configuration file.
- filters.txt
- The filters configuration file determines whether URLs are included or ignored by the site content crawler.
- SiteMap.jsp
- The site map, which is used by web browsers and external search engines, contains pointers to the different starter store pages
- StaticContentSitemap.jsp
- The static site map contains pointers to the static content files that are in the HCL Commerce database.The URL that is passed from the configuration file to the site content crawler is:
You must update the static site map file to include your additional static content files that are in the HCL Commerce database.http://host_name/webapp/wcs/stores/servlet/StaticContentSitemap?storeId=storeId&langId=-1&catalogId=catalogId
This file is used only by the site content crawler.
- Site content crawler manifest files
-
The site content crawler manifest.txt output files are comma-separated values (CSV) formatted documents that contain generated information. You can find the files in the directory searchServerPath\resources\search\index\crawler\cache\date\number,where:
- date
- Is the date when the crawler utility was run.
- number
- Means the number of times the crawler was run, starting with 1.
- The manifest file that indicates which folder contains the downloaded site content files. It
contains the following columns:
- Timestamp
- The time stamp for the column.
- Directory path
- The counter directory path.
- Initial location URLs
- The initial URLs separated by a comma.
- The manifest file that contains the mappings of downloaded files to URLs. It contains the
following columns:
- ID
- The ID that distinguishes each file in the document. For example, a simple sequence.
- URL
- The relative URL to the current store, or full URL pointing to external resources.
- Local file path
- The file path, either in full format or relative format, of the stored site content.
- Content-type
- The content type of the file for example,
text/html
. - Encoding
- The encoding of the file, if it is a text-based file.