Crawling WebSphere Commerce site content from the repeater
You can use the site content crawler utility
from the repeater to crawl WebSphere Commerce site content in starter
stores.
Before you begin
- Ensure that the test server is started.
- Ensure that your administrative server is started. For example:
- If WebSphere Commerce is managed by WebSphere Application Server Deployment Manager (dmgr), start the deployment manager and all node agents. Your cluster can also be started.
- If WebSphere Commerce is not managed by WebSphere Application Server Deployment Manager (dmgr), start the WebSphere Application Server server1.
- Ensure that you complete the following task:
Procedure
- Copy the following script files from the WebSphere Commerce WC_installdir/bin directory
to the remote Solr server's remoteSearchHome/bin directory:
- configServerEnv
- crawler
- setdbenv.db2
- setenv
For example, where remoteSearchHome is opt/IBM/WebSphere/search/bin. - Copy the WC_installdir/instances/instance_name/xml/config/dataimport directory to the remoteSearchHome/instance_name/search directory.
- Edit the crawler script file and update
the CRAWLER_CONFIG and CRAWLER_CP values
to the appropriate paths.For example:
CRAWLER_CONFIG="remoteSearchHome/instance_name/search/dataimport" CRAWLER_CP="remoteSearchHome/solr/Solr.war/WEB-INF/lib/*"
- Edit the setenv script file and update
the WAS_HOME, WCS_HOME,
and DB_HOME values to match your environment.
For example:
OS_WAS_HOME=/opt/IBM/WebSphere/AppServer OS_WCS_HOME=/opt/IBM/WebSphere/search OS_DB2_HOME=/home/wcsuser/sqllib
- Copy the following files from the WebSphere Commerce solrhome directory
to the remote Solr server's remoteSearchHome directory:
- droidConfig.xml
- filters.txt
- Edit the droidConfig.xml file to match
your environment.
- Update the values for
storePathDirectory
andfilterDir
.For example:<var name="storePathDirecttory">/opt/IBM/WebSphere/search/demo/</var> <var name="filterDir">/opt/IBM/WebSphere/search/demo/search/solr/home</var>
- Define the following new variables:
solrhostname
andsolrport
. These variables are used when the search web server is on a different host than the WebSphere Commerce web server.For example:<var name="solrhostname">searchWebServerHost.exmaple.com</var> <var name="solrport">3737</var>
- Update the value of the
autoIndex
URL to use the new variables.For example:<autoIndex enable="true"> http://${solrhostname}:${solrport}/solr/MC_${masterCatalogId}_CatalogEntry_Unstructured_${locale}/webdataimport?command=full-import&storeId=${storeId}&basePath= </autoIndex>
This update creates and writes the crawler output to the opt/IBM/WebSphere/search/demo/StaticContent/en_US/date directories.
- Update the values for
- Create the remoteSearchHome/logs directory for the crawler.log file to be written.
- Update the
BasePath
value in theCONFIG
column of the SRCHCONFEXT table, for the row where theINDEXSUBTYPE
column isWebContent
.For example:BasePath=/opt/IBM/WebSphere/search/StaticContent/en_US/2012-09-18
- Complete
one of the following tasks:
- Log on as a WebSphere Commerce non-root user.
- Log on with a user ID that is a member of the Windows Administration group.
- Log on with a user profile that has *SECOFR authority.
- Go to the remoteSearchHome/bin directory.
- Run the crawler utility:
-
crawler.bat -cfg cfg -instance instance_name [-dbtype dbtype] [-dbname dbname] [-dbhost dbhost] [-dbport dbport] [-dbuser db_user] [-dbuserpwd db_password] [-searchuser searchuser] [-searchuserpwd searchuserpwd]
crawler.sh -cfg cfg -instance instance_name [-dbtype dbtype] [-dbname dbname] [-dbhost dbhost] [-dbport dbport] [-dbuser db_user] [-dbuserpwd db_password] [-searchuser searchuser] [-searchuserpwd searchuserpwd]
-
crawler.bat -cfg cfg -instance instance_name [-dbtype dbtype] [-dbname dbname] [-dbhost dbhost] [-dbport dbport] [-dbuser db_user] [-dbuserpwd db_password] [-searchuser searchuser] [-searchuserpwd searchuserpwd]
crawler.bat -cfg cfg [-searchuser searchuser] [-searchuserpwd searchuserpwd]
- cfg
- The location of the site content crawler configuration file. For example, /opt/WebSphere/CommerceServer70/instances/demo/search/solr/home/droidConfig.xml
- instance
- The name of the WebSphere Commerce instance with which you are working (for example, demo).
- dbtype
- Optional: The database type. For example, Cloudscape, db2, or oracle.
- dbname
- Optional: The database name to be connected.
- dbhost
- Optional: The database host to be connected.
- dbport
- Optional: The database port to be connected.
- dbuser
-
Optional: The name of the user that is connecting to the database.
Optional: The user ID connecting to the database.
- dbuserpwd
- Optional: The password for the user that is connecting to the database.
- If the dbuser and dbuserpwd values are not specified, the crawler can run successfully, but cannot update the database.
- searchuser
- Optional: The user name for the search server.
- searchuserpwd
- Optional: The password for the search server user.
Note: If you specify any optional database information, such as dbuser, the surrounding database information must also be specified, such as dbuserpwd. -
- You can run the utility with
a URL on the WebSphere Commerce search server. This is recommended
if a remote search server is used.
http://solrHost:port/solr/crawler?action=actionValue&cfg=pathOfdroidConfig&
Where action is the action that the crawler performs. The possible values are:- start
- Starts the crawler.
- status
- Shows the crawler status.
- stop
- Stops the crawler.
- Ensure
that the utility runs successfully.Running the utility with all the parameters involves the following factors:
- Crawling and downloading the crawled pages in HTML format into the destination directory.
- Updating the database with the created manifest.txt file.
- Invoking the indexer.
Depending on the passed parameters, you can check that the utility runs successfully by:- Verifying that the crawled pages are downloaded into the destination directory.
- If passing the database information: Verifying that the database has been updated with the correct manifest.txt location.
- If setting auto index to true: Verifying that the crawled pages are also indexed.
- Update the
BasePath
value in theCONFIG
column of the SRCHCONFEXT table, to the correct output path of the manifest.txt file in the date directory. If thebasePath
value is missing, add it to the column.For example:basePath=/opt/IBM/search/StaticContent/10052012/
Note: You must update the date directory in thebasePath
value if you run the crawler on a different day since its last run. - Build the WebSphere Commerce search index.