Troubleshooting when files content is not found after searching
Connections uses the Apache Tika file conversion libraries for converting business documents from various types to plain text. The plain text is required before the content can be indexed. If search cannot find the content from a file, it could be due to an issue with this conversion from the business document format to plain text. This article describes some steps that can be used to troubleshoot this process.
Running the conversion manually
You can run a standalone tika process by itself, passing in the document to convert to plain text as a parameter. This should produce the same plain text that is created when the Connections server is running the process and it can be used to determine if the tika conversion is working properly.
You need to have a test document saved locally to run the conversion against. The followings steps are using Linux as an example, but similar steps can be used on Windows.
- Setup the command line. Using the terminal, go to to the WebSphere
DMgr/bin directory and execute the
following:
cd /opt/IBM/WebSphere/AppServer/profiles/DMgr01/bin . ./setupCmdline.sh
- On a WebSphere node, go to the Search.ear/tika directory by executing
the
following:
cd /opt/IBM/WebSphere/AppServer/profiles/AppSrv01/installedApps/bvtdb2Node01Cell/Search.ear/tika
- Run java using the tika jar that is present in this directory. The jar name may vary
slightly depending on the version that is
present.
java -jar tika-app-2.4.1.jar -t <filename>
for example
java -jar tika-app-2.4.1.jar -t /home/lcuser/testdoc.docx
- This action will produce plain text output to the console which can be examined for the missing text.
cnx-tika-server process
Since file conversions are required, HCL Connections server launches a java process to perform the conversions. To avoid a long startup time for the jvm, Connections keeps these processes running for up to tikaFileConversion.maxDocConversionsPerProcess conversions before ending the process. You can see these processes running on your Connections server by checking the process list:
ps -ef|grep tika
The results are displayed in the tikaFileConversion.maxConversionThreads processes, such as:
[lcuser@lcauto100 ~]$ ps -ef | grep tika
lcuser 3353 1928 49 15:19 ? 00:00:18 /opt/IBM/WebSphere/AppServer/java/jre/bin/java Dlog4j.configurationFile=/opt/IBM/WebSphere/AppServer/profiles/AppSrv01/installedApps/bvtdb2Node01Cell/Search.ear/tika/cnxtikalog4j2.xml -jar /opt/IBM/WebSphere/AppServer/profiles/AppSrv01/installedApps/bvtdb2Node01Cell/Search.ear/tika/cnxtika-server.jar -server
lcuser 3356 1928 26 15:19 ? 00:00:09 /opt/IBM/WebSphere/AppServer/java/jre/bin/java Dlog4j.configurationFile=/opt/IBM/WebSphere/AppServer/profiles/AppSrv01/installedApps/bvtdb2Node01Cell/Search.ear/tika/cnxtikalog4j2.xml -jar /opt/IBM/WebSphere/AppServer/profiles/AppSrv01/installedApps/bvtdb2Node01Cell/Search.ear/tika/cnxtika-server.jar -server