You can enable searching on more unstructured content types so that custom attachments
data can be processed by search and retrieved in store search results.
Important: HCL Commerce Search indexes decrypted unstructured data by default. That is,
processing encrypted data with HCL Commerce Search is not supported.
Before you begin
Ensure that you complete the following tasks:
- Your database contains customized content types.
Procedure
-
Create a parser for the new file type.
HCL Commerce supports extra parsers to enable searching on more file types.
-
Prepare for the extension.
Before you implement the logic for the new file type, the MIME types of the new parser must be
selected.
- Open the tika-mimetypes.xml file. The file is in the
tika-core-0.4.jar file, under org/apache/tika/mime.
- Enter the MIME type that you want to implement. For example, for media of type
application/vnd.rn-realmedia
:
<mime-type type="application/vnd.rn-realmedia">
<magic priority="50">
<match value=".RMF" type="string" offset="0" />
</magic>
<glob pattern="*.rm"/>
</mime-type>
- Find a reader that understands the file format so that it can be parsed successfully.
- If the parser must support more types, select more. These MIME types are
required when you implement the logic.
-
Implement the extension logic.
- Create a class that implements the
org.apache.tika.parser.Parser
interface. In
com.ibm.commerce.tika.parser.video.VideoParser.getSupportedTypes(ParseContext)
, it
must return the supported media type list.For
example:
private static final Set<MediaType> SUPPORTED_TYPES =
Collections.unmodifiableSet(new HashSet<MediaType>(Arrays.asList(
MediaType.application("vnd.rn-realmedia"))));
public Set<MediaType> getSupportedTypes(ParseContext context) {
return SUPPORTED_TYPES;
}
The
application media type is given the value
vnd.rn-realmedia
to match the previously
selected MIME type.
- The
com.ibm.commerce.tika.parser.video.VideoParser.parse(InputStream, ContentHandler,
Metadata, ParseContext)
must handle the content of the media that comes as the
InputStream
parameter. In addition, it must also handle the metadata container of
the media that comes as the Metadata
parameter.For
example:
metadata.set(Metadata.CONTENT_TYPE, "application/vnd.rn-realmedia");
metadata.add(Metadata.PUBLISHER, "Publisher");
metadata.add(Metadata.LANGUAGE, "RM_language");
metadata.add(Metadata.COMPANY, "IBM Commerce");
XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
xhtml.startDocument();
xhtml.endDocument();
When
the result is returned from this method, the metadata can have extra publisher, language, and
company information. However, no content is extracted.
-
Assemble the logic and enable HCL Commerce Search to recognize it.
A service registry file helps insert the new parser to be known to the
HCL Commerce Search framework.
- Create the following file:
- META-INF/services/org.apache.tika.parser.Parser
- Insert the parser's full class name into the file. For
example:
com.ibm.commerce.tika.parser.video.VideoParser
- Export the code and the register file into a JAR file and save it in the same directory as the
tika-parser-version.jar file.
-
Confirm the results in HCL Commerce Search.
HCL Commerce Search automatically finds the proper parser for the file content. For
example, if a realmedia file is in the extracting request, HCL Commerce Search returns the
parser result. The Solr Cell uses the result and composes a new document and sends it to the search
server for create and update commands.
For example, you can check the index content, where the result resembles the following
snippet:
content_type:=>application/vnd.rn-realmedia
tika_company:=>IBM Commerce
tika_publisher:=>Publisher
tika_language:=>RM_language
tika_stream_size:=>614135
What to do next
After enable searching on more unstructured content types by creating a new parser, you can
search the storefront to confirm that the search results contain your custom unstructured content
types.