You can enable searching on additional unstructured
content types so that custom attachments data can be processed by
search and retrieved in store search results.
Important: WebSphere Commerce
search indexes unencrypted unstructured data by default. That is, processing
encrypted data with WebSphere Commerce search is not
supported.
Before you begin
Ensure that you have completed the following tasks:
Procedure
- Create a new parser for the new file type.
WebSphere
Commerce supports using additional parsers to enable searching on
additional file types.
- Prepare for the extension.
Before implementing
the logic for the new file type, the MIME types of the new parser
must be selected.
- Open the tika-mimetypes.xml file. The file is located in the tika-core-0.4.jar
file, under org/apache/tika/mime.
- Select the MIME type that you want to implement. For example,
for media of type
application/vnd.rn-realmedia
:
<mime-type type="application/vnd.rn-realmedia">
<magic priority="50">
<match value=".RMF" type="string" offset="0" />
</magic>
<glob pattern="*.rm"/>
</mime-type>
- Find a reader that understands the file format so that it can
be parsed successfully.
- If the parser must support additional types, select more.
These MIME types are required when implementing the logic.
- Implement the extension logic.
- Create a class that implements the
org.apache.tika.parser.Parser
interface.
In com.ibm.commerce.tika.parser.video.VideoParser.getSupportedTypes(ParseContext)
,
it must return the supported media type list.For example:
private static final Set<MediaType> SUPPORTED_TYPES =
Collections.unmodifiableSet(new HashSet<MediaType>(Arrays.asList(
MediaType.application("vnd.rn-realmedia"))));
public Set<MediaType> getSupportedTypes(ParseContext context) {
return SUPPORTED_TYPES;
}
The application media type is given the value
vnd.rn-realmedia
to
match the previously-selected MIME type.
- The
com.ibm.commerce.tika.parser.video.VideoParser.parse(InputStream,
ContentHandler, Metadata, ParseContext)
must handle the content
of the media that comes as the InputStream
parameter.
In addition, it must also handle the metadata container of the media
that comes as the Metadata
parameter.For example:
metadata.set(Metadata.CONTENT_TYPE, "application/vnd.rn-realmedia");
metadata.add(Metadata.PUBLISHER, "Publisher");
metadata.add(Metadata.LANGUAGE, "RM_language");
metadata.add(Metadata.COMPANY, "IBM Commerce");
XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
xhtml.startDocument();
xhtml.endDocument();
When the result is returned from this
method, the metadata can have extra publisher, language, and company
information. However, no content is extracted.
- Assemble the logic and enable WebSphere Commerce search
to recognize it.
A service registry file helps insert
the new parser to be known to the WebSphere Commerce search framework.
- Create the following file:
- META-INF/services/org.apache.tika.parser.Parser
- Insert the parser's full class name into the file. For example:
com.ibm.commerce.tika.parser.video.VideoParser
- Export the code and the register file into a JAR file and save
it in the same directory as the tika-parser-version.jar file.
- Confirm the results in WebSphere Commerce search.
WebSphere
Commerce search automatically finds the proper parser for the file
content. For example, if a realmedia file is in the extracting request,
WebSphere Commerce search returns the parser result, and the Solr
Cell uses the result and composes a new document and sends it to the
search server for create and update commands.
For example, you
can check the index content, where the result should resemble the
following snippet:
content_type:=>application/vnd.rn-realmedia
tika_company:=>IBM Commerce
tika_publisher:=>Publisher
tika_language:=>RM_language
tika_stream_size:=>614135
What to do next
After enable searching on additional unstructured content
types by creating a new parser, you can search the storefront to confirm
that the search results contain your custom unstructured content types.