Preparing data for Natural Language Processing
Incoming data must be pre-processed to be usable by HCL Commerce Search's Natural Language Processing feature.
HCL Commerce Search uses the Stanford CoreNLP language parser to provide the Query service with multilingual support, full grammatical parsing, and extensibility. The enhancements provided by HCL Commerce Search specifically target the needs of online shoppers, giving greater responsiveness and intelligence to the search system.
The Matchmaker is also an important feature of the Natural Language Processor's AI. Data needs to be prepared for its consumption as well.
- Tokenization
- The process of breaking the text down into smaller units called tokens that can be worked with in various ways. For a complete discussion of the tokenization process, see Tokenization in the Stanford CoreNLP documentation.
- Stop word removal
- Common words are removed so that unique terms stand out to the processor. For more information, see Dropping common terms: stop words.
- Lemmatization and stemming
- Words are reduced to their basic form, eliminating contractions and other variations on basic nouns. See Stemming and lemmatization.
- Part-of-speech tagging
- Individual words and phrases are categorized by type: noun, verb, preposition, etc. See Parts of Speech.
- Named entity recognition (NER)
- Identifies people, companies, and products in the text. The Query service
constructs a custom NER file, which is a tab-separated list of
word and value, where
value is the classification given to the word. For
example, a search term "white shirt girls" will be broken into three tokens:
white/color
,shirt/category
, andgirls/category
. "white shirt girls under $37" would add under37/filter
as the fourth token. - Preparing data for Matchmaker
- The Ingest service will analyze incoming data for three features relevant to the Matchmaker.
The query service initializes the Stanford Core NLP by passing the custom NER file to the Core NLP object. When a query is made, the search term is passed to the SearchNLPSupportProvider method, which in turn passes it to the Stanford Core NLP object. SearchNLPSupportProvider then returns the result.