Configuring Augmented Search
Configuration
Custom Elasticsearch configuration
In some situations, it might be necessary to provide a custom Elasticsearch index mapping or custom index settings. To simplify the deployment of such files we created a dedicated module available on Github (https://github.com/Jahia/augmented-search-custom-configuration).
Simply clone this module, modify the configuration files located in src/main/resources/META-INF/configurations/
, build and deploy the module to your Jahia instance. Any configuration file provided in this module will take precedence over settings provided by Augmented Search.
Configuration file
After installing the search-provider-elasticsearch module, a configuration file named org.jahia.modules.augmentedsearch.cfg
is created under digital-factory-data/karaf/etc/ . This configuration file is used to specify the types of contents to index in Elasticsearch.
Main Resources
During indexing, Augmented Search aggregates content from each main resource and its children (or subnodes) and pushes this into a single document in Elasticsearch. The resulting documents are then used while searching, returning one result per document matching the search criteria.
The main resources are defined using the indexedMainResourceTypes
property and correspond to a full page accessible through the results of a search query. These content types need to have corresponding content templates to be displayed individually, like "pages". Only content templates without restriction (mode / user / permissions), which are set in the Studio, can be used to index content. By default main resources are jnt:page
and jmix:mainResource
.
To index content in your site, you must declare nodes hierarchically. You must always declare jnt:page
first and then other nodes in the hierarchy. You declare nodes by adding them under org.jahia.modules.augmentedsearch.content.indexedMainResourceTypes
, as shown in the following example.
org.jahia.modules.augmentedsearch.content.indexedMainResourceTypes = \ jnt:page, \ jnt:news
More details about how to use these properties are available in the configuration file.
File changes are reloaded automatically. You do not need to restart Jahia. For the changes to take effect in your current dataset, you still need to trigger a full re-indexing through the UI or API.
Subnodes
Subnodes are defined using the indexedSubNodeTypes
property and get aggregated alongside their main resource as part of the Elasticsearch document. Subnodes allow their parent page (main resource) to be searchable but cannot be used as a dedicated page. Their primary role is to contribute to the scoring of their parent by adding more data to the aggregate mentioned above.
By default, only content which has jmix:editorialContent
as supertype is indexed as subnodes of main resource (pages or contents with a content template).
Mapped nodetypes
By default, all text properties under indexedMainResourceTypes and indexedSubNodeTypes are pushed to Elasticsearch during the indexation process. But in some situations, it might be useful to add additional properties that can then be used when building facets, creating filters, or retrieved data using the property node (under hits in GraphQL).
Two configuration properties: content.mappedNodeTypes
andfile.mappedNodeTypes
can be used for additional properties to be pushed to their corresponding indexes in Elasticsearch.
For example, if jnt:event is configured, all the text properties of the node type jnt:event will be indexed. You can also set a property name using {nodeType}.{propertyName}
notation to specify exact property names from the type to be indexed.
Note that you have to use the nodetype of the declaring property, not the inherited one, for example, jnt:event.eventsType.
Indexed files
A configuration setting is available to select which files are indexed into the FILES index. org.jahia.modules.augmentedsearch.content.indexedFileExtensions
can take the following values. You can:
- Leave it empty to index no files at all (note that the index is still created, but will remain empty)
- Add
*
to indicate that all file extensions should be indexed - Add a comma-separated list of extensions to index, for example: "
jpg,png,doc
"
As with any setting defining content to be indexed, this setting will have a direct impact on the size of your Elasticsearch indices and therefore the resource requirement of your entire Elasticsearch cluster. You should only specify in this configuration setting filetypes you aim at being searchable on your platform.
Highlighting
Augmented Search supports highlighting of searched terms by automatically adding html tags (<em>
) to content returned in the excerpt. Highlighting is a complex topic and often considered as a tradeoff between search convenience for the end user and performance.
Note that highlighting is not associated with the definition of search results and there will be situations where results will be returned without highlights in the excerpt. In this case, you may want to consider adjusting your Augmented Search configuration to refine the way that highlighting occurs.
Internally, Augmented Search processes search results for highlighting using a combination of three fields:
- jgql:content
The content in Jahia, which is useful for returning in case of exact match highlights on individual words - jgql:content.ngram
The content after it has been processed by the ngram tokenizer (see: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html) - jgql:content.phrase
The content after it has been analyzed by the shingle filter (see: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-shingle-tokenfilter.html)
The default configuration first uses the ngram content until a limit of 12 characters is searched by a site visitor. If over 12 characters but less than 3 words is searched, Augmented Search will use the content itself. Finally, with over 3 words, results of the phrase analyzer is used.
Highlighting can be configured individually for the different languages using the highligtingfields
configuration setting with the corresponding language suffix:
org.jahia.modules.augmentedsearch.language.highligtingfields.ja = \ jgql:content, \ jgql:content.phrase
This setting supports one or two field values.
Taking the following text as an example :
Our documentation is here to help you deliver a great customer experience to the people who visit your site. Learn more depending on whether you create or manage content on your site, deploy and administer Jahia products, or develop and customize modules to extend functionality
The table shows how search terms are interpreted based on the sample text.
Configuration | Search terms | Highlighting |
---|---|---|
Default | custom | ngram content applies as the search term is less than 12 characters. Every occurrence of "custom" is highlighted (customer, customize). |
Default | custom content | The content field applies as more than 12 characters and less than 3 words are searched. Only "content" is highlighted as ngram is not used custom is not matching customer or customize. |
Default | customer content | The content field applies as more than 12 characters and less than 3 words are searched. Only "customer" and "content" are highlighted. |
Default | customize modules to extend | The phrase field applies as more than 12 characters and 3 words are searched. The "customize modules to extend" phrase is highlighted. |
Default | customize modules extend | The phrase field applies as more than 12 characters and 3 words are searched. The phrase field doesn't match due to the absence of "to" in the search term. Nothing is highlighted |
jgql:content | customer content help | A static configuration applies as more than 12 characters and 3 words are searched. Every occurrence of "customer", "content", and "help" is highlighted. |
jgql:content, jgql:content.phrase | customize modules to extend | The phrase field applies as more than 3 words are searched. The "customize modules to extend" phrase is highlighted. |
Other configurations
Other parameters are available in the configuration files to customize Augmented Search behavior even further:
- Language analyzer
Defines the analyzer used for a specific language - Buffering configuration
Defines the strategy and timing to use when Jahia checks for the Elasticsearch connection after Elasticsearch has become unreachable - Reindexing requests batch size
Sets the number of requests sent at a time while reindexing - and more (see comments in the configuration file itself)