Configuring Augmented Search

November 14, 2023

Configuration

Custom Elasticsearch configuration

In some situations, it might be necessary to provide a custom Elasticsearch index mapping or custom index settings. To simplify the deployment of such files we created a dedicated module available on Github (https://github.com/Jahia/augmented-search-custom-configuration). 

Simply clone this module, modify the configuration files located in src/main/resources/META-INF/configurations/, build and deploy the module to your Jahia instance. Any configuration file provided in this module will take precedence over settings provided by Augmented Search.

After uploading a new mapping (or setting), you will need to re-index all sites for such changes to take effect. Re-indexing a single site will not be sufficient to enable those settings.

Configuration file

After installing the search-provider-elasticsearch module, a configuration file named org.jahia.modules.augmentedsearch.cfg is created under digital-factory-data/karaf/etc/ . This configuration file is used to specify the types of contents to index in Elasticsearch.

Main Resources

During indexing, Augmented Search aggregates content from each main resource and its children (or subnodes) and pushes this into a single document in Elasticsearch. The resulting documents are then used while searching, returning one result per document matching the search criteria. 

The main resources are defined using the indexedMainResourceTypes property and correspond to a full page accessible through the results of a search query. These content types need to have corresponding content templates to be displayed individually, like "pages". Only content templates without restriction (mode / user / permissions), which are set in the Studio, can be used to index content. By default main resources are jnt:page and jmix:mainResource.

To index content in your site, you must declare nodes hierarchically. You must always declare jnt:page first and then other nodes in the hierarchy. You declare nodes by adding them under org.jahia.modules.augmentedsearch.content.indexedMainResourceTypes, as shown in the following example.

org.jahia.modules.augmentedsearch.content.indexedMainResourceTypes = \   
   jnt:page, \ 
   jnt:news

More details about how to use these properties are available in the configuration file.

File changes are reloaded automatically. You do not need to restart Jahia. For the changes to take effect in your current dataset, you still need to trigger a full re-indexing through the UI or API.

Subnodes

Subnodes are defined using the indexedSubNodeTypes property and get aggregated alongside their main resource as part of the Elasticsearch document. Subnodes allow their parent page (main resource) to be searchable but cannot be used as a dedicated page. Their primary role is to contribute to the scoring of their parent by adding more data to the aggregate mentioned above.

By default, only content which has jmix:editorialContent as supertype is indexed as subnodes of main resource (pages or contents with a content template). 

Mapped nodetypes

By default, all text properties under indexedMainResourceTypes and indexedSubNodeTypes are pushed to Elasticsearch during the indexation process. But in some situations, it might be useful to add additional properties that can then be used when building facets, creating filters, or retrieved data using the property node (under hits in GraphQL).
Two configuration properties: content.mappedNodeTypes andfile.mappedNodeTypes can be used for additional properties to be pushed to their corresponding indexes in Elasticsearch. 

For example, if jnt:event is configured, all the text properties of the node type jnt:event will be indexed. You can also set a property name using {nodeType}.{propertyName} notation to specify exact property names from the type to be indexed. 

Note that you have to use the nodetype of the declaring property, not the inherited one, for example, jnt:event.eventsType.

Indexed files

A configuration setting is available to select which files are indexed into the FILES index. org.jahia.modules.augmentedsearch.content.indexedFileExtensionscan take the following values. You can:

  • Leave it empty to index no files at all (note that the index is still created, but will remain empty)
  • Add * to indicate that all file extensions should be indexed
  • Add a comma-separated list of extensions to index, for example: "jpg,png,doc"

As with any setting defining content to be indexed, this setting will have a direct impact on the size of your Elasticsearch indices and therefore the resource requirement of your entire Elasticsearch cluster. You should only specify in this configuration setting filetypes you aim at being searchable on your platform. 

Highlighting

Augmented Search supports highlighting of searched terms by automatically adding html tags (<em>) to content returned in the excerpt. Highlighting is a complex topic and often considered as a tradeoff between search convenience for the end user and performance.

Note that highlighting is not associated with the definition of search results and there will be situations where results will be returned without highlights in the excerpt. In this case, you may want to consider adjusting your Augmented Search configuration to refine the way that highlighting occurs.

Internally, Augmented Search processes search results for highlighting using a combination of three fields:

The default configuration first uses the ngram content until a limit of 12 characters is searched by a site visitor. If over 12 characters but less than 3 words is searched, Augmented Search will use the content itself. Finally, with over 3 words, results of the phrase analyzer is used.

Highlighting can be configured individually for the different languages using the highligtingfields configuration setting with the corresponding language suffix:

org.jahia.modules.augmentedsearch.language.highligtingfields.ja = \   
   jgql:content, \ 
   jgql:content.phrase

This setting supports one or two field values.

Taking the following text as an example :

Our documentation is here to help you deliver a great customer experience to the people who visit your site.

Learn more depending on whether you create or manage content on your site, 
deploy and administer Jahia products, or develop and customize modules to extend functionality

The table shows how search terms are interpreted based on the sample text.

Configuration Search terms Highlighting
Default custom ngram content applies as the search term is less than 12 characters. Every occurrence of "custom" is highlighted (customer, customize).
Default custom content The content field applies as more than 12 characters and less than 3 words are searched. Only "content" is highlighted as ngram is not used custom is not matching customer or customize.
Default customer content The content field applies as more than 12 characters and less than 3 words are searched. Only "customer" and "content" are highlighted.
Default customize modules to extend The phrase field applies as more than 12 characters and 3 words are searched. The "customize modules to extend" phrase is highlighted.
Default customize modules extend The phrase field applies as more than 12 characters and 3 words are searched. The phrase field doesn't match due to the absence of "to" in the search term. Nothing is highlighted
jgql:content customer content help A static configuration applies as more than 12 characters and 3 words are searched. Every occurrence of "customer", "content", and "help" is highlighted.
jgql:content, jgql:content.phrase customize modules to extend The phrase field applies as more than 3 words are searched. The "customize modules to extend" phrase is highlighted.

Other configurations

Other parameters are available in the configuration files to customize Augmented Search behavior even further:

  • Language analyzer
    Defines the analyzer used for a specific language
  • Buffering configuration
    Defines the strategy and timing to use when Jahia checks for the Elasticsearch connection after Elasticsearch has become unreachable
  • Reindexing requests batch size
    Sets the number of requests sent at a time while reindexing
  • and more (see comments in the configuration file itself)