Configuring Augmented Search

Configuration

Custom Elasticsearch configuration

In some situations, it might be necessary to provide a custom Elasticsearch index mapping or custom index settings. To simplify the deployment of such files we created a dedicated module available on Github (https://github.com/Jahia/augmented-search-custom-configuration).

Simply clone this module, modify the configuration files located in src/main/resources/META-INF/configurations/, build and deploy the module to your Jahia instance. Any configuration file provided in this module will take precedence over settings provided by Augmented Search.

After uploading a new mapping (or setting), you will need to re-index all sites for such changes to take effect. Re-indexing a single site will not be sufficient to enable those settings.

Configuration file

After installing the search-provider-elasticsearch module, a configuration file named org.jahia.modules.augmentedsearch.cfg is created under digital-factory-data/karaf/etc/ . This configuration file is used to specify the types of contents to index in Elasticsearch.

Main Resources

During indexing, Augmented Search aggregates content from each main resource and its children (or subnodes) and pushes this into a single document in Elasticsearch. The resulting documents are then used while searching, returning one result per document matching the search criteria.

The main resources are defined using the indexedMainResourceTypes property and correspond to a full page accessible through the results of a search query. These content types need to have corresponding content templates to be displayed individually, like "pages". Only content templates without restriction (mode / user / permissions), which are set in the Studio, can be used to index content. By default main resources are jnt:page and jmix:mainResource.

To index content in your site, you must declare nodes hierarchically. You must always declare jnt:page first and then other nodes in the hierarchy. You declare nodes by adding them under org.jahia.modules.augmentedsearch.content.indexedMainResourceTypes, as shown in the following example.

org.jahia.modules.augmentedsearch.content.indexedMainResourceTypes = \   
   jnt:page, \ 
   jnt:news

More details about how to use these properties are available in the configuration file.

File changes are reloaded automatically. You do not need to restart Jahia. For the changes to take effect in your current dataset, you still need to trigger a full re-indexing through the UI or API.

Subnodes

Subnodes are defined using the indexedSubNodeTypes property and get aggregated alongside their main resource as part of the Elasticsearch document. Subnodes allow their parent page (main resource) to be searchable but cannot be used as a dedicated page. Their primary role is to contribute to the scoring of their parent by adding more data to the aggregate mentioned above.

By default, only content which has jmix:editorialContent as supertype is indexed as subnodes of main resource (pages or contents with a content template).

Mapped nodetypes

By default, all text properties under indexedMainResourceTypes and indexedSubNodeTypes are pushed to Elasticsearch during the indexation process. But in some situations, it might be useful to add additional properties that can then be used when building facets, creating filters, or retrieved data using the property node (under hits in GraphQL).
Two configuration properties: content.mappedNodeTypes andfile.mappedNodeTypes can be used for additional properties to be pushed to their corresponding indexes in Elasticsearch.

For example, if jnt:event is configured, all the text properties of the node type jnt:event will be indexed. You can also set a property name using {nodeType}.{propertyName} notation to specify exact property names from the type to be indexed.

Important considerations when using this configuration setting:

You have to use the nodetype of the declaring property, not the inherited one, for example, jnt:event.eventsType.
Unless a full reindexation is performed after modifying this setting, the new property will be indexed, but not stored in Elasticsearch (`{"store": false}` in the mapping). It will still be possible to use it in search queries.

Indexed files

A configuration setting is available to select which files are indexed into the FILES index. org.jahia.modules.augmentedsearch.content.indexedFileExtensionscan take the following values. You can:

Leave it empty to index no files at all (note that the index is still created, but will remain empty)
Add * to indicate that all file extensions should be indexed
Add a comma-separated list of extensions to index, for example: "jpg,png,doc"

As with any setting defining content to be indexed, this setting will have a direct impact on the size of your Elasticsearch indices and therefore the resource requirement of your entire Elasticsearch cluster. You should only specify in this configuration setting filetypes you aim at being searchable on your platform.

Note that if you want to index files mounted on to Jahia using a VFS you need to
implement a searchable VFS Provider.

Highlighting

Augmented Search supports highlighting of searched terms by automatically adding html tags (<em>) to content returned in the excerpt. Highlighting is a complex topic and often considered as a tradeoff between search convenience for the end user and performance.

Note that highlighting is not associated with the definition of search results and there will be situations where results will be returned without highlights in the excerpt. In this case, you may want to consider adjusting your Augmented Search configuration to refine the way that highlighting occurs.

Internally, Augmented Search processes search results for highlighting using a combination of three fields:

jgql:content
The content in Jahia, which is useful for returning in case of exact match highlights on individual words
jgql:content.ngram
The content after it has been processed by the ngram tokenizer (see: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html)
jgql:content.phrase
The content after it has been analyzed by the shingle filter (see: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-shingle-tokenfilter.html)

The default configuration first uses the ngram content until a limit of 12 characters is searched by a site visitor. If over 12 characters but less than 3 words is searched, Augmented Search will use the content itself. Finally, with over 3 words, results of the phrase analyzer is used.

Highlighting can be configured individually for the different languages using the highlightingfields configuration setting with the corresponding language suffix:

org.jahia.modules.augmentedsearch.language.highligtingfields.ja = \   
   jgql:content, \ 
   jgql:content.phrase

This setting supports one or two field values.

Taking the following text as an example :

Our documentation is here to help you deliver a great customer experience to the people who visit your site.

Learn more depending on whether you create or manage content on your site, 
deploy and administer Jahia products, or develop and customize modules to extend functionality

The table shows how search terms are interpreted based on the sample text.

Configuration	Search terms	Highlighting
Default	custom	ngram content applies as the search term is less than 12 characters. Every occurrence of "custom" is highlighted (customer, customize).
Default	custom content	The content field applies as more than 12 characters and less than 3 words are searched. Only "content" is highlighted as ngram is not used custom is not matching customer or customize.
Default	customer content	The content field applies as more than 12 characters and less than 3 words are searched. Only "customer" and "content" are highlighted.
Default	customize modules to extend	The phrase field applies as more than 12 characters and 3 words are searched. The "customize modules to extend" phrase is highlighted.
Default	customize modules extend	The phrase field applies as more than 12 characters and 3 words are searched. The phrase field doesn't match due to the absence of "to" in the search term. Nothing is highlighted
jgql:content	customer content help	A static configuration applies as more than 12 characters and 3 words are searched. Every occurrence of "customer", "content", and "help" is highlighted.
jgql:content, jgql:content.phrase	customize modules to extend	The phrase field applies as more than 3 words are searched. The "customize modules to extend" phrase is highlighted.

Indexation Performance

Augmented Search provides several configuration options to optimize indexing performance and resource usage. These settings allow you to fine-tune how content is batched and sent to Elasticsearch.

Bulk operations batch size

The operations.batch.size setting controls how many operations are sent to Elasticsearch in a single bulk request and also controls scrolling pagination limit when indexing main resources and subnodes on a full indexation:

org.jahia.modules.augmentedsearch.operations.batch.size = 500

Higher values can improve throughput but increase memory usage and the risk of timeouts. Lower values reduce memory pressure but may slow down indexing.

Queue request limit (4.1.x)

The operations.queueRequestLimit setting is another way to throttle node requests being submitted to the bulk ingester at a given time. With new 4.1.x changes related to node processing throughput, this lets users control and go back to processing one node at a time similar to 3.x. If this setting is set to 0 or lower, this will process the same number of nodes specified in the scrolling pagination limit in operations.batch.sizeat a given time.

org.jahia.modules.augmentedsearch.operations.queueRequestLimit = 50

Higher values can improve throughput but increase memory usage and the risk of timeouts. Lower values reduce memory pressure but may slow down indexing.

Max concurrent requests (4.1.x)

The maxConcurrentRequests setting limits how many bulk requests can be processed simultaneously:

org.jahia.modules.augmentedsearch.maxConcurrentRequests = 8

This value is automatically restricted to a range between 1 and 8. Higher values can speed up indexing when Elasticsearch has sufficient resources, but too many concurrent requests may overwhelm the cluster. Try reducing this value if you start to experience some timeouts.

Subnode batch processing (4.1.x)

For main resources with many subnodes, the bulkSubnodeRequestsLimit controls how many subnodes are accumulated before being submitted as a batch:

org.jahia.modules.augmentedsearch.bulkSubnodeRequestsLimit = 20

Increasing this value can improve indexing performance for content with many subnodes by reducing the number of updates to the main resource document. However, higher values also increase memory usage and processing time per batch. Fine-tune by lowering this limit value if running on an Elasticsearch environment with limited memory (lower than 4GB), or if you start getting Data too large exceptions on your bulk requests.

Retry wait strategy

When connection issues occur, Augmented Search uses a configurable retry strategy to control the delay between retry attempts. Three strategies are available:

Fibonacci (default)

Uses a Fibonacci sequence to calculate wait times, providing a balanced backoff that gradually increases delays:

org.jahia.modules.augmentedsearch.indexation.retry.waitStrategy = FIBONACCI



org.jahia.modules.augmentedsearch.indexation.retry.waitStrategy.FIBONACCI.multiplier = 100



org.jahia.modules.augmentedsearch.indexation.retry.waitStrategy.FIBONACCI.maximum = 2



org.jahia.modules.augmentedsearch.indexation.retry.waitStrategy.FIBONACCI.maximumTimeUnit = MINUTES

The multiplier defines the base time unit (in milliseconds) for the Fibonacci sequence, while maximum and maximumTimeUnit set the upper limit for retry delays.

Exponential

Uses exponential backoff for more aggressive retry delays:

org.jahia.modules.augmentedsearch.indexation.retry.waitStrategy = EXPONENTIAL



org.jahia.modules.augmentedsearch.indexation.retry.waitStrategy.EXPONENTIAL.multiplier = 100



org.jahia.modules.augmentedsearch.indexation.retry.waitStrategy.EXPONENTIAL.maximum = 2



org.jahia.modules.augmentedsearch.indexation.retry.waitStrategy.EXPONENTIAL.maximumTimeUnit = MINUTES

Fixed

Uses a constant wait time between retries, suitable for predictable retry patterns:

org.jahia.modules.augmentedsearch.indexation.retry.waitStrategy = FIXED



org.jahia.modules.augmentedsearch.indexation.retry.waitStrategy.FIXED.sleep = 100



org.jahia.modules.augmentedsearch.indexation.retry.waitStrategy.FIXED.sleepTimeUnit = MILLISECONDS

Allowed time unit values: NANOSECONDS, MICROSECONDS, MILLISECONDS, SECONDS, MINUTES, HOURS, DAYS

Other configurations

Other parameters are available in the configuration files to customize Augmented Search behavior even further:

Language analyzer
Defines the analyzer used for a specific language
and more (see comments in the configuration file itself)