Installing and configuring the Elasticsearch search provider

November 11, 2022

Introduction

The Elasticsearch search provider modules make it possible to use the power of Elasticsearch to index and search contents in your Digital Experience Manager websites. They act as connector to an existing Elasticsearch environment, by sending the index data and search queries, and retrieving search results.

The Elasticsearch search provider module improves the relevance of the search results (compared to the default JCR search) as it includes a full-page search as opposed to a content-based search. Elasticsearch 5.6.3 is also based on a more recent version of Lucene (6.6.1) than the one embedded in the core of Digital Experience Manager.

Delegating the search capabilities to Elasticsearch has another advantage: Digital Experience Manager then consumes less resources, which improves the overall stability of the platform. You can also make most of the Elasticsearch scalability design.This was validated by our different performance tests, which showed significant improvement for edit operation response times. Smaller improvements regarding live browsing were also observed.

In all our performance test scenarios, the Elasticsearch search provider easily met our acceptance criterias, as 90% of the requests took 3 times less time than the limit we fixed, without any special tuning / configuration on the Elasticsearch environment side. The use of Elasticsearch as the search provider is also more performant than JCR search in scenarios with a lot of searches combined with content contribution. However, in some other scenarios, with no contribution at all for instance, the JCR search scored slightly better than our Elasticsearch search provider. For this reason, before using this module in production, it is strongly advised to carefully test your Elasticsearch environment to ensure that your expected performance levels are met.

Capabilities

  • Buffered operations: if the connection to Elasticsearch is lost, the operations to perform (contents to be indexed) are queued until the Elasticsearch connection is re-established. In the current version of the module, the DX processing node is the one to manage the content indexing queue. This queue is stored in the RAM of the processing node, therefore it is strongly advised to stop the processing node only when there are no operations left to perform.
  • Search while reindexing: it is possible to contribute content and search in the existing index while reindexing.
  • ACLs are supported at content level. The full-page search is only available when the searched terms appear in contents which share the same ACL as the page.
  • Visibility conditions on contents are supported.
  • Facets are supported starting from version 2.1.0
Note that any readable content can be found and listed as a result, even if one of the parent nodes is currently non-readable (e.g. due to visibility conditions, publication/unpublication, broken inheritance for roles): if a view to display the childnode on its own exists, then the user should be able to access the searched content. In case no view exist for the node, then the content will not be displayed on the page.

 

Installation requirements

Elasticsearch 

The Elasticsearch search provider module currently supports Elasticsearch 5.6.3, available here.

If your Elasticsearch cluster is using X-Pack, please consult our dedicated page related to X-Pack configuration.

DX modules

The required modules can be deployed on your Digital Experience Manager environment by using the Elasticsearch search provider package available on our store, or by installing its modules individually:

  • Database connector
  • Elasticsearch connector
  • Search provider elasticsearch

Setup

Elasticsearch connection

In the DX administration UI, go into Configuration -> Database Connector
Create a new connection, by clicking on the “New connection” button:

ES-config-1.PNG

Select the “elastic” database type:

ES-config-2.PNG

 

Create a new Elasticsearch connection by filling the following settings:

  • Host: the IP/hostname of your Elasticsearch server
  • Port: the port used by your Elasticsearch server
  • Id: the name of the Elasticsearch connection you are creating
  • Cluster Name: Your Elasticsearch cluster.name property
  • If using X-Pack: additionally open the "Advanced" tab, check "Use XPack Security" and enter username and password (default values are elastic/changeme) 

ES-config-3.PNG

Then the connection is created:

ES-config-4.PNG

 

Elasticsearch setup

In the administration, go to Configuration > Elasticsearch management

Verify that the Elasticsearch database connection ID corresponds to the one previously created (in our example "esConnection"), and click on save: this will trigger a re-indexing of the platform.

ES-config-5.PNG

It is possible to start a re-indexing from the same screen: a job will perform the re-indexing in background.

Search provider setup

In order to enable the Elasticsearch search provider, go to Configuration > Search settings, and select “Elasticsearch search provider”, then save:

ES-config-6.PNG

Like for any search provider, default (or custom) views for search results can be used with the Elasticsearch search provider.

Configuration

Configuration file

Upon installation of the search-provider-elasticsearch module, a configuration file named org.jahia.services.search.provider.elasticsearch.cfg is created under digital-factory-data/karaf/etc/ . This configuration file is used to specify the types of contents that need to be indexed in Elasticsearch. Whenever you modify this file in a DX cluster environment, the synchronization does not happen automatically, so you have to copy the modified file to each cluster node.

Main Resource types

The indexing and the way to return results with Elasticsearch are done differently from the default JCR search provider. The JCR search provider indexes each content individually, then an aggregation is performed when collecting the search results. On the other side, the Eleasticsearch search provider already aggregates in the index contents which are displayable in full page: pages (jnt:page) and contents which come with a content template (e.g. jnt:news).

By default, only pages are indexed.

org.jahia.services.search.provider.elasticsearch.content.indexedMainResourceTypes defines the list of content types that can be indexed as full page contents. These content types need to have corresponding content templates in order to be displayed individually, like "pages". Only content templates without restriction (mode / user / permissions), which are set in the studio, can be used to index content.

Mapped nodetypes

Only fulltext and metadata are indexed by default. If you want to do a search on a specific property you have to list it in:

org.jahia.services.search.provider.elasticsearch.content.mappedNodeTypes

Indexing of contents

By default, only the contents which have the jmix:editorialContent as supertype are indexed as subnodes of main resource (pages or contents with a content template). 

Other content types can be indexed by listing them in:

org.jahia.services.search.provider.elasticsearch.content.indexedSubNodeTypes

Other configurations

  • Language analyzer: defines the analyzer used for a specific language
  • Buffering configuration: this property defines the strategy and timing to use when DX checks for the Elasticsearch connection after ES has become unreachable
  • Reindexing requests batch size: you can set the number of requests sent at a time while reindexing

Differences between Elasticsearch search provider and JCR search provider

The implementation of the Elasticsearch search provider differs from the implementation of the JCR search provider. Therefore, the search results may differ from a search provider to another. This section lists the differences.

Cluster

The JCR search provider accesses the Lucene index created by Jackrabbit, where each DX cluster node has its own index. Therefore the index is always up-to-date and synchronously includes the content changes processed on the current cluster node and obtains in near real time changes from content write operations processsed on other DX nodes in the cluster.

With ES search provider the index operations are always sent to ElasticSearch exclusively through the DX processing node. So if the DX processing node is down, while content is being modified, the index is not up-to-date with these latest content changes. On DX processing node startup it will catch up and send to Elasticsearch all indexing requests for changes done during its absence. Notice also that even if the DX processing node is up and running, the Elasticsearch indexing is always asynchronous and thus just near real time.

Accessing the rawHit object

Using hit.rawHit in a JSP or hit.getRawHit() in Java code, may not be compatible with the Elasticsearch provider as the object returned is not the JCRNode, but a SearchHit object from Elasticsearch. We tried to avoid loading JCR nodes for each result due to performance reasons. If you really need that object, you have to load it by path, which can be obtained from the SearchHit object by calling getField(ESConstants.NODE_PATH_KEY).

Elasticsearch provider is not considering indexing-configuration.xml

The configuration in indexing-configuration.xml is Jackrabbit specific and thus not considered by the Elasticsearch provider. If you use index rules to boost fields under certain conditions, or exclude certain nodes/sections from being indexed or override analyzers for specific fields (without the need to modify the CND file), then this configuration will not be used by Elasticsearch, but we will come up with alternatives in future releases of the module.

Contents

A displayable content (which has a content template) created inside a page, will generate two results with ES search provider: a link to the page where the content was created, and a link to the full page view of the content. The JCR search provider only displays the link to the full page view of the content.

Files

The ES search providers uses a different analyzer for file names than the JCR search provider. This means that the different elements of a file name are individually indexed in ES, and an exact search on these elements is performed. The following example illustrates this difference.

3 files are available on a site:

  • allBlueCars.zip
  • carsListWikipedia.txt
  • blueCar.png
Searched word(s)  JCR Result  Elasticsearch result
car

allBlueCars.zip 
carsListWikipedia.txt
blueCar.png

 blueCar.png
blue car allBlueCars.zip 
blueCar.png
blueCar.png
cars allBlueCars.zip 
carsListWikipedia.txt
allBlueCars.zip 
carsListWikipedia.txt

References

JCR

Files and contents can be found as references when searching for contents only. The content or file found is displayed as a search result, and the links to the pages where the content appear are displayed in the "Appears in" field.

Elasticsearch

In the current version (2.1.x) content references are part of the page / displayable content that contains it, meaning that the content is found as if it would be part of the page. 
File usages in search results are currently not supported.

Search form

Search for date ranges

The ES search provider evaluates each date as follows:
Date: 10.11.2017 -> 10.11.2017 00:00
This means that in order to search for all the contents created on the 10.11.2017 you will need to use the following date range: from 10.11.2017 to 11.11.2017

 

Display of last modification date

The ES and JCR search providers may not display the same last modification date in the search results (if this information is part of the search result view): ES uses the last modification date of the page (which is shown as the actual search result, therefore the information is relevant), whereas the JCR search shows the date corresponding to the content found.

For instance, a page is returned as it contains a richtext matching the search criteria:

  • with Elasticsearch the last modification date of the page is displayed (which may not correspond to the last modification date of the rich text, but corresponds to the last modification date of another content in the page)
  • with JCR search the last modification date of the richtext is displayed.