Written by The Jahia Team
   Estimated reading time:

Deploying a powerful search offers an opportunity to greatly improve the end-user experience but can also introduce new challenges primarily associated with performance of the search backend.

Due to the nature of the operations being performed and the high performance of its querying system, Elasticsearch capacity planning can be complex as you try to achieve the right balance between the size of the dataset and cluster size.

This topic discusses factors to consider when sizing your environment (Elasticsearch primarily) for Augmented Search as well as design considerations potentially impacting performance.

Indexing vs. querying

As with any data store, there are two types of operations performed by Augmented Search:

  • Indexing
    includes operations associated with extracting data from a Jahia page to make it searchable.
  • Querying
    includes operations associated with fetching data either in the form of search results or aggregations (for example, the number of documents created by all authors on the platform).

In Augmented Search, indexing can be performed either in bulk (for example when a new site is imported), or on-the-fly as documents are saved (and/or published) by the content editor.

While querying and single document indexing must be sub-second operations (and generally sub-200ms operations), bulk indexing operations can sometimes take hours depending on the size of the dataset to index.

Indexing considerations

As the volume of data to index has a direct impact on bulk indexing time, ensure that only the content that needs to be searchable is sent for indexation. Keep in mind that what might not be critical for single document indexing can easily become a challenge when this operation has to be repeated a million times.

When configuring Augmented Search, be conservative by only using data actually required to build your search experience and avoid indexing data you might need later.

The following settings in the Augmented Search configuration file (org.jahia.modules.augmentedsearch.cfg) have a direct impact on the amount of data that is indexed.

Property Description
indexParentCategories Indexes all ancestors as well as the category itself
content.mappedNodeTypes Indexes additional properties to make them available for faceting or filtering.
content.indexedSubNodeTypes Aggregates subnodes properties alongside their main resource in the indexed document
content.indexedMainResourceTypes Nodes to be indexed as main resource
content.indexedFileExtensions File extensions to be indexed
workspaces Specifies the workspace to be indexed (ALL or LIVE only)

Indexing features

To provide more granularity for its indexing operation, Augmented Search offers the ability to trigger indexing by site and workspace.

Indexing per site

Using our configuration UI (Jahia 8.X only) or using our GraphQL API (Jahia 7.X or Jahia 8.X), you can trigger indexing for one site, multiple sites, or all sites at once. This example shows how to index only the digitall site.

mutation { admin { search { startIndex(siteKeys: ["digitall"]) } } }

Note that Augmented Search indexes sites serially (one site at a time). Therefore, the time it takes to index all your sites in bulk is identical to the sum of indexing all sites individually. The flexibility lies in the ability to reindex only a portion of the dataset to limit the impact indexing operations have on the infrastructure.

Indexing per workspace

You can also trigger bulk indexing only for a particular workspace. This example shows how to index the live workspace for the digitall site.

mutation { admin { search { startIndex(siteKeys: [“digitall”], workspace: LIVE) } } }

Cluster sizing

With Augmented Search, you can run bulk indexing operations without impacting querying capabilities. Bulk indexing performs the indexing operation and creates a new index. Once indexing is successful, Augmented Search redirects queries to the new index and deletes the old one. This means that your cluster must be able to hold at least twice the amount of data during the indexing operation.

capacity-planning-1.png

A reindexing operation across all sites

Querying consideration

Query-time performance, key to a positive user experience when searching, is typically measured in the number of milliseconds it takes for the response to come back after user interaction (usually a user typing a search query).

In most cases, connectivity and operations performed by resources on the path between a user and the data store are not the source of increased latency (other than a few milliseconds). With Jahia, data transits between the user and a Jahia server (eventually through a reverse proxy). Jahia analyzes the GraphQL query, converts it into an Elasticsearch query, sends it to Elasticsearch for processing, and then sends it all the way back again.

The complexity of an operation also has an impact on performance. The more complex an operation is, the longer it takes for Elasticsearch to process it and return a response. For example, if performing an aggregation on a field takes 150ms, performing the same aggregation on two different fields is likely going to take 300ms, and so on. Also, when complex queries run, the load on Elasticsearch increases accordingly, reducing the available compute time for other users.

In short, if your server can process 10 units of processing per second and a query consumes 1 unit, you can serve 10 users in that interval. But, if a query consumes 4 units, you will only be able to serve 3 users in that same interval.

Finally, the size of the dataset compared to the actual technical specifications (CPU, RAM, and disk of the Elasticsearch cluster nodes) also plays an important role in the capacity of the server to process queries efficiently.

To sum, the following factors are important (in no particular order):

  • Processing cost of the query
  • Size of the dataset
  • Load on the server
  • Cluster technical specifications

Elasticsearch sizing

The good news is that Elasticsearch scales very well horizontally. Adding more resources (more nodes) to an Elasticsearch cluster is generally an answer to performance or load issues.

With Jahia, the operations performed during search are not compute intensive and we didn’t run in a situation where Jahia represented a significant bottleneck. However, adding more resources can also quickly become very costly, thus the need to optimize querying.

Query and UI optimization

When designing a search experience, developers should consider resource utilization and optimize querying to avoid unnecessary loads on the infrastructure, or even better, optimize querying to avoid performing operations that fetch data users are unlikely to need. This is not just a developer’s concern. UI/UX designers should also consider the impact their design has on the infrastructure.

Search-as-you-type, the Google example

Search-as-you-type is a good example of a search experience that can potentially become very expensive if not implemented properly.

For example, Google built their search experience to optimize consumption on their infrastructure (of course!) while providing a good user experience. As you start typing a query, Google gives you clues about search terms associated with what you typed, but does not actually give you more than the title, and eventually a subtitle and an image. As a user types search criteria, many micro-requests are generated for the Google infrastructure. By reducing the query scope, Google also reduces loads on the server.

If we push the logic even further, it’s also more than likely that by recommending results during initial search, Google actually aims at redirecting users to cached content. This makes it less expensive when a user clicks on a recommended search term, rather than processing every custom search term entered by the visitor.

Then, and only then, when you click Enter (or the magnifying glass), does Google perform a query fetching a lot more information about the search terms (including a title, thumbnails, search excerpt, and other metadata). This expensive operation is kept until the very last moment, hoping that the search term actually corresponds to what the user is looking for, and limiting the need to further refine (and perform additional expensive queries).

Here are some factors to consider when building your search experience:

  • Use different queries (and different results) for search-as-you-type than your full search page (containing facets, filters, and more)
  • Don’t fetch facets by default. Let the user enter search terms first, then perform aggregation on the content.
  • Tweak the delay between keystrokes before actually triggering a search. If a user is interested in tomatoes, wait for them to enter the full word instead of triggering a search on each keystroke (“t”, “to”, “tom”). You can also wait for a particular delay after the last keystroke.

This doesn’t mean you absolutely have to follow these recommendations. If search is a key element of your platform and your infrastructure is sized accordingly, you could offer a very detailed search experience from the start.

No one-fit-all solution

As you might guess from the various factors covered previously, sizing an Elasticsearch cluster is a complex exercise and depends on a wide variety of factors unique to each environment. The structure of your data, the amount of it, and the queries you are going to perform, all play an important role in determining the ideal cluster size.

Before jumping into Augmented Search, we understand that it is essential to understand its resource needs. Instead of providing generic numbers that might be approximate, we’ve decided to provide benchmark results with a variety of scenarios to allow you to relate your implementation needs to the dataset we tested and to understand associated boundaries.

Benchmarking

Approach

Our primary objective when creating our test runs (using JMeter) is to progressively increase the query load until the Elasticsearch cluster reaches 100% CPU utilization, and collect mean response times and error rates associated with a particular throughput during that time.

Using a ramp-up time of 90 seconds, we progressively increased the sample size (count of API calls) and monitored the behavior on a wide variety of queries.

Sample size Error count Latency Throughput
1000 0 277 11
2000 0 278 22
3000 0 300 33
4000 0 323 43
5000 0 2157 55
6000 0 21127 54
7000 1465 27645 70

Results of a term aggregation query

Generally, the latency increases progressively alongside the load until the system reaches a point (usually at 100% CPU consumption) at which it cannot process the requests fast enough. Then, the system starts piling them up (which results in increased latency) until it cannot handle it anymore and starts erroring out.

The above results are a typical illustration, the latency increases slightly until reaching a sample size of 4000 API calls (with a mean latency of 323ms at 43 queries per second). The next two samples (5000 and 6000) do not error out, but latency is through the roof (and is not usable in production). Also notice that these two runs have an almost identical throughput, which corresponds to the system’s max capacity.

Finally at 7000, the API starts crashing, with 1,465 failed calls out of the 7000. Interestingly, at sample size the throughput is increasing, but that is simply because the server is refusing calls “Nope, too busy! Come back later”.

When analyzing test results, we’re looking for the time when multiple runs are getting a similar throughput, which usually corresponds to the throughput of the system when operating at its maximum capacity.

For the results above, the number we’ll keep for our metrics is 55 queries per second.

Queries and search terms

To provide accurate results, we ensured that search terms match actual content (queries with 0 results use less resources). In the context of these benchmarks, we ran all queries in 4 different languages (German, English, French, and Portuguese) using 6 different search strings per language. When performing aggregations or filters, we ensured these were also matching content on the sites.

As for the queries, we performed the runs with 7 different queries with different levels of complexity:

Query Name Description
Simple search Performs a simple search in a set language and returns a name, excerpt, and path. This is the most common search experience.
Two filters Adds two filters (on author and publication date range) to “Simple search”. Returns the pages created by John D. between November 15, 2020 and February 15, 2021 matching a set query string.
Range facet Performs a range aggregation (over 8 buckets) in addition to “Simple search”
Term facet Performs a term aggregation, by author, in addition to “Simple search”
Tree facet Performs a tree aggregation by categories (similar to term facet but identify if the tree element has any children), in addition to “Simple search”
Range facet with two filters Performs a “Simple Search”, “Range facet” and “Two filters” in the same query
Three facets Performs a “Simple Search” Range facet”, “Term facet” and “Tree facet” in the same query.

 

During our benchmarks, we also performed a run while indexing was occurring to measure the impact of whole-site indexing on search queries.

Data caveats

These tests represent a snapshot of an environment at a particular point in time with a particular dataset and are not going to be perfectly reproducible (even with the same environment). You will get different results!

In these benchmarks, the throughput is expressed in queries/second, and should not be confused with the ability of the Jahia platform to serve a specific number of users per second. This throughput corresponds to the number of API calls processed by Augmented Search (for example, users all clicking on the search button at the same second).

Remember that these numbers correspond to the maximum throughput and should not be used to define a nominal production environment. We recommend lowering these numbers by 25-30% for nominal production to keep room for an unplanned increase of traffic.

Testing infrastructure

The following environment was used for the tests:

  • Elasticsearch cluster: 3x 8GB RAM, 240GB SSD using AWS “aws.data.highio.i3” instance
  • Jahia cluster: 2x 8 Cores 8GB RAM, SSD storage (1 browsing/authoring + 1 processing)
  • Jahia database: 3x 8 Cores 8GB RAM, SSD storage
  • Jmeter host: 1x 32 Cores, 64 GB RAM using AWS “c5a.8xlarge”

Understanding the data

Consider the following factors when reviewing the runs results:

  • Search Hits
    Number of elements searchable through Augmented Search API across the entire platform. One page translated in 15 different languages and available in both live and edit workspaces will account for 30 search hits.
  • Mean
    Mean response time in milliseconds across all queries for the particular batch. For example, if performing a batch of 5000 API calls within 90 seconds, mean corresponds to the mean over the 5000 queries.
  • 95th %ile
    Mean response time in milliseconds across the 95% of the slowest responses for the batch.
  • Recommended
    The recommended value corresponds to results with the 95th percentile below 500ms
  • Max
    Without considering latency, the maximum throughput obtained without API errors. This gives a notion of the potential maximum that the system can support.

Since we execute our tests in batch of API calls (for example 1000, 2000, and 3000) with a fixed ramp-up time (90s), we get a good understanding of the maximum throughput supported by the system, but we can’t get precise latency measurements for every single increment in measured throughput.

For illustrative purposes, we performed a run with 200 calls increments, which provides a good sample of results across all runs. Below, note the recommended and max values. Notice that soon after the max, the 95th percentile latency increases significantly, and then starts erroring out around the 6600 sample count.

capacity-planning-3.png

Run results

Remember, we are submitting the sample count within a 90 seconds window. Compare the measured throughput versus the average rate of submitted calls (sampleCount/90s) and notice that the two lines diverge at 5000 samples. At this time, the API is unable to process the data in “real-time” and begins queuing API calls.

capacity-planning-4.png

Also notice that our recommendation is for 50 q/s (4400 samples) while the real-time max is actually at 55.5 (5000 samples). The reason here is that we want as much as possible to avoid going much over 500ms latency.

Run #1

Number of ES Documents Dataset size Search Hits
7,890,693 9.9 GB 239,043
Number of ES Documents Nominal During query benchmark
Indexing time 1 hour 10m 1 hours 18m
Query Throughput (queries/second)
Nominal During indexation
Recommended Max Recommended Max
Simple Search 67 q/s
95th %ile: 594ms
Mean: 195ms
90 q/s
95th %ile: 5214ms
Mean: 3795ms
63 q/s
95th %ile: 358ms
Mean: 174ms
67 q/s
95th %ile: 1024ms
Mean: 246ms
Two filters 56 q/s
95th %ile: 202ms
Mean: 156ms
89 q/s
95th %ile: 4634ms
Mean: 3209ms
63 q/s
95th %ile: 318ms
Mean: 176ms
77 q/s
95th %ile: 1264ms
Mean: 264ms
Range facet 43 q/s
95th %ile: 594ms
Mean: 269ms
50 q/s
95th %ile: 10637ms
Mean: 6915ms
33 q/s
95th %ile: 354ms
Mean: 253ms
36 q/s
95th %ile: 19876ms
Mean: 12026ms
Term facet 43 q/s
95th %ile: 323ms
Mean: 238ms
55 q/s
95th %ile: 2157ms
Mean: 1048ms
28 q/s
95th %ile: 508ms
Mean: 272ms
36 q/s
95th %ile: 21027ms
Mean: 14403ms
Tree facet 43 q/s
95th %ile: 427ms
Mean: 253ms
55 q/s
95th %ile: 2326ms
Mean: 964ms
28 q/s
96th %ile: 771ms
Mean: 305ms
51q/s
95th %ile: 9362ms
Mean: 6253ms
Range facet with two filters 43 q/s
95th %ile: 314ms
Mean: 232ms
56 q/s
95th %ile: 926ms
Mean: 337ms
33 q/s
95th %ile: 316ms
Mean: 236ms
47 q/s
95th %ile: 16386ms
Mean: 8811ms
Three facets 22 q/s
95th %ile: 755ms
Mean: 438ms
24 q/s
95th %ile: 18954ms
Mean: 12531ms
11 q/s
95th %ile: 859ms
Mean: 495ms
17 q/s
95th %ile: 31109ms
Mean: 19099ms
Run Analysis
The following deductions can be made from this run:
  • Adding filtering to a search query does not have a strong impact on performance
  • The three aggregations (range, term, tree) have a similar query cost
  • Adding two filters on top of the range facet is actually less costly. This is because aggregation must be performed on a smaller (filtered) dataset.
  • The query containing three aggregations is the most expensive operation of our entire suite.
In regards to indexing, the impact while the system is stress-tested is not significant. But on the contrary, the impact to querying a system undergoing full re-indexing is more significant, with between 34% and 9% of performance degradation.

Run #2

Number of ES Documents Dataset size Search Hits
14,585,551 19 GB 549,171
Number of ES Documents Nominal During query benchmark
Indexing time 4 hours 53 mn 5 hours 3 mn
Query Throughput (queries/second)
Nominal During indexation
Recommended Max Recommended Max
Simple search 63 q/s
95th %ile: 488ms
Mean: 184ms
77 q/s
95th %ile: 2340ms
Mean: 409ms
63 q/s
95th %ile: 633ms
Mean: 237ms
67 q/s
95th %ile: 23547ms
Mean: 11358ms
Two filters 62 q/s
95th %ile: 482ms
Mean: 186ms
77 q/s
95th %ile: 780ms
Mean: 210ms
63 q/s
95th %ile: 458ms
Mean: 192ms
66 q/s
95th %ile: 929ms
Mean: 359ms
Range facet 43 q/s
95th %ile: 305ms
Mean: 242ms
55 q/s
95th %ile: 2924ms
Mean: 2281ms
28 q/s
95th %ile: 496ms
Mean: 267ms
40 q/s
95th %ile: 10324ms
Mean: 7411ms
Term facet 43 q/s
95th %ile: 395ms
Mean: 252ms
55 q/s
95th %ile: 12779ms
Mean: 7920ms
28 q/s
95th %ile: 305ms
Mean: 233ms
36 q/s
95th %ile: 19539ms
Mean: 6386ms
Tree facet 43 q/s
95th %ile: 519ms
Mean: 267ms
55 q/s
95th %ile: 13353ms
Mean: 8393ms
28 q/s
96th %ile: 362ms
Mean: 247ms
38q/s
95th %ile: 14958ms
Mean: 11154ms
Range facet with two filters 43 q/s
95th %ile: 285ms
Mean: 225ms
60 q/s
95th %ile: 3181ms
Mean: 2554ms
33 q/s
95th %ile: 296ms
Mean: 236ms
42 q/s
95th %ile: 4961ms
Mean: 3215ms
Three facets 22 q/s
95th %ile: 686ms
Mean: 422ms
26 q/s
95th %ile: 9610ms
Mean: 6559ms
11 q/s
95th %ile: 1051ms
Mean: 529ms
17 q/s
95th %ile: 26802ms
Mean: 13018ms
Run Analysis
For this run deductions from run #1 are still mostly valid. Note that the query that seems to be most impacted by the increase in the dataset size is the simple search, while all the aggregations behave very similarly.
When performing free text search, the system has to search through a wide variety of elements (which increased alongside the dataset), while the fields being aggregated on are not seeing much variation in their content. Since run #1 is a subset of the data in run #2, the number of unique values for the authors fields are similar. As the site is larger in run #2, there are more documents for each author but a similar number of authors.

 

Run #3

Number of ES Documents Dataset size Search Hits
21,772,140 347.1 GB 1,660,897
Number of ES Documents Nominal During query benchmark
Indexing time n/a 17 hours 22mn
Query Throughput (queries/second)
Nominal During indexation
Recommended Max Recommended Max
Simple search 43 q/s
95th %ile: 251ms
Mean: 193ms
62 q/s
95th %ile: 1641ms
Mean: 664ms
44 q/s
95th %ile: 328ms
Mean: 192ms
62 q/s
95th %ile: 2419ms
Mean: 882ms
Two filters 63 q/s
95th %ile: 402ms
Mean: 189ms
77 q/s
95th %ile: 1719ms
Mean: 322ms
56 q/s
95th %ile: 522ms
Mean: 246ms
63 q/s
95th %ile: 1114ms
Mean: 350ms
Range facet 22 q/s
95th %ile: 491ms
Mean: 340ms
29 q/s
95th %ile: 15288ms
Mean: 9591ms
28 q/s
95th %ile: 424ms
Mean: 267ms
Same as recommended, errors if above
Term facet 22 q/s
95th %ile: 526ms
Mean: 333ms
31 q/s
95th %ile: 10326ms
Mean: 6668ms
28 q/s
95th %ile: 280ms
Mean: 232ms
32 q/s
95th %ile: 8056ms
Mean: 6444ms
Tree facet 22 q/s
95th %ile: 468ms
Mean: 326ms
32 q/s
95th %ile: 7388ms
Mean: 4679ms
28 q/s
96th %ile: 288ms
Mean: 239ms
33q/s
95th %ile: 1915ms
Mean: 908ms
Range facet with two filters 22 q/s
95th %ile: 491ms
Mean: 340ms
29 q/s
95th %ile: 15288ms
Mean: 9591ms
28 q/s
95th %ile: 504ms
Mean: 282ms
37 q/s
95th %ile: 14138ms
Mean: 8394ms
Three facets 11 q/s
95th %ile: 1124ms
Mean: 686ms
Same as recommended, errors if above 11 q/s
95th %ile: 741ms
Mean: 454ms
25 q/s
95th %ile: 14294ms
Mean: 8740ms
Run Analysis
This run stretches the system in terms of capacity. For such a project, we would recommend deploying additional Elasticsearch resources if building a search experience making use of aggregations.

Run #4

This last run was performed with increased Elasticsearch resources to measure impact on performance (3x 15GB RAM, 480GB SSD using AWS “aws.data.highio.i3” instance).

Number of ES Documents Dataset size Search Hits
21,772,140 347.1 GB 1,660,897
Number of ES Documents Nominal During query benchmark
Indexing time n/a n/a
Query Throughput (queries/second)
Nominal During indexation
Recommended Max Recommended Max
Simple search 63 q/s
95th %ile: 589ms
Mean: 225ms
91 q/s
95th %ile: 4277ms
Mean: 1803ms
n/a (not tested) n/a (not tested)
Two filters 63 q/s
95th %ile: 423ms
Mean: 181ms
90 q/s
95th %ile: 4034ms
Mean: 2172ms
n/a (not tested) n/a (not tested)
Range facet 42 q/s
95th %ile: 422ms
Mean: 314ms
50 q/s
95th %ile: 2549ms
Mean: 938ms
n/a (not tested) n/a (not tested)
Term facet 42 q/s
95th %ile: 490ms
Mean: 307ms
53 q/s
95th %ile: 1333ms
Mean: 568ms
n/a (not tested) n/a (not tested)
Tree facet 42 q/s
95th %ile: 538ms
Mean: 329ms
53 q/s
95th %ile: 1194ms
Mean: 529ms
n/a (not tested) n/a (not tested)
Range facet with two filters 56 q/s
95th %ile: 531ms
Mean: 261ms
76 q/s
95th %ile: 2068ms
Mean: 1568ms
n/a (not tested) n/a (not tested)
Three facets 11 q/s
95th %ile: 687ms
Mean: 498ms
24 q/s
95th %ile: 1378ms
Mean: 737ms
n/a (not tested) n/a (not tested)
Run Analysis

Run #4 should be looked at in comparison to run #3 since the two are using the same dataset, the only difference being in more resources given to the Elasticsearch cluster. We can see an increase in throughput, which was expected, but what is interesting to pay attention to here is the ability of the system to cope under load, which is especially noticeable in the 95th %ile and Mean latency for the Max “column”.

Example with range facet query

  Run #3 Run #4
Throughput 29 q/s 50 q/s
95th %ile 15288ms 2549ms
Mean 9591ms 938ms
Notice that although throughput nearly doubled, the latency was much better when the system was operating at its max capacity. At 938ms mean latency when operating at its max, the system was still able to provide a reasonable response time to the end user. This behavior was consistent across all of the queries. In a “real” production scenario, it would be worth spending some more time investigating possible tweaks to the infrastructure as there might be ways to improve the performance even further.

Test summary

The four runs that were executed during this series of benchmarks should provide you with a good baseline to understand the expected performance of your search environment.

Three different datasets were analyzed (239,043 hits, 549,171 hits and 1,660,897 hits) and by reviewing both your dataset size, your desired performance and the type of search queries being performed you will get a good sense of either the resources needed or the expected performance.