Augmented Search capacity planning

We recommend using Augmented Search on Jahia Cloud. However, it can be used on-premises under the conditions described in the Augmented Search On-Premises page.

Deploying a powerful search offers an opportunity to greatly improve the end-user experience but can also introduce new challenges primarily associated with performance of the search backend.

Due to the nature of the operations being performed and the high performance of its querying system, Elasticsearch capacity planning can be complex as you try to achieve the right balance between the size of the dataset and cluster size.

This topic discusses factors to consider when sizing your environment (Elasticsearch primarily) for Augmented Search as well as design considerations potentially impacting performance.

Indexing vs. querying

As with any data store, there are two types of operations performed by Augmented Search:

Indexing
includes operations associated with extracting data from a Jahia page to make it searchable.
Querying
includes operations associated with fetching data either in the form of search results or aggregations (for example, the number of documents created by all authors on the platform).

In Augmented Search, indexing can be performed either in bulk (for example when a new site is imported), or on-the-fly as documents are saved (and/or published) by the content editor.

While querying and single document indexing must be sub-second operations (and generally sub-200ms operations), bulk indexing operations can sometimes take hours depending on the size of the dataset to index.

Indexing considerations

As the volume of data to index has a direct impact on bulk indexing time, ensure that only the content that needs to be searchable is sent for indexation. Keep in mind that what might not be critical for single document indexing can easily become a challenge when this operation has to be repeated a million times.

When configuring Augmented Search, be conservative by only using data actually required to build your search experience and avoid indexing data you might need later.

The following settings in the Augmented Search configuration file (org.jahia.modules.augmentedsearch.cfg) have a direct impact on the amount of data that is indexed.

Property	Description
indexParentCategories	Indexes all ancestors as well as the category itself
content.mappedNodeTypes	Indexes additional properties to make them available for faceting or filtering.
content.indexedSubNodeTypes	Aggregates subnodes properties alongside their main resource in the indexed document
content.indexedMainResourceTypes	Nodes to be indexed as main resource
content.indexedFileExtensions	File extensions to be indexed
workspaces	Specifies the workspace to be indexed (ALL or LIVE only)

Indexing features

To provide more granularity for its indexing operation, Augmented Search offers the ability to trigger indexing by site and workspace.

Indexing per site

Using our configuration UI (Jahia 8.X only) with site-admin pemission or using our GraphQL API (Jahia 7.X or Jahia 8.X), you can trigger indexing for one site, multiple sites, or all sites at once. This example shows how to index only the digitall site.

mutation { admin { search { startIndex(siteKeys: ["digitall"]) } } }

Note that Augmented Search indexes sites serially (one site at a time). Therefore, the time it takes to index all your sites in bulk is identical to the sum of indexing all sites individually. The flexibility lies in the ability to reindex only a portion of the dataset to limit the impact indexing operations have on the infrastructure.

Indexing per workspace

You can also trigger bulk indexing only for a particular workspace. This example shows how to index the live workspace for the digitall site.

mutation { admin { search { startIndex(siteKeys: [“digitall”], workspace: LIVE) } } }

Cluster sizing

With Augmented Search, you can run bulk indexing operations without impacting querying capabilities. Bulk indexing performs the indexing operation and creates a new index. Once indexing is successful, Augmented Search redirects queries to the new index and deletes the old one. This means that your cluster must be able to hold at least twice the amount of data during the indexing operation.

A reindexing operation across all sites

Querying consideration

Query-time performance, key to a positive user experience when searching, is typically measured in the number of milliseconds it takes for the response to come back after user interaction (usually a user typing a search query).

In most cases, connectivity and operations performed by resources on the path between a user and the data store are not the source of increased latency (other than a few milliseconds). With Jahia, data transits between the user and a Jahia server (eventually through a reverse proxy). Jahia analyzes the GraphQL query, converts it into an Elasticsearch query, sends it to Elasticsearch for processing, and then sends it all the way back again.

The complexity of an operation also has an impact on performance. The more complex an operation is, the longer it takes for Elasticsearch to process it and return a response. For example, if performing an aggregation on a field takes 150ms, performing the same aggregation on two different fields is likely going to take 300ms, and so on. Also, when complex queries run, the load on Elasticsearch increases accordingly, reducing the available compute time for other users.

In short, if your server can process 10 units of processing per second and a query consumes 1 unit, you can serve 10 users in that interval. But, if a query consumes 4 units, you will only be able to serve 3 users in that same interval.

Finally, the size of the dataset compared to the actual technical specifications (CPU, RAM, and disk of the Elasticsearch cluster nodes) also plays an important role in the capacity of the server to process queries efficiently.

To sum, the following factors are important (in no particular order):

Processing cost of the query
Size of the dataset
Load on the server
Cluster technical specifications

Elasticsearch sizing

The good news is that Elasticsearch scales very well horizontally. Adding more resources (more nodes) to an Elasticsearch cluster is generally an answer to performance or load issues.

With Jahia, the operations performed during search are not compute intensive and we didn’t run in a situation where Jahia represented a significant bottleneck. However, adding more resources can also quickly become very costly, thus the need to optimize querying.

Query and UI optimization

When designing a search experience, developers should consider resource utilization and optimize querying to avoid unnecessary loads on the infrastructure, or even better, optimize querying to avoid performing operations that fetch data users are unlikely to need. This is not just a developer’s concern. UI/UX designers should also consider the impact their design has on the infrastructure.

Search-as-you-type, the Google example

Search-as-you-type is a good example of a search experience that can potentially become very expensive if not implemented properly.

For example, Google built their search experience to optimize consumption on their infrastructure (of course!) while providing a good user experience. As you start typing a query, Google gives you clues about search terms associated with what you typed, but does not actually give you more than the title, and eventually a subtitle and an image. As a user types search criteria, many micro-requests are generated for the Google infrastructure. By reducing the query scope, Google also reduces loads on the server.

If we push the logic even further, it’s also more than likely that by recommending results during initial search, Google actually aims at redirecting users to cached content. This makes it less expensive when a user clicks on a recommended search term, rather than processing every custom search term entered by the visitor.

Then, and only then, when you click Enter (or the magnifying glass), does Google perform a query fetching a lot more information about the search terms (including a title, thumbnails, search excerpt, and other metadata). This expensive operation is kept until the very last moment, hoping that the search term actually corresponds to what the user is looking for, and limiting the need to further refine (and perform additional expensive queries).

Here are some factors to consider when building your search experience:

Use different queries (and different results) for search-as-you-type than your full search page (containing facets, filters, and more)
Don’t fetch facets by default. Let the user enter search terms first, then perform aggregation on the content.
Tweak the delay between keystrokes before actually triggering a search. If a user is interested in tomatoes, wait for them to enter the full word instead of triggering a search on each keystroke (“t”, “to”, “tom”). You can also wait for a particular delay after the last keystroke.

This doesn’t mean you absolutely have to follow these recommendations. If search is a key element of your platform and your infrastructure is sized accordingly, you could offer a very detailed search experience from the start.

No one-fit-all solution

As you might guess from the various factors covered previously, sizing an Elasticsearch cluster is a complex exercise and depends on a wide variety of factors unique to each environment. The structure of your data, the amount of it, and the queries you are going to perform, all play an important role in determining the ideal cluster size.

Before jumping into Augmented Search, we understand that it is essential to understand its resource needs. Instead of providing generic numbers that might be approximate, we’ve decided to provide benchmark results with a variety of scenarios to allow you to relate your implementation needs to the dataset we tested and to understand associated boundaries.

Benchmarking

Approach

Our primary objective when creating our test runs (using JMeter) is to progressively increase the query load until the Elasticsearch cluster reaches 100% CPU utilization, and collect mean response times and error rates associated with a particular throughput during that time.

Using a ramp-up time of 90 seconds, we progressively increased the sample size (count of API calls) and monitored the behavior on a wide variety of queries.

Sample size	Error count	Latency	Throughput
1000	0	277	11
2000	0	278	22
3000	0	300	33
4000	0	323	43
5000	0	2157	55
6000	0	21127	54
7000	1465	27645	70

Results of a term aggregation query

Generally, the latency increases progressively alongside the load until the system reaches a point (usually at 100% CPU consumption) at which it cannot process the requests fast enough. Then, the system starts piling them up (which results in increased latency) until it cannot handle it anymore and starts erroring out.

The above results are a typical illustration, the latency increases slightly until reaching a sample size of 4000 API calls (with a mean latency of 323ms at 43 queries per second). The next two samples (5000 and 6000) do not error out, but latency is through the roof (and is not usable in production). Also notice that these two runs have an almost identical throughput, which corresponds to the system’s max capacity.

Finally at 7000, the API starts crashing, with 1,465 failed calls out of the 7000. Interestingly, at sample size the throughput is increasing, but that is simply because the server is refusing calls “Nope, too busy! Come back later”.

When analyzing test results, we’re looking for the time when multiple runs are getting a similar throughput, which usually corresponds to the throughput of the system when operating at its maximum capacity.

For the results above, the number we’ll keep for our metrics is 55 queries per second.

Queries and search terms

To provide accurate results, we ensured that search terms match actual content (queries with 0 results use less resources). In the context of these benchmarks, we ran all queries in 4 different languages (German, English, French, and Portuguese) using 6 different search strings per language. When performing aggregations or filters, we ensured these were also matching content on the sites.

As for the queries, we performed the runs with 7 different queries with different levels of complexity:

Query Name	Description
Simple search	Performs a simple search in a set language and returns a name, excerpt, and path. This is the most common search experience.
Two filters	Adds two filters (on author and publication date range) to “Simple search”. Returns the pages created by John D. between November 15, 2020 and February 15, 2021 matching a set query string.
Range facet	Performs a range aggregation (over 8 buckets) in addition to “Simple search”
Term facet	Performs a term aggregation, by author, in addition to “Simple search”
Tree facet	Performs a tree aggregation by categories (similar to term facet but identify if the tree element has any children), in addition to “Simple search”
Range facet with two filters	Performs a “Simple Search”, “Range facet” and “Two filters” in the same query
Three facets	Performs a “Simple Search” Range facet”, “Term facet” and “Tree facet” in the same query.

During our benchmarks, we also performed a run while indexing was occurring to measure the impact of whole-site indexing on search queries.

Data caveats

These tests represent a snapshot of an environment at a particular point in time with a particular dataset and are not going to be perfectly reproducible (even with the same environment). You will get different results!

In these benchmarks, the throughput is expressed in queries/second, and should not be confused with the ability of the Jahia platform to serve a specific number of users per second. This throughput corresponds to the number of API calls processed by Augmented Search (for example, users all clicking on the search button at the same second).

Remember that these numbers correspond to the maximum throughput and should not be used to define a nominal production environment. We recommend lowering these numbers by 25-30% for nominal production to keep room for an unplanned increase of traffic.

Testing infrastructure

The following environment was used for the tests:

Elasticsearch cluster: 3x 8GB RAM, 240GB SSD using AWS “aws.data.highio.i3” instance
Jahia cluster: 2x 8 Cores 8GB RAM, SSD storage (1 browsing/authoring + 1 processing)
Jahia database: 3x 8 Cores 8GB RAM, SSD storage
Jmeter host: 1x 32 Cores, 64 GB RAM using AWS “c5a.8xlarge”

Understanding the data

Consider the following factors when reviewing the runs results:

Search Hits
Number of elements searchable through Augmented Search API across the entire platform. One page translated in 15 different languages and available in both live and edit workspaces will account for 30 search hits.
Mean
Mean response time in milliseconds across all queries for the particular batch. For example, if performing a batch of 5000 API calls within 90 seconds, mean corresponds to the mean over the 5000 queries.
95th %ile
Mean response time in milliseconds across the 95% of the slowest responses for the batch.
Recommended
The recommended value corresponds to results with the 95th percentile below 500ms
Max
Without considering latency, the maximum throughput obtained without API errors. This gives a notion of the potential maximum that the system can support.

Since we execute our tests in batch of API calls (for example 1000, 2000, and 3000) with a fixed ramp-up time (90s), we get a good understanding of the maximum throughput supported by the system, but we can’t get precise latency measurements for every single increment in measured throughput.

For illustrative purposes, we performed a run with 200 calls increments, which provides a good sample of results across all runs. Below, note the recommended and max values. Notice that soon after the max, the 95th percentile latency increases significantly, and then starts erroring out around the 6600 sample count.

Run results

Remember, we are submitting the sample count within a 90 seconds window. Compare the measured throughput versus the average rate of submitted calls (sampleCount/90s) and notice that the two lines diverge at 5000 samples. At this time, the API is unable to process the data in “real-time” and begins queuing API calls.

Also notice that our recommendation is for 50 q/s (4400 samples) while the real-time max is actually at 55.5 (5000 samples). The reason here is that we want as much as possible to avoid going much over 500ms latency.

Run #1

Number of ES Documents	Dataset size	Search Hits
7,890,693	9.9 GB	239,043

Number of ES Documents	Nominal	During query benchmark
Indexing time	1 hour 10m	1 hours 18m

Query	Throughput (queries/second)
	Nominal		During indexation
	Recommended	Max	Recommended	Max
Simple Search	67 q/s 95th %ile: 594ms Mean: 195ms	90 q/s 95th %ile: 5214ms Mean: 3795ms	63 q/s 95th %ile: 358ms Mean: 174ms	67 q/s 95th %ile: 1024ms Mean: 246ms
Two filters	56 q/s 95th %ile: 202ms Mean: 156ms	89 q/s 95th %ile: 4634ms Mean: 3209ms	63 q/s 95th %ile: 318ms Mean: 176ms	77 q/s 95th %ile: 1264ms Mean: 264ms
Range facet	43 q/s 95th %ile: 594ms Mean: 269ms	50 q/s 95th %ile: 10637ms Mean: 6915ms	33 q/s 95th %ile: 354ms Mean: 253ms	36 q/s 95th %ile: 19876ms Mean: 12026ms
Term facet	43 q/s 95th %ile: 323ms Mean: 238ms	55 q/s 95th %ile: 2157ms Mean: 1048ms	28 q/s 95th %ile: 508ms Mean: 272ms	36 q/s 95th %ile: 21027ms Mean: 14403ms
Tree facet	43 q/s 95th %ile: 427ms Mean: 253ms	55 q/s 95th %ile: 2326ms Mean: 964ms	28 q/s 96th %ile: 771ms Mean: 305ms	51q/s 95th %ile: 9362ms Mean: 6253ms
Range facet with two filters	43 q/s 95th %ile: 314ms Mean: 232ms	56 q/s 95th %ile: 926ms Mean: 337ms	33 q/s 95th %ile: 316ms Mean: 236ms	47 q/s 95th %ile: 16386ms Mean: 8811ms
Three facets	22 q/s 95th %ile: 755ms Mean: 438ms	24 q/s 95th %ile: 18954ms Mean: 12531ms	11 q/s 95th %ile: 859ms Mean: 495ms	17 q/s 95th %ile: 31109ms Mean: 19099ms

Run Analysis
The following deductions can be made from this run: Adding filtering to a search query does not have a strong impact on performance The three aggregations (range, term, tree) have a similar query cost Adding two filters on top of the range facet is actually less costly. This is because aggregation must be performed on a smaller (filtered) dataset. The query containing three aggregations is the most expensive operation of our entire suite. In regards to indexing, the impact while the system is stress-tested is not significant. But on the contrary, the impact to querying a system undergoing full re-indexing is more significant, with between 34% and 9% of performance degradation.

Run #2

Number of ES Documents	Dataset size	Search Hits
14,585,551	19 GB	549,171

Number of ES Documents	Nominal	During query benchmark
Indexing time	4 hours 53 mn	5 hours 3 mn

Query	Throughput (queries/second)
	Nominal		During indexation
	Recommended	Max	Recommended	Max
Simple search	63 q/s 95th %ile: 488ms Mean: 184ms	77 q/s 95th %ile: 2340ms Mean: 409ms	63 q/s 95th %ile: 633ms Mean: 237ms	67 q/s 95th %ile: 23547ms Mean: 11358ms
Two filters	62 q/s 95th %ile: 482ms Mean: 186ms	77 q/s 95th %ile: 780ms Mean: 210ms	63 q/s 95th %ile: 458ms Mean: 192ms	66 q/s 95th %ile: 929ms Mean: 359ms
Range facet	43 q/s 95th %ile: 305ms Mean: 242ms	55 q/s 95th %ile: 2924ms Mean: 2281ms	28 q/s 95th %ile: 496ms Mean: 267ms	40 q/s 95th %ile: 10324ms Mean: 7411ms
Term facet	43 q/s 95th %ile: 395ms Mean: 252ms	55 q/s 95th %ile: 12779ms Mean: 7920ms	28 q/s 95th %ile: 305ms Mean: 233ms	36 q/s 95th %ile: 19539ms Mean: 6386ms
Tree facet	43 q/s 95th %ile: 519ms Mean: 267ms	55 q/s 95th %ile: 13353ms Mean: 8393ms	28 q/s 96th %ile: 362ms Mean: 247ms	38q/s 95th %ile: 14958ms Mean: 11154ms
Range facet with two filters	43 q/s 95th %ile: 285ms Mean: 225ms	60 q/s 95th %ile: 3181ms Mean: 2554ms	33 q/s 95th %ile: 296ms Mean: 236ms	42 q/s 95th %ile: 4961ms Mean: 3215ms
Three facets	22 q/s 95th %ile: 686ms Mean: 422ms	26 q/s 95th %ile: 9610ms Mean: 6559ms	11 q/s 95th %ile: 1051ms Mean: 529ms	17 q/s 95th %ile: 26802ms Mean: 13018ms

Run Analysis
For this run deductions from run #1 are still mostly valid. Note that the query that seems to be most impacted by the increase in the dataset size is the simple search, while all the aggregations behave very similarly. When performing free text search, the system has to search through a wide variety of elements (which increased alongside the dataset), while the fields being aggregated on are not seeing much variation in their content. Since run #1 is a subset of the data in run #2, the number of unique values for the authors fields are similar. As the site is larger in run #2, there are more documents for each author but a similar number of authors.

Run #3

Number of ES Documents	Dataset size	Search Hits
21,772,140	347.1 GB	1,660,897

Number of ES Documents	Nominal	During query benchmark
Indexing time	n/a	17 hours 22mn

Query	Throughput (queries/second)
	Nominal		During indexation
	Recommended	Max	Recommended	Max
Simple search	43 q/s 95th %ile: 251ms Mean: 193ms	62 q/s 95th %ile: 1641ms Mean: 664ms	44 q/s 95th %ile: 328ms Mean: 192ms	62 q/s 95th %ile: 2419ms Mean: 882ms
Two filters	63 q/s 95th %ile: 402ms Mean: 189ms	77 q/s 95th %ile: 1719ms Mean: 322ms	56 q/s 95th %ile: 522ms Mean: 246ms	63 q/s 95th %ile: 1114ms Mean: 350ms
Range facet	22 q/s 95th %ile: 491ms Mean: 340ms	29 q/s 95th %ile: 15288ms Mean: 9591ms	28 q/s 95th %ile: 424ms Mean: 267ms	Same as recommended, errors if above
Term facet	22 q/s 95th %ile: 526ms Mean: 333ms	31 q/s 95th %ile: 10326ms Mean: 6668ms	28 q/s 95th %ile: 280ms Mean: 232ms	32 q/s 95th %ile: 8056ms Mean: 6444ms
Tree facet	22 q/s 95th %ile: 468ms Mean: 326ms	32 q/s 95th %ile: 7388ms Mean: 4679ms	28 q/s 96th %ile: 288ms Mean: 239ms	33q/s 95th %ile: 1915ms Mean: 908ms
Range facet with two filters	22 q/s 95th %ile: 491ms Mean: 340ms	29 q/s 95th %ile: 15288ms Mean: 9591ms	28 q/s 95th %ile: 504ms Mean: 282ms	37 q/s 95th %ile: 14138ms Mean: 8394ms
Three facets	11 q/s 95th %ile: 1124ms Mean: 686ms	Same as recommended, errors if above	11 q/s 95th %ile: 741ms Mean: 454ms	25 q/s 95th %ile: 14294ms Mean: 8740ms

Run Analysis
This run stretches the system in terms of capacity. For such a project, we would recommend deploying additional Elasticsearch resources if building a search experience making use of aggregations.

Run #4

This last run was performed with increased Elasticsearch resources to measure impact on performance (3x 15GB RAM, 480GB SSD using AWS “aws.data.highio.i3” instance).

Number of ES Documents	Dataset size	Search Hits
21,772,140	347.1 GB	1,660,897

Number of ES Documents	Nominal	During query benchmark
Indexing time	n/a	n/a

Query	Throughput (queries/second)
	Nominal		During indexation
	Recommended	Max	Recommended	Max
Simple search	63 q/s 95th %ile: 589ms Mean: 225ms	91 q/s 95th %ile: 4277ms Mean: 1803ms	n/a (not tested)	n/a (not tested)
Two filters	63 q/s 95th %ile: 423ms Mean: 181ms	90 q/s 95th %ile: 4034ms Mean: 2172ms	n/a (not tested)	n/a (not tested)
Range facet	42 q/s 95th %ile: 422ms Mean: 314ms	50 q/s 95th %ile: 2549ms Mean: 938ms	n/a (not tested)	n/a (not tested)
Term facet	42 q/s 95th %ile: 490ms Mean: 307ms	53 q/s 95th %ile: 1333ms Mean: 568ms	n/a (not tested)	n/a (not tested)
Tree facet	42 q/s 95th %ile: 538ms Mean: 329ms	53 q/s 95th %ile: 1194ms Mean: 529ms	n/a (not tested)	n/a (not tested)
Range facet with two filters	56 q/s 95th %ile: 531ms Mean: 261ms	76 q/s 95th %ile: 2068ms Mean: 1568ms	n/a (not tested)	n/a (not tested)
Three facets	11 q/s 95th %ile: 687ms Mean: 498ms	24 q/s 95th %ile: 1378ms Mean: 737ms	n/a (not tested)	n/a (not tested)

Run Analysis

Run #4 should be looked at in comparison to run #3 since the two are using the same dataset, the only difference being in more resources given to the Elasticsearch cluster. We can see an increase in throughput, which was expected, but what is interesting to pay attention to here is the ability of the system to cope under load, which is especially noticeable in the 95th %ile and Mean latency for the Max “column”.

Example with range facet query

	Run #3	Run #4
Throughput	29 q/s	50 q/s
95th %ile	15288ms	2549ms
Mean	9591ms	938ms

Notice that although throughput nearly doubled, the latency was much better when the system was operating at its max capacity. At 938ms mean latency when operating at its max, the system was still able to provide a reasonable response time to the end user. This behavior was consistent across all of the queries. In a “real” production scenario, it would be worth spending some more time investigating possible tweaks to the infrastructure as there might be ways to improve the performance even further.

Test summary

The four runs that were executed during this series of benchmarks should provide you with a good baseline to understand the expected performance of your search environment.

Three different datasets were analyzed (239,043 hits, 549,171 hits and 1,660,897 hits) and by reviewing both your dataset size, your desired performance and the type of search queries being performed you will get a good sense of either the resources needed or the expected performance.