Performance and capacity planning

November 14, 2023

While developing jCustomer we regularly measure system performance to get a sense of jCustomer capacity characteristics, understand jCustomer behavior under load as well as assess whether changes to our codebase have an impact on performance.

In this document, we’re presenting our approach to performance tests and provide results based on environments we tested jCustomer in. 

Performance testing is a complex domain, in which opinionated decisions are made. Please take the time to read and understand the approach we took, prior to reviewing the actual performance metrics.

Approach

Our primary objective when measuring performance is to progressively increase user traffic until we notice degradation. While the tests are running, we collect response times and error rates associated with a particular throughput during that time as well as collect underlying resource utilization (CPU, RAM, disk, network…), allowing us to best pinpoint performance bottlenecks.

Definitions

Some of the terms mentioned in this document are used in the context of performance testing and are worth a few more details:

  • ramp-up refers to the amount of time until all users within a run have been submitted.
  • users refers to the number of users submitted by the end of the ramp-up time. A scenario itself is composed of multiple actions. The number of actions in a scenario can easily be calculated ,dividing the number of samples by the number of users. 
  • mean response time refers to response time across all of the scenario actions across all of the users. 50% of the queries are below that response time while 50% are above.
  • 90%ile: 90% of the queries are below that response time while 10% are above
  • 95%ile: 95% of the queries are below that response time while 5% are above
  • 99%ile: 99% of the queries are below that response time while 1% are above
  • Throughput: number of queries processed by the system per second

Testing environment

For this session of performance tests, the test environment was deployed in AWS Ireland (eu-west-1) on resources created specifically for these tests. The entirety of the test environment was destroyed and re-created between each run.

When designing the testing environment, we made the decision to rely on EC2 instances, with each instance running one single resource using docker. This allows for segregation of services and ensures measured metrics (and in particular system load) are specific to the exact resource being tested.

The following instance types were used during the tests:

  • A (large) jMeter host
  • A Jahia host serving the site
  • One or more jCustomer hosts
  • One of more Elasticsearch hosts

All instances and resources used for the tests are collocated in the same VPC (same region), exact specifications of the EC2 instances are provided next to the test results.

Anatomy of a run

A single performance test run is composed of elements that do not change during the run, such as the testing environment, the ramp-up time, and the test scenario. The only variable element during the run is the number of users, which is progressively increased until the environment begins to fail.

balance the wish for extensive data samples with the desired execution time for the test. Therefore it’s usually best to identify ranges of interest (see profiles detailed below) and focus the analysis on these.

At the end of a run, we gather the data collected by jMeter in a format that facilitates its interpretation.

jcustomer-perf-metrics-run-1.png

Results (metrics) of performance run #1

We can generally group the results in the 4 different profiles described below.

Result profile: Under-utilized

The system is under-utilized as the load it receives from jMeter is not sufficient for it to be stress-tested. Results with this profile usually have a very similar latency but different Throughput. 

Such results are still relevant as they provide us with a sense of the fastest response time supported by the environment. 

For example, on the results above, we can see that the environment’s response time behaves similarly between 25 and 100 users. 

In general, you would want to avoid being in that profile for extended periods as it means you are spending more on the environment than necessary.

Result profile: Optimum performances

The environment response time is situated within optimum operating conditions as defined by the business

For example, if the desired response time is set by the business at 250ms on the 90%ile, the optimum environment would be able to support the number of users at or next to that value. In the results above this would be somewhere close to 150 users (287ms at 90%ile).

This is the sweet spot and the capacity you should be aiming towards for your environment.

Result profile: At capacity

The system is performing at its peak with limited queuing and without errors. This profile can be identified by looking for a flattening in the throughput curve (see chart above). 

For example, in the results above you will notice that throughput is at its peak at 300 users (throughput of 213q/s). It means that the maximum capacity supported by the infrastructure is situated around 300 users. The environment should be in a position to sustain such a load, with degraded performances (above “optimum performances”) over a long period of time without erroring out, although you would want to monitor memory consumption.

Although you wouldn’t want your environment to operate in this profile, this could still be acceptable for worst-case scenarios, for example for an unexpected increase in traffic and while your infrastructure team is working at increasing capacity. This mostly corresponds to the load your environment should never go above.

Result profile: Over-utilized

The system is performing over its capacity and will begin queuing requests, progressively increasing its memory usage and dramatically increasing response time until it starts erroring-out.

This typically is the next step just after the “at capacity” profile. In the example above we reached that point above 300 users and began noticing errors at 800 users.

This is the no-go zone, reaching that type of load should trigger immediate actions from your operations team.

Test scenario - a user journey

Creating a user journey is complex in nature. For these performance tests, we used product-relevant features, while navigating across these features following a use case representative to the expected usage of these features in production.

Test site

The site supporting the user journey is composed of 3 pages:

  • Page 1 contains 3 personalized areas (with server-side rendering) using Geolocation, Scoring plan and Goal. It contains a tags aggregation component and is also considered the site’s home page.
  • Page 2 contains 3 personalized areas (with client-side rendering) using the segment “onePage”, the session property “browser” and the profile property “utm_campaign”.
  • Page 3 contains a form with 3 fields (email, firstname, lastname) mapped to the corresponding jExperience profile property.

User journey

When navigating, the test user will follow this scenario:

  • Open page 1 with a campaign parameter and stay on the page for 2 seconds
  • Navigate to page 2 and stay on the page for 2 seconds
  • Navigate to page 3, submit the form, which will fulfill a “newsletter” goal, and stay on the page for 2 seconds
  • Navigate to page 1, a new personalization is displayed due to the “newsletter” goal being fulfilled and stay on the page for 2 seconds
  • Navigate to page 2 and stay on the page for 2 seconds

For each of the steps in the user scenario, we verify that the personalization properties are correctly resolved and displayed.

The entire user journey was measured at 15 seconds during our tests, we are repeating it twice for each user, bringing the total user journey to 30 seconds per user.

System load profile

Having a jMeter ramp-up time of 10 seconds jointly with a total user journey of 30 seconds means that the last users should complete their journey 40s after the start of the test.

load-profile.png

System load profile with 100 users

For about 20s (between T+10 and T+30), all of the users will be on the platform at the same time, which gives us a pretty good understanding of the supported load in that use case.

Another important element to keep in mind when reviewing performance metrics is that these metrics correspond to the average values across an entire scenario for one particular user. While the response time could be 250ms for a given run, this value is itself an average of all of the queries performed by the user within a scenario.

Note of caution

jCustomer is a complex product, offering flexibility in implementing a wide range of use cases. Although flexibility is key for such a platform, it also makes performance results very specific to the implemented use case on that precise environment being tested.

In other words, the results presented in this document offer a baseline to help you understand how performance metrics evolve as the underlying infrastructure is being modified. It also helps in pinpointing infrastructure bottlenecks as environments are being spec’d out.

Except if you were to run the same test scenario, it is very important to put a strong emphasis on the fact that YOU WILL NOT obtain the same performance metrics on your production environment. And although we designed the test scenario to be representative of a typical jCustomer use case, such a scenario remains a static implementation and by nature cannot be compared to the variability introduced when dealing with “real-world” users in a production environment.

Performance tests

In these tests, we’re going to start with very small resources, progressively increase their size and see how this impacts metrics.

The threshold for optimum performance is set at 250ms on the 90%ile.

All tests will be running with a c5.9xlarge jMeter instance (36 vCPU, 72GB RAM) which provides sufficient throughput while not becoming the bottleneck in any of the planned runs. Jahia itself will be running on a small instance, and although Jahia is not on the critical path for this test, we’re making sure Jahia resources utilization remains low during the test to confirm it is not becoming a bottleneck.

Please note that deploying a single-node Elasticsearch instance is not recommended by Elastic in production due to the risk of data loss, it is presented here as it is useful to understand bottlenecks.

Run #1

Environment
Name Specs
jCustomer t2.medium (2 vCPU, 4GB RAM)
Elasticsearch t2.medium (2 vCPU, 4GB RAM)
  Profiles
Under-utilization Optimum performances At capacity Over-utilized
Range (users)

up to 100

150

250

above 300

Latency (mean)

38ms

111ms

483ms

605ms

Latency (90%ile)

82ms

287ms

1064ms

1248ms

Latency (95%ile)

112ms

359ms

1208ms

1336ms

Latency (99%ile)

199ms

524ms

1414ms

1517ms

Throughput

117q/s

162q/s

194q/s

213q/s

Run Analysis

This run was performed using small and identical instances and serves as a baseline for all subsequent runs detailed in this document.

Resource usage on the host running Elasticsearch was pretty high, which tends to indicate it was the bottleneck for that particular run. In th next run (#2), we are increasing resources on this node to confirm the impact on performance.

Run #2

Environment
Name Specs
jCustomer t2.medium (2 vCPU, 4GB RAM)
Elasticsearch t2.xlarge (4 vCPU, 16GB RAM)
  Profiles
Under-utilization Optimum performances At capacity Over-utilized
Range (users)

up to 200

250

400

above 450

Latency (mean)

99ms

116ms

446ms

606ms

Latency (90%ile)

239ms

288ms

923ms

1188ms

Latency (95%ile)

302ms

354ms

1012ms

1354ms

Latency (99%ile)

400ms

465ms

1220ms

1577ms

Throughput

221q/s

270q/s

320q/s

317q/s

Run Analysis
As expected when analyzing the outcome of run #1, the Elasticsearch node was the bottleneck and increasing its capacity did have a significant impact on measured performances. In this new run, the environment supported precisely 100 more users in the “optimum performances” profile.

Run #3

Environment
Name Specs
jCustomer t2.xlarge (4 vCPU, 16GB RAM)
Elasticsearch t2.xlarge (4 vCPU, 16GB RAM)
  Profiles
Under-utilization Optimum performances At capacity Over-utilized
Range (users)

up to 200

300

450

above 500

Latency (mean)

53ms

119ms

389ms

502ms

Latency (90%ile)

134ms

311ms

778ms

1027ms

Latency (95%ile)

183ms

386ms

914ms

1385ms

Latency (99%ile)

294ms

509ms

1202ms

1926ms

Throughput

232q/s

318q/s

378q/s

373q/s

Run Analysis

The main purpose of this run is to increase capacity on the jCustomer host and compare the result with run #2 to see its impact on performance.

run2-run3-comparison.png

When we look at how jCustomer CPU usage decreases between the two runs while Elasticsearch CPU usage increases. It means that the jCustomer host (2 vCPU, 4GB RAM) in run #2 was actually the performance bottleneck but not by a significant margin. Increasing the jCustomer host in run #3, did indeed increase performances, but you can see by looking at CPU usage that the bottleneck in that case became the Elasticsearch host.

Run #4

Environment
Name Specs
jCustomer 3x t2.medium (2 vCPU, 4GB RAM)
Elasticsearch 3x t2.medium (2 vCPU, 4GB RAM)
  Profiles
Under-utilization Optimum performances At capacity Over-utilized
Range (users)

up to 200

300

800

above 900

Latency (mean)

41ms

83ms

875ms

995ms

Latency (90%ile)

82ms

202ms

1810ms

2068ms

Latency (95%ile)

108ms

279ms

1942ms

2407ms

Latency (99%ile)

202ms

422ms

2185ms

2888ms

Throughput

231q/s

331q/s

469q/s

469q/s

Run Analysis

In this run we are deploying an environment containing the same hardware specs than run #1, but deploying jCustomer and Elasticsearch as a cluster, we can draw the following conclusion from the results:

Performance results for the “optimum performances” profile are very similar to run #3 (with slightly better latency). Although it would be inaccurate to provide the blanket statement (1x t2.xlarge = 3x t2.medium), the performance metrics are indeed relatable.

The environment in run #4 is much more capable of handling loads above the “optimum performances” profile, for example the “at capacity” profile in run #4 is at 800 users while it was at 450 users in run #3. This is due to the nature of cluster deployments, which allows the load to be spread between multiple hosts. So aside from providing redundancy, cluster based environments will be more capable of handling bursts in visitor traffic than single-node environments.

Run #5

Environment
Name Specs
jCustomer 3x t2.xlarge (4 vCPU, 16GB RAM)
Elasticsearch 3x t2.xlarge (4 vCPU, 16GB RAM)
  Profiles
Under-utilization Optimum performances At capacity Over-utilized
Range (users)

up to 500

700

1100

above 1400

Latency (mean)

43ms

102ms

437ms

691ms

Latency (90%ile)

89ms

216ms

1017ms

1198ms

Latency (95%ile)

128ms

274ms

1232ms

1358ms

Latency (99%ile)

241ms

397ms

1690ms

1655ms

Throughput

573q/s

775q/s

813q/s

862q/s

Run Analysis
In this run, we are starting from the same base as run #4 but increasing the capacity of the underlying EC2 instances. As expected the user capacity supported by the environment grew (almost doubled when compared to run #4).
One notable element here is the latency for the “at capacity” and “over-utilized” profiles, which appears to be better than the same profiles in run #4. This is a situation that was not present when comparing run #1 and run #3. This is more than likely a consequence of operating the environment in a cluster.

Run #6

Environment
Name Specs
jCustomer 5x t2.medium (2 vCPU, 4GB RAM)
Elasticsearch 5x t2.medium (2 vCPU, 4GB RAM)
  Profiles
Under-utilization Optimum performances At capacity Over-utilized
Range (users)

up to 600

700

2000

above 2500

Latency (mean)

60ms

104ms

1185ms

1566ms

Latency (90%ile)

122ms

256ms

1720ms

2635ms

Latency (95%ile)

175ms

357ms

2066ms

3129ms

Latency (99%ile)

352ms

603ms

2775ms

4486ms

Throughput

665q/s

731q/s

903q/s

897q/s

Run Analysis
Finally, for the last run we are using the same base than run #4 but are increasing the number of nodes in the cluster.
Results in this run are consistent with the previous elements highlighted in the previous ones. When compared to run #5, we do see similar results for the “optimum performances” profile, in which the “at capacity” and “over-utilized” profiles support a significantly greater number of users.

Conclusions

Since the purpose of this page is to document performance metrics, we will begin by a summary table of the optimum capacity for the various environments tested.

Run Environment specs Capacity Measured performance
#1 jCustomer: t2.medium (2 vCPU, 4GB RAM)
Elasticsearch: t2.medium (2 vCPU, 4GB RAM)
150 users Latency (mean): 111ms
Latency (90%ile): 287ms
#2 jCustomer: t2.medium (2 vCPU, 4GB RAM)
Elasticsearch: t2.xlarge (4 vCPU, 16GB RAM)
250 users Latency (mean): 116ms
Latency (90%ile): 288ms
#3 jCustomer: t2.xlarge (4 vCPU, 16GB RAM)
Elasticsearch: t2.xlarge (4 vCPU, 16GB RAM))
300 users Latency (mean): 119ms
Latency (90%ile): 311ms
#4 jCustomer: 3x t2.medium (2 vCPU, 4GB RAM)
Elasticsearch: 3x t2.medium (2 vCPU, 4GB RAM)
300 users Latency (mean): 83ms
Latency (90%ile): 202ms
#5 jCustomer: 3x t2.xlarge (4 vCPU, 16GB RAM)
Elasticsearch: 3x t2.xlarge (4 vCPU, 16GB RAM)
700 users Latency (mean): 102ms
Latency (90%ile): 216ms
#6 jCustomer: 5x t2.medium (2 vCPU, 4GB RAM)
Elasticsearch: 5x t2.medium (2 vCPU, 4GB RAM)
700 users Latency (mean): 104ms
Latency (90%ile): 256ms

Although comparisons between the runs are available in the “Run Analysis” section of each run, this summary table and the following conclusions provides a good starting point for spec’ing out a jCustomer environment.

It is obviously recommended to run your jCustomer and Elasticsearch environment as a cluster, not only will it provide redundancy in case of failure, but it will also better cope with increased traffic (when compared to single-node large resources).

The environment scales very well, both vertically (bigger resources) and horizontally (more resources). The choice between scaling up a cluster vertically or horizontally depends on your infrastructure and its limitations. Vertical scaling a cluster offers simplicity of setup (less resources running), but you are limited by the physical capacity of the underlying resource. Horizontal scaling does not have such limitations but can involve a larger number of resources to set up/monitor/maintain.

In most cases, the Elasticsearch cluster’s capacity will be the performance bottleneck, make sure it is sized properly.

Finally, and although we’re still insisting that the tests performed for this document are specific and opinionated, and that they are not going to directly relate to your production environment, they also provide some insights about physical requirements of jCustomer.