Performance and capacity planning
While developing jCustomer we regularly measure system performance to get a sense of jCustomer capacity characteristics, understand jCustomer behavior under load as well as assess whether changes to our codebase have an impact on performance.
In this document, we’re presenting our approach to performance tests and provide results based on environments we tested jCustomer in.
Performance testing is a complex domain, in which opinionated decisions are made. Please take the time to read and understand the approach we took, prior to reviewing the actual performance metrics.
Approach
Our primary objective when measuring performance is to progressively increase user traffic until we notice degradation. While the tests are running, we collect response times and error rates associated with a particular throughput during that time as well as collect underlying resource utilization (CPU, RAM, disk, network…), allowing us to best pinpoint performance bottlenecks.
Definitions
Some of the terms mentioned in this document are used in the context of performance testing and are worth a few more details:
- ramp-up refers to the amount of time until all users within a run have been submitted.
- users refers to the number of users submitted by the end of the ramp-up time. A scenario itself is composed of multiple actions. The number of actions in a scenario can easily be calculated ,dividing the number of samples by the number of users.
- mean response time refers to response time across all of the scenario actions across all of the users. 50% of the queries are below that response time while 50% are above.
- 90%ile: 90% of the queries are below that response time while 10% are above
- 95%ile: 95% of the queries are below that response time while 5% are above
- 99%ile: 99% of the queries are below that response time while 1% are above
- Throughput: number of queries processed by the system per second
Testing environment
For this session of performance tests, the test environment was deployed in AWS Ireland (eu-west-1) on resources created specifically for these tests. The entirety of the test environment was destroyed and re-created between each run.
When designing the testing environment, we made the decision to rely on EC2 instances, with each instance running one single resource using docker. This allows for segregation of services and ensures measured metrics (and in particular system load) are specific to the exact resource being tested.
The following instance types were used during the tests:
- A (large) jMeter host
- A Jahia host serving the site
- One or more jCustomer hosts
- One of more Elasticsearch hosts
All instances and resources used for the tests are collocated in the same VPC (same region), exact specifications of the EC2 instances are provided next to the test results.
Anatomy of a run
A single performance test run is composed of elements that do not change during the run, such as the testing environment, the ramp-up time, and the test scenario. The only variable element during the run is the number of users, which is progressively increased until the environment begins to fail.
balance the wish for extensive data samples with the desired execution time for the test. Therefore it’s usually best to identify ranges of interest (see profiles detailed below) and focus the analysis on these.
At the end of a run, we gather the data collected by jMeter in a format that facilitates its interpretation.
Results (metrics) of performance run #1
We can generally group the results in the 4 different profiles described below.
Result profile: Under-utilized
The system is under-utilized as the load it receives from jMeter is not sufficient for it to be stress-tested. Results with this profile usually have a very similar latency but different Throughput.
Such results are still relevant as they provide us with a sense of the fastest response time supported by the environment.
For example, on the results above, we can see that the environment’s response time behaves similarly between 25 and 100 users.
In general, you would want to avoid being in that profile for extended periods as it means you are spending more on the environment than necessary.
Result profile: Optimum performances
The environment response time is situated within optimum operating conditions as defined by the business.
For example, if the desired response time is set by the business at 250ms on the 90%ile, the optimum environment would be able to support the number of users at or next to that value. In the results above this would be somewhere close to 150 users (287ms at 90%ile).
This is the sweet spot and the capacity you should be aiming towards for your environment.
Result profile: At capacity
The system is performing at its peak with limited queuing and without errors. This profile can be identified by looking for a flattening in the throughput curve (see chart above).
For example, in the results above you will notice that throughput is at its peak at 300 users (throughput of 213q/s). It means that the maximum capacity supported by the infrastructure is situated around 300 users. The environment should be in a position to sustain such a load, with degraded performances (above “optimum performances”) over a long period of time without erroring out, although you would want to monitor memory consumption.
Although you wouldn’t want your environment to operate in this profile, this could still be acceptable for worst-case scenarios, for example for an unexpected increase in traffic and while your infrastructure team is working at increasing capacity. This mostly corresponds to the load your environment should never go above.
Result profile: Over-utilized
The system is performing over its capacity and will begin queuing requests, progressively increasing its memory usage and dramatically increasing response time until it starts erroring-out.
This typically is the next step just after the “at capacity” profile. In the example above we reached that point above 300 users and began noticing errors at 800 users.
This is the no-go zone, reaching that type of load should trigger immediate actions from your operations team.
Test scenario - a user journey
Creating a user journey is complex in nature. For these performance tests, we used product-relevant features, while navigating across these features following a use case representative to the expected usage of these features in production.
Test site
The site supporting the user journey is composed of 3 pages:
- Page 1 contains 3 personalized areas (with server-side rendering) using Geolocation, Scoring plan and Goal. It contains a tags aggregation component and is also considered the site’s home page.
- Page 2 contains 3 personalized areas (with client-side rendering) using the segment “onePage”, the session property “browser” and the profile property “utm_campaign”.
- Page 3 contains a form with 3 fields (email, firstname, lastname) mapped to the corresponding jExperience profile property.
User journey
When navigating, the test user will follow this scenario:
- Open page 1 with a campaign parameter and stay on the page for 2 seconds
- Navigate to page 2 and stay on the page for 2 seconds
- Navigate to page 3, submit the form, which will fulfill a “newsletter” goal, and stay on the page for 2 seconds
- Navigate to page 1, a new personalization is displayed due to the “newsletter” goal being fulfilled and stay on the page for 2 seconds
- Navigate to page 2 and stay on the page for 2 seconds
For each of the steps in the user scenario, we verify that the personalization properties are correctly resolved and displayed.
The entire user journey was measured at 15 seconds during our tests, we are repeating it twice for each user, bringing the total user journey to 30 seconds per user.
System load profile
Having a jMeter ramp-up time of 10 seconds jointly with a total user journey of 30 seconds means that the last users should complete their journey 40s after the start of the test.
System load profile with 100 users
For about 20s (between T+10 and T+30), all of the users will be on the platform at the same time, which gives us a pretty good understanding of the supported load in that use case.
Another important element to keep in mind when reviewing performance metrics is that these metrics correspond to the average values across an entire scenario for one particular user. While the response time could be 250ms for a given run, this value is itself an average of all of the queries performed by the user within a scenario.
Note of caution
jCustomer is a complex product, offering flexibility in implementing a wide range of use cases. Although flexibility is key for such a platform, it also makes performance results very specific to the implemented use case on that precise environment being tested.
In other words, the results presented in this document offer a baseline to help you understand how performance metrics evolve as the underlying infrastructure is being modified. It also helps in pinpointing infrastructure bottlenecks as environments are being spec’d out.
Except if you were to run the same test scenario, it is very important to put a strong emphasis on the fact that YOU WILL NOT obtain the same performance metrics on your production environment. And although we designed the test scenario to be representative of a typical jCustomer use case, such a scenario remains a static implementation and by nature cannot be compared to the variability introduced when dealing with “real-world” users in a production environment.
Performance tests
In these tests, we’re going to start with very small resources, progressively increase their size and see how this impacts metrics.
The threshold for optimum performance is set at 250ms on the 90%ile.
All tests will be running with a c5.9xlarge jMeter instance (36 vCPU, 72GB RAM) which provides sufficient throughput while not becoming the bottleneck in any of the planned runs. Jahia itself will be running on a small instance, and although Jahia is not on the critical path for this test, we’re making sure Jahia resources utilization remains low during the test to confirm it is not becoming a bottleneck.
Please note that deploying a single-node Elasticsearch instance is not recommended by Elastic in production due to the risk of data loss, it is presented here as it is useful to understand bottlenecks.
Run #1
Environment | |
---|---|
Name | Specs |
jCustomer | t2.medium (2 vCPU, 4GB RAM) |
Elasticsearch | t2.medium (2 vCPU, 4GB RAM) |
Profiles | ||||
---|---|---|---|---|
Under-utilization | Optimum performances | At capacity | Over-utilized | |
Range (users) |
up to 100 |
150 |
250 |
above 300 |
Latency (mean) |
38ms |
111ms |
483ms |
605ms |
Latency (90%ile) |
82ms |
287ms |
1064ms |
1248ms |
Latency (95%ile) |
112ms |
359ms |
1208ms |
1336ms |
Latency (99%ile) |
199ms |
524ms |
1414ms |
1517ms |
Throughput |
117q/s |
162q/s |
194q/s |
213q/s |
Run Analysis |
---|
This run was performed using small and identical instances and serves as a baseline for all subsequent runs detailed in this document. Resource usage on the host running Elasticsearch was pretty high, which tends to indicate it was the bottleneck for that particular run. In th next run (#2), we are increasing resources on this node to confirm the impact on performance. |
Run #2
Environment | |
---|---|
Name | Specs |
jCustomer | t2.medium (2 vCPU, 4GB RAM) |
Elasticsearch | t2.xlarge (4 vCPU, 16GB RAM) |
Profiles | ||||
---|---|---|---|---|
Under-utilization | Optimum performances | At capacity | Over-utilized | |
Range (users) |
up to 200 |
250 |
400 |
above 450 |
Latency (mean) |
99ms |
116ms |
446ms |
606ms |
Latency (90%ile) |
239ms |
288ms |
923ms |
1188ms |
Latency (95%ile) |
302ms |
354ms |
1012ms |
1354ms |
Latency (99%ile) |
400ms |
465ms |
1220ms |
1577ms |
Throughput |
221q/s |
270q/s |
320q/s |
317q/s |
Run Analysis |
---|
As expected when analyzing the outcome of run #1, the Elasticsearch node was the bottleneck and increasing its capacity did have a significant impact on measured performances. In this new run, the environment supported precisely 100 more users in the “optimum performances” profile. |
Run #3
Environment | |
---|---|
Name | Specs |
jCustomer | t2.xlarge (4 vCPU, 16GB RAM) |
Elasticsearch | t2.xlarge (4 vCPU, 16GB RAM) |
Profiles | ||||
---|---|---|---|---|
Under-utilization | Optimum performances | At capacity | Over-utilized | |
Range (users) |
up to 200 |
300 |
450 |
above 500 |
Latency (mean) |
53ms |
119ms |
389ms |
502ms |
Latency (90%ile) |
134ms |
311ms |
778ms |
1027ms |
Latency (95%ile) |
183ms |
386ms |
914ms |
1385ms |
Latency (99%ile) |
294ms |
509ms |
1202ms |
1926ms |
Throughput |
232q/s |
318q/s |
378q/s |
373q/s |
Run Analysis |
---|
The main purpose of this run is to increase capacity on the jCustomer host and compare the result with run #2 to see its impact on performance. When we look at how jCustomer CPU usage decreases between the two runs while Elasticsearch CPU usage increases. It means that the jCustomer host (2 vCPU, 4GB RAM) in run #2 was actually the performance bottleneck but not by a significant margin. Increasing the jCustomer host in run #3, did indeed increase performances, but you can see by looking at CPU usage that the bottleneck in that case became the Elasticsearch host. |
Run #4
Environment | |
---|---|
Name | Specs |
jCustomer | 3x t2.medium (2 vCPU, 4GB RAM) |
Elasticsearch | 3x t2.medium (2 vCPU, 4GB RAM) |
Profiles | ||||
---|---|---|---|---|
Under-utilization | Optimum performances | At capacity | Over-utilized | |
Range (users) |
up to 200 |
300 |
800 |
above 900 |
Latency (mean) |
41ms |
83ms |
875ms |
995ms |
Latency (90%ile) |
82ms |
202ms |
1810ms |
2068ms |
Latency (95%ile) |
108ms |
279ms |
1942ms |
2407ms |
Latency (99%ile) |
202ms |
422ms |
2185ms |
2888ms |
Throughput |
231q/s |
331q/s |
469q/s |
469q/s |
Run Analysis |
---|
In this run we are deploying an environment containing the same hardware specs than run #1, but deploying jCustomer and Elasticsearch as a cluster, we can draw the following conclusion from the results: Performance results for the “optimum performances” profile are very similar to run #3 (with slightly better latency). Although it would be inaccurate to provide the blanket statement (1x t2.xlarge = 3x t2.medium), the performance metrics are indeed relatable. The environment in run #4 is much more capable of handling loads above the “optimum performances” profile, for example the “at capacity” profile in run #4 is at 800 users while it was at 450 users in run #3. This is due to the nature of cluster deployments, which allows the load to be spread between multiple hosts. So aside from providing redundancy, cluster based environments will be more capable of handling bursts in visitor traffic than single-node environments. |
Run #5
Environment | |
---|---|
Name | Specs |
jCustomer | 3x t2.xlarge (4 vCPU, 16GB RAM) |
Elasticsearch | 3x t2.xlarge (4 vCPU, 16GB RAM) |
Profiles | ||||
---|---|---|---|---|
Under-utilization | Optimum performances | At capacity | Over-utilized | |
Range (users) |
up to 500 |
700 |
1100 |
above 1400 |
Latency (mean) |
43ms |
102ms |
437ms |
691ms |
Latency (90%ile) |
89ms |
216ms |
1017ms |
1198ms |
Latency (95%ile) |
128ms |
274ms |
1232ms |
1358ms |
Latency (99%ile) |
241ms |
397ms |
1690ms |
1655ms |
Throughput |
573q/s |
775q/s |
813q/s |
862q/s |
Run Analysis |
---|
In this run, we are starting from the same base as run #4 but increasing the capacity of the underlying EC2 instances. As expected the user capacity supported by the environment grew (almost doubled when compared to run #4). One notable element here is the latency for the “at capacity” and “over-utilized” profiles, which appears to be better than the same profiles in run #4. This is a situation that was not present when comparing run #1 and run #3. This is more than likely a consequence of operating the environment in a cluster. |
Run #6
Environment | |
---|---|
Name | Specs |
jCustomer | 5x t2.medium (2 vCPU, 4GB RAM) |
Elasticsearch | 5x t2.medium (2 vCPU, 4GB RAM) |
Profiles | ||||
---|---|---|---|---|
Under-utilization | Optimum performances | At capacity | Over-utilized | |
Range (users) |
up to 600 |
700 |
2000 |
above 2500 |
Latency (mean) |
60ms |
104ms |
1185ms |
1566ms |
Latency (90%ile) |
122ms |
256ms |
1720ms |
2635ms |
Latency (95%ile) |
175ms |
357ms |
2066ms |
3129ms |
Latency (99%ile) |
352ms |
603ms |
2775ms |
4486ms |
Throughput |
665q/s |
731q/s |
903q/s |
897q/s |
Run Analysis |
---|
Finally, for the last run we are using the same base than run #4 but are increasing the number of nodes in the cluster. Results in this run are consistent with the previous elements highlighted in the previous ones. When compared to run #5, we do see similar results for the “optimum performances” profile, in which the “at capacity” and “over-utilized” profiles support a significantly greater number of users. |
Conclusions
Since the purpose of this page is to document performance metrics, we will begin by a summary table of the optimum capacity for the various environments tested.
Run | Environment specs | Capacity | Measured performance |
---|---|---|---|
#1 | jCustomer: t2.medium (2 vCPU, 4GB RAM) Elasticsearch: t2.medium (2 vCPU, 4GB RAM) |
150 users | Latency (mean): 111ms Latency (90%ile): 287ms |
#2 | jCustomer: t2.medium (2 vCPU, 4GB RAM) Elasticsearch: t2.xlarge (4 vCPU, 16GB RAM) |
250 users | Latency (mean): 116ms Latency (90%ile): 288ms |
#3 | jCustomer: t2.xlarge (4 vCPU, 16GB RAM) Elasticsearch: t2.xlarge (4 vCPU, 16GB RAM)) |
300 users | Latency (mean): 119ms Latency (90%ile): 311ms |
#4 | jCustomer: 3x t2.medium (2 vCPU, 4GB RAM) Elasticsearch: 3x t2.medium (2 vCPU, 4GB RAM) |
300 users | Latency (mean): 83ms Latency (90%ile): 202ms |
#5 | jCustomer: 3x t2.xlarge (4 vCPU, 16GB RAM) Elasticsearch: 3x t2.xlarge (4 vCPU, 16GB RAM) |
700 users | Latency (mean): 102ms Latency (90%ile): 216ms |
#6 | jCustomer: 5x t2.medium (2 vCPU, 4GB RAM) Elasticsearch: 5x t2.medium (2 vCPU, 4GB RAM) |
700 users | Latency (mean): 104ms Latency (90%ile): 256ms |
Although comparisons between the runs are available in the “Run Analysis” section of each run, this summary table and the following conclusions provides a good starting point for spec’ing out a jCustomer environment.
It is obviously recommended to run your jCustomer and Elasticsearch environment as a cluster, not only will it provide redundancy in case of failure, but it will also better cope with increased traffic (when compared to single-node large resources).
The environment scales very well, both vertically (bigger resources) and horizontally (more resources). The choice between scaling up a cluster vertically or horizontally depends on your infrastructure and its limitations. Vertical scaling a cluster offers simplicity of setup (less resources running), but you are limited by the physical capacity of the underlying resource. Horizontal scaling does not have such limitations but can involve a larger number of resources to set up/monitor/maintain.
In most cases, the Elasticsearch cluster’s capacity will be the performance bottleneck, make sure it is sized properly.
Finally, and although we’re still insisting that the tests performed for this document are specific and opinionated, and that they are not going to directly relate to your production environment, they also provide some insights about physical requirements of jCustomer.