Monitoring Guide

March 22, 2024

Ownership of the incident resolution on Jahia Cloud

Jahia is committed to ensuring a 100% availability and snappy response time for all of the platforms hosted on Jahia Cloud. Many measures are deployed to reach that goal:

  • Multi datacenter active/active redundancy for all Production platforms
  • Built-in 24/7/365 support
  • Support by specialized Jahia/jCustomer Support engineers and dedicated infrastructure Operations teams

While almost all monitoring alerts will be handled by Jahia directly, some alerts can only be solved by Jahia Cloud customers. These alerts are automatically escalated to the customers and are listed below.

Monitoring is essential for maintaining our 99.9% SLA. We continuously monitor your production environment for maximum uptime. To enable SLA checks, ensure that Datadog can directly access your production environment. Add this configuration in your front-end configuration (HAProxy). 

Real-life scenario example (Note: This configuration serves as an example and should be adjusted based on your specific context).

My production environment isn't directly accessible to visitors because they must pass through a CDN and/or a WAF. However, it's crucial to ensure that /ping.jsp is always accessible directly. This allows synthetic checks to be performed, thereby maintaining our SLA.

## ALL WAF
acl service_waf req.hdr_ip(x-forwarded-for,-1) -m ip  1.3.4.5/19 
## JAHIA VPN IPs
acl service_jahia_vpn req.hdr_ip(x-forwarded-for,1) -m ip 188.165.59.149 91.134.164.155 51.161.118.223
## DATADOG CHECKS
acl datadog_ping path /ping.jsp
## APPLY RULES TO REQUESTS
http-request deny if !service_waf !service_jahia_vpn !datadog_ping

List of monitoring alerts sent to Jahia Cloud customers

All browsing nodes are down according to HAProxy

This monitor triggers an alert when HAProxy considers all browsing nodes to be down for an environment, for a duration of ~20 seconds (2 consecutive datadog-agent checks).

HAProxy considers a node to be down if it is unreachable, or if the health check fails (because of time-out, invalid response or RED status). When all nodes are considered down, all requests will fail with a 502 error until at least one node is back up.

This monitor is handled directly by the Support teams at Jahia when it is fired for Production environments.

Alerts related to non-production environments are forwarded to Jahia Cloud customers, as the environment unavailability is generally related to custom code deployment.

At least one browsing node is down according to HAProxy

This monitor triggers an alert when HAProxy considers one browsing node to be down for an environment, for a duration of ~20 seconds (2 consecutive datadog-agent checks).

HAProxy considers a node to be down if it is unreachable, or if the health check fails (because of time-out, invalid response or RED status). When all nodes are considered down, all requests will fail with a 502 error until at least one node is back up.

This monitor is handled directly by the Support teams at Jahia when it is fired for production environments.
Alerts related to non-production environments are forwarded to Jahia Cloud customers, as the environment unavailability is generally related to custom code deployment.

Processing node is down according to HAProxy

This monitor triggers an alert when HAProxy considers the Jahia processing node to be down for an environment, for a duration of ~20 seconds (2 consecutive datadog-agent checks).

HAProxy considers a node to be down if it is unreachable, or if the health check fails (because of time-out, invalid response or RED status).

This monitor is handled directly by the Support teams at Jahia when it is fired for Production environments.
Alerts related to non-production environments are forwarded to Jahia Cloud customers, as the environment unavailability is generally related to custom code deployment.

[Jahia] Node cache load average

This monitor reports a high Jahia node cache load, for each Jahia host. Note that this metric is gathered from Jahia logs. Jahia only outputs the cache load when the value is above 2, so you will never see values below this threshold.

The Jahia Node Cache load average represents the average number of node opened per session.

This monitor is handled directly by the Support teams at Jahia when it is fired for Production environments.

Alerts related to non-production environments are forwarded to Jahia Cloud customers, as the environment unavailability is generally related to custom code deployment.

Threshold: an alert is raised when the cache load is over 1000 on average across 10 minutes.

[Jahia] JCR Session Load average

This monitor triggers an alert when the JCR Session Load Average is above 70 on average across 10 minutes, for each Jahia host. Note that this metric is gathered from Jahia logs. Jahia only outputs the JCR session load when the value is above 10, so you will never see values below this threshold.

The Jahia JCR Session Load average represents the average number of opened sessions on a given Jahia server. Sessions are opened when there’s a low-level access to a content item, and are closed when the content has been read or written. Having a high JCR session count shows that many content operations happen in parallel and stack up, which is not a good sign for the service’s performance.
A view of all opened JCR sessions can be found in the tools and can help understand the situation.

This monitor is handled directly by the Support teams at Jahia when it is fired for Production environments.

Alerts related to non-production environments are forwarded to Jahia Cloud customers, as the environment unavailability is generally related to custom code deployment.

Threshold: an alert is raised when there are more than 70 opened sessions on average across 10 minutes.

[Jahia] Request load average

This monitor reports a high Jahia request load, for each Jahia host. Note that this metric is gathered from Jahia logs. Jahia only outputs the request load when the value is above 2, so you will never see values below this threshold.

The Jahia Request Load average represents the average number of HTTP requests being processed at a given time. Ideally, the servers should process the requests in real time and avoid handling too many requests at the same time.

Increasing values for the request load shows that the server struggles to process the amount of requests it receives. This issue can have several root causes:

  • The service receives an unusually large amount of traffic
  • The custom code involved in serving the page / API requests needs to be optimized
  • The Jahia Cloud backends or underlying infrastructure are slowing down for unexpected reasons. Your Support team will soon reach out to you if that’s the case.

Troubleshooting high request loads is often easy by using Datadog’s APM (Application Performance Monitoring).

Creating and analyzing a thread dump is a fairly typical way to understand why processes are slowing down the request processing. Thread dumps can be captured in the Jahia Tools.

This monitor is handled directly by the Support teams at Jahia when it is fired for Production environments.

Alerts related to non-production environments are forwarded to Jahia Cloud customers, as the environment unavailability is generally related to custom code deployment.

Threshold: an alert is raised when there are more than 70 processed requests on average across 10 minutes.

[Jahia] JVM thread count

This monitor reports a high Java thread count, for each Jahia host.

The JVM thread count represents the number of active threads in the JVM, and should remain almost constant throughout the life of the Java application. A high thread count shows problematic thread management within the application, and can lead to poor performance or memory management errors.

Increasing values for the request load shows that the server struggles to process the amount of requests it receives. This issue can have several root causes:

  • The service receives an unusually large amount of traffic
  • The custom code involved in serving the page / API requests needs to be optimized
  • The custom code deployed in the platform relies on external backends that are too slow
  • The custom code deployed in the platform contains problematic thread management
  • The Jahia Cloud backends or underlying infrastructure are slowing down for unexpected reasons. Your Support team will soon reach out to you if that’s the case.

Creating and analyzing a thread dump is a fairly typical way to understand what processes are slowing down the request processing. Thread dumps can be captured in the Jahia Tools.

This monitor is handled directly by the Support teams at Jahia when it is fired for Production environments.

Alerts related to non-production environments are forwarded to Jahia Cloud customers, as the environment unavailability is generally related to custom code deployment.

Threshold: an alert is raised when the JVM has 1000 live threads or more for a duration of 5 minutes.

[Jahia] Tomcat threads percentage usage

This monitor checks the average percentage usage of available Tomcat threads, over 5 minutes, for each Jahia host. Tomcat can only create a limited number of concurrent threads. If this limit is reached, requests will start getting queued, which can lead to timeouts. Reaching the limit may come from a thread leak issue.

This monitor is handled directly by the Support teams at Jahia when it is fired for Production environments.

Alerts related to non-production environments are forwarded to Jahia Cloud customers, as the environment unavailability is generally related to custom code deployment.

Threshold: an alert is raised when more than 90% of the available threads are used on average over 5 minutes.

[Jahia] Page rendered duration

This monitor checks for poor page render performance.

Alerts are often related to an overloaded platform, poor garbage collection performance due to a high heap memory usage or a JSP implementation problem.

Threshold: an alert is raised when the top 1% longest pages to render take more than 10 seconds to display on average over 5 minutes.

[Jahia] API rendered duration

This monitor checks for poor API call performance.

Alerts are often related to an overloaded platform, poor garbage collection performance due to a high heap memory usage or a JSP implementation problem.

Threshold: an alert is raised when the top 1% longest pages to render take more than 10 seconds to display on average over 5 minutes.

[Jahia] Tomcat logs generation

This monitor checks for high log line count. High log counts are often associated with poor server performance and can highlight code or stability issues. The fix for this monitor is highly dependent on the type of log that is flooding the servers.

Log level can be temporarily fine tuned in the Jahia Tools, but modifications in the Tools will be reverted after the next Jahia restart.

There is a cost associated with a high log count.

Threshold: a Jahia node outputs more than 200,000 lines of log over 1 hour

[Jahia] Augmented Search connection

This monitor checks the Elasticsearch connection configured for Augmented Search module in Elasticsearch Connector. It triggers an alert if the connection used isn't the one defined by Jahia Cloud.
This monitor is fired when a user mistakenly changes the configuration of the connection used by Augmented Search. The default configuration is deployed automatically and should never be manually changed.

[Jahia] Augmented Search healthcheck status

This monitor checks the status of Augmented Search module's probe for Jahia healthcheck. It triggers an alert if the probe returns a status different from "GREEN"