Monitoring performance and stability of a DX Platform

  Written by The Jahia Team
 
Developers
Sysadmins
   Estimated reading time:
7.0 7.1 7.2

1 Why and what to monitor

Digital Experience Manager is often part of complex architectures composed of load balancers, web servers, databases, firewalls, and way more. Monitoring each individual component separately is vital to speed up the troubleshooting operations when the platform becomes unstable.

This chapter describes how and what to monitor with Digital Experience Manager to have a complete picture of the state of the platform at any given time. This monitoring shall include a mixture of log analysis, JVM and OS metrics (CPU/Memory), JMX information, file system, network monitoring and server polling.

1.1 Logs

Two log files are usually worth monitoring:

  • jahia.log
  • catalina.out (in case you use Tomcat. File name will differ when using another application server)

While jahia.log includes all informations regarding Digital Experience Manager, the application server’s log file might include precious information about the system, such as memory related infos.  

What to look for specifically is described in the subsequent sections.


1.1.1 Errors and Exceptions

We consider any error in the logs as being an important matter, and thus, no exception or error should be ignored. If you are seeing an exception you cannot explain, feel free to reach out to the Jahia  team and ask for more details.

Digital Experience Manager uses Log4J which classifies each log entry in one of these categories:

  • FATAL/SEVERE: the server is unstable and the service will probably be discontinued. Immediate attention required.
  • ERROR: DXM or one of its modules is not behaving the way it should, probably leading to one or several pages not displaying correctly. The problem needs to be investigated. Errors are frequently associated with a Java Exception which is usually providing more clarity on the source of the discrepancy.
  • WARN: usually points out an error with a low degree of severity.
  • INFO: gives information about what the platform is doing. Can be used to trace platform activity and load
  • DEBUG: disabled by default, debug logs are usually used by developers or administrators to output additional information about the way a specific feature works. To be activated only when required.
  • TRACE: disabled by default, trace logs are very verbose and need to be activated only when troubleshooting a specific problem.

Fatal logs should raise an immediate monitoring alert while errors should be investigated in a timely manner. It is essential to troubleshoot errors frequently so that “normal” errors don’t become a common fact on the platform.

 

1.1.2 Jahia Request Load / JCR Session Load

Both Jahia JCR Session Load and Jahia Request Load will show up in jahia.log past a particular threshold. Both indicators show average values and will display as follows: 2.053000 0.907000 0.395000.
First figure is the average value during the past minute, second value during past five minutes, third value during past fifteen minutes.

  • JCR Session Load indicates the average number of JCR Sessions opened at this time. JCR Sessions are used to interact with the content of the CMS and are closed once the user request is completed. The value is displayed in the logs when the last minute average is > 10.
  • Jahia Request Load indicates the average number of HTTP requests processed by the platform. A responsive platform usually processes its requests almost instantly and won’t accumulate backlog. A high value (>30) is usually a sign of an overloaded platform. Troubleshooting poor performances when the Jahia Request Load is too high can be done by analysing a thread dump generated when the Request Load was important. Should you have any questions regarding this topic, you Jahia support is able to assist. The value is displayed in the logs when the last minute average is > 2.

Generating graphs based on the values displayed in the logs for these indicators can help understand peak periods and needs for optimizations and architecture change.

Real life examples of these logs:

JCRSessionLoadAverage: Jahia JCR Session Load = 11.000000 15.37400 19.23800
RequestLoadAverage: Jahia Request Load = 2.053000 0.907000 0.395000


1.1.3 Bundle Caches

Jahia.log will frequently output statistics about the Bundle Cache’s efficiency. As this cache is directly responsible for mitigating the database accesses, its hit to miss ratio should be the lowest possible. More information is available in the fine tuning guide.

Generating graphs based on the values displayed in the logs for this indicator can help understand the need to a configuration change when the hit ratio is too low and when the currently used memory equals the max memory value. (It is often an indicator that more memory should be allocated to this cache)

Real life example of these logs:

2011-03-15 14:00:35,739: INFO [BundleCache] - num=2331 mem=8191k max=8192k avg=3598 hits=574445 miss=55555
(In this case, the hit ratio is ( 574445 ) / ( 574445 + 55555 ) = 91.2%


1.1.4 Garbage Collection

Garbage Collection is a standard operation performed by the JVM. While it should occur in a frequent basis to purge the memory from unused objects, high-frequency GC operations can indicate that the server is struggling to allocated a sufficient amount of memory for its normal operations.

There are two types of Garbage Collection operation:

  • GC: quick and able to deallocate most unused objects
  • Full FC: scans all the JVM memory to deallocate all unused objects

Both operations freeze the JVM when carried out. (User requests are processed back when the operation finishes)

GC operations is a very good indicator of the health of the platform from a memory allocation standpoint.

A series of Full GC always shows a memory allocation issue and will dramatically reduce the overall platform performance.
[Full GC (Metadata GC Threshold)  279747K->220370K(2017767K), 2.5930379 secs]
[Full GC (Metadata GC Threshold)  279742K->220343K(2017766K), 3.8139272 secs]
[Full GC (Metadata GC Threshold)  279744K->220311K(2017787K), 2.5633357 secs]

Monitoring the frequency of GC and Full GC operations is a key aspect of the monitoring of a Java platform.
Garbage Collection logs can usually be found in the application server logs (catalina.out with Tomcat)


1.1.5 Page generation duration

Page loading time as seen by the end user is the combination of multiple factors (network, firewalls, load balancers, web servers, application server, client-side DOM execution time, Javascript code execution, static asset download...). Pages that take time to load are usually a combination of multiple factors.

Digital Experience Manager logs the time it takes to render each requested page before the overall infrastructure processing time is added. Monitoring  the DX rendering time in order to generate graphs and send monitoring alert is essential to spot performance problems before they become a problem for end users.

Real life example of these logs:

2017-03-30 09:37:37,955: INFO  [http-nio-8080-exec-3] org.jahia.bin.Render: Rendered [/cms/render/live/en/sites/digitall/home.html] user=[root] ip=[127.0.0.1] sessionID=[2A12474B0E498F816C24352DFA3814C3] in [46ms]

Some pages are usually more important than others (home page, search page…) and it could be worth monitoring specific page rendering time specifically.

 

1.2 Server (Physical Server or Virtual Machine)

Digital Experience Manager has no specific requirement in terms of monitoring regarding the server and its operation system. A typical monitoring strategy is to monitor the overall availability of the operating system as well as the CPU, Memory, opened file descriptors, disk space and Network usage.

 

1.3 JVM

The Java Virtual Machine has its own pool of allocated resources, and thus, needs to be specifically monitored. The JVM can be seen an Operating System with components such as  a thread manager, a scheduler, a memory manager and many more.

The JVM is always seen as a black box from the OS standpoint, but it has interfaces and tools to help monitor it.

Metrics worth monitoring are:

  • Memory
    • Memory usage vs max memory (xmx)
    • Garbage Collection frequency (frequent and long Full GC operations indicate a memory management problem)
  • Threads
    • DB Connection pools
    • Blocked threads
  • Exceptions
    • Average number of exceptions and errors

Most monitoring tools have plugins to connect to JMX, which is the JVM API to expose these metrics.


1.4 File system

File System monitoring is often forgotten and needs to be taken into account. A lack of available space is most likely to generate a service outage and to corrupt the files.

When Digital Experience Manager and its ecosystem are spread across several partitions, some might grow faster than others:

  • Log files
    • Log size will vary based on the traffic of the platform, log level and component implementation. A sufficient amount of free space should be available at any time to absorb any unexpected event. Keep in mind that log size can suddenly explode and that the server will crash if the log directory runs out of space.
  • Temporary folder
    • Digital Experience Manager uses the application server tmp folder (tomcat/tmp for Tomcat) when performing some operations (site import, module deployment, JSP compilation, file upload...). Importing huge sites or deployment a vast number of modules might make this folder temporarily bigger. It is essential that this folder can grow significantly without running out of space.
  • digital-factory-data folder
    • Digital Experience Manager’s data folder contains many important files, the biggest being a collection of Jackrabbit Indexes under digital-factory-data/repository. This folder will grow slowly but regularly over time, but on the contribution activity on the platform, sometimes reaching 20-30GB of used space. It is essential that this folder never run out of space to prevent an index corruption from happening.
  • Datastore folder
    • Based on your configuration, you might or might not use a datastore on a file system. In case you do use a file system-based datastore, a specific attention should be dedicated to the datastore’s partition. As it contains most contributed files, this folder can grow significantly based on the type and size of files uploaded by the content authors.

 

1.5 Server Polling (Health Check)

It is considered a good practice to have a probe poll the live site on a regular basis and make sure that the most important pages still display correctly and render in a timely manner. The most common strategy is to poll the home page and any other important page of the site, and analyse the HTML code to look for predefined markups.

Server can also automatically be browser with automated tests using technologies such as Selenium in order to control the integrity of the pages on a frequent basis.

A Health Check is used both to control the server performance (page generation time) as well as the content of the page (no error is displayed and the page is displayed correctly)

 

1.6 Caches

Digital Experience Manager uses multiple layers of caching to ensure a fast and scalable platform. While most caches are fully controlled by DX, some are application-specific and might be improved by the project’s developers.

The “Cache Management” entry of the Jahia Tools shows the size and hit ratio of each cache. The HTMLCache is a key aspect of Digital Experience Manager’s performance and should have the best hit-to-miss ratio possible. A low ratio could be improved by the project developers.
It should also be avoided to saturate the HTML cache when possible by increasing its size.


2 What tools to use


The key to a successful platform monitoring is often to use modular and interoperable tools to monitor the overall architecture. Key players in the monitoring industry often offer log analysis, JMX and VM analysis performance modules.

Monitoring tools help anticipate the problems before they arise, as well as providing every necessary asset to the troubleshooting of a platform in a timely fashion. Platform instability and crashes often occur when key teams aren’t available, hence the need of a centralized platform providing everything ranging from logs to operating system metrics and key resources.

This section describes some monitoring platforms that offer a state-of-the-art monitoring capability.

2.1 Nagios

Nagios is a leading monitoring platform with a modular approach. It provides many out of the box modules to monitor anything from network availability to operating systems, file systems logs and even JVMs. It has a licence-based pricing depending on the number of servers there is to monitor.

  • Operating System monitoring: Nagios offers out-of-the-box OS monitoring capabilities for most common metrics. (CPU, Memory, Threads, I/Os…)
  • File System monitoring: Nagios offers out-of-the-box OS monitoring capabilities for file system monitoring including capability planning features
  • Logs monitoring: Nagios Log Server provides a centralized way to fetch logs across many servers as well as a log parser to detect specific threads documented earlier in this chapter. Many free alternative to this plugin exist.
  • JVM monitoring: check_jvm plugin provides many metrics about the Java Virtual Machine, including used and available memory heap, thread states and database connection pool status.
  • Monitoring alerts: Nagios offers out-of-the-box support for email, SMS and custom alerts in case of an emergency.

2.2 AppDynamics

AppDynamics comes as a SaaS monitoring platform or as a on-premise platform. It provides a wide range of native features with a transaction-based approach, allowing to get a breakdown of the cost of a specific request across all the infrastructure. Many built-in features streamline the monitoring of JVM platforms. AppDynamics has a per-server licensing price.

  • Operating System monitoring: out-of-the-box monitoring capabilities for most common metrics. (CPU, Memory, Threads, I/Os…)
  • File System monitoring: out-of-the-box feature when “health rules” can be defined to generate monitoring alerts once a specific threshold is reached
  • Logs monitoring: AppDynamics has a native log consolidation capability and can be interfaces with Logstach. AppDynamics has a “Log Analytics” feature to parse the logs and send alerts based on keywords and values.
  • JVM monitoring: AppDynamics can natively monitor Java applications with deep inspections inside the stack traces, including statics on the number of calls and method execution time.
  • Monitoring alerts: AppDynamics offers out-of-the-box support for email, SMS and custom alerts in case of an emergency.

2.3 ELK

Elasticseach Logstach Kibana is not a product per se, but rather a stack of products that combine very well for monitoring purposes. ELK architectures are free and benefit from a dynamic community as well as of a wide range of additional modules.

ELK has a great momentum because of its scalability and modularity. The downside is that not being a fully integrated product, more customization and module aggregation is necessary before getting a fully monitored platform.

An alternative is Elastic’s X-Pack which is a subscription-based fully integrated product with monitoring, reporting and alerting capabilities.

  • Operating System monitoring: The Topbeat , Packetbeat and the Beat product galaxy can gather all OS-related metrics, including CPU, Memory, System events, Disk I/O and Network packets.
  • File System monitoring: MetricBeat can gather disk-related informations such as disk usage and left space.
  • Logs monitoring: Logstach can natively collect log files from various servers, parse them and have them injected inside Elastichsearch for further analysis.
  • JVM monitoring: the JMX plugin can fetch all vitals from the JVM.
  • Monitoring alerts: many free plugins exist for email-based alerts. The X-Pack bundle is also able to push notifications through emails, SMS, Slack...