Monitoring your servers

November 14, 2023

The Server Availability Manager (SAM) module supports monitoring in complex containerized environments. SAM extends Jahia's GraphQL API and provides server monitoring, server availability, and maintenance operation functionality. Learn more below about how to use and extend Jahia APIs with complex containerized environments.

Note: SAM is compatible with Jahia 7.3.3.0 and higher.

Migrating from the Healthcheck module

Server Availability Manager improves on and replaces the Healtheck module and adds more server availability functionality. As with Healthcheck, SAM offers the ability to develop new probes (it ships by default with 5).

When migrating from Healthcheck to SAM, remember that the data object has been removed from the health probes. These data elements (such as system load, node list, and URLs) are now provided by dedicated nodes of the API rather than the health probe. SAM actually includes nodes dedicated to system monitoring, but those nodes are located outside of the probes.

Monitoring system health

The SAM module provides insights into your platform's health and triggers alerts when key components need particular attention. Available through GraphQL or REST, you can trigger the module at will with minimal impact on the platform load. You can also develop additional probes to provide more information to the monitoring systems.

Probes are categorized by severity and report a status:

  • GREEN (Nominal status)
  • YELLOW (Non-critical issue)
  • RED (Critical issue)

The query below fetches the status of all of the probes with severity LOW or above: 

query {
  admin {
    jahia {
      healthCheck(severity: LOW) {  # You minimum severity to return
        status {                    # Highest reported status across all probes
          health                    # GREEN, YELLOW or RED
          message                   # Explanation for the health level
        }
        probes {
          name          # Name of the probe
          status  {     # Status reported by the probe
            health      # GREEN, YELLOW or RED
            message     # Explanation for the health level
          }
          severity      # Severity of the probe (LOW to CRITICAL)
          description   # Description specified by the developer of the probe
        }
      }
    }
  }
}

As mentioned in About monitoring, this module replaces the previous healthcheck module.

Permissions

By default, only the root user is able to perform GraphQL Queries against Server Availability Manager. Although not directly associated with Server Availability Manager, a dedicated permission called "Graphql admin query" is available for querying GraphQL nodes under "query.admin".

Using this permission makes it possible to create a dedicated monitoring user (or role) requiring less privileges than the root user.

This permission is visible using "Roles and Permissions" > "Other Permissions" > "Admin".

Available probes

Server Availability Manager ships by default with seven probes that you can configure by editing the org.jahia.modules.sam.healthcheck.ProbesRegistry.cfg file.

Note: PatchFailures probe will only provide accurate results against processing node if you are in cluser mode.
  • DBConnectivity
    Verifies that the database is configured with a valid connection. This probe has a default severity of CRITICAL and only returns GREEN or RED.
    The following messages are available:
            1 Connection established (everything is OK)
            2 Could not connect (connection was not successfull)
            3 Encountered exception while connecting (the server produced SQLException)
  • Datastore
    Verifies the connection to the JCR Datastore and the ability to write to it. This probe has a default severity of CRITICAL and only returns GREEN or RED.
    The following messages are available:
            1 Datastore is healthy (everything is OK)
            2 Could not perform write operation (could not write to jahia.jackrabbit.datastore.path)
  • ClusterConsistency
    Verifies that module states are consistent across given cluster group, reports modules that have inconsistent state. This probe has default severity of HIGH and returns GREEN, YELLOW and RED.
    By default the check is performed against default cluster group you can change the name via configuration by assigning value to clusterGroup property of the probe. The cellar section is added if the probe detects differences between the local and the cellar (from default group) states.
    The following messages are available:
           1 Cluster is not activated (everything is OK)     
           2 Failed to find ClusteredBundleService (this indicates probe cannot function properly due to internal issues, status for this is RED)       
           3 No issues found on this node (everything is OK, all cluster nodes should have this message if cluster is consistent)
           4 Message giving, for each cluster node, the module(s) which caused the issue (this indicates there is an issues and the status is YELLOW):
    {
    	"dx-cluster-node1-id": [
    		{
    			"clusterNodeId": "dx-cluster-node1-id",
    			"module": "org.jahia.modules/database-connector/1.5.0",
    			"osgiState": "RESOLVED",
    			"moduleState": "RESOLVED"
    		}
    	],
    	"dx-cluster-node2-id": [
    		{
    			"clusterNodeId": "dx-cluster-node2-id",
    			"module": "org.jahia.modules/database-connector/1.5.0",
    			"osgiState": "STARTING",
    			"moduleState": "RESOLVED"
    		}
    	],
    	"dx-cluster-node3-id": [
    		{
    			"clusterNodeId": "dx-cluster-node3-id",
    			"module": "org.jahia.modules/database-connector/1.5.0",
    			"osgiState": "ACTIVE",
    			"moduleState": "ACTIVE"
    		}
    	]
    	"cellar": [
    		{
    			"module": "org.jahia.modules/database-connector/1.5.0",
    			"clusterState": "ACTIVE"
    		}
    	]
    }
  • ServerLoad
    Verifies both request and session loads over one minute. This probe has a default severity of HIGH and can return a GREEN, YELLOW, or RED status based on configurable thresholds. The configuration file above contains examples of such configuration.
    The following messages are available (Note that request and session thresholds are set at 40 for YELLOW and 70 for RED):
             1 Serverload is normal (indicates normal load)
             2 Serverload is above normal (indicates above normal load)
             3 Serverload is high (indicates hight load)
  • ModuleState
    Verifies if any of the modules are in an inactive state. This probe has a default severity of MEDIUM and can return GREEN, YELLOW or RED depending on module states. You can configure this module with both a whitelist (only monitor specific modules) and a blacklist (monitor all modules except a provided list). The configuration file above contains examples of such configurations.
    The following messages are available:
             1 All modules are started (no not started modules in the system)
             2 At least one module is not started. Module <module name> is i n <state description> state (this can indicate both YELLOW and RED health depending on if the module has another version in started state or not)
             3 At least one modules has an invalid start-level. Module <module name> has start-level <start level> (this is a RED indicator)

    You can correct the state of a module by either navigating to https://[YOUR_JAHIA_HOST]/jahia/administration/manageModules or https://[YOUR_JAHIA_HOST]/tools/osgi/console/bundles, locating desired module and choosing between either stoping, starting ot undeploying it.

    To correct start level of a bundle you can either login to Karaf console or use latest provisioning API to send a Karaf command. For example, you can check and correct start level by executing a call to https://[YOUR_JAHIA_HOST]/modules/api/provisioning with required level of authorization and the following JSON body: 
     
    [
      {
        "karafCommand": "bundle:start-level article 90"
      }
    ]

    In the example above module article will have start-level of 90. If you want to simply check start-level execute the same command but without start-level at the end (i. e. 90). You can also send bundle:start-level --help for information. Note that results will be available in Tomcat console.  

  • PatchFailures
    Verifies that migration was successfull. The probe has CRITICAL severity and can be in GREEN or RED state.
    The following messages are available:
             1 No patching to report on (didn't find patches folder, everything is OK)
             2 Patch applied successfully (there are not failed patches in the patches folder, everything is OK)
             3 Following patches failed: <patch name1>, <patch name2> ... (some patches failed, migration was not successfull)
  • ModulesDefinitions
    Verifies that the modules are compatible with the current deployed definitions (more details about Module Definition checks here). The probe has HIGH severity and can be in GREEN or RED state.
    The following messages are available:
             1 The definitions used by the started <module name1>, <module name2>, ... modules correspond to the definitions of higher, non started, versions of these modules (some modules have their definitions modified, there are compatibility issues)
             2 All modules are OK (there are no invalid modules, everything is OK)
  • SearchIndex
    Verifies if search indices are too fragmented (default values: 5ms for the yellow threshold and 50ms for the red one). The probe has HIGH severity and can be in GREEN, YELLOW or RED state.
    The following messages are available:
             1 Query AVG (... ms) is lower than ... ms over the last minute. All good here. (average query execution time in the last minute is under the yellow threshold, everything is OK)
             2 Query AVG (... ms) is greater than ... ms over the last minute. (average query execution time over the last minute is above the yellow threshold, this needs to be monitored)
             3 Query AVG (... ms) is greater than ... ms over the last minute. It might be time to reindex. (average query execution time over the last minute is above the red threshold, redindexation is needed)
  • JExperienceConnections
    Check the status of the connection between jExperience and jCustomer. This probe is only available when jExperience 2.6.0+ is installed on the platform. The probe has HIGH severity and can be in GREEN, YELLOW or RED state.
    The following messages are available:
             1 <Number> JExperience connection(s) in error out of <Number>:  related to the following config key <Config Key>  (YELLOW)
             2 All JExperience connection(s) are OK (GREEN)
             3 All JExperience connection(s) are KO (RED)

About the REST API

The module also contains a servlet providing REST (GET) capabilities to Jahia to help you transition from the Healthcheck to the Server Availability Manager module. Accessible at https://[YOUR_JAHIA_HOST]/modules/healthcheck?severity=low, a GET request to this URL returns the list of probes and their values.

You can further customize the configuration by editing the org.jahia.modules.sam.healthcheck.HealthCheckServlet.cfg file.

# default severity level with "?severity=LEVEL" is not provided
severity.default=MEDIUM

# Threshold above which an HTTP error code will be returned
status.threshold=RED
# Error code to be returned if above threshold
status.code=503

Developing your own health probes

You can easily develop your own probes in a very similar fashion to the Healthcheck module, by taking inspiration from the existing probes available here on here on GitHub.

Monitoring background tasks

During its regular lifecycle, Jahia performs actions that shouldn't be interrupted by maintenance activities, such as server shutdown and database maintenance. By exposing such tasks, Jahia makes third-party platforms (or individuals) aware of when to avoid such interruptions.

The following query returns the list of critical tasks currently running on the server. The query returns the tasks running at the time the query was made. The Server Availability Manager does not keep a log of previously running tasks.

query {
  admin {
    jahia {
      tasks {
        service # Name of the service holding the task
        name    # Name of the tasks that should not be interrupted
        started # Datetime at which the task started (if available)
      }
    }
  }
}

Core

To provide backward compatibility with older versions of Jahia, the module implements two approaches for identifying running tasks:

  • For tasks triggered by modules (or Jahia versions) released prior to the availability of SAM, SAM uses pattern matching on the running threads to identify critical tasks. When listing tasks, those will appear grouped within the core service.
  • For future releases of Jahia and its modules, a registry is made available to allow such tasks to be registered when starting and unregistered once completed.

The following task are monitored under the core service:

  • IMPORT_ZIP
  • IMPORT_XML
  • BACKGROUND_JOB
  • BUNDLE_START
  • BUNDLE_INSTALL

Registering tasks

The module has a registry of running tasks allowing modules that are not part of the Jahia default distribution to declare their own tasks. This can also be extended to external services willing to prevent a server from being restarted.

Registering tasks from a Java module

To register long-running tasks, we use the Jahia FrameworkService and TaskRegisterEventHandler of Server Availability Manager.

To register tasks from a Java module:

  1. Before registering the task, we recommended that you unregister it first in case it wasn't cleaned up due to failure with the following command:
    FrameworkService.sendEvent(UNREGISTER_EVENT, constructTaskDetailsEvent(workspace, WORKSPACE_INDEXATION), true);
    where UNREGISTER_EVENT represents event topic - org/jahia/modules/sam/TaskRegistryService/UNREGISTER, and constructTaskDetailsEvent creates a map with three event properties:
    • name
    • service
    • started, in this example we register workspace indexation of Augmented Search, example can be seen in ESService.java class
  2. Register the task when starting, following the same approach as unregister. Just use REGISTER instead of UNREGISTER.
  3. Unregister the task when ended.
Important: Tasks are distinguished by a combination of name and service to ensure that the combination is unique.

Registering tasks through the GraphQL API

You can use the GraphQL API to create and delete tasks when external services need to inform a server that it should not be restarted. This example shows how to use createTask. Using deleteTask with the same parameters deletes that particular task.

mutation {
  admin {
    jahia {
      createTask(service: "DevOps Team" name: "Network maintenance on Core VPC")
    }
  }
}

The registry is shared between GraphQL and Java modules, so you can create a task in a Java module and delete it using the GraphQL API.

Performing other operations

Jahia also supports a set of additional operations associated with monitoring.

Server shutdown

Working jointly with the tasks registry, a shutdown service is exposed via the GraphQL API. This API node should be used with care as it shuts down the Jahia server.

This query is provided as an example. Don’t use timeout and force together because force triggers an immediate shutdown without considering the timeout value.

mutation {
  admin {
    jahia {
      shutdown(
        # When dryRun is provided, the server will not be shutdown but 
        # still return the expected API response (true or false) 
        dryRun: true,   
        timeout: 25,    # In seconds, maximum time to wait for server to be ready (empty list of tasks) to shutdown 
        force: true     # Force immediate shutdown even if tasks are running
      )
    }
  }
}

Server load metrics

Load metrics are also available through the GraphQL API, as per the query below:

query {
  admin {
    jahia {
      load {
        requests {
          count
          # Interval can be ONE, FIVE, or FIFTEEN minutes
          average(interval: ONE)
        }
        sessions {
          count
          average(interval: FIVE)
        }
      }
    }
  }
}