Filtering HTML tags from rich text fields

March 6, 2024

When editing a site, contributors often use a HTML-based rich text editor. Since contributors are allowed to use HTML tags and to copy and paste HTML content, you can prevent the use of specific HTML tags in rich text fields.

Jahia provides a module called html-filtering (GitHub repository, Jahia Store), compatible with Jahia 8.1.5+, whose implementation is based on the OWASP sanitizer. Note that it is more restrictive than the previous HTML filtering feature as it requires to explicitly list the allowed (or disallowed) tags and attributes.

To help you start using this module, we’ve provided a default configuration with an exhaustive list of allowed tags and attributes (org.jahia.modules.htmlfiltering.config-default.yml) and, for instance, tags like script, iframe, form will be filtered. So you will need to ensure that you are not using these tags first, or will need to add them to your configuration.

Activating/Deactivating the HTML Filtering

From  the UI, you need to go to the site settings (/jahia/administration/<site-key>/settings/properties) and enable/disable the “markup filtering” (under HTML settings). Note that it relies on the same property as in the previous versions (previously accessible in Administration>Sites>HTML Filtering).

As an alternative, you can also use the following GraphQL mutation to activate HTML filtering on a given site:

To disable HTML filtering on a given site use the following GraphQL mutation:

mutation {
  htmlFilteringConfiguration {
    htmlFiltering {
      disableFiltering(siteKey:"mySite")
    }
  }
}

The following GraphQL query will return a list of sites where HTML filtering is activated: 

query {
  htmlFilteringConfiguration {
    htmlFiltering {
      sitesWithActiveFiltering
    }
  }
}

View existing configuration

To view existing configuration for a site run the following query:

query {
  htmlFilteringConfiguration {
    htmlFiltering {
      configuration(siteKey: "mySite") {
        elements
        protocols
        attributes {
          attribute
          elements
          isGlobal
          pattern
        }
        disallow {
          elements
          protocols
          attributes {
            attribute
            elements
            isGlobal
            pattern
          }
        }
      }
    }
  }
}

You can use “default” in the siteKey to get the default configuration

Default configuration

The default configuration will be used by the sites with the HTML filtering activated if there’s no site-specific configuration.

There is no way to update configuration itself via mutation, it can only be done by editing the file manually (digital-factory-data/karaf/etc/org.jahia.modules.htmlfiltering.config-default.yml).

Providing a custom configuration per site

Create a custom configuration file

Create a configuration file in yml format with your site key using this naming convention: org.jahia.modules.htmlfiltering.config-yourSiteKey.yml

Note that the default configuration will be merged with the per-site one, meaning you’ll have to explicitly declare the tags (and attributes) you want to allow or disallow, knowing that the disallow trumps all allow configurations. Which means you can have a somewhat permissive default and restrict as you see fit on a per-site basis.

htmlFiltering:
  protocols:
    - http
    - https
  attributes:
    - name: class
      pattern: "(myclass1|myclass2)"
      elements: a, p, i
    - name: dir
    - name: id
      pattern: HTML_ID
    ...
  elements:
    - name: h1, h2, ...
  disallow:
    protocols:
      ...
    attributes:
      ...
    elements:
      ...

Activate the custom configuration

  1. Deploy configuration on Jahia /karaf/etc.
    You can use the provisioning API, see https://github.com/Jahia/jahia-private/tree/master/bundles/provisioning#install--edit-configuration 
  2. Enable html filtering on your site
  3. Edit a RichText component to have the content filtered.

Test your configuration

Dry-run in content edition

A dry-run mode can be activated by explicitly using the htmlSanitizerDryRun option either in the default config or in the per-site one: 

htmlFiltering:
  htmlSanitizerDryRun: true

This dry-run will log what would be filtered and will only apply to content being edited while the dry run is running. It could be helpful to ensure the behaviour matches the expectations.

Testing existing configuration

To test what your configuration is doing use the following mutation:

mutation {
  htmlFilteringConfiguration {
    htmlFiltering {
      testFiltering(
        siteKey: "mySite"
        html: "<p>Your HTML</p>"
      ) {
        html
        removedElements
        removedAttributes {
          element
          attributes
        }
      }
    }
  }
}

You need to supply the site key (you can supply default to test default configuration) and HTML you want to test.

Default CKEditor toolbar configuration

Note that the html-filtering module might filter some tags that could be allowed by the CKEditor toolbar configuration as both configurations are not in-sync.

Legacy HTML filtering in Jahia versions prior to 8.2.0

In Jahia versions prior to 8.2.0, you can access the legacy HTML filtering option from Administration>Sites>HTML Filtering. It uses the same the same property.

To prevent the use of specifi HTML tags, you specify a comma-separated list of forbidden tags. When the contributor’s HTML contents are saved, Jahia automatically removes those tags from the submitted contents.

The following example shows filtering of H1, script, table, tr, td, and strong tags.