HTML Filtering
Overview
The HTML Filtering module provides XSS (Cross-Site Scripting) filtering protection in Jahia when contributing content that have properties containing HTML markup. It helps ensure that user-generated HTML content is safe and compliant with your configuration.
Note: The module doesn't have to be enabled on a site to work and start filtering HTML. As soon as the module is installed and started, it will be applied and fully working based on the global configuration embedded with the module.
The HTML Filtering module in v2.0.0 is compatible with versions of Jahia starting from 8.1.8.0.
Configuration
Configuration Files and Priority
The module can be configured at three different levels, with a clear priority order:
- Site-specific configuration (highest priority)
Filename:org.jahia.modules.htmlfiltering.site-<SITE_KEY>.yml
(example:org.jahia.modules.htmlfiltering.site-digitall.yml
)
Purpose: Create separate configuration for a specific site - Global custom configuration (medium priority)
Filename:org.jahia.modules.htmlfiltering.global.custom.yml
Purpose: Override default settings for all Jahia sites. Managed by administrators to customize HTML filtering rules. - Global default configuration (lowest priority)
Filename:org.jahia.modules.htmlfiltering.global.default.yml
Purpose: Provide a minimal recommended configuration. Installed automatically by the module (it should not be modified as it will be reinstalled with each module update).
If a configuration file is invalid or contains incorrect values, it will not be loaded, and the system will fall back to the next configuration in the priority chain. Error logs will indicate which configuration was not loaded and why.
Configuration Structure
All configuration files follow the same structure, with separate sections for the EDIT and LIVE workspaces.
For example:
htmlFiltering:
formatDefinitions:
HTML_ID: '[a-zA-Z0-9\:\-_\.]+'
NUMBER_OR_PERCENT: '\d+%?'
editWorkspace:
strategy: REJECT
skipOnPermissions: []
process: ['nt:base.*']
skip: []
allowedRuleSet:
elements:
# Rules for allowed elements and attributes
protocols: [http, https, mailto]
liveWorkspace:
strategy: SANITIZE
skipOnPermissions: []
process: ['nt:base.*']
skip: []
allowedRuleSet:
elements:
# Rules for allowed elements and attributes
protocols: [http, https, mailto]
Configuration Components
Format Definitions
The formatDefinitions
section defines regular expression patterns with a unique name that can be referenced in rules:
formatDefinitions:
HTML_ID: '[a-zA-Z0-9\:\-_\.]+'
NUMBER_OR_PERCENT: '\d+%?'
LINKS_URL: '(?:(?:[\p{L}\p{N}\\\.#@$%\+&;\-_~,\?=/!{}:]+|#(\w)+)|(\s*(?:(?:ht|f)tps?://|mailto:)[\p{L}\p{N}][\p{L}\p{N}\p{Zs}\.#@$%\+&:\-_~,\?=/!\(\)]*+\s*))'
These patterns can be referenced to enforce specific formats for attribute values.
Workspace Configuration
The module provides separate sections for each Jahia workspace within the same configuration file:
- editWorkspace: Configuration for the default workspace (content editing)
- liveWorkspace: Configuration for the live workspace (direct content creation on live)
Having separate sections allows for different levels of protection based on context. The live workspace often requires stricter rules since it may include content created by users with fewer privileges (e.g., blog posts, forum comments). It is mandatory to have both workspaces present for a configuration file to be considered valid.
Strategy
The strategy setting defines how the module handles HTML content with potentially unsafe elements:
- SANITIZE: Automatically removes invalid markup when content is saved (recommended for liveWorkspace in which direct feedback may not be available).
- REJECT: Validates content and rejects the save operation if invalid markup is found (recommended for editWorkspace). A content is considered invalid if it contains tags and/or attributes that got removed after being sanitized by the module.
Sanitization
You can expect two different types of behavior when sanitization is performed. Block-level content will see the tags removed but the content will remain present. While other types of tags will see their content removed entirely.
For example <p>hello</p>
will be sanitized in hello
, while <script>alert('hello')</script>
will be removed entirely.
Process and Skip Settings
These settings define which node types and properties should be processed or skipped by the HTML filtering:
- process: Node types and properties to be processed
- Example:
process: ['nt:base.*']
applies filtering to all properties of all node types)
- Example:
- skip: Node types/properties to be excluded from filtering (
skip
takes precedence overprocess
)- Example:
skip: ['nt:myNodeType.*']
skips filtering for all properties of a specific node type - Example:
skip: ['nt:myNodeType.myProperty']
skips filtering for a specific property
- Example:
The skip
setting takes precedence over the process
setting.
Note: The node type and property notation doesn't necessarily follow a strict nodetype-property relation. For example, if a node is of type jnt:bigText
and has an additional mixin jmix:htmlReadme
(which contains property j:htmlContent
), then a configuration like skip: ['jnt:bigText.j:htmlContent']
is valid. It will skip processing of j:htmlContent
when set on a jnt:bigText
node, even if there's no direct relation between them.
Skip on Permissions
The skipOnPermissions
setting allows you to specify permissions that, when granted to a user, will cause HTML filtering to be bypassed:
skipOnPermissions: ['view-full-wysiwyg-editor', 'site-admin']
This is useful when certain privileged users need to include HTML markup that would normally be filtered out. For example:
view-full-wysiwyg-editor
: Users with access to the full WYSIWYG rich text editor experience could be granted the power to bypass HTML filteringsite-admin
: Site administrators might need to include specific HTML elements for site functionality
It's important to note that Jahia does not recommend any default permissions for this setting to keep content safe regardless of privilege level. It's up to each implementation to determine whether to use this feature and which permissions to include.
Note: Using skipOnPermissions
can create inconsistent content editor experiences. If a privileged user creates HTML content with markup that would normally be filtered, less privileged users attempting to edit that same content later may not be able to save their changes. This happens because the HTML filtering will be applied to their edits, causing the save operation to fail if the existing content contains elements not allowed for their permission level.
Rule Sets
The allowedRuleSet
and disallowedRuleSet
sections define permitted or prohibited HTML elements and attributes. Example:
allowedRuleSet:
elements:
- attributes: [class, dir, hidden, lang, role, style, title]
- attributes:
- id
format: HTML_ID
- attributes: [align]
tags: [caption, col, colgroup, hr, img, table, tbody, td, tfoot, th, thead, tr]
# More element rules...
protocols: [http, https, mailto]
Each rule can specify:
- tags: Which HTML tags the rule applies to
- attributes: Which attributes are allowed for those tags. Specifying attributes without tags means that those attributes are allowed on any tags
- format: A reference to a format definition that the attribute(s) values must match
The protocols
section defines which URL protocols are allowed in attributes like href
and src
.
Note: The allowedRuleSet
is mandatory and should contain at least one rule, while disallowedRuleSet
is optional.
GraphQL API
The module exposes a GraphQL API for validating and sanitizing HTML content:
query HtmlFiltering($html: String!, $workspace: Workspace = EDIT, $siteKey: String!) {
htmlFiltering {
validate(html: $html, workspace: $workspace, siteKey: $siteKey) {
removedTags
removedAttributes {
attributes
tag
}
sanitizedHtml
safe
}
}
}
This API can be used to validate or sanitize HTML strings, with the appropriate configuration selected based on the specified workspace and site key.
The API returns the following fields:
- removedTags: A list of tags that were removed during the sanitization process
- removedAttributes: A list of removed attributes along with the tags they were removed from
- sanitizedHtml: A sanitized version of the input HTML markup based on the underlying configuration
- safe: A boolean value that returns
true
if nothing was removed from the input HTML markup and it's valid according to the configuration that was used
Usage
For a property to be processed by HTML filtering, all the following must be true:
- The current user doesn't have any permission defined in
skipOnPermissions
- The property matches at least one pattern in
process
- The property doesn't match any pattern in
skip
- The property is defined as a RichText property in the Content Definition File (.cnd)
Example in a .cnd file:
[nt:myNodeType] > jnt:content, jmix:droppableContent
- myHTMLProperty (string, richtext)
- willNotBeProcessed (string)
In this example, myHTMLProperty
will be processed by HTML filtering because it has the richtext
constraint, while willNotBeProcessed
will not be processed because it lacks this constraint.
JSON Overrides and HTML Filtering
jContent supports .cnd overrides (called JSON overrides)
which allow you to override the .cnd constraints. For example, JSON Overrides can override property constraints to use a different field type when editing content in Jahia jContent edition UI.
It's important to note that HTML filtering is not able to process these JSON overrides, so they will be ignored by the filtering system. For example, consider a .cnd:
[nt:myNodeType] > jnt:content, jmix:droppableContent
- willNotProcessed (string)
With a JSON Override:
{
"nodeType": "nt:myNodeType",
"sections": [
{
"name": "content",
"fieldSets": [
{
"rank": 1.2,
"name": "willNotProcessed",
"selectorType": "RichText"
}
]
}
]
}
This configuration will have no effect on HTML filtering. jContent will display a WYSIWYG field for your property, allowing you to contribute HTML markup, but HTML filtering will simply ignore this property because it's not declared as richtext
in the .cnd definition.
Best Practices
- Don't modify the global default configuration file
- Create a custom global or site-specific configuration instead.
- Be careful with
skipOnPermissions
- Only grant HTML filtering bypass permissions to trusted users
- Be aware of the potential inconsistent editing experience it may create
- Monitor for rejected content saves
- Educate content creators about allowed HTML elements and attributes
- Consider expanding the allowed elements if legitimate use cases arise
- Always declare HTML properties with the
richtext
constraint in .cnd —- Remember that JSON overrides do not affect HTML filtering
- To protect all properties that can contain HTML markup, they must be declared using the RichText selector type in .cnd
- Check server logs when adding or editing configurations
- Verify that your configuration has been properly loaded
- Some configuration properties are mandatory and subject to validation
- If a configuration is invalid, it will not be loaded, and HTML filtering will fall back to another configuration in the priority chain
Migrating from HTML Filtering v1.x to v2.x
HTML Filtering v2 builds upon the foundation of v1 while introducing significant improvements to address limitations in the original version. While v1 and v2 use similar underlying technology for HTML sanitizing and validation, v2 offers enhanced flexibility and configuration options.
Important Note
As soon as HTML Filtering v2 is installed and started, it will take over from previous v1.x, and any custom v1 configurations will no longer be taken into account. It's therefore important to familiarize yourself with the new v2 configuration format and adapt any custom configurations you may have in place from v1 before upgrading. Please follow the step-by-step migration instructions below to ensure a smooth transition.
The most significant changes in v2 include:
- Support for two different filtering strategies:
SANITIZE
orREJECT
(v1 only supportedSANITIZE
by default with no option to change) - Separate configuration sections for different workspaces
- Improved configuration file organization to avoid conflicts with default settings
- Enhanced flexibility for content processing based on permissions, node types, and property names
- Configurable format definitions (patterns were hardcoded in v1)
Configuration Changes
The configuration system has been significantly revised in v2, with changes to both file naming conventions and the structure of the configuration itself.
File Naming and Organization
Version 2.x introduces a three-tier configuration system designed to make updates safer while allowing for customization:
V1.x Configuration Files:
- Default configuration:
org.jahia.modules.htmlfiltering.config-default.yml
- Site-specific configuration:
org.jahia.modules.htmlfiltering.config-SITE_KEY.yml
V2.x Configuration Files:
- Global default configuration:
org.jahia.modules.htmlfiltering.global.default.yml
- Installed automatically with the module
- Contains recommended security settings
- Should not be modified as it will be overwritten with each module update
- Global custom configuration:
org.jahia.modules.htmlfiltering.global.custom.yml
- New in v2.x
- Overrides the global default for all sites
- Intended for administrator customizations
- Site-specific configuration:
org.jahia.modules.htmlfiltering.site-SITE_KEY.yml
- Highest priority
- Only applies to the specific site
This new tiered approach allows the module to always ship with the latest recommended security settings in the global default configuration while giving administrators the ability to customize settings without risk of losing their changes during updates.
When migrating from v1 to v2:
- If you have customized the default configuration in v1, create a new
org.jahia.modules.htmlfiltering.global.custom.yml
file with your customizations - If you have site-specific configurations, create new files using the
org.jahia.modules.htmlfiltering.site-SITE_KEY.yml
naming pattern - Do not modify the
org.jahia.modules.htmlfiltering.global.default.yml
file
Configuration Structure
The structure and syntax of the configuration file has been revised to improve clarity and organization:
v1.x Configuration Example
htmlFiltering: protocols: - http - https - mailto attributes: - name: class - name: dir - name: hidden - name: id pattern: HTML_ID # Pattern names were hardcoded and not configurable - name: lang - name: role - name: style - name: title - name: align elements: caption, col, colgroup, hr, img, table, tbody, td, tfoot, th, thead, tr - name: alt elements: img # More attributes... elements: - name: h1, h2, h3, h4, h5, h6, hgroup, p, a, img, figure, figcaption, canvas, picture, br, strong, b, em, i, span, div, ul, ol, li, dl, dd, dt, table, tbody, thead, tfoot, tr, td, th, col, colgroup, caption, blockquote, q, cite, code, pre, var, abbr, address, del, s, details, summary, ins, sub, sup, small, mark, hr, button, legend, audio, video, source, track, nav, article, main, aside, section, time, header, footer, wbr, u htmlSanitizerDryRun: false
v2.x Equivalent Configuration
htmlFiltering: formatDefinitions: # New in v2: configurable format definitions HTML_ID: '[a-zA-Z0-9\:\-_\.]+' NUMBER_OR_PERCENT: '\d+%?' LINKS_URL: '(?:(?:[\p{L}\p{N}\\\.#@$%\+&;\-_~,\?=/!{}:]+|#(\w)+)|(\s*(?:(?:ht|f)tps?://|mailto:)[\p{L}\p{N}][\p{L}\p{N}\p{Zs}\.#@$%\+&:\-_~,\?=/!\(\)]*+\s*))' editWorkspace: strategy: SANITIZE # Same behavior as v1, or can be changed to REJECT skipOnPermissions: [] process: ['nt:base.*'] skip: [] allowedRuleSet: elements: - attributes: [class, dir, hidden, lang, role, style, title] - attributes: - id format: HTML_ID # References a format defined in formatDefinitions - attributes: [align] tags: [caption, col, colgroup, hr, img, table, tbody, td, tfoot, th, thead, tr] - attributes: [alt] tags: [img] # More element rules... - tags: [h1, h2, h3, h4, h5, h6, hgroup, p, a, img, figure, figcaption, canvas, picture, br, strong, b, em, i, span, div, ul, ol, li, dl, dd, dt, table, tbody, thead, tfoot, tr, td, th, col, colgroup, caption, blockquote, q, cite, code, pre, var, abbr, address, del, s, details, summary, ins, sub, sup, small, mark, hr, button, legend, audio, video, source, track, nav, article, main, aside, section, time, header, footer, wbr, u] protocols: [http, https, mailto] liveWorkspace: strategy: SANITIZE skipOnPermissions: [] process: ['nt:base.*'] skip: [] allowedRuleSet: # Similar configuration as editWorkspace but potentially more restrictive elements: # Element rules... protocols: [http, https, mailto]
Key structural differences:
- Format definitions are now configurable at the top of the configuration (in v1, patterns like HTML_ID were hardcoded and not customizable)
- Configuration is divided into workspace-specific sections (
editWorkspace
andliveWorkspace
) - v1's flat structure for attributes and elements has been reorganized into rule sets
htmlSanitizerDryRun
option has been removed- v1 only supported the
SANITIZE
behavior, while v2 allows choosing betweenSANITIZE
orREJECT
strategies for each workspace
GraphQL API Changes
The GraphQL API has been completely redesigned in v2:
v1.x GraphQL API
mutation PreviewFiltering($text: String!, $siteKey: String!) { htmlFilteringConfiguration { htmlFiltering { testFiltering( siteKey: $siteKey html: $text ) { html removedElements removedAttributes { element attributes } } } } }
v2.x GraphQL API
query HtmlFiltering($html: String!, $workspace: Workspace = EDIT, $siteKey: String) { htmlFiltering { validate(html: $html, workspace: $workspace, siteKey: $siteKey) { removedTags removedAttributes { attributes tag } sanitizedHtml safe } } }
Key differences in the GraphQL API:
- Changed from a mutation to a query operation
- Added
workspace
parameter to specify which workspace configuration to use - Renamed
html
tosanitizedHtml
in the response - Renamed
removedElements
toremovedTags
in the response - Added
safe
boolean field to quickly check if the HTML is valid - Simplified the namespace from
htmlFilteringConfiguration.htmlFiltering
to justhtmlFiltering
Additionally, v2 removes several configuration-related GraphQL APIs that were present in v1. This was done because providing such specific APIs for HTML filtering configuration doesn't make sense in the broader context. A more generic approach to configuration management via GraphQL would be better but is outside the scope of the HTML filtering module.
Migration Steps
-
Review your existing v1 configuration files
- Identify all custom configurations you have in place
- Note that v1 always used the SANITIZE strategy by default
-
Create new v2 configuration files
- For site-specific configurations: create new files using the
org.jahia.modules.htmlfiltering.site-SITE_KEY.yml
pattern - For global customizations: create a new
org.jahia.modules.htmlfiltering.global.custom.yml
file - Do not modify the default
org.jahia.modules.htmlfiltering.global.default.yml
file - Define your format definitions if you need custom patterns
- Restructure your configuration according to the v2 format with workspace-specific sections
- Choose appropriate strategies for each workspace (
SANITIZE
to maintain v1 behavior orREJECT
for stricter validation) - Note that the v2 global default configuration includes new options (
process
,skip
,skipOnPermissions
) with values that make it behave just like v1 did. These options don't need to be changed for basic migration but are available for additional customization if needed.
- For site-specific configurations: create new files using the
-
Update GraphQL API calls
- Refactor any code that uses the v1 GraphQL API to use the new v2 API
- Update field references to match the new response structure
- If you were using any configuration-related GraphQL APIs, these are no longer available
-
Test thoroughly
- Test your migrated configuration with various HTML content to ensure it behaves as expected
- Pay special attention to any content that might be affected by the new filtering behavior
- If you've chosen the
REJECT
strategy, ensure your content editors are prepared for potential validation errors
-
Remove old configuration files
- Once you have confirmed that your v2 configuration is working correctly, you can safely remove the old v1 configuration files
- This includes both the default configuration (
org.jahia.modules.htmlfiltering.config-default.yml
) and any site-specific configurations (org.jahia.modules.htmlfiltering.config-SITE_KEY.yml
)
By following these steps, you can successfully migrate from HTML Filtering v1.x to v2.x and take advantage of the enhanced features and flexibility offered by the new version.