Integrating external data sources

November 14, 2023

The External Data Provider module provides an API that integrates external systems as content providers like the JCR. Integration is done by implementing an External Data Source. The data source implementation just needs to manage the connection to the external system and the data retrieval. The External Data Provider does all the mapping work to make the external content appear in the JCR tree.

All data sources must provide content (reading). They can provide search capabilities and write access to create and update content. They can also be enhanceable, meaning that the ‘raw’ content they provide can be enhanced by Jahia content. For example, being able to add comments to an object provided by an External Data Provider.

How it works

Specify your mapping

Your external content must be mapped as nodes inside Jahia so they can be used by Jahia as regular nodes (editcopypastereference ). This means that your external provider module must provide a definition cnd file for each type of content that you plan to map into Jahia.

As a simple example, you can map a database table to a nodetype, defining each column as a JCR property.

Then, define a tree structure for your objects. As they will appear in the Jahia repository, you’ll have to decide a parent and children for each entry.

It is very important that each node has a unique path. You must be able to find an object from a path and also know the path from an object. The node returned by a path must always be the same, and not depend on contextual information. If your nodes depend on the context (for example, the current user), you’ll need to have different paths. To correctly create a node hierarchy, it’s perfectly acceptable to add virtual nodes which act as containers for organizing your data.

Optionally, you can define a unique identifier for every node. The External Data Provider will map this identifier to a JCR compatible UUID if needed, so that it can be used in Jahia as any other node.

Declaring your data source

External data is accessed through a JCR provider declared as an OSGi service, where you set information such as the provider key, the mount point, and the data source implementation, as shown in this example.

/**
 * An external data provider data source implementation to expose TMDB movies as external nodes
 */
@Component(service = {TMDBDataSource.class, ExternalDataSource.class, ExternalDataSource.Searchable.class}, immediate = true)
public class TMDBDataSource implements ExternalDataSource, ExternalDataSource.Searchable {
    public static final String MOVIE_NODETYPE = "jnt:movie";
    private static final List<String> EXTENDABLE_TYPES = Arrays.asList(MOVIE_NODETYPE);
    private static final List<String> OVERRIDABLE_ITEMS = Collections.singletonList("*.*");
    public static final String NODETYPE_ROOT = "jnt:movie";
    private static final List<String> SUPPORTED_NODETYPES = Arrays.asList(MOVIE_NODETYPE);
    private ExternalContentStoreProvider provider;
    /**
     * On activate.
     *
     * @param configuration the configuration
     * @throws RepositoryException the repository exception
     */
    @Activate
    public void onActivate(Map<String, ?> configuration) throws RepositoryException, JahiaInitializationException {
        provider = (ExternalContentStoreProvider) SpringContextSingleton.getBean("ExternalStoreProviderPrototype");
        provider.setDataSource(this);
        provider.setExtendableTypes(EXTENDABLE_TYPES);
        provider.setOverridableItems(OVERRIDABLE_ITEMS);
        provider.setDynamicallyMounted(false);
        provider.setMountPoint("/sites/systemsite/contents/movies");
        provider.setKey("tmdb");
        provider.start();
    }
    @Deactivate
    public void onDeactivate() {
        provider.stop();
    }

Implementation

Providing and reading content

The main point for defining a new provider is to implement the ExternalDataSource interface provided by the external-provider module (org.jahia.modules.external.ExternalDataSource). This interface requires you to implement the following methods so that you can mount and browse your data as if they were part of the Jahia content tree.

  • getItemByPath
  • getChildren
  • getItemByIdentifier
  • itemExists
  • getSupportedNodeTypes
  • isSupportsUuid
  • isSupportsHierarchicalIdentifiers

The first method, getItemByPath(), method is the entry point for the external data. It must return an ExternalData node for all valid paths, including the root path (/). ExternalData is a simple Java object that represent an entry of your external data. It contains the id, path, node types and properties encoded as string (or Binary objects for binaries properties).

The getChildren method also needs to be implemented for all valid paths. It must return the names of all subnodes, as you want them to appear in the Jahia repository. For example, if you map a table or the result of a SQL query then this is the method that will return all the results. Note that it is not required that all valid nodes are listed here. If they don’t appear here, you won’t see them in the repository tree, but you still can access them directly by path or through a search. This is especially useful if you have thousands of nodes at the same level.

These two methods reflect the hierarchy you will give to Jahia.

The getItemByIdentifier() method returns the same ExternalData node, but based on the internal identifier you want to use.

The getSupportedNodeTypes() method simply returns the list of node types that your data source may contains.

isSupportsUuid() tells the External Data Provider that your external data has identifiers in the UUID format. This prevents Jahia from creating its own identifiers and maintains a mapping between its UUIDs and your identifiers. In most cases, it returns false.

isSupportsHierarchicalIdentifiers() specified whether your identifier actually looks like the path of the node, and allows the provider to optimize some operation like the move, where your identifier will be “updated”. This is for example useful in a file system provider, if you want to use the path of the file as its identifier.

itemExists() simply tests if the item at the given path exists.

Identifier Mapping

Every time Jahia reads an external node for the first time, Jahia generate a unique identifier for it inside Jahia. Those mapped identifiers are stored inside a table called jahia_external_mapping. This table maps the internal ID to a pair of provider keys and the external ID returned by ExternalData.getIdentifier method.

ExternalData

The External Data Source is responsible for mapping its data content into the ExternalData object. ExternalData provides access to the properties of your content. Those properties must be converted to one of two types: String or Binary. Strings can be internationalized or not, as they are declared in the cnd file.

Lazy Loading

If your External provider is accessing expansive data (performance or memory wise) to read, then you can implement the ExternalDataSource.LazyProperty interface and fill the lazyProperties, lazyI18nProperties, and lazyBinaryProperties sets inside ExternalData. If somebody tries to get a property which is not the properties map in ExternalData, but which is in one of those sets, the system will call one of these methods to get these values:

  • getBinaryPropertyValues
  • getI18nPropertyValues
  • getPropertyValues

For example, the ModuleDataSource retrieves the source code as LazyProperties so that the source code is read from the disk only when displayed, not when you display the file inside the tree for exploration. You must decide which type of loading you want to implement. For example, on a DB it must be more interesting to read all the data at once (if not binaries ) depending on the number of rows and columns.

Searching Content

Basic implementation

This capability requires you to implement ExternalDataSource.Searchable interface which defines only one method:

  • search (ExternalQuery query)

Where query is an ExternalQuery. For more information see, http://svn.apache.org/repos/asf/jackrabbit/site/tags/pre-markdown/content/api/1.4/index.html?org/apache/jackrabbit/core/query/RelationQueryNode.html.

Your method should be able to handle a list of constraint from the query (such asAND, OR, NOT, DESCENDANTNODE, and CHILDNODE.) You do not have to handle everything if it does not make sense in your case.

The QueryHelper class provide some helpful methods to parse the constraints:

  • getNodeType
  • getRootPath
  • includeSubChild
  • getSimpleAndConstraints
  • getSimpleOrConstraints

The getSimpleAndConstraints method returns a map of the properties and their expected values from the AND constraints in the query.

The getSimpleOrConstraints method returns a map of the properties and their expected values from the OR constraints in the query

With these constraints, you build a query that means something for your external provider, for example, if it is an SQL DB, map those constraints as ‘AND’ constraint in the WHERE clause of your request). Queries are expressed using the JCR SQL-2 language definition.

Offset and Limit support

The external data queries support offset and limit query parameters. In case of multiple providers, the results are returned querying each provider in no specific order, but it will always use the same order after the provider being mounted. This mean that on a same query, limit and offset can be used to paginate the results. 

Count

You can provide your own count capability by implementing ExternalDataSource.SupportCount and the following method:

  • count(ExternalQuery query)

​This should return the number of results for the provided query. The query can be parsed the same way as the query method. If you have one or multiple providers, count() always returns one row containing the number rows matching the query. 

Enhancing and merging external content with Jahia content

Jahia allows you to extend your external data content with some of its own mixins or to override some properties of your nodes from Jahia. This allows you to mix data in your definition, for example external data and data defined in Jahia. In your OSGi service activation method, you can declare which of your nodes are extensible by additional mixin and properties, and which properties from your definition can be overridden or merged. This example specifies that all your types are extendable, but you can limit that to only certain nodes by listing their definitions. Any mixin can be added on nodes that are extendable.

provider.setExtendableTypes(EXTENDABLE_TYPES);

This examples specifies that all properties from jtestnt:directory can be overridden inside Jahia. The following example specifies that only the firstclass_seats property from airline definition can be overridden.

provider.setOverridableItems(OVERRIDABLE_ITEMS);

With regular usage, these nodes are only be available to users and editors if the external provider is mounted. If you unmount your external provider, that data is only accessible from Jahia tools for administrative purpose. As all content coming from the external provider, these content are not subject to publication. Any extension will be visible in both default and live workspace immediately.

Writing and updating content

The external provider can be writeable. This means that you can create new content or update existing content from within Jahia. This requires you to implement the ExternalDataSource.Writable interface which define 4 methods:

  • move
  • order
  • removeItemByPath
  • saveItem

Your provider should at least implement saveItem. saveItem receives ExternalData with all modified properties. Note that if you are using lazy properties, modified properties will be moved from the set of lazy properties to the map of properties. Removed properties will be removed from both properties map and lazy properties set.

If content can be deleted, then you should implement removeItemsByPath.

The other two methods (move and order) are optional behaviors that need to be implemented only if your provider support them. For example, of the VFSDataSource does not implement order as files are not ordered on a filesystem but moving is implemented.

Here is an example of how to access binary data from ExternalDataSource and save the data in the filesystem using VFS API (example from the VFSDataSource).

public void saveItem(ExternalData data) throws RepositoryException {
    try {
        ExtendedNodeType nodeType = NodeTypeRegistry.getInstance().getNodeType(data.getType());
        if (nodeType.isNodeType(Constants.NT_RESOURCE)) {
            OutputStream outputStream = null;
            try {
                final Binary[] binaries = data.getBinaryProperties().get(Constants.JCR_DATA);
                if (binaries.length > 0) {
                    outputStream = getFile(data.getPath().substring(0, data.getPath().indexOf("/" + Constants.JCR_CONTENT))).getContent().getOutputStream();
                    for (Binary binary : binaries) {
                        InputStream stream = null;
                        try {
                            stream = binary.getStream();
                            IOUtils.copy(stream, outputStream);
                        } finally {
                            IOUtils.closeQuietly(stream);
                            binary.dispose();
                        }
                    }
                }
            } catch (IOException e) {
                throw new PathNotFoundException("I/O on file : " + data.getPath(),e);
            } catch (RepositoryException e) {
                throw new PathNotFoundException("unable to get outputStream of : " + data.getPath(),e);
            } finally {
                IOUtils.closeQuietly(outputStream);
            }
        } else if (nodeType.isNodeType("jnt:folder")) {
            try {
                getFile(data.getPath()).createFolder();
            } catch (FileSystemException e) {
                throw new PathNotFoundException(e);
            }
        }
    } catch (NoSuchNodeTypeException e) {
        throw new PathNotFoundException(e);
    }
}

Provider factories

You can create a configurable external data source that will be mounted and unmounted on demand by the server administrator. You do so by adding a bean implementing the ProviderFactory interface, which is responsible of mounting the provider.

The factory must be associated with a node type which inherits from jnt:mountPoint and that defines all required properties to correctly initialize the Data Source. Then, the moutProvider method will instantiate the External Data Provider instance based on a prototype, and initialize the Data Source. Here’s the code the definition of a mount point from the VFS Provider:

[jnt:vfsMountPoint] > jnt:mountPoint
 - j:rootPath (string) nofulltext

And the associated code, which create the provider by mount the VFS url passed in j:rootPath:

public JCRStoreProvider mountProvider(JCRNodeWrapper mountPoint) throws RepositoryException {
    ExternalContentStoreProvider provider = (ExternalContentStoreProvider) SpringContextSingleton.getBean("ExternalStoreProviderPrototype");
    provider.setKey(mountPoint.getIdentifier());
    provider.setMountPoint(mountPoint.getPath());

    VFSDataSource dataSource = new VFSDataSource();
    dataSource.setRoot(mountPoint.getProperty("j:rootPath").getString());
    provider.setDataSource(dataSource);
    provider.setDynamicallyMounted(true);
    provider.setSessionFactory(JCRSessionFactory.getInstance());
    try {
        provider.start();
    } catch (JahiaInitializationException e) {
        throw new RepositoryException(e);
    }
    return provider;

}

Once the provider factory is declared, the “w” button in document manager will display the new node type, allowing the administrator to create a new mount point with this Data Source.

External data ACL implementation

Since the revision 4.0 of the external provider, we introduced the support of the ACL for the External provider. It can be either provided by the external provider or you can let Jahia completely manage them.

Default behavior

By default, you can use the Jahia ACL directly on external nodes the same way as other nodes. The ACL will be stored as extensions.

ACL or privileges?

Some external sources can provide a way to get all ACL for resource, but other can only provide the allowed operation, or privileges, on it. Depending of both, you can either implement ACL or privileges support for the provider.

ACL from the provider

You can let the provider get the ACL from the external source. To do so, the Datasource has to implement ExternalDataSource.AccessControllable and set an ExternalDataAcl to the DataSource.

ExternalDataAcl

ExternalDataAcl contains a list of roles, granted or denied and associated with a user or group. The provider can provide new roles with custom permissions, as Jahia does not export the roles and has no mean to save modifications on a running server. These roles do not have to be edited in the role manager (in a further version, we will make the role editor read only for such kind of roles).

First create an ExternalDataAcl.

new ExternalDataAcl()

Then fill it with access control entries:

ExternalDataAcl.addAce(type, principal, roles)

type is one of:

ExternalDataAce.Type.GRANT or ExternalDataAce.Type.DENY

principal is a group or a user, format is : u:userKey or g:groupKey
roles is a list of roles names

Note: The ExternalDataSource.AccessControllable interface has been updated and the method String[] getPrivilegesNames(String username, String path) has been removed.

ExternalData

To support the ACL you have to set the ExternalDataAcl in the ExternalData using the method ExternalData.setExternalDataAcl(ExternalDataAcl acl).

For example:

// acl
ExternalDataAcl userNodeAcl = new ExternalDataAcl();
userNodeAcl.addAce(ExternalDataAce.Type.GRANT, "u:" + user.getUsername(), Collections.singleton("owner"));
userExtrernalData.setExternalDataAcl(userNodeAcl);

Note that ACLs are read only on an external node when they are provided by the DataSource.

Privileges support

If your data source only provides allowed actions for a resource, you have to implement ExternalDataSource.SupportPrivileges on your Datasource. Implement the getPrivilegesNames method that returns for a user and a path, the list of String as Jahia privilege names. A Jahia privileged name is the concatenation of a privilege from javax.jcr.security.Privilege and if necessary the workspace where it applies. They are structured like this:

 privilegeName[_(live | default)]

For example:

Privilege.JCR_READ
Privilege.JCR_ADD_CHILD_NODES + "_default"

Note that the role tabs in Content Editor or managers are not accurate because they display Jahia inherited roles. However these roles are meaningless for the external source as they are not used . Also, as for ACL implementation, the role tab is read only, so no operation can be done on it.

Disabling

By default, ACL on external content is enabled. To disable it completely you have to set in your instance of ExternalContentStoreProvider the property aclSupport to false. Note that Jahia removes some permissions to have consistent behavior in Content Editor when ACL or the content are not allowed to be edited.

Content Editor

If the external source is not writable/extendable and has no ACL support, the content will not be editable and the menu entry edit will not be available. If the external source do not support ACL but can be overridden or is writable, the roles panel displays as read-only in Content Editor.

Warnings

  • When the external data contains ACLs, you cannot update ACL on the corresponding node (the roles tab in Content Editor is in read only)
  • If a module defines roles, as they are imported each time the module is deployed. You cannot edit them from the settings panel as you will lose all your changes.

Comparison of data management between JCR, EDP and User Provider

This summary table provides for a list of key data type,s the differences between their management within the JCR, which is limitless, and the list of available actions and expected results when managed within an External Data or a User Provider connected to Jahia.

Within the JCR What you can do with the External Data Provider What you can do with the Users provider
Identifier Can provide its own UUID or let the EDP generate one Users and groups are identified by their name only. A UUID is generated by EDP for every user, group, and member node.
Property types All types, i18n, multiple supported No i18n or binary properties. Multiple values supported.
Reference properties Internal or external references supported No reference properties for users and groups. Group members internally references users and groups from the same provider.
Search - JCR-SQL2 queries QOM model passed to EDP, up to implementation to parse and execute the query as it can ( ISO-37 ). QueryHelper is provided to help parsing of simple query, but does not support all type of constraints (1 type of boolean operators, only = comparison).

Queries results are aggregated sequentially for each provider, so global ordering may not be consistent. 

QOM is interpreted and query is transformed into a simple key-value pair criteria, on users and groups nodes only. Only simple AND or OR search can be done, no combination of both ( fix on and/or selection: QA-9046 ). Complex queries cannot be implemented in users provider. Cannot do query on member nodes. Ordering not supported (MAE-40)
Listeners and rules Not supported yet  
Publication Publication not supported. Content is visible in default and live. This applies on external nodes, but also on extensions. Same as EDP, but it can be confusing that extensions nodes (content stored under users and groups) do not support publication.
ACL and permissions Can set ACLs on any node as an extension (stored in JR) if the provider does not give its own ACL User nodes have Extensions node (content added under the users and the groups) can have custom ACLs set by the user
Write operations Can define a writeable provider Not supported

 

Sending events to Jahia

Goals

In some cases, it can be useful to send information to Jahia that an item, mounted in an external provider, has been modified externally. This allows you to execute listeners in Jahia that can trigger indexation and cache flush. This event is sent by the external system to Jahia through a specific REST API.

EDP-event1.PNG

Listening to events

By default, the event listener won’t receive the events from the API. The listener must implement the ApiEventListener interface to get them as any other event.
The EventWrapper class provides a method “isApiEvent()” that can be used to check if the event is coming from the API or not.

Sending events

The REST entry point in a URL of the form:

http://<server>/<context>/modules/external-provider/events/<provider-key>

To find the <provider-key> you can go to Administration>Server>Modules and Extensions>External providers. Your mount point should display in the table and the first column contains the <provider-key> of your provider.

This URL accepts a JSON formatted input defining the list of events to trigger in Jahia.

Events format

The events are a JSON serialization of javax.jcr.observation.Event (see docs.adobe.com/docs/en/spec/javax.jcr/javadocs/jcr-2.0/javax/jcr/observation/Event.html) and contain the following entries:

{
 type: string
 path: string
 identifier: string
 userID: string
 info: object
 date : string
}

 

The type of event is one of the value defined in javax.jcr.observation.Event:

  • NODE_ADDED
  • NODE_REMOVED
  • PROPERTY_ADDED
  • PROPERTY_REMOVED
  • PROPERTY_CHANGED
  • NODE_MOVED

If not specified, the event type is “NODE_ADDED” by default. 

  • Path is mandatory and should point to the node/property on which the event happen. Note that path is the local path in the external system. 
  • Identifier is not mandatory. It’s the ID as known by the external system.
  • UserID is the username of the user who originally triggered the event.
  • Info contains optional data related to the event. For the “node moved” event, it contains the source and target of the move. 
  • Date is the timestamp of the event, in ISO9601 compatible format. If not specified, default value is the current time.

 

For example:

curl --header "Content-Type: application/json" \
  --request POST \
  --data '[
  {
    "path":"/deadpool.jpg",
    "userID":"root"
  },
  {
    "type":"NODE_REMOVED",
    "path":"/oldDeadpool.jpg",
    "userID":"root"
  }]' \
http://localhost:8080/modules/external-provider/events/2dbc3549-15ff-4b08-92b9-94fc78beeba1

Passing external data

In addition, the “info” field can also contain an externalData object which contains a serialized version of the ExternalData object. This data is loaded into the session so that listeners can have access to the external node without requesting back the data to the provider. This avoids a complete round trip to the external data provider. For example, sending events without data could give this sequence:

EDP-event2.PNG

If the externalData is provided for both events, this would lead instead to this sequence:

EDP-event3.PNG

 

External data format

External data can contain the following entries:

{
 "id": string,
 "path":string,
 "type":string,
 "properties": object,
 "i18nProperties":object,
 "binaryProperties":object,
 "mixin": string[]
}

The fields id, path and type are mandatory.

  • properties is an object with the properties name as key and an array of serialized values as value (array of one value for non multi-valued properties)
  • i18nProperties is an object with the language as key, and a structure like properties as value
  • binaryProperties is an object with properties name as key and an array of base64 encoded value

For example:

curl --header "Content-Type: application/json" \
  --request POST \
  --data '[
   {
    "path":"/deathstroke.jpg",
    "userID":"root",
    "info": {
      "externalData":{
        "id":"/deathstroke.jpg",
        "path":"/deathstroke.jpg",
        "type":"jnt:file",
        "properties": {
          "jcr:created": ["2017-10-10T10:50:43.000+02:00"],
          "jcr:lastModified": ["2017-10-10T10:50:43.000+02:00"]
        },
        "i18nProperties":{
          "en":{
            "jcr:title":["my title"]
          },
          "fr":{
            "jcr:title":["my title"]
          }
        },
        "mixin":["jmix:image"]
      }
    }
  },
  {
    "path":"/deathstroke.jpg/jcr:content",
    "userID":"root",
    "info": {
      "externalData":{
        "id":"/deathstroke.jpg/jcr:content",
        "path":"/deathstroke.jpg/jcr:content",
        "type":"jnt:resource",
        "binaryProperties": {
          "j:extractedText":["ZXh0cmFjdCBjb250ZW50IGJ1YmJsZWd1bQ=="]
        }
      }
    }
  }
]' \
http://localhost:8080/modules/external-provider/events/2dbc3549-15ff-4b08-92b9-94fc78beeba1

REST API Security

By default the REST API is not allowed, so any request will be denied. You must provide an API key. An apiKey is not generated automatically, you have to do it manually and configure it inside a new config file available in /digital-factory-data/karaf/etc which is named org.jahia.modules.api.external_provider.event.cfg.

Key declaration

<name>.event.api.key=<apiKey>
  • <apiKey>
    the apiKey
  • <name>
    the name you want, it's just used to group the other config options for the same key

For example:

global.event.api.key=42267ebc-f8d0-4f4d-ac98-21fb8eeda653

Restrict apiKey to some providers

By default an apiKey is used to protect all the providers, but you can restrict the providers allowed by an apiKey.

<name>.event.api.providers=<providerKeys>
  • <apiKey>
    the apiKey
  • <name>
    the name used to declare the key
  • <providerKeys>
    comma separated list of providerKeys

For example:

providers.event.api.key=42267ebc-f8d0-4f4d-ac98-21fb8eeda653
providers.event.api.providers=provider1,provider2,provider3