Integrating external data sources

November 11, 2022

1 Introduction

External Data Provider is a module which provides a new API, allowing the integration of external system as content providers like the JCR.

Integration is done by implementing an External Data Source. The data source just need to focus in the connection to the external system and the data retrieval, when the external data provider does all the mapping work which will make the external content appear in the JCR tree.

All data source must provide content (reading). They can provide search capabilities, provide write access to create/update content. They can also be enhanceable - meaning that the ‘raw’ content they provide can be enhanced by Digital Experience Manager content (such as for example being able to add comments to an object provided by an External Data Provider).

2 How it works

2.1 Specify your mapping

Your external content has to be mapped as nodes inside Digital Experience Manager so they can be used by Digital Experience Manager as regular nodes (edit/copy/paste/reference etc.). This means that your external provider module must provide a definition cnd file for each type of content you plan to map into Digital Experience Manager.

As a simple example, you can map a database table to a nodetype, defining each column as a JCR property:

Then, you have to define a tree structure for your objects. As they will appear in the Digital Experience Manager repository, you’ll have to decide for each entry what will be its parent and children.

It is very important that each node have a unique path - you must be able to find an object from a path, and also to know the path from an object. The node returned by a path must always be the same, and not depend on contextual information. If your nodes depend on the context (for example, the current user), you’ll need to have different paths. In order to correctly create a node hierarchy, it’s perfectly allowed to add some “virtual nodes” which will act container to organize your data.

Optionally, you can define a unique identifier for every node. The External Data Provider will map this identifier to a JCR compatible UUID if needed, so that it can be used in Digital Experience Manager as any other node.

2.2 Declaring your Data Source

Those external data are accessed through a JCR provider declared in Spring, where you will set some information like provider key, the mount point, the data source implementation, ...

<bean id="TMDBProvider" class="org.jahia.modules.external.ExternalContentStoreProvider"
      parent="AbstractJCRStoreProvider">
    <property name="key" value="TMDBProvider"/>
    <property name="mountPoint" value="/sites/movies/contents/tmdb"/>
    <property name="externalProviderInitializerService" ref="ExternalProviderInitializerService"/>
    <property name="extendableTypes">
        <list>
            <value>nt:base</value>
        </list>
    </property>
    <property name="dataSource" ref="TMDBDataSource"/>
</bean>

<bean name="TMDBDataSource" class="org.jahia.modules.tmdbprovider.TMDBDataSource" init-method="start">
    <property name="cacheProvider" ref="ehCacheProvider"/>
    <property name="apiKeyValue" value="${com.jahia.tmdb.apiKeyValue}"/>
</bean>

This provider then access the underlying data source (implementing ExternalDataSource and other optional interface if needed to read,save the data).

Your implementation of ExternalDataSource must also list the node types you are handling so that Digital Experience Manager knows which node types this data source is able to handle. This can be done programmatically or inside your spring file, here an example of declarative nodeType support from the ModuleDataSource.

<bean id="ModulesDataSourcePrototype" class="org.jahia.modules.external.modules.ModulesDataSource"
      scope="prototype">
    <property name="supportedNodeTypes">
        <set>
            <value>jnt:cssFolder</value>
            <value>jnt:cssFile</value>
            <value>jnt:javascriptFolder</value>
            <value>jnt:javascriptFile</value>
        </set>
    </property>
</bean>

3 Implementation

3.1 Providing/Reading Content

The main point to define a new provider is to implement the ExternalDataSource interface provided by the external-provider module (org.jahia.modules.external.ExternalDataSource).

This interface requires from you to implement 7 methods to be able to mount/browse your data as if they were part of the Digital Experience Manager Content tree.

Here the listing of those methods:

  • getItemByPath
  • getChildren
  • getItemByIdentifier
  • itemExists
  • getSupportedNodeTypes
  • isSupportsUuid
  • isSupportsHierarchicalIdentifiers

The first method, getItemByPath(), is the entry point for the external data. It has to return an ExternalData node for all valid paths - including the root path (/). ExternalData is a simple java object that represent an entry of your external data. It contains the id, path, node types and the properties encoded as string (or Binary objects for binaries properties).

The getChildren method also need to be implemented for all valid paths - it has to return the names of all sub nodes, as you want them to appear in the Digital Experience Manager repository. For example, if you map a table or the result of a SQL query then this is the method that will return all the results. Note that it is not required that all valid nodes are listed here. If they don’t appear here, you won’t see them in the repository tree, but you may still be able to access them directly by path or by doing a search. This is especially useful if you have thousands of nodes at the same level.

These two methods reflect the hierarchy you will give to Digital Experience Manager.

The getItemByIdentifier() method return the same ExternalData node, but based on the internal identifier you want to use.

The getSupportedNodeTypes() method simply return the list of node types that your data source may contains.

isSupportsUuid() tells the External Data Provider that your external data have identifier in the UUID format. This prevent Digital Experience Manager to create its own identifiers and maintain a mapping between its uuids and your identifiers. In most of the cases, return false.

isSupportsHierarchicalIdentifiers() tells if your identifier actually looks like the path of the node, and allows the provider to optimize some operation like the move - where your identifier will be “updated”. This is for example useful in a file system provider, if you want to use the path of the file as its identifier.

itemExists() simply tests if the item at the give path exists.

3.2 Identifier Mapping

Every time we read an external node for the first time we generate a unique identifier for it inside Digital Experience Manager. Those mapped identifiers are stored inside a table called jahia_external_mapping.

This table map the internal id to a pair of provider key and the external id returned by ExternalData.getIdentifier method.

3.3 ExternalData

The External Data Source is responsible for mapping its data content into ExternalData object. ExternalData provides access to the properties of your content, those properties have to be converted to one of two type String or Binary. String can be internationalized or not, as they are declared in the cnd file.

3.4 Lazy Loading

If your External provider is accessing expansive data (performance/memory wise) to read then you can implement the ExternalDataSource.LazyProperty interface and fill the lazyProperties, lazyI18nProperties and lazyBinaryProperties sets inside ExternalData. If somebody tries to get a property which is not the properties map in ExternalData, but which is in one of those sets, the system will call one of these methods to get the values:

  • getBinaryPropertyValues
  • getI18nPropertyValues
  • getPropertyValues

For example, the ModuleDataSource retrieve the source code as a LazyProperties so this way the source code will be read from the disk only when displayed not when you display the file inside the tree for exploration.

You have to decide which type of loading you want to implement, for example on a DB it must be more interesting to read all the data at once (if not binaries ) depending on the number of rows and columns.

3.5 Searching Content

3.5.1 Basic implementation

This capability will require you to implement ExternalDataSource.Searchable interface which define only one method:

  • search (ExternalQuery query)

Where query is an ExternalQuery, more information here:

http://jackrabbit.apache.org/api/1.4/org/apache/jackrabbit/core/query/jsr283/qom/QueryObjectModel.html

Your method should be able to handle a list of constraint from the query (AND, OR, NOT, DESCENDANTNODE, CHILDNODE, etc.)

You do not have to handle everything if it does not make sense in your case.

The QueryHelper class provide some helpful methods to parse the constraints:

  • getNodeType
  • getRootPath
  • includeSubChild
  • getSimpleAndConstraints
  • getSimpleOrConstraints

The getSimpleAndConstraints method will return you a map of the properties and their expected values from the AND constraints in the query.

The getSimpleOrConstraints method will return you a map of the properties and their expected values from the OR constraints in the query

From the constraints you build a query that means something for your external provider (for example if it is an SQL DB, map those constraints as ‘AND’ constraint in the WHERE clause of your request).

Query are expressed using the JCR SQL-2 language definition.

3.5.2 Offset and Limit support.

The external data queries support offset and limit query parameters. 

In case of multiple providers, the results are returned querying each provider in no specific order, but it will always use the same order after the provider being mounted. 

This mean that on a same query, limit and offset can be used to paginate the results. 

3.5.3 Count

You can provide your own count capability by implementing ExternalDataSource.SupportCount and the following method:

  • count(ExternalQuery query)

​This should return the number of results for the provided query. The query can be parsed the same way as the query method.

In case of one or multiple providers, count() returns always one row containing the number rows matching the query. 

3.6 Enhancing/Merging External Content with Digital Experience Manager Content

Digital Experience Manager allows you to extend your external data content with some of its own mixins or to override some properties of your nodes from Digital Experience Manager. This allow in your definition to mix for example external data and data defined in Digital Experience Manager.

In your Spring file you can declare two things, which of your nodes are extensible by additional mixin and properties, and which properties from your definition can be overridden/merge. Here how you do that:

<property name="extendableTypes">
    <list>
        <value>nt:base</value>
    </list>
</property>

This is saying that all your types are extendable, but you can limit that to only certain nodes by listing their definitions. Any mixin can be added on nodes that are extendable.

<property name="overridableItems">
    <list>
        <value>jtestnt:directory.*</value>
        <value>jtestnt:airline.firstclass_seats</value>
    </list>
</property>

This one is saying that all properties from jtestnt:directory can be overridden inside Digital Experience Manager. The next one is saying that only the property firstclass_seats from airline definition can be overridden.

On regular usage those nodes will only be available to end users/editors if the external provider is mounted. If you unmount your external provider those data will only be accessible from Jahia tools for administrative purpose.

As all content coming from the external provider, these content are not subject to publication. Any extension will be visible in both default and live workspace immediately.

3.7 Writing/Updating Content

The external provider can be writeable, this means that you will be able to create new content, or update existing one from within Digital Experience Manager.

This capability will require you to implement ExternalDataSource.Writable interface which define 4 methods:

  • move
  • order
  • removeItemByPath
  • saveItem

Your provider should at least implement saveItem. saveItem will receive ExternalData with all modified properties. Note that if you are using lazy properties, modified properties will be moved from the set of lazy properties to the map of properties. Removed properties will be removed from both properties map and lazy properties set.

If content can be deleted, then you should implement removeItemsByPath.

The other two methods (move and order) are optional behavior, that need to be implemented only if your provider support them (for example the VFSDataSource does not implement order as files are not ordered on a filesystem but moving is implemented).

Here is an example of how to access binary data from ExternalDataSource and save them into the filesystem using VFS API (example from the VFSDataSource).

public void saveItem(ExternalData data) throws RepositoryException {
    try {
        ExtendedNodeType nodeType = NodeTypeRegistry.getInstance().getNodeType(data.getType());
        if (nodeType.isNodeType(Constants.NT_RESOURCE)) {
            OutputStream outputStream = null;
            try {
                final Binary[] binaries = data.getBinaryProperties().get(Constants.JCR_DATA);
                if (binaries.length > 0) {
                    outputStream = getFile(data.getPath().substring(0, data.getPath().indexOf("/" + Constants.JCR_CONTENT))).getContent().getOutputStream();
                    for (Binary binary : binaries) {
                        InputStream stream = null;
                        try {
                            stream = binary.getStream();
                            IOUtils.copy(stream, outputStream);
                        } finally {
                            IOUtils.closeQuietly(stream);
                            binary.dispose();
                        }
                    }
                }
            } catch (IOException e) {
                throw new PathNotFoundException("I/O on file : " + data.getPath(),e);
            } catch (RepositoryException e) {
                throw new PathNotFoundException("unable to get outputStream of : " + data.getPath(),e);
            } finally {
                IOUtils.closeQuietly(outputStream);
            }
        } else if (nodeType.isNodeType("jnt:folder")) {
            try {
                getFile(data.getPath()).createFolder();
            } catch (FileSystemException e) {
                throw new PathNotFoundException(e);
            }
        }
    } catch (NoSuchNodeTypeException e) {
        throw new PathNotFoundException(e);
    }
}

3.8 Provider factories

It is possible to create a configurable external data source that will be mounted and unmounted on demand by the server administrator. Instead of declaring a mount point in the spring declaration, you can add a bean implementing the ProviderFactory interface, which will be responsible of mounting the provider.

The factory need to be associated with a node type which inherits from jnt:mountPoint, and that will define all required properties to correctly initialize the Data Source. Thenthe moutProvider method will instantiate the External Data Provider instance based on a prototype, and initialize the Data Source. Here’s the code the definition of a mount point from the VFS Provider:

[jnt:vfsMountPoint] > jnt:mountPoint
 - j:rootPath (string) nofulltext

And the associated code, which create the provider by mount the VFS url passed in j:rootPath :

public JCRStoreProvider mountProvider(JCRNodeWrapper mountPoint) throws RepositoryException {
    ExternalContentStoreProvider provider = (ExternalContentStoreProvider) SpringContextSingleton.getBean("ExternalStoreProviderPrototype");
    provider.setKey(mountPoint.getIdentifier());
    provider.setMountPoint(mountPoint.getPath());

    VFSDataSource dataSource = new VFSDataSource();
    dataSource.setRoot(mountPoint.getProperty("j:rootPath").getString());
    provider.setDataSource(dataSource);
    provider.setDynamicallyMounted(true);
    provider.setSessionFactory(JCRSessionFactory.getInstance());
    try {
        provider.start();
    } catch (JahiaInitializationException e) {
        throw new RepositoryException(e);
    }
    return provider;

}

Once the provider factory is declared, the “w” button in document manager will display the new node type, allowing the administrator to create a new mount point with this Data Source.

4 External Data ACL implementation

Since the revision 4.0 of the external provider, we introduced the support of the ACL for the External provider. It can be either provided by the external provider or let Digital Experience Manager completely manage them.

4.1 Default behavior

By default, you can use the DX ACL directly on external nodes the same way as other nodes. The ACL will be stored as extensions.

4.2 ACL or privileges ?

Some external source can provide a way to get all ACL for resource, but other can only provide the allowed operation, or privileges, on it. Depending of both you can either implement ACL or privileges support for the provider.

4.3 ACL from the provider

You can let the provider get the ACL from the external source.
In order to do so, the Datasource has to implement ExternalDataSource.AccessControllable and set an ExternalDataAcl to the DataSource.

4.3.1 ExternalDataAcl

ExternalDataAcl contains a list of roles granted or denied  associated to a user or a group.

The provider can provide new roles with custom permissions, as we do not export the roles and have no mean to save any modification on a running server, these roles do not have to be edited in the role manager (in a further version, we will make the role editor read only for such kind of roles)

First create an ExternalDataAcl

new ExternalDataAcl()

Then fill it with access control entries :
ExternalDataAcl.addAce(type, principal, roles)

type is one of : ExternalDataAce.Type.GRANT or ExternalDataAce.Type.DENY
principal is a group or a user, format is : u:userKey or g:groupKey
roles is a list of roles names

Note that the ExternalDataSource.AccessControllable interface has been updated, the method String[] getPrivilegesNames(String username, String path) has been removed.

4.3.2 ExternalData

To support the ACL you have to set the ExternalDataAcl in the ExternalData using the method
ExternalData.setExternalDataAcl(ExternalDataAcl acl)

example :

// acl
ExternalDataAcl userNodeAcl = new ExternalDataAcl();
userNodeAcl.addAce(ExternalDataAce.Type.GRANT, "u:" + user.getUsername(), Collections.singleton("owner"));
userExtrernalData.setExternalDataAcl(userNodeAcl);

Note that ACL are read only on an external node when they are provided by the DataSource.

4.4 Privileges support

If your data source only provides allowed actions for a resource, you have to implement ExternalDataSource.SupportPrivileges on your Datasource. You will have to implement the method getPrivilegesNames that will return for a user and a path, the list of String as DX privilege names. A DX privileged name is the concatenation of a privilege from javax.jcr.security.Privilege and if necessary the workspace where it applies. They are structured like this:

 privilegeName[_(live | default)]

Examples :

Privilege.JCR_READ
Privilege.JCR_ADD_CHILD_NODES + "_default"

Note that the role tabs in the edit engine or managers are not accurate because they are displaying DX inherited roles. However these roles are meaningless for the external source as they are not used . Also, as for ACL implementation, the role tab is read only, so no operation can be done on it.
 

4.5 Disabling

By default, ACL on external content is enabled, to disable it completely you have to set in your instance of ExternalContentStoreProvider the property aclSupport to false.
Note that we remove some permission to have a consistent behavior in edit engine when acl or the content are not allowed to be edited.

4.6 Edit engine

If the external source is not writable/extendable and has no ACL support, the content will not be editable, the menu entry edit will not be available.

If the external source do not support ACL but can be overridden or is writable, the roles panel will be displayed in read only within the edit engine.

4.7 Warnings

  • When the external data contains ACLs, you cannot update ACL on the corresponding node (the roles tab in edit engine is in read only)
  • If a module defines roles, as they are imported each time the module is deployed, you cannot edit them from the settings panel as you will lose all your changes.

5 Comparison of data management between JCR, EDP and User Provider

This summary table provides for a list of key data types the differences between their management within the JCR, which is limitless, and the list of available actions and expected results when managed within an External Data or a User Provider connected to Jahia DX.

 

Within the JCR What you can do with the External Data Provider What you can do with the Users provider
Identifier Can provide its own UUID, or let the EDP generate one. Users/groups are identified by their name only. A UUID is generated by EDP for every user/group/member node.
Property types All types/i18n/multiple supported No i18n or binary properties. Multiple values supported.
Reference properties Internal or external references supported. No reference properties for users/groups. Group members internally references users and groups from the same provider.
Search - JCR-SQL2 queries QOM model passed to EDP, up to implementation to parse and execute the query as it can ( ISO-37 ) QueryHelper is provided to help parsing of simple query, but do not support all type of constraints (1 type of boolean operators, only = comparison)

 

Queries results are aggregated sequentially for each provider, so global ordering may not be consistent ( https://jira.jahia.org/browse/IDEAS-802 ) . 
QOM is interpreted and query is transformed into a simple key-value pair criteria, on users and groups nodes only. Only simple AND or OR search can be done, no combination of both ( fix on and/or selection : QA-9046 ). Complex queries cannot be implemented in users provider. Cannot do query on member nodes. Ordering not supported ( MAE-40 )
Listeners and rules Not supported yet - see https://jira.jahia.org/browse/BACKLOG-5678  -
Publication Publication not supported : content is visible in default and live. This applies on external nodes, but also on extensions Same as EDP - but it can be confusing that extensions nodes (content stored under users and groups) do not support publication.
ACL and permissions Can set ACLs on any node as an extension (stored in JR) if the provider does not give its own ACL. User nodes have Extensions node (content added under the users and the groups) can have custom ACLs set by the user
Write operations Can define a writeable provider Not supported

6 Sending events to DX

6.1 Goals

In some cases, it can be useful to send the information to DX that an item, mounted in an external provider, has been modified externally. This will allows to execute listeners in DX that can trigger indexation and cache flush. This event will be sent by the external system to DX through a specific REST API.

EDP-event1.PNG

6.2 Listening to events

Event listener won’t receive by default the events from the API - the listener must implement the ApiEventListener interface to get them as any other event.
The EventWrapper class provides a method “isApiEvent()” that can be used to check if the event is coming from the Api or not.

6.3 Sending events

The REST entry point in a url of the form:

http://<server>/<context>/modules/external-provider/events/<provider-key>

To find the <provider-key> you can go to Administration -> System components -> external provider, your mount point should be displayed in the table, and the first column contains the <provider-key> of your provider.

This URL accepts a JSON formatted input defining the list of events to trigger in DX.

6.3.1 Events format

The events are a JSON serialization of javax.jcr.observation.Event ( see https://docs.adobe.com/docs/en/spec/javax.jcr/javadocs/jcr-2.0/javax/jcr/observation/Event.html ), and so contain the following entries:

{
 type: string
 path: string
 identifier: string
 userID: string
 info: object
 date : string
}

 

The type of event is one of the value defined in javax.jcr.observation.Event:

  • NODE_ADDED
  • NODE_REMOVED
  • PROPERTY_ADDED
  • PROPERTY_REMOVED
  • PROPERTY_CHANGED
  • NODE_MOVED

If not specified, the event type is “NODE_ADDED” by default. 

  • Path is mandatory and should point to the node/property on which the event happen. Note that path is the local path in the external system. 
  • Identifier is not mandatory, it’s the id as known by the external system.
  • UserID is the username of the user who originally triggered the event.
  • Info contains optional data related to the event. For the “node moved” event, it contains the source and target of the move. 
  • Date is the timestamp of the event, in ISO9601 compatible format. If not specified, default value is the current time.

 

6.3.2 Example

 

curl --header "Content-Type: application/json" \
  --request POST \
  --data '[
  {
    "path":"/deadpool.jpg",
    "userID":"root"
  },
  {
    "type":"NODE_REMOVED",
    "path":"/oldDeadpool.jpg",
    "userID":"root"
  }]' \
http://localhost:8080/modules/external-provider/events/2dbc3549-15ff-4b08-92b9-94fc78beeba1

6.4 Passing external data

In addition, the “info” field can also contains an “externalData” object which contains a serialized version of the “ExternalData” object. This data will be loaded into the session, so that listeners can have access to the external node without requesting back the data to the provider. This avoid a complete round trip to the external data provider. 
For example, sending events without data could give this sequence:

EDP-event2.PNG

If the externalData is provided for both events, this would lead instead to this sequence:

EDP-event3.PNG

 

6.4.1 External data format

External data can contain the following entries:

{
 "id": string,
 "path":string,
 "type":string,
 "properties": object,
 "i18nProperties":object,
 "binaryProperties":object,
 "mixin": string[]
}

 

The fields id, path and type are mandatory.

  • properties is an object with the properties name as key and an array of serialized values as value (array of one value for non multi-valued properties)
  • i18nProperties is an object with the language as key, and a structure like properties as value
  • binaryProperties is an object with properties name as key and an array of base64 encoded value

 

6.4.2 Example

curl --header "Content-Type: application/json" \
  --request POST \
  --data '[
   {
    "path":"/deathstroke.jpg",
    "userID":"root",
    "info": {
      "externalData":{
        "id":"/deathstroke.jpg",
        "path":"/deathstroke.jpg",
        "type":"jnt:file",
        "properties": {
          "jcr:created": ["2017-10-10T10:50:43.000+02:00"],
          "jcr:lastModified": ["2017-10-10T10:50:43.000+02:00"]
        },
        "i18nProperties":{
          "en":{
            "jcr:title":["my title"]
          },
          "fr":{
            "jcr:title":["my title"]
          }
        },
        "mixin":["jmix:image"]
      }
    }
  },
  {
    "path":"/deathstroke.jpg/jcr:content",
    "userID":"root",
    "info": {
      "externalData":{
        "id":"/deathstroke.jpg/jcr:content",
        "path":"/deathstroke.jpg/jcr:content",
        "type":"jnt:resource",
        "binaryProperties": {
          "j:extractedText":["ZXh0cmFjdCBjb250ZW50IGJ1YmJsZWd1bQ=="]
        }
      }
    }
  }
]' \
http://localhost:8080/modules/external-provider/events/2dbc3549-15ff-4b08-92b9-94fc78beeba1

6.5 REST API Security

By default the REST API is not allowed, so any request will be denied. You must provide an api key.

An apiKey is not generated automatically, you have to do it manually and configure it inside a new config file available in: /digital-factory-data/karaf/etc
Named: org.jahia.modules.api.external_provider.event.cfg

6.5.2 Key declaration

<name>.event.api.key=<apiKey>

  • <apiKey> : the apiKey
  • <name> : the name you want, it's just used to group the other config options for the same key

Example:
global.event.api.key=42267ebc-f8d0-4f4d-ac98-21fb8eeda653

6.5.2 Restrict apiKey to some providers

By default an apiKey is used to protect all the providers, but you can restrict the providers allowed by an apiKey

<name>.event.api.providers=<providerKeys>

  • <apiKey>: the apiKey
  • <name>: the name used to declare the key
  • <providerKeys>: coma separated list of providerKeys

Example:
providers.event.api.key=42267ebc-f8d0-4f4d-ac98-21fb8eeda653
providers.event.api.providers=provider1,provider2,provider3