Using an External Data Provider with Augmented Search

November 14, 2023

There are several ways to implement an External Data Provider (EDP) with Augmented Search (AS). This topic shows you how to do so in a way that is best for your environment. When you implement your EDP with Augmented Search:

  • First ensure that your external data can be indexed by AS.
  • Once your node types are ready to be indexed by AS, ensure that they can be found by:
    • Implementing an ExternalDataSource.Searchable interface. This is the preferred strategy if you have less than a few thousands data entries.
    • Sending your data through the EventService. This EventService can work with the ExternalDataSource.Searchable interface or be used by itself.
  • Run queries to validate that your data is indexed.
  • Declare your node types as mapped node types. You can use mapped properties to build filters, sorting, and facets.

Note: This document assumes that you are familiar with External Data Providers (EDPs). If not, first see Integrating external data sources.

To maximize your use of Augmented Search when you have an external provider, focus on what to index (node types and defining the properties correctly to avoid noise and pollution in your documents), then when to send the data to Augmented Search. Once it’s all working the way you want, you can focus on mapped properties and dedicated queries and rendering.

Indexing your external data

You can use different strategies to index your external data in Augmented Search. This document explains those strategies with their pros and cons.

First, ensure that your data can be indexed by Augmented Search. You can either:

  • Make sure your definitions extend \jmix:mainResource.
  • Add your node types to the list of main resources node types to be indexed by Augmented Search. For more information, see Configuring Augmented Search>Main Resources.

You can make sure your definitions extend \jmix:mainResource, as in this example from The Movie Database provider:

[jnt:movie] > jnt:content, jmix:structuredContent, mix:title, jmix:tagged, jmix:keywords, jmix:mainResource
- overview (string) i18n
- tagline (string) i18n
- original_title (string)
- backdrop_path (string) nofulltext
- poster_path (string) i18n nofulltext
- homepage (string) nofulltext
- release_date (date)
- status (string) nofulltext
- runtime (long)
- spoken_languages (string) multiple nofulltext
- adult (boolean)
- production_companies (string) multiple nofulltext
- imdb_id (string) nofulltext
- budget (long)
- revenue (double)
- vote_average (double)
- vote_count (long)
- popularity (double)
+ * (jnt:cast)
+ * (jnt:crew)

By default, all nodes of the jmix:mainResource type are indexed by Augmented Search.

It is extremely important that you correctly define your indexation options for each property. By default, every String property is part of the indexed node or document full-text excerpt. For information on all indexation options, see Content Definitions.

In the example, the backdrop_path, poster_path, homepage, status, spoken_languages, production_companies, and imdb_id fields use the nofulltext property to exclude the fields from the document excerpt. This ensures that they won’t add noise to the full text excerpt of the document or pollute search results. This also allows Augmented Search to use the fields for filtering and faceting, unless they are flagged as indexed=no. Properties other than strings are excluded from the full text excerpt.

ExternalDataSource.Searchable Strategy

Once your node types are ready to be indexed by Augmented Search, they need to be found. You can implement an ExternalDataSource.Searchable interface to help Augmented Search find your data.

This is recommended if you have less than a few thousand data entries. Remember that the time taken to index your site includes fetching data from your provider.

In this approach, when Augmented Search performs a site or full indexation, it queries every node type defined in org.jahia.modules.augmentedsearch.content.indexedMainResourceTypes.

An example from the movie database provider, Augmented Search returns only 2000 of the most popular movies while indexing. Augmented Search performs this query:

select * from jnt:movie where isdescendant('/sites/augmented-search-site')

If you add jnt:movie to the indexedMainResourceTypes property, AS will execute this query by default:

select * from jmix:mainResource where isdescendant('/sites/augmented-search-site')

Warning: When implementing the Searchable interface, you are responsible for handling pagination with your provider system as the query is a ScrollableQuery. You need to paginate based on query offset and limit.

private void getMostPopularMovies(ExternalQuery query, List<String> results) throws RepositoryException, JSONException {
   JSONArray tmdbResult;
   long pageNumber = query.getOffset() / 20;
   if (pageNumber < 100) {
       //Return up to the first 2000 most popular movies
       JSONObject discoverMovies = queryTMDB(API_DISCOVER_MOVIE, "sort_by", "popularity.desc", "page", String.valueOf(pageNumber + 1));
       if (discoverMovies.has("total_pages") && discoverMovies.has(RESULTS)) {
           int totalPages = discoverMovies.getInt("total_pages");
           tmdbResult = discoverMovies.getJSONArray(RESULTS);
           if (tmdbResult != null) {
               processResultsArray(results, tmdbResult);
           }
           for (long i = pageNumber + 2; i <= totalPages; i++) {
               processResultsArray(results, queryTMDB(API_DISCOVER_MOVIE, "sort_by", "popularity.desc", "page", String.valueOf(i)).getJSONArray(
                       RESULTS));
               if (results.size() >= query.getLimit()) {
                   break;
               }
           }
       }
       logger.info("Found {} results from TMDB", results.size());
   }
}

Your implementation should return a list of paths. Then Augmented Search will iterate over each node to index them, translated in External Data Provider calls. This means that after the call to search from Searchable, Augmented Search will call getItemByPath and use the returned data to populate the index.

EventService Strategy

You can send the data through the EventService to replace or complement ExternalDataSource.Searchable. This indexes a node only when it has been accessed by a user or the system. In the movie database implementation, the EventService indexes movies when the user is browsing the provider’s content. If a movie has not already been indexed, then it will be indexed after call to getItemByPath.

image1.jpg

This double strategy ensures that at least 2000 movies are indexed when administrators index the whole site, and also allows you to add more movies along the lifespan of the platform in the index. The following code shows how to index a node this way. Here CompletableFuture does not block the user when sending the indexation information.

if (cache.get(INDEXEDFULLMOVIECACHEKEYPREFIX + movieId) == null) {
   EventService eventService = BundleUtils.getOsgiService(EventService.class, null);
   JCRStoreProvider jcrStoreProvider = JCRSessionFactory.getInstance().getProviders().get("TMDBProvider");
   CompletableFuture.supplyAsync(() -> {
       try {
           eventService.sendAddedNodes(Arrays.asList(data), jcrStoreProvider);
           cache.put(new Element(INDEXEDFULLMOVIECACHEKEYPREFIX + movieId, "indexed"));
       } catch (RepositoryException e) {
           e.printStackTrace();
       }
       return "eventSent";
   });
}

The last indexation strategy depends on how your data provider is implemented. For example, if your implementation fetches all content from a product catalog, you might want to send the data to Augmented Search at the same time instead of waiting for an indexation or a user to browse every product. This code will be the same type as previously shown but with a batch of ExternalData instead of only one node.

You need to handle changes done on the 3rd party side to maintain your data accuracy. With Augmented Search now involved, you need to make sure to reindex the content when updated. You also need to avoid indexing the content every time the data is accessed. The TMDB implementation uses a dedicated cache to store the movies already indexed. If the cache is emptied or if some keys are evicted from the cache, then they will be indexed again.

Run queries to validate the indexation

Once your data is indexed, you should run queries to validate that Augmented Search can find your data. Here is an example that searches “Star Wars” to confirm that the search is performed only on the movie type.

query {
  search(q: "Star Wars", workspace: EDIT, siteKeys: "digitall", filters:{nodeType:{type:"jnt:movie"}}) {
    results(size:3) {
      took
      totalHits
      hits {
        displayableName
        score
        path
        excerpt
        link
        nodeType
      }
    }
  }
}

Which returns:

{
  "data": {
    "search": {
      "results": {
        "took": "24ms",
        "totalHits": 126,
        "hits": [
          {
            "displayableName": "Star Wars",
            "score": 333.220947265625,
            "path": "/sites/digitall/contents/tmdb/movies/1977/05/11",
            "excerpt": "Star Wars Princess Leia is captured and held hostage by the evil Imperial forces in their effort to take over the galactic Empire. Venturesome Luke Skywalker and dashing captain Han Solo team together with the loveable robot duo R2-D2 and C-3PO to rescue the beautiful princess and restore peace and justice",
            "link": "http://localhost/cms/render/default/en/sites/digitall/contents/tmdb/movies/1977/05/11.html",
            "nodeType": "jnt:movie"
          },
          {
            "displayableName": "Star Wars: Episode III - Revenge of the Sith",
            "score": 287.9112548828125,
            "path": "/sites/digitall/contents/tmdb/movies/2005/05/1895",
            "excerpt": "Star Wars: Episode III - Revenge of the Sith The evil Darth Sidious enacts his final plan for unlimited power -- and the heroic Jedi Anakin Skywalker must choose a side. The saga is complete.",
            "link": "http://localhost/cms/render/default/en/sites/digitall/contents/tmdb/movies/2005/05/1895.html",
            "nodeType": "jnt:movie"
          },
          {
            "displayableName": "Star Wars: The Clone Wars",
            "score": 276.456298828125,
            "path": "/sites/digitall/contents/tmdb/movies/2008/08/12180",
            "excerpt": "Star Wars: The Clone Wars Set between Episode II and III, The Clone Wars is the first computer animated Star Wars film. Anakin and Obi Wan must find out who kidnapped Jabba the Hutt's son and return him safely. The Separatists will try anything to stop them and ruin any chance of a diplomatic agreement",
            "link": "http://localhost/cms/render/default/en/sites/digitall/contents/tmdb/movies/2008/08/12180.html",
            "nodeType": "jnt:movie"
          }
        ]
      }
    }
  }
}

If you are using the Augmented Search UI, you will also see your data.

image2.jpg

Mapping

To be able to use filters, sorting, and facets with your data, declare your node types as mapped node types. For more information, see Custom Indexing.

For example, add “jnt:movie” to the list of mapped node types to perform this kind of query. This query does not filter on any node types as the search is performed on the entire index.

query {
  search(q: "Star Wars", workspace: EDIT, siteKeys: "digitall") {
    results(size:3) {
      took
      totalHits
      hits {
        displayableName
        score
        path
        excerpt
        link
        nodeType
        poster_path: property
        release_date: property
        tagline: property
      }
    }
  }
}

This query requests 3 extra properties (poster_path, release_date, and tagline). The properties are empty if the documents that the search finds do not contain a value for them.

The default call should be poster_path: property(name:”poster_path”), but to make developers' lives easier the Augmented Search API allows you to use the alias name to declare the property you want to get. Therefore, the call can be shortened to poster_path: property if the alias and property name match. If they do not match, you need to use the long version.

{
  "data": {
    "search": {
      "results": {
        "took": "19ms",
        "totalHits": 128,
        "hits": [
          {
            "displayableName": "Star Wars",
            "score": 333.220947265625,
            "path": "/sites/digitall/contents/tmdb/movies/1977/05/11",
            "excerpt": "Star Wars Princess Leia is captured and held hostage by the evil Imperial forces in their effort to take over the galactic Empire. Venturesome Luke Skywalker and dashing captain Han Solo team together with the loveable robot duo R2-D2 and C-3PO to rescue the beautiful princess and restore peace and justice",
            "link": "http://localhost/cms/render/default/en/sites/digitall/contents/tmdb/movies/1977/05/11.html",
            "nodeType": "jnt:movie",
            "poster_path": "http://image.tmdb.org/t/p/w154/6FfCtAuVAW8XJjZ7eWeLibRLWTw.jpg",
            "release_date": "1977-05-25T00:00:00.000Z",
            "tagline": "A long time ago in a galaxy far, far away..."
          },
          {
            "displayableName": "Star Wars: Episode III - Revenge of the Sith",
            "score": 287.9112548828125,
            "path": "/sites/digitall/contents/tmdb/movies/2005/05/1895",
            "excerpt": "Star Wars: Episode III - Revenge of the Sith The evil Darth Sidious enacts his final plan for unlimited power -- and the heroic Jedi Anakin Skywalker must choose a side. The saga is complete.",
            "link": "http://localhost/cms/render/default/en/sites/digitall/contents/tmdb/movies/2005/05/1895.html",
            "nodeType": "jnt:movie",
            "poster_path": "http://image.tmdb.org/t/p/w154/xfSAoBEm9MNBjmlNcDYLvLSMlnq.jpg",
            "release_date": "2005-05-17T00:00:00.000Z",
            "tagline": "The saga is complete."
          },
          {
            "displayableName": "Star Wars: The Clone Wars",
            "score": 276.456298828125,
            "path": "/sites/digitall/contents/tmdb/movies/2008/08/12180",
            "excerpt": "Star Wars: The Clone Wars Set between Episode II and III, The Clone Wars is the first computer animated Star Wars film. Anakin and Obi Wan must find out who kidnapped Jabba the Hutt's son and return him safely. The Seperatists will try anything to stop them and ruin any chance of a diplomatic agreement",
            "link": "http://localhost/cms/render/default/en/sites/digitall/contents/tmdb/movies/2008/08/12180.html",
            "nodeType": "jnt:movie",
            "poster_path": "http://image.tmdb.org/t/p/w154/veee7dll1xMwK14dGt0xsQekYYs.jpg",
            "release_date": "2008-08-05T00:00:00.000Z",
            "tagline": ""
          }
        ]
      }
    }
  }
}

The mapped properties can be used to build filters, sorting, and facets.

query {
  search(q: "star wars", workspace: EDIT, siteKeys: "digitall") {
    results(size: 3) {
      took
      totalHits
      hits {
        displayableName
        score
        path
        excerpt
        link
        nodeType
        poster_path: property
        release_date: property
        tagline: property
        popularity: property
      }
    }
    numberRange(field: "popularity") {
      max
      min
    }
  }
}

This will return something like this for the facet:

"numberRange": {
        "max": 268.729,
        "min": 23.608
      }

Or a facet on the release date:

query {
  search(q: "star wars", workspace: EDIT, siteKeys: "digitall") {
    results(size: 3) {
      took
      totalHits
      hits {
        displayableName
        score
        path
        excerpt
        link
        nodeType
        poster_path: property
        release_date: property
        tagline: property
        popularity: property
      }
    }
    rangeFacet(field: "release_date", ranges: [
      {name: "2010's", from: "2010-01-01", to: "2019-12-31"}, 
      {name: "2000's", from: "2000-01-01", to: "2009-12-31"}, 
      {name: "1990's", from: "1990-01-01", to: "1999-12-31"}, 
      {name: "1980's", from: "1980-01-01", to: "1989-12-31"}, 
      {name: "1970's", from: "1970-01-01", to: "1979-12-31"}]) {
      data {
        count
        name
      }
    }
  }
}

This will return something like this:

"rangeFacet": {
        "data": [
          {
            "count": 1,
            "name": "1970's"
          },
          {
            "count": 10,
            "name": "1980's"
          },
          {
            "count": 10,
            "name": "1990's"
          },
          {
            "count": 19,
            "name": "2000's"
          },
          {
            "count": 76,
            "name": "2010's"
          }
        ]
      }

Now you can modify the Augmented Search UI to design a dedicated rendering for your movies.

First, you need to modify the App.jsx file like this to request all the extra properties you need for our movies:

let fields = [
   new Field(FieldType.HIT, 'link'),
   new Field(FieldType.HIT, 'displayableName', 'title'),
   new Field(FieldType.HIT, 'excerpt', null, true),
   new Field(FieldType.HIT, 'score'),
   new Field(FieldType.HIT, 'lastModified'),
   new Field(FieldType.HIT, 'lastModifiedBy'),
   new Field(FieldType.HIT, 'createdBy'),
   new Field(FieldType.HIT, 'created'),
   new Field(FieldType.HIT, 'nodeType'),
   new Field(FieldType.NODE, 'poster_path'),
   new Field(FieldType.NODE, 'tagline'),
   new Field(FieldType.NODE, 'release_date'),
   new Field(FieldType.NODE, 'vote_average'),
   new Field(FieldType.NODE, 'vote_count'),
   new Field(FieldType.NODE, 'popularity'),
   new Field(FieldType.NODE, 'overview')
];

The extra fields you need are highlighted. Next you need to branch the different rendering based on the value of nodeType.

First, create a component Movie to render the movies:

const Movie = ({movie}) => {
   return (
       <div style={{
           minHeight: '300px',
           maxHeight: '300px',
           display: 'flex',
           alignItems: 'center',
           justifyContent: 'space-evenly'
       }}
       >
           <img src={getRaw(movie, 'poster_path')} style={{maxHeight: '150px'}}/>
           <div style={{maxWidth: '500px', minWidth: '500px'}}>
               <a href={getRaw(movie, 'link')}><h2
                   dangerouslySetInnerHTML={{__html: getEscapedField(movie, 'title') + '<span style="font-size: medium; margin-left: 5px">(<i>' + getEscapedField(movie, 'popularity') + '</i>)</span>'}}/>
               </a>
               <h4 dangerouslySetInnerHTML={{__html: getEscapedField(movie, 'tagline')}}/>
               <p
                   dangerouslySetInnerHTML={{__html: getEscapedField(movie, 'overview')}}
                   style={{fontSize: 'smaller', maxHeight: '200px'}}/>
               <span><em>Released:</em>&nbsp;
                   <i>{moment(getEscapedField(movie, 'release_date')).format('LL')}</i>
               </span>
               <span style={{marginLeft: '10px'}}>
                   <em>Rating</em>&nbsp;
                   <i>{getEscapedField(movie, 'vote_average')}/10 ({getEscapedField(movie, 'vote_count')}&nbsp; votes)</i>
               </span>
           </div>
       </div>
   );
};

Movie.propTypes = {
   movie: PropTypes.object.isRequired
};

Then, in ResultView you can add a test to render a movie of nodeType is jnt:movie. Functions like getEscapedField are defined in ResultView.jsx to ease the rendering of fields that might contain HTML.

const type = getRaw(result, 'nodeType');

if (type === 'jnt:movie') {
   console.log(getRaw(result, 'poster_path'));
   return (<Movie movie={result}/>);
}

This gives you a rendering like the following for each movie.

image3.jpg