About indexing

November 14, 2023

For documents to be searchable they need to be indexed in an Elasticsearch cluster first. This is done by identifying main resources nodes and their subnodes, as defined in Augmented Search configuration file.

As part of the indexing process, corresponding ACLs and roles (coming from the source nodes) are attached to the Elasticsearch documents. This allows Augmented Search to return search results matching the visitor permissions.

Triggering indexing

Indexing is currently triggered by various type of events, when:

  • Clicking on "Index All" in the Augmented Search section of Jahia Administration, triggers (re-)indexation of all configured sites. Alternatively, clicking on "Start" for a single site will trigger (re-)indexation for that particular site.
  • Creating or modifying content on a site. This triggers the indexing of the new or modified content (not the full site) and it becomes available for search almost immediately after publication.
  • Importing content. This triggers the indexing of the new/modified content. 

Indexing time depends on the number of documents to index. For page updates, documents are available almost immediately, while for large full-site indexing, expect it to take a couple of minutes.

Per-site indexing vs indexing all

In Augmented Search UI (as well as its GraphQL API), you can trigger indexing either individually per site or for all sites. These actions have slightly different behavior in the way indexing is handled, and it is critical to understand these differences.

Augmented Search uses Elasticsearch aliases when communicating with data indices and share indices across multiple sites (with different indices based on language and content type). When triggering indexing across all configured sites, Augmented Search creates a set of new indices and start populating them with data. Once the indexing is complete, the alias is updated to point to the new indices, and the old indices are deleted. This operation of creating new indices allow for new mappings or new settings to be applied to the newly created indices.

When triggering indexing for one single site, Augmented Search goes through all of the Jahia nodes for that particular site and pushes them to Elasticsearch. This results in either an update of existing documents or the creation of new documents (for new sites), but the indices themselves are not modified (only their content for the site being indexed), therefore not modifying their mapping nor settings.

Excluding a document

In some situations, it might be necessary to exclude a document from Augmented Search, for example, if you want to make sure specific content cannot show up in the excerpt (see Search results and content filtering).

Excluding content can be done using the Remove From Augmented Search Results mixin. If the document was previously indexed, it will be removed from Augmented Search upon save.

augmented-search-skip-indexing.png

Two options are available:

  • Remove content only
    Removes only the selected content. If this content has subcontent defined as indexedMainResourcesTypes in the Augmented Search configuration file, the subcontent will be indexed. This is useful when a site contains a tree of documents, for example news organized by year. You might not want to have the page listing news for the year 2018 available in search results but would want each individual news item to be indexed.
  • Remove content and sub-content items
    Removes both content and subcontent, even if the subcontent is defined as an indexedMainResourcesTypes.

Excluding a parent will also exclude all of its subpages.

Modifying Elasticsearch mapping

Note that any modifications to the Elasticsearch mapping require a reindexing of the content for all sites.