Index & Searchable content

November 14, 2023

This section presents how to configure the index, so the search results do return the expected pages or content. Consult the Search relevance and boosting page to learn how the search experience can be fine-tuned to better address your needs and dataset.

Indexing Pages

In a site, pages can be regular Jahia pages (of type jnt:page), or they can be specific content types displayable in full page: such content types have what is called a content template. The best practice consists of adding the jmix:mainResource mixin to such content types to better identify them.

By default, Jahia pages and contents with the jmix:mainResource mixin are indexed as pages, and thus can be returned as search results.

If you need more, or different, content types to appear in the search results, then you need to edit this list, by declaring the content types hierarchically (the only constraint is thatjnt:page shall always be first) in the org.jahia.modules.augmentedsearch.content.indexedMainResourceTypes, as shown in the following example:

org.jahia.modules.augmentedsearch.content.indexedMainResourceTypes = jnt:page,jmix:mainResource,jnt:myCustomType

Note: if jmix:mainResource is declared in the indexedMainResourceTypes property, it is not necessary to declare content types with this mixing

Indexing the page content

By default, only the editorial content (which have jmix:editorialContent as supertype) created inside a page are indexed in the page.

But if your pages contain other types of content that need to be indexed, so they can be used to search for the page, then you need to declare them in the org.jahia.modules.augmentedsearch.content.indexedSubNodeTypes property. For instance:

org.jahia.modules.augmentedsearch.content.indexedSubNodeTypes = jmix:editorialContent, jnt:myNonEditorialContentType

Indexing nodetype properties

By default, all the text properties of the indexed content  (see the Indexing pages section above) are searchable through full text search.

Additionnaly, the following default properties can be used when building specific queries, e.g. when building an advanced search form or when implementing facets:

  • tags: jgql:tags
  • keyword: jgql:keywords
  • categories: jgql:categorized
  • displayableName: jgql:displayableName
  • nodetype: jgql:nodeType
  • creation date: jgql:created
  • creator: jgql:createdBy
  • last modification date: jgql:lastModified
  • last contributor: jgql:lastModifiedBy
  • last publication date: jgql:lastPublished
  • last publisher: jgql:lastPublishedBy
  • mimetype: jgql:mimeType

To add custom node type properties to this list, you need to declare them in the  in the org.jahia.modules.augmentedsearch.content.mappedNodeTypes configuration property:

  • using the node type only will add all the content type properties to the previous list
  • using the {contentType}.{propertyName} notation will only add the designated properties

In the following example, all the properties of the jnt:news content type will be available to build queries/facets, as well as the eventsType of jnt:event. The other jnt:event property will not be available:

org.jahia.modules.augmentedsearch.content.mappedNodeTypes = jnt:news, jnt:event.eventsType

Similarly, if you need to add additional properties to this list when searching for files, you need to declare them in the org.jahia.modules.augmentedsearch.file.mappedNodeTypes configuration property.

All the nodetypes declared in org.jahia.modules.augmentedsearch.content.mappedNodeTypes shall also be included in the declared types of the org.jahia.modules.augmentedsearch.content.indexedMainResourceTypes configuration property
Elasticsearch has a default limit of 1000 of such properties indexation. It is possible to increase this limit, however it may have impacts on performances. 

Indexing files

By default, only pdf files are indexed and searchable.

You can configure which files need to be indexed, based on the file extension using the org.jahia.modules.augmentedsearch.content.indexedFileExtensionsproperty:

  • Provide all the extension types in a comma-separated list:
    org.jahia.modules.augmentedsearch.content.indexedFileExtensions = pdf,docx,doc
  • Use * to index all files
  • Leave the property empty to not index files at all

As with any setting defining content to be indexed, this setting will have a direct impact on the size of your Elasticsearch indices and therefore the resource requirement of your entire Elasticsearch cluster. You should only specify in this configuration setting filetypes you aim at being searchable on your platform. 

Workspace indexation

By default, the content is indexed in both live and staging. This means that you can search for content in preview mode.

If you do not need to search for content in preview, or if the index size is a concern, you can index only live content by setting the org.jahia.modules.augmentedsearch.workspaces property to LIVE:

org.jahia.modules.augmentedsearch.workspaces = LIVE

Use ALL to return to the default behavior of indexing both staging and live content.

Preventing specific contents from appearing in the search results

It is possible to exclude a specific section, page, content item, folder or file from being indexed, and thus appearing in the search results.

Using Content Editor

To do so, when editing the page/content you need to enable the Remove From Augmented Search Results mixin in the Options section:

augmented-search-skip-indexing.png

Two options are available:

  • Current content only
    Removes only the selected content from the index.
    If this content has subcontent defined as indexedMainResourcesTypes in the Augmented Search configuration file, the subcontent will be indexed. This is useful when a site contains a tree of documents, for example news organized by year. You might not want to have the page listing news for the year 2018 available in search results but would want each individual news item to be indexed.
  • Current content and subcontent items
    Removes both content and subcontent from the index, even if the subcontent is defined as an indexedMainResourcesTypes. When using this option on a page, all the subpages will be excluded from the indexation.

 If the document was previously indexed:

  • it will be removed from Augmented Search in preview upon save.
  • it will be removed from Augmented Search in live upon publication.

Programmaticaly

To remove a specific node from the index, you need to add the jmix:skipESIndexation mixin on this node. This can be done using a GraphQL mutation. By default, it will remove the content only (current-only). If you need to remove the subcontents as well, you will need to update the value of the the skipIndexationString to current-subtree:

mutation excludeContent {
  jcr(workspace: EDIT) {
    mutateNode(pathOrId: "/sites/digitall/home/example") {
      addMixins(mixins: "jmix:skipESIndexation")
      mutateProperty(name: "skipIndexationString") {
        setValue(type: STRING, value: "current-subtree")
      }
    }
  }
}

This mutation can also be executed on the LIVE workspace

Reindex a specific content

It is possible to use a GraphQL mutation to trigger the redindexing of specific nodes:

mutation indexNode($nodePaths: [String!], $workspace: Workspace, $inclDescendants: Boolean = false) {
    admin {
        search {
            startNodeIndex(nodePaths:$nodePaths, inclDescendants: $inclDescendants, workspace: $workspace) {
                jobs {
                    id
                    status
                }
            }
        }
    }
}

using the following variables:

{
    nodePaths: [pagePathA],
    workspace: 'LIVE',
    inclDescendants: true,
}

Set the inclDescendants parameter to true to reindex the children nodes of the nodePaths, or set it to false to only reindex the given nodes. Please note that when inclDescendants is set to true when reindexing pages, it will reindex the content of the page, but not the subpages.

The node will be indexed in all the languages of the site.