Indexing FAQs

November 14, 2023

Elasticsearch and indexing

Infrastructure

Can I use the same Elasticsearch cluster for different Jahia platforms?

Indexation

Should I use content-based search or page-based search? Can I combine page-based search with content-based search?

Can I index content that is not displayed in full page?

Can I boost a custom property of a content type in the definitions.cnd file?

Can I exclude some content from being indexed, like the Google no-index option?

Can I add a negative boost on some content?

Can I set a negative boost for some content types?

Can I define boosts by website?

Can I configure synonyms?

What analyzer does Augmented Search use?

Can I use different analyzers?

Which language is used for lemmatization?

How can I display the facet for a custom field? 

How can I display categories facets?

Can I customize Elasticsearch settings and mappings?

How can I customize Ngrams?

What are the default boosts?

How is the fuzzy match configured?

How content is indexed in Augmented Search?

Definitions.cnd and Elasticsearch configurations

What parameters can I set in the definitions.cnd file to modify the Augmented Search indexation (for example, nofulltext, indexed=no, boost, analyzer=keyword)?

Legacy components

What are the legacy Jahia search-related components that continue to work in an Augmented Search setup (for example, glossary or pager)? If some are not working anymore what are the alternative ones?

Does the default behavior display all matching results or can I set a default limit?

Upgrading from JCR search

Can I have one site running with JCR search and another one with Augmented Search?

External Data Provider (EDP)

How can I index content from the External Data Provider? 

Is it possible to have search results including contents coming from the External Data Provider and from the JCR?

Answers

Can I use the same Elasticsearch cluster for different Jahia platforms?

Yes, you can add a prefix for each index, so one prefix per platform.

Should I use content-based search or page-based search? Can I combine page-based search with content-based search?

Yes, depending on your site and business requirements, you can configure one part of your website with page-based search, by using filter on path. Then, you could index the rest of the website using content-based search.

Can I index content that is not displayed in full page?

Yes. Content is indexed by node type and sub type.

Can I boost a custom property of a content type in the definitions.cnd file?

No, it’s not possible to boost custom fields.

Can I exclude some content from being indexed, like the Google no-index option?

You can exclude content from indexing with the Remove From Augmented Search Results mixin. For more information, see Excluding a document.

Can I add a negative boost on some content?

This is not possible with Augmented Search.

Can I set a negative boost for some content types?

This is not possible with Augmented Search.

Can I define boosts by website?

No, indexation is done at the platform level and all sites are affected.

Can I configure synonyms?

Yes. You can configure synonyms using standard Elasticsearch configuration.

What analyzer does Augmented Search use?

Each language uses its own index and dedicated analyzer.

Can I use different analyzers?

Yes. You can configure analyzers and stemmers by modifying the OSGI properties in the Augmented Search module. You can do this in the configuration file, Karaf console, or in Jahia Tools.

Which language is used for lemmatization?

The indexation process does not use lemmatization by default, as Elasticsearch and Lucene only provide stemming out-of-the-box.

How can I display the facet for a custom field? 

In jgql:nodes.audiences.keywords, add your field to the Elasticsearch mapping. If your field has a namespace, surround the namespace and field in quotes.

Can I customize Elasticsearch settings and mappings?

Yes you can. To customize Elasticsearch settings and mappings:

  1. Copy the embedded files from augmented-search modules. Copy the mapping.json and settings.json files from META-INF/configurations to a location where they can be referenced by your Jahia.
  2. Then, update the configuration file to reflect the new paths to the files.
  3. There is a property for the settings and the mapping. Each property can be specified for both content and files, so this gives the following four properties.
    
    org.jahia.modules.augmentedsearch.content.settingsFileLocation
    org.jahia.modules.augmentedsearch.file.settingsFileLocation
    org.jahia.modules.augmentedsearch.content.mappingFileLocation
    org.jahia.modules.augmentedsearch.file.mappingFileLocation
    # Example:
    # org.jahia.modules.augmentedsearch.content.settingsFileLocation = /opt/jahia/elasticsearch/settings.json

How can I customize Ngrams?

First, copy the embedded configuration files. Once you have copied the JSON files, edit the settings.json file. Locate the tokenizer definition at the end of the file.


"tokenizer": {
 ...
 "main_tokenizer": {
   "type": "edge_ngram",
   "min_gram": 1,
   "max_gram": 12,
   "token_chars": [
     "letter",
     "digit"
   ]
 },
 "metadata_tokenizer": {
   "type": "edge_ngram",
   "min_gram": 1,
   "max_gram": 12,
   "token_chars": [
     "letter",
     "digit"
   ]
 }
}

Here you can tune the min_gram and max_gram properties. 

  • min_gram
    Specifies when “instant search” applies to searches that your users perform. A value of 1 means that users get results on the first keyboard stroke. A value of 3 means results display when they type at least 3 characters.
  • max_gram
    Determines the length of the maximum groups of letters, by default up to 12 letters. This value depends on your dataset, the complexity of your vocabulary, and the different languages you are going to index. For example,  some languages like German tend to compound words together.
Note: The max_gram property has a significant impact on the size of your index. Each word will generate up to 12 token, ranging from 1 to 12 characters in length.

What are the default boosts?

Boost settings are applied by default to the jgql:main, jgql:metadata, and jgql:content fields.


#
# Boost settings for fields: jgql:main, jgql:metadata and jgql:content
#
org.jahia.modules.augmentedsearch.field.main.boost = 2.0
org.jahia.modules.augmentedsearch.field.metadata.boost = 1.5
org.jahia.modules.augmentedsearch.field.content.boost = 1.5


How is the fuzzy match configured?

By default, the fuzzy matching starts at the 4th character. Also, it can permute one letter, starting at the 3rd character. The first 2 letters need to be exact.

How content is indexed in Augmented Search?

All content is split in 3 fields:

  • Main
    Indexes the displayable name of the content, usually the title or alternatively the 128 first characters if rich-text. By default, the weight = 2.
  • Metadata
    Indexes the categories, tags and keywords that are set on each content. By default, the weight = 1.5.
  • Content
    Aggregates all full-text properties into one field to provide an efficient full-text search. By default, the weight = 1.5.

Each of these fields is analyzed and stored in the following subfields to provide the best search relevance out of the box: 

  • Stemming
    Takes the searched term and tries to match it against the stem (for example developer > develop). This subfield applies to all words in your searched term.
  • Ngram
    Edge Ngram analyzes each word and emits a token for each group of letter in the defined limit (1-10) (ex: wolf -> [w, wo, wol, wolf]). This subfield is mainly used when the visitor starts typing words. 
  • Phrase
    Matches the searched terms against the indexed content. If the searched terms have a match with the indexed content, then the order of the words has an impact.
  • Exact match
    Checks the exact match between the searched term and the indexed content. Exact match has a lot of weight. 

What parameters can I set in the definitions.cnd file to modify the Augmented Search indexation (for example, nofulltext, indexed=no, boost, analyzer=keyword)?

The query uses the main, content, and metadata fields, which do not take into account boost or analyzer. The properties that are not indexable are not indexed (indexed=no). The properties that are not full text are not copied in the field content and are not part of the query for search, but they can be used for filtering or faceting.

What are the legacy Jahia search-related components that continue to work in an Augmented Search setup (for example, glossary or pager)? If some are not working anymore what are the alternative ones?

No legacy Jahia search components will continue to work. Only the Augmented Search UI component uses the Search UI library from Elasticsearch. See the Elasticsearch documentation for components available for you to use with your search application.

  • SearchBox
  • Results
  • Result
  • ResultsPerPage
  • Facet
  • Sorting
  • Paging
  • PagingInfo
  • ErrorBoundary
  • Search results

Does the default behavior display all matching results or can I set a default limit?

The default limit is 10 results if nothing is specified in the GraphQL query.

Can I have one site running with JCR search and another one with Augmented Search?

Yes. Augmented Search is not based on search provider so the JCR search is still available. You can add the Augmented Search UI on one site and not another.

How can I index content from the External Data Provider? 

You can use the event API to index content from the External Data Provider. For more information, see Sending events to Jahia.

Is it possible to have search results including contents coming from the External Data Provider and from the JCR?

Yes it is possible, and the search results will be mixed, as if they were from the same content source (as opposed to the JCR search today where the JCR results are displayed before the EDP results).