Search relevance and boosting

November 14, 2023

How relevance works

Search results are ordered by default based on their relevance. Several factors are taken into the calculation of the relevance score, for instance if the search terms appear in the page title or in the metada, if a page does not contain the exact search terms, but other words sharing the same stem, etc. It is possible to configure the weight of these factors, using boosts, so that the relevance calculation done on each query by Elasticsearch is adapted to your data set. It is also possible to use Function Score to dynamically adjust result scores at query time.

To better adjust the boost configuration to your dataset, or to fine tune Function Score queries, it is best to understand a bit about how Elasticsearch calculates scores. Several aspects of your dataset will have impacts on the score calculation, such as the number of indexed documents (pages, main resource contents and files), their length and also the number of documents matching the searched terms. We then advise to regularly review your boost configuration, as a growing index may require to update the boost configuration. 
A couple articles on internet provides more in-depth description of the Elasticsearch algorithm (e.g. The BM25 Algorithm, Understanding Similarity Scoring in Elasticsearch).

Understanding the indexation

Each page, or content indexed as main resource (see Indexing Pages), is split in the 3 following fields in the index.

  • Main
    Indexes the displayable name of the content, usually the title or alternatively the 128 first characters if rich-text. 
  • Content
    Aggregates all full-text properties into one field to provide an efficient full-text search. 
  • Metadata
    Indexes the categories, tags and keywords that are set on each content. 

Each of these fields is analyzed and stored in the following subfields to provide the best search relevance out of the box: 

  • Stemming
    Takes the searched term and tries to match it against the stem (for example developer > develop). This subfield applies to all words in your searched term.
  • Ngram
    Edge Ngram analyzes each word and emits a token for each group of letter in the defined limit (1-10) (ex: wolf -> [w, wo, wol, wolf]). This subfield is mainly used when the visitor starts typing words. 
  • Phrase
    Matches the searched terms against the indexed content. If the searched terms have a match with the indexed content, then the order of the words has an impact.
  • Exact match
    Checks the exact match between the searched term and the indexed content. Exact match has a lot of weight. 

Ngrams configuration

Edge Ngram analyzes each word and emits a token for each group of letter in the defined limit (1-10) (ex: wolf -> [w, wo, wol, wolf]). This subfield is mainly used when the visitor starts typing words. 

The configuration for ngrams is done on Elasticsearch side, consult the Configuring Elasticsearch page to learn more.

The Ngram configuration can be done in the tokenizer section of the settings.json file:


"tokenizer": {
 ...
 "main_tokenizer": {
   "type": "edge_ngram",
   "min_gram": 1,
   "max_gram": 12,
   "token_chars": [
     "letter",
     "digit"
   ]
 },
 "metadata_tokenizer": {
   "type": "edge_ngram",
   "min_gram": 1,
   "max_gram": 12,
   "token_chars": [
     "letter",
     "digit"
   ]
 }
}

Here you can tune the min_gram and max_gram properties. 

  • min_gram
    Specifies when “instant search” applies to searches that your users perform. A value of 1 means that users get results on the first keyboard stroke. A value of 3 means results display when they type at least 3 characters.
  • max_gram
    Determines the length of the maximum groups of letters, by default up to 12 letters. This value depends on your dataset, the complexity of your vocabulary, and the different languages you are going to index. For example,  some languages like German tend to compound words together.
Note: The max_gram property has a significant impact on the size of your index. Each word will generate up to 12 token, ranging from 1 to 12 characters in length.

Available boosts

Pages and content

It is possible to configure the boost for all of the previous fields and subfields: in other words you can adjust the weight and importance of each field and subfield compared to the others. For instance, if you have tagged your content, you may want to boost the Metadata field so when searching using a tag name, contents with such tag will have an even bigger score, and appear upper in the search results.

The various boosts can be configured in the Augmented Search configuration file (consult the Editing the configuration file to see how to edit it).

Boost Description Default value
org.jahia.modules.augmentedsearch.field.main.boost    2.0
org.jahia.modules.augmentedsearch.field.main.stem.boost   1.0
org.jahia.modules.augmentedsearch.field.main.phrase.boost   2.5
org.jahia.modules.augmentedsearch.field.main.content.boost   2.0
org.jahia.modules.augmentedsearch.field.main.ngram.boost   1.0
org.jahia.modules.augmentedsearch.field.content.boost    2.0
org.jahia.modules.augmentedsearch.field.content.stem.boost   1.0
org.jahia.modules.augmentedsearch.field.content.phrase.boost   2.5
org.jahia.modules.augmentedsearch.field.content.content.boost   2.0
org.jahia.modules.augmentedsearch.field.content.ngram.boost   1.0
org.jahia.modules.augmentedsearch.field.metdata.boost    2.0
org.jahia.modules.augmentedsearch.field.metdata.stem.boost   1.0
org.jahia.modules.augmentedsearch.field.metdata.phrase.boost   2.5
org.jahia.modules.augmentedsearch.field.metdata.content.boost   2.0
org.jahia.modules.augmentedsearch.field.metdata.ngram.boost   1.0

Files

You can adjust adjust the weight of files in the search results by updating the different file boost values.

  • if files appear too low in the search result, increase the file boost values
  • if files appear too high, decrease the file boost values, or increase the content ones
Boost Description Default value
org.jahia.modules.augmentedsearch.field.file_content.boost   1.5
org.jahia.modules.augmentedsearch.field.file_content.content.boost   1.0
org.jahia.modules.augmentedsearch.field.file_content.phrase.boost   2.5

Function Score

Function Score can be used to modify the score of the pages/contents/files retrieved by a query, at query time and without modifying the index. For instance, it can be used to:

  • boost contents of a given type
  • boost a specific content
  • boost specific sections of a site
  • make more recently published content more relevant
  • implement a personalized search
  • etc.

An implementation of these examples can be found in the Function Score Example Github repository.

Learn more about Function Score, and its capabilities, directly in the Elasticsearch Function Score documentation.

Example

The following contentTypeBoost function score boosts files (jnt:file) and news(jnt:news) over the other contents returned by the search query. The file scores are multiplied by 10, while the scores of news are multiplied by 5.

{
  "contentTypeBoost": {
    "boost": "1",
    "functions": [
      {
        "filter": {
          "match": {
            "jgql:nodeType": "jnt:file"
          }
        },
        "weight": 10
      },
      {
        "filter": {
          "match": {
            "jgql:nodeType": "jnt:news"
          }
        },
        "weight": 5
      }
    ],
    "score_mode": "first",
    "boost_mode": "multiply",
    "min_score": 1
  }
}

It is then referenced in the search query with the functionScoreId parameter: 

     query{
       search(
         q: "searched terms"
         language: "en"
         searchIn: [CONTENT, FILES]
         workspace: LIVE
         functionScoreId: "contentTypeBoost"
       ) {
         results {
           totalHits
           hits {
             displayableName
             nodeType
             path
             score
           }
         }
       }
     }

Calling functionScoreId:"contentTypeBoost" will apply the function named contentTypeBoost to calculate the score for each document. The list of function score IDs can be retrieved through a GraphQL query (see below).

Declaring a Function Score

Functions cannot be dynamically created, and must be registered first. 

Using a module

Functions can be declared in a augmented-search-score-functions.json file, located under the src/main/resources/META-INF/ folder of a module.

An example is available in the Function Score Examples github repository.

Using the GraphQL API

Functions can also be registered using the GraphQL API:

mutation {
  admin {
    search{
      functionScore{
        uploadFunction(id:"contentTypeBoost", file:file)
      }
    }
  }
}

The store function should be called as a multipart/form-data, CURL example:

curl --location --request POST 'http://localhost:8080/ctx/modules/graphql' \
--header 'Origin: http://localhost:8080' \
--header 'Authorization: Basic cm9vdDpyb290MTIzNA==' \
--header 'Cookie: JSESSIONID=6E00F0FDD1C9C5C0F280E5CC2DF9AB57' \
--form 'query="[{\"operationName\":\"create\",\"query\":\"mutation create { admin { search { functionScore { uploadFunction(id:\\\"contentTypeBoost\\\",description:\\\"Boost by content type\\\", file:\\\"file\\\") } } } }\"}]"' \
--form 'file=@"/tmp/contentTypeBoost.json";type=application/json'

List all the functions

The ID of the functions deployed on your Jahia environment can be retrieved with the following GraphQL query:

​​query {
  admin {
    search{
      functionScore{
        functions {
          nodes {
            id
          }
        }
      }
    }
  }
}

Removing a function

Functions can be removed from the system with the following query: 

mutation {
  admin {
    search{
      functionScore{
        removeFunction(id:"contentTypeBoost")
      }
    }
  }
}

Increasing or decreasing the relevance of a specific content

Increasing or decreasing the relevance of a specific content can be done using Function Score. An example of such implementation is available in the Function Score Example Github repository.

Increasing or decreasing the relevance of contents of a certain type

Increasing or decreasing the relevance of a certain content type can be done using Function Score. An example of such implementation is available in the Function Score Example Github repository.

Increasing the relevance of the most recent contents

Making contents more or less releavant based on their last publication date can be done using Function Score. An example of such implementation is available in the Function Score Example Github repository.

FAQ

Can I boost a custom property of a content type in the definitions.cnd file?

No, it’s not possible to boost custom fields.

Can I set negative boosts?

This is not possible with Augmented Search.

Can I define boosts by website?

No, indexation is done at the platform level and all sites are affected.

How is the fuzzy match configured?

By default, the fuzzy matching starts at the 4th character. Also, it can permute one letter, starting at the 3rd character. The first 2 letters need to be exact.