Search relevance and boosting
How relevance works
Search results are ordered by default based on their relevance. Several factors are taken into the calculation of the relevance score, for instance if the search terms appear in the page title or in the metada, if a page does not contain the exact search terms, but other words sharing the same stem, etc. It is possible to configure the weight of these factors, using boosts, so that the relevance calculation done on each query by Elasticsearch is adapted to your data set. It is also possible to use Function Score to dynamically adjust result scores at query time.
A couple articles on internet provides more in-depth description of the Elasticsearch algorithm (e.g. The BM25 Algorithm, Understanding Similarity Scoring in Elasticsearch).
Understanding the indexation
Each page, or content indexed as main resource (see Indexing Pages), is split in the 3 following fields in the index.
- Main
Indexes the displayable name of the content, usually the title or alternatively the 128 first characters if rich-text. - Content
Aggregates all full-text properties into one field to provide an efficient full-text search. - Metadata
Indexes the categories, tags and keywords that are set on each content.
Each of these fields is analyzed and stored in the following subfields to provide the best search relevance out of the box:
- Stemming
Takes the searched term and tries to match it against the stem (for example developer > develop). This subfield applies to all words in your searched term. - Ngram
Edge Ngram analyzes each word and emits a token for each group of letter in the defined limit (1-10) (ex: wolf -> [w, wo, wol, wolf]). This subfield is mainly used when the visitor starts typing words. - Phrase
Matches the searched terms against the indexed content. If the searched terms have a match with the indexed content, then the order of the words has an impact. - Exact match
Checks the exact match between the searched term and the indexed content. Exact match has a lot of weight.
Ngrams configuration
Edge Ngram analyzes each word and emits a token for each group of letter in the defined limit (1-10) (ex: wolf -> [w, wo, wol, wolf]). This subfield is mainly used when the visitor starts typing words.
The configuration for ngrams is done on Elasticsearch side, consult the Configuring Elasticsearch page to learn more.
The Ngram configuration can be done in the tokenizer section of the settings.json file:
"tokenizer": {
...
"main_tokenizer": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 12,
"token_chars": [
"letter",
"digit"
]
},
"metadata_tokenizer": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 12,
"token_chars": [
"letter",
"digit"
]
}
}
Here you can tune the min_gram and max_gram properties.
- min_gram
Specifies when “instant search” applies to searches that your users perform. A value of 1 means that users get results on the first keyboard stroke. A value of 3 means results display when they type at least 3 characters. - max_gram
Determines the length of the maximum groups of letters, by default up to 12 letters. This value depends on your dataset, the complexity of your vocabulary, and the different languages you are going to index. For example, some languages like German tend to compound words together.
Available boosts
Pages and content
It is possible to configure the boost for all of the previous fields and subfields: in other words you can adjust the weight and importance of each field and subfield compared to the others. For instance, if you have tagged your content, you may want to boost the Metadata field so when searching using a tag name, contents with such tag will have an even bigger score, and appear upper in the search results.
The various boosts can be configured in the Augmented Search configuration file (consult the Editing the configuration file to see how to edit it).
Boost | Description | Default value |
---|---|---|
org.jahia.modules.augmentedsearch.field.main.boost | 2.0 | |
org.jahia.modules.augmentedsearch.field.main.stem.boost | 1.0 | |
org.jahia.modules.augmentedsearch.field.main.phrase.boost | 2.5 | |
org.jahia.modules.augmentedsearch.field.main.content.boost | 2.0 | |
org.jahia.modules.augmentedsearch.field.main.ngram.boost | 1.0 | |
org.jahia.modules.augmentedsearch.field.content.boost | 2.0 | |
org.jahia.modules.augmentedsearch.field.content.stem.boost | 1.0 | |
org.jahia.modules.augmentedsearch.field.content.phrase.boost | 2.5 | |
org.jahia.modules.augmentedsearch.field.content.content.boost | 2.0 | |
org.jahia.modules.augmentedsearch.field.content.ngram.boost | 1.0 | |
org.jahia.modules.augmentedsearch.field.metdata.boost | 2.0 | |
org.jahia.modules.augmentedsearch.field.metdata.stem.boost | 1.0 | |
org.jahia.modules.augmentedsearch.field.metdata.phrase.boost | 2.5 | |
org.jahia.modules.augmentedsearch.field.metdata.content.boost | 2.0 | |
org.jahia.modules.augmentedsearch.field.metdata.ngram.boost | 1.0 |
Files
You can adjust adjust the weight of files in the search results by updating the different file boost values.
- if files appear too low in the search result, increase the file boost values
- if files appear too high, decrease the file boost values, or increase the content ones
Boost | Description | Default value |
---|---|---|
org.jahia.modules.augmentedsearch.field.file_content.boost | 1.5 | |
org.jahia.modules.augmentedsearch.field.file_content.content.boost | 1.0 | |
org.jahia.modules.augmentedsearch.field.file_content.phrase.boost | 2.5 |
Function Score
Function Score can be used to modify the score of the pages/contents/files retrieved by a query, at query time and without modifying the index. For instance, it can be used to:
- boost contents of a given type
- boost a specific content
- boost specific sections of a site
- make more recently published content more relevant
- implement a personalized search
- etc.
An implementation of these examples can be found in the Function Score Example Github repository.
Learn more about Function Score, and its capabilities, directly in the Elasticsearch Function Score documentation.
Example
The following contentTypeBoost
function score boosts files (jnt:file) and news(jnt:news) over the other contents returned by the search query. The file scores are multiplied by 10, while the scores of news are multiplied by 5.
{
"contentTypeBoost": {
"boost": "1",
"functions": [
{
"filter": {
"match": {
"jgql:nodeType": "jnt:file"
}
},
"weight": 10
},
{
"filter": {
"match": {
"jgql:nodeType": "jnt:news"
}
},
"weight": 5
}
],
"score_mode": "first",
"boost_mode": "multiply",
"min_score": 1
}
}
It is then referenced in the search query with the functionScoreId
parameter:
query{
search(
q: "searched terms"
language: "en"
searchIn: [CONTENT, FILES]
workspace: LIVE
functionScoreId: "contentTypeBoost"
) {
results {
totalHits
hits {
displayableName
nodeType
path
score
}
}
}
}
Calling functionScoreId:"contentTypeBoost"
will apply the function named contentTypeBoost to calculate the score for each document. The list of function score IDs can be retrieved through a GraphQL query (see below).
Declaring a Function Score
Functions cannot be dynamically created, and must be registered first.
Using a module
Functions can be declared in a augmented-search-score-functions.json
file, located under the src/main/resources/META-INF/
folder of a module.
An example is available in the Function Score Examples github repository.
Using the GraphQL API
Functions can also be registered using the GraphQL API:
mutation {
admin {
search{
functionScore{
uploadFunction(id:"contentTypeBoost", file:file)
}
}
}
}
The store function should be called as a multipart/form-data, CURL example:
curl --location --request POST 'http://localhost:8080/ctx/modules/graphql' \
--header 'Origin: http://localhost:8080' \
--header 'Authorization: Basic cm9vdDpyb290MTIzNA==' \
--header 'Cookie: JSESSIONID=6E00F0FDD1C9C5C0F280E5CC2DF9AB57' \
--form 'query="[{\"operationName\":\"create\",\"query\":\"mutation create { admin { search { functionScore { uploadFunction(id:\\\"contentTypeBoost\\\",description:\\\"Boost by content type\\\", file:\\\"file\\\") } } } }\"}]"' \
--form 'file=@"/tmp/contentTypeBoost.json";type=application/json'
List all the functions
The ID of the functions deployed on your Jahia environment can be retrieved with the following GraphQL query:
query {
admin {
search{
functionScore{
functions {
nodes {
id
}
}
}
}
}
}
Removing a function
Functions can be removed from the system with the following query:
mutation {
admin {
search{
functionScore{
removeFunction(id:"contentTypeBoost")
}
}
}
}
Increasing or decreasing the relevance of a specific content
Increasing or decreasing the relevance of a specific content can be done using Function Score. An example of such implementation is available in the Function Score Example Github repository.
Increasing or decreasing the relevance of contents of a certain type
Increasing or decreasing the relevance of a certain content type can be done using Function Score. An example of such implementation is available in the Function Score Example Github repository.
Increasing the relevance of the most recent contents
Making contents more or less releavant based on their last publication date can be done using Function Score. An example of such implementation is available in the Function Score Example Github repository.
FAQ
Can I boost a custom property of a content type in the definitions.cnd file?
No, it’s not possible to boost custom fields.
This is not possible with Augmented Search.
Can I define boosts by website?
No, indexation is done at the platform level and all sites are affected.
How is the fuzzy match configured?
By default, the fuzzy matching starts at the 4th character. Also, it can permute one letter, starting at the 3rd character. The first 2 letters need to be exact.