Stemmers usage and configuration

November 14, 2023

This topic shows how to configure the way in which Augmented Search analyzes your content. Augmented Search uses Elasticsearch as a text analysis tool to perform full-text searches using stemmers to reduce words to their stem and filters, remove stop words out of your documents, and make searches more relevant.

Analyzing text in Augmented Search

By default, Elasticsearch uses text analysis to analyze a field in two ways:

  • first Elasticsearch detects if the field is something other than a string (for example, a number or date)
  • if the field contains a string it is analyzed as text and as a keyword

The text analysis applies a default analyzer for the language. The analyzer is a mix of filters and tokenizers. Elasticsearch uses text analysis to perform full-text searches and returns search results with all relevant results rather than just exact matches. For a full overview of text analysis, see the Elasticsearch documentation.

The three fields (main, content, metadata) used by Augmented Search (AS) during the search are analyzed in four different ways. Those analyzers use different sets of tokenizers and filters. The four analyzers are phrase_shingle_analyzer, main_analyzer, text_base, and text_stem.

To see them in action after indexation, you can use a curl query to check what Elasticsearch will store and to have a better grasp of what will match as a search:

curl -XGET "http://localhost:9200/dx__content__default__en__read/_analyze" -H 'Content-Type: application/json' -d '{  "analyzer": "phrase_shingle_analyzer",  "text": "\n  Tomatoes, like peaches, are one of the many fruits and vegetables that will continue to ripen after they'\''ve been picked. They won'\''t ripen quickly and they won'\''t usually ripen perfectly, but you can coax an underripe tomato to ripen at home. While tomatoes won'\''t ripen as well as peaches do, underripe tomatoes can definitely be improved upon. And best of all: it'\''s beyond easy and requires barely any equipment so if you find yourself with not-quite-ripe tomatoes, there'\''s no reason not to give it a shot.\n  "}' 

Here we are testing the phrase_shingle_analyzer. This will return an array of the tokens to be stored by this analyzer.


"tokens" : [
    {
      "token" : "tomatoes",
      "start_offset" : 3,
      "end_offset" : 11,
      "type" : "",
      "position" : 0
    },
    {
      "token" : "tomatoes like",
      "start_offset" : 3,
      "end_offset" : 17,
      "type" : "shingle",
      "position" : 0,
      "positionLength" : 2
    },
    {
      "token" : "tomatoes like peaches",
      "start_offset" : 3,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 0,
      "positionLength" : 3
    },
    {
      "token" : "like",
      "start_offset" : 13,
      "end_offset" : 17,
      "type" : "",
      "position" : 1
    },
    {
      "token" : "like peaches",
      "start_offset" : 13,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 1,
      "positionLength" : 2
    }...

Those tokens are the one used to match your queries, the underlying search engine will analyze your query and split it into the same kind of token and match them against the one in the indices.

How to verify your current settings and mappings

The following two queries are useful for understanding the settings and mappings of an Augmented Search index:


curl -XGET "http://localhost:9200/dx__content__default__en__read/_settings"
curl -XGET "http://localhost:9200/dx__content__default__en__read/_mapping"

The settings curl query shows you all the configurations applied on this index (filters, tokenizers, analyzers) with all of their customizations.

For example, all four analyzers declare a filter named “i18n-stop-words-filter”. This filter differs based on the language of your index.


"i18n-stop-words-filter" : {
    "type" : "stop",
    "stopwords" : "_english_"
    },
"i18n-stem-filter" : {
    "name" : "english",
    "type" : "stemmer"
}

The "i18n-stem-filter" is used only by the text_stem analyzer. These two filters use standard analyzers (stop and stemmer) from Elasticsearch and customize them for the indexed language.

The mappings curl query shows you all the analyzers applied on each field. For example, the “jgql:content” field mapping shows all the analyzed subfields (ngram, phrase, and stem) with their respective analyzers:


"jgql:content" : {
          "type" : "text",
          "store" : true,
          "term_vector" : "with_positions_offsets",
          "index_options" : "offsets",
          "fields" : {
            "ngram" : {
              "type" : "text",
              "store" : true,
              "term_vector" : "with_positions_offsets",
              "index_options" : "offsets",
              "analyzer" : "main_analyzer",
              "search_analyzer" : "text_base"
            },
            "phrase" : {
              "type" : "text",
              "store" : true,
              "term_vector" : "with_positions_offsets",
              "index_options" : "offsets",
              "analyzer" : "phrase_shingle_analyzer",
              "position_increment_gap" : 100
            },
            "stem" : {
              "type" : "text",
              "store" : true,
              "term_vector" : "with_positions_offsets",
              "index_options" : "offsets",
              "analyzer" : "text_stem",
              "position_increment_gap" : 100
            }
          },
          "analyzer" : "text_base",
          "position_increment_gap" : 100
        }

When you execute a search query using Augmented Search GraphQL API, the query searches in those fields and applies different weights for each field so that a phrase matching will score higher than an ngram matching.

How-to specify custom settings/mappings

Augmented Search allows you to define your own setting per language in the settings_{language}.json file. In the following example, settings for Japanese and Polish will activate only if the required plugins are available on your Elasticsearch installation.

settings_ja.json:


{
 "requires": "analysis-kuromoji",
 "tokenizer": {
   "ja_tokenizer": {
     "type": "kuromoji_tokenizer",
     "mode": "extended"
   }
 },
 "analyzer": "kuromoji",
 "stopwords": {
   "type": "ja_stop"
 },
 "stemmer": "kuromoji_stemmer"
}

settings_pl.json:


{
 "requires": "analysis-stempel",
 "analyzer": "polish",
 "stopwords": {
   "type": "polish_stop"
 },
 "stemmer": "polish_stem"
}

Augmented Search will look for those files beside your custom settings.json file, as defined in the configuration of the module:


org.jahia.modules.augmentedsearch.content.settingsFileLocation=/opt/jahia/configurations/myplatform_augmented_search_settings.json

For example, if you have Chinese activated on your website, Augmented Search will look for /opt/jahia/configurations/myplatform_augmented_search_settings_zh.json. If not found, it will use the default settings defined in the module.