Jahia 8 Jahia 7.3

Managing the size of Elasticsearch data with automatic data purging in jCustomer

Question

Customer is using jExperience for a while and verifies an increasing growth of filesystem usage in their Elasticsearch server.

Does jExperience/jCustomer provides any mechanism to purge old data from Elasticsearch? Is this option configurable?

Considering an outage situation where the usage of the filesystem gets to 100%, is it possible to trigger an immediate purge of old data in Elasticsearch? How to verify the size of indexes in Elasticsearch to help in such decision?

Answer

The Background Jobs section in jExperience documentation describes some of the periodical jobs in the default installation of jExperience and jCustomer.

The following table describes periodical purge jobs that can be configured in file <JCUSTOMER_PATH>/etc/org.apache.unomi.services.cfg

Job Configuration Unit Default Value
Purge job interval org.apache.unomi.profile.purge.interval Days 1
Purge profiles that have been inactive for a specific interval org.apache.unomi.profile.purge.inactiveTime Days 180
Purge profiles that have been created for a specific interval org.apache.unomi.profile.purge.existTime Days 1
Purge all sessions/events that have been created for a specific interval org.apache.unomi.event.purge.existTime Months 12

 

If one needs to trigger an immediate purge of old unomi sessions and events to release filesystem space in Elasticsearch server the best option is to use the purge service described in Unomi API Documentation.

The purge service gets an Date as argument before which all data will be removed from Elasticsearch.

Although the Date argument is in the format YYYY-MM-DD, when the call reaches Elasticsearch it gets trunked to YYYY-MM, so it is recommended to be cautious and give some safe margin for the date you choose before calling this service.

First it is helpful to get an idea of the size of each Elasticsearch indices before choosing a date which might purge some of these indices. That information can be fetched running the following command in your Elasticsearch server:

user$ curl -X GET "localhost:9200/_cat/indices"

green  open geonames              _OFEJm-DQCeHzlV2z2A0ow 5 0        0      0   955b   955b
green  open context               ZNQmfaaaaae3AQlt71flqA 5 0 20897583 334812  5.2gb  5.2gb
green  open context-2019-12       Tjbbbbu0QRiu0uf-6Aj96Q 5 0 11741552    408 11.9gb 11.9gb
green  open context-2019-11       tA2DxKxjR-67fP5Xt48uDA 5 0 11848605    501   11gb   11gb
green  open context-2019-10       NuGDlErKSaaVD5sbbPWsww 5 0  4698576    119  5.5gb  5.5gb
green  open context-2019-09       Tacce9yGQcezWbbWrtQIew 5 0 14463363   4010 12.1gb 12.1gb
green  open context-2019-08       VlllTkTsQTO-yslM91cVvw 5 0  3599435   1792  4.1gb  4.1gb
green  open context-2020-03       TsYbc0bba42L8dr7UvAHuw 5 0  8882950    221 11.1gb 11.1gb
green  open context-2020-02       9xc7hla5Rciggf-uZ-tgmQ 5 0 12488469    331 11.2gb 11.2gb
green  open context-2020-01       kMLRfdhEQTitKB3UG-HHHA 5 0 14136394    420 12.8gb 12.8gb

So to remove the oldest entry in this scenario, for instance, one can run the following command in jCustomer server:

user$ curl -X GET -k -u <unomi_karaf_user>:<unomi_karaf_password> https://localhost:9443/cxs/cluster/purge/2019-12-31

Which will trigger the following log in Elasticsearch:

[2020-04-08T14:02:04,372][INFO ][o.e.c.m.MetaDataDeleteIndexService] [cSWM0Da] [context-2019-12/Tjbbbbu0QRiu0uf-6Aj96Q] deleting index