Managing the size of Elasticsearch data with automatic data purging in jCustomer
Question
Customer is using jExperience for a while and verifies an increasing growth of filesystem usage in their Elasticsearch server.
Does jExperience/jCustomer provides any mechanism to purge old data from Elasticsearch? Is this option configurable?
Considering an outage situation where the usage of the filesystem gets to 100%, is it possible to trigger an immediate purge of old data in Elasticsearch? How to verify the size of indexes in Elasticsearch to help in such decision?
Answer
The Background Jobs section in jExperience documentation describes some of the periodical jobs in the default installation of jExperience and jCustomer.
The following table describes periodical purge jobs that can be configured in file <JCUSTOMER_PATH>/etc/org.apache.unomi.services.cfg
Job | Configuration | Unit | Default Value |
---|---|---|---|
Purge job interval | org.apache.unomi.profile.purge.interval |
Days | 1 |
Purge profiles that have been inactive for a specific interval | org.apache.unomi.profile.purge.inactiveTime |
Days | 180 |
Purge profiles that have been created for a specific interval | org.apache.unomi.profile.purge.existTime |
Days | 1 |
Purge all sessions/events that have been created for a specific interval | org.apache.unomi.event.purge.existTime |
Months | 12 |
If one needs to trigger an immediate purge of old unomi sessions and events to release filesystem space in Elasticsearch server the best option is to use the purge service described in Unomi API Documentation.
The purge service gets an Date as argument before which all data will be removed from Elasticsearch.
Although the Date argument is in the format YYYY-MM-DD, when the call reaches Elasticsearch it gets trunked to YYYY-MM, so it is recommended to be cautious and give some safe margin for the date you choose before calling this service.
First it is helpful to get an idea of the size of each Elasticsearch indices before choosing a date which might purge some of these indices. That information can be fetched running the following command in your Elasticsearch server:
user$ curl -X GET "localhost:9200/_cat/indices"
green open geonames _OFEJm-DQCeHzlV2z2A0ow 5 0 0 0 955b 955b
green open context ZNQmfaaaaae3AQlt71flqA 5 0 20897583 334812 5.2gb 5.2gb
green open context-2019-12 Tjbbbbu0QRiu0uf-6Aj96Q 5 0 11741552 408 11.9gb 11.9gb
green open context-2019-11 tA2DxKxjR-67fP5Xt48uDA 5 0 11848605 501 11gb 11gb
green open context-2019-10 NuGDlErKSaaVD5sbbPWsww 5 0 4698576 119 5.5gb 5.5gb
green open context-2019-09 Tacce9yGQcezWbbWrtQIew 5 0 14463363 4010 12.1gb 12.1gb
green open context-2019-08 VlllTkTsQTO-yslM91cVvw 5 0 3599435 1792 4.1gb 4.1gb
green open context-2020-03 TsYbc0bba42L8dr7UvAHuw 5 0 8882950 221 11.1gb 11.1gb
green open context-2020-02 9xc7hla5Rciggf-uZ-tgmQ 5 0 12488469 331 11.2gb 11.2gb
green open context-2020-01 kMLRfdhEQTitKB3UG-HHHA 5 0 14136394 420 12.8gb 12.8gb
So to remove the oldest entry in this scenario, for instance, one can run the following command in jCustomer server:
user$ curl -X GET -k -u <unomi_karaf_user>:<unomi_karaf_password> https://localhost:9443/cxs/cluster/purge/2019-12-31
Which will trigger the following log in Elasticsearch:
[2020-04-08T14:02:04,372][INFO ][o.e.c.m.MetaDataDeleteIndexService] [cSWM0Da] [context-2019-12/Tjbbbbu0QRiu0uf-6Aj96Q] deleting index