Add ref store size threshold property to initiate purge of old streams #4610

at055612 · 2024-11-21T09:19:37Z

When a ref store reaches stroom.pipeline.referenceData.lmdb.maxStoreSize loads will fail. It would be good if there was an additional threshold % property (e.g. 90%) such that prior to a load, if the store size (as a % of the maxStoreSize) is greater than this threshold, then it will purge old streams until below the threshold.

The text was updated successfully, but these errors were encountered:

at055612 · 2024-11-21T09:19:59Z

Raised for @p-kimberley

p-kimberley · 2024-11-21T10:37:32Z

For context, there are two issues here:

Disk pressure caused by multiple ref DBs (one per feed), each growing at their own rate up to maxStoreSize. Once a volume fills up, ref data writes will fail. Therefore it would be useful to be able to cap ref usage on a per-node basis as well as per-feed/DB.
A single DB reaching maxStoreSize and subsequently failing to load until entries expire and are purged.

Firstly, I suggest maxStoreSize be renamed to maxDbSize. When ref stores were combined (not split into feeds), this property made sense as a combined limit. Now there can be multiple DBs, there should be a separate property governing the size of individual DBs.

I propose two additional settings be created, both of which will cause streams to be purged until the DB(s) are within limits:

stroom.pipeline.referenceData.lmdb.maxStoreSize. Maximum size of the ref store, encompassing all ref DBs. This will enable a global cap to prevent disk pressure from affecting node health if ref data grows in an uncontrolled manner. If this limit is reached, Stroom should start purging from the largest ref DBs until the aggregate DB size falls below. Maybe purge iteratively in batches, each time from the largest DB.
stroom.pipeline.referenceData.lmdb.dbHighWaterMarkPercent. Once an individual ref DB reaches this percentage of dbMaxSize, purge streams until its size falls below.

at055612 · 2025-04-03T17:23:46Z

@p-kimberley

Note: I take store to mean one LMDB env, containing data for one feed. Reference data is a collection of stores. The store actually contains multiple DBs so will avoid your terminology to avoid confusion with how the code/lmdb is structured.

It will have to work slightly differently to what you have suggested due to the way LMDB allocates disk but never frees it. The move to a store per feed has been good, but actually makes things worse when it comes to freeing up data. Each feed store is an independent LMDB env, so loading a new stream of feed XYZ will either make the XYZ store grow to make room, or the XYZ store will remain unchanged as the stream can fit into space reclaimed in the store from previous purges of XYZ streams. All this assumes readsBlockWrites is on to allow write txns to use reclaimed space.

So if you do a big re-process on a feed causing it to load a lot of streams, that will use up a lot of disk that can never be used by other feeds. For example, a store could grow to 1Gb on disk, but contain no data due to purges. Only one feed can use this 1Gb.

Your point 1. currently won't work, as purges won't free any space on disk. I'm going to look into making a compacted copy of the env and then swapping over. I think this will work to free space on disk, but not sure how quick it is. It will also have to block all other writes to the store. Hopefully it could be a scheduled job, e.g. at a quiet time of day, or done at the end of the purge scheduled job.

I think point 2. (the high water mark % on the store) is ok. I just need to be sure I can determine from LMDB the % of the store that is free space. A load of a stream for feed XYZ will need to check the % free against the HWM and if necessary, purge LRU streams from the XYZ store until it is below the HWM. This can also be scheduled too.

I think we could also do with the means in stroom to set limits on a per feed basis as you likely have some ref feeds that have much bigger streams than others.

at055612 added enhancement A new feature or enhancement to an existing feature f:ref data Issues relating to reference data loads or lookups labels Nov 21, 2024

at055612 added this to the v7.7 milestone Nov 21, 2024

at055612 added a commit that referenced this issue Apr 3, 2025

Empty initial commit on gh-4610-ref-auto-purge [no ci]

ba309dc

at055612 self-assigned this Apr 4, 2025

at055612 added a commit that referenced this issue Apr 4, 2025

gh-4610 Added compaction after purge, WIP on OffHeapStoreInfo

1fc7bc9

at055612 linked a pull request Apr 4, 2025 that will close this issue

PR for #4610 - Add ref store size threshold property to initiate purge of old streams #4852

Open

at055612 added a commit that referenced this issue Apr 7, 2025

gh-4610 Fix expected.yml and CS

b32581a

at055612 added a commit that referenced this issue Apr 7, 2025

gh-4610 Fix CS and failing test

a46315e

at055612 added a commit that referenced this issue Apr 9, 2025

Merge branch 'master' into gh-4610-ref-auto-purge

c1d57f9

at055612 added a commit that referenced this issue Apr 9, 2025

gh-4610 Add perf test

bf115cc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ref store size threshold property to initiate purge of old streams #4610

Add ref store size threshold property to initiate purge of old streams #4610

at055612 commented Nov 21, 2024

at055612 commented Nov 21, 2024

p-kimberley commented Nov 21, 2024

at055612 commented Apr 3, 2025

Add ref store size threshold property to initiate purge of old streams #4610

Add ref store size threshold property to initiate purge of old streams #4610

Comments

at055612 commented Nov 21, 2024

at055612 commented Nov 21, 2024

p-kimberley commented Nov 21, 2024

at055612 commented Apr 3, 2025