-
Notifications
You must be signed in to change notification settings - Fork 25.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fallback in ILM to run cluster state steps periodically #126073
base: main
Are you sure you want to change the base?
Conversation
ILM sometimes skips a policy/index for a cluster state update if the step is still running/enqueued when the update comes in. That on its own isn't a problem, but in very quiet clusters, this would mean that it could take arbitrarily long for the policy step to be run - i.e. when the next cluster state comes in. We saw this happening in a few tests, but it could potentially happen in production too. Fixes elastic#125683 Fixes elastic#125789 Fixes elastic#125867 Fixes elastic#125911 Fixes elastic#126053
Pinging @elastic/es-data-management (Team:Data Management) |
Hi @nielsbauman, I've created a changelog YAML for you. |
After @nielsbauman walked me through this PR, I see the relative simplicity of using the periodic trigger to also (re)process the current cluster state if there have been no cluster state updates since the latest periodic ILM run. However, I am a little bit concerned with the extra complexity we are adding on an already complex system. ILM already has at least two mechanisms that ensure that we do not miss any cluster updates and that we do not execute any steps twice. When we are triggering the policies after a cluster state update, we check at the end of the "triggering" if the cluster state has changed. See: elasticsearch/x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/IndexLifecycleService.java Lines 394 to 404 in e68587a
In order to avoid queueing and executing a step multiple times, we check if a task for this index and step combination has already been submitted: elasticsearch/x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/IndexLifecycleRunner.java Lines 618 to 632 in 10a8dcf
Now that we have seen this mechanisms, we can better understand the problem we are facing:
Before we move forward with this PR, I would like to see if we can tweak the mechanisms above to account for the issue we are trying to fix. |
logger.trace("job triggered: {}, {}, {}", event.jobName(), event.scheduledTime(), event.triggeredTime()); | ||
triggerPolicies(clusterService.state(), false); | ||
if (event.jobName().equals(XPackField.INDEX_LIFECYCLE) == false) { | ||
assert false : "Expected scheduler event to be for ILM"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert false : "Expected scheduler event to be for ILM"; | |
assert false : "Expected scheduler event to be for ILM but it was for " + event.jobName(); |
// if it was null before - to avoid redundant processing. | ||
final var stateCurrentlyBeingProcessed = lastSeenState.compareAndExchange(null, clusterService.state()); | ||
if (stateCurrentlyBeingProcessed == null) { | ||
logger.info("ILM didn't receive a new cluster state for [{}]. Running cluster state steps now", pollInterval); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can make this debug
level
processClusterState(); | ||
} else { | ||
logger.warn( | ||
"ILM didn't receive a new cluster state for [{}] but it was still processing cluster state version [{}]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we clarify this to say something like "the poll interval should be increased" in the log message? In order to give a hint to users running on-prem.
Hi @nielsbauman, I've updated the changelog YAML for you. |
ILM sometimes skips a policy/index for a cluster state update if the step is still running/enqueued when the update comes in. That on its own isn't a problem, but in very quiet clusters, this would mean that it could take arbitrarily long for the policy step to be run - i.e. when the next cluster state comes in. We saw this happening in a few tests, but it could potentially happen in production too.
Fixes #125683
Fixes #125789
Fixes #125867
Fixes #125911
Fixes #126053
Fixes #126354