-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[FLINK-37605][runtime] Infer checkpoint id on endInput in sink #26433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a2cf486
to
e54f829
Compare
long completedCheckpointId = endInput ? EOI : lastCompletedCheckpointId; | ||
private void commitAndEmitCheckpoints(long checkpointId) | ||
throws IOException, InterruptedException { | ||
lastCompletedCheckpointId = checkpointId; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: probably a basic question, but shouldn't we update the lastCompletedCheckpointId
variable after we have completed the checkpoint, which I assume happens in the subsequent for loop? I was expecting the lastCompletedCheckpointId
to be updated after the checkpointing loop in case there was an error during the checkpointing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, transient state is lost on error. So whether we update before or after the loop doesn't matter because the exception will lead to a fail-over and everything is recalculated on recovery. Since everything is called from the main task thread (mailbox thread), there is no interleaving possible of this call and another call like endInput
.
Now in this specific case, lastCompletedCheckpointId
refers to the completed checkpoint id of Flink as a whole. Since this value is primarily set through notifyCheckpointCompleted
, the checkpoint is already completed before the start of the method. So I'd like to keep it as the first statement because it's easier to read than if it's done at the end of the method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation @AHeise - that makes sense
e54f829
to
023b6b8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed the new method to infer the checkpoint id offline and it seems solid (at least less brittle than using the special EOI marker).
I didn't fully understand why the refactoring was needed for this PR but I'll leave that up to you.
@@ -397,6 +399,10 @@ public StreamConfig getStreamConfig() { | |||
return config; | |||
} | |||
|
|||
public void setRestoredCheckpointId(long restoredCheckpointId) { | |||
this.restoredCheckpointId = restoredCheckpointId; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks unrelated to this commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll move the last commit.
With the removal of SinkV1, all adapter tests have also been testing V2. We can remove the adapter tests and simplify test hierarchy.
Remove factory methods and InspectableSink because we don't need the abstraction anymore. Make test setup and assertions more explicit by using sink builder directly in tests. Remove unused methods.
So far, we used a special value for the final checkpoint on endInput. However, as shown in the description of this ticket, final doesn't mean final. Hence, multiple committables with EOI could be created at different times. With this commit, we stop using a special value for such committables and instead try to guess the checkpoint id of the next checkpoint. There are various factors that influence the checkpoint id but we can mostly ignore them all because we just need to pick a checkpoint id that is - higher than all checkpoint ids of the previous, successful checkpoints of this attempt - higher than the checkpoint id of the restored checkpoint - lower than any future checkpoint id. Hence, we just remember the last observed checkpoint id (initialized with max(0, restored id)), and use last id + 1 for endInput. Naturally, multiple endInput calls happening through restarts will result in unique checkpoint ids. Note that aborted checkpoints before endInput may result in diverged checkpoint ids across subtasks. However, each of the id satisfies above requirements and any id of endInput1 will be smaller than any id of endInput2. Thus, diverged checkpoint ids will not impact correctness at all.
023b6b8
to
941e510
Compare
What is the purpose of the change
So far, we used a special value for the final checkpoint on endInput. However, as shown in the description of this ticket, final doesn't mean final. Hence, multiple committables with EOI could be created at different times.
With this commit, we stop using a special value for such committables and instead try to guess the checkpoint id of the next checkpoint. There are various factors that influence the checkpoint id but we can mostly ignore them all because we just need to pick a checkpoint id that is
Hence, we just remember the last observed checkpoint id (initialized with max(0, restored id)), and use last id + 1 for endInput. Naturally, multiple endInput calls happening through restarts will result in unique checkpoint ids. Note that aborted checkpoints before endInput may result in diverged checkpoint ids across subtasks. However, each of the id satisfies above requirements and any id of endInput1 will be smaller than any id of endInput2. Thus, diverged checkpoint ids will not impact correctness at all.
Brief change log
Verifying this change
Covered by existing tests. No new tests since it removes special case handling.
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: (yes / no)Documentation