[FLINK-37605][runtime] Infer checkpoint id on endInput in sink #26433

AHeise · 2025-04-09T13:57:09Z

What is the purpose of the change

So far, we used a special value for the final checkpoint on endInput. However, as shown in the description of this ticket, final doesn't mean final. Hence, multiple committables with EOI could be created at different times.

With this commit, we stop using a special value for such committables and instead try to guess the checkpoint id of the next checkpoint. There are various factors that influence the checkpoint id but we can mostly ignore them all because we just need to pick a checkpoint id that is

higher than all checkpoint ids of the previous, successful checkpoints of this attempt
higher than the checkpoint id of the restored checkpoint
lower than any future checkpoint id.

Hence, we just remember the last observed checkpoint id (initialized with max(0, restored id)), and use last id + 1 for endInput. Naturally, multiple endInput calls happening through restarts will result in unique checkpoint ids. Note that aborted checkpoints before endInput may result in diverged checkpoint ids across subtasks. However, each of the id satisfies above requirements and any id of endInput1 will be smaller than any id of endInput2. Thus, diverged checkpoint ids will not impact correctness at all.

Brief change log

Clarify contract of endInput
Infer checkpoint id on endInput

Verifying this change

Covered by existing tests. No new tests since it removes special case handling.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (yes / no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
The serializers: (yes / no / don't know)
The runtime per-record code paths (performance sensitive): (yes / no / don't know)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
The S3 file system connector: (yes / no / don't know)

Documentation

Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

flinkbot · 2025-04-09T14:12:18Z

CI report:

941e510 Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

davidradl · 2025-04-09T15:36:35Z

...ntime/src/main/java/org/apache/flink/streaming/runtime/operators/sink/CommitterOperator.java

-        long completedCheckpointId = endInput ? EOI : lastCompletedCheckpointId;
+    private void commitAndEmitCheckpoints(long checkpointId)
+            throws IOException, InterruptedException {
+        lastCompletedCheckpointId = checkpointId;


nit: probably a basic question, but shouldn't we update the lastCompletedCheckpointId variable after we have completed the checkpoint, which I assume happens in the subsequent for loop? I was expecting the lastCompletedCheckpointId to be updated after the checkpointing loop in case there was an error during the checkpointing.

In general, transient state is lost on error. So whether we update before or after the loop doesn't matter because the exception will lead to a fail-over and everything is recalculated on recovery. Since everything is called from the main task thread (mailbox thread), there is no interleaving possible of this call and another call like endInput.

Now in this specific case, lastCompletedCheckpointId refers to the completed checkpoint id of Flink as a whole. Since this value is primarily set through notifyCheckpointCompleted, the checkpoint is already completed before the start of the method. So I'd like to keep it as the first statement because it's easier to read than if it's done at the end of the method.

Thanks for the explanation @AHeise - that makes sense

fapaul

We discussed the new method to infer the checkpoint id offline and it seems solid (at least less brittle than using the special EOI marker).

I didn't fully understand why the refactoring was needed for this PR but I'll leave that up to you.

fapaul · 2025-04-14T10:46:14Z

...runtime/src/test/java/org/apache/flink/streaming/util/AbstractStreamOperatorTestHarness.java

@@ -397,6 +399,10 @@ public StreamConfig getStreamConfig() {
        return config;
    }

+    public void setRestoredCheckpointId(long restoredCheckpointId) {
+        this.restoredCheckpointId = restoredCheckpointId;


Looks unrelated to this commit

Yes, I'll move the last commit.

With the removal of SinkV1, all adapter tests have also been testing V2. We can remove the adapter tests and simplify test hierarchy.

Remove factory methods and InspectableSink because we don't need the abstraction anymore. Make test setup and assertions more explicit by using sink builder directly in tests. Remove unused methods.

So far, we used a special value for the final checkpoint on endInput. However, as shown in the description of this ticket, final doesn't mean final. Hence, multiple committables with EOI could be created at different times. With this commit, we stop using a special value for such committables and instead try to guess the checkpoint id of the next checkpoint. There are various factors that influence the checkpoint id but we can mostly ignore them all because we just need to pick a checkpoint id that is - higher than all checkpoint ids of the previous, successful checkpoints of this attempt - higher than the checkpoint id of the restored checkpoint - lower than any future checkpoint id. Hence, we just remember the last observed checkpoint id (initialized with max(0, restored id)), and use last id + 1 for endInput. Naturally, multiple endInput calls happening through restarts will result in unique checkpoint ids. Note that aborted checkpoints before endInput may result in diverged checkpoint ids across subtasks. However, each of the id satisfies above requirements and any id of endInput1 will be smaller than any id of endInput2. Thus, diverged checkpoint ids will not impact correctness at all.

[FLINK-37605][runtime] Clarify contract of endInput

778d976

AHeise force-pushed the FLINK-37605-fix-eoi-sink branch from a2cf486 to e54f829 Compare April 9, 2025 13:58

flinkbot added the component=API/Core label Apr 9, 2025

davidradl reviewed Apr 9, 2025

View reviewed changes

AHeise force-pushed the FLINK-37605-fix-eoi-sink branch from e54f829 to 023b6b8 Compare April 13, 2025 09:22

AHeise assigned fapaul Apr 13, 2025

fapaul approved these changes Apr 14, 2025

View reviewed changes

AHeise added 3 commits April 14, 2025 14:12

[FLINK-37605][runtime] Remove obsolete sink tests

17f0192

With the removal of SinkV1, all adapter tests have also been testing V2. We can remove the adapter tests and simplify test hierarchy.

[FLINK-37605][runtime] Cleanup writer test

e5f0df8

Remove factory methods and InspectableSink because we don't need the abstraction anymore. Make test setup and assertions more explicit by using sink builder directly in tests. Remove unused methods.

AHeise force-pushed the FLINK-37605-fix-eoi-sink branch from 023b6b8 to 941e510 Compare April 14, 2025 12:31

AHeise merged commit 9302545 into apache:master Apr 14, 2025

This was referenced Apr 14, 2025

[FLINK-37605][runtime] Infer checkpoint id on endInput in sink [2.0] #26456

Merged

[FLINK-37605][runtime] Infer checkpoint id on endInput in sink [1.20] #26457

Merged

[FLINK-37605][runtime] Infer checkpoint id on endInput in sink [1.19] #26458

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FLINK-37605][runtime] Infer checkpoint id on endInput in sink #26433

[FLINK-37605][runtime] Infer checkpoint id on endInput in sink #26433

Uh oh!

AHeise commented Apr 9, 2025

Uh oh!

flinkbot commented Apr 9, 2025 •

edited

Loading

Uh oh!

davidradl Apr 9, 2025

Uh oh!

AHeise Apr 9, 2025

Uh oh!

davidradl Apr 10, 2025

Uh oh!

fapaul left a comment

Uh oh!

fapaul Apr 14, 2025

Uh oh!

AHeise Apr 14, 2025

Uh oh!

Uh oh!

[FLINK-37605][runtime] Infer checkpoint id on endInput in sink #26433

[FLINK-37605][runtime] Infer checkpoint id on endInput in sink #26433

Uh oh!

Conversation

AHeise commented Apr 9, 2025

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

flinkbot commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

davidradl Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

AHeise Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

davidradl Apr 10, 2025

Choose a reason for hiding this comment

Uh oh!

fapaul left a comment

Choose a reason for hiding this comment

Uh oh!

fapaul Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

AHeise Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

flinkbot commented Apr 9, 2025 •

edited

Loading