Support more reliable async task retry to guarantee eventual execution (1/2) – Metastore Layer #1523

danielhumanmod · 2025-05-04T21:19:57Z

Context

Polaris uses async tasks to perform operations such as table and manifest file cleanup. These tasks are executed asynchronously in a separate thread within the same JVM, and retries are handled inline within the task execution. However, this mechanism does not guarantee eventual execution in the following cases:

The task fails repeatedly and hits the maximum retry limit.
The service crashes or shuts down before retrying.

Implementation Plan

Stage 1 (this PR):
Introduce per-task transactional leasing in the metastore layer via loadTasks(...). This enables fine-grained compensation by allowing tasks to be leased and updated one at a time, avoiding the all-or-nothing semantics of bulk operations (which is also mentioned in TODO). This is important for retry scenarios, where we want to isolate failures and ensure that tasks are independently retried without affecting each other.

Stage 2 (next PR):
Persist failed tasks and introduce a retry mechanism triggered during Polaris startup and via periodic background checks.

collado-mike · 2025-05-08T00:32:34Z

Introduce per-task transactional leasing in the metastore layer via loadTasks(...). This enables fine-grained compensation by allowing tasks to be leased and updated one at a time, avoiding the all-or-nothing semantics of bulk operations (which is also mentioned in TODO). This is important for retry scenarios, where we want to isolate failures and ensure that tasks are independently retried without affecting each other.

I don't understand how this PR enables isolation of task failures. This PR only reads the tasks from the metastore one at a time, so the only failure would be in loading the task. In a transactional database, the UPDATE ... WHERE statement would only update the task state when the task is not currently leased by another client, so I don't see how one or a few tasks would fail to be leased while the others succeed.

The PR description sounds like it intends to tackle task execution failure - is that right? If so, loading the tasks from the database isn't going to solve that problem.

eric-maynard · 2025-05-08T05:00:49Z

The PR description sounds like it intends to tackle task execution failure - is that right? If so, loading the tasks from the database isn't going to solve that problem.

I think it could, just very lazily right @collado-mike? The next time the service restarts, we could retry any orphaned tasks.

danielhumanmod · 2025-05-08T05:10:19Z

Introduce per-task transactional leasing in the metastore layer via loadTasks(...). This enables fine-grained compensation by allowing tasks to be leased and updated one at a time, avoiding the all-or-nothing semantics of bulk operations (which is also mentioned in TODO). This is important for retry scenarios, where we want to isolate failures and ensure that tasks are independently retried without affecting each other.

I don't understand how this PR enables isolation of task failures. This PR only reads the tasks from the metastore one at a time, so the only failure would be in loading the task. In a transactional database, the UPDATE ... WHERE statement would only update the task state when the task is not currently leased by another client, so I don't see how one or a few tasks would fail to be leased while the others succeed.

The PR description sounds like it intends to tackle task execution failure - is that right? If so, loading the tasks from the database isn't going to solve that problem.

Sorry for the confusion — we actually have a second PR for this feature. I try to split two parts to make review easier :)

This is the draft PR for second phase (still WIP):
danielhumanmod#1 — that’s where pending tasks are loaded from the metastore and executed.

Regarding this PR’s changes in the metastore, the goal is to allow each task entity to be read and leased individually. This ensures that if an exception occurs while reading or leasing one task, it won’t affect others. This improvement was also noted in the TODO comment of the previous implementation. It’s not strictly required, but maybe a “nice-to-have” one for isolating failures.

danielhumanmod added 2 commits May 4, 2025 14:00

handling each task in isolated txn when loadTasks

b1f82ce

format

794f48d

github-project-automation bot added this to Basic Kanban Board May 4, 2025

github-project-automation bot moved this to PRs In Progress in Basic Kanban Board May 4, 2025

danielhumanmod marked this pull request as ready for review May 4, 2025 21:21

danielhumanmod requested review from adutra, ashvina, dennishuo, dimas-b, eric-maynard, jackye1995, jbonofre, vvcephei, collado-mike, snazy, RussellSpitzer, takidau, MonkeyCanCode, flyrain, ebyhr and HonahX as code owners May 4, 2025 21:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support more reliable async task retry to guarantee eventual execution (1/2) – Metastore Layer #1523

Support more reliable async task retry to guarantee eventual execution (1/2) – Metastore Layer #1523

danielhumanmod commented May 4, 2025 •

edited

Loading

collado-mike commented May 8, 2025

eric-maynard commented May 8, 2025

danielhumanmod commented May 8, 2025 •

edited

Loading

Support more reliable async task retry to guarantee eventual execution (1/2) – Metastore Layer #1523

Are you sure you want to change the base?

Support more reliable async task retry to guarantee eventual execution (1/2) – Metastore Layer #1523

Conversation

danielhumanmod commented May 4, 2025 • edited Loading

Context

Implementation Plan

collado-mike commented May 8, 2025

eric-maynard commented May 8, 2025

danielhumanmod commented May 8, 2025 • edited Loading

danielhumanmod commented May 4, 2025 •

edited

Loading

danielhumanmod commented May 8, 2025 •

edited

Loading