Skip to content

Improve plan shutdown speed on cancel (improve performance on the cancellation benchmark) #15314

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
carols10cents opened this issue Mar 19, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@carols10cents
Copy link
Contributor

Describe the bug

Sometimes datafusion queries can't be cancelled in a reasonable amount of time because datafusion is doing CPU-bound work for too long without yielding.

Ideally, as documented on ExecutionPlan::execute, datafusion should yield every so often in a manner appropriate for the situation to enable queries to be cancelled.

The current behavior is captured in the cancellation benchmark, and the benchmark shows there are improvements to be made.

To Reproduce

Run the benchmark (and create the needed data) with the default wait time of 100ms (the time after starting a query that the benchmark waits before attempting to cancel) that measures the time between requesting cancellation and the tokio runtime is actually dropped:

./benchmarks/bench.sh run cancellation
[...]
     Running `target/release/dfbench cancellation --iterations 5 --path benchmarks/data/cancellation -o benchmarks/results/main/cancellation.json`
Using 7 files found on disk
Starting to load data into in-memory object store
Done loading data into in-memory object store
in main, sleeping
Starting spawned
Creating logical plan...
Creating physical plan...
Executing physical plan...
Getting results...
cancelling thread
done dropping runtime in 31.251709ms
Iteration 0 cancelled in 31.251709 ms
[...]

These results show datafusion is not yielding regularly; ideally the 31ms would be lower.

Vary the wait time to try cancelling at different points in the query's execution to see that there are likely multiple places that need to be yielding more often:

Waiting 200ms:

./target/release/dfbench cancellation --iterations 5 --path benchmarks/data/cancellation -o benchmarks/results/main/cancellation.json --wait-time 200
Using 7 files found on disk
Starting to load data into in-memory object store
Done loading data into in-memory object store
Starting spawned
in main, sleeping
Creating logical plan...
Creating physical plan...
Executing physical plan...
Getting results...
cancelling thread
done dropping runtime in 68.639709ms
Iteration 0 cancelled in 68.639709 ms
[...]

Waiting 300 ms:

./target/release/dfbench cancellation --iterations 5 --path benchmarks/data/cancellation -o benchmarks/results/main/cancellation.json --wait-time 300

Using 7 files found on disk
Starting to load data into in-memory object store
Done loading data into in-memory object store
in main, sleeping
Starting spawned
Creating logical plan...
Creating physical plan...
Executing physical plan...
Getting results...
cancelling thread
done dropping runtime in 100.60675ms
Iteration 0 cancelled in 100.60674999999999 ms
[...]

Expected behavior

Ideally, any query cancelled at any point in the query should stop within ~1ms.

Additional context

Related issues:

This PR may be a fix, but there's some disagreement about whether this is the right way to fix this issue.

@carols10cents carols10cents added the bug Something isn't working label Mar 19, 2025
@alamb alamb changed the title Improve performance on the cancellation benchmark Improve plan shutdown speed on cancel (improve performance on the cancellation benchmark) Mar 19, 2025
@alamb
Copy link
Contributor

alamb commented Mar 19, 2025

Thank you @carols10cents -- this is super helpful. I changed the title of this ticket to make it clearer that it isn't just about the benchmark, but it is really about shutting down the query more quickly after cancellation (as measured by the benchmark)

Looks like a good project

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants