You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Sometimes datafusion queries can't be cancelled in a reasonable amount of time because datafusion is doing CPU-bound work for too long without yielding.
Ideally, as documented on ExecutionPlan::execute, datafusion should yield every so often in a manner appropriate for the situation to enable queries to be cancelled.
The current behavior is captured in the cancellation benchmark, and the benchmark shows there are improvements to be made.
To Reproduce
Run the benchmark (and create the needed data) with the default wait time of 100ms (the time after starting a query that the benchmark waits before attempting to cancel) that measures the time between requesting cancellation and the tokio runtime is actually dropped:
./benchmarks/bench.sh run cancellation
[...]
Running `target/release/dfbench cancellation --iterations 5 --path benchmarks/data/cancellation -o benchmarks/results/main/cancellation.json`
Using 7 files found on disk
Starting to load data into in-memory object store
Done loading data into in-memory object store
in main, sleeping
Starting spawned
Creating logical plan...
Creating physical plan...
Executing physical plan...
Getting results...
cancelling thread
done dropping runtime in 31.251709ms
Iteration 0 cancelled in 31.251709 ms
[...]
These results show datafusion is not yielding regularly; ideally the 31ms would be lower.
Vary the wait time to try cancelling at different points in the query's execution to see that there are likely multiple places that need to be yielding more often:
Waiting 200ms:
./target/release/dfbench cancellation --iterations 5 --path benchmarks/data/cancellation -o benchmarks/results/main/cancellation.json --wait-time 200
Using 7 files found on disk
Starting to load data into in-memory object store
Done loading data into in-memory object store
Starting spawned
in main, sleeping
Creating logical plan...
Creating physical plan...
Executing physical plan...
Getting results...
cancelling thread
done dropping runtime in 68.639709ms
Iteration 0 cancelled in 68.639709 ms
[...]
Waiting 300 ms:
./target/release/dfbench cancellation --iterations 5 --path benchmarks/data/cancellation -o benchmarks/results/main/cancellation.json --wait-time 300
Using 7 files found on disk
Starting to load data into in-memory object store
Done loading data into in-memory object store
in main, sleeping
Starting spawned
Creating logical plan...
Creating physical plan...
Executing physical plan...
Getting results...
cancelling thread
done dropping runtime in 100.60675ms
Iteration 0 cancelled in 100.60674999999999 ms
[...]
Expected behavior
Ideally, any query cancelled at any point in the query should stop within ~1ms.
alamb
changed the title
Improve performance on the cancellation benchmark
Improve plan shutdown speed on cancel (improve performance on the cancellation benchmark)
Mar 19, 2025
Thank you @carols10cents -- this is super helpful. I changed the title of this ticket to make it clearer that it isn't just about the benchmark, but it is really about shutting down the query more quickly after cancellation (as measured by the benchmark)
Describe the bug
Sometimes datafusion queries can't be cancelled in a reasonable amount of time because datafusion is doing CPU-bound work for too long without yielding.
Ideally, as documented on
ExecutionPlan::execute
, datafusion should yield every so often in a manner appropriate for the situation to enable queries to be cancelled.The current behavior is captured in the cancellation benchmark, and the benchmark shows there are improvements to be made.
To Reproduce
Run the benchmark (and create the needed data) with the default wait time of 100ms (the time after starting a query that the benchmark waits before attempting to cancel) that measures the time between requesting cancellation and the tokio runtime is actually dropped:
These results show datafusion is not yielding regularly; ideally the 31ms would be lower.
Vary the wait time to try cancelling at different points in the query's execution to see that there are likely multiple places that need to be yielding more often:
Waiting 200ms:
Waiting 300 ms:
Expected behavior
Ideally, any query cancelled at any point in the query should stop within ~1ms.
Additional context
Related issues:
filter_pushdown
enabled filter takes too much time to process #13298This PR may be a fix, but there's some disagreement about whether this is the right way to fix this issue.
The text was updated successfully, but these errors were encountered: