Skip to content

Use Tokio's task budget consistently, better APIs to support task cancellation #16398

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Jun 20, 2025

Conversation

pepijnve
Copy link
Contributor

@pepijnve pepijnve commented Jun 13, 2025

Which issue does this PR close?

Rationale for this change

RecordBatchStreamReceiver supports cooperative scheduling implicitly by using Tokio's task budget. YieldStream currently uses a custom mechanism. It would be better to use a single mechanism consistently.

What changes are included in this PR?

  • Renamed YieldStream and related types to CooperativeStream
  • Removed configuration option which is no longer applicable
  • Enabled cooperative scheduling in spill manager

Note that the implementation of CooperativeStream in this PR is suboptimal. The final implementation requires tokio-rs/tokio#7405 which I'm trying to move along as best I can.

Are these changes tested?

Covered by infinite_cancelcoop test.

Are there any user-facing changes?

Yes, the datafusion.optimizer.yield_period configuration option is removed, but at the time of writing this has not been released yet.

@github-actions github-actions bot added documentation Improvements or additions to documentation optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate proto Related to proto crate datasource Changes to the datasource crate physical-plan Changes to the physical-plan crate labels Jun 13, 2025
@pepijnve pepijnve force-pushed the task_budget branch 8 times, most recently from 75fd648 to 935db91 Compare June 13, 2025 13:57
@ozankabak
Copy link
Contributor

Thanks for the draft -- this is inline with my understanding from your description. I think it will inch us closer to a good, lasting solution (especially after your upstream tokio also PR merges). Feel free to ping me for a more detailed review once you are done with it

@pepijnve
Copy link
Contributor Author

pepijnve commented Jun 13, 2025

@ozankabak I've pushed the optimizer rule changes I had in mind. This introduces two new execution plan properties that capture the evaluation type (how children are evaluated: eager vs lazy) and the scheduling type (how poll_next will behave wrt scheduling: blocking vs cooperative).

With those two combined the tree can be rewritten in a bottom up fashion. Every leaf that is not cooperative gets wrapped as before. Additionally, any eager evaluating nodes (i.e. exchanges) that are not cooperative are wrapped. This should ensure the entire plan participates in cooperative scheduling.

The only caveat that remains is dynamic stream creation. Operators that do that need to take the necessary precautions themselves. I already update the spill manager for this in the previous commit.

While I was writing this I started wondering if evaluation type should be a per child thing. In my spawn experiment branch for instance hash join is eager for the build side, but lazy for the probe side. Perhaps it would be best to leave room for that.

@pepijnve pepijnve force-pushed the task_budget branch 6 times, most recently from b593bfa to c648c0c Compare June 13, 2025 19:07
@ozankabak
Copy link
Contributor

While I was writing this I started wondering if evaluation type should be a per child thing. In my spawn experiment branch for instance hash join is eager for the build side, but lazy for the probe side. Perhaps it would be best to leave room for that.

This is in alignment with what I was thinking, let's do it that way

@pepijnve
Copy link
Contributor Author

pepijnve commented Jun 14, 2025

Thinking about it some more. The evaluation type is intended to describe how the operator computes record batches itself: lazy on demand, or by driving things itself. I’m kind of trying to refer to the terminology from the volcano paper. That talks about demand-driven and data-driven operators. I had first called this 'drive type' with values 'demand' and 'data', but that felt a bit awkward. Since this is actually a property of how the operator prepares its output, one value per operator is probably fine after all.

What I'm trying to do with this is find the exchanges in the plan. The current set that's present in DataFusion is all fine, but if you were to implement one using std::sync::mpsc::channel instead of the one from tokio, explicit cooperation with the scheduler would be necessary again.

@pepijnve
Copy link
Contributor Author

Open to suggestions on better names for these properties.

@pepijnve
Copy link
Contributor Author

pepijnve commented Jun 19, 2025

What puzzles me about the performance regressions are the results from #16398 (comment). At that point in time, the default behavior was what's now datafusion_coop="per_stream" which is the exact same strategy @zhuqi-lucas initially implemented. I kept that around to be able to easily compare that variant with the Tokio budget implementation. Performance should be essentially identical to main.

I'll have a look at the query plans for some of the queries to make sure the optimizer rule is not injecting coop operators needlessly.

The only other difference I can think of is that in this PR CooperativeStream (fka YieldStream) is a generic type. If anything, I would expect that to be faster at runtime rather than slower. Maybe my intuition is completely off here.

@pepijnve
Copy link
Contributor Author

Zooming in on query 4 (it would be super convenient if these had 1-based indices BTW to match line numbers in the file). The plan is

ProjectionExec: expr=[count(alias1)@0 as count(DISTINCT hits.UserID)]
  AggregateExec: mode=Final, gby=[], aggr=[count(alias1)]
    CoalescePartitionsExec
      AggregateExec: mode=Partial, gby=[], aggr=[count(alias1)]
        AggregateExec: mode=FinalPartitioned, gby=[alias1@0 as alias1], aggr=[]
          CoalesceBatchesExec: target_batch_size=8192
            RepartitionExec: partitioning=Hash([alias1@0], 10), input_partitions=10
              AggregateExec: mode=Partial, gby=[UserID@0 as alias1], aggr=[]
                DataSourceExec: file_groups={10 groups: ...]}, projection=[UserID], file_type=parquet

No redundant coop so that's good. The only element that changed in this plan is DataSourceExec which contains the yielding. I'm having a hard time coming up with a hypothesis why this would become so much slower.

@pepijnve
Copy link
Contributor Author

I added some output to be able to see what the coop logic was doing. Ignore the times; this is dev profile.
What you can see is that the PR hardly forces yields, while main does. This is explained by the yield frequency being 64 on main and 128 in the PR (to match Tokio's budget). So it doesn't look like increased yielding is the culprit.

Yield output:

Main (same for datafusion_coop="per_stream" and datafusion_coop="tokio_fallback")
Polled 1146 times; 20 pending, 1118 ready, 8 forced yields
Polled 1262 times; 20 pending, 1230 ready, 12 forced yields
Polled 1076 times; 20 pending, 1048 ready, 8 forced yields
Polled 1124 times; 24 pending, 1095 ready, 5 forced yields
Polled 1183 times; 25 pending, 1153 ready, 5 forced yields
Polled 1209 times; 25 pending, 1179 ready, 5 forced yields
Polled 1157 times; 28 pending, 1127 ready, 2 forced yields
Polled 1218 times; 26 pending, 1187 ready, 5 forced yields
Polled 1572 times; 28 pending, 1535 ready, 9 forced yields
Polled 1594 times; 29 pending, 1556 ready, 9 forced yields
Query 4 iteration 2 took 6382.8 ms and returned 1 rows

vs

PR
Polled 1138 times; 20 pending, 1118 ready, 0 forced yields
Polled 1251 times; 20 pending, 1230 ready, 1 forced yields
Polled 1068 times; 19 pending, 1048 ready, 1 forced yields
Polled 1118 times; 23 pending, 1095 ready, 0 forced yields
Polled 1178 times; 25 pending, 1153 ready, 0 forced yields
Polled 1154 times; 27 pending, 1127 ready, 0 forced yields
Polled 1204 times; 25 pending, 1179 ready, 0 forced yields
Polled 1213 times; 26 pending, 1187 ready, 0 forced yields
Polled 1584 times; 28 pending, 1556 ready, 0 forced yields
Polled 1563 times; 28 pending, 1535 ready, 0 forced yields
Query 4 iteration 6 took 6362.7 ms and returned 1 rows

The next thing I can think of is the access to the budget thread local. But that doesn't explain why we got a very similar benchmark delta with the non-thread local counter.
Comparing the clickbench1 run we had this in #16398 (comment) with per_stream

Total Time (HEAD)          │ 56209.26ms
Total Time (task_budget)   │ 57381.15ms
Average Time (HEAD)        │  1307.19ms
Average Time (task_budget) │  1334.45ms

vs the last result in #16398 (comment) with tokio_fallback

Total Time (HEAD)          │ 55925.88ms
Total Time (task_budget)   │ 56847.01ms
Average Time (HEAD)        │  1300.60ms
Average Time (task_budget) │  1322.02ms

@zhuqi-lucas
Copy link
Contributor

zhuqi-lucas commented Jun 19, 2025

Could share budget count affect performance? I can see it only show regression for clickbench partition, i am not sure if it is possible that it will improve some performance when we yield more for multi partition cases.

@pepijnve
Copy link
Contributor Author

I can see it only show regression for clickbench partition, i am not sure if it is possible that it will improve some performance when we yield more for multi partition cases.

I'll experiment with the multi file variant as well. FWIW I'm running this on a 10-core MacBook, so clickbench1 was running with 10 partitions but all from the same file.

@pepijnve
Copy link
Contributor Author

pepijnve commented Jun 19, 2025

🤔 now that I write '10-core MacBook', I'm wondering if the 10 core part is where my variability is coming from. That's 6 performance and 4 efficiency cores. Ideally DataFusion keeps the CPU bound work on the perf cores, and uses the efficiency ones for IO. I had been wondering about that and NUMA effects already. A topic for a different thread though.

@pepijnve
Copy link
Contributor Author

Could share budget count affect performance?

@zhuqi-lucas the situations where this can have an impact is when you have an operator that switches to a different input stream when the polled input stream returns pending. Interleave is a simple example of that. With the per stream budget, the second poll may end up returning ready. With the shared task budget the second poll will get 'budget depleted' and return pending as well. We may incur some overhead by attempting to poll the second stream even though you can already know that it will return pending.

Besides that type of situation, and disregarding any non-obvious side effects of using thread locals for a moment, I don't see how shared vs non-shared budget would make a difference. There shouldn't be competition between tasks for the same budget; the counter is distinct per task and is reset each task 'tick'.

@zhuqi-lucas
Copy link
Contributor

Could share budget count affect performance?

@zhuqi-lucas the situations where this can have an impact is when you have an operator that switches to a different input stream when the polled input stream returns pending. Interleave is a simple example of that. With the per stream budget, the second poll may end up returning ready. With the shared task budget the second poll will get 'budget depleted' and return pending as well. We may incur some overhead by attempting to poll the second stream even though you can already know that it will return pending.

Besides that type of situation, and disregarding any non-obvious side effects of using thread locals for a moment, I don't see how shared vs non-shared budget would make a difference. There shouldn't be competition between tasks for the same budget; the counter is distinct per task and is reset each task 'tick'.

Got it, so we add Cooperative to some operators such as sort_perserve_merging when partition > 1, it looks like similar to interleave which will have many inputs? I am not sure if it affects some performance. And the original solution, we only add yield to leaf node, leaf node will only has one input i believe.

@pepijnve
Copy link
Contributor Author

pepijnve commented Jun 19, 2025

Got it, so we add Cooperative to some operators such as sort_perserve_merging when partition > 1

No, that hasn't changed (or at least it shouldn't have). When SPM has more than 1 input, it acts as an exchange and drives the inputs itself. This is advertised using EvaluationType::Eager.
It makes use of tokio mpsc channels which are (and were) already consuming the tokio task budget. Because of this implementation aspect SPM advertises itself as SchedulingType::Cooperative.
The optimizer should not be adding CooperativeExec here. I'll double check to be sure. The query I was investigating specifically doesn't use SPM though.

Note that each exchange is also a task boundary, so task budget consumption does not leak from below the exchange to above it or vice versa. Since all the inputs to SPM are all driven as distinct tasks, there also shouldn't be any competition amongst the inputs.

(I'm stating things as fact above, but this only reflects my current understanding. Working on double checking everything.)

@zhuqi-lucas
Copy link
Contributor

Got it, so we add Cooperative to some operators such as sort_perserve_merging when partition > 1

No, that hasn't changed (or at least it shouldn't have). When SPM has more than 1 input, it acts as an exchange and drives the inputs itself. This is advertised using EvaluationType::Eager. It makes use of tokio mpsc channels which are (and were) already consuming the tokio task budget. Because of this implementation aspect SPM advertises itself as SchedulingType::Cooperative. The optimizer should not be adding CooperativeExec here. I'll double check to be sure. The query I was investigating specifically doesn't use SPM though.

Note that each exchange is also a task boundary, so task budget consumption does not leak from below the exchange to above it or vice versa. Since all the inputs to SPM are all driven as distinct tasks, there also shouldn't be any competition amongst the inputs.

(I'm stating things as fact above, but this only reflects my current understanding. Working on double checking everything.)

Ok, i see MPSC already for SPM, we only add the EvaluationType::Eager type to it, we don't add another yield/budget. It should not affect performance.

@pepijnve
Copy link
Contributor Author

@alamb @Dandandan I'm starting to get the feeling this is a wild goose chase. I adapted bench.sh a bit so that I can pass in --query. I then ran clickbench_1 query 4 multiple times since that showed a particularly large slowdown in your run. Here's what I got. Depending on the whims of the benchmark gods the runs are sometimes faster, sometimes slower.

I'm not sure if you can make any meaningful conclusion based on these timings alone. I'll try to have a look at performance counters next, maybe that paints a clearer picture.

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃                           baseline ┃                             branch ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 4     │ 408.42 / 416.77 ±10.99 / 437.63 ms │ 427.32 / 453.68 ±23.80 / 496.62 ms │    slower │
│ QQuery 4     │ 412.03 / 420.43 ±9.88 / 439.50 ms  │ 413.27 / 426.28 ±8.92 / 437.55 ms  │ no change │
│ QQuery 4     │ 416.20 / 426.53 ±5.67 / 432.79 ms  │ 426.05 / 431.55 ±7.22 / 444.22 ms  │ no change │
│ QQuery 4     │ 409.48 / 417.30 ±8.24 / 431.85 ms  │ 406.66 / 419.13 ±11.66 / 438.14 ms │ no change │
│ QQuery 4     │ 407.35 / 421.29 ±10.34 / 436.56 ms │ 418.82 / 433.02 ±14.18 / 458.39 ms │ no change │
│ QQuery 4     │ 409.18 / 421.46 ±9.34 / 437.74 ms  │ 409.45 / 418.36 ±10.57 / 438.11 ms │ no change │
│ QQuery 4     │ 407.54 / 418.17 ±9.50 / 435.26 ms  │ 413.72 / 423.48 ±7.94 / 434.72 ms  │ no change │
│ QQuery 4     │ 406.85 / 417.14 ±10.05 / 436.24 ms │ 406.06 / 416.47 ±8.68 / 428.46 ms  │ no change │
│ QQuery 4     │ 405.30 / 416.89 ±13.20 / 441.77 ms │ 405.34 / 415.79 ±12.40 / 439.34 ms │ no change │
└──────────────┴────────────────────────────────────┴────────────────────────────────────┴───────────┘

@pepijnve
Copy link
Contributor Author

@alamb @Dandandan Another update. The realization that Tokio is not P/E-core aware out of the box and that you have very little to no explicit control over thread affinity on macOS (QoS remains a hint) led me to try and find some other machine to run the perf tests on. We still had some Intel desktop machines lying around in our server room, so I installed a fresh Ubuntu server 24.04 on one of them and ran the test. Results below. Strengthens my believe that I'm measuring noise.

Additionally I ran some tests with Instruments yesterday comparing total number of retired instructions, branch mispredictions, etc. There's some variability across runs, but it's more or less the same (as you would expect).

The hardware specs this ran on are:

  • i9-9900 @ 3.10GHz (no P vs E cores)
  • 32GiB memory
  • 1TB 970 EVO/PRO SSD
Comparing baseline and branch
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃                        baseline ┃                         branch ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │      1816.81 / 1846.02 ±16.82 / │    1487.24 / 1592.59 ±105.23 / │ +1.16x faster │
│              │                      1864.51 ms │                     1766.04 ms │               │
│ QQuery 1     │       1154.89 / 1168.67 ±9.90 / │      1170.36 / 1176.54 ±5.75 / │     no change │
│              │                      1180.63 ms │                     1185.78 ms │               │
│ QQuery 2     │      2149.19 / 2178.05 ±15.43 / │     2167.67 / 2205.78 ±30.69 / │     no change │
│              │                      2194.42 ms │                     2249.29 ms │               │
│ QQuery 3     │ 950.54 / 965.78 ±11.70 / 979.71 │       929.35 / 942.41 ±10.50 / │     no change │
│              │                              ms │                      956.59 ms │               │
│ QQuery 4     │      2318.32 / 2380.99 ±49.47 / │     2175.43 / 2251.66 ±47.23 / │ +1.06x faster │
│              │                      2426.34 ms │                     2323.00 ms │               │
│ QQuery 5     │   20654.23 / 20842.04 ±135.24 / │  20474.42 / 20693.12 ±278.58 / │     no change │
│              │                     21072.72 ms │                    21236.04 ms │               │
│ QQuery 6     │      3365.89 / 3380.75 ±14.32 / │    3303.32 / 3470.30 ±286.78 / │     no change │
│              │                      3407.26 ms │                     4042.51 ms │               │
│ QQuery 7     │      3667.60 / 3734.69 ±78.34 / │     3720.23 / 3752.53 ±20.55 / │     no change │
│              │                      3883.45 ms │                     3776.15 ms │               │
│ QQuery 8     │      1189.93 / 1214.20 ±21.58 / │     1167.36 / 1189.68 ±21.82 / │     no change │
│              │                      1254.50 ms │                     1226.90 ms │               │
└──────────────┴─────────────────────────────────┴────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary       ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (baseline)   │ 37711.20ms │
│ Total Time (branch)     │ 37274.61ms │
│ Average Time (baseline) │  4190.13ms │
│ Average Time (branch)   │  4141.62ms │
│ Queries Faster          │          2 │
│ Queries Slower          │          0 │
│ Queries with No Change  │          7 │
│ Queries with Failure    │          0 │
└─────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃                        baseline ┃                         branch ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │  16.50 / 20.72 ±8.17 / 37.06 ms │ 16.10 / 16.77 ±0.61 / 17.75 ms │ +1.24x faster │
│ QQuery 1     │  37.95 / 39.92 ±2.24 / 44.24 ms │ 36.54 / 37.35 ±0.70 / 38.59 ms │ +1.07x faster │
│ QQuery 2     │  114.76 / 119.67 ±3.62 / 123.85 │ 93.35 / 96.31 ±1.79 / 98.98 ms │ +1.24x faster │
│              │                              ms │                                │               │
│ QQuery 3     │ 100.46 / 143.26 ±68.08 / 279.05 │   95.18 / 97.89 ±2.08 / 101.42 │ +1.46x faster │
│              │                              ms │                             ms │               │
│ QQuery 4     │      1019.10 / 1046.67 ±31.71 / │       967.72 / 998.89 ±17.07 / │     no change │
│              │                      1102.74 ms │                     1013.90 ms │               │
│ QQuery 5     │      1099.85 / 1126.37 ±22.00 / │      1075.93 / 1084.52 ±8.80 / │     no change │
│              │                      1164.14 ms │                     1096.82 ms │               │
│ QQuery 6     │  28.26 / 32.57 ±4.41 / 40.21 ms │ 26.20 / 29.64 ±2.27 / 33.33 ms │ +1.10x faster │
│ QQuery 7     │  48.48 / 51.67 ±1.66 / 52.95 ms │ 45.85 / 47.43 ±1.14 / 48.65 ms │ +1.09x faster │
│ QQuery 8     │      1301.94 / 1310.48 ±11.55 / │      1241.74 / 1251.86 ±7.81 / │     no change │
│              │                      1333.12 ms │                     1262.60 ms │               │
│ QQuery 9     │       1381.11 / 1391.08 ±7.87 / │      1334.99 / 1342.58 ±4.51 / │     no change │
│              │                      1400.07 ms │                     1348.99 ms │               │
│ QQuery 10    │  350.32 / 355.63 ±5.80 / 365.56 │ 348.43 / 351.11 ±1.43 / 352.37 │     no change │
│              │                              ms │                             ms │               │
│ QQuery 11    │  386.57 / 395.36 ±5.94 / 403.46 │ 393.27 / 396.59 ±2.43 / 399.68 │     no change │
│              │                              ms │                             ms │               │
│ QQuery 12    │       1155.30 / 1158.69 ±2.15 / │      1104.36 / 1107.24 ±2.54 / │     no change │
│              │                      1161.21 ms │                     1111.15 ms │               │
│ QQuery 13    │       1988.21 / 2007.19 ±9.94 / │     1898.72 / 1929.14 ±16.27 / │     no change │
│              │                      2015.68 ms │                     1946.54 ms │               │
│ QQuery 14    │       1212.31 / 1218.28 ±6.63 / │      1160.51 / 1168.44 ±6.35 / │     no change │
│              │                      1230.82 ms │                     1176.61 ms │               │
│ QQuery 15    │       1167.31 / 1173.50 ±9.25 / │      1101.69 / 1108.11 ±7.11 / │ +1.06x faster │
│              │                      1191.88 ms │                     1121.69 ms │               │
│ QQuery 16    │     2588.75 / 2734.29 ±138.85 / │     2494.71 / 2516.10 ±19.90 / │ +1.09x faster │
│              │                      2908.22 ms │                     2549.44 ms │               │
│ QQuery 17    │      2567.55 / 2584.06 ±19.09 / │      2473.06 / 2479.84 ±4.24 / │     no change │
│              │                      2620.85 ms │                     2483.78 ms │               │
│ QQuery 18    │     5061.62 / 5150.17 ±119.14 / │    4893.81 / 5094.60 ±259.69 / │     no change │
│              │                      5382.84 ms │                     5574.29 ms │               │
│ QQuery 19    │  104.53 / 111.22 ±7.69 / 126.24 │ 101.36 / 105.48 ±3.19 / 109.25 │ +1.05x faster │
│              │                              ms │                             ms │               │
│ QQuery 20    │      1608.60 / 1624.36 ±13.22 / │      1604.68 / 1622.16 ±9.65 / │     no change │
│              │                      1643.68 ms │                     1632.41 ms │               │
│ QQuery 21    │      1886.35 / 1908.10 ±17.39 / │     1883.89 / 1910.36 ±30.98 / │     no change │
│              │                      1939.48 ms │                     1969.64 ms │               │
│ QQuery 22    │      3230.90 / 3291.20 ±96.36 / │     3215.98 / 3269.94 ±73.17 / │     no change │
│              │                      3483.50 ms │                     3413.69 ms │               │
│ QQuery 23    │   12742.62 / 12986.18 ±234.60 / │  12887.51 / 13036.66 ±120.87 / │     no change │
│              │                     13407.25 ms │                    13246.23 ms │               │
│ QQuery 24    │ 641.82 / 659.54 ±19.55 / 696.41 │ 643.40 / 655.68 ±8.45 / 664.67 │     no change │
│              │                              ms │                             ms │               │
│ QQuery 25    │  425.36 / 433.39 ±6.10 / 441.01 │ 429.37 / 435.14 ±5.45 / 445.48 │     no change │
│              │                              ms │                             ms │               │
│ QQuery 26    │  635.17 / 647.46 ±7.49 / 654.76 │ 649.43 / 654.46 ±4.11 / 659.08 │     no change │
│              │                              ms │                             ms │               │
│ QQuery 27    │      2362.95 / 2379.85 ±13.64 / │      2392.53 / 2399.95 ±6.61 / │     no change │
│              │                      2395.36 ms │                     2410.38 ms │               │
│ QQuery 28    │    18378.22 / 18533.23 ±95.66 / │  18441.65 / 18594.61 ±125.29 / │     no change │
│              │                     18668.42 ms │                    18821.72 ms │               │
│ QQuery 29    │      1150.49 / 1168.63 ±13.63 / │      1148.00 / 1156.96 ±8.18 / │     no change │
│              │                      1192.78 ms │                     1171.97 ms │               │
│ QQuery 30    │       1100.90 / 1105.88 ±4.96 / │      1064.98 / 1069.64 ±3.15 / │     no change │
│              │                      1113.19 ms │                     1072.83 ms │               │
│ QQuery 31    │       1254.63 / 1264.53 ±9.24 / │     1226.02 / 1241.47 ±15.83 / │     no change │
│              │                      1277.99 ms │                     1270.33 ms │               │
│ QQuery 32    │     4979.64 / 5108.61 ±189.24 / │    4980.73 / 5114.22 ±222.24 / │     no change │
│              │                      5481.98 ms │                     5557.95 ms │               │
│ QQuery 33    │      5183.53 / 5233.23 ±45.34 / │     5203.84 / 5271.23 ±71.45 / │     no change │
│              │                      5307.24 ms │                     5373.36 ms │               │
│ QQuery 34    │     5178.24 / 5257.93 ±110.05 / │     5275.85 / 5359.43 ±79.80 / │     no change │
│              │                      5474.69 ms │                     5463.29 ms │               │
│ QQuery 35    │      1862.37 / 1876.24 ±13.10 / │     1866.96 / 1881.38 ±17.30 / │     no change │
│              │                      1896.66 ms │                     1915.02 ms │               │
│ QQuery 36    │  99.14 / 111.55 ±22.24 / 155.97 │ 98.30 / 111.96 ±23.17 / 158.19 │     no change │
│              │                              ms │                             ms │               │
│ QQuery 37    │  50.94 / 55.96 ±8.58 / 73.09 ms │ 51.04 / 57.79 ±9.18 / 75.97 ms │     no change │
│ QQuery 38    │ 98.17 / 99.93 ±1.36 / 102.03 ms │ 100.21 / 103.01 ±2.28 / 106.51 │     no change │
│              │                                 │                             ms │               │
│ QQuery 39    │ 156.90 / 164.82 ±13.32 / 191.37 │       154.78 / 166.31 ±13.91 / │     no change │
│              │                              ms │                      193.63 ms │               │
│ QQuery 40    │ 43.33 / 50.75 ±10.62 / 71.86 ms │ 46.58 / 53.38 ±9.48 / 72.08 ms │  1.05x slower │
│ QQuery 41    │  42.01 / 45.40 ±2.33 / 48.16 ms │ 38.83 / 42.60 ±2.58 / 46.57 ms │ +1.07x faster │
│ QQuery 42    │  35.95 / 40.04 ±3.71 / 46.25 ms │ 35.48 / 40.16 ±2.77 / 43.74 ms │     no change │
└──────────────┴─────────────────────────────────┴────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary       ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (baseline)   │ 86217.62ms │
│ Total Time (branch)     │ 85508.40ms │
│ Average Time (baseline) │  2005.06ms │
│ Average Time (branch)   │  1988.57ms │
│ Queries Faster          │         10 │
│ Queries Slower          │          1 │
│ Queries with No Change  │         32 │
│ Queries with Failure    │          0 │
└─────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃                       baseline ┃                          branch ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 184.98 / 189.83 ±2.53 / 191.91 │  186.28 / 191.11 ±3.04 / 194.56 │     no change │
│              │                             ms │                              ms │               │
│ QQuery 2     │ 30.88 / 33.57 ±2.75 / 38.50 ms │  25.66 / 29.63 ±3.12 / 33.86 ms │ +1.13x faster │
│ QQuery 3     │ 57.19 / 58.34 ±0.91 / 59.31 ms │  57.45 / 58.47 ±0.81 / 59.45 ms │     no change │
│ QQuery 4     │ 19.17 / 21.17 ±1.30 / 22.81 ms │  22.04 / 23.09 ±0.83 / 24.30 ms │  1.09x slower │
│ QQuery 5     │ 95.47 / 96.71 ±1.12 / 98.56 ms │  94.46 / 95.11 ±0.47 / 95.73 ms │     no change │
│ QQuery 6     │ 16.97 / 17.28 ±0.28 / 17.82 ms │  16.97 / 17.30 ±0.23 / 17.56 ms │     no change │
│ QQuery 7     │ 178.92 / 183.10 ±6.38 / 195.80 │  177.82 / 181.92 ±7.13 / 196.16 │     no change │
│              │                             ms │                              ms │               │
│ QQuery 8     │ 30.31 / 31.15 ±0.75 / 32.39 ms │  26.75 / 28.40 ±1.13 / 29.71 ms │ +1.10x faster │
│ QQuery 9     │ 80.96 / 105.44 ±43.82 / 192.90 │  84.11 / 85.20 ±0.72 / 86.30 ms │ +1.24x faster │
│              │                             ms │                                 │               │
│ QQuery 10    │ 69.22 / 69.53 ±0.31 / 70.07 ms │  69.66 / 72.36 ±1.50 / 73.73 ms │     no change │
│ QQuery 11    │ 12.78 / 14.23 ±1.42 / 16.07 ms │  14.70 / 15.91 ±1.02 / 17.44 ms │  1.12x slower │
│ QQuery 12    │ 46.39 / 49.56 ±2.85 / 53.70 ms │  46.59 / 49.78 ±2.35 / 52.93 ms │     no change │
│ QQuery 13    │ 38.61 / 39.09 ±0.35 / 39.52 ms │  37.73 / 38.36 ±0.47 / 39.17 ms │     no change │
│ QQuery 14    │ 14.05 / 14.55 ±0.43 / 15.22 ms │  13.30 / 13.88 ±0.53 / 14.79 ms │     no change │
│ QQuery 15    │ 22.79 / 23.77 ±0.62 / 24.72 ms │  21.87 / 21.93 ±0.05 / 22.00 ms │ +1.08x faster │
│ QQuery 16    │ 27.54 / 28.36 ±0.89 / 29.44 ms │  28.28 / 28.62 ±0.37 / 29.34 ms │     no change │
│ QQuery 17    │ 161.75 / 162.75 ±0.75 / 163.87 │  161.42 / 162.39 ±0.82 / 163.86 │     no change │
│              │                             ms │                              ms │               │
│ QQuery 18    │       359.51 / 371.92 ±20.52 / │ 359.23 / 371.58 ±22.78 / 417.12 │     no change │
│              │                      412.89 ms │                              ms │               │
│ QQuery 19    │ 35.48 / 37.25 ±1.68 / 40.40 ms │  35.82 / 38.29 ±3.46 / 45.14 ms │     no change │
│ QQuery 20    │ 52.43 / 53.09 ±0.69 / 54.27 ms │  51.79 / 52.69 ±0.50 / 53.09 ms │     no change │
│ QQuery 21    │ 237.65 / 240.66 ±1.77 / 242.82 │  236.26 / 240.45 ±6.19 / 252.67 │     no change │
│              │                             ms │                              ms │               │
│ QQuery 22    │   38.43 / 54.37 ±13.22 / 74.36 │  49.39 / 53.63 ±5.71 / 64.71 ms │     no change │
│              │                             ms │                                 │               │
└──────────────┴────────────────────────────────┴─────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary       ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (baseline)   │ 1895.71ms │
│ Total Time (branch)     │ 1870.10ms │
│ Average Time (baseline) │   86.17ms │
│ Average Time (branch)   │   85.00ms │
│ Queries Faster          │         4 │
│ Queries Slower          │         2 │
│ Queries with No Change  │        16 │
│ Queries with Failure    │         0 │
└─────────────────────────┴───────────┘

@alamb
Copy link
Contributor

alamb commented Jun 20, 2025

Zooming in on query 4 (it would be super convenient if these had 1-based indices BTW to match line numbers in the file).

Yes i also find the ClickHouse numbering schemes confusing (we follow the https://github.com/ClickHouse/ClickBench convention)

Thank you @pepijnve -- I have reviewed your analysis and agree with your conclusion (that the reported differences are likely just noise)

To double check I tried to reproduce the results reported manually locally

│ QQuery 4 │ 614.68 ms │ 702.11 ms │ 1.14x slower │

I used a decided unscientific approach

$ cat q4.sql
SELECT COUNT(DISTINCT "UserID") FROM hits;
$ datafusion-cli -f q4.sql  | grep Elapsed

Results on merge-base of this PR

Elapsed 0.311 seconds.
Elapsed 0.287 seconds.
Elapsed 0.292 seconds.
Elapsed 0.294 seconds.

Results on this PR

Elapsed 0.294 seconds.
Elapsed 0.301 seconds.
Elapsed 0.293 seconds.
Elapsed 0.287 seconds.

This I conclude there is no appreciable difference and we should merge this PR. I'll plan to do so after we get a clean CI run (I'll merge up to fix conflicts too)

The Power vs Efficiency cores is a great (and fascinating) observation -- and one that I think deserves further study. I'll file another ticket to discuss that

@alamb
Copy link
Contributor

alamb commented Jun 20, 2025

Thank you also @zhuqi-lucas for all your help and attention to this PR

@alamb
Copy link
Contributor

alamb commented Jun 20, 2025

Actually, @zhuqi-lucas it wasn't 100% clear to me from your comments -- are you ok with this PR being merged?

@pepijnve
Copy link
Contributor Author

@alamb thanks for taking the time to verify a bit further on your end. Just FYI (and some shameless PR promotion), #16476 and #16477 will help a bit in making investigation of individual queries a bit easier.

@zhuqi-lucas
Copy link
Contributor

@alamb I agree with merging this PR, the change should not affect the performance since we already use MPSC for SPM, etc, thanks!

@alamb
Copy link
Contributor

alamb commented Jun 20, 2025

I think the CI failures are due to

Merging up to get a clean run

@alamb
Copy link
Contributor

alamb commented Jun 20, 2025

🚀

@alamb alamb merged commit 8b03e5e into apache:main Jun 20, 2025
30 checks passed
@pepijnve
Copy link
Contributor Author

Oh my, it landed 🎉. Thanks @alamb, great start of my weekend!

Still working on the Tokio PR. I'll create a follow-up issue to referring to the todo I left in the code.

@alamb
Copy link
Contributor

alamb commented Jun 20, 2025

🤔 now that I write '10-core MacBook', I'm wondering if the 10 core part is where my variability is coming from. That's 6 performance and 4 efficiency cores. Ideally DataFusion keeps the CPU bound work on the perf cores, and uses the efficiency ones for IO. I had been wondering about that and NUMA effects already. A topic for a different thread though.

@alamb
Copy link
Contributor

alamb commented Jun 20, 2025

Oh my, it landed 🎉. Thanks @alamb, great start of my weekend!

Yeah I think the bias towards action is the name of the game in DataFusion to keep things moving forward (rather than stall due to apathy).

Basically i like to think "if there is nothing actively stopping a PR from merging, we should merge it!"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate documentation Improvements or additions to documentation optimizer Optimizer rules physical-plan Changes to the physical-plan crate proto Related to proto crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Epic] Pipeline breaking cancellation support and improvement
5 participants