-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Make ClickBench Q23 Go Faster #15177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
OOOO -- here is the duckdb plan and it shows what they are doing! The key is this line:
What I think this is referring to is what @adriangb is describing in : Specifically, the Top_N operator passes down a filter into the scan. The filter is "dynamic" in the sense that
|
The topk dynamic filtering is described here: |
BTW apparently DuckDB uses the "late materialization" technique with its own native format. Here is an explain courtesy of Joe Issacs and Robert Kruszewski
|
This looks cool! Very interested in this. |
There's two optimizations here that go together, if you check clickbench results duckdb on their own format is significantly faster than parquet. The two optimizer rules that do this is 1) TopN https://github.com/duckdb/duckdb/blob/main/src/optimizer/topn_optimizer.cpp#L105 2) Late materialization https://github.com/duckdb/duckdb/blob/main/src/optimizer/late_materialization.cpp#L180 (join back the filter result to obtain rest of the columns) |
Note that late materialization (the join / semi join rewrite) needs join operator support that DataFusion doesn't yet have (we could add it but it will take non trivial effort) My suggested order of implementation is:
I actually think that will likely get us quite fast. I am not sure how much more improvement late materialized joins will get without a specialized file format. I don't have time to help plan out late materializing joins at the moment, but I am quite interested in pushing along the predicate pushdown |
There is a similar thought named Even though it aims to filter, the idea is similar, for example: Table
Back to topk,
We can spilt the idea to the query: WITH ids AS (SELECT row_id, a FROM t ORDER BY a LIMIT 10)
SELECT t.* FROM t JOIN ids WHERE t.row_id IN (SELECT row_id FROM ids) |
I agree -- this is what I meant by "late materialization" . Your example / explanation is much better than mine @xudong963 🙏 |
I did not fully get this part. DF has semi join support and some rewrites to utilize it in similar cases? > CREATE TABLE t (a int, b int, row_id int);
0 row(s) fetched.
Elapsed 0.004 seconds.
> EXPLAIN (WITH ids AS (SELECT row_id, a FROM t ORDER BY a LIMIT 10)
SELECT t.* FROM t JOIN ids WHERE t.row_id IN (SELECT row_id FROM ids));
+---------------+--------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+--------------------------------------------------------------------------------------------+
| logical_plan | LeftSemi Join: t.row_id = __correlated_sq_1.row_id |
| | Cross Join: |
| | TableScan: t projection=[a, b, row_id] |
| | SubqueryAlias: ids |
| | Projection: |
| | Sort: t.a ASC NULLS LAST, fetch=10 |
| | TableScan: t projection=[a] |
| | SubqueryAlias: __correlated_sq_1 |
| | SubqueryAlias: ids |
| | Projection: t.row_id |
| | Sort: t.a ASC NULLS LAST, fetch=10 |
| | Projection: t.row_id, t.a |
| | TableScan: t projection=[a, row_id] |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192 |
| | HashJoinExec: mode=Partitioned, join_type=LeftSemi, on=[(row_id@2, row_id@0)] |
| | CrossJoinExec |
| | DataSourceExec: partitions=1, partition_sizes=[0] |
| | ProjectionExec: expr=[] |
| | SortExec: TopK(fetch=10), expr=[a@0 ASC NULLS LAST], preserve_partitioning=[false] |
| | DataSourceExec: partitions=1, partition_sizes=[0] |
| | ProjectionExec: expr=[row_id@0 as row_id] |
| | SortExec: TopK(fetch=10), expr=[a@1 ASC NULLS LAST], preserve_partitioning=[false] |
| | DataSourceExec: partitions=1, partition_sizes=[0] |
| | |
+---------------+--------------------------------------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.005 seconds.
> |
Ah actually, the query given by @xudong963 is I think slightly off, I think it should be the following (without the explicit join). This yields the same plan as DuckDB: > EXPLAIN (WITH ids AS (SELECT row_id, a FROM t ORDER BY a LIMIT 10)
SELECT t.* FROM t WHERE t.row_id IN (SELECT row_id FROM ids));
+---------------+------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+------------------------------------------------------------------------------------------+
| logical_plan | LeftSemi Join: t.row_id = __correlated_sq_1.row_id |
| | TableScan: t projection=[a, b, row_id] |
| | SubqueryAlias: __correlated_sq_1 |
| | SubqueryAlias: ids |
| | Projection: t.row_id |
| | Sort: t.a ASC NULLS LAST, fetch=10 |
| | Projection: t.row_id, t.a |
| | TableScan: t projection=[a, row_id] |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192 |
| | HashJoinExec: mode=Partitioned, join_type=LeftSemi, on=[(row_id@2, row_id@0)] |
| | DataSourceExec: partitions=1, partition_sizes=[0] |
| | ProjectionExec: expr=[row_id@0 as row_id] |
| | SortExec: TopK(fetch=10), expr=[a@1 ASC NULLS LAST], preserve_partitioning=[false] |
| | DataSourceExec: partitions=1, partition_sizes=[0] |
| | |
+---------------+------------------------------------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.004 seconds. |
@Dandandan -- I think it would be interesting to try and rewrite q23 manually to that pattern and see how it goes fast I suspect (but have not measured), if we implemented this rewrite we would find it runs much more slowly than the existing code because what would happen is that the entire input file (all columns) would be decoded and all but 10 rows are thrown away To avoid this we need to push the join filters into the scan (and get predicate pushdown on by default) Edit: although now I say this maybe it would be much better as we have to decode all the columns now.... |
I tried the rewrite into a Semi join and indeed it is over 2x slower (5.3sec vs 12sec) > SELECT * from 'hits_partitioned' WHERE "URL" LIKE '%google%' ORDER BY "EventTime" LIMIT 10;
Elapsed 5.320 seconds. Here is what I think the rewrite is > SELECT * from 'hits_partitioned' WHERE "WatchID" IN (
SELECT "WatchID" FROM 'hits_partitioned' WHERE "URL" LIKE '%google%' ORDER BY "EventTime" LIMIT 10
);
Elapsed 12.023 seconds. WatchID is a unique key > select count(distinct "WatchID"), count(*) from 'hits_partitioned';
+------------------------------------------+----------+
| count(DISTINCT hits_partitioned.WatchID) | count(*) |
+------------------------------------------+----------+
| 99997493 | 99997497 |
+------------------------------------------+----------+ I also double checked the output ## orig
datafusion-cli -c "SELECT * FROM 'hits_partitioned' WHERE \"URL\" LIKE '%google%' ORDER BY \"EventTime\" LIMIT 10;" > orig.out
## rewrite
datafusion-cli -c "SELECT * from 'hits_partitioned' WHERE \"WatchID\" IN (SELECT \"WatchID\" FROM 'hits_partitioned' WHERE \"URL\" LIKE '%google%' ORDER BY \"EventTime\" LIMIT 10);" > rewrite.out
## check
sort orig.out > orig.out.sort
sort rewrite.out > rewrite.out.sort
diff orig.out.sort rewrite.out.sort
7c7
< Elapsed 5.649 seconds.
---
> Elapsed 11.067 seconds.
|
I am not really sure where the time is going 🤔 |
Thanks for checking @alamb ! I think a large portion is spent in the hash join (repartitioning the right side input) - I think because it runs as |
I also think the |
BTW combined with @adriangb's PR here It will likely go crazy fast 🚀 |
I traced this down to an issue in the planner, which uses |
I recently took a detailed look at this optimization in ClickHouse, and it might offer you some insights @alamb . rewrite SQL in ClickHouseFirst, in ClickHouse, each row of data can be located using the two virtual columns -- Q1:
SELECT * from hits WHERE "URL" LIKE '%google%' ORDER BY "EventTime" LIMIT 10;
10 rows in set. Elapsed: 34.907 sec. Processed 18.63 million rows, 11.77 GB (533.61 thousand rows/s., 337.25 MB/s.)
Peak memory usage: 1.31 GiB.
-- Q2:
SELECT * FROM hits WHERE (_part,_part_offset) in
(SELECT _part,_part_offset from hits WHERE "URL" LIKE '%google%' ORDER BY "EventTime" LIMIT 10);
10 rows in set. Elapsed: 7.262 sec. Processed 18.68 million rows, 3.13 GB (2.57 million rows/s., 431.28 MB/s.)
Peak memory usage: 190.45 MiB. I measured that Q1 took 34 seconds, while Q2 only took 7.2 seconds (both cleared the page cache before running). However, in earlier versions of ClickHouse (such as 23.12), the aforementioned query Q2 would actually degrade in performance. But at that time, if I split Q2 into two separate statements and executed them manually, it would still work perfectly fine. -- Initial (Clear page cache)
-- step 1
SELECT _part,_part_offset FROM hits WHERE "URL" LIKE '%google%' ORDER BY "EventTime" LIMIT 10;
10 rows in set. Elapsed: 6.254 sec. Processed 18.63 million rows, 3.10 GB (2.98 million rows/s., 495.76 MB/s.)
Peak memory usage: 190.79 MiB.
-- step 2
SELECT * FROM hits WHERE
(_part = 'all_1_210_3' AND _part_offset IN (20223140, 20223142, 20223144, 19725555, 15188338, 13322137, 19741966, 3076201))
OR
(_part = 'all_211_216_1' AND _part_offset IN (692957, 692958));
10 rows in set. Elapsed: 0.731 sec. Processed 65.04 thousand rows, 36.83 MB (89.02 thousand rows/s., 50.41 MB/s.)
Peak memory usage: 46.23 MiB.
-- The total time is 6.98 seconds. Conclusion
|
Thank you so much for sharing this blog link—it’s truly an excellent learning resource! I hadn’t come across it before; I was just looking at the source code of this related optimization PR and running performance tests on the hits dataset. They introduced a new type called ColumnLazy(Interestingly, the comments also mention optimizing the join process through this structure.) to implement lazy materialization of columns. However, in my testing, I’ve noticed that performance starts to degrade when the However, I noticed that using the following SQL rewrite approach hardly leads to any performance degradation. I even tested it with -- Q1:
SELECT * from hits ORDER BY "EventTime" LIMIT 100000;
100000 rows in set. Elapsed: 70.314 sec. Processed 103.89 million rows, 70.28 GB (1.48 million rows/s., 999.50 MB/s.)
Peak memory usage: 27.66 GiB.
-- Q2:
SELECT * FROM hits WHERE (_part,_part_offset) in
(SELECT _part,_part_offset from hits ORDER BY "EventTime" LIMIT 100000);
100000 rows in set. Elapsed: 5.639 sec. Processed 82.02 million rows, 3.39 GB (14.55 million rows/s., 601.81 MB/s.)
Peak memory usage: 715.57 MiB. The direct SQL rewrite approach mentioned above might still have some issues in ClickHouse at the moment. They are currently exploring the possibility of using it alongside projections (a feature in ClickHouse akin to materialized views) to create secondary indexes and similar functionalities. Relevant PR: ClickHouse/ClickHouse#78429 (comment). I find this truly remarkable, and I’m also investigating why the direct SQL rewrite performs faster than ColumnLazy. |
I have already discovered the reason:ClickHouse/ClickHouse#79645 |
There's a follow-up PR that makes the query in that comment work consistently: ClickHouse/ClickHouse#79471 Also, I had asked for something that sounds equivalent to dynamic filter in TopK in ClickHouse in ClickHouse/ClickHouse#75098 and ClickHouse/ClickHouse#75774 (comment), so it seems like CH has yet to perform that optimization at the moment, it just has lazy materialization and soon the manual version of it via query rewriting and "projection indexes". |
Uh oh!
There was an error while loading. Please reload this page.
Is your feature request related to a problem or challenge?
Comparing ClickBench on DataFusion 45 and DuckDB (link)
You can see that for 23 DataFusion is almost 2x slower (around 10s where DuckDB is 5s)

You can run this query like this:
Here is the explain plan
Something that immediately jumps out at me in the explain plan is this line
"Projection" I think means that all of those columns are being read/ decoded from parquet, which makes sense as the query has a
SELECT *
on it.However, in this case all but the top 10 rows are returned (out of 100M rows in the file)
So this means that most of the decoded data is decoded and thrown away immediately
Describe the solution you'd like
I would like to close the gap with DuckDB with some general purpose improvement
Describe alternatives you've considered
I think the way to improve performance here is to defer decoding ("Materializing") the other columns until we know what the top 10 rows are.
some wacky ideas:
Late materialization would look something like
row_id
Additional context
No response
The text was updated successfully, but these errors were encountered: