-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Simplify predicates in PushDownFilter
optimizer rule
#16362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
f12788d
to
fee2358
Compare
let predicate = split_conjunction_owned(filter.predicate.clone()); | ||
let old_predicate_len = predicate.len(); | ||
let new_predicates = simplify_predicates(predicate)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new_predicates
operates on col comparison_operator literal
sometimes these are tied in conjuncts: a > -5 AND a < 5
or disjuncts: a < -5 OR a > 5
would it make sense to be able to process per-column predicates in both forms?
To do that we could have a function Expr -> captured per-column predicates and then be able to combine such functions on AND and on OR
This might be related https://github.com/trinodb/trino/blob/232916b75d415a5eb643cf922492eb8513d99aae/core/trino-main/src/main/java/io/trino/sql/planner/DomainTranslator.java#L365
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a > -5 AND a < 5
will be split to [a > -5
, a < 5
], it seems that the two split predicates can't be simplified, that is, we shoud keep the two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but if I can have a < -5 OR a > 5
then maybe I can have a < -5 OR a > 5 OR a < -10 OR a > 10
. This is simplifiable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, good point, we can do this as the following PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could use the cp_solver approach to handle expressions more generally and I also agree this could / should be done in a follow on PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I open an issue to track: #16511
fee2358
to
883f562
Compare
883f562
to
92d9658
Compare
PushDownFilter
optimizer rule
🤖 |
🤖: Benchmark completed Details
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @xudong963 and @findepi -- Other than the duplicated equality predicate, I think this PR is ready to go.
I view this as a very nice first step towards a more sophisticated predicate simplification feature in DataFusion. Longer term I think using the cp_solver framework would enable more general analysis, but the approach in this PR is a good start and we can evolve in some follow on PRs
I also verified the newly added tests fail without the code changes in this PR
Completed 2 test files in 0 seconds External error: 14 errors in file /Users/andrewlamb/Software/datafusion/datafusion/sqllogictest/test_files/simplify_predicates.slt
1. query result mismatch:
[SQL] EXPLAIN SELECT * FROM test_data WHERE int_col > 5 AND int_col > 6;
[Diff] (-expected|+actual)
logical_plan
- 01)Filter: test_data.int_col > Int32(6)
+ 01)Filter: test_data.int_col > Int32(5) AND test_data.int_col > Int32(6)
02)--TableScan: test_data projection=[int_col, float_col, str_col, date_col, bool_col]
at /Users/andrewlamb/Software/datafusion/datafusion/sqllogictest/test_files/simplify_predicates.slt:34
Ok(result) | ||
} | ||
|
||
fn simplify_column_predicates(predicates: Vec<Expr>) -> Result<Vec<Expr>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a follow on PR, perhaps we can explore the possibility here of using the boundary analysis, basically
https://docs.rs/datafusion/latest/datafusion/physical_expr/intervals/cp_solver/index.html
That could potentially be a more general form of analysis, rather than these particular rules
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll open an issue to track the general way, it may involve more discussion, such as
- A new physical rule? Or leverage the current physical filter push down
- Some filters, such as partition columns related predicates, will be fully pushed down scan, that is, we don't have such predicates in some FilterExec. How to capture this?
- TBD
let predicate = split_conjunction_owned(filter.predicate.clone()); | ||
let old_predicate_len = predicate.len(); | ||
let new_predicates = simplify_predicates(predicate)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could use the cp_solver approach to handle expressions more generally and I also agree this could / should be done in a follow on PR
92d9658
to
96abd67
Compare
96abd67
to
253bdb8
Compare
9c77907
to
ac0ec27
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @xudong963 -- I think this PR looks really nice and is a really nice step forward for datafusion (I actually noticed DuckDB doing this kind of simplification as well some time ago).
I also ran the extended sqlite tests with this branch locally
export RUST_MIN_STACK=30485760;
INCLUDE_SQLITE=true cargo test --test sqllogictests
And they all passed
Running bin/sqllogictests.rs (target/release-nonlto/deps/sqllogictests-396d408d47dff008)
Completed 937 test files in 4 minutes
🚀
} | ||
} | ||
|
||
Ok(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a follow on, once a false
is added to result, this function can probably return immediately with a single lit(false)
8d392bd
to
123ef69
Compare
Which issue does this PR close?
Rationale for this change
This PR adds predicate simplification capabilities to the
PushDownFilter
optimizer rule. Currently, when filters contain redundant predicates (likex > 5 AND x > 6
), these are not simplified, leading to unnecessary work during query execution.The goal is to optimize filter predicates by removing redundant conditions and keeping only the most restrictive ones, which can improve query performance.
What changes are included in this PR?
New predicate simplification module (
simplify_predicates.rs
):x = 5 AND x = 6
becomesfalse
)Enhanced push_down_filter.rs:
simplify_predicates
before applying other optimizationsAre these changes tested?
20+ test cases in
simplify_predicates.slt
Are there any user-facing changes?