-
Notifications
You must be signed in to change notification settings - Fork 1.5k
feat: Support parsing subqueries with OuterReferenceColumn
belongs to non-adjacent outer relations
#16186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
OuterRefColumn
belongs to non-adjacent outer relationOuterReferenceColumn
belongs to non-adjacent outer relation
// (i.e in the case of recursive subquery) | ||
// this function may accidentally pushdown the subquery expr as well | ||
// until then, we have to exclude these exprs here | ||
.partition(|pred| pred.is_volatile() || has_subquery(pred)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is to satisfy the new test added in subquery.slt, else the predicate containing the scalar subquery will be accidentally pushed down and cause planning error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have left a few comments below.
I will try to review the logic in depth later.
@@ -235,18 +235,27 @@ impl PlannerContext { | |||
} | |||
|
|||
// Return a reference to the outer query's schema | |||
pub fn outer_query_schema(&self) -> Option<&DFSchema> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a breaking change. You can either make a new function or have the PR marked with API changes
label. with the first option being more preferable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh right 👍
} | ||
|
||
/// Sets the outer query schema, returning the existing one, if | ||
/// any | ||
pub fn set_outer_query_schema( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above, but this can simply be deprecated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, old methods were deprecated
OuterReferenceColumn
belongs to non-adjacent outer relationOuterReferenceColumn
belongs to non-adjacent outer relations
// this function may accidentally pushdown the subquery expr as well | ||
// until then, we have to exclude these exprs here | ||
.partition(|pred| { | ||
pred.is_volatile() || has_scalar_subquery(pred) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when we allow nested subquery, the final plan reaches this optimizor and the predicate on scalar_subquery can be accidentally push down
query TT | ||
explain select c_custkey from customer | ||
where c_acctbal < ( | ||
select sum(o_totalprice) from orders | ||
where o_custkey = c_custkey | ||
and o_totalprice in ( | ||
select l_extendedprice as price from lineitem where l_orderkey = o_orderkey | ||
and l_extendedprice < c_acctbal | ||
) | ||
) order by c_custkey; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still failed in datafusion-cli, but in sqllogictest it runs well:
> explain select c_custkey from customer
where c_acctbal < (
select sum(o_totalprice) from orders
where o_custkey = c_custkey
and o_totalprice in (
select l_extendedprice as price from lineitem where l_orderkey = o_orderkey
and l_extendedprice < c_acctbal
)
) order by c_custkey;
Schema error: No field named customer.c_acctbal. Valid fields are __correlated_sq_1.l_extendedprice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, the generated logical plan is still unexecutable, because all existing decorrelating optimizor cannot decorrelate it
Currently error look like:- > explain SELECT e1.employee_name, e1.salary
FROM employees e1
WHERE e1.salary > (
SELECT AVG(e2.salary)
FROM employees e2
WHERE e2.dept_id = e1.dept_id
AND e2.salary > (
SELECT AVG(e3.salary)
FROM employees e3
WHERE e3.dept_id = e1.dept_id
)
);
Schema error: No field named e1.dept_id. Did you mean 'e3.dept_id'?.
> explain SELECT e1.employee_name, e1.salary
FROM employees e1
WHERE e1.salary > (
SELECT AVG(e2.salary)
FROM employees e2
WHERE e2.dept_id = e1.dept_id
AND e2.salary > (
SELECT AVG(e3.salary)
FROM employees e3
WHERE e3.dept_id = e1.depd
)
);
Schema error: No field named e1.depd. Valid fields are e3.employee_id, e3.employee_name, e3.dept_id, e3.salary.
> After This PR:- > explain SELECT e1.employee_name, e1.salary
FROM employees e1
WHERE e1.salary > (
SELECT AVG(e2.salary)
FROM employees e2
WHERE e2.dept_id = e1.dept_id
AND e2.salary > (
SELECT AVG(e3.salary)
FROM employees e3
WHERE e3.dept_id = e1.depd
)
);
Schema error: No field named e1.depd. Valid fields are e3.employee_id, e3.employee_name, e3.dept_id, e3.salary.
> explain SELECT e1.employee_name, e1.salary
FROM employees e1
WHERE e1.salary > (
SELECT AVG(e2.salary)
FROM employees e2
WHERE e2.dept_id = e1.dept_id
AND e2.salary > (
SELECT AVG(e3.salary)
FROM employees e3
WHERE e3.dept_id = e1.dept_id
)
);
Schema error: No field named e1.dept_id. Valid fields are e2.salary, __scalar_sq_2."avg(e3.salary)", __scalar_sq_2.dept_id. The results are a little inconsistent. |
Beware that this error is thrown after the planning stage has completed, and it is expected because the current limitation of subquery decorrelation. The columns shown in the error is the side-effect of If we print the output plan after this optimizor completes, here is the result
And the error is thrown at this line: Looks like we need a new plan to somehow let this Optimizor back-off during the implementation of new rules 🤔 |
So here are my thoughts (this plan is to split the work in smaller PRs) while avoid breaking things as much as possible:
|
or an easiest way is to have a large feature branch 🤔 |
Oh I was under the impression that it was still in planning stage. Thanks for letting me know.
We probably should not discriminate in terms of depths, as what works for higher depth should work at depth = 1. |
true, i've just realized it. Looks like a feature branch for us to work on is the way then? |
cc @alamb |
It looks like @duongcongtoai addressed the depth issue in #16016. Maybe this PR can be merged with #16016 to better verify the depth-related problem? |
yep, it should be merged after every point is clear, to reduce review burden |
i create a temp branch here to combine 2 PRs The test output a plan being rewritten into dependent join node. We only expect this as a intermediate plan during decorrelation http://datafusion.apache.org/user-guide/sql/explain.html#tree-format-default |
tree formatter is used only in datafusion-cli, for sqllogical test, we only use indent explain.🤣 |
Which issue does this PR close?
There were some discussion going on regarding handle subquery with depth aware at the planning stage, which is a nice thing to have, but until we implement something like that, we cannot continue implement query decorrelation. But i realize that we can add some minor change to how we plan the subqueries so at least no error is thrown because of ambiguous schema as in #15558:
PlannerContext
maintains an optional outer schema, we just need to replace this field with a stack of outer schemaRationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?