Skip to content

feat: Support parsing subqueries with OuterReferenceColumn belongs to non-adjacent outer relations #16186

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

duongcongtoai
Copy link
Contributor

@duongcongtoai duongcongtoai commented May 25, 2025

Which issue does this PR close?

There were some discussion going on regarding handle subquery with depth aware at the planning stage, which is a nice thing to have, but until we implement something like that, we cannot continue implement query decorrelation. But i realize that we can add some minor change to how we plan the subqueries so at least no error is thrown because of ambiguous schema as in #15558:

  • PlannerContext maintains an optional outer schema, we just need to replace this field with a stack of outer schema

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels May 25, 2025
@duongcongtoai duongcongtoai changed the title feat: Support parsing subqueries with OuterRefColumn belongs to non-adjacent outer relation feat: Support parsing subqueries with OuterReferenceColumn belongs to non-adjacent outer relation May 25, 2025
// (i.e in the case of recursive subquery)
// this function may accidentally pushdown the subquery expr as well
// until then, we have to exclude these exprs here
.partition(|pred| pred.is_volatile() || has_subquery(pred));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is to satisfy the new test added in subquery.slt, else the predicate containing the scalar subquery will be accidentally pushed down and cause planning error

Copy link
Contributor

@logan-keede logan-keede left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have left a few comments below.
I will try to review the logic in depth later.

@@ -235,18 +235,27 @@ impl PlannerContext {
}

// Return a reference to the outer query's schema
pub fn outer_query_schema(&self) -> Option<&DFSchema> {
Copy link
Contributor

@logan-keede logan-keede May 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a breaking change. You can either make a new function or have the PR marked with API changes label. with the first option being more preferable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh right 👍

}

/// Sets the outer query schema, returning the existing one, if
/// any
pub fn set_outer_query_schema(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, but this can simply be deprecated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, old methods were deprecated

@duongcongtoai duongcongtoai changed the title feat: Support parsing subqueries with OuterReferenceColumn belongs to non-adjacent outer relation feat: Support parsing subqueries with OuterReferenceColumn belongs to non-adjacent outer relations May 25, 2025
// this function may accidentally pushdown the subquery expr as well
// until then, we have to exclude these exprs here
.partition(|pred| {
pred.is_volatile() || has_scalar_subquery(pred)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we allow nested subquery, the final plan reaches this optimizor and the predicate on scalar_subquery can be accidentally push down

Comment on lines +1542 to +1551
query TT
explain select c_custkey from customer
where c_acctbal < (
select sum(o_totalprice) from orders
where o_custkey = c_custkey
and o_totalprice in (
select l_extendedprice as price from lineitem where l_orderkey = o_orderkey
and l_extendedprice < c_acctbal
)
) order by c_custkey;
Copy link
Contributor

@irenjj irenjj May 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still failed in datafusion-cli, but in sqllogictest it runs well:

> explain select c_custkey from customer
where c_acctbal < (
    select sum(o_totalprice) from orders
    where o_custkey = c_custkey
    and o_totalprice in (
        select l_extendedprice as price from lineitem where l_orderkey = o_orderkey
        and l_extendedprice < c_acctbal
    )
) order by c_custkey;
Schema error: No field named customer.c_acctbal. Valid fields are __correlated_sq_1.l_extendedprice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, the generated logical plan is still unexecutable, because all existing decorrelating optimizor cannot decorrelate it

@duongcongtoai duongcongtoai marked this pull request as ready for review May 26, 2025 04:36
@logan-keede
Copy link
Contributor

Currently error look like:-

> explain SELECT e1.employee_name, e1.salary
FROM employees e1
WHERE e1.salary > (
    SELECT AVG(e2.salary)
    FROM employees e2 
    WHERE e2.dept_id = e1.dept_id
    AND e2.salary > (
        SELECT AVG(e3.salary)
        FROM employees e3
        WHERE e3.dept_id = e1.dept_id
    )
);
Schema error: No field named e1.dept_id. Did you mean 'e3.dept_id'?.
> explain SELECT e1.employee_name, e1.salary
FROM employees e1
WHERE e1.salary > (
    SELECT AVG(e2.salary)
    FROM employees e2 
    WHERE e2.dept_id = e1.dept_id
    AND e2.salary > (
        SELECT AVG(e3.salary)
        FROM employees e3
        WHERE e3.dept_id = e1.depd
    )
);
Schema error: No field named e1.depd. Valid fields are e3.employee_id, e3.employee_name, e3.dept_id, e3.salary.
>

After This PR:-

> explain SELECT e1.employee_name, e1.salary
FROM employees e1
WHERE e1.salary > (
    SELECT AVG(e2.salary)
    FROM employees e2 
    WHERE e2.dept_id = e1.dept_id
    AND e2.salary > (
        SELECT AVG(e3.salary)
        FROM employees e3
        WHERE e3.dept_id = e1.depd
    )
);
Schema error: No field named e1.depd. Valid fields are e3.employee_id, e3.employee_name, e3.dept_id, e3.salary.
> explain SELECT e1.employee_name, e1.salary
FROM employees e1
WHERE e1.salary > (
    SELECT AVG(e2.salary)
    FROM employees e2 
    WHERE e2.dept_id = e1.dept_id
    AND e2.salary > (
        SELECT AVG(e3.salary)
        FROM employees e3
        WHERE e3.dept_id = e1.dept_id
    )
);
Schema error: No field named e1.dept_id. Valid fields are e2.salary, __scalar_sq_2."avg(e3.salary)", __scalar_sq_2.dept_id.

The results are a little inconsistent. __scalar_sq_2."avg(e3.salary)", __scalar_sq_2.dept_id are not valid fields in the above context.
Ideally, all the field in e1, e2 and e3 should come up here as they are valid.

@duongcongtoai
Copy link
Contributor Author

duongcongtoai commented May 28, 2025

The results are a little inconsistent. __scalar_sq_2."avg(e3.salary)", __scalar_sq_2.dept_id are not valid fields in the above context. Ideally, all the field in e1, e2 and e3 should come up here as they are valid.

Beware that this error is thrown after the planning stage has completed, and it is expected because the current limitation of subquery decorrelation.

The columns shown in the error is the side-effect of DecorrelateLateralJoin optimizor

If we print the output plan after this optimizor completes, here is the result

Inner Join: e1.dept_id = __scalar_sq_1.dept_id Filter: CAST(e1.salary AS Decimal128(38, 14)) > __scalar_sq_1.avg(e2.salary)
  SubqueryAlias: e1
    TableScan: employees projection=[employee_name, dept_id, salary]
  SubqueryAlias: __scalar_sq_1
    Projection: avg(e2.salary), e2.dept_id
      Aggregate: groupBy=[[e2.dept_id]], aggr=[[avg(e2.salary)]]
        Projection: e2.dept_id, e2.salary
          Inner Join:  Filter: CAST(e2.salary AS Decimal128(38, 14)) > __scalar_sq_2.avg(e3.salary) AND __scalar_sq_2.dept_id = e1.dept_id 
            SubqueryAlias: e2
              TableScan: employees projection=[dept_id, salary]
            SubqueryAlias: __scalar_sq_2
              Projection: avg(e3.salary), e3.dept_id
                Aggregate: groupBy=[[e3.dept_id]], aggr=[[avg(e3.salary)]]
                  SubqueryAlias: e3
                    TableScan: employees projection=[dept_id, salary]

And the error is thrown at this line: Inner Join: Filter: CAST(e2.salary AS Decimal128(38, 14)) > __scalar_sq_2.avg(e3.salary) AND __scalar_sq_2.dept_id = e1.dept_id

Looks like we need a new plan to somehow let this Optimizor back-off during the implementation of new rules 🤔

@duongcongtoai
Copy link
Contributor Author

duongcongtoai commented May 28, 2025

So here are my thoughts (this plan is to split the work in smaller PRs) while avoid breaking things as much as possible:

  1. we introduce 3 optimizors, declared in the order below:
  • DependentJoinRewriter
  • OldOptimizors (DecorrelateLateralJoin, ScalarSubqueryToJoin, DecorrelatePredicateSubquery)
  • NewOptimizor to DecorrelateArbitrarySubquery
  1. We add a if condition inside DependentJoinRewriter, that if the deepest subquery depth is 1, it doesn't change the query plan at all
    => This allow the old optimizors to work as expected
  2. Develop the newOptimizor

@duongcongtoai
Copy link
Contributor Author

or an easiest way is to have a large feature branch 🤔

@logan-keede
Copy link
Contributor

Beware that this error is thrown after the planning stage has completed, and it is expected because the current limitation of subquery decorrelation.

Oh I was under the impression that it was still in planning stage. Thanks for letting me know.

Looks like we need a new plan to somehow let this Optimizor back-off during the implementation of new rules 🤔

2. We add a if condition inside DependentJoinRewriter, that if the deepest subquery depth is 1, it doesn't change the query plan at all
=> This allow the old optimizors to work as expected.

We probably should not discriminate in terms of depths, as what works for higher depth should work at depth = 1.
Though I remember Improving Unnesting of Complex Queries mentioning the case of Simplistic Unnesting but I don't think that was limited by depth either.

@duongcongtoai
Copy link
Contributor Author

true, i've just realized it. Looks like a feature branch for us to work on is the way then?

@logan-keede
Copy link
Contributor

cc @alamb

@irenjj
Copy link
Contributor

irenjj commented May 28, 2025

It looks like @duongcongtoai addressed the depth issue in #16016. Maybe this PR can be merged with #16016 to better verify the depth-related problem?

@duongcongtoai
Copy link
Contributor Author

yep, it should be merged after every point is clear, to reduce review burden

@duongcongtoai
Copy link
Contributor Author

duongcongtoai commented May 29, 2025

i create a temp branch here to combine 2 PRs
https://github.com/duongcongtoai/arrow-datafusion/blob/14554-subquery-unnest-framework-fixed-planner/datafusion/sqllogictest/test_files/dependent_join_temp.slt

The test output a plan being rewritten into dependent join node. We only expect this as a intermediate plan during decorrelation

http://datafusion.apache.org/user-guide/sql/explain.html#tree-format-default
we can implement the tree formatter for this node type to visualize it Nope, this format is only applicable for physical_plan

@irenjj
Copy link
Contributor

irenjj commented May 29, 2025

we can implement the tree formatter for this node type to visualize it

tree formatter is used only in datafusion-cli, for sqllogical test, we only use indent explain.🤣

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions optimizer Optimizer rules sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants