feat: Support parsing subqueries with `OuterReferenceColumn` belongs to non-adjacent outer relations #16186

duongcongtoai · 2025-05-25T15:21:11Z

Which issue does this PR close?

Partially solve Nested correlated subquery error with a depth exceeding 1 #15558 and unblock feat: rewrite subquery into dependent join logical plan #16016

There were some discussion going on regarding handle subquery with depth aware at the planning stage, which is a nice thing to have, but until we implement something like that, we cannot continue implement query decorrelation. But i realize that we can add some minor change to how we plan the subqueries so at least no error is thrown because of ambiguous schema as in #15558:

PlannerContext maintains an optional outer schema, we just need to replace this field with a stack of outer schema

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

duongcongtoai · 2025-05-25T15:26:38Z

datafusion/optimizer/src/push_down_filter.rs

+                        // (i.e in the case of recursive subquery)
+                        // this function may accidentally pushdown the subquery expr as well
+                        // until then, we have to exclude these exprs here
+                        .partition(|pred| pred.is_volatile() || has_subquery(pred));


this is to satisfy the new test added in subquery.slt, else the predicate containing the scalar subquery will be accidentally pushed down and cause planning error

logan-keede

I have left a few comments below.
I will try to review the logic in depth later.

logan-keede · 2025-05-25T15:57:23Z

datafusion/sql/src/planner.rs

@@ -235,18 +235,27 @@ impl PlannerContext {
    }

    // Return a reference to the outer query's schema
-    pub fn outer_query_schema(&self) -> Option<&DFSchema> {


I think this is a breaking change. You can either make a new function or have the PR marked with API changes label. with the first option being more preferable.

oh right 👍

logan-keede · 2025-05-25T15:57:56Z

datafusion/sql/src/planner.rs

    }

    /// Sets the outer query schema, returning the existing one, if
    /// any
-    pub fn set_outer_query_schema(


same as above, but this can simply be deprecated.

done, old methods were deprecated

duongcongtoai · 2025-05-25T20:58:55Z

datafusion/optimizer/src/push_down_filter.rs

+                        // this function may accidentally pushdown the subquery expr as well
+                        // until then, we have to exclude these exprs here
+                        .partition(|pred| {
+                            pred.is_volatile() || has_scalar_subquery(pred)


when we allow nested subquery, the final plan reaches this optimizor and the predicate on scalar_subquery can be accidentally push down

irenjj · 2025-05-25T23:18:12Z

datafusion/sqllogictest/test_files/subquery.slt

+query TT
+explain select c_custkey from customer
+where c_acctbal < (
+    select sum(o_totalprice) from orders
+    where o_custkey = c_custkey
+    and o_totalprice in (
+        select l_extendedprice as price from lineitem where l_orderkey = o_orderkey
+        and l_extendedprice < c_acctbal
+    )
+) order by c_custkey;


Still failed in datafusion-cli, but in sqllogictest it runs well:

> explain select c_custkey from customer where c_acctbal < ( select sum(o_totalprice) from orders where o_custkey = c_custkey and o_totalprice in ( select l_extendedprice as price from lineitem where l_orderkey = o_orderkey and l_extendedprice < c_acctbal ) ) order by c_custkey; Schema error: No field named customer.c_acctbal. Valid fields are __correlated_sq_1.l_extendedprice.

yep, the generated logical plan is still unexecutable, because all existing decorrelating optimizor cannot decorrelate it

logan-keede · 2025-05-28T20:40:28Z

Currently error look like:-

> explain SELECT e1.employee_name, e1.salary
FROM employees e1
WHERE e1.salary > (
    SELECT AVG(e2.salary)
    FROM employees e2 
    WHERE e2.dept_id = e1.dept_id
    AND e2.salary > (
        SELECT AVG(e3.salary)
        FROM employees e3
        WHERE e3.dept_id = e1.dept_id
    )
);
Schema error: No field named e1.dept_id. Did you mean 'e3.dept_id'?.
> explain SELECT e1.employee_name, e1.salary
FROM employees e1
WHERE e1.salary > (
    SELECT AVG(e2.salary)
    FROM employees e2 
    WHERE e2.dept_id = e1.dept_id
    AND e2.salary > (
        SELECT AVG(e3.salary)
        FROM employees e3
        WHERE e3.dept_id = e1.depd
    )
);
Schema error: No field named e1.depd. Valid fields are e3.employee_id, e3.employee_name, e3.dept_id, e3.salary.
>

After This PR:-

> explain SELECT e1.employee_name, e1.salary
FROM employees e1
WHERE e1.salary > (
    SELECT AVG(e2.salary)
    FROM employees e2 
    WHERE e2.dept_id = e1.dept_id
    AND e2.salary > (
        SELECT AVG(e3.salary)
        FROM employees e3
        WHERE e3.dept_id = e1.depd
    )
);
Schema error: No field named e1.depd. Valid fields are e3.employee_id, e3.employee_name, e3.dept_id, e3.salary.
> explain SELECT e1.employee_name, e1.salary
FROM employees e1
WHERE e1.salary > (
    SELECT AVG(e2.salary)
    FROM employees e2 
    WHERE e2.dept_id = e1.dept_id
    AND e2.salary > (
        SELECT AVG(e3.salary)
        FROM employees e3
        WHERE e3.dept_id = e1.dept_id
    )
);
Schema error: No field named e1.dept_id. Valid fields are e2.salary, __scalar_sq_2."avg(e3.salary)", __scalar_sq_2.dept_id.

The results are a little inconsistent. __scalar_sq_2."avg(e3.salary)", __scalar_sq_2.dept_id are not valid fields in the above context.
Ideally, all the field in e1, e2 and e3 should come up here as they are valid.

duongcongtoai · 2025-05-28T21:29:00Z

The results are a little inconsistent. __scalar_sq_2."avg(e3.salary)", __scalar_sq_2.dept_id are not valid fields in the above context. Ideally, all the field in e1, e2 and e3 should come up here as they are valid.

Beware that this error is thrown after the planning stage has completed, and it is expected because the current limitation of subquery decorrelation.

The columns shown in the error is the side-effect of DecorrelateLateralJoin optimizor

If we print the output plan after this optimizor completes, here is the result

Inner Join: e1.dept_id = __scalar_sq_1.dept_id Filter: CAST(e1.salary AS Decimal128(38, 14)) > __scalar_sq_1.avg(e2.salary)
  SubqueryAlias: e1
    TableScan: employees projection=[employee_name, dept_id, salary]
  SubqueryAlias: __scalar_sq_1
    Projection: avg(e2.salary), e2.dept_id
      Aggregate: groupBy=[[e2.dept_id]], aggr=[[avg(e2.salary)]]
        Projection: e2.dept_id, e2.salary
          Inner Join:  Filter: CAST(e2.salary AS Decimal128(38, 14)) > __scalar_sq_2.avg(e3.salary) AND __scalar_sq_2.dept_id = e1.dept_id 
            SubqueryAlias: e2
              TableScan: employees projection=[dept_id, salary]
            SubqueryAlias: __scalar_sq_2
              Projection: avg(e3.salary), e3.dept_id
                Aggregate: groupBy=[[e3.dept_id]], aggr=[[avg(e3.salary)]]
                  SubqueryAlias: e3
                    TableScan: employees projection=[dept_id, salary]

And the error is thrown at this line: Inner Join: Filter: CAST(e2.salary AS Decimal128(38, 14)) > __scalar_sq_2.avg(e3.salary) AND __scalar_sq_2.dept_id = e1.dept_id

Looks like we need a new plan to somehow let this Optimizor back-off during the implementation of new rules 🤔

duongcongtoai · 2025-05-28T21:45:44Z

So here are my thoughts (this plan is to split the work in smaller PRs) while avoid breaking things as much as possible:

we introduce 3 optimizors, declared in the order below:

DependentJoinRewriter
OldOptimizors (DecorrelateLateralJoin, ScalarSubqueryToJoin, DecorrelatePredicateSubquery)
NewOptimizor to DecorrelateArbitrarySubquery

We add a if condition inside DependentJoinRewriter, that if the deepest subquery depth is 1, it doesn't change the query plan at all
=> This allow the old optimizors to work as expected
Develop the newOptimizor

duongcongtoai · 2025-05-28T22:03:29Z

or an easiest way is to have a large feature branch 🤔

logan-keede · 2025-05-28T22:06:25Z

Beware that this error is thrown after the planning stage has completed, and it is expected because the current limitation of subquery decorrelation.

Oh I was under the impression that it was still in planning stage. Thanks for letting me know.

Looks like we need a new plan to somehow let this Optimizor back-off during the implementation of new rules 🤔

2. We add a if condition inside DependentJoinRewriter, that if the deepest subquery depth is 1, it doesn't change the query plan at all
=> This allow the old optimizors to work as expected.

We probably should not discriminate in terms of depths, as what works for higher depth should work at depth = 1.
Though I remember Improving Unnesting of Complex Queries mentioning the case of Simplistic Unnesting but I don't think that was limited by depth either.

duongcongtoai · 2025-05-28T22:11:59Z

true, i've just realized it. Looks like a feature branch for us to work on is the way then?

logan-keede · 2025-05-28T22:19:11Z

cc @alamb

irenjj · 2025-05-28T23:43:49Z

It looks like @duongcongtoai addressed the depth issue in #16016. Maybe this PR can be merged with #16016 to better verify the depth-related problem?

duongcongtoai · 2025-05-29T06:21:16Z

yep, it should be merged after every point is clear, to reduce review burden

duongcongtoai · 2025-05-29T08:31:54Z

i create a temp branch here to combine 2 PRs
https://github.com/duongcongtoai/arrow-datafusion/blob/14554-subquery-unnest-framework-fixed-planner/datafusion/sqllogictest/test_files/dependent_join_temp.slt

The test output a plan being rewritten into dependent join node. We only expect this as a intermediate plan during decorrelation

http://datafusion.apache.org/user-guide/sql/explain.html#tree-format-default
~~we can implement the tree formatter for this node type to visualize it~~ Nope, this format is only applicable for physical_plan

irenjj · 2025-05-29T08:52:21Z

~~we can implement the tree formatter for this node type to visualize it~~

tree formatter is used only in datafusion-cli, for sqllogical test, we only use indent explain.🤣

duongcongtoai added 3 commits May 25, 2025 16:10

fix: allow OuterRefColumn for non-adjacent outer relation

2a828ed

fix: accidentally pushdown filter with subquery

dea0b70

chore: clippy

5ed2d24

github-actions bot added sql SQL Planner logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels May 25, 2025

duongcongtoai changed the title ~~feat: Support parsing subqueries with OuterRefColumn belongs to non-adjacent outer relation~~ feat: Support parsing subqueries with OuterReferenceColumn belongs to non-adjacent outer relation May 25, 2025

chore: rm debug details

c2caf37

duongcongtoai commented May 25, 2025

View reviewed changes

logan-keede suggested changes May 25, 2025

View reviewed changes

duongcongtoai changed the title ~~feat: Support parsing subqueries with OuterReferenceColumn belongs to non-adjacent outer relation~~ feat: Support parsing subqueries with OuterReferenceColumn belongs to non-adjacent outer relations May 25, 2025

duongcongtoai added 3 commits May 25, 2025 18:50

fix: breaking changes

cec566a

fix: lateral join losing its outer ref columns

699424d

test: more test case for other decorrelation

4edaf61

duongcongtoai commented May 25, 2025

View reviewed changes

irenjj reviewed May 25, 2025

View reviewed changes

doc: better comments

244a778

duongcongtoai marked this pull request as ready for review May 26, 2025 04:36

Merge branch 'main' into plann-recursive-subquery

f0a5f8c

duongcongtoai mentioned this pull request May 27, 2025

General framework to decorrelate the subqueries #5492

Open

feat: Support parsing subqueries with OuterReferenceColumn belongs to non-adjacent outer relations #16186

Are you sure you want to change the base?

feat: Support parsing subqueries with OuterReferenceColumn belongs to non-adjacent outer relations #16186

Uh oh!

Conversation

duongcongtoai commented May 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

duongcongtoai May 25, 2025

Choose a reason for hiding this comment

Uh oh!

logan-keede left a comment

Choose a reason for hiding this comment

Uh oh!

logan-keede May 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duongcongtoai May 25, 2025

Choose a reason for hiding this comment

Uh oh!

logan-keede May 25, 2025

Choose a reason for hiding this comment

Uh oh!

duongcongtoai May 26, 2025

Choose a reason for hiding this comment

Uh oh!

duongcongtoai May 25, 2025

Choose a reason for hiding this comment

Uh oh!

irenjj May 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duongcongtoai May 26, 2025

Choose a reason for hiding this comment

Uh oh!

logan-keede commented May 28, 2025

Uh oh!

duongcongtoai commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

duongcongtoai commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

duongcongtoai commented May 28, 2025

Uh oh!

logan-keede commented May 28, 2025

Uh oh!

duongcongtoai commented May 28, 2025

Uh oh!

logan-keede commented May 28, 2025

Uh oh!

irenjj commented May 28, 2025

Uh oh!

duongcongtoai commented May 29, 2025

Uh oh!

duongcongtoai commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

irenjj commented May 29, 2025

Uh oh!

Uh oh!

feat: Support parsing subqueries with `OuterReferenceColumn` belongs to non-adjacent outer relations #16186

feat: Support parsing subqueries with `OuterReferenceColumn` belongs to non-adjacent outer relations #16186

duongcongtoai commented May 25, 2025 •

edited

Loading

logan-keede May 25, 2025 •

edited

Loading

irenjj May 25, 2025 •

edited

Loading

duongcongtoai commented May 28, 2025 •

edited

Loading

duongcongtoai commented May 28, 2025 •

edited

Loading

duongcongtoai commented May 29, 2025 •

edited

Loading