[SPARK-51885][SQL]Part 1.b Add analyzer support for nested correlated subqueries #50548

AveryQi115 · 2025-04-09T20:33:49Z

What changes were proposed in this pull request?

Add support for queries containing nested correlations in multi-pass analyzer.
- Change the AnalysisContext.outerPlan from LogicalPlan to LogicalPlans, containing all the outer plans outer references might refer to.
- Change the update AnalysisContext logic in ResolveSubquery.
- Change ResolveSubquery to update NestedOuterAttrs when subquery are resolved.
- Change ResolveAggregateFunction to update NestedOuterAttrs for subquery in the having clause.
- Change UpdateOuterReferences to update NestedOuterAttrs as well.
Add new error types and check analysis methods.
- Add new error type NESTED_REFERENCES_IN_SUBQUERY_NOT_SUPPORTED which prompts users to turn on spark.sql.optimizer.supportNestedCorrelatedSubqueries.enabled configs for queries containing nested correlations.
- Add new check analysis methods to check if the config is turned on for queries containing nested correlations.
- Add new check analysis methods to ensure main query does not contain subqueries with nested outer attrs. (NestedOuterAttrs.nonEmpty means that subquery contains outer references can't be resolved in the subquery or the containing query of the subquery, but might be resolved in nested outer queries. This is not allowed for the main query as it is the outer most query.)

Currently the config is set to false by default as the optimizer changes would be in later prs.
And the behavior of lateralSubquery is not changed. We don't allow nested correlations in lateralSubquery for now.

Why are the changes needed?

Spark only supports one layer of correlation now and does not support nested correlation.
For example,

SELECT col1 FROM VALUES (1, 2) t1 (col1, col2) WHERE EXISTS (
 SELECT col1 FROM VALUES (1, 2) t2 (col1, col2) WHERE t2.col2 == MAX(t1.col2)
)GROUP BY col1;

is supported and

SELECT col1 FROM VALUES (1, 2) t1 (col1, col2) WHERE EXISTS (
 SELECT col1 FROM VALUES (1, 2) t2 (col1, col2) WHERE t2.col2 == (
   SELECT MAX(t1.col2)
 )
)GROUP BY col1;

is not supported.

The reason spark does not support it is because the Analyzer and Optimizer resolves and plans Subquery in a recursive way.

This pr is for add Analyzer support for queries containing nested correlations.

Does this PR introduce any user-facing change?

Yes,

Queries containing nested correlations are not supported before. Spark will throw UNRESOLVED_COLUMN or FIELD_NOT_FOUND errors, but now if they are valid with spark.sql.optimizer.supportNestedCorrelatedSubqueries.enabled = false, Spark will throw NESTED_REFERENCES_IN_SUBQUERY_NOT_SUPPORTED error to prompt user to turn on the flag.

How was this patch tested?

Current UT and Suite.
Extracted tests about nested correlations from duckDB's repo.

subquery/nestedcorrelation/scalar-subquery.sql
subquery/nestedcorrelation/exists-subquery.sql
subquery/nestedcorrelation/combined-subquery.sql
subquery/nestedcorrelation/lateral-subquery.sql
subquery/nestedcorrelation/subquery-not-supported.sql
As the optimizer changes are not merged yet, this pr only tests analyzer results for these queries.

The subquery-not-supported contains queries not supported by spark with the resolving nested correlation features.
They are mainly:

Nested correlations in unsupported operators, eg: Limit/Offset, OrderBys.
Subqueries containing nested correlations in unsupported positions, eg: From clause without explicit lateral keywords, subqueries in join conditions.
Subqueries containing outer references wrapped in aggregate expressions, eg: max(outer(a)) and the subquery is not in the having clause.
Nested correlations in the right child of left joins or nested correlations in the left child of the right joins.

For 2 and 4, the optimizer actually already supports them, we might want to support them later in analyzers with more tests added. For 3, Postgresql and DuckDB have different behaviors to resolve them.

Was this patch authored or co-authored using generative AI tooling?

No

AveryQi115 · 2025-04-09T21:00:10Z

cc: @agubichev
This depends on the definition change in this pr #50285

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

…LUMN.WITH_SUGGESTION error for main query with nested outer attrs

… for subquery in having clause

…tion

sql/core/src/test/resources/sql-tests/analyzer-results/join-lateral.sql.out

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

vladimirg-db · 2025-04-22T07:49:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+      val outerPlanContext = AnalysisContext.get.outerPlans
+      val newSubqueryPlan = if (outerPlanContext.isDefined &&
+        // We don't allow lateral subquery having nested correlation
+        !e.isInstanceOf[LateralSubquery]


Ok, so LateralSubquery cannot reference nested scopes. But can the subqueries below the LateralSubquery reference attributes above that LateralSubquery?

No, any subqueries within/below LateralSubquery can refer up to attributes in the LateralSubquery or the containing query of the LateralSubquery.
This is becuase when we're resolving a LateralSubquery, we update the AnalysisContext.outerplans to clear all the outerPlans before and only leaves the direct outerPlan for the LateralSubquery's plan. This include any subqueries within the lateralSubquery.

For other subqueries not within the LateralSubquery but are resolved after resolving LateralSubquery, we didn't change the AnalysisContext for them.

PostgreSQL and duckDB supports special outerScopeAttrs as long as the outerScopeAttrs and the LateralSubquery has clear quantifier and alias.
We can do this later, but for now, we'll just disallow nested correlations in LateralSubquery.

vladimirg-db · 2025-04-22T07:52:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+     * Returns the outer scope attributes referenced in the subquery expressions
+     *  in current plan and the children of the current plan.
+     */
+    private def getOuterAttrsNeedToBePropagated(plan: LogicalPlan): Seq[Expression] = {


It feels like this method solves the same problem as SubExprUtils.getOuterReferences. Can we instead update SubExprUtils.getOuterReferences to do that?

Better not as there are many other places in the Analyzer use SubExprUtils.getOuterReferences. And I checked that we don't need this getOuterAttrsNeedToBePropagated there.

We could return a pair from getOuterReferences (direct and indirect outer attrs).

Sounds good.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

vladimirg-db · 2025-04-22T07:55:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

@@ -228,6 +228,67 @@ trait CheckAnalysis extends LookupCatalog with QueryErrorsBase with PlanToString
    }
  }

+  def checkNoNestedOuterReferencesInMainQuery(plan: LogicalPlan): Unit = {


I wonder if we can move this all to ValidateSubqueryExpression?

This is only for any subquery within the mainQuery.

Because there can be some outer references can be resolved in the whole plan but are found not in any inputSet for the operators containing subqueries.
In this case, the Analyzer treats these outer references as outerScopeAttrs for each subquery, even the subquery within the mainQuery. But for the subquery within the mainQuery, they cannot have outerScopeAttrs as there are no outer scope above the mainQuery.

ValidateSubqueryExpression checks each subquery. They are different.

Aha, I see.

Please think in the background how to efficiently implement this check in the single-pass Analyzer.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/DynamicPruning.scala

fix wrong number of arguments error; fix assertions fix wrong number of arguments error fix wrong number of arguments error fix for mis-deleting ScalarSubquery.withNewOuterAttrs fmt fix wrong number of arguments error fix wrong number of arguments error rename unresolved outer attrs to nested outer attrs throw internalErrors and format compile and format resolve comments rename nestedOuterAttrs to outerScopeAttrs Update DynamicPruning.scala Update FunctionTableSubqueryArgumentExpression.scala add new lines for readability

vladimirg-db · 2025-04-24T09:31:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+        // We don't allow lateral subquery having nested correlation
+        !e.isInstanceOf[LateralSubquery]
+      ) {
+        // The previous outerPlanContext contains resolved outer scope plans


I'm not sure about the claim that "the plan is guaranteed to be resolved". FIxed-point Analyzer loops the rules over the partially resolved plan, and the plan can be considered resolved only after the analysis is done, after CheckAnalysis (well, kinda, if you don't take all kinds of bugs into account). Which is different to the single-pass Analyzer, that guarantees that the tree that has been already traversed is properly resolved.

Icic. Do you know any corner testcases for the fixed point analyzer which I can use to test?

Not sure... depends on what you are trying to find.

AveryQi115 added 10 commits March 14, 2025 17:22

add unresolved outer attrs

9a6f982

fix wrong number of arguments error; fix assertions

52d5ce7

fix wrong number of arguments error

e3bfef4

fix wrong number of arguments error

995ffdd

fix for mis-deleting ScalarSubquery.withNewOuterAttrs

08d3cce

fmt

471f084

fix wrong number of arguments error

4e0bf74

fix wrong number of arguments error

bc9179e

rename unresolved outer attrs to nested outer attrs

9559dbc

Analyzer support nested correlated subqueries

edd6828

github-actions bot added the SQL label Apr 9, 2025

fix compilation error

bbfdd7b

vladimirg-db reviewed Apr 9, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Show resolved Hide resolved

vladimirg-db reviewed Apr 10, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala Show resolved Hide resolved

vladimirg-db reviewed Apr 10, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated Show resolved Hide resolved

AveryQi115 and others added 14 commits April 15, 2025 11:25

throw internalErrors and format

dbb2dd1

compile and format

4500892

testing

5886273

try to align errors

31937b6

remove temporary test first

4880813

restore FunctionTableSubqueryArgumentExpression, output UNRESOLVED_CO…

fc37a5e

…LUMN.WITH_SUGGESTION error for main query with nested outer attrs

update updateOuterReferences for nested correlation

9457a27

scalafmt

3acaafd

remove assertion as we might have duplicate column identifiers

5de70ee

new error type

1f6f000

format new error type

aba5e81

try regenerate golden files

2787777

update ResolveSubquerySuite

e51ce61

fix ResolveSubquerySuite.scala

27c909c

averyqi-db and others added 6 commits April 18, 2025 11:37

restore same behavior for lateral subqueyr

a318f1e

restore error msg for lateral subquery

154b1db

update subquery's nested outer references in resolveAggregateFunction…

f9e2b23

… for subquery in having clause

remove temporary tests

c428df8

Merge branch 'master' into AveryQi115/analyzer_support_nested_correla…

8026f89

…tion

restore missing_attributes error

9932efe

AveryQi115 requested a review from vladimirg-db April 18, 2025 23:14

averyqi-db added 9 commits April 18, 2025 17:04

add test

508065d

generate test

4faca48

test

9a31a2f

deduplicate

bb97392

summarize not supported

26fb9fb

add new configs to control subquery type level feature

7f30dfa

queries returning nonderterministic results are also supported.

1465741

ignore tests under nestedcorrelation in ThriftServerQueryTestSuite

9794f0f

rename nestedOuterAttrs to outerScopeAttrs

8c3ce16

vladimirg-db reviewed Apr 22, 2025

View reviewed changes

resolve comments

a18e598

AveryQi115 requested a review from vladimirg-db April 22, 2025 18:37

AveryQi115 changed the title ~~[SPARK-50983][SQL]Part 1.b Add analyzer support for nested correlated subqueries~~ [SPARK-51885][SQL]Part 1.b Add analyzer support for nested correlated subqueries Apr 23, 2025

averyqi-db and others added 2 commits April 23, 2025 11:55

revert deduplication because we don't want to change current behavior

93d2003

vladimirg-db reviewed Apr 24, 2025

View reviewed changes

averyqi-db added 2 commits April 28, 2025 17:27

validateOuterScopeAttrs are used to check new outerScopeAttrs

0285abc

Fix errors for subqueries in the having clause

6432b05

[SPARK-51885][SQL]Part 1.b Add analyzer support for nested correlated subqueries #50548

Are you sure you want to change the base?

[SPARK-51885][SQL]Part 1.b Add analyzer support for nested correlated subqueries #50548

Uh oh!

Conversation

AveryQi115 commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

AveryQi115 commented Apr 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AveryQi115 Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AveryQi115 commented Apr 9, 2025 •

edited

Loading

AveryQi115 Apr 22, 2025 •

edited

Loading