Skip to content

[SPARK-51885][SQL]Part 1.b Add analyzer support for nested correlated subqueries #50548

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 45 commits into
base: master
Choose a base branch
from

Conversation

AveryQi115
Copy link
Contributor

@AveryQi115 AveryQi115 commented Apr 9, 2025

What changes were proposed in this pull request?

  • Add support for queries containing nested correlations in multi-pass analyzer.
    • Change the AnalysisContext.outerPlan from LogicalPlan to LogicalPlans, containing all the outer plans outer references might refer to.
    • Change the update AnalysisContext logic in ResolveSubquery.
    • Change ResolveSubquery to update NestedOuterAttrs when subquery are resolved.
    • Change ResolveAggregateFunction to update NestedOuterAttrs for subquery in the having clause.
    • Change UpdateOuterReferences to update NestedOuterAttrs as well.
  • Add new error types and check analysis methods.
    • Add new error type NESTED_REFERENCES_IN_SUBQUERY_NOT_SUPPORTED which prompts users to turn on spark.sql.optimizer.supportNestedCorrelatedSubqueries.enabled configs for queries containing nested correlations.
    • Add new check analysis methods to check if the config is turned on for queries containing nested correlations.
    • Add new check analysis methods to ensure main query does not contain subqueries with nested outer attrs. (NestedOuterAttrs.nonEmpty means that subquery contains outer references can't be resolved in the subquery or the containing query of the subquery, but might be resolved in nested outer queries. This is not allowed for the main query as it is the outer most query.)

Currently the config is set to false by default as the optimizer changes would be in later prs.
And the behavior of lateralSubquery is not changed. We don't allow nested correlations in lateralSubquery for now.

Why are the changes needed?

Spark only supports one layer of correlation now and does not support nested correlation.
For example,

SELECT col1 FROM VALUES (1, 2) t1 (col1, col2) WHERE EXISTS (
 SELECT col1 FROM VALUES (1, 2) t2 (col1, col2) WHERE t2.col2 == MAX(t1.col2)
)GROUP BY col1;

is supported and

SELECT col1 FROM VALUES (1, 2) t1 (col1, col2) WHERE EXISTS (
 SELECT col1 FROM VALUES (1, 2) t2 (col1, col2) WHERE t2.col2 == (
   SELECT MAX(t1.col2)
 )
)GROUP BY col1;

is not supported.

The reason spark does not support it is because the Analyzer and Optimizer resolves and plans Subquery in a recursive way.

This pr is for add Analyzer support for queries containing nested correlations.

Does this PR introduce any user-facing change?

Yes,

  • Queries containing nested correlations are not supported before. Spark will throw UNRESOLVED_COLUMN or FIELD_NOT_FOUND errors, but now if they are valid with spark.sql.optimizer.supportNestedCorrelatedSubqueries.enabled = false, Spark will throw NESTED_REFERENCES_IN_SUBQUERY_NOT_SUPPORTED error to prompt user to turn on the flag.

How was this patch tested?

Current UT and Suite.
Extracted tests about nested correlations from duckDB's repo.

  • subquery/nestedcorrelation/scalar-subquery.sql
  • subquery/nestedcorrelation/exists-subquery.sql
  • subquery/nestedcorrelation/combined-subquery.sql
  • subquery/nestedcorrelation/lateral-subquery.sql
  • subquery/nestedcorrelation/subquery-not-supported.sql
    As the optimizer changes are not merged yet, this pr only tests analyzer results for these queries.

The subquery-not-supported contains queries not supported by spark with the resolving nested correlation features.
They are mainly:

  1. Nested correlations in unsupported operators, eg: Limit/Offset, OrderBys.
  2. Subqueries containing nested correlations in unsupported positions, eg: From clause without explicit lateral keywords, subqueries in join conditions.
  3. Subqueries containing outer references wrapped in aggregate expressions, eg: max(outer(a)) and the subquery is not in the having clause.
  4. Nested correlations in the right child of left joins or nested correlations in the left child of the right joins.

For 2 and 4, the optimizer actually already supports them, we might want to support them later in analyzers with more tests added. For 3, Postgresql and DuckDB have different behaviors to resolve them.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Apr 9, 2025
@AveryQi115
Copy link
Contributor Author

cc: @agubichev
This depends on the definition change in this pr #50285

@AveryQi115 AveryQi115 requested a review from vladimirg-db April 18, 2025 23:14
val outerPlanContext = AnalysisContext.get.outerPlans
val newSubqueryPlan = if (outerPlanContext.isDefined &&
// We don't allow lateral subquery having nested correlation
!e.isInstanceOf[LateralSubquery]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so LateralSubquery cannot reference nested scopes. But can the subqueries below the LateralSubquery reference attributes above that LateralSubquery?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, any subqueries within/below LateralSubquery can refer up to attributes in the LateralSubquery or the containing query of the LateralSubquery.
This is becuase when we're resolving a LateralSubquery, we update the AnalysisContext.outerplans to clear all the outerPlans before and only leaves the direct outerPlan for the LateralSubquery's plan. This include any subqueries within the lateralSubquery.

For other subqueries not within the LateralSubquery but are resolved after resolving LateralSubquery, we didn't change the AnalysisContext for them.

Copy link
Contributor Author

@AveryQi115 AveryQi115 Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PostgreSQL and duckDB supports special outerScopeAttrs as long as the outerScopeAttrs and the LateralSubquery has clear quantifier and alias.
We can do this later, but for now, we'll just disallow nested correlations in LateralSubquery.

* Returns the outer scope attributes referenced in the subquery expressions
* in current plan and the children of the current plan.
*/
private def getOuterAttrsNeedToBePropagated(plan: LogicalPlan): Seq[Expression] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like this method solves the same problem as SubExprUtils.getOuterReferences. Can we instead update SubExprUtils.getOuterReferences to do that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better not as there are many other places in the Analyzer use SubExprUtils.getOuterReferences. And I checked that we don't need this getOuterAttrsNeedToBePropagated there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could return a pair from getOuterReferences (direct and indirect outer attrs).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

@@ -228,6 +228,67 @@ trait CheckAnalysis extends LookupCatalog with QueryErrorsBase with PlanToString
}
}

def checkNoNestedOuterReferencesInMainQuery(plan: LogicalPlan): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can move this all to ValidateSubqueryExpression?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only for any subquery within the mainQuery.

Because there can be some outer references can be resolved in the whole plan but are found not in any inputSet for the operators containing subqueries.
In this case, the Analyzer treats these outer references as outerScopeAttrs for each subquery, even the subquery within the mainQuery. But for the subquery within the mainQuery, they cannot have outerScopeAttrs as there are no outer scope above the mainQuery.

ValidateSubqueryExpression checks each subquery. They are different.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I see.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please think in the background how to efficiently implement this check in the single-pass Analyzer.

@AveryQi115 AveryQi115 requested a review from vladimirg-db April 22, 2025 18:37
@AveryQi115 AveryQi115 changed the title [SPARK-50983][SQL]Part 1.b Add analyzer support for nested correlated subqueries [SPARK-51885][SQL]Part 1.b Add analyzer support for nested correlated subqueries Apr 23, 2025
averyqi-db and others added 2 commits April 23, 2025 11:55
fix wrong number of arguments error; fix assertions

fix wrong number of arguments error

fix wrong number of arguments error

fix for mis-deleting ScalarSubquery.withNewOuterAttrs

fmt

fix wrong number of arguments error

fix wrong number of arguments error

rename unresolved outer attrs to nested outer attrs

throw internalErrors and format

compile and format

resolve comments

rename nestedOuterAttrs to outerScopeAttrs

Update DynamicPruning.scala

Update FunctionTableSubqueryArgumentExpression.scala

add new lines for readability
// We don't allow lateral subquery having nested correlation
!e.isInstanceOf[LateralSubquery]
) {
// The previous outerPlanContext contains resolved outer scope plans
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the claim that "the plan is guaranteed to be resolved". FIxed-point Analyzer loops the rules over the partially resolved plan, and the plan can be considered resolved only after the analysis is done, after CheckAnalysis (well, kinda, if you don't take all kinds of bugs into account). Which is different to the single-pass Analyzer, that guarantees that the tree that has been already traversed is properly resolved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Icic. Do you know any corner testcases for the fixed point analyzer which I can use to test?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure... depends on what you are trying to find.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants