-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-51885][SQL]Part 1.b Add analyzer support for nested correlated subqueries #50548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[SPARK-51885][SQL]Part 1.b Add analyzer support for nested correlated subqueries #50548
Conversation
cc: @agubichev |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Outdated
Show resolved
Hide resolved
…LUMN.WITH_SUGGESTION error for main query with nested outer attrs
… for subquery in having clause
sql/core/src/test/resources/sql-tests/analyzer-results/join-lateral.sql.out
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Show resolved
Hide resolved
val outerPlanContext = AnalysisContext.get.outerPlans | ||
val newSubqueryPlan = if (outerPlanContext.isDefined && | ||
// We don't allow lateral subquery having nested correlation | ||
!e.isInstanceOf[LateralSubquery] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so LateralSubquery
cannot reference nested scopes. But can the subqueries below the LateralSubquery
reference attributes above that LateralSubquery
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, any subqueries within/below LateralSubquery
can refer up to attributes in the LateralSubquery
or the containing query of the LateralSubquery
.
This is becuase when we're resolving a LateralSubquery
, we update the AnalysisContext.outerplans
to clear all the outerPlans before and only leaves the direct outerPlan for the LateralSubquery
's plan. This include any subqueries within the lateralSubquery.
For other subqueries not within the LateralSubquery
but are resolved after resolving LateralSubquery
, we didn't change the AnalysisContext for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PostgreSQL and duckDB supports special outerScopeAttrs as long as the outerScopeAttrs and the LateralSubquery has clear quantifier and alias.
We can do this later, but for now, we'll just disallow nested correlations in LateralSubquery.
* Returns the outer scope attributes referenced in the subquery expressions | ||
* in current plan and the children of the current plan. | ||
*/ | ||
private def getOuterAttrsNeedToBePropagated(plan: LogicalPlan): Seq[Expression] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like this method solves the same problem as SubExprUtils.getOuterReferences
. Can we instead update SubExprUtils.getOuterReferences
to do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better not as there are many other places in the Analyzer use SubExprUtils.getOuterReferences
. And I checked that we don't need this getOuterAttrsNeedToBePropagated
there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could return a pair from getOuterReferences
(direct and indirect outer attrs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
Show resolved
Hide resolved
@@ -228,6 +228,67 @@ trait CheckAnalysis extends LookupCatalog with QueryErrorsBase with PlanToString | |||
} | |||
} | |||
|
|||
def checkNoNestedOuterReferencesInMainQuery(plan: LogicalPlan): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can move this all to ValidateSubqueryExpression
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only for any subquery within the mainQuery.
Because there can be some outer references can be resolved in the whole plan but are found not in any inputSet for the operators containing subqueries.
In this case, the Analyzer treats these outer references as outerScopeAttrs
for each subquery, even the subquery within the mainQuery. But for the subquery within the mainQuery, they cannot have outerScopeAttrs
as there are no outer scope above the mainQuery.
ValidateSubqueryExpression
checks each subquery. They are different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, I see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please think in the background how to efficiently implement this check in the single-pass Analyzer.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/DynamicPruning.scala
Outdated
Show resolved
Hide resolved
fix wrong number of arguments error; fix assertions fix wrong number of arguments error fix wrong number of arguments error fix for mis-deleting ScalarSubquery.withNewOuterAttrs fmt fix wrong number of arguments error fix wrong number of arguments error rename unresolved outer attrs to nested outer attrs throw internalErrors and format compile and format resolve comments rename nestedOuterAttrs to outerScopeAttrs Update DynamicPruning.scala Update FunctionTableSubqueryArgumentExpression.scala add new lines for readability
// We don't allow lateral subquery having nested correlation | ||
!e.isInstanceOf[LateralSubquery] | ||
) { | ||
// The previous outerPlanContext contains resolved outer scope plans |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about the claim that "the plan is guaranteed to be resolved". FIxed-point Analyzer loops the rules over the partially resolved plan, and the plan can be considered resolved only after the analysis is done, after CheckAnalysis
(well, kinda, if you don't take all kinds of bugs into account). Which is different to the single-pass Analyzer, that guarantees that the tree that has been already traversed is properly resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Icic. Do you know any corner testcases for the fixed point analyzer which I can use to test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure... depends on what you are trying to find.
What changes were proposed in this pull request?
NESTED_REFERENCES_IN_SUBQUERY_NOT_SUPPORTED
which prompts users to turn onspark.sql.optimizer.supportNestedCorrelatedSubqueries.enabled
configs for queries containing nested correlations.Currently the config is set to false by default as the optimizer changes would be in later prs.
And the behavior of lateralSubquery is not changed. We don't allow nested correlations in lateralSubquery for now.
Why are the changes needed?
Spark only supports one layer of correlation now and does not support nested correlation.
For example,
is supported and
is not supported.
The reason spark does not support it is because the Analyzer and Optimizer resolves and plans Subquery in a recursive way.
This pr is for add Analyzer support for queries containing nested correlations.
Does this PR introduce any user-facing change?
Yes,
UNRESOLVED_COLUMN
orFIELD_NOT_FOUND
errors, but now if they are valid withspark.sql.optimizer.supportNestedCorrelatedSubqueries.enabled = false
, Spark will throwNESTED_REFERENCES_IN_SUBQUERY_NOT_SUPPORTED
error to prompt user to turn on the flag.How was this patch tested?
Current UT and Suite.
Extracted tests about nested correlations from duckDB's repo.
As the optimizer changes are not merged yet, this pr only tests analyzer results for these queries.
The subquery-not-supported contains queries not supported by spark with the resolving nested correlation features.
They are mainly:
For 2 and 4, the optimizer actually already supports them, we might want to support them later in analyzers with more tests added. For 3, Postgresql and DuckDB have different behaviors to resolve them.
Was this patch authored or co-authored using generative AI tooling?
No