-
Notifications
You must be signed in to change notification settings - Fork 1.5k
feat: add raw aggregate udf planner #11371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -24,6 +24,7 @@ use datafusion_common::{ | |
config::ConfigOptions, file_options::file_type::FileType, not_impl_err, DFSchema, | ||
Result, TableReference, | ||
}; | ||
use sqlparser::ast::NullTreatment; | ||
|
||
use crate::{AggregateUDF, Expr, GetFieldAccess, ScalarUDF, TableSource, WindowUDF}; | ||
|
||
|
@@ -161,6 +162,28 @@ pub trait ExprPlanner: Send + Sync { | |
) -> Result<PlannerResult<Vec<Expr>>> { | ||
Ok(PlannerResult::Original(args)) | ||
} | ||
|
||
/// Plans a `RawAggregateUDF` based on the given input expressions. | ||
/// | ||
/// Returns a `PlannerResult` containing either the planned aggregate function or the original | ||
/// input expressions if planning is not possible. | ||
fn plan_aggregate_udf( | ||
&self, | ||
aggregate_function: RawAggregateUDF, | ||
) -> Result<PlannerResult<RawAggregateUDF>> { | ||
Ok(PlannerResult::Original(aggregate_function)) | ||
} | ||
} | ||
|
||
// An `AggregateUDF` to be planned. | ||
#[derive(Debug, Clone)] | ||
pub struct RawAggregateUDF { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One thought I had during this is it might be nice to also update the naming pattern for these intermediate structs... maybe from There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree that Plannable sounds better (perhaps we can propose the change as a follow on PR)? What do you think @samuelcolvin and @jayzhan211 ? |
||
pub udf: Arc<crate::AggregateUDF>, | ||
pub args: Vec<Expr>, | ||
pub distinct: bool, | ||
pub filter: Option<Box<Expr>>, | ||
pub order_by: Option<Vec<Expr>>, | ||
pub null_treatment: Option<NullTreatment>, | ||
} | ||
|
||
/// An operator with two arguments to plan | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
// Licensed to the Apache Software Foundation (ASF) under one | ||
// or more contributor license agreements. See the NOTICE file | ||
// distributed with this work for additional information | ||
// regarding copyright ownership. The ASF licenses this file | ||
// to you under the Apache License, Version 2.0 (the | ||
// "License"); you may not use this file except in compliance | ||
// with the License. You may obtain a copy of the License at | ||
// | ||
// http://www.apache.org/licenses/LICENSE-2.0 | ||
// | ||
// Unless required by applicable law or agreed to in writing, | ||
// software distributed under the License is distributed on an | ||
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
// KIND, either express or implied. See the License for the | ||
// specific language governing permissions and limitations | ||
// under the License. | ||
|
||
use datafusion_expr::planner::{ExprPlanner, PlannerResult, RawAggregateUDF}; | ||
|
||
pub struct AggregateUDFPlanner; | ||
|
||
impl ExprPlanner for AggregateUDFPlanner { | ||
fn plan_aggregate_udf( | ||
&self, | ||
aggregate_function: RawAggregateUDF, | ||
) -> datafusion_common::Result<PlannerResult<RawAggregateUDF>> { | ||
Ok(PlannerResult::Original(aggregate_function)) | ||
} | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -21,6 +21,7 @@ use datafusion_common::{ | |
internal_datafusion_err, not_impl_err, plan_datafusion_err, plan_err, DFSchema, | ||
Dependency, Result, | ||
}; | ||
use datafusion_expr::planner::{PlannerResult, RawAggregateUDF}; | ||
use datafusion_expr::window_frame::{check_window_frame, regularize_window_order_by}; | ||
use datafusion_expr::{ | ||
expr, AggregateFunction, Expr, ExprSchemable, WindowFrame, WindowFunctionDefinition, | ||
|
@@ -335,7 +336,7 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> { | |
} | ||
} else { | ||
// User defined aggregate functions (UDAF) have precedence in case it has the same name as a scalar built-in function | ||
if let Some(fm) = self.context_provider.get_aggregate_meta(&name) { | ||
if let Some(udf) = self.context_provider.get_aggregate_meta(&name) { | ||
let order_by = self.order_by_to_sort_expr( | ||
&order_by, | ||
schema, | ||
|
@@ -349,13 +350,31 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> { | |
.map(|e| self.sql_expr_to_logical_expr(*e, schema, planner_context)) | ||
.transpose()? | ||
.map(Box::new); | ||
return Ok(Expr::AggregateFunction(expr::AggregateFunction::new_udf( | ||
fm, | ||
|
||
let raw_aggregate_function = RawAggregateUDF { | ||
udf, | ||
args, | ||
distinct, | ||
filter, | ||
order_by, | ||
null_treatment, | ||
}; | ||
|
||
for planner in self.planners.iter() { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this means that any planner extensions would have precidence over the built in aggregate functions from the Registry. I think that is ok There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can avoid clone like this for planner in self.planners.iter() {
match planner.plan_aggregate_udf(raw_aggregate_function)? {
PlannerResult::Planned(e) => return Ok(e),
PlannerResult::Original(e) => {
raw_aggregate_function = e;
}
}
}
return Ok(Expr::AggregateFunction(expr::AggregateFunction::new_udf(
raw_aggregate_function.udf,
raw_aggregate_function.args,
raw_aggregate_function.distinct,
raw_aggregate_function.filter,
raw_aggregate_function.order_by,
raw_aggregate_function.null_treatment,
))); |
||
if let PlannerResult::Planned(aggregate_function) = | ||
planner.plan_aggregate_udf(raw_aggregate_function.clone())? | ||
{ | ||
return Ok(aggregate_function); | ||
} | ||
} | ||
|
||
return Ok(Expr::AggregateFunction(expr::AggregateFunction::new_udf( | ||
raw_aggregate_function.udf, | ||
raw_aggregate_function.args, | ||
raw_aggregate_function.distinct, | ||
raw_aggregate_function.filter, | ||
raw_aggregate_function.order_by, | ||
raw_aggregate_function.null_treatment, | ||
))); | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we agree to convert
count()
tocount(1)
in planner. The conversion would be likeThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the path is after that to getting from
count(1)
tocount()
as we're talking about in #11229 (that also works in.select
). I think until there's a clear path to that, I'm hesitant to expand the scope of this PR.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the conversion from
count(1)
tocount()
is more like an optinization (rather than something for hte sql planner) as it would apply equally to SQL and to dataframe apisPerhaps we could add it as part of https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.AggregateUDFImpl.html#method.simplify 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I'm think of dataframe API is that we only provide one
count_star()
function, so we don't need to deal with function rewrite for dataframe APIs.For sql,
count()
,count(1)
andcount(*)
are equivalent things that can all to one single Expr, so I think it is possible to have the conversion in planner.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree they are equivalent things. But I am thinking it may be hard to teach everyone who builds
Exprs
to callcount_star
rather thancount(lit(1))
orcount(Expr::Wildcard)
To be clear I don't have a massively strong opinion, but it seems like since we don't control people's creation of
Expr
we aren't going to be able to prevent stuff likecount(1)
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rewrite could be simple enough that we don't need to push down to optimizer.
I think we need a clear role between expression rewrite in planner vs rewrite in optimizer.
ExprPlanner is the first spot we get the expressions from SQLExpr and build up datafusion::Expr. For syntax rewrite like dict
{a: b}
, compound ida.b
or equivalent args rewritecount(*)
are all good candidate to determine the expected datafusion::Expr in ExprPlanner. Others should push down to optimizer since they may need const evaluation or multiple passes.If the reason to rewrite expression in optimizer is because that benefit both sql and dataframe API, maybe redesign dataframe API is what we need 🤔 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess in my mind the distinction is
So in my mind the transformation of
count(1)
-->count()
is an optimization as it potentially avoids having to accumulate a known column in the execution. Maybe there is a better way to think about itThis might be a good distinction to clarify in the documentation 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this. Rewrite in logical optimizer results to the better one.
In the case of count,
count(*)
,count(1)
, andcount()
are equivalent things, not one is faster than the other.The optimization that counts the row number is not in the logical optimizer but in the physical optimizer. Therefore, the transformation from
count()
andcount(*)
tocount(1)
is not close to the second point IMO. What we are doing before the physical optimizer is to transform the equivalent expr to a single one (count(1)
), so we just need to process that one in the physical optimizer. Note that we could process those 3 expressions individually in the physical optimizer, but standardizing to a single expression as early as possible might reduce the complexity of the downstream query processing. It is more like rewriting expressions to the equivalent expression to me.It is nice to do so, we should clarify the role between these two.