Skip to content

Deprecate ScalarUDFImpl::return_type #13717

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions datafusion/core/src/catalog_common/information_schema.rs
Original file line number Diff line number Diff line change
Expand Up @@ -406,6 +406,7 @@ fn get_udf_args_and_return_types(
.into_iter()
.map(|arg_types| {
// only handle the function which implemented [`ScalarUDFImpl::return_type`] method
#[allow(deprecated)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @goldmedal for ideas how (if) we will be able to resolve this deprecation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reminding this. I filed #13735 to trace it.

Copy link
Contributor

@alamb alamb Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While reading #13735 and thinking about the churn we had recently with invoke_with_batch and invoke_with_args, etc

What if we made a new API that could accomodate both usecases. Something like

struct ReturnTypeArgs {
       // Argments, if available. This may not be available in some contexts
       // such as information_table schema queries
        args: Option<&[Expr]>
       // schema, if available. 
        schema: Option<&dyn ExprSchema>,
        arg_types: &[DataType],
}

impl ScalarUdfImpl {
  /// returns the result type, given `args` if possible
  /// if not possible, returns `Ok(None)`
  fn return_type_with_args(&self, args: &ReturnTypeArgs) -> Result<Option<DataType>> 
}

🤔

This would also let us add other fields to the ReturnTypeArgs structure over time with less API churn

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we change the signature to return_type(ReturnTypeArgs) and remove return_type_with_args at all

I was thinking about this too but not sure this huge breaking change is acceptable or not, but I think this will be a better change in long term

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we change the signature to return_type(ReturnTypeArgs) and remove return_type_with_args at all

I was thinking about this too but not sure this huge breaking change is acceptable or not, but I think this will be a better change in long term

I agree return_type(ReturnTypeArgs) would be the best in the long term

However, in the short term (next 6 months) I think it would be nicer to downstream crates / users to add a new function return_type_with_args and leave #deprecation notices on return_type and return_type_from_exprs directing people to return_type_with_args

Then once the deprecation period has passed we could rename / introduce a new function called return_type 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one of the challenge of #13825 and #12604 is that the properties of Expr including data type is determined based on schema. Unless the schema is the same all the way of the processing otherwise we need to recompute the properties based on given schema so we can't compute once and store the information within Expr. Introduce Map for schema -> properties mapping seems not practically.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, if we have column c, we need schema to know it's data type is Int32. With another schema, it may becomes Int64. How do we ensure the schema is not changed and we can happily reuse the result we had before?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, if we have column c, we need schema to know it's data type is Int32. With another schema, it may becomes Int64.

That's a good point.
The other way to view this is -- what does an Expr represent?
is it a syntactical expression (doesn't know types, can be applied to multiple different inputs), or a semantical expression (anchored in the evaluation context, knows types).
The SQL handling process goes from syntax to semantics, and expressions (Expr) built in dataframe layer are definitely syntactical, not semantical.

This may be a challenge for #13825, but less so for #12604. If we have new IR with separate Expr types, it won't be used in contexts where we need syntactical expressions.

Copy link
Contributor

@jayzhan211 jayzhan211 Jan 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I review about this part again and my conclusion is that Expr should be the same as it is now, and schema is separated concept other than Expr, with them we compute the corresponding metadata like data type and nullability. Therefore, I still think we need return_type_with_args for solving this issue.

The issue mentioned in #12604 should be solved in another way. I think we can have such information stored in LogicalPlan instead. Expr has no type info until the schema is determined and it is when we create corresponding LogicalPlan from Expr and Schema.

LogicalPlan can be considered as the Container of Expr + Schema, whenever the schema is updated or Expr is rewritten, we recompute the properties of Expr. If nothing changed, we can reuse such computed properties.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that we don't even need args: Vec<Expr>, what we need are Vec<String>.

pub struct ReturnTypeArgs<'a> {
    pub arg_types: &'a [DataType],
    pub arguments: &'a [String],
}

let return_type = udf.return_type(&arg_types).ok().map(|t| t.to_string());
let arg_types = arg_types
.into_iter()
Expand Down
5 changes: 5 additions & 0 deletions datafusion/expr/src/udf.rs
Original file line number Diff line number Diff line change
Expand Up @@ -173,7 +173,9 @@ impl ScalarUDF {
/// its [`ScalarUDFImpl::return_type`] should raise an error.
///
/// See [`ScalarUDFImpl::return_type`] for more details.
#[deprecated(since = "44.0.0", note = "Use return_type_from_exprs() instead")]
pub fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {
#[allow(deprecated)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we also deprecate this (ScalarUDF::return_type) method if we deprecate underlying ScalarUDFImpl method?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

self.inner.return_type(arg_types)
}

Expand Down Expand Up @@ -450,6 +452,7 @@ pub trait ScalarUDFImpl: Debug + Send + Sync {
/// is recommended to return [`DataFusionError::Internal`].
///
/// [`DataFusionError::Internal`]: datafusion_common::DataFusionError::Internal
#[deprecated(since = "44.0.0", note = "Use `return_type_from_exprs` instead")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are going to deprecate this API I think we should add an example to return_type_from_exprs to help the migration effort.

I can help if we proceed

fn return_type(&self, arg_types: &[DataType]) -> Result<DataType>;

/// What [`DataType`] will be returned by this function, given the
Expand Down Expand Up @@ -483,6 +486,7 @@ pub trait ScalarUDFImpl: Debug + Send + Sync {
_schema: &dyn ExprSchema,
arg_types: &[DataType],
) -> Result<DataType> {
#[allow(deprecated)]
self.return_type(arg_types)
}

Expand Down Expand Up @@ -756,6 +760,7 @@ impl ScalarUDFImpl for AliasedScalarUDFImpl {
}

fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {
#[allow(deprecated)]
self.inner.return_type(arg_types)
}

Expand Down
1 change: 1 addition & 0 deletions datafusion/functions/src/string/concat.rs
Original file line number Diff line number Diff line change
Expand Up @@ -305,6 +305,7 @@ pub fn simplify_concat(args: Vec<Expr>) -> Result<ExprSimplifyResult> {
_ => None,
})
.collect();
#[allow(deprecated)]
ConcatFunc::new().return_type(&data_types)
}?;

Expand Down
Loading