Introduce Async User Defined Functions #14837

goldmedal · 2025-02-23T13:52:17Z

Which issue does this PR close?

Closes Async User Defined Functions (UDF) #6518.

Rationale for this change

I have been working with @alamb to implement the functional for the async UDF.

Implement general purpose async functions goldmedal/datafusion-llm-function#1

It introduces the following trait:

#[async_trait]
pub trait AsyncScalarUDFImpl: Debug + Send + Sync {
    /// the function cast as any
    fn as_any(&self) -> &dyn Any;

    /// The name of the function
    fn name(&self) -> &str;

    /// The signature of the function
    fn signature(&self) -> &Signature;

    /// The return type of the function
    fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType>;

    /// The ideal batch size for this function.
    ///
    /// This is used to determine what size of data to be evaluated at once.
    /// If None, the whole batch will be evaluated at once.
    fn ideal_batch_size(&self) -> Option<usize> {
        None
    }

    /// Invoke the function asynchronously with the async arguments
    async fn invoke_async_with_args(
        &self,
        args: AsyncScalarFunctionArgs,
        option: &ConfigOptions,
    ) -> Result<ArrayRef>;
}

It allows the user to implement the UDF for invoking some external remote function in the query.
Given an async udf async_equal, the plan would look like:

> explain select async_equal(a.id, 1) from animal a
+---------------+----------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                   |
+---------------+----------------------------------------------------------------------------------------+
| logical_plan  | Projection: async_equal(a.id, Int64(1))                                                |
|               |   SubqueryAlias: a                                                                     |
|               |     TableScan: animal projection=[id]                                                  |
| physical_plan | ProjectionExec: expr=[__async_fn_0@1 as async_equal(a.id,Int64(1))]                    |
|               |   AsyncFuncExec: async_expr=[async_expr(name=__async_fn_0, expr=async_equal(id@0, 1))] |
|               |     CoalesceBatchesExec: target_batch_size=8192                                        |
|               |       DataSourceExec: partitions=1, partition_sizes=[1]                                |
|               |                                                                                        |
+---------------+----------------------------------------------------------------------------------------+

To reduce the number of invoking the async function, CoalesceAsyncExecInput rule is used for coalescing the input batch of AsyncFuncExec.

See the details usages in the example.

What changes are included in this PR?

Remaining Work

Support for ProjectExec
Support for FilterExec
Support for Join Expression

Maybe implement in the follow-up PR

Async aggregation function
Async window function
Async table function (?

Are these changes tested?

Are there any user-facing changes?

alamb · 2025-02-24T15:55:09Z

😮 -- thanks @goldmedal -- I'll put this on my list of things to review

goldmedal · 2025-03-12T02:59:45Z

@alamb Sorry for the late. This PR is ready for review now.
I want to focus on Projection and Filter, which currently invoke the async UDF. After ensuring the approach makes sense, I'll create the follow-up PR for other plans.

alamb · 2025-03-12T22:22:07Z

Thanks I'll put it on my list

datafusion/physical-expr/src/async_scalar_function.rs

berkaysynnada · 2025-04-22T12:42:22Z

What's the status of this PR?

goldmedal · 2025-04-22T15:06:35Z

What's the status of this PR?

It's ready to review. I'm still waiting for someone to help review it.

berkaysynnada · 2025-04-23T14:06:50Z

What's the status of this PR?

It's ready to review. I'm still waiting for someone to help review it.

Thanks @goldmedal. We'll need this as well, so let's revive it. I'm putting this into my review list.

berkaysynnada

Hi again @goldmedal. I finally found some time to look into this. First of all, thank you for your work. This PR is in very good shape overall, and easy to follow the idea.

However, when I first imagined the design of this feature, I was thinking of approaching the problem from a different angle, which I believe could simplify things quite a bit:

What if we just added a new method to the PhysicalExpr trait, like evaluate_async()? We could then call this from streams that might involve async work. The default implementation would delegate to evaluate(), but in the case of ScalarFunctionExpr, we could branch depending on the function type.

This way, we wouldn't need to introduce a new physical rule or operator, which add overhead to both planning and execution. As I mentioned below, the special handling in the planner isn't well scalable IMO.

I'd love to hear your thoughts on my suggestion

berkaysynnada · 2025-05-10T18:57:22Z

datafusion/core/src/physical_planner.rs

@@ -775,12 +776,44 @@ impl DefaultPhysicalPlanner {

                let runtime_expr =
                    self.create_physical_expr(predicate, input_dfschema, session_state)?;
+
+                let filter = match self.try_plan_async_exprs(


Do we need to apply this pattern for every operator which has PhysicalExprs inside it that need to be evaluated during runtime? I think we can figure out another way to not make people modify the planner code for such every operator

I think at a really high level this pattern is basically the same as the "Common Subexpression Elimination" and many of the other optimizer passes -- that is pulling some subset of the expressions into a new node, and rewriting the others.

If we want to avoid having to follow the same model I think we could follow the model of some of the other recent optimizer passes and add a method to ExecutionPlan -- something like this perhaps

trait ExecutionPlan { /// Factor all async expressions in this ExecutionPlan from any internal expressions /// returning a list of such Async expressions and the rewritten plan /// /// The async expression values will be provided to the rewritten plan after all the existing /// input columns rewrite_async(&self) -> Transformed<(Vec<AsyncExpr>, Arc<dyn ExecutionPlan>) -> { // default to not supporting async functins Transformed::no() } }

something like this perhaps

rewritten plan is (async_exec + original plan)?

I think at a really high level this pattern is basically the same as the "Common Subexpression Elimination" and many of the other optimizer passes -- that is pulling some subset of the expressions into a new node, and rewriting the others.

I see the pattern now, but IMO for this async evaluation, adding a new operator for each async fn in the query seems a bit unnatural to me. I feel like we should encapsulate this feature in PhysicalExpr's level.

adriangb · 2025-05-10T21:44:05Z

What if we just added a new method to the PhysicalExpr trait, like evaluate_async()? We could then call this from streams that might involve async work. The default implementation would delegate to evaluate(), but in the case of ScalarFunctionExpr, we could branch depending on the function type.

How would that work going from sync -> async? For example: 1 = 2 OR 1 = call_llm_model_async(). I imagine this would build something like BinaryExpr(BinaryExpr(1, Eq, 2), Or, ScalarFunc(call_llm_model_async)). If we call evaluate_async on the outer BinaryExpr it would call evaluate() by default so now you're in sync world. How do you break back into async world? Do we pass around a handle to the tokio runtime?

alamb

Thank you @goldmedal -- I am sorry I missed this PR for so long. I think it is a great extension for DataFusion and will make using DataFusion with various new LLMs / services easier

I am approving this PR as I think it follows the existing patterns for optimizers and adds some key functionality

However, note I am quite biased as I had something to do with this pattern here goldmedal/datafusion-llm-function#1. Thus I believe that we should address @berkaysynnada and @adriangb 's concerns prior to megign

I think we should file some follow on tickets to

Add support for the remaining nodes
Add some more documentation / examples

alamb · 2025-05-11T10:34:33Z

datafusion/core/src/physical_planner.rs

@@ -775,12 +776,44 @@ impl DefaultPhysicalPlanner {

                let runtime_expr =
                    self.create_physical_expr(predicate, input_dfschema, session_state)?;
+
+                let filter = match self.try_plan_async_exprs(


I think at a really high level this pattern is basically the same as the "Common Subexpression Elimination" and many of the other optimizer passes -- that is pulling some subset of the expressions into a new node, and rewriting the others.

If we want to avoid having to follow the same model I think we could follow the model of some of the other recent optimizer passes and add a method to ExecutionPlan -- something like this perhaps

trait ExecutionPlan { /// Factor all async expressions in this ExecutionPlan from any internal expressions /// returning a list of such Async expressions and the rewritten plan /// /// The async expression values will be provided to the rewritten plan after all the existing /// input columns rewrite_async(&self) -> Transformed<(Vec<AsyncExpr>, Arc<dyn ExecutionPlan>) -> { // default to not supporting async functins Transformed::no() } }

alamb · 2025-05-11T10:36:20Z

datafusion-examples/examples/async_udf.rs

+use std::any::Any;
+use std::sync::Arc;
+
+#[tokio::main]


It would be nice to add some high level context to this example -- like an introduction saying that most functions are sync, but for some functions can be run as async ...

I can help with this potentially.

It would also be awesome to put this example / code in the docs https://datafusion.apache.org/library-user-guide/adding-udfs.html so it was easier to find

Thanks @alamb for the suggestion.
I've been a bit busy with personal matters these past few days, but I should be able to complete the enhancements to the document and examples this weekend.

berkaysynnada · 2025-05-11T10:52:01Z

How would that work going from sync -> async? For example: 1 = 2 OR 1 = call_llm_model_async(). I imagine this would build something like BinaryExpr(BinaryExpr(1, Eq, 2), Or, ScalarFunc(call_llm_model_async)). If we call evaluate_async on the outer BinaryExpr it would call evaluate() by default so now you're in sync world. How do you break back into async world? Do we pass around a handle to the tokio runtime?

Easy answer is converting original evaluate()'s to async, and move all evalute() impls to evaluate_sync(), but I cannot fully estimate its effects and challenges. Any comes to your mind?

adriangb · 2025-05-11T11:39:00Z

How would that work going from sync -> async? For example: 1 = 2 OR 1 = call_llm_model_async(). I imagine this would build something like BinaryExpr(BinaryExpr(1, Eq, 2), Or, ScalarFunc(call_llm_model_async)). If we call evaluate_async on the outer BinaryExpr it would call evaluate() by default so now you're in sync world. How do you break back into async world? Do we pass around a handle to the tokio runtime?

Easy answer is converting original evaluate()'s to async, and move all evalute() impls to evaluate_sync(), but I cannot fully estimate its effects and challenges. Any comes to your mind?

I mean that makes sense but sounds like a lot of churn? I'm not sure tbh sync / async coloring is always a pain and I don't know of any good solutions :(

berkaysynnada · 2025-05-11T14:47:32Z

How would that work going from sync -> async? For example: 1 = 2 OR 1 = call_llm_model_async(). I imagine this would build something like BinaryExpr(BinaryExpr(1, Eq, 2), Or, ScalarFunc(call_llm_model_async)). If we call evaluate_async on the outer BinaryExpr it would call evaluate() by default so now you're in sync world. How do you break back into async world? Do we pass around a handle to the tokio runtime?

Easy answer is converting original evaluate()'s to async, and move all evalute() impls to evaluate_sync(), but I cannot fully estimate its effects and challenges. Any comes to your mind?

I mean that makes sense but sounds like a lot of churn? I'm not sure tbh sync / async coloring is always a pain and I don't know of any good solutions :(

I'll try a POC when I find some time, and wonder @alamb's opinion

alamb · 2025-05-11T19:11:11Z

How would that work going from sync -> async? For example: 1 = 2 OR 1 = call_llm_model_async(). I imagine this would build something like BinaryExpr(BinaryExpr(1, Eq, 2), Or, ScalarFunc(call_llm_model_async)). If we call evaluate_async on the outer BinaryExpr it would call evaluate() by default so now you're in sync world. How do you break back into async world? Do we pass around a handle to the tokio runtime?

Easy answer is converting original evaluate()'s to async, and move all evalute() impls to evaluate_sync(), but I cannot fully estimate its effects and challenges. Any comes to your mind?

I mean that makes sense but sounds like a lot of churn? I'm not sure tbh sync / async coloring is always a pain and I don't know of any good solutions :(

I'll try a POC when I find some time, and wonder @alamb's opinion

My feeling (without any solid data) is that using async functions is not ideal because:

The async overhead (e.g. what it takes to make await vs a normal function) could be noticable, but maybe not that big a deal
The fact that everything that calls UDF would have to be async (as only async functions can call other async functions) -- the so called "what color are your functions" problem -- we be quite disruptive.

Another benefit of the approach in this PR is that it requires no changes to any existing functions or APIs (in fact the original POC can be implemented entirely as a DataFusion user defined optimizer extension)

alamb · 2025-05-15T10:09:34Z

My use of async in udf's currently is to query either an external system or datafusion itself.

That is interesting, it almost sounds like you are using async udfs to implement some sort of custom subquery. Very interesting

goldmedal · 2025-05-15T13:18:49Z

@goldmedal We discussed the aggregation scope with @ozankabak, and there are still a few open question-- like which parts of the aggregation process should actually be async, is it just the evaluation stage, or do we also need to make the update and merge stages async??

Introducing a new operator like AsyncAggregateExec might be a natural next step, as you mentioned. But to me, that direction feels more like a workaround than a scalable, long-term solution--it duplicates a lot of logic and risks fragmenting execution paths, and you're also aware of it IIUC.

👍

I’m also curious how others envision using this feature. Is the goal mainly to support I/O-bound workloads, like the LLM use case? Or are there also plans to handle CPU-bound, compute-heavy tasks in a more async-friendly way? Depending on the use cases we want to support, it might be worth considering a more foundational approach. Of course, those come with significant design and implementation challenges, but it could open the door to a more unified and flexible execution model.

I agreed with @alamb's point. I/O workload is my main goal. Besides the LLM case, I think invoking the data API is also the case. For the compute-heavy tasks, I have no furthermore design yet. However, I think it's good to have a more efficient design if we know the specific case.

Omega359 · 2025-05-15T14:04:25Z

My use of async in udf's currently is to query either an external system or datafusion itself.

That is interesting, it almost sounds like you are using async udfs to implement some sort of custom subquery. Very interesting

Pretty much, yes.

alamb · 2025-05-21T15:10:02Z

Are there any remaining outstanding issues to merging this PR?

If not, perhaps we can merge it and file an epic / ticket for filling out the remaining features.

A blog post (perhaps based on the example here) would be 100% amazing

alamb · 2025-05-22T19:00:11Z

Unless I hear anything else I plan to merge this tomorrow and will file a follow on Epic for other tasks (docs / blogs / support in other types of plans0

github-actions bot added logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Feb 23, 2025

goldmedal mentioned this pull request Feb 23, 2025

Async User Defined Functions (UDF) #6518

Open

This was referenced Feb 24, 2025

Weekly Plan (Andrew Lamb) Feb 24, 2025 #14850

Closed

Weekly Plan (Andrew Lamb) March 3, 2025 #14978

Closed

alamb mentioned this pull request Mar 10, 2025

Weekly Plan (Andrew Lamb) March 10, 2025 #15121

Closed

11 tasks

goldmedal force-pushed the epic/async-udf branch from 6b29489 to 848e175 Compare March 11, 2025 13:29

goldmedal marked this pull request as ready for review March 12, 2025 02:54

alamb mentioned this pull request Mar 17, 2025

March 17, 2025: This week(s) in DataFusion #15269

Closed

Omega359 reviewed Mar 21, 2025

View reviewed changes

datafusion/physical-expr/src/async_scalar_function.rs Outdated Show resolved Hide resolved

goldmedal force-pushed the epic/async-udf branch from 4d8b40e to 78ecbe6 Compare March 22, 2025 06:34

goldmedal force-pushed the epic/async-udf branch from 78ecbe6 to a493c33 Compare April 22, 2025 15:12

Omega359 mentioned this pull request Apr 30, 2025

[DISCUSSION] DataFusion Road Map: Q3-Q4 2025 #15878

Open

berkaysynnada reviewed May 10, 2025

View reviewed changes

alamb approved these changes May 11, 2025

View reviewed changes

alamb mentioned this pull request May 11, 2025

Weekly Plan: Andrew Lamb 2025-05-12 #16022

Closed

24 tasks

alamb mentioned this pull request May 16, 2025

Release DataFusion 48.0.0 (June 2025) #15771

Open

22 tasks

goldmedal added 17 commits May 18, 2025 14:55

introduce async udf for projection

f57ac9d

refactor for filter

2827bcb

coalesce_batches for AsyncFuncExec

2dbf575

project filter to exclude the filter expression

cc1cfc0

coalesce the input batch of AsyncFuncExec

53c6923

simple example

759abbd

enhance comment

ecb6b9e

enhance doc and fix test

04ad934

fix clippy and fmt

0140da1

add missing dependency

9a22be4

fix clippy

bc01bed

rename the symbol

aed09c4

cargo fmt

39003d6

fix fmt and rebase

4906b21

add return_field_from_args for async scalar udf

be6412a

modified into_scalar_udf method

784f0bb

add the async scalar udf in udfs doc

c8b609b

goldmedal force-pushed the epic/async-udf branch from a493c33 to c8b609b Compare May 18, 2025 08:10

github-actions bot added the documentation Improvements or additions to documentation label May 18, 2025

goldmedal added 2 commits May 18, 2025 16:59

pretty doc

4c754f3

fix doc test

d21c674

alamb mentioned this pull request May 19, 2025

Weekly Plan: Andrew Lamb 2025-05-19 #16101

Open

18 tasks

Merge branch 'main' into epic/async-udf

58fddea

github-actions bot added the physical-plan Changes to the physical-plan crate label May 22, 2025

Introduce Async User Defined Functions #14837

Are you sure you want to change the base?

Introduce Async User Defined Functions #14837

Conversation

goldmedal commented Feb 23, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Remaining Work

Maybe implement in the follow-up PR

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb commented Feb 24, 2025

Uh oh!

goldmedal commented Mar 12, 2025

Uh oh!

alamb commented Mar 12, 2025

Uh oh!

Uh oh!

berkaysynnada commented Apr 22, 2025

Uh oh!

goldmedal commented Apr 22, 2025

Uh oh!

berkaysynnada commented Apr 23, 2025

Uh oh!

berkaysynnada left a comment

Choose a reason for hiding this comment

Uh oh!

berkaysynnada May 10, 2025

Choose a reason for hiding this comment

Uh oh!

alamb May 11, 2025

Choose a reason for hiding this comment

Uh oh!

berkaysynnada May 11, 2025

Choose a reason for hiding this comment

Uh oh!

adriangb commented May 10, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb May 11, 2025

Choose a reason for hiding this comment

Uh oh!

alamb May 11, 2025

Choose a reason for hiding this comment

Uh oh!

goldmedal May 13, 2025

Choose a reason for hiding this comment

Uh oh!

berkaysynnada commented May 11, 2025

Uh oh!

adriangb commented May 11, 2025

Uh oh!

berkaysynnada commented May 11, 2025

Uh oh!

alamb commented May 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented May 15, 2025

Uh oh!

goldmedal commented May 15, 2025

Uh oh!

Omega359 commented May 15, 2025

Uh oh!

alamb commented May 21, 2025

Uh oh!

alamb commented May 22, 2025

Uh oh!

Uh oh!

alamb commented May 11, 2025 •

edited

Loading