-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Introduce Async User Defined Functions #14837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
😮 -- thanks @goldmedal -- I'll put this on my list of things to review |
6b29489
to
848e175
Compare
@alamb Sorry for the late. This PR is ready for review now. |
Thanks I'll put it on my list |
4d8b40e
to
78ecbe6
Compare
What's the status of this PR? |
It's ready to review. I'm still waiting for someone to help review it. |
78ecbe6
to
a493c33
Compare
Thanks @goldmedal. We'll need this as well, so let's revive it. I'm putting this into my review list. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi again @goldmedal. I finally found some time to look into this. First of all, thank you for your work. This PR is in very good shape overall, and easy to follow the idea.
However, when I first imagined the design of this feature, I was thinking of approaching the problem from a different angle, which I believe could simplify things quite a bit:
What if we just added a new method to the PhysicalExpr trait, like evaluate_async()
? We could then call this from streams that might involve async work. The default implementation would delegate to evaluate()
, but in the case of ScalarFunctionExpr
, we could branch depending on the function type.
This way, we wouldn't need to introduce a new physical rule or operator, which add overhead to both planning and execution. As I mentioned below, the special handling in the planner isn't well scalable IMO.
I'd love to hear your thoughts on my suggestion
@@ -775,12 +776,44 @@ impl DefaultPhysicalPlanner { | |||
|
|||
let runtime_expr = | |||
self.create_physical_expr(predicate, input_dfschema, session_state)?; | |||
|
|||
let filter = match self.try_plan_async_exprs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to apply this pattern for every operator which has PhysicalExpr
s inside it that need to be evaluated during runtime? I think we can figure out another way to not make people modify the planner code for such every operator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think at a really high level this pattern is basically the same as the "Common Subexpression Elimination" and many of the other optimizer passes -- that is pulling some subset of the expressions into a new node, and rewriting the others.
If we want to avoid having to follow the same model I think we could follow the model of some of the other recent optimizer passes and add a method to ExecutionPlan
-- something like this perhaps
trait ExecutionPlan {
/// Factor all async expressions in this ExecutionPlan from any internal expressions
/// returning a list of such Async expressions and the rewritten plan
///
/// The async expression values will be provided to the rewritten plan after all the existing
/// input columns
rewrite_async(&self) -> Transformed<(Vec<AsyncExpr>, Arc<dyn ExecutionPlan>) -> {
// default to not supporting async functins
Transformed::no()
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something like this perhaps
rewritten plan is (async_exec + original plan)?
I think at a really high level this pattern is basically the same as the "Common Subexpression Elimination" and many of the other optimizer passes -- that is pulling some subset of the expressions into a new node, and rewriting the others.
I see the pattern now, but IMO for this async evaluation, adding a new operator for each async fn in the query seems a bit unnatural to me. I feel like we should encapsulate this feature in PhysicalExpr's level.
How would that work going from sync -> async? For example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @goldmedal -- I am sorry I missed this PR for so long. I think it is a great extension for DataFusion and will make using DataFusion with various new LLMs / services easier
I am approving this PR as I think it follows the existing patterns for optimizers and adds some key functionality
However, note I am quite biased as I had something to do with this pattern here goldmedal/datafusion-llm-function#1. Thus I believe that we should address @berkaysynnada and @adriangb 's concerns prior to megign
I think we should file some follow on tickets to
- Add support for the remaining nodes
- Add some more documentation / examples
@@ -775,12 +776,44 @@ impl DefaultPhysicalPlanner { | |||
|
|||
let runtime_expr = | |||
self.create_physical_expr(predicate, input_dfschema, session_state)?; | |||
|
|||
let filter = match self.try_plan_async_exprs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think at a really high level this pattern is basically the same as the "Common Subexpression Elimination" and many of the other optimizer passes -- that is pulling some subset of the expressions into a new node, and rewriting the others.
If we want to avoid having to follow the same model I think we could follow the model of some of the other recent optimizer passes and add a method to ExecutionPlan
-- something like this perhaps
trait ExecutionPlan {
/// Factor all async expressions in this ExecutionPlan from any internal expressions
/// returning a list of such Async expressions and the rewritten plan
///
/// The async expression values will be provided to the rewritten plan after all the existing
/// input columns
rewrite_async(&self) -> Transformed<(Vec<AsyncExpr>, Arc<dyn ExecutionPlan>) -> {
// default to not supporting async functins
Transformed::no()
}
}
use std::any::Any; | ||
use std::sync::Arc; | ||
|
||
#[tokio::main] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to add some high level context to this example -- like an introduction saying that most functions are sync, but for some functions can be run as async ...
I can help with this potentially.
It would also be awesome to put this example / code in the docs https://datafusion.apache.org/library-user-guide/adding-udfs.html so it was easier to find
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alamb for the suggestion.
I've been a bit busy with personal matters these past few days, but I should be able to complete the enhancements to the document and examples this weekend.
Easy answer is converting original |
I mean that makes sense but sounds like a lot of churn? I'm not sure tbh sync / async coloring is always a pain and I don't know of any good solutions :( |
I'll try a POC when I find some time, and wonder @alamb's opinion |
My feeling (without any solid data) is that using
Another benefit of the approach in this PR is that it requires no changes to any existing functions or APIs (in fact the original POC can be implemented entirely as a DataFusion user defined optimizer extension) |
That is interesting, it almost sounds like you are using async udfs to implement some sort of custom subquery. Very interesting |
👍
I agreed with @alamb's point. I/O workload is my main goal. Besides the LLM case, I think invoking the data API is also the case. For the compute-heavy tasks, I have no furthermore design yet. However, I think it's good to have a more efficient design if we know the specific case. |
Pretty much, yes. |
Are there any remaining outstanding issues to merging this PR? If not, perhaps we can merge it and file an epic / ticket for filling out the remaining features. A blog post (perhaps based on the example here) would be 100% amazing |
Unless I hear anything else I plan to merge this tomorrow and will file a follow on Epic for other tasks (docs / blogs / support in other types of plans0 |
Which issue does this PR close?
Rationale for this change
I have been working with @alamb to implement the functional for the async UDF.
It introduces the following trait:
It allows the user to implement the UDF for invoking some external remote function in the query.
Given an async udf
async_equal
, the plan would look like:To reduce the number of invoking the async function,
CoalesceAsyncExecInput
rule is used for coalescing the input batch ofAsyncFuncExec
.See the details usages in the example.
What changes are included in this PR?
Remaining Work
Maybe implement in the follow-up PR
Are these changes tested?
Are there any user-facing changes?