Description
Description
There are some reasons why I want to implement in the logical plan level
Go through all the expression
When working on #4, I noticed that it's not so convenient to visit all the expressions(To check if it's an AsyncFuncExpr) of the physical plan. In the logical plan phase, we can use something like LogicalPlan::map_expression
to go through all expressions of the plan.
By the way, I considered implementing a similar tree node method for the physical plan. However, the ExecutionPlan
isn't an ENUM
. It's hard to maintain this API if anyone adds a new ExecutionPlan
. 🤔
Discouple with the optimization rule
In my opinion, the optimization phase isn't required for SQL execution. A physical plan should be executable even if we don't apply the optimization rule. The physical planner can ensure that a logical AsyncExecute
is planned to AsyncFuncExec
, then apply the optimization for the batch coalesce if necessary.
On the other hand, the ordering of the optimization is important. If we do the planning thing in the optimization phase, I guess it may break some optimization effects.
Keep the compatibility for the federation scenario
datafusion-federation attempts to unparse the logical plan into SQL and push it down to the external database. If we can't identify async scalar functions in the logical plan phase, we may generate incorrect SQL. Conversely, if async scalar functions can be recognized during logical planning or unparsing, we can push down only valid plans to the data source and apply the async scalar function to the results from the external database.
Consider the following case: (the pg_items is provided by an external Postgres)
SELECT t.name, llm_bool('Is {t.price} reasonable?', t.price) as reasonable FROM pg_items t WHERE create_date > '2024-01-01'
It will be planned to
Projection t.name, reasonable
AsyncExecute llm_bool('Is {t.price} reasonable?', t.price) as reasonable
Filter create_date > '2024-01-01'
TableScan pg_items
If we apply the concept of datafusion-federation, we can get the plan like
Projection t.name, reasonable
AsyncExecute llm_bool('Is {t.price} reasonable?', t.price) as reasonable
Scan( SELECT name, price FROM pg_items WHERE create_date > '2024-01-01' ) as t // pushdown the scan and filter to Postgres
What I may implement
- Add a logical plan for
AsyncExec
- Plan
AsyncExec
by the logical planner. (maybe analyzer rule) - Plan logical
AsyncExec
to the physicalAsyncExec
. (maybe in the physical planner)
If we plan the logical plan mentioned above to the physical plan:
ProjectExec
AsyncFuncExec
DataSourceExec // for the postgres source
Then, apply the optimization rule
ProjectExec
AsyncFuncExec
ColeaseBatchExec
DataSourceExec // for the postgres source