Skip to content

Plan AsyncFuncExpr by physical planner #6

Closed
@goldmedal

Description

@goldmedal

Description

There are some reasons why I want to implement in the logical plan level

Go through all the expression

When working on #4, I noticed that it's not so convenient to visit all the expressions(To check if it's an AsyncFuncExpr) of the physical plan. In the logical plan phase, we can use something like LogicalPlan::map_expression to go through all expressions of the plan.

By the way, I considered implementing a similar tree node method for the physical plan. However, the ExecutionPlan isn't an ENUM. It's hard to maintain this API if anyone adds a new ExecutionPlan. 🤔

Discouple with the optimization rule

In my opinion, the optimization phase isn't required for SQL execution. A physical plan should be executable even if we don't apply the optimization rule. The physical planner can ensure that a logical AsyncExecute is planned to AsyncFuncExec, then apply the optimization for the batch coalesce if necessary.

On the other hand, the ordering of the optimization is important. If we do the planning thing in the optimization phase, I guess it may break some optimization effects.

Keep the compatibility for the federation scenario

datafusion-federation attempts to unparse the logical plan into SQL and push it down to the external database. If we can't identify async scalar functions in the logical plan phase, we may generate incorrect SQL. Conversely, if async scalar functions can be recognized during logical planning or unparsing, we can push down only valid plans to the data source and apply the async scalar function to the results from the external database.

Consider the following case: (the pg_items is provided by an external Postgres)

SELECT t.name, llm_bool('Is {t.price} reasonable?', t.price) as reasonable FROM pg_items t WHERE create_date > '2024-01-01'

It will be planned to

Projection t.name, reasonable
  AsyncExecute llm_bool('Is {t.price} reasonable?', t.price) as reasonable
    Filter create_date > '2024-01-01'
      TableScan pg_items

If we apply the concept of datafusion-federation, we can get the plan like

Projection t.name, reasonable
  AsyncExecute llm_bool('Is {t.price} reasonable?', t.price) as reasonable
      Scan( SELECT name, price FROM pg_items WHERE create_date > '2024-01-01' ) as t   // pushdown the scan and filter to Postgres

What I may implement

  • Add a logical plan for AsyncExec
  • Plan AsyncExec by the logical planner. (maybe analyzer rule)
  • Plan logical AsyncExec to the physical AsyncExec. (maybe in the physical planner)

If we plan the logical plan mentioned above to the physical plan:

ProjectExec
   AsyncFuncExec
      DataSourceExec // for the postgres source

Then, apply the optimization rule

ProjectExec
   AsyncFuncExec
      ColeaseBatchExec
          DataSourceExec // for the postgres source

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions