Skip to content

User Defined Table Function (udtf) support #2177

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from

Conversation

gandronchik
Copy link
Contributor

@gandronchik gandronchik commented Apr 8, 2022

UDTF support (User-defined functions returning table)

In my understanding table function returns multiple rows. For now, we have only UDF which returns a scalar value.

I don't think it should return multiply columns, structures are usually used for this.

we have the following cases:

1. select table_fun(1, 5);

generate_series(Int64(1),Int64(5))`
------------------------------------
                                  1
                                  2
                                  3
                                  4
                                  5
(5 rows)
 Projection: #generate_series(Int64(1),Int64(5)) +
   TableUDFs: generate_series(Int64(1), Int64(5))+
     EmptyRelation

it is the easiest scenario. The function just returns vec of values.

2. select table_fun(1, col) from (select 2 col union all select 3 col) t;

 generate_series(Int64(1),t.col)
---------------------------------
                               1
                               2
                               3
                               1
                               2
(5 rows)
Projection: #generate_series(Int64(1),t.col)  +
   TableUDFs: generate_series(Int64(1), #t.col)+
     Projection: #t.col, alias=t               +
       Union                                   +
         Projection: Int64(2) AS col           +
           EmptyRelation                       +
         Projection: Int64(3) AS col           +
           EmptyRelation

The function returns a batch.

3. select col, table_fun(1, col) from (select 2 col union all select 3 col) t;

col | generate_series(Int64(1),t.col)
-----+---------------------------------
   3 |                               1
   3 |                               2
   3 |                               3
   2 |                               1
   2 |                               2
(5 rows)
Projection: #t.col, #generate_series(Int64(1),t.col)+
   TableUDFs: generate_series(Int64(1), #t.col)      +
     Projection: #t.col, alias=t                     +
       Union                                         +
         Projection: Int64(2) AS col                 +
           EmptyRelation                             +
         Projection: Int64(3) AS col                 +
           EmptyRelation

it is the most difficult case. In this case, we have to transform data flow, because as you can see from the result, we have to duplicate col for each row of table_fun result.

4. select * from table_fun(1, 5);

 generate_series(Int64(1),Int64(5))
------------------------------------
                                  1
                                  2
                                  3
                                  4
                                  5
(5 rows)
Projection: #generate_series(Int64(1),Int64(5)) +
   TableUDFs: generate_series(Int64(1), Int64(5))+
     EmptyRelation

In this case, in this case, the result is the same as in the first case. However, we have another plan structure here.

5. select * from table_fun(1, 5) t(n);

 n
---
 1
 2
 3
 4
 5
(5 rows)
Projection: #t.n                                               +
   Projection: #generate_series(Int64(1),Int64(5)) AS n, alias=t+
     TableUDFs: generate_series(Int64(1), Int64(5))             +
       EmptyRelation

It looks the same with the previous case, however we have a bit different plan here to support alias (because table_fun node not support aliases and we have to add projection).

Regarding signature, I decided to use a single vector and vector with sizes of sections instead of vec of vecs to have better performance. If we use Vec, this will require a lot of memory in case of a request for millions of rows.

@github-actions github-actions bot added the datafusion Changes in the datafusion crate label Apr 8, 2022
@@ -99,6 +99,7 @@ impl ExpressionVisitor for ApplicabilityVisitor<'_> {
Expr::ScalarUDF { fun, .. } => {
self.visit_volatility(fun.signature.volatility)
}
Expr::TableUDF { fun, .. } => self.visit_volatility(fun.signature.volatility),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend writing it like this:

            Expr::ScalarUDF { fun, .. } | Expr::TableUDF { fun, .. } => {
                self.visit_volatility(fun.signature.volatility)
            }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, however, it doesn't work in this case (because argument fun has different types for TableUDF and ScalarUDF)

@@ -381,6 +381,7 @@ impl<'a> ConstEvaluator<'a> {
| Expr::QualifiedWildcard { .. } => false,
Expr::ScalarFunction { fun, .. } => Self::volatility_ok(fun.volatility()),
Expr::ScalarUDF { fun, .. } => Self::volatility_ok(fun.signature.volatility),
Expr::TableUDF { .. } => false,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@xudong963
Copy link
Member

BTW, from clippy:

error: unneeded `return` statement
   --> datafusion/core/src/physical_plan/functions.rs:752:9
    |
752 |         return Ok(ColumnarValue::Array(result));
    |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ help: remove `return`: `Ok(ColumnarValue::Array(result))`
    |
    = note: `-D clippy::needless-return` implied by `-D warnings`
    = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#needless_return

error: could not compile `datafusion` due to previous error
warning: build failed, waiting for other jobs to finish...
error: called `.nth(0)` on a `std::iter::Iterator`, when `.next()` is equivalent
    --> datafusion/core/src/execution/context.rs:3527:32
     |
3527 |             let start_number = start_arr.into_iter().nth(0).unwrap().unwrap_or(0);
     |                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ help: try calling `.next()` instead of `.nth(0)`: `start_arr.into_iter().next()`
     |
     = note: `-D clippy::iter-nth-zero` implied by `-D warnings`
     = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#iter_nth_zero

error: called `.nth(0)` on a `std::iter::Iterator`, when `.next()` is equivalent
    --> datafusion/core/src/execution/context.rs:3533:30
     |
3533 |             let end_number = end_arr.into_iter().nth(0).unwrap().unwrap_or(0) + 1;
     |                              ^^^^^^^^^^^^^^^^^^^^^^^^^^ help: try calling `.next()` instead of `.nth(0)`: `end_arr.into_iter().next()`
     |
     = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#iter_nth_zero

error: build failed

@xudong963 xudong963 added the enhancement New feature or request label Apr 9, 2022
@doki23
Copy link
Contributor

doki23 commented Apr 10, 2022

Hmm, is TableFunction an expression 🤔?
refer to https://docs.snowflake.com/en/sql-reference/functions-table.html
The sql looks usually like

select doi.date as "Date", record_temperatures.city, record_temperatures.temperature
    from dates_of_interest as doi,
         table(record_high_temperatures_for_date(doi.date)) as record_temperatures;

It shouldn't be an expression, right?

@gandronchik gandronchik requested a review from xudong963 April 12, 2022 14:02
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @gandronchik -- sorry for the delay in review. I think this PR is looking quite good 👌

Epic first PR

Would it be possible to add a test for a table function that gets no arguments (as there is code to handle that case, but I don't see coverage)?

I also had one relatively minor question related to zero argument handling; Really nice.

Also it would be nice to add a note about supporting Table Functions in https://github.com/apache/arrow-datafusion/blob/master/docs/source/user-guide/sql/sql_status.md (but we can do so as a follow on PR)

Does anyone else have questions or concerns about merging this PR?

cc @andygrove @liukun4515 @yjshen

// specific language governing permissions and limitations
// under the License.

//! UDTF support
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//! UDTF support
//! User Defined Table Function (UDTF) support

// specific language governing permissions and limitations
// under the License.

//! Udtf module contains foundational types that are used to represent UDTFs in DataFusion.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//! Udtf module contains foundational types that are used to represent UDTFs in DataFusion.
//! Contains foundational types that are used to represent User Defined Table Functions (UDTFs) in DataFusion.

fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue> {
// evaluate the arguments, if there are no arguments we'll instead pass in a null array
// indicating the batch size (as a convention)
let inputs = match (self.args.len(), self.name.parse::<BuiltinScalarFunction>()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we are parsing the table function name using BuiltinScalarFunction? Don't we already have self.fun?

@doki23
Copy link
Contributor

doki23 commented Apr 14, 2022

Hmmmm...I have some problems about this pr.
If we treat UDTF as an expression, does it mean that it can only produce one column?
As I mentioned before (#2177 (comment)), it's more like a table so that we can select * from it and get any number of columns.
I'm confused, would you please explain it to me? @alamb @gandronchik

@thinkharderdev
Copy link
Contributor

Hmmmm...I have some problems about this pr. If we treat UDTF as an expression, does it mean that it can only produce one column? As I mentioned before (#2177 (comment)), it's more like a table so that we can select * from it and get any number of columns. I'm confused, would you please explain it to me? @alamb @gandronchik

I had the same question. I'm not sure I understand how this is different from a scalar function. It seems like a table function should produce RecordBatchs and effectively compile down to an ExecutionPlan.

@alamb
Copy link
Contributor

alamb commented Apr 15, 2022

It seems like a table function should produce RecordBatchs and effectively compile down to an ExecutionPlan.

I agree it should definitely produce RecordBatch

@gandronchik
Copy link
Contributor Author

gandronchik commented Apr 15, 2022

what about Result<Vec<ColumnarValue>>. I already almost implemented it this way:)

It seems like a table function should produce RecordBatchs and effectively compile down to an ExecutionPlan.

I agree it should definitely produce RecordBatch

@thinkharderdev
Copy link
Contributor

what about Result<Vec<ColumnarValue>>. I already almost implemented it this way:)

It seems like a table function should produce RecordBatchs and effectively compile down to an ExecutionPlan.

I agree it should definitely produce RecordBatch

That's essentially a RecordBatch :)

You could have

pub type TableFunctionImplementation =
    Arc<dyn Fn(&[ColumnarValue]) -> Result<Vec<ColumnarValue>> + Send + Sync>;

// This is a terrible name but this would be analogous to ReturnTypeFunction/StateTypeFunction
pub type TableSchemaFunction = 
    Arc<dyn Fn(&[DataType]) -> Result<SchemaRef> + Send + Sync>; 

@Ted-Jiang
Copy link
Member

@alamb @thinkharderdev @doki23 i met the same problem in #2343

if we treat it as a Expr , we need change it to PhysicalExpr but

/// Evaluate an expression against a RecordBatch
    fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue>;
pub enum ColumnarValue {
    /// Array of values
    Array(ArrayRef),
    /// A single value
    Scalar(ScalarValue),
}

cause of it return ColumnarValue, we can not return result as a table, am i right?

Should i implement a TablePhysicalExpr
using

  fn evaluate(&self, batch: &RecordBatch) -> Result<Vec<ColumnarValue>>;

@alamb
Copy link
Contributor

alamb commented Apr 26, 2022

@alamb @thinkharderdev @doki23 i met the same problem in #2343

I left some thoughts in

#2343 (comment)

@gandronchik gandronchik deleted the support-udtf branch April 27, 2022 10:37
@gandronchik gandronchik restored the support-udtf branch April 27, 2022 11:49
@gandronchik gandronchik reopened this Apr 27, 2022
@gandronchik gandronchik requested a review from alamb April 27, 2022 14:13
@alamb alamb changed the title udtf support User Defined Table Function (udtf) support Apr 27, 2022
@alamb
Copy link
Contributor

alamb commented Apr 27, 2022

I plan to give this a more careful review tomorrow

@@ -39,6 +40,10 @@ use std::sync::Arc;
pub type ScalarFunctionImplementation =
Arc<dyn Fn(&[ColumnarValue]) -> Result<ColumnarValue> + Send + Sync>;

/// Table function. Second tuple
pub type TableFunctionImplementation =
Arc<dyn Fn(&[ColumnarValue], usize) -> Result<(ArrayRef, Vec<usize>)> + Send + Sync>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as ArrayRef is one of ColumnarValue

pub enum ColumnarValue {
    /// Array of values
    Array(ArrayRef),
    /// A single value
    Scalar(ScalarValue),
}

I think TableFunctionImplementation is same as ScalarFunctionImplementation .
And it only generate table N*1 , if we use as #2177 (comment)

Arc<dyn Fn(&[ColumnarValue], usize) -> Result<(Vec< ColumnarValue >, Vec<usize>)> + Send + Sync>;

We could generate N*M table
If im wrong plz correct me?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or in this case it can generate N*M table

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also a little mystified by this signature

. It looks like "Second tuple" was the start of a thought that didn't get finished? I also don't understand what the usize in the tuple represents -- perhaps you can add some comments explaining its purpose?

Also, I agree with @Ted-Jiang 's analysis -- I would expect this signature to return a "table" (aka a RecordBatch or a Vec<ColumnarValue> if preferred

Perhaps something like

Arc<dyn Fn(&[ColumnarValue]) -> Result<RecordBatch> + Send + Sync>;

or

Arc<dyn Fn(&[ColumnarValue]) -> Result<Vec<ColumnarValue>> + Send + Sync>;

Copy link
Contributor

@doki23 doki23 Apr 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that @gandronchik wants to chain each result(ArrayRef) of TableFunctionImplementation into a multi-column result (see the code in TableFunStream::batch), which may mean the table udf consists of multi exprs. The reason should be trait PhysicalExpr only provides fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue>. But I agree that Arc<dyn Fn(&[ColumnarValue]) -> Result<Vec<ColumnarValue>> + Send + Sync> is more proper. So I believe that the approach may be directly invoke the table udf in the TableFunStream without implementing trait PhysicalExpr for it, or adding fn evaluate(&self, batch: &RecordBatch) -> Result<Vec<ColumnarValue>> for PhysicalExpr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the header of PR. Hope it is clear enough now:)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, again thank you @gandronchik for this contribution

If you are implementing a table function I would expect it to be able to return multiple rows and columns. I think this PR only implements a table function that produces multiple rows out

It may be that I have a different understanding of "table function" than you are trying to implement. A writeup up of what you are trying to do (not how you are implementing it) would likely help this conversation forward.

As I am familiar with Table Functions, they are a little tricky as they can change the cardinality and schema of their input, and thus database systems restrict where in queries they may appear.

I think typical uses are in the FROM clause and in SELECT clause. I wonder if that sounds similar to what you are trying to do?

Comment on lines +2195 to +2207
let result = plan_and_collect(&ctx, "SELECT integer_series(1,5)").await?;

let expected = vec![
"+-----------------------------------+",
"| integer_series(Int64(1),Int64(5)) |",
"+-----------------------------------+",
"| 1 |",
"| 2 |",
"| 3 |",
"| 4 |",
"| 5 |",
"+-----------------------------------+",
];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good example of a UDT producing more row than went in 👍

Would it be possible to write an example that also produces a different number of columns than went in? I think that is what @Ted-Jiang and I are pointing out in in our comments below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't support it. You can use structures for that

assert_batches_eq!(expected, &result);

let result =
plan_and_collect(&ctx, "SELECT * from integer_series(1,5) pos(n)").await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what this test is supposed to be demonstrating? I am not quite sure what it shows

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have just explained it in the header of PR. Hope I did it clear enough:)

@@ -39,6 +40,10 @@ use std::sync::Arc;
pub type ScalarFunctionImplementation =
Arc<dyn Fn(&[ColumnarValue]) -> Result<ColumnarValue> + Send + Sync>;

/// Table function. Second tuple
pub type TableFunctionImplementation =
Arc<dyn Fn(&[ColumnarValue], usize) -> Result<(ArrayRef, Vec<usize>)> + Send + Sync>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also a little mystified by this signature

. It looks like "Second tuple" was the start of a thought that didn't get finished? I also don't understand what the usize in the tuple represents -- perhaps you can add some comments explaining its purpose?

Also, I agree with @Ted-Jiang 's analysis -- I would expect this signature to return a "table" (aka a RecordBatch or a Vec<ColumnarValue> if preferred

Perhaps something like

Arc<dyn Fn(&[ColumnarValue]) -> Result<RecordBatch> + Send + Sync>;

or

Arc<dyn Fn(&[ColumnarValue]) -> Result<Vec<ColumnarValue>> + Send + Sync>;

@doki23
Copy link
Contributor

doki23 commented Apr 29, 2022

I don't think it should return multiply columns, structures are usually used for this.

I cannot agree. Result of Table Function represents a temporary table. Since it's a table, it shouldn't only have one column. Of course, one column of type structure can solve the problem, but it's different. We cannot directly execute order by or other query on it if we don't extract the structure.

@alamb
Copy link
Contributor

alamb commented Apr 29, 2022

@gandronchik thank you for the explanation in this PR's description. It helps though I will admit I still don't fully understand what is going o.

I agree with @doki23 -- I expect a table function to logically return a table (that something with both rows and columns)

Regarding signature, I decided to use a single vector and vector with sizes of sections instead of vec of vecs to have better performance. If we use Vec, this will require a lot of memory in case of a request for millions of rows.

The way the rest of DataFusion avoids buffering all the intermediate results at once int memory is with Streams but then that requires interacting with rust's async ecosystem which is non trivial

If you wanted a streaming solution, that would mean the signature might look something like the following (maybe)

Arc<dyn Fn(Box<dyn SendableRecordBatchStream>) -> Result<Box<dyn SendableRecordBatchStream>> + Send + Sync>;

@gandronchik
Copy link
Contributor Author

@gandronchik thank you for the explanation in this PR's description. It helps though I will admit I still don't fully understand what is going o.

I agree with @doki23 -- I expect a table function to logically return a table (that something with both rows and columns)

Regarding signature, I decided to use a single vector and vector with sizes of sections instead of vec of vecs to have better performance. If we use Vec, this will require a lot of memory in case of a request for millions of rows.

The way the rest of DataFusion avoids buffering all the intermediate results at once int memory is with Streams but then that requires interacting with rust's async ecosystem which is non trivial

If you wanted a streaming solution, that would mean the signature might look something like the following (maybe)

Arc<dyn Fn(Box<dyn SendableRecordBatchStream>) -> Result<Box<dyn SendableRecordBatchStream>> + Send + Sync>;

Looks like I got the title wrong. I have implemented a function that returns many rows, probably it is not a table function. If I rename it, will it be fine?

Regarding the function signature, I think my solution is a compromise between vec and streaming. Actually, I don't think that function can return so many rows. However, of course, I will rewrite it if you want. So which solution do we choose: current Result<(ArrayRef, Vec<usize>)> + Send + Sync>, Result<Vec<ColumnarValue>> + Send + Sync> or Result<Box< dyn SendableRecordBatchStream>> + Send + Sync> ?

@gandronchik gandronchik requested a review from alamb May 24, 2022 06:53
@alamb
Copy link
Contributor

alamb commented May 24, 2022

I think adding UDTFs (aka user defined table functions) that produce a 2 dimensional table output (aka Vec<RecordBatch> or a SendableRecordBatchStream) would be a valuable addition to DataFusion.

I think Spark calls these "table value functions":

https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-qry-select-tvf.html

Postgres calls them table functions:

https://www.postgresql.org/docs/7.3/xfunc-tablefunctions.html

However, this PR does not implement table functions that I can see. I still don't fully understand the usecase for the code in this PR for a function that returns a single column of values and I don't know of any other system that implements such functions. Thus I feel that this PR adds a feature that is not widely usable to DataFusion users as a whole, and so I don't feel I can approve it.

If others (users or maintainers) have a perspective on this issue, I would love to hear them too. If there is broader support for this feature, I won't oppose merging it.

@andygrove andygrove removed the datafusion Changes in the datafusion crate label Jun 3, 2022
@alamb alamb marked this pull request as draft June 7, 2022 17:22
@alamb
Copy link
Contributor

alamb commented Jun 7, 2022

marking as draft until we figure out what to do with this

@gandronchik
Copy link
Contributor Author

@alamb Hello! Sorry for the long response.

I am sorry for so big PR with so a bad description.

Now I try to explain what is happening here.
Honestly, I made mistake with the naming. I supported Set Returning Function. (https://www.postgresql.org/docs/current/functions-srf.html)

As I know DataFunction is oriented on PostgreSQL behavior. So, the functionality I provide here is Postgres functionality.

We already use it in Cube.js. We implemented a several functions:

Please, look at my PR closer. I am ready to improve it, rename some structures, etc.

Bellow, I provide the implementation of generate_series function (real Postgres function):

macro_rules! generate_series_udtf {
    ($ARGS:expr, $TYPE: ident, $PRIMITIVE_TYPE: ident) => {{
        let mut section_sizes: Vec<usize> = Vec::new();
        let l_arr = &$ARGS[0].as_any().downcast_ref::<PrimitiveArray<$TYPE>>();
        if l_arr.is_some() {
            let l_arr = l_arr.unwrap();
            let r_arr = downcast_primitive_arg!($ARGS[1], "right", $TYPE);
            let step_arr = PrimitiveArray::<$TYPE>::from_value(1 as $PRIMITIVE_TYPE, 1);
            let step_arr = if $ARGS.len() > 2 {
                downcast_primitive_arg!($ARGS[2], "step", $TYPE)
            } else {
                &step_arr
            };

            let mut builder = PrimitiveBuilder::<$TYPE>::new(1);
            for (i, (start, end)) in l_arr.iter().zip(r_arr.iter()).enumerate() {
                let step = if step_arr.len() > i {
                    step_arr.value(i)
                } else {
                    step_arr.value(0)
                };

                let start = start.unwrap();
                let end = end.unwrap();
                let mut section_size: i64 = 0;
                if start <= end && step > 0 as $PRIMITIVE_TYPE {
                    let mut current = start;
                    loop {
                        if current > end {
                            break;
                        }
                        builder.append_value(current).unwrap();

                        section_size += 1;
                        current += step;
                    }
                }
                section_sizes.push(section_size as usize);
            }

            return Ok((Arc::new(builder.finish()) as ArrayRef, section_sizes));
        }
    }};
}

pub fn create_generate_series_udtf() -> TableUDF {
    let fun = make_table_function(move |args: &[ArrayRef]| {
        assert!(args.len() == 2 || args.len() == 3);

        if args[0].as_any().downcast_ref::<Int64Array>().is_some() {
            generate_series_udtf!(args, Int64Type, i64)
        } else if args[0].as_any().downcast_ref::<Float64Array>().is_some() {
            generate_series_udtf!(args, Float64Type, f64)
        }

        Err(DataFusionError::Execution(format!("Unsupported type")))
    });

    let return_type: ReturnTypeFunction = Arc::new(move |tp| {
        if tp.len() > 0 {
            Ok(Arc::new(tp[0].clone()))
        } else {
            Ok(Arc::new(DataType::Int64))
        }
    });

    TableUDF::new(
        "generate_series",
        &Signature::one_of(
            vec![
                TypeSignature::Exact(vec![DataType::Int64, DataType::Int64]),
                TypeSignature::Exact(vec![DataType::Int64, DataType::Int64, DataType::Int64]),
                TypeSignature::Exact(vec![DataType::Float64, DataType::Float64]),
                TypeSignature::Exact(vec![
                    DataType::Float64,
                    DataType::Float64,
                    DataType::Float64,
                ]),
            ],
            Volatility::Immutable,
        ),
        &return_type,
        &fun,
    )
}

@alamb
Copy link
Contributor

alamb commented Jun 12, 2022

Thanks @gandronchik -- I will try and find time to re-review this PR over the next few days in light of the information above.

@gandronchik
Copy link
Contributor Author

Thanks @gandronchik -- I will try and find time to re-review this PR over the next few days in light of the information above.

@alamb Hello! Have you had already time to check the PR?

@alamb
Copy link
Contributor

alamb commented Jun 28, 2022

@alamb Hello! Have you had already time to check the PR?

Hi @gandronchik sadly I have not had a chance. I apologize for my lack of bandwidth but it is hard to find sufficient contiguous time to review such large PRs when I don't have the background context.

My core problem is that I don't understand (despite your admirable attempts to clarify) what this PR is trying to implement, so it is very hard to evaluate the code to see if it is implementing what is desired (because I don't understand what is desired).

For example, all the examples of "set returning functions" in the links you shared in postgres appear to use those functions as elements in the FROM clause. For example,

select * from unnest(ARRAY[1,2], ARRAY['foo','bar','baz']) as x(a,b) →

So I am struggling to understand examples you share in the PR's description that show using these functions in combination with a column 🤔

select table_fun(1, col) from (select 2 col union all select 3 col) t;

So what would you think about implementing more general user defined table functions (that can return RecordBatches / streams as we have discussed above)? I think others would also likely use such functionality and it seems like it would satisfy the usecases from cube.js (?)

@gandronchik
Copy link
Contributor Author

gandronchik commented Jun 29, 2022

@alamb Hello! I think it will be easier to understand what I implemented here if you check how generate_series function works in Postgres. Just try to call the following requests:

1. select generate_series(1, 5);
2. select generate_series(1, n) from (select 2 n union all select 3 n) x;
3. select n, generate_series(1, n) from (select 2 n union all select 3 n) x;
4. select col from generate_series(1, 5) fun(col);

Before these changes, DataFusion had only udf (returns only one row per each input row) and udaf (returns one row per any count of input rows). My changes allow to return multiply rows per each input row.

@alamb
Copy link
Contributor

alamb commented Jan 14, 2023

This PR is more than 6 month old, so closing it down for now to clean up the PR list. Please reopen if this is a mistake and you plan to work on it more

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants