Skip to content

Allow default value on lag/lead if they are coerceable to value #8308

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 25 additions & 7 deletions datafusion/physical-expr/src/window/lead_lag.rs
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,11 @@
use crate::window::BuiltInWindowFunctionExpr;
use crate::PhysicalExpr;
use arrow::array::ArrayRef;
use arrow::compute::cast;
use arrow::compute::{cast, cast_with_options, CastOptions};
use arrow::datatypes::{DataType, Field};
use datafusion_common::ScalarValue;
use datafusion_common::{internal_err, DataFusionError, Result};
use datafusion_common::{internal_err, ScalarValue};
use datafusion_common::{DataFusionError, Result};
use datafusion_expr::type_coercion::binary::comparison_coercion;
use datafusion_expr::PartitionEvaluator;
use std::any::Any;
use std::cmp::min;
Expand Down Expand Up @@ -235,11 +236,28 @@ fn get_default_value(
default_value: Option<&ScalarValue>,
dtype: &DataType,
) -> Result<ScalarValue> {
if let Some(value) = default_value {
if let ScalarValue::Int64(Some(val)) = value {
ScalarValue::try_from_string(val.to_string(), dtype)
if let Some(default_value) = default_value {
let default_value_type = default_value.data_type();

if comparison_coercion(&default_value_type, dtype).is_some() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that you just need to change integer, since I'm not sure having pub fn comparison_binary_numeric_coercion is a good idea or not. At least, we can add the comment for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't understand what you mean by "change integer". Could you rephrase, or provide a suggestion for a change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I mean coerce, not change.

Copy link
Contributor

@jayzhan211 jayzhan211 Nov 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code from your previous commit

if can_coerce_from(dtype, &default_value_type)
            || (dtype.is_integer() && default_value_type.is_integer())

Comment similar to
// We only check coercion between integer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So just to be clear, are you suggesting that the comparison should be reverted to

if can_coerce_from(dtype, &default_value_type)
            // We explicitely allow a lossy conversion between signed and unsigned integers
            || (dtype.is_integer() && default_value_type.is_integer())

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest comparison_coercion. I read the doc, it does not seem lag/lead should only be integer.

Copy link
Contributor

@jayzhan211 jayzhan211 Dec 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that coerced_type might not be the same of dtype or default_value_type. So we need to cast both of them to coerced_type

if let(coerced_type) = comparison_coercion(dtype, &default_value_type) {}

For example, if we have (dtype: UInt64, default_value_type: Int32), coerced_type is Int64

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @alamb to confirm my suggestion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think coercing during physical evaluation is too late in the process -- most other coercion happens in the analyze phase so that subsequent transformations / optimizations use the same rules.

Would it be possible to move this coercion earlier into

https://github.com/apache/arrow-datafusion/blob/513fd052bdbf5c7a73de544a876961f780b90a92/datafusion/optimizer/src/analyzer/type_coercion.rs#L64-L63

?

ScalarValue::try_from_array(
&cast_with_options(
&default_value.to_array()?,
dtype,
&CastOptions {
safe: false,
format_options: Default::default(),
},
)?,
0,
)
} else {
internal_err!("Expects default value to have Int64 type")
internal_err!(
"Cannot coerce default value {:?} {} to {:?}",
default_value,
default_value_type,
dtype
)
}
} else {
Ok(ScalarValue::try_from(dtype)?)
Expand Down
20 changes: 20 additions & 0 deletions datafusion/sqllogictest/test_files/window.slt
Original file line number Diff line number Diff line change
Expand Up @@ -1153,6 +1153,26 @@ SELECT
141680161 63044568 63044568 225513085 225513085
145294611 141047417 141047417 243203849 243203849

# can coerce given default value to lag/lead value type
query PP
SELECT
LAG(c9::timestamp, 99999999, '2000-11-22T16:17:06.624Z') OVER(ORDER BY c9) as lag_default_coerced,
LEAD(c9::timestamp, 99999999, '2023-11-22T16:17:06.624Z') OVER(ORDER BY c9) as lead_default_coerced
FROM aggregate_test_100
LIMIT 1
----
2000-11-22T16:17:06.624 2023-11-22T16:17:06.624

# cannot coerce given default value to lag value type
query error
SELECT
LAG(c9, 99999999, '2000-11-22T16:17:06.624Z') OVER(ORDER BY c9)
FROM aggregate_test_100
LIMIT 1
----
DataFusion error: Arrow error: Cast error: Cannot cast string '2000-11-22T16:17:06.624Z' to value of UInt64 type


#fn test_window_frame_first_value_last_value_aggregate
query IIII
SELECT
Expand Down