-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Support bounds evaluation for temporal data types #14523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Support bounds evaluation for temporal data types #14523
Conversation
Thank you @ch-sc for working on this. When you need a review, I can do that if you ping me. |
Thanks @berkaysynnada! Yeah I'd like to get a review. There is still a bug, though, that I don't understand yet. Sometimes the sort operator gets removed as can be seen in the |
Seems like you're losing the order requirement somehow |
…t-temporal-types-in-interval-arithmetics
@berkaysynnada do you have time to take another look? :)
I debugged into this and noticed that the I removed |
Thanks again for driving this forward :) I'm a bit busy these days, but I'd love to go over it. If it's not urgent for you, would you mind waiting until the weekend?
Thanks for the heads-up. I try to add support for NotEq. If I find a solution, I can fix it within this PR itself—if that's okay with you. I took a quick look, and you can use the union logic there: |
Sure, feel free to make any adjustment as you see fit. |
I thought a bit more about the union logic and came up with another solution. Sometimes intervals are not overlapping and sometimes have no bound in one direction ( Here are some examples of what I think should be correct, without interval sets support:
I adapted the union logic accordingly and added tests. Happy for your feedback! |
Looks like I confused the meaning of nulls in intervals. I switched to the union logic @berkaysynnada suggested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ch-sc. Both your implementations and clean-ups on the existing code seem very good to me. I suggested some further improvements, and have one point to discuss: Rather than introducing new supports_() API's, should we force the users to infer the support by the evaluate API? IMO that will make things simpler, WDYT?
and sorry for the late response 😞
datafusion/common/src/scalar/mod.rs
Outdated
@@ -1583,6 +1583,17 @@ impl ScalarValue { | |||
} | |||
} | |||
|
|||
/// Returns negation for a boolean scalar value | |||
pub fn boolean_negate(&self) -> Result<Self> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can extend arithmetic_negate() with booleans, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do that, but I think we should then rename arithmetic_negate
to just negate
to cause no ambiguity. WDYT?
Operator::IsDistinctFrom | Operator::IsNotDistinctFrom => { | ||
NullableInterval::from(lhs) | ||
.apply_operator(op, &rhs.into()) | ||
.and_then(|x| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can avoid cloning here actually introducing a new taker API to NullableInterval
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how you would do that.
…t-temporal-types-in-interval-arithmetics
@berkaysynnada thank you for your review 🙂
I agree, something like that would be ideal. I'll look into it today |
Hi @ch-sc. Do you plan to complete the work here? (as it's so almost) |
Hi @berkaysynnada, I should be able to spend some time on this at the end of this week. |
…t-temporal-types-in-interval-arithmetics
Hi @berkaysynnada, can you take another look to move this forward? :) |
…t-temporal-types-in-interval-arithmetics
I will do but some tests are failing. I'll have some time probably in the evening or tomorrow morning |
Hi @ch-sc. Reviewing while tests are failing is not very efficient. Are you dealing with them? |
Sorry @berkaysynnada, I got side-tracked from this yesterday. The test is fixed. |
Thank you @ch-sc. I will review it today |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you again @ch-sc. I've looked the changes except cp_solver file. Your changes makes very sense overall, but I see some points to discuss. Let's continue iterating
col_stats: &ColumnStatistics, | ||
col_index: usize, | ||
) -> Result<Self> { | ||
let field = schema.fields().get(col_index).ok_or_else(|| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we spread this check over all caller side code?
@@ -126,6 +126,25 @@ pub trait PhysicalExpr: Send + Sync + Display + Debug + DynEq + DynHash { | |||
not_impl_err!("Not implemented for {self}") | |||
} | |||
|
|||
/// Checks support of bounds evaluation for this expression, before evaluating bounds. | |||
/// Returns None if bounds evaluation is not supported. | |||
fn evaluate_bounds_checked( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have supports_bounds_evaluation() there already. I think no need to introduce another API in PhysicalExpr. If we see lots of repetition of "first supports_bounds_evaluation, then evaluate_bounds" pattern, then we can make a standalone function, IMO
rhs.data_type() | ||
); | ||
}; | ||
BinaryTypeCoercer::new(&self.data_type(), &Operator::Plus, &rhs.data_type()).get_result_type() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we check the result type is also supported by evaluate_bounds? WDYT
@@ -963,6 +961,23 @@ pub fn apply_operator(op: &Operator, lhs: &Interval, rhs: &Interval) -> Result<I | |||
Operator::Minus => lhs.sub(rhs), | |||
Operator::Multiply => lhs.mul(rhs), | |||
Operator::Divide => lhs.div(rhs), | |||
Operator::IsDistinctFrom | Operator::IsNotDistinctFrom => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can apply_operator() return NullableInterval::Null or NullableInterval::MaybeNull here? isn't it strange for isDistinct and isNotDistinct?
@@ -1690,6 +1705,24 @@ impl Display for NullableInterval { | |||
} | |||
} | |||
|
|||
impl From<&Interval> for NullableInterval { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This conversion is dangerous. Null means different things in Interval and NullableInterval. Interval's can be converted to NotNull types only in the current behavior. Where do we need such a conversion, and why? can you elaborate a bit
/// | ||
/// output interval: [`true`, `true`] | ||
/// ``` | ||
fn evaluate_bounds(&self, children: &[&Interval]) -> Result<Interval> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My previous concern still applies. I think there should be 3 different paths of this logic:
- expr_bounds is singleton:
check the inList expression separately. If any of them is also singleton, and has the same value with expr_bounds, then the result is CertainlyTrue - expr_bounds is not a singleton => intersect expr_bounds with all inList expressions.
a. if any of inList expressions intersect with expr_bounds, then the result is Uncertain
b. if any of inList does not intersect with expr_bounds, then the result is CertainlyFalse
Which issue does this PR close?
As discussed in 14237 temporal data types should be supported in bounds evaluation.
Rationale for this change
We want to extend the number of expressions that can evaluate bounds to calculate statistics and enable better/more query plan optimisations.
What changes are included in this PR?
Timestamp
,Date
,Time
,Duration
,Interval
,NotEq
IsDistinctFrom
,IsNotDistinctFrom
PhysicalExpr
trait and therefore decided by every expression type itself. This will also enable UDFs to implement bounds evaluation.Are these changes tested?
yes
Are there any user-facing changes?
PhysicalExpr
andScalarUDFImpl
change, but both provide default implementations for additional functions.