Skip to content

Support bounds evaluation for temporal data types #14523

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 34 commits into
base: main
Choose a base branch
from

Conversation

ch-sc
Copy link
Contributor

@ch-sc ch-sc commented Feb 6, 2025

Which issue does this PR close?

As discussed in 14237 temporal data types should be supported in bounds evaluation.

Rationale for this change

We want to extend the number of expressions that can evaluate bounds to calculate statistics and enable better/more query plan optimisations.

What changes are included in this PR?

  • Additional data types are supported: Timestamp, Date, Time, Duration, Interval
  • Additional binary operators are supported: NotEq, IsDistinctFrom, IsNotDistinctFrom
  • Whether an expression supports bounds evaluation or not is now part of the PhysicalExpr trait and therefore decided by every expression type itself. This will also enable UDFs to implement bounds evaluation.

Are these changes tested?

yes

Are there any user-facing changes?

PhysicalExpr and ScalarUDFImpl change, but both provide default implementations for additional functions.

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate labels Feb 6, 2025
@github-actions github-actions bot removed the sqllogictest SQL Logic Tests (.slt) label Feb 6, 2025
@berkaysynnada
Copy link
Contributor

Thank you @ch-sc for working on this. When you need a review, I can do that if you ping me.

@ch-sc
Copy link
Contributor Author

ch-sc commented Feb 7, 2025

Thanks @berkaysynnada! Yeah I'd like to get a review.

There is still a bug, though, that I don't understand yet. Sometimes the sort operator gets removed as can be seen in the repartition_scan.slt test.

@berkaysynnada
Copy link
Contributor

Sometimes the sort operator gets removed as can be seen in the repartition_scan.slt test.

Seems like you're losing the order requirement somehow

@ch-sc
Copy link
Contributor Author

ch-sc commented Feb 12, 2025

@berkaysynnada do you have time to take another look? :)

NotEq leads to the removal of the sort operator.

I debugged into this and noticed that the EnforceSorting optimiser removes SortExec. enforce_sorting/mod.rs:415
However, prior to that the order requirement somehow is lost and I don't understand why that happens.

I removed NotEq again from the supported binary operators to move this forward. We might want to look into this in a follow-up PR. WDYT?

@berkaysynnada
Copy link
Contributor

@berkaysynnada do you have time to take another look? :)

Thanks again for driving this forward :) I'm a bit busy these days, but I'd love to go over it. If it's not urgent for you, would you mind waiting until the weekend?

I removed NotEq again from the supported binary operators to move this forward. We might want to look into this in a follow-up PR. WDYT?

Thanks for the heads-up. I try to add support for NotEq. If I find a solution, I can fix it within this PR itself—if that's okay with you.

I took a quick look, and you can use the union logic there:
https://github.com/Fly-Style/datafusion/blob/fee3023a63227f0a22ac2da1d040d373cc028c34/datafusion/expr-common/src/interval_arithmetic.rs#L667
It seems to follow a better style, IMO.

@ch-sc
Copy link
Contributor Author

ch-sc commented Feb 13, 2025

I try to add support for NotEq. If I find a solution, I can fix it within this PR itself—if that's okay with you.

Sure, feel free to make any adjustment as you see fit.

@ch-sc
Copy link
Contributor Author

ch-sc commented Feb 14, 2025

I took a quick look, and you can use the union logic there

I thought a bit more about the union logic and came up with another solution.

Sometimes intervals are not overlapping and sometimes have no bound in one direction (NULL). Typically the union of non-overlapping intervals would result in a set of intervals, but there is no support of this.

Here are some examples of what I think should be correct, without interval sets support:

(1, 2)    ∪ (3,4)        = (1,4)
(1, NULL) ∪ (NULL, NULL) = (1, NULL)
(NULL, 1) ∪ (NULL, NULL) = (NULL, 1)
(3, NULL) ∪ (NULL, 1)    = (NULL, NULL)
(5, NULL) ∪ (1, 2)       = (1, NULL)

I adapted the union logic accordingly and added tests. Happy for your feedback!

@ch-sc
Copy link
Contributor Author

ch-sc commented Feb 18, 2025

Looks like I confused the meaning of nulls in intervals. I switched to the union logic @berkaysynnada suggested.

Copy link
Contributor

@berkaysynnada berkaysynnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ch-sc. Both your implementations and clean-ups on the existing code seem very good to me. I suggested some further improvements, and have one point to discuss: Rather than introducing new supports_() API's, should we force the users to infer the support by the evaluate API? IMO that will make things simpler, WDYT?

and sorry for the late response 😞

@@ -1583,6 +1583,17 @@ impl ScalarValue {
}
}

/// Returns negation for a boolean scalar value
pub fn boolean_negate(&self) -> Result<Self> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can extend arithmetic_negate() with booleans, wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do that, but I think we should then rename arithmetic_negate to just negate to cause no ambiguity. WDYT?

Operator::IsDistinctFrom | Operator::IsNotDistinctFrom => {
NullableInterval::from(lhs)
.apply_operator(op, &rhs.into())
.and_then(|x| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can avoid cloning here actually introducing a new taker API to NullableInterval

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how you would do that.

@ch-sc
Copy link
Contributor Author

ch-sc commented Feb 25, 2025

@berkaysynnada thank you for your review 🙂

Rather than introducing new supports_() API's, should we force the users to infer the support by the evaluate API?

I agree, something like that would be ideal. I'll look into it today

@berkaysynnada
Copy link
Contributor

Hi @ch-sc. Do you plan to complete the work here? (as it's so almost)

@ch-sc
Copy link
Contributor Author

ch-sc commented Mar 17, 2025

Hi @berkaysynnada, I should be able to spend some time on this at the end of this week.

@github-actions github-actions bot added the functions Changes to functions implementation label Mar 26, 2025
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Mar 26, 2025
@ch-sc
Copy link
Contributor Author

ch-sc commented Apr 9, 2025

Hi @berkaysynnada, can you take another look to move this forward? :)

@github-actions github-actions bot added the sql SQL Planner label Apr 9, 2025
@berkaysynnada
Copy link
Contributor

Hi @berkaysynnada, can you take another look to move this forward? :)

I will do but some tests are failing. I'll have some time probably in the evening or tomorrow morning

@berkaysynnada
Copy link
Contributor

Hi @ch-sc. Reviewing while tests are failing is not very efficient. Are you dealing with them?

@ch-sc
Copy link
Contributor Author

ch-sc commented Apr 11, 2025

Sorry @berkaysynnada, I got side-tracked from this yesterday. The test is fixed.

@berkaysynnada
Copy link
Contributor

Thank you @ch-sc. I will review it today

Copy link
Contributor

@berkaysynnada berkaysynnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you again @ch-sc. I've looked the changes except cp_solver file. Your changes makes very sense overall, but I see some points to discuss. Let's continue iterating

col_stats: &ColumnStatistics,
col_index: usize,
) -> Result<Self> {
let field = schema.fields().get(col_index).ok_or_else(|| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we spread this check over all caller side code?

@@ -126,6 +126,25 @@ pub trait PhysicalExpr: Send + Sync + Display + Debug + DynEq + DynHash {
not_impl_err!("Not implemented for {self}")
}

/// Checks support of bounds evaluation for this expression, before evaluating bounds.
/// Returns None if bounds evaluation is not supported.
fn evaluate_bounds_checked(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have supports_bounds_evaluation() there already. I think no need to introduce another API in PhysicalExpr. If we see lots of repetition of "first supports_bounds_evaluation, then evaluate_bounds" pattern, then we can make a standalone function, IMO

rhs.data_type()
);
};
BinaryTypeCoercer::new(&self.data_type(), &Operator::Plus, &rhs.data_type()).get_result_type()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we check the result type is also supported by evaluate_bounds? WDYT

@@ -963,6 +961,23 @@ pub fn apply_operator(op: &Operator, lhs: &Interval, rhs: &Interval) -> Result<I
Operator::Minus => lhs.sub(rhs),
Operator::Multiply => lhs.mul(rhs),
Operator::Divide => lhs.div(rhs),
Operator::IsDistinctFrom | Operator::IsNotDistinctFrom => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can apply_operator() return NullableInterval::Null or NullableInterval::MaybeNull here? isn't it strange for isDistinct and isNotDistinct?

@@ -1690,6 +1705,24 @@ impl Display for NullableInterval {
}
}

impl From<&Interval> for NullableInterval {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conversion is dangerous. Null means different things in Interval and NullableInterval. Interval's can be converted to NotNull types only in the current behavior. Where do we need such a conversion, and why? can you elaborate a bit

///
/// output interval: [`true`, `true`]
/// ```
fn evaluate_bounds(&self, children: &[&Interval]) -> Result<Interval> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My previous concern still applies. I think there should be 3 different paths of this logic:

  1. expr_bounds is singleton:
    check the inList expression separately. If any of them is also singleton, and has the same value with expr_bounds, then the result is CertainlyTrue
  2. expr_bounds is not a singleton => intersect expr_bounds with all inList expressions.
    a. if any of inList expressions intersect with expr_bounds, then the result is Uncertain
    b. if any of inList does not intersect with expr_bounds, then the result is CertainlyFalse

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate documentation Improvements or additions to documentation functions Changes to functions implementation logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Changes to the physical-expr crates sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for filter selectivity for more expression types
2 participants