-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Consolidate statistics merging code (try 2) #15661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -296,6 +311,24 @@ impl Statistics { | |||
.collect() | |||
} | |||
|
|||
/// Set the number of rows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made these methods to reduce the boiler plate in the examples (otherwise I had to explicitly provide Precision::Absent
for every field. This let's the examples be simpler)
/// | ||
/// # Example | ||
/// ``` | ||
/// # use datafusion_common::{ColumnStatistics, ScalarValue, Statistics}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here is an example showing how merge works
@@ -521,6 +649,36 @@ impl ColumnStatistics { | |||
} | |||
} | |||
|
|||
/// Set the null count | |||
pub fn with_null_count(mut self, null_count: Precision<usize>) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, adding these setters to make it easier (less verbose) to write examples and tests
@@ -798,4 +958,193 @@ mod tests { | |||
distinct_count: Precision::Exact(100), | |||
} | |||
} | |||
|
|||
#[test] | |||
fn test_try_merge_basic() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these tests were moved from datasource
} | ||
|
||
#[test] | ||
fn test_try_merge_mismatched_size() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mismatched size is a new test
@@ -409,62 +406,6 @@ pub async fn get_statistics_with_limit( | |||
Ok((result_files, statistics)) | |||
} | |||
|
|||
/// Generic function to compute statistics across multiple items that have statistics | |||
fn compute_summary_statistics<T, I>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of this PR is to remove this function and replace it with Statistics::try_merge_iter
} | ||
|
||
#[cfg(test)] | ||
mod tests { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests are moved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the PR still has the issue that is mentioned here: xudong963#5 (comment).
But from the comment: #15503, I think you plan to fix the issue after merging the PR.
Thank you
Yes I think you are right -- however I don't think this PR makes the problem worse (or better). I have filed this ticket to track: @xudong963 do you have time to look at that one? |
Sure, sorry for the late reply, I had a headache this weekend, so off my computer. |
I hope you feel better! |
Which issue does this PR close?
Precision
combination code in favor ofPrecision::min/max/add
#15659statistics_by_partition API
to ExecutionPlan #15503statistics_by_partition
API toExecutionPlan
#15495Rationale for this change
As @xudong963 works on getting statistics code into better shape we have to move the code to combine multiple
Statistics
objects (compute_summary_statistics
) and make it public so it can be reused (see #15503)While we are doing this, lets consolidate the functionality into
Statistics
where it isWhat changes are included in this PR?
Statistics::merge_iter
Statistics::merge
compute_summary_statistics
Are these changes tested?
Yes, by existing tests and newly added doc test and unit test
Are there any user-facing changes?