Skip to content

Consolidate statistics merging code (try 2) #15661

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 11, 2025
Merged

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Apr 9, 2025

Which issue does this PR close?

Rationale for this change

As @xudong963 works on getting statistics code into better shape we have to move the code to combine multiple Statistics objects (compute_summary_statistics) and make it public so it can be reused (see #15503)

While we are doing this, lets consolidate the functionality into Statistics where it is

  1. Easier to discover (and thus so we don't end up with multiple copies)
  2. Easier to test (e.g for statistics with different schemas)
  3. Easier to document

What changes are included in this PR?

  1. Add Statistics::merge_iter
  2. Add Statistics::merge
  3. Remove compute_summary_statistics

Are these changes tested?

Yes, by existing tests and newly added doc test and unit test

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate common Related to common crate datasource Changes to the datasource crate labels Apr 9, 2025
@xudong963 xudong963 self-requested a review April 10, 2025 12:03
@alamb alamb force-pushed the alamb/add_merge branch from 97ca42f to 95f7235 Compare April 10, 2025 18:14
@github-actions github-actions bot removed the core Core DataFusion crate label Apr 10, 2025
@@ -296,6 +311,24 @@ impl Statistics {
.collect()
}

/// Set the number of rows
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made these methods to reduce the boiler plate in the examples (otherwise I had to explicitly provide Precision::Absent for every field. This let's the examples be simpler)

///
/// # Example
/// ```
/// # use datafusion_common::{ColumnStatistics, ScalarValue, Statistics};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is an example showing how merge works

@@ -521,6 +649,36 @@ impl ColumnStatistics {
}
}

/// Set the null count
pub fn with_null_count(mut self, null_count: Precision<usize>) -> Self {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, adding these setters to make it easier (less verbose) to write examples and tests

@@ -798,4 +958,193 @@ mod tests {
distinct_count: Precision::Exact(100),
}
}

#[test]
fn test_try_merge_basic() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these tests were moved from datasource

}

#[test]
fn test_try_merge_mismatched_size() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mismatched size is a new test

@@ -409,62 +406,6 @@ pub async fn get_statistics_with_limit(
Ok((result_files, statistics))
}

/// Generic function to compute statistics across multiple items that have statistics
fn compute_summary_statistics<T, I>(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of this PR is to remove this function and replace it with Statistics::try_merge_iter

}

#[cfg(test)]
mod tests {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests are moved

@alamb alamb marked this pull request as ready for review April 10, 2025 18:30
Copy link
Member

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the PR still has the issue that is mentioned here: xudong963#5 (comment).

But from the comment: #15503, I think you plan to fix the issue after merging the PR.

Thank you

@alamb
Copy link
Contributor Author

alamb commented Apr 11, 2025

It seems that the PR still has the issue that is mentioned here: xudong963#5 (comment).

Yes I think you are right -- however I don't think this PR makes the problem worse (or better).

I have filed this ticket to track:

@xudong963 do you have time to look at that one?

@alamb alamb merged commit 6380556 into apache:main Apr 11, 2025
28 checks passed
@alamb alamb deleted the alamb/add_merge branch April 11, 2025 18:08
@xudong963
Copy link
Member

do you have time to look at that one?

Sure, sorry for the late reply, I had a headache this weekend, so off my computer.

@alamb
Copy link
Contributor Author

alamb commented Apr 14, 2025

I hope you feel better!

nirnayroy pushed a commit to nirnayroy/datafusion that referenced this pull request May 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate datasource Changes to the datasource crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants