Skip to content

Support computing statistics for FileGroup #15432

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 1, 2025

Conversation

xudong963
Copy link
Member

@xudong963 xudong963 commented Mar 26, 2025

Which issue does this PR close?

Rationale for this change

Compute the FileGroup's statistics when computing all files' statistics

What changes are included in this PR?

Improving File Statistics Handling:

  1. Replacing the get_statistics_with_limit function with get_files_with_limit and compute_all_files_statistics`, make code logic clearer.

  2. Adding new functions to handle better statistics of FileGroup and all files, including compute_file_group_statistics and compute_all_files_statistics. Also adding a generic function compute_summary_statistics to compute statistics across multiple items that have statistics.

Enhancing Documentation:

  1. Adding detailed comments and documentation for new functions to explain their purpose and usage.

Are these changes tested?

Yes, by existing tests

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate proto Related to proto crate datasource Changes to the datasource crate labels Mar 26, 2025
@xudong963 xudong963 added the api change Changes the API exposed to users of the crate label Mar 26, 2025
@jayzhan211
Copy link
Contributor

Do you think it is a good idea to add another FileGroups struct and compute the statistics across all the files when we create FileGroups to be less error-prone (accidentally get the incorrect statistics)

@xudong963
Copy link
Member Author

when we create FileGroups to be less error-prone (accidentally get the incorrect statistics)

Sorry, I don't get it. Why does adding a new FileGroups reduce error-prone?

Copy link
Contributor

@suremarc suremarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, as far as I can tell it is a refactor of existing logic, but made more flexible so we can get statistics at the global level as well as file group level. Nice 👍

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @xudong963 @suremarc and @jayzhan211

I think this is a good refactor and the code is well structured and documented.

Would it be possible to add some unit tests for compute_summary_statistics? Something like:

  1. Create a Vec
  2. Call compute_summary_statistics
  3. Verify the summarized statistics?

I think that would put us in a better position to continue working on the statistics collection code with confidence that we aren't breaking something, especially as we start relying on statistics more and more for correctness

@xudong963
Copy link
Member Author

xudong963 commented Mar 30, 2025

Would it be possible to add some unit tests for compute_summary_statistics? Something like:

Thanks @alamb ! I'm cooking it

Done in d8090fe

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @xudong963 -- this looks good to me

);
assert_eq!(
col_stats.min_value,
Precision::Inexact(ScalarValue::Int32(Some(-10)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -106,7 +106,7 @@ pub struct PartitionedFile {
///
/// DataFusion relies on these statistics for planning (in particular to sort file groups),
/// so if they are incorrect, incorrect answers may result.
pub statistics: Option<Statistics>,
pub statistics: Option<Arc<Statistics>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 for adding Arc

@xudong963 xudong963 mentioned this pull request Apr 1, 2025
@xudong963
Copy link
Member Author

Thanks for your review! Lets go

@xudong963 xudong963 merged commit 507f6b6 into apache:main Apr 1, 2025
27 checks passed
@alamb
Copy link
Contributor

alamb commented Apr 1, 2025

For anyone else following along, this PR is part of a larger plan top optimize ORDER BY queries operating on pre-sorted inputs. See this ticket for more detail

@alamb alamb mentioned this pull request Apr 14, 2025
9 tasks
nirnayroy pushed a commit to nirnayroy/datafusion that referenced this pull request May 2, 2025
* Support computing statistics for FileGroup

* avoid clone

* add tests

* fix conflicts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate datasource Changes to the datasource crate proto Related to proto crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants