-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Support computing statistics for FileGroup #15432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Do you think it is a good idea to add another |
Sorry, I don't get it. Why does adding a new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, as far as I can tell it is a refactor of existing logic, but made more flexible so we can get statistics at the global level as well as file group level. Nice 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @xudong963 @suremarc and @jayzhan211
I think this is a good refactor and the code is well structured and documented.
Would it be possible to add some unit tests for compute_summary_statistics
? Something like:
- Create a Vec
- Call
compute_summary_statistics
- Verify the summarized statistics?
I think that would put us in a better position to continue working on the statistics collection code with confidence that we aren't breaking something, especially as we start relying on statistics more and more for correctness
ac5d16e
to
d8090fe
Compare
d8090fe
to
04bb1d6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @xudong963 -- this looks good to me
); | ||
assert_eq!( | ||
col_stats.min_value, | ||
Precision::Inexact(ScalarValue::Int32(Some(-10))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -106,7 +106,7 @@ pub struct PartitionedFile { | |||
/// | |||
/// DataFusion relies on these statistics for planning (in particular to sort file groups), | |||
/// so if they are incorrect, incorrect answers may result. | |||
pub statistics: Option<Statistics>, | |||
pub statistics: Option<Arc<Statistics>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 for adding Arc
04bb1d6
to
e775493
Compare
Thanks for your review! Lets go |
For anyone else following along, this PR is part of a larger plan top optimize ORDER BY queries operating on pre-sorted inputs. See this ticket for more detail |
* Support computing statistics for FileGroup * avoid clone * add tests * fix conflicts
Which issue does this PR close?
FileGroup
structure forVec<PartitionedFile>
#15379Rationale for this change
Compute the
FileGroup
's statistics when computing all files' statisticsWhat changes are included in this PR?
Improving File Statistics Handling:
Replacing the
get_statistics_with_limit
function withget_files_with_limit and
compute_all_files_statistics`, make code logic clearer.Adding new functions to handle better statistics of
FileGroup
and all files, includingcompute_file_group_statistics
andcompute_all_files_statistics
. Also adding a generic functioncompute_summary_statistics
to compute statistics across multiple items that have statistics.Enhancing Documentation:
Are these changes tested?
Yes, by existing tests
Are there any user-facing changes?