-
Notifications
You must be signed in to change notification settings - Fork 1.5k
perf: Reuse row converter during sort #15302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
.map(|expr| expr.evaluate_to_sort_column(&batch)) | ||
.collect::<Result<Vec<_>>>()?; | ||
|
||
let sorted = if is_multi_column_with_lists(&sort_columns) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is took from sort_batch()
and the same logic inside sort_batch()
is kept unchanged. This is due to sort_batch()
is a public interface and I want to avoid changing its behavior.
After we move to always using row format, we can clean it up by deprecating sort_batch()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @2010YOUY01 -- this makes sense. Did you run any benchmark numbers for this change?
It seems like we have an external aggregation benchmark in https://github.com/apache/datafusion/tree/main/benchmarks#external-aggregation but not an external sorting benchmark 🤔
.collect::<Result<Vec<_>>>() | ||
.expect("Valid sort fields"); | ||
|
||
let converter = RowConverter::new(sort_fields) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is probably good to return a runtime error here rather than panic'ing (for example if someone tried to sort an REE array or UnionArray it might panic)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in 399966b, I should have noticed it
Thank you for the review! Now
After
|
.map(|expr| expr.evaluate_to_sort_column(&batch)) | ||
.collect::<Result<Vec<_>>>()?; | ||
|
||
let sorted = if is_multi_column_with_lists(&sort_columns) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a good heuristic is to use RowConverter for multi-column + no limit cases as documented in lexsort_to_indices
/// Note: for multi-column sorts without a limit, using the [row format](https://docs.rs/arrow-row/latest/arrow_row/)
/// may be significantly faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, now always using row format shows a mixed benchmark result, because converted rows are not reused during sorting and sort-preserving merging, and the conversion overhead outweighs the benefits.
We'd better do it as a follow-up with more benchmark analysis.
f923838
to
0ee4fd1
Compare
The test submodule issue should be fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @2010YOUY01 and @Dandandan
.map(|expr| expr.evaluate_to_sort_column(&batch)) | ||
.collect::<Result<Vec<_>>>()?; | ||
|
||
let sorted = if is_multi_column_with_lists(&sort_columns) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes please
This reverts commit 14635da.
* reuse row converter during sort * review * update submodule pin
Which issue does this PR close?
This is a refactor towards #14748 and #7053
Rationale for this change
Arrow Row format speeds up comparison between multiple ORDER BY keys, and now it's only used in one special case that column-by-column comparison is not working, and a new converter will be constructed for each incoming
RecordBatch
.This PR: A more efficient way is to construct a
RowConverter
when initializing the sort operator, and reuse the same converter during execution.Note:
ExternalSorter
, instead of only do so for the special case.What changes are included in this PR?
ExternalSorter
Are these changes tested?
Existing tests.
Are there any user-facing changes?
No