Skip to content

perf: Reuse row converter during sort #15302

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 30, 2025

Conversation

2010YOUY01
Copy link
Contributor

Which issue does this PR close?

This is a refactor towards #14748 and #7053

Rationale for this change

Arrow Row format speeds up comparison between multiple ORDER BY keys, and now it's only used in one special case that column-by-column comparison is not working, and a new converter will be constructed for each incoming RecordBatch.
This PR: A more efficient way is to construct a RowConverter when initializing the sort operator, and reuse the same converter during execution.
Note:

  • The old logic is kept: only using row format for one special case, enabling it by default requires more benchmarking and thus should be done as a follow-up.
  • Since we hope to use row format by default, for simplicity, this PR always constructs a converter when initializing ExternalSorter, instead of only do so for the special case.

What changes are included in this PR?

  • Construct a converter when initializing ExternalSorter
  • Reuse the row converter during execution

Are these changes tested?

Existing tests.

Are there any user-facing changes?

No

.map(|expr| expr.evaluate_to_sort_column(&batch))
.collect::<Result<Vec<_>>>()?;

let sorted = if is_multi_column_with_lists(&sort_columns) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is took from sort_batch() and the same logic inside sort_batch() is kept unchanged. This is due to sort_batch() is a public interface and I want to avoid changing its behavior.
After we move to always using row format, we can clean it up by deprecating sort_batch()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes please

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @2010YOUY01 -- this makes sense. Did you run any benchmark numbers for this change?

It seems like we have an external aggregation benchmark in https://github.com/apache/datafusion/tree/main/benchmarks#external-aggregation but not an external sorting benchmark 🤔

.collect::<Result<Vec<_>>>()
.expect("Valid sort fields");

let converter = RowConverter::new(sort_fields)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is probably good to return a runtime error here rather than panic'ing (for example if someone tried to sort an REE array or UnionArray it might panic)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 399966b, I should have noticed it

@2010YOUY01
Copy link
Contributor Author

Thanks @2010YOUY01 -- this makes sense. Did you run any benchmark numbers for this change?

Thank you for the review! Now RowConverter is only used when sort key includes a List type, I have run the sort_tpch benchmark and verified the run time is unchanged.

It seems like we have an external aggregation benchmark in https://github.com/apache/datafusion/tree/main/benchmarks#external-aggregation but not an external sorting benchmark 🤔

After Rows are used by default for sorting more benchmarking is definitely required, I think for external sorting, an easy way to extend the benchmark will be:

  1. Profile each query in sort_tpch benchmark for memory consumption
  2. Include a new configuration --memory-limit-tier to let each query run in 50%, 20% of the actual memory consumption, and see how the performance change

.map(|expr| expr.evaluate_to_sort_column(&batch))
.collect::<Result<Vec<_>>>()?;

let sorted = if is_multi_column_with_lists(&sort_columns) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a good heuristic is to use RowConverter for multi-column + no limit cases as documented in lexsort_to_indices

/// Note: for multi-column sorts without a limit, using the [row format](https://docs.rs/arrow-row/latest/arrow_row/)
/// may be significantly faster

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, now always using row format shows a mixed benchmark result, because converted rows are not reused during sorting and sort-preserving merging, and the conversion overhead outweighs the benefits.
We'd better do it as a follow-up with more benchmark analysis.

@alamb
Copy link
Contributor

alamb commented Mar 26, 2025

There appears to be a change to the testing pin in this PR as well which will cause the extended tests on main to fail

Screenshot 2025-03-26 at 4 37 14 PM

@2010YOUY01 2010YOUY01 force-pushed the sort-reuse-row-converter branch from f923838 to 0ee4fd1 Compare March 27, 2025 03:20
@2010YOUY01
Copy link
Contributor Author

The test submodule issue should be fixed.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @2010YOUY01 and @Dandandan

.map(|expr| expr.evaluate_to_sort_column(&batch))
.collect::<Result<Vec<_>>>()?;

let sorted = if is_multi_column_with_lists(&sort_columns) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes please

@2010YOUY01 2010YOUY01 merged commit 14635da into apache:main Mar 30, 2025
27 checks passed
2010YOUY01 added a commit that referenced this pull request Mar 30, 2025
nirnayroy pushed a commit to nirnayroy/datafusion that referenced this pull request May 2, 2025
* reuse row converter during sort

* review

* update submodule pin
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants