perf: Reuse row converter during sort #15302

2010YOUY01 · 2025-03-19T06:56:29Z

Which issue does this PR close?

This is a refactor towards #14748 and #7053

Rationale for this change

Arrow Row format speeds up comparison between multiple ORDER BY keys, and now it's only used in one special case that column-by-column comparison is not working, and a new converter will be constructed for each incoming RecordBatch.
This PR: A more efficient way is to construct a RowConverter when initializing the sort operator, and reuse the same converter during execution.
Note:

The old logic is kept: only using row format for one special case, enabling it by default requires more benchmarking and thus should be done as a follow-up.
Since we hope to use row format by default, for simplicity, this PR always constructs a converter when initializing ExternalSorter, instead of only do so for the special case.

What changes are included in this PR?

Construct a converter when initializing ExternalSorter
Reuse the row converter during execution

Are these changes tested?

Existing tests.

Are there any user-facing changes?

No

2010YOUY01 · 2025-03-19T06:59:54Z

datafusion/physical-plan/src/sorts/sort.rs

+                .map(|expr| expr.evaluate_to_sort_column(&batch))
+                .collect::<Result<Vec<_>>>()?;
+
+            let sorted = if is_multi_column_with_lists(&sort_columns) {


This logic is took from sort_batch() and the same logic inside sort_batch() is kept unchanged. This is due to sort_batch() is a public interface and I want to avoid changing its behavior.
After we move to always using row format, we can clean it up by deprecating sort_batch()

alamb

Thanks @2010YOUY01 -- this makes sense. Did you run any benchmark numbers for this change?

It seems like we have an external aggregation benchmark in https://github.com/apache/datafusion/tree/main/benchmarks#external-aggregation but not an external sorting benchmark 🤔

alamb · 2025-03-19T19:11:19Z

datafusion/physical-plan/src/sorts/sort.rs

+            .collect::<Result<Vec<_>>>()
+            .expect("Valid sort fields");
+
+        let converter = RowConverter::new(sort_fields)


it is probably good to return a runtime error here rather than panic'ing (for example if someone tried to sort an REE array or UnionArray it might panic)

Updated in 399966b, I should have noticed it

2010YOUY01 · 2025-03-20T03:45:26Z

Thanks @2010YOUY01 -- this makes sense. Did you run any benchmark numbers for this change?

Thank you for the review! Now RowConverter is only used when sort key includes a List type, I have run the sort_tpch benchmark and verified the run time is unchanged.

It seems like we have an external aggregation benchmark in https://github.com/apache/datafusion/tree/main/benchmarks#external-aggregation but not an external sorting benchmark 🤔

After Rows are used by default for sorting more benchmarking is definitely required, I think for external sorting, an easy way to extend the benchmark will be:

Profile each query in sort_tpch benchmark for memory consumption
Include a new configuration --memory-limit-tier to let each query run in 50%, 20% of the actual memory consumption, and see how the performance change

Dandandan · 2025-03-20T14:18:19Z

datafusion/physical-plan/src/sorts/sort.rs

+                .map(|expr| expr.evaluate_to_sort_column(&batch))
+                .collect::<Result<Vec<_>>>()?;
+
+            let sorted = if is_multi_column_with_lists(&sort_columns) {


Probably a good heuristic is to use RowConverter for multi-column + no limit cases as documented in lexsort_to_indices

/// Note: for multi-column sorts without a limit, using the [row format](https://docs.rs/arrow-row/latest/arrow_row/) /// may be significantly faster

Good point, now always using row format shows a mixed benchmark result, because converted rows are not reused during sorting and sort-preserving merging, and the conversion overhead outweighs the benefits.
We'd better do it as a follow-up with more benchmark analysis.

alamb · 2025-03-26T20:37:54Z

There appears to be a change to the testing pin in this PR as well which will cause the extended tests on main to fail

2010YOUY01 · 2025-03-27T03:21:15Z

The test submodule issue should be fixed.

alamb

Thanks @2010YOUY01 and @Dandandan

alamb · 2025-03-27T19:04:30Z

datafusion/physical-plan/src/sorts/sort.rs

+                .map(|expr| expr.evaluate_to_sort_column(&batch))
+                .collect::<Result<Vec<_>>>()?;
+
+            let sorted = if is_multi_column_with_lists(&sort_columns) {


This reverts commit 14635da.

* reuse row converter during sort * review * update submodule pin

reuse row converter during sort

8c8f6d9

2010YOUY01 commented Mar 19, 2025

View reviewed changes

alamb reviewed Mar 19, 2025

View reviewed changes

review

399966b

Dandandan reviewed Mar 20, 2025

View reviewed changes

Merge branch 'main' into sort-reuse-row-converter

41ce267

update submodule pin

0ee4fd1

2010YOUY01 force-pushed the sort-reuse-row-converter branch from f923838 to 0ee4fd1 Compare March 27, 2025 03:20

alamb approved these changes Mar 27, 2025

View reviewed changes

2010YOUY01 merged commit 14635da into apache:main Mar 30, 2025
27 checks passed

2010YOUY01 added a commit that referenced this pull request Mar 30, 2025

Revert "perf: Reuse row converter during sort (#15302)"

433a50a

This reverts commit 14635da.

This was referenced Mar 30, 2025

Revert "perf: Reuse row converter during sort" due to panic in extended tests #15494

Closed

custom_datasource example panicked during RepartitionExec planning #15493

Closed

nirnayroy pushed a commit to nirnayroy/datafusion that referenced this pull request May 2, 2025

perf: Reuse row converter during sort (apache#15302)

da709df

* reuse row converter during sort * review * update submodule pin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: Reuse row converter during sort #15302

perf: Reuse row converter during sort #15302

Uh oh!

2010YOUY01 commented Mar 19, 2025

Uh oh!

2010YOUY01 Mar 19, 2025

Uh oh!

alamb Mar 27, 2025

Uh oh!

alamb left a comment

Uh oh!

alamb Mar 19, 2025

Uh oh!

2010YOUY01 Mar 20, 2025

Uh oh!

2010YOUY01 commented Mar 20, 2025

Uh oh!

Dandandan Mar 20, 2025

Uh oh!

2010YOUY01 Mar 21, 2025

Uh oh!

alamb commented Mar 26, 2025 •

edited

Loading

Uh oh!

2010YOUY01 commented Mar 27, 2025

Uh oh!

alamb left a comment

Uh oh!

alamb Mar 27, 2025

Uh oh!

Uh oh!

Uh oh!

perf: Reuse row converter during sort #15302

perf: Reuse row converter during sort #15302

Uh oh!

Conversation

2010YOUY01 commented Mar 19, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

2010YOUY01 Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

2010YOUY01 Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

2010YOUY01 commented Mar 20, 2025

Uh oh!

Dandandan Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

2010YOUY01 Mar 21, 2025

Choose a reason for hiding this comment

Uh oh!

alamb commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

2010YOUY01 commented Mar 27, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alamb commented Mar 26, 2025 •

edited

Loading