fix: External sort failing on an edge case #15017

2010YOUY01 · 2025-03-05T09:52:23Z

Which issue does this PR close?

Closes #.

Rationale for this change

I came across one sorting query with memory limit fail indefinitely, here is a simpler reproducer (running in datafusion-cli with commit 7597769)

# Compile datafusion-cli
cargo run --profile release-nonlto -- --mem-pool-type fair -m 10M

2 sort queries are executed: Q1 get executed with no issue, Q2 has smaller input size than Q1, but it failed.

DataFusion CLI v46.0.0
> set datafusion.execution.sort_spill_reservation_bytes = 3000000;
0 row(s) fetched.
Elapsed 0.001 seconds.

> select * from generate_series(1,10000000) as t1(v1) order by v1;
...Query succeed

> select * from generate_series(1,9000000) as t1(v1) order by v1;
Resources exhausted: Failed to allocate additional 65536 bytes for ExternalSorterMerge[0] with 0 bytes already allocated for this reservation - 49152 bytes remain available for the total pool

Query failure reason

At the final stage of sorting, all buffered in-memory batches and all the spilled files will be sort-preserving merged at the same time, and this caused the issue.

datafusion/datafusion/physical-plan/src/sorts/sort.rs

Lines 342 to 355 in 7597769

    
           let mut streams = vec![]; 
        
           if !self.in_mem_batches.is_empty() { 
        
               let in_mem_stream = 
        
                   self.in_mem_sort_stream(self.metrics.baseline.intermediate())?; 
        
               streams.push(in_mem_stream); 
        
           } 
        
           for spill in self.spills.drain(..) { 
        
               if !spill.path().exists() { 
        
                   return internal_err!("Spill file {:?} does not exist", spill.path()); 
        
               } 
        
               let stream = read_spill_as_stream(spill, Arc::clone(&self.schema), 2)?; 
        
               streams.push(stream); 
        
           }

For example, there is one workload, let's say it's executing in a single partition. It's memory limit can hold 10 batches.

Sorting 100 batches can be executed without issue:
- Every time 10 batches are read, mempool is full and one spill file will be written to disk
- Finally, there are 10 spill files, only one batch of each file is required to load to memory at the same time, so there is enough memory budget to do the final merging.
Sorting 49 batches fails:
- When the input is exhausted, there are 9 in-mem batches and 4 spill files. 9 + 4 batches are required to load to memory for final merging, it exceeds the memory pool limit which is around 10 batches.

A common workaround I believe is to set datafusion.execution.sort_spill_reservation_bytes to larger, its used for extra space for merging. However, workloads' row/batch size can vary drastically, also it's possible to see the case in-memory batches has almost reached the memory limit but not yet triggered on spilling, so this parameter is very tricky to configure it correct.
To make this simpler, we can always spill the in-memory batches (if it has spilled previously) at the final stage.

What changes are included in this PR?

Change the final sort-preserving merge logic of sorting: when it has spilled before, always spill all in-mem batches first, then start the merging phase.

Are these changes tested?

Regression test is added

Are there any user-facing changes?

No

alamb

Thank you @2010YOUY01

While this likely will result in slightly slower performance in some cases (as there is additional spilling) making sure the queries won't error seems like a very valuable property.

Change the final sort-preserving merge logic of sorting: when it has spilled before, always spill all in-mem batches first, then start the merging phase.

Thank you for the super clear writeup and code. this PR was a pleasure to read.

alamb · 2025-03-05T11:38:11Z

datafusion/core/tests/memory_limit/mod.rs

@@ -468,6 +468,31 @@ async fn test_stringview_external_sort() {
    let _ = df.collect().await.expect("Query execution failed");
 }

+/// This test case is for the regression case:


I don't understand the reference here to "regression" (which refers normally to something that stopped working when it worked before)

Maybe a better description would be "test_in_mem_buffer_almost_full" or something 🤔

I agree, updated.

alamb · 2025-03-05T11:39:12Z

datafusion/physical-plan/src/sorts/sort.rs

        // Release the memory reserved for merge back to the pool so
        // there is some left when `in_mem_sort_stream` requests an
        // allocation.
        self.merge_reservation.free();

        if self.spilled_before() {
            let mut streams = vec![];
+
+            // Sort `in_mem_batches` and spill it first. If there are many


2010YOUY01 · 2025-03-05T11:55:53Z

Thank you @2010YOUY01

While this likely will result in slightly slower performance in some cases (as there is additional spilling) making sure the queries won't error seems like a very valuable property.

Thank you for the review.
For efficiency, it's possible to avoid the final spill by checking the buffered batch number and spilled file count, but this should better be done after we have accurate memory estimation #14748, I'll update the issue for this idea.

comphead · 2025-03-05T20:04:58Z

Thanks @2010YOUY01

@andygrove as it might be also related to OOMs in Comet on Sort phase

* fix external sort failure * clippy * review

fix external sort failure

6e9d459

github-actions bot added the core Core DataFusion crate label Mar 5, 2025

clippy

0f26bb2

alamb mentioned this pull request Mar 5, 2025

Weekly Plan (Andrew Lamb) March 3, 2025 #14978

Closed

12 tasks

alamb approved these changes Mar 5, 2025

View reviewed changes

review

b7896ef

2010YOUY01 mentioned this pull request Mar 5, 2025

More accurate memory accounting in external sort #14748

Open

comphead merged commit d288b80 into apache:main Mar 5, 2025
24 checks passed

danila-b pushed a commit to danila-b/datafusion that referenced this pull request Mar 8, 2025

fix: External sort failing on an edge case (apache#15017)

c18cbf3

* fix external sort failure * clippy * review

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: External sort failing on an edge case #15017

fix: External sort failing on an edge case #15017

Uh oh!

2010YOUY01 commented Mar 5, 2025 •

edited

Loading

Uh oh!

alamb left a comment

Uh oh!

alamb Mar 5, 2025

Uh oh!

2010YOUY01 Mar 5, 2025

Uh oh!

alamb Mar 5, 2025

Uh oh!

2010YOUY01 commented Mar 5, 2025

Uh oh!

comphead commented Mar 5, 2025

Uh oh!

Uh oh!

Uh oh!

	let mut streams = vec![];
	if !self.in_mem_batches.is_empty() {
	let in_mem_stream =
	self.in_mem_sort_stream(self.metrics.baseline.intermediate())?;
	streams.push(in_mem_stream);
	}

	for spill in self.spills.drain(..) {
	if !spill.path().exists() {
	return internal_err!("Spill file {:?} does not exist", spill.path());
	}
	let stream = read_spill_as_stream(spill, Arc::clone(&self.schema), 2)?;
	streams.push(stream);
	}

fix: External sort failing on an edge case #15017

fix: External sort failing on an edge case #15017

Uh oh!

Conversation

2010YOUY01 commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

Query failure reason

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

2010YOUY01 Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

2010YOUY01 commented Mar 5, 2025

Uh oh!

comphead commented Mar 5, 2025

Uh oh!

Uh oh!

Uh oh!

2010YOUY01 commented Mar 5, 2025 •

edited

Loading