Skip to content

fix: External sort failing on an edge case #15017

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 5, 2025

Conversation

2010YOUY01
Copy link
Contributor

@2010YOUY01 2010YOUY01 commented Mar 5, 2025

Which issue does this PR close?

  • Closes #.

Rationale for this change

I came across one sorting query with memory limit fail indefinitely, here is a simpler reproducer (running in datafusion-cli with commit 7597769)

# Compile datafusion-cli
cargo run --profile release-nonlto -- --mem-pool-type fair -m 10M

2 sort queries are executed: Q1 get executed with no issue, Q2 has smaller input size than Q1, but it failed.

DataFusion CLI v46.0.0
> set datafusion.execution.sort_spill_reservation_bytes = 3000000;
0 row(s) fetched.
Elapsed 0.001 seconds.

> select * from generate_series(1,10000000) as t1(v1) order by v1;
...Query succeed

> select * from generate_series(1,9000000) as t1(v1) order by v1;
Resources exhausted: Failed to allocate additional 65536 bytes for ExternalSorterMerge[0] with 0 bytes already allocated for this reservation - 49152 bytes remain available for the total pool

Query failure reason

At the final stage of sorting, all buffered in-memory batches and all the spilled files will be sort-preserving merged at the same time, and this caused the issue.

let mut streams = vec![];
if !self.in_mem_batches.is_empty() {
let in_mem_stream =
self.in_mem_sort_stream(self.metrics.baseline.intermediate())?;
streams.push(in_mem_stream);
}
for spill in self.spills.drain(..) {
if !spill.path().exists() {
return internal_err!("Spill file {:?} does not exist", spill.path());
}
let stream = read_spill_as_stream(spill, Arc::clone(&self.schema), 2)?;
streams.push(stream);
}

For example, there is one workload, let's say it's executing in a single partition. It's memory limit can hold 10 batches.

  • Sorting 100 batches can be executed without issue:
    • Every time 10 batches are read, mempool is full and one spill file will be written to disk
    • Finally, there are 10 spill files, only one batch of each file is required to load to memory at the same time, so there is enough memory budget to do the final merging.
  • Sorting 49 batches fails:
    • When the input is exhausted, there are 9 in-mem batches and 4 spill files. 9 + 4 batches are required to load to memory for final merging, it exceeds the memory pool limit which is around 10 batches.

A common workaround I believe is to set datafusion.execution.sort_spill_reservation_bytes to larger, its used for extra space for merging. However, workloads' row/batch size can vary drastically, also it's possible to see the case in-memory batches has almost reached the memory limit but not yet triggered on spilling, so this parameter is very tricky to configure it correct.
To make this simpler, we can always spill the in-memory batches (if it has spilled previously) at the final stage.

What changes are included in this PR?

Change the final sort-preserving merge logic of sorting: when it has spilled before, always spill all in-mem batches first, then start the merging phase.

Are these changes tested?

Regression test is added

Are there any user-facing changes?

No

@github-actions github-actions bot added the core Core DataFusion crate label Mar 5, 2025
@alamb alamb mentioned this pull request Mar 5, 2025
12 tasks
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @2010YOUY01

While this likely will result in slightly slower performance in some cases (as there is additional spilling) making sure the queries won't error seems like a very valuable property.

Change the final sort-preserving merge logic of sorting: when it has spilled before, always spill all in-mem batches first, then start the merging phase.

Thank you for the super clear writeup and code. this PR was a pleasure to read.

@@ -468,6 +468,31 @@ async fn test_stringview_external_sort() {
let _ = df.collect().await.expect("Query execution failed");
}

/// This test case is for the regression case:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the reference here to "regression" (which refers normally to something that stopped working when it worked before)

Maybe a better description would be "test_in_mem_buffer_almost_full" or something 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, updated.

// Release the memory reserved for merge back to the pool so
// there is some left when `in_mem_sort_stream` requests an
// allocation.
self.merge_reservation.free();

if self.spilled_before() {
let mut streams = vec![];

// Sort `in_mem_batches` and spill it first. If there are many
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@2010YOUY01
Copy link
Contributor Author

Thank you @2010YOUY01

While this likely will result in slightly slower performance in some cases (as there is additional spilling) making sure the queries won't error seems like a very valuable property.

Thank you for the review.
For efficiency, it's possible to avoid the final spill by checking the buffered batch number and spilled file count, but this should better be done after we have accurate memory estimation #14748, I'll update the issue for this idea.

@comphead
Copy link
Contributor

comphead commented Mar 5, 2025

Thanks @2010YOUY01

@andygrove as it might be also related to OOMs in Comet on Sort phase

@comphead comphead merged commit d288b80 into apache:main Mar 5, 2025
24 checks passed
danila-b pushed a commit to danila-b/datafusion that referenced this pull request Mar 8, 2025
* fix external sort failure

* clippy

* review
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants