Skip to content

Implement a way to preserve partitioning through UnionExec without losing ordering #10314

Open
@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

The EnforceDistribution physical optimizer pass in DataFusion in some cases will introduce InterleaveExec to increase partitioning when data passes through a UnionExec:

plan = if plan.as_any().is::<UnionExec>()
&& !config.optimizer.prefer_existing_union
&& can_interleave(children_plans.iter())
{
// Add a special case for [`UnionExec`] since we want to "bubble up"
// hash-partitioned data. So instead of
//
// Agg:
// Repartition (hash):
// Union:
// - Agg:
// Repartition (hash):
// Data
// - Agg:
// Repartition (hash):
// Data
//
// we can use:
//
// Agg:
// Interleave:
// - Agg:
// Repartition (hash):
// Data
// - Agg:
// Repartition (hash):
// Data
Arc::new(InterleaveExec::try_new(children_plans)?)
} else {
plan.with_new_children(children_plans)?
};

Here is what InterleaveExec does:

/// Combines multiple input streams by interleaving them.
///
/// This only works if all inputs have the same hash-partitioning.
///
/// # Data Flow
/// ```text
/// +---------+
/// | |---+
/// | Input 1 | |
/// | |-------------+
/// +---------+ | |
/// | | +---------+
/// +------------------>| |
/// +---------------->| Combine |-->
/// | +-------------->| |
/// | | | +---------+
/// +---------+ | | |
/// | |-----+ | |
/// | Input 2 | | |
/// | |---------------+
/// +---------+ | | |
/// | | | +---------+
/// | +-------->| |
/// | +------>| Combine |-->
/// | +---->| |
/// | | +---------+
/// +---------+ | |
/// | |-------+ |
/// | Input 3 | |
/// | |-----------------+
/// +---------+
/// ```

However, this has the potential downside of destroying and pre-existing ordering which is sometimes preferable than increasing / improving partitionining (e.g. see #10257 and datafusion.optimizer.prefer_existing_sort setting)

Describe the solution you'd like

I would like there to be some way to preserve the partitioning after a UnionExec without losing the ordering information and then remove the prefer_existing_union flag

Describe alternatives you've considered

One possibility is to add a preserve_order flag to InterleaveExec the same way as RepartitionExec has a preserve_order flag:

/// Maps `N` input partitions to `M` output partitions based on a
/// [`Partitioning`] scheme.
///
/// # Background
///
/// DataFusion, like most other commercial systems, with the
/// notable exception of DuckDB, uses the "Exchange Operator" based
/// approach to parallelism which works well in practice given
/// sufficient care in implementation.
///
/// DataFusion's planner picks the target number of partitions and
/// then `RepartionExec` redistributes [`RecordBatch`]es to that number
/// of output partitions.
///
/// For example, given `target_partitions=3` (trying to use 3 cores)
/// but scanning an input with 2 partitions, `RepartitionExec` can be
/// used to get 3 even streams of `RecordBatch`es
///
///
///```text
/// ▲ ▲ ▲
/// │ │ │
/// │ │ │
/// │ │ │
///┌───────────────┐ ┌───────────────┐ ┌───────────────┐
///│ GroupBy │ │ GroupBy │ │ GroupBy │
///│ (Partial) │ │ (Partial) │ │ (Partial) │
///└───────────────┘ └───────────────┘ └───────────────┘
/// ▲ ▲ ▲
/// └──────────────────┼──────────────────┘
/// │
/// ┌─────────────────────────┐
/// │ RepartitionExec │
/// │ (hash/round robin) │
/// └─────────────────────────┘
/// ▲ ▲
/// ┌───────────┘ └───────────┐
/// │ │
/// │ │
/// .─────────. .─────────.
/// ,─' '─. ,─' '─.
/// ; Input : ; Input :
/// : Partition 0 ; : Partition 1 ;
/// ╲ ╱ ╲ ╱
/// '─. ,─' '─. ,─'
/// `───────' `───────'
///```
///
/// # Output Ordering
///
/// If more than one stream is being repartitioned, the output will be some
/// arbitrary interleaving (and thus unordered) unless
/// [`Self::with_preserve_order`] specifies otherwise.
///
/// # Footnote
///
/// The "Exchange Operator" was first described in the 1989 paper
/// [Encapsulation of parallelism in the Volcano query processing
/// system
/// Paper](https://w6113.github.io/files/papers/volcanoparallelism-89.pdf)
/// which uses the term "Exchange" for the concept of repartitioning
/// data across threads.
#[derive(Debug)]
pub struct RepartitionExec {
/// Input execution plan
input: Arc<dyn ExecutionPlan>,
/// Partitioning scheme to use
partitioning: Partitioning,
/// Inner state that is initialized when the first output stream is created.
state: LazyState,
/// Execution metrics
metrics: ExecutionPlanMetricsSet,
/// Boolean flag to decide whether to preserve ordering. If true means
/// `SortPreservingRepartitionExec`, false means `RepartitionExec`.
preserve_order: bool,
/// Cache holding plan properties like equivalences, output partitioning etc.
cache: PlanProperties,
}
#[derive(Debug, Clone)]
struct RepartitionMetrics {
/// Time in nanos to execute child operator and fetch batches
fetch_time: metrics::Time,
/// Time in nanos to perform repartitioning
repartition_time: metrics::Time,
/// Time in nanos for sending resulting batches to channels.
///
/// One metric per output partition.
send_time: Vec<metrics::Time>,
}

Additional context

We encountered this while working on #10259 @mustafasrepo and @phillipleblanc pointed out that config flag prefer_existing_union was effectively the same as prefer_existing_sort

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions