Skip to content

Introduce ProjectionMask To Allow Nested Projection Pushdown #2581

Open
@tustvold

Description

@tustvold

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Currently projection indices are pushed down to scans as Vec<usize>. This creates some ambiguities:

To demonstrate how these problems intertwine, consider the case of

Struct {
   first: Struct {
      a: Integer,
      b: Integer,
   },
   second: Struct {
      c: Integer
   }
}

If I project ["first.a", "second.c", "first.b"] what is the resulting schema?

Describe the solution you'd like

I would like to propose we instead pushdown a leaf column mask, where leaf columns are fields with no children, as enumerated by a depth-first-scan of the schema tree. This avoids any ordering ambiguities, whilst also being relatively straightforward to implement and interpret.

I recently introduced a similar concept to the parquet reader apache/arrow-rs#1716. We could theoretically lift this into arrow-rs, potentially adding support to RecordBatch for it, and then use this in DataFusion.

Describe alternatives you've considered

We could not support nested pushdown

Additional context

Currently pushdown for nested types in ParquetExec is broken - #2453

Thoughts @andygrove @alamb

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions