Skip to content

No efficient way to load a subset of files from partitioned table #8906

Open
@rspears74

Description

@rspears74

Is your feature request related to a problem or challenge?

As far as I can tell, there is no good way to load a subset of files from a partitioned table. Using ListingTable or another TableProvider like DeltaTableProvider from deltalake, I'm able to read_table, but this loads the entire table. I can also load a list of parquet files with read_parquet, but this doesn't work with partitioned tables if the partitions are not "materialized" columns in the raw parquet. The only way I've found to load partitioned files is by iterating over a list of file paths, and doing the entire TableProvider/read_table process on each one individually, and unioning the results together.

Describe the solution you'd like

It seems like it would be nice to be able to create a TableProvider with a table path, then pass some sort of file "whitelist" in. Maybe a read_table_files(TableProvider, impl IntoIterator<Item = String>).

Describe alternatives you've considered

As stated above, I've tried reading the files one-by-one and unioning results, but it's shockingly inefficient compared to reading all files at once.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions