`TableProvider` to skip files in the folder which non relevant to selected reader #16487

comphead · 2025-06-20T16:05:20Z

Which issue does this PR close?

Closes Make datafusion read parquet folders if non parquet files exists #16460 .

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

…files

comphead · 2025-06-24T00:29:51Z

datafusion/core/src/datasource/listing_table_factory.rs

@@ -125,6 +125,13 @@ impl TableProviderFactory for ListingTableFactory {
            // specifically for parquet file format.
            // See: https://github.com/apache/datafusion/issues/7317
            None => {
+                // if the folder then rewrite a file path as 'path/*.parquet'


this is an actual fix

This will mean a directory of files like foo/my_file.parquet.snappy would not be readable anymore -- I think that spark creates files like my_file.snappy.parquet so it should be ok

it should be ok, compressed files are usually *.codec.parquet and more broad wildcard *.parquet should read them. My local test I did against part-00000-9b95f137-d11f-44b6-84b7-d49c95bc7c5b-c000.snappy.parquet

…files

alamb

This makes sense to me -- thank you @comphead

cc @hendrikmakait

alamb · 2025-06-24T10:18:42Z

datafusion/core/src/datasource/listing_table_factory.rs

@@ -125,6 +125,13 @@ impl TableProviderFactory for ListingTableFactory {
            // specifically for parquet file format.
            // See: https://github.com/apache/datafusion/issues/7317
            None => {
+                // if the folder then rewrite a file path as 'path/*.parquet'


This will mean a directory of files like foo/my_file.parquet.snappy would not be readable anymore -- I think that spark creates files like my_file.snappy.parquet so it should be ok

github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) datasource Changes to the datasource crate labels Jun 20, 2025

comphead changed the title ~~Dev~~ WIP: Datafusion to skip non parquet files for parquet reader Jun 20, 2025

comphead mentioned this pull request Jun 20, 2025

Make datafusion read parquet folders if non parquet files exists #16460

Closed

comphead added 2 commits June 23, 2025 17:07

minor: Avoid parquet read failure if the folder doesn't end with slash

05d3ef2

minor: Avoid parquet read failure if the folder doesn't end with slash

a78de19

comphead force-pushed the dev branch from dd23b8f to 34d7cd5 Compare June 24, 2025 00:07

minor: Avoid parquet read failure if the folder contains non parquet …

742b135

…files

comphead force-pushed the dev branch from 34d7cd5 to 742b135 Compare June 24, 2025 00:11

minor: Avoid parquet read failure if the folder contains non parquet …

c299358

…files

comphead marked this pull request as ready for review June 24, 2025 00:15

comphead changed the title ~~WIP: Datafusion to skip non parquet files for parquet reader~~ TableProvider to skip files in the folder which non relevant to selected reader Jun 24, 2025

minor: Avoid parquet read failure if the folder contains non parquet …

b550df0

…files

comphead commented Jun 24, 2025

View reviewed changes

comphead and others added 2 commits June 23, 2025 18:22

minor: Avoid parquet read failure if the folder contains non parquet …

79cb057

…files

Merge branch 'main' into dev

09bc231

alamb approved these changes Jun 24, 2025

View reviewed changes

comphead merged commit 59143c1 into apache:main Jun 24, 2025
27 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`TableProvider` to skip files in the folder which non relevant to selected reader #16487

`TableProvider` to skip files in the folder which non relevant to selected reader #16487

Uh oh!

comphead commented Jun 20, 2025 •

edited

Loading

Uh oh!

comphead Jun 24, 2025

Uh oh!

alamb Jun 24, 2025

Uh oh!

comphead Jun 24, 2025

Uh oh!

alamb left a comment

Uh oh!

alamb Jun 24, 2025

Uh oh!

Uh oh!

Uh oh!

TableProvider to skip files in the folder which non relevant to selected reader #16487

TableProvider to skip files in the folder which non relevant to selected reader #16487

Uh oh!

Conversation

comphead commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

comphead Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

comphead Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

`TableProvider` to skip files in the folder which non relevant to selected reader #16487

`TableProvider` to skip files in the folder which non relevant to selected reader #16487

comphead commented Jun 20, 2025 •

edited

Loading