Map file-level column statistics to the table-level #15865

xudong963 · 2025-04-26T07:47:03Z

Which issue does this PR close?

Closes ListingTable statistics improperly merges statistics when files have different schemas #15689

Rationale for this change

As said in #15689

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

xudong963 · 2025-04-27T09:52:04Z

Fyi @friendlymatthew

alamb

Thank you @xudong963 -- this looks good to me except for one more test.

Can you please add a test that verifies the statistics of a ListingTable that was created with two parquet files of different schemas? I think you could write a SLT level test with something like

> select * from  values (1, 'a'), (2, 'b') t(int_col, str_col);
+---------+---------+
| int_col | str_col |
+---------+---------+
| 1       | a       |
| 2       | b       |
+---------+---------+
2 row(s) fetched.
Elapsed 0.006 seconds.

> COPY (SELECT * FROM values (1, 'a'), (2, 'b') t(int_col, str_col)) to '/tmp/table/1.parquet';
+-------+
| count |
+-------+
| 2     |
+-------+
1 row(s) fetched.
Elapsed 0.010 seconds.

> COPY (SELECT * FROM values ('c', 3), ('d', -1) t(str_col, int_col)) to '/tmp/table/2.parquet';
+-------+
| count |
+-------+
| 2     |
+-------+
1 row(s) fetched.
Elapsed 0.004 seconds.

And then verify the statistics with

> set datafusion.execution.collect_statistics = true;
0 row(s) fetched.
Elapsed 0.000 seconds.

> set datafusion.explain.show_statistics = true;
0 row(s) fetched.
Elapsed 0.000 seconds.

> create external table t stored as parquet location '/tmp/table';
0 row(s) fetched.
Elapsed 0.006 seconds.

> explain format indent select * from t;
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                                                                                                                 |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | TableScan: t projection=[int_col, str_col]                                                                                                                                                                                                                                                                                           |
| physical_plan | DataSourceExec: file_groups={2 groups: [[tmp/table/1.parquet], [tmp/table/2.parquet]]}, projection=[int_col, str_col], file_type=parquet, statistics=[Rows=Exact(4), Bytes=Exact(288), [(Col[0]: Min=Exact(Int64(-1)) Max=Exact(Int64(3)) Null=Exact(0)),(Col[1]: Min=Exact(Utf8View("a")) Max=Exact(Utf8View("d")) Null=Exact(0))]] |
|               |                                                                                                                                                                                                                                                                                                                                      |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.001 seconds.

alamb · 2025-04-29T00:50:45Z

datafusion/core/src/datasource/listing/table.rs

@@ -1129,7 +1130,17 @@ impl ListingTable {
        let (file_group, inexact_stats) =
            get_files_with_limit(files, limit, self.options.collect_stat).await?;

-        let file_groups = file_group.split_files(self.options.target_partitions);
+        let mut file_groups = file_group.split_files(self.options.target_partitions);
+        let (schema_mapper, _) = DefaultSchemaAdapterFactory::from_schema(self.schema())


I think using the default schema mapper makes sense for now / in this PR, but in general I think it would make sense to allow the user to provide their own schema mapping rules here (so a default value that is not NULL can be used, for example) via their own mapper.

However, we woudl have to add a schema mapper factory to ListingOptions

https://github.com/apache/datafusion/blob/f1bbb1d636650c7f28f52dc507f36e64d71e1aa8/datafusion/core/src/datasource/listing/table.rs#L256-L255

(this is not a change needed for this PR, I just noticed it while reviewing this PR)

Make sense, I filed an issue to track: #15889

xudong963 · 2025-04-29T05:32:01Z

Can you please add a test that verifies the statistics of a ListingTable that was created with two parquet files of different schemas? I think you could write a SLT level test with something like

Yes, cool tests

xudong963 · 2025-04-29T07:05:13Z

@alamb I'll merge the PR and continue to fix tests in #15852

xudong963 · 2025-04-29T09:06:20Z

datafusion/core/src/datasource/listing/table.rs

+        let mut file_groups = file_group.split_files(self.options.target_partitions);
+        let (schema_mapper, _) = DefaultSchemaAdapterFactory::from_schema(self.schema())


@alamb While I was working on #15852, I found in fact, for listing table, doesn't have the issue described in #15689, that is, all files here have the same schema because when creating table, all fetched files already use the SchemaMapper to reorder their schema, see here: https://github.com/apache/datafusion/blob/main/datafusion/datasource-parquet/src/opener.rs#L206.

What we should fix is let the file schema match the listing table schema, usually, if users specify the partition col, table schema will have the extra partition col infos, so I moved the mapper down the compute_all_files_statistics method in the commit: 689fc66.

* init * fix clippy * add test

xudong963 marked this pull request as draft April 26, 2025 07:47

github-actions bot added optimizer Optimizer rules core Core DataFusion crate datasource Changes to the datasource crate labels Apr 26, 2025

xudong963 force-pushed the fix_listing_schema branch from bb0f430 to 26c5050 Compare April 26, 2025 08:09

init

f1bbb1d

xudong963 force-pushed the fix_listing_schema branch from 26c5050 to 777c5e7 Compare April 27, 2025 09:51

github-actions bot added common Related to common crate and removed optimizer Optimizer rules labels Apr 27, 2025

xudong963 marked this pull request as ready for review April 27, 2025 09:51

xudong963 requested a review from alamb April 27, 2025 10:14

fix clippy

df1db6a

xudong963 force-pushed the fix_listing_schema branch from 777c5e7 to df1db6a Compare April 28, 2025 06:46

xudong963 added the bug Something isn't working label Apr 28, 2025

alamb mentioned this pull request Apr 28, 2025

Weekly Plan: Andrew Lamb 2025-04-28 #15880

Open

26 tasks

alamb approved these changes Apr 29, 2025

View reviewed changes

add test

ab25831

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Apr 29, 2025

xudong963 mentioned this pull request Apr 29, 2025

Add schema mapper factory to ListingOptions #15889

Open

xudong963 merged commit 54302ac into apache:main Apr 29, 2025
29 checks passed

xudong963 commented Apr 29, 2025

View reviewed changes

nirnayroy pushed a commit to nirnayroy/datafusion that referenced this pull request May 2, 2025

Map file-level column statistics to the table-level (apache#15865)

9f2ef4d

* init * fix clippy * add test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Map file-level column statistics to the table-level #15865

Map file-level column statistics to the table-level #15865

xudong963 commented Apr 26, 2025

xudong963 commented Apr 27, 2025

alamb left a comment

alamb Apr 29, 2025

xudong963 Apr 29, 2025

xudong963 commented Apr 29, 2025

xudong963 commented Apr 29, 2025

xudong963 Apr 29, 2025

		let mut file_groups = file_group.split_files(self.options.target_partitions);
		let (schema_mapper, _) = DefaultSchemaAdapterFactory::from_schema(self.schema())

Map file-level column statistics to the table-level #15865

Map file-level column statistics to the table-level #15865

Conversation

xudong963 commented Apr 26, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

xudong963 commented Apr 27, 2025

alamb left a comment

Choose a reason for hiding this comment

alamb Apr 29, 2025

Choose a reason for hiding this comment

xudong963 Apr 29, 2025

Choose a reason for hiding this comment

xudong963 commented Apr 29, 2025

xudong963 commented Apr 29, 2025

xudong963 Apr 29, 2025

Choose a reason for hiding this comment