Skip to content

Map file-level column statistics to the table-level #15865

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 29, 2025

Conversation

xudong963
Copy link
Member

Which issue does this PR close?

Rationale for this change

As said in #15689

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@xudong963 xudong963 marked this pull request as draft April 26, 2025 07:47
@github-actions github-actions bot added optimizer Optimizer rules core Core DataFusion crate datasource Changes to the datasource crate labels Apr 26, 2025
@xudong963 xudong963 force-pushed the fix_listing_schema branch from bb0f430 to 26c5050 Compare April 26, 2025 08:09
@xudong963 xudong963 force-pushed the fix_listing_schema branch from 26c5050 to 777c5e7 Compare April 27, 2025 09:51
@github-actions github-actions bot added common Related to common crate and removed optimizer Optimizer rules labels Apr 27, 2025
@xudong963 xudong963 marked this pull request as ready for review April 27, 2025 09:51
@xudong963
Copy link
Member Author

Fyi @friendlymatthew

@xudong963 xudong963 requested a review from alamb April 27, 2025 10:14
@xudong963 xudong963 force-pushed the fix_listing_schema branch from 777c5e7 to df1db6a Compare April 28, 2025 06:46
@xudong963 xudong963 added the bug Something isn't working label Apr 28, 2025
@alamb alamb mentioned this pull request Apr 28, 2025
26 tasks
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @xudong963 -- this looks good to me except for one more test.

Can you please add a test that verifies the statistics of a ListingTable that was created with two parquet files of different schemas? I think you could write a SLT level test with something like

> select * from  values (1, 'a'), (2, 'b') t(int_col, str_col);
+---------+---------+
| int_col | str_col |
+---------+---------+
| 1       | a       |
| 2       | b       |
+---------+---------+
2 row(s) fetched.
Elapsed 0.006 seconds.

> COPY (SELECT * FROM values (1, 'a'), (2, 'b') t(int_col, str_col)) to '/tmp/table/1.parquet';
+-------+
| count |
+-------+
| 2     |
+-------+
1 row(s) fetched.
Elapsed 0.010 seconds.

> COPY (SELECT * FROM values ('c', 3), ('d', -1) t(str_col, int_col)) to '/tmp/table/2.parquet';
+-------+
| count |
+-------+
| 2     |
+-------+
1 row(s) fetched.
Elapsed 0.004 seconds.

And then verify the statistics with

> set datafusion.execution.collect_statistics = true;
0 row(s) fetched.
Elapsed 0.000 seconds.

> set datafusion.explain.show_statistics = true;
0 row(s) fetched.
Elapsed 0.000 seconds.

> create external table t stored as parquet location '/tmp/table';
0 row(s) fetched.
Elapsed 0.006 seconds.

> explain format indent select * from t;
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                                                                                                                 |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | TableScan: t projection=[int_col, str_col]                                                                                                                                                                                                                                                                                           |
| physical_plan | DataSourceExec: file_groups={2 groups: [[tmp/table/1.parquet], [tmp/table/2.parquet]]}, projection=[int_col, str_col], file_type=parquet, statistics=[Rows=Exact(4), Bytes=Exact(288), [(Col[0]: Min=Exact(Int64(-1)) Max=Exact(Int64(3)) Null=Exact(0)),(Col[1]: Min=Exact(Utf8View("a")) Max=Exact(Utf8View("d")) Null=Exact(0))]] |
|               |                                                                                                                                                                                                                                                                                                                                      |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.001 seconds.

@@ -1129,7 +1130,17 @@ impl ListingTable {
let (file_group, inexact_stats) =
get_files_with_limit(files, limit, self.options.collect_stat).await?;

let file_groups = file_group.split_files(self.options.target_partitions);
let mut file_groups = file_group.split_files(self.options.target_partitions);
let (schema_mapper, _) = DefaultSchemaAdapterFactory::from_schema(self.schema())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using the default schema mapper makes sense for now / in this PR, but in general I think it would make sense to allow the user to provide their own schema mapping rules here (so a default value that is not NULL can be used, for example) via their own mapper.

However, we woudl have to add a schema mapper factory to ListingOptions

https://github.com/apache/datafusion/blob/f1bbb1d636650c7f28f52dc507f36e64d71e1aa8/datafusion/core/src/datasource/listing/table.rs#L256-L255

(this is not a change needed for this PR, I just noticed it while reviewing this PR)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, I filed an issue to track: #15889

@xudong963
Copy link
Member Author

Can you please add a test that verifies the statistics of a ListingTable that was created with two parquet files of different schemas? I think you could write a SLT level test with something like

Yes, cool tests

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Apr 29, 2025
@xudong963
Copy link
Member Author

@alamb I'll merge the PR and continue to fix tests in #15852

@xudong963 xudong963 merged commit 54302ac into apache:main Apr 29, 2025
29 checks passed
Comment on lines +1133 to +1134
let mut file_groups = file_group.split_files(self.options.target_partitions);
let (schema_mapper, _) = DefaultSchemaAdapterFactory::from_schema(self.schema())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb While I was working on #15852, I found in fact, for listing table, doesn't have the issue described in #15689, that is, all files here have the same schema because when creating table, all fetched files already use the SchemaMapper to reorder their schema, see here: https://github.com/apache/datafusion/blob/main/datafusion/datasource-parquet/src/opener.rs#L206.

What we should fix is let the file schema match the listing table schema, usually, if users specify the partition col, table schema will have the extra partition col infos, so I moved the mapper down the compute_all_files_statistics method in the commit: 689fc66.

nirnayroy pushed a commit to nirnayroy/datafusion that referenced this pull request May 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ListingTable statistics improperly merges statistics when files have different schemas
2 participants