-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Map file-level column statistics to the table-level #15865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -33,6 +33,7 @@ use crate::execution::context::SessionState; | |
use datafusion_catalog::TableProvider; | ||
use datafusion_common::{config_err, DataFusionError, Result}; | ||
use datafusion_datasource::file_scan_config::{FileScanConfig, FileScanConfigBuilder}; | ||
use datafusion_datasource::schema_adapter::DefaultSchemaAdapterFactory; | ||
use datafusion_expr::dml::InsertOp; | ||
use datafusion_expr::{utils::conjunction, Expr, TableProviderFilterPushDown}; | ||
use datafusion_expr::{SortExpr, TableType}; | ||
|
@@ -1129,7 +1130,17 @@ impl ListingTable { | |
let (file_group, inexact_stats) = | ||
get_files_with_limit(files, limit, self.options.collect_stat).await?; | ||
|
||
let file_groups = file_group.split_files(self.options.target_partitions); | ||
let mut file_groups = file_group.split_files(self.options.target_partitions); | ||
let (schema_mapper, _) = DefaultSchemaAdapterFactory::from_schema(self.schema()) | ||
Comment on lines
+1133
to
+1134
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @alamb While I was working on #15852, I found in fact, for listing table, doesn't have the issue described in #15689, that is, all files here have the same schema because when creating table, all fetched files already use the What we should fix is let the file schema match the listing table schema, usually, if users specify the partition col, table schema will have the extra partition col infos, so I moved the mapper down the |
||
.map_schema(self.file_schema.as_ref())?; | ||
// Use schema_mapper to map each file-level column statistics to table-level column statistics | ||
file_groups.iter_mut().try_for_each(|file_group| { | ||
if let Some(stat) = file_group.statistics_mut() { | ||
stat.column_statistics = | ||
schema_mapper.map_column_statistics(&stat.column_statistics)?; | ||
} | ||
Ok::<_, DataFusionError>(()) | ||
})?; | ||
compute_all_files_statistics( | ||
file_groups, | ||
self.schema(), | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using the default schema mapper makes sense for now / in this PR, but in general I think it would make sense to allow the user to provide their own schema mapping rules here (so a default value that is not
NULL
can be used, for example) via their own mapper.However, we woudl have to add a schema mapper factory to
ListingOptions
https://github.com/apache/datafusion/blob/f1bbb1d636650c7f28f52dc507f36e64d71e1aa8/datafusion/core/src/datasource/listing/table.rs#L256-L255
(this is not a change needed for this PR, I just noticed it while reviewing this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense, I filed an issue to track: #15889