-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Implement Parquet filter pushdown via new filter pushdown APIs #15769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
d3057ff
c84f613
c132a99
64adeb5
3ca8da7
22bf472
c3c66e5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -24,9 +24,7 @@ use std::{any::Any, str::FromStr, sync::Arc}; | |
|
||
use crate::datasource::{ | ||
create_ordering, | ||
file_format::{ | ||
file_compression_type::FileCompressionType, FileFormat, FilePushdownSupport, | ||
}, | ||
file_format::{file_compression_type::FileCompressionType, FileFormat}, | ||
physical_plan::FileSinkConfig, | ||
}; | ||
use crate::execution::context::SessionState; | ||
|
@@ -35,22 +33,19 @@ use datafusion_common::{config_err, DataFusionError, Result}; | |
use datafusion_datasource::file_scan_config::{FileScanConfig, FileScanConfigBuilder}; | ||
use datafusion_datasource::schema_adapter::DefaultSchemaAdapterFactory; | ||
use datafusion_expr::dml::InsertOp; | ||
use datafusion_expr::{utils::conjunction, Expr, TableProviderFilterPushDown}; | ||
use datafusion_expr::{Expr, TableProviderFilterPushDown}; | ||
use datafusion_expr::{SortExpr, TableType}; | ||
use datafusion_physical_plan::empty::EmptyExec; | ||
use datafusion_physical_plan::{ExecutionPlan, Statistics}; | ||
|
||
use arrow::datatypes::{DataType, Field, Schema, SchemaBuilder, SchemaRef}; | ||
use datafusion_common::{ | ||
config_datafusion_err, internal_err, plan_err, project_schema, Constraints, | ||
SchemaExt, ToDFSchema, | ||
config_datafusion_err, internal_err, plan_err, project_schema, Constraints, SchemaExt, | ||
}; | ||
use datafusion_execution::cache::{ | ||
cache_manager::FileStatisticsCache, cache_unit::DefaultFileStatisticsCache, | ||
}; | ||
use datafusion_physical_expr::{ | ||
create_physical_expr, LexOrdering, PhysicalSortRequirement, | ||
}; | ||
use datafusion_physical_expr::{LexOrdering, PhysicalSortRequirement}; | ||
|
||
use async_trait::async_trait; | ||
use datafusion_catalog::Session; | ||
|
@@ -941,19 +936,6 @@ impl TableProvider for ListingTable { | |
None => {} // no ordering required | ||
}; | ||
|
||
let filters = match conjunction(filters.to_vec()) { | ||
Some(expr) => { | ||
let table_df_schema = self.table_schema.as_ref().clone().to_dfschema()?; | ||
let filters = create_physical_expr( | ||
&expr, | ||
&table_df_schema, | ||
state.execution_props(), | ||
)?; | ||
Some(filters) | ||
} | ||
None => None, | ||
}; | ||
|
||
let Some(object_store_url) = | ||
self.table_paths.first().map(ListingTableUrl::object_store) | ||
else { | ||
|
@@ -978,7 +960,6 @@ impl TableProvider for ListingTable { | |
.with_output_ordering(output_ordering) | ||
.with_table_partition_cols(table_partition_cols) | ||
.build(), | ||
filters.as_ref(), | ||
) | ||
.await | ||
} | ||
|
@@ -1002,18 +983,6 @@ impl TableProvider for ListingTable { | |
return Ok(TableProviderFilterPushDown::Exact); | ||
} | ||
|
||
// if we can't push it down completely with only the filename-based/path-based | ||
// column names, then we should check if we can do parquet predicate pushdown | ||
let supports_pushdown = self.options.format.supports_filters_pushdown( | ||
&self.file_schema, | ||
&self.table_schema, | ||
&[filter], | ||
)?; | ||
|
||
if supports_pushdown == FilePushdownSupport::Supported { | ||
return Ok(TableProviderFilterPushDown::Exact); | ||
} | ||
Comment on lines
-1005
to
-1015
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The point of this PR is that this moves from being something specialized that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have one question: aren't we expecting/preparing for, people to use ListingTable if they read Parquet files? Are we eventually planning to remove all format-specific handlings? Or this is a case only for filter pushdown? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If that's the case, why don't we fully remove There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think many users of DataFusion (based on our usage, talks I've seen and examples we have) use custom I would keep There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can justify implementing other TableProviders for Parquet, but still I cannot understand why we need to degrade the capabilities of our ListingTable. Is't it always better pruning/simplifying things at the higher levels as possible? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
For what it is worth, we (InfluxData) doesn't use ListingTable to read parquet files, instead we provide our own equivalent and create the DataSourceExec's directly
Yes I think that is important too -- I don't think we should be removing any APIs from ListingTable There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I don't think this degrades the capabilities of the current listing table. I think the only implications are for anyone who used a custom |
||
|
||
Ok(TableProviderFilterPushDown::Inexact) | ||
}) | ||
.collect() | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -54,6 +54,7 @@ mod tests { | |
use datafusion_datasource::file_scan_config::FileScanConfigBuilder; | ||
use datafusion_datasource::source::DataSourceExec; | ||
|
||
use datafusion_datasource::file::FileSource; | ||
use datafusion_datasource::{FileRange, PartitionedFile}; | ||
use datafusion_datasource_parquet::source::ParquetSource; | ||
use datafusion_datasource_parquet::{ | ||
|
@@ -139,7 +140,7 @@ mod tests { | |
self.round_trip(batches).await.batches | ||
} | ||
|
||
fn build_file_source(&self, file_schema: SchemaRef) -> Arc<ParquetSource> { | ||
fn build_file_source(&self, file_schema: SchemaRef) -> Arc<dyn FileSource> { | ||
// set up predicate (this is normally done by a layer higher up) | ||
let predicate = self | ||
.predicate | ||
|
@@ -148,7 +149,7 @@ mod tests { | |
|
||
let mut source = ParquetSource::default(); | ||
if let Some(predicate) = predicate { | ||
source = source.with_predicate(Arc::clone(&file_schema), predicate); | ||
source = source.with_predicate(predicate); | ||
Comment on lines
-151
to
+152
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seemed like an easy win since I was able to just change this so that the schema is always passed in by the |
||
} | ||
|
||
if self.pushdown_predicate { | ||
|
@@ -161,14 +162,14 @@ mod tests { | |
source = source.with_enable_page_index(true); | ||
} | ||
|
||
Arc::new(source) | ||
source.with_schema(Arc::clone(&file_schema)) | ||
} | ||
|
||
fn build_parquet_exec( | ||
&self, | ||
file_schema: SchemaRef, | ||
file_group: FileGroup, | ||
source: Arc<ParquetSource>, | ||
source: Arc<dyn FileSource>, | ||
) -> Arc<DataSourceExec> { | ||
let base_config = FileScanConfigBuilder::new( | ||
ObjectStoreUrl::local_filesystem(), | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change makes sense to me -- when @itsjunetime originally implemented this code, there was some complexity because there was no way to do filter pushdown in ExecutionPlans so in my mind this approach was a (clever) workaround
The comments even hint that this is a parquet specific special case
I think the new pattern of handling predicates more generally in this PR is cleaner and will support more cases. Since this code is only currently executed
Perhaps @cisaacson has some other thoughts