-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Per file filter evaluation #15057
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Per file filter evaluation #15057
Conversation
The example is not working yet. It gets |
Ok the example is now working and I think the overall approach is interesting but I don't think it's quite close to a workable solution. |
a21f316
to
3bb4c36
Compare
The example is now working and even does stats pruning of shredded columns 🚀 |
let parquet_source = ParquetSource::default() | ||
.with_predicate(self.schema.clone(), filter) | ||
.with_pushdown_filters(true) | ||
.with_filter_expression_rewriter(Arc::new(StructFieldRewriter) as _); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the API for users to attach this rewriter to their plan
struct StructFieldRewriterImpl { | ||
file_schema: SchemaRef, | ||
} | ||
|
||
impl TreeNodeRewriter for StructFieldRewriterImpl { | ||
type Node = Arc<dyn PhysicalExpr>; | ||
|
||
fn f_down( | ||
&mut self, | ||
expr: Arc<dyn PhysicalExpr>, | ||
) -> Result<Transformed<Arc<dyn PhysicalExpr>>> { | ||
if let Some(scalar_function) = expr.as_any().downcast_ref::<ScalarFunctionExpr>() | ||
{ | ||
if scalar_function.name() == "get_field" { | ||
if scalar_function.args().len() == 2 { | ||
// First argument is the column, second argument is the field name | ||
let column = scalar_function.args()[0].clone(); | ||
let field_name = scalar_function.args()[1].clone(); | ||
if let Some(literal) = | ||
field_name.as_any().downcast_ref::<expressions::Literal>() | ||
{ | ||
if let Some(field_name) = literal.value().try_as_str().flatten() { | ||
if let Some(column) = | ||
column.as_any().downcast_ref::<expressions::Column>() | ||
{ | ||
let column_name = column.name(); | ||
let source_field = | ||
self.file_schema.field_with_name(column_name)?; | ||
let expected_flattened_column_name = | ||
format!("_{}.{}", column_name, field_name); | ||
// Check if the flattened column exists in the file schema and has the same type | ||
if let Ok(shredded_field) = self | ||
.file_schema | ||
.field_with_name(&expected_flattened_column_name) | ||
{ | ||
if source_field.data_type() | ||
== shredded_field.data_type() | ||
{ | ||
// Rewrite the expression to use the flattened column | ||
let rewritten_expr = expressions::col( | ||
&expected_flattened_column_name, | ||
&self.file_schema, | ||
)?; | ||
return Ok(Transformed::yes(rewritten_expr)); | ||
} | ||
} | ||
} | ||
} | ||
} | ||
} | ||
} | ||
} | ||
|
||
Ok(Transformed::no(expr)) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example implementation of a rewriter
@alamb I think this is ready for a first round of review when you have a chance! |
The main issue I've found with this approach is marking filters as datafusion/datafusion/datasource-parquet/src/row_filter.rs Lines 333 to 336 in 9382add
|
Okay I think I can answer my own question: https://github.com/pydantic/datafusion/blob/38356998059a2d08113401ea8111f238899ab0b8/datafusion/core/src/datasource/listing/table.rs#L961-L995 Based on this it seems like it's safe to mark filters as exact if they are getting pushed down 😄 |
008eba0
to
1878f59
Compare
Okay folks sorry for the churn, I thought this was in a better state than it ended up being. I've now reworked it to minimize the diff and make sure all existing tests pass. I'm going to add tests for the new functionality now to compliment the example. |
// Note about schemas: we are actually dealing with _4_ different schemas here: | ||
// - The table schema as defined by the TableProvider. This is what the user sees, what they get when they `SELECT * FROM table`, etc. | ||
// - The "virtual" file schema: this is the table schema minus any hive partition columns. This is what the file schema is coerced to. | ||
// - The physical file schema: this is the schema as defined by the parquet file. This is what the parquet file actually contains. | ||
// - The filter schema: a hybrid of the virtual file schema and the physical file schema. | ||
// If a filter is rewritten to reference columns that are in the physical file schema but not the virtual file schema, we need to add those columns to the filter schema so that the filter can be evaluated. | ||
// This schema is generated by taking any columns from the virtual file schema that are referenced by the filter and adding any columns from the physical file schema that are referenced by the filter but not in the virtual file schema. | ||
// Columns from the virtual file schema are added in the order they appear in the virtual file schema. | ||
// The columns from the physical file schema are always added to the end of the schema, in the order they appear in the physical file schema. | ||
// | ||
// I think it might be wise to do some renaming of parameters where possible, e.g. rename `file_schema` to `table_schema_without_partition_columns` and `physical_file_schema` or something like that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting bit to ponder upon
/// Rewrite an expressions to take into account this file's particular schema. | ||
/// This can be used to evaluate expressions against shredded variant columns or columns that pre-compute expressions (e.g. `day(timestamp)`). | ||
pub trait FileExpressionRewriter: Debug + Send + Sync { | ||
/// Rewrite an an expression in the context of a file schema. | ||
fn rewrite( | ||
&self, | ||
file_schema: SchemaRef, | ||
expr: Arc<dyn PhysicalExpr>, | ||
) -> Result<Arc<dyn PhysicalExpr>>; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: if users need the table_schema
they can bind that inside of TableProvider::scan
I will try and give this a look over the next few days |
a310528
to
59ec143
Compare
I would like to resume this work. Some thoughts should the rewrite happen via a new trait as I'm currently doing, or should we add a method I suspect the hard bit with this approach will be edge cases: what if a filter cannot adapt itself to the file schema, but we could cast the column to make it work? I'm thinking something like a UDF that only accepts I think @jayzhan211 proposed something similar in https://github.com/apache/datafusion/pull/15685/files#diff-2b3f5563d9441d3303b57e58e804ab07a10d198973eed20e7751b5a20b955e42. @alamb any thoughts? |
This method is too general and it is unclear what we need to do with the provided schema for each PhysicalExpr, it is not a good idea.
I think it is unavoidable we need to cast the columns to be able to evaluate the filter. Another question is, isn't the filter created based on table schema? And then the batch is read as file schema and casted to table schema and is evaluated by filter. What we could do is rewrite the filter based on file schema. Assume we have |
Yes this is exactly the case.
Yes that is exactly what I am proposing above, thank you for giving a more concrete example. The other point is if we can use this same mechanism to handle shredding for the variant type. In other words, can we "optimize" And if that all makes sense... how do we do those optimizations? Is it something like an optimizer that has to downcast match on the expressions, or do we add methods to PhysicalExpr for each expression to describe how it handles this behavior? |
Probably
This is likely only applied to parquet filter so we can rewrite the filter when we know the filter + file_schema + table_schema (probably |
Yes agreed, that's basically what's in this PR currently: a custom trait to implement an optimizer pass with all of that information available. My questions are:
|
rewrite with file schema is specialized to filter, if you add
Makes sense to me if we have many rules inside the |
a241488
to
32de4dd
Compare
/// Schema of the file as far as the rest of the system is concerned. | ||
pub logical_file_schema: SchemaRef, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this rename is worth it - there's been constant confusion even amongst maintainers about this. And this is only public for internal use.
A step towards #14993.
I decided to tackle filter pushdown first because:
The idea is that we can experiment with filters because it's less work and later re-use
FileExpressionRewriter
to do projection pushdown once we flesh out the details and apply learnings from this piece of work.