-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Evaluate filter pushdown against the physical schema for performance and correctness #15780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I would love to work on this task |
I can confirm this is currently being done at the LogicalPlan level. I'd say the first step is to understand how it happens there and then if something similar exists for PhysicalExpr and if it doesn't create it. |
To understand how this happens in the logical optimizer, as part of the |
I think the first thing to do would be to try and write some tests that show the error happening Perhaps we could use the existing statistics: Here is an example test that shows how to use those statistics: https://github.com/search?q=repo%3Aapache%2Fdatafusion%20predicate_evaluation_errors&type=code
|
@alamb there is no error AFAIK. It currently works, but it works by casting the data to match the types of the table. The point I’m making is that we could instead cast the expression to match the type of the file, possibly saving a lot of copying / blowing up dictionaries. |
As discussed a bit in #16086 (comment) there is a fundamental problem that all of the predicates are planned at the table level. So for example the predicate My feeling is that we eventually have to do this for performance / reducing extra work, to add important new features and for correctness reasons, but evidently my initial attempt was too naive so we had to revert it. |
if i said exactly that I should stand corrected. The casts are same category as function calls -- the optimizer may reorganize or replace function calls with other expressions as long as they are equivalent (and are believed to "be better"). Casts can be removed or replaced the same way (again: as long as the resulting expression is well formed and equivalent). from the issue description:
Where does Int8 come back? Anyway, as the example shows, two different files may have two different internal representation for the same SQL-level column. I.e. the table may declare Int64, but the file may contain Int32 or Int16. (This is not limited to various Int bitnesses). |
Thanks for correcting me! That's the sort of distinction I knew you'd be able to make that I was lacking. It's a helpful way to think about it
That's the point: we need to do similar logic to Basically I think we need to all agree that this complexity is the right way to go and then agree on what to do in the different scenarios. |
It's very natural to think about file-level vs table level as same thing as SQL coercions, but there is an important distinction. SQL has its own semantics and table provider has its own semantics. From Parquet to table level -- the semantic of this operation is defined by a read. What happens if file has |
What about in the situation described above? What happens now is basically that both columns get cast to |
I predict that the 'extra fields' usecase is going to be the one that will be the most important and will drive this feature (as perfomance will be disasterous without it) |
Uh oh!
There was an error while loading. Please reload this page.
Describe the bug
Consider the following test:
At some point
DataFusion
optimizes theInt8
filter by casting the filter toInt32
(matching the table schema, thus avoiding having to cast the column).So when the filter gets into
ParquetSource
it's anInt32
filter. But when we read the file schema it's actually anInt8
! Since we now build pruning predicates, etc. on a per-file basis using the physical file schema this can introduce casting of the data fromInt8
toInt32
which is unnecessary because (1) we could cast the filter instead which would be much cheaper and (2) if the file type and filter type were bothInt8
orInt16
in this example (as might happen if one changes the table schema but not old data or old queries) we would actually be closer to the original intent of the query.To be clear, I do not mean that this is a new regression. I believe this has always been the case but now we can actually fix it and before we could not.
This applies not only to stats filtering (where the impact is likely negligible) but also to predicate pushdown where I expect the impact may be much larger especially for cases where we never end up materializing the columns (and thus don't have to cast them to the table's data type at all). I don't know that any benchmark measures this case at the moment though.
To resolve this I think we just need to call
optimize_casts(physical_expr, physical_file_schema)
(a made up function) but I don't know where or howoptimize_casts
exists (I feel like it must already exist, maybe it's at the logical expr level?). Does anyone know where this exists?The text was updated successfully, but these errors were encountered: