Skip to content

Test DataFusion 45.0.0 with Sail #14408

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
shehabgamin opened this issue Feb 2, 2025 · 12 comments
Closed

Test DataFusion 45.0.0 with Sail #14408

shehabgamin opened this issue Feb 2, 2025 · 12 comments
Assignees
Labels
enhancement New feature or request

Comments

@shehabgamin
Copy link
Contributor

shehabgamin commented Feb 2, 2025

Is your feature request related to a problem or challenge?

We would like to make sure that DataFusion 45.0.0 works with Sail before creating the release candidate.

Describe the solution you'd like

All CI tests should pass in a PR that upgrades to DataFusion 45.

Current PR:
lakehq/sail#365

Current issues being investigated:

Test 1: Commit 68e372f (DataFusion main)

1. Derived TPC-DS Query 66 fails on Sail, when it previously did not.

Exception: Error executing query q66 with error: Internal error:
Physical input schema should be the same as the one converted from logical input schema. 
Differences:
 - field nullability at index 7 [#98]: (physical) false vs (logical) true.

2. Another regression is that implementing the is_nullable function in the ScalarUDFImpl trait no longer works. For example:

impl ScalarUDFImpl for SparkArray {
    ...
    fn is_nullable(&self, _args: &[Expr], _schema: &dyn ExprSchema) -> bool {
        false
    }
    ...
}

Digging into the code, I see that it's deprecated (it should still work even though it's deprecated). What's strange, however, is that the deprecation warning is not propagating to me as a downstream user. I only found this out due to a failed test.

Test 2: Commit 26058ac (DataFusion 45.0.0-rc1)

1. Derived TPC-DS Query 66 still failing, with same error.

2. Same issue still. I ended up removing is_nullable and implementing return_type_from_args. This seems to be a breaking change rather than is_nullable simply being deprecated.

Describe alternatives you've considered

No response

Additional context

No response

@shehabgamin shehabgamin added the enhancement New feature or request label Feb 2, 2025
@shehabgamin
Copy link
Contributor Author

take

@Omega359
Copy link
Contributor

Omega359 commented Feb 2, 2025

I'd be curious if this could be narrowed down via git bisect. Probably would need a small script to do it and a quick test case

@findepi
Copy link
Member

findepi commented Feb 2, 2025

Physical input schema should be the same as the one converted from logical input schema.
Differences:

I think this should be allowed, especially since we run further optimizations on the physical plans.
For example a query

SELECT nullable_value FROM ...
UNOIN ALL
SELECT non_null_value FROM ...

may have it's first union branch eventually eliminated at some point, and this may happen on the physical stage. As a result, nullable field becomes non-null.

@alamb
Copy link
Contributor

alamb commented Feb 3, 2025

In terms of TPC-DS errors, I think we have tests for those queries as part of the regression tests

https://github.com/apache/datafusion/blob/main/datafusion/core/tests/tpc-ds/66.sql

Seems to pass on main

(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion$ cargo test --test tpcds_planning -- 66
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.17s
     Running tests/tpcds_planning.rs (target/debug/deps/tpcds_planning-2ddd48cf45e9df2d)

running 2 tests
test tpcds_logical_q66 ... ok
test tpcds_physical_q66 ... ok

test result: ok. 2 passed; 0 failed; 0 ignored; 0 measured; 196 filtered out; finished in 0.23s

(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion$

Maybe we can look into what is different in sail 🤔

@shehabgamin
Copy link
Contributor Author

@alamb This doesn't surprise me considering that Sail doesn't use SQLToRel. My concern isn't necessarily with TPC-DS 66, but rather the implications of this error. There is either some bug or break that has been surfaced by this test and we don't know the impact that this has. IMO it would be risky to do this release without further debugging this. I am available this evening to debug though.

@shehabgamin
Copy link
Contributor Author

Not sure if this is the right file to be looking at, but it's where the error comes from (Physical input schema should be the same as the one converted from logical input schema).
There are only a handful of PRs that have touched /datafusion/core/src/physical_planner.rs since 44:

  1. chore: deprecate ValuesExec in favour of MemoryExec #14032
  2. Chore: refactor DataSink traits to avoid duplication #14121
  3. NestedLoopJoin Projection Pushdown #14120
  4. feat: Use SchemaRef in JoinFilter #14182
  5. Interface for physical plan invariant checking. #13986

@alamb
Copy link
Contributor

alamb commented Feb 6, 2025

Not sure if this is the right file to be looking at, but it's where the error comes from (Physical input schema should be the same as the one converted from logical input schema). There are only a handful of PRs that have touched /datafusion/core/src/physical_planner.rs since 44:

  1. chore: deprecate ValuesExec in favour of MemoryExec #14032
  2. Chore: refactor DataSink traits to avoid duplication #14121
  3. NestedLoopJoin Projection Pushdown #14120
  4. feat: Use SchemaRef in JoinFilter #14182
  5. Interface for physical plan invariant checking. #13986

It seems like the issue may be from 5 -- which now adds additional checks to catch errors earlier rather than later (for example now the error will happen during planning rather than when RecordBatchs are created during execution)

So the question then in my mind is if the error is due to:

  1. Some code in Sail (and we just happen now to catch the error earlier)
  2. Some code in DataFusion

I think this should be allowed, especially since we run further optimizations on the physical plans.

@findepi do you mean we should relax the check to ignore nullable / non nullable annotations? -- I think that would probably be ok too.

@wiedld perhaps you have some comments here

I think in general the split between what a Phyiscal optimizer is allowed to do / not (the "invariants" are not very clear). #13986 took a first step at formalizing them but maybe we have found other corner cases where it is not reasonable

@findepi
Copy link
Member

findepi commented Feb 6, 2025

@findepi do you mean we should relax the check to ignore nullable / non nullable annotations? -- I think that would probably be ok too.

yes, i think so -- #14519

@alamb
Copy link
Contributor

alamb commented Feb 7, 2025

From my perspective it seems like all the tests in DataFusion are passing, so clearly we have some sort of test coverage / gap

@shehabgamin I wonder if this would be a good time to add more tests that cover your usecase in Sail (or perhaps that would be part of adding Spark functions as discussed in #5600) 🤔

@shehabgamin
Copy link
Contributor Author

@findepi do you mean we should relax the check to ignore nullable / non nullable annotations? -- I think that would probably be ok too.

yes, i think so -- #14519

@findepi @alamb I Just confirmed that this fixes TPC-DS 66 on Sail! Unfortunately this fix didn't make it into the 45 release.

From my perspective it seems like all the tests in DataFusion are passing, so clearly we have some sort of test coverage / gap

@shehabgamin I wonder if this would be a good time to add more tests that cover your usecase in Sail (or perhaps that would be part of adding Spark functions as discussed in #5600) 🤔

@alamb I agree this would add value. While Spark function test coverage is a starting point, I believe what we're looking for here is broader. I'm not entirely sure about the best approach yet, but I suspect Sail's pattern of catching these bugs/breaks stems from not using DataFusion out-of-the-box.

For example, in the TPC-DS 66 case, adding a DataFusion test wouldn't have helped since there was already a test in place. Maybe one approach could be exporting Logical Plans from Sail's tests and using their evaluation as DataFusion test cases? In the case of TPC-DS 66, we would take the Logical Plan that Sail produced and make the evaluation of that plan a DataFusion test case.

P.S. I'm recovering from a harsh cold so sorry if I'm not making much sense lol

@alamb
Copy link
Contributor

alamb commented Feb 10, 2025

@findepi do you mean we should relax the check to ignore nullable / non nullable annotations? -- I think that would probably be ok too.

yes, i think so -- #14519

@findepi @alamb I Just confirmed that this fixes TPC-DS 66 on Sail! Unfortunately this fix didn't make it into the 45 release.

From my perspective it seems like all the tests in DataFusion are passing, so clearly we have some sort of test coverage / gap
@shehabgamin I wonder if this would be a good time to add more tests that cover your usecase in Sail (or perhaps that would be part of adding Spark functions as discussed in #5600) 🤔

@alamb I agree this would add value. While Spark function test coverage is a starting point, I believe what we're looking for here is broader. I'm not entirely sure about the best approach yet, but I suspect Sail's pattern of catching these bugs/breaks stems from not using DataFusion out-of-the-box.

Yes, we have a similar issue downstream in InfluxDB IOx as well (we make extensive use of the DataFusion APIs and so often find bugs that are not covered by existing tests in DataFusion)

What we have tried to do is to add the appropriate test coverage for such issues when we find them, though what "appropriate" means varies case by case

For example, in the TPC-DS 66 case, adding a DataFusion test wouldn't have helped since there was already a test in place. Maybe one approach could be exporting Logical Plans from Sail's tests and using their evaluation as DataFusion test cases? In the case of TPC-DS 66, we would take the Logical Plan that Sail produced and make the evaluation of that plan a DataFusion test case.

That is one approach for sure.

If there is a broader pattern of how Sail is creating / using Logical Plans too that we could get tested that would also help a lot.

P.S. I'm recovering from a harsh cold so sorry if I'm not making much sense lol

@alamb
Copy link
Contributor

alamb commented Feb 13, 2025

Since we have released DF 45 I think we can claim the testing is complete

@alamb alamb closed this as completed Feb 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants