-
Notifications
You must be signed in to change notification settings - Fork 255
Support retrieving the latest Iceberg table on table scan #1297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Support retrieving the latest Iceberg table on table scan #1297
Conversation
* Allow resolving the current snapshot ID to use on a scan from a callback function * Use table_fn * Fix * Just pass a reference to the catalog * make public * Just take table ident * lint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me overall, just some small things. Thanks @phillipleblanc!
@@ -43,6 +43,8 @@ pub struct IcebergTableProvider { | |||
snapshot_id: Option<i64>, | |||
/// A reference-counted arrow `Schema`. | |||
schema: ArrowSchemaRef, | |||
/// A reference to the catalog that this table provider belongs to. | |||
catalog: Option<Arc<dyn Catalog>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice, we need this for future transactions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make it required? 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it might make sense. I think the use-cases for a static table are fairly limited, and that could be a separate TableProvider. But it would be a breaking change for all users
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think breaking change should still be fine at this early stage
@@ -130,8 +131,19 @@ impl TableProvider for IcebergTableProvider { | |||
filters: &[Expr], | |||
_limit: Option<usize>, | |||
) -> DFResult<Arc<dyn ExecutionPlan>> { | |||
// Get the latest table metadata from the catalog if it exists |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can change to a refresh function similar to python + java implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean just extract this logic out into a refresh function that also updates self.table
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was thinking of having a refresh function inside the table struct that can refresh metadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pushed an update that uses a refresh function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! cc @liurenjie1024 @Fokko @sdd
Which issue does this PR close?
What changes are included in this PR?
Makes the IcebergTableProvider::try_new method public that takes an Arc and a TableIdent. It uses that to get the current table metadata when the DataFusion TableProvider is created - but it also stores a reference to the Arc. When the DataFusion TableProvider is asked to scan the table, it uses the catalog to fetch the latest table metadata.
This allows the TableProvider to get the latest changes to the Iceberg table, as opposed to being stuck on the snapshot when the table was created. This aligns closer to the expectation of using DataFusion TableProviders, where the scan is expected to scan the latest data.
Are these changes tested?
Covered by the existing integration tests at
crates/integrations/datafusion/tests/integration_datafusion_test.rs
- but I have also added a dedicated test for this.