Skip to content

Support retrieving the latest Iceberg table on table scan #1297

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

phillipleblanc
Copy link
Contributor

@phillipleblanc phillipleblanc commented May 7, 2025

Which issue does this PR close?

What changes are included in this PR?

Makes the IcebergTableProvider::try_new method public that takes an Arc and a TableIdent. It uses that to get the current table metadata when the DataFusion TableProvider is created - but it also stores a reference to the Arc. When the DataFusion TableProvider is asked to scan the table, it uses the catalog to fetch the latest table metadata.

This allows the TableProvider to get the latest changes to the Iceberg table, as opposed to being stuck on the snapshot when the table was created. This aligns closer to the expectation of using DataFusion TableProviders, where the scan is expected to scan the latest data.

Are these changes tested?

Covered by the existing integration tests at crates/integrations/datafusion/tests/integration_datafusion_test.rs - but I have also added a dedicated test for this.

* Allow resolving the current snapshot ID to use on a scan from a callback function

* Use table_fn

* Fix

* Just pass a reference to the catalog

* make public

* Just take table ident

* lint
Copy link
Contributor

@jonathanc-n jonathanc-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me overall, just some small things. Thanks @phillipleblanc!

@@ -43,6 +43,8 @@ pub struct IcebergTableProvider {
snapshot_id: Option<i64>,
/// A reference-counted arrow `Schema`.
schema: ArrowSchemaRef,
/// A reference to the catalog that this table provider belongs to.
catalog: Option<Arc<dyn Catalog>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice, we need this for future transactions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make it required? 👀

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it might make sense. I think the use-cases for a static table are fairly limited, and that could be a separate TableProvider. But it would be a breaking change for all users

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think breaking change should still be fine at this early stage

@@ -130,8 +131,19 @@ impl TableProvider for IcebergTableProvider {
filters: &[Expr],
_limit: Option<usize>,
) -> DFResult<Arc<dyn ExecutionPlan>> {
// Get the latest table metadata from the catalog if it exists
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can change to a refresh function similar to python + java implementation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean just extract this logic out into a refresh function that also updates self.table?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was thinking of having a refresh function inside the table struct that can refresh metadata

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed an update that uses a refresh function

Copy link
Contributor

@jonathanc-n jonathanc-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The iceberg-datafusion Table Provider has a way to scan the latest snapshots dynamically
3 participants