Skip to content

feat: Declarative TableMetadata Builder #1362

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

c-thiel
Copy link
Collaborator

@c-thiel c-thiel commented May 21, 2025

Which issue does this PR close?

Currently we don't have a way to create TableMetadata in a declarative way other than deserializing it from JSON.
The TableMetadataBuilder mostly works imperatively with methods like add_schema and add_partition_spec that mutate the state incrementally. While incremental modification is what most users need, it is also helpful to offer a type safe way to create a new TableMetadata object given all of its attributes, other than deserialization from JSON.

My concrete use-case is: Lakekeeper stores TableMetadata in Postgres. We extract data typesafe, so we end up with various objects such as a list of schemas, partition_specs, the last_updated_ms and all other fields required to build a new TableMetadata. As we already have rust-types, we don't want to serialize them to JSON first. We also can't use the TableMetadata builder, as it's incremental nature would result in significant overhead.

Design considerations

Instead of using the builder approach, we could also add another method try_from_parts to impl TableMetadata. We have this alternative approach implemented here: https://github.com/lakekeeper/iceberg-rust/blob/a8b6509775b139c92772a999b0cbca637e274a03/crates/iceberg/src/spec/table_metadata.rs#L676-L740

I feel that a builder is less intrusive though. I don't like adding the UnnormalizedTableMetadata, but didn't find a way to skip this type without writing significantly more code that needs to be maintained. Because of the derived builder in this PR, the approach is almost maintenance free.

What changes are included in this PR?

  • Introduce a new derived TableMetadataDeclarativeBuilder
  • Introduce UnnormalizedTableMetadata as an intermediate type that is .try_normalize() to TableMetadata

Are these changes tested?

yes

@c-thiel
Copy link
Collaborator Author

c-thiel commented May 21, 2025

@liurenjie1024 @Xuanwo Looking forward to your thoughts!

Comment on lines +183 to +184
#[builder(default, setter(transform = |stats: Vec<StatisticsFile>| {
stats.into_iter().map(|s| (s.snapshot_id, s)).collect()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why also default? How is this different from snapshots?
Accepting Vec<StatisticsFile> should be enough.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not - I missed the default for snapshots and current_snapshot_id and a few others.
My basic reasoning is that all fields that must be set in order to produce valid metadata should not have a default, even if the Rust type would allow it. An example for a non-default field would be schemas, as we require at least one entry in the HashMap to be able to produce valid Metadata (current_schema_id must be set and be present).
Valid TableMetadata must not contain snapshots though, and current_snapshot_id can be None, so those should have defaults.

.metadata_log(vec![])
.refs(HashMap::new())
.build()
.try_normalize()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the test for the declarative builder would be better to avoid unnecessary calls. Do we need metadata.try_normalize? (the test passes without metadata.try_normalize call)

OTOH it's somewhat weird that try_normalize is the only API to finish building.
Could build() do the try_normalize implicitly, so that user doesn't have to?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer this design as well, however I can't find an easy way to implement a custom build method without implementing it on the very complex generic type that is derived by the macro (see end of this message) and is hard to keep in sync.

typed_builders only usable customization of the build method is using Into to convert the type. There is no native way to inject try_normalize into the build method. Especially, there is no way to make the generated build method return a Result.

I see the following options:

  1. Just build TableMetadata without normalization - this would allow to create a potentially invalid TableMetadata, as all other constructors (existing TableMetadataBuilder & Deserialize) use normalization. Hence, I discarded it.
  2. Implement a fallible try_build (with normalize) method and make the derived build method private. This would require to implement the try_build on the above type, which isn't very obvious, and would need to be kept in sync with fields in TableMetadata and even worse, with which have a default builder implementation. Due to this complexity, and a high probability of state drift, I discarded this as well.
  3. Don't use typed builder - which requires us to write significantly more code, but it would be a viable option.
  4. Extend typed builder to support TryInto in addition to Into in the build method.
  5. Use a very small custom type that wraps the notion of unvalidated metadata and offers an easy way to validate it - which is what I did.

If there are any other ideas here or if I missed something, let me know!

Here is the type: Note that default and non-default fields are treated differently:

#[allow(dead_code, non_camel_case_types, missing_docs)]
#[automatically_derived]
impl<
    __properties: ::typed_builder::Optional<HashMap<String, String>>,
    __current_snapshot_id: ::typed_builder::Optional<Option<i64>>,
    __snapshots: ::typed_builder::Optional<HashMap<i64, SnapshotRef>>,
    __snapshot_log: ::typed_builder::Optional<Vec<SnapshotLog>>,
    __metadata_log: ::typed_builder::Optional<Vec<MetadataLog>>,
    __statistics: ::typed_builder::Optional<HashMap<i64, StatisticsFile>>,
    __partition_statistics: ::typed_builder::Optional<HashMap<i64, PartitionStatisticsFile>>,
    __encryption_keys: ::typed_builder::Optional<HashMap<String, String>>,
>
    TableMetadataDeclarativeBuilder<(
        (FormatVersion,),
        (Uuid,),
        (String,),
        (i64,),
        (i64,),
        (i32,),
        (HashMap<i32, SchemaRef>,),
        (i32,),
        (HashMap<i32, PartitionSpecRef>,),
        (PartitionSpecRef,),
        (StructType,),
        (i32,),
        __properties,
        __current_snapshot_id,
        __snapshots,
        __snapshot_log,
        __metadata_log,
        (HashMap<i64, SortOrderRef>,),
        (i64,),
        (HashMap<String, SnapshotReference>,),
        __statistics,
        __partition_statistics,
        __encryption_keys,
    )>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typed_builders only usable customization of the build method is using Into to convert the type. There is no native way to inject try_normalize into the build method. Especially, there is no way to make the generated build method return a Result.

Looks there is draft design for this problem: idanarye/rust-typed-builder#95 . Another option maybe we can try to make the progress of feature. 😄

#[derive(Debug, PartialEq, Deserialize, Eq, Clone)]
#[derive(Debug, PartialEq, Deserialize, Eq, Clone, typed_builder::TypedBuilder)]
#[builder(builder_method(name=declarative_builder))]
#[builder(builder_type(name=TableMetadataDeclarativeBuilder, doc="Build a new [`TableMetadata`] in a declarative way. For imperative operations (e.g. `add_snapshot`) and creating new TableMetadata, use [`TableMetadataBuilder`] instead."))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not convinced "declarative" vs "imperative" is the defining distinction.
The TableMetadataBuilder is not "a builder pattern for TableMetadata". It's a facade to perform changes on the TableMetadata to be invoked by higher layer. For that reason, it includes implicit changes that are supposed to be carried on when a metadata change happens, such as updating last_updated_ms when a new snapshot is added

self.last_updated_ms = Some(snapshot.timestamp_ms());

The use-case described in the PR description is "the" standard builder pattern. To me, this is the defining difference.

I suggest changing the doc string here and align the type names.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I absolutely agree with this from a rust perspective. From an iceberg perspective other languages use the world Builder for this functionality unfortunately. I am uncertain if renaming the TableMetadataBuilder to a clearer name is worth the breaking change today.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @findepi about the naming. To me this is more like a constructor, but since I'm not a native English speaker, I don't have a good suggestion to the name. But it would be lovely if we have a better name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I feel the TypedBuilder is a little hard to maintain in such a large struct, do you mind to create a standalone struct for this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I feel the TypedBuilder is a little hard to maintain in such a large struct, do you mind to create a standalone struct for this?

I don't quite understand this suggestion - isn't the purpose of this PR to create TableMetadata? How would implementing it on another struct help?

Regarding names, @liurenjie1024 would you be OK with the breaking change that Piotrs proposal implies? Renaming the existing TableMetadataBuilder to something else, then naming the new builder TableMetadataBuilder (which does something completely different and is used by far less people?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @liurenjie1024 and @findepi you mean different things.

@liurenjie1024 you just would like to rename TableMetadataDeclarativeBuilder to TableMetadataConstructor - correct? I am onboard with that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liurenjie1024 you just would like to rename TableMetadataDeclarativeBuilder to TableMetadataConstructor - correct? I am onboard with that.

Yes. Feel free to suggest anything more intuitive.

I don't quite understand this suggestion - isn't the purpose of this PR to create TableMetadata? How would implementing it on another struct help?

I mean not using TypedBuilder to generate code automatically, rather than manually maintaining the builder/construct data structure. This pr contains many advanced usage of TypeBuilder, and as a reviewer I find it's sth how difficult to catch up all the usages. I also worry about the maintainceness of using too much auto generated code.


impl UnnormalizedTableMetadata {
/// Build a new [`UnnormalizedTableMetadata`] using the [`TableMetadataDeclarativeBuilder`].
pub fn builder() -> TableMetadataDeclarativeBuilder {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused? remove?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a library and the constructor can be used outside of the crate. I like to attach some constructor to a type if available.
As it's an auxiliary type, I am OK with removing it as well.

/// Unnormalized table metadata, used as an intermediate type
/// to build table metadata in a declarative way.
#[derive(Debug, PartialEq, Deserialize, Eq, Clone)]
pub struct UnnormalizedTableMetadata(TableMetadata);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get why we need this. Why not build TableMeadata from a declarative builder directly?🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly due to my point 1) in this comment: #1362 (comment)

I don't think there should be any constructor that allows to create unvalidated / potentially corrupt TableMetadata.

@c-thiel
Copy link
Collaborator Author

c-thiel commented Jun 11, 2025

@findepi @ZENOTME thanks for the Reviews!
I addressed most of your comments in my latest commit.

The most unclear point is probably the return type of the Build method.
Getting more Feedback on my #1362 (comment) would be great!

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @c-thiel for this pr! In general I agree that the requirement is reasonable. But I have concerns with the naming and TypedBuilder usage, left some comments for discussion.

#[derive(Debug, PartialEq, Deserialize, Eq, Clone)]
#[derive(Debug, PartialEq, Deserialize, Eq, Clone, typed_builder::TypedBuilder)]
#[builder(builder_method(name=declarative_builder))]
#[builder(builder_type(name=TableMetadataDeclarativeBuilder, doc="Build a new [`TableMetadata`] in a declarative way. For imperative operations (e.g. `add_snapshot`) and creating new TableMetadata, use [`TableMetadataBuilder`] instead."))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @findepi about the naming. To me this is more like a constructor, but since I'm not a native English speaker, I don't have a good suggestion to the name. But it would be lovely if we have a better name.

#[derive(Debug, PartialEq, Deserialize, Eq, Clone)]
#[derive(Debug, PartialEq, Deserialize, Eq, Clone, typed_builder::TypedBuilder)]
#[builder(builder_method(name=declarative_builder))]
#[builder(builder_type(name=TableMetadataDeclarativeBuilder, doc="Build a new [`TableMetadata`] in a declarative way. For imperative operations (e.g. `add_snapshot`) and creating new TableMetadata, use [`TableMetadataBuilder`] instead."))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I feel the TypedBuilder is a little hard to maintain in such a large struct, do you mind to create a standalone struct for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants