Make FileIO a Trait #1314

tustvold · 2025-05-12T11:33:42Z

Is your feature request related to a problem or challenge?

Originally proposed on #172 (comment) making FileIO a trait would allow for more pluggable storage access. This in turn would potentially allow better integration where people already have an existing storage setup, e.g. based on object_store, that they want to use.

Describe the solution you'd like

I would like FileIO to be changed to a trait, allowing for pluggable storage engines.

The major remaining questions from the linked issue are:

Should implementations pass around Arc<dyn FileIO or something else - Consider Using object_store as IO Abstraction #172 (comment) Consider Using object_store as IO Abstraction #172 (comment)
How important is preserving compatibility with iceberg-java - Consider Using object_store as IO Abstraction #172 (comment) Consider Using object_store as IO Abstraction #172 (comment)
Can we make breaking changes to iceberg-rust - Consider Using object_store as IO Abstraction #172 (comment)

If the decision is that we can break compatibility with iceberg-java and are happy to use a trait object, the next question is

How should the interface differ from ObjectStore - Consider Using object_store as IO Abstraction #172 (comment)

My 2 cents is that designing a custom abstraction at this granularity when one already exists and is well adopted within the Rust data ecosystem, seems unnecessary.

Willingness to contribute

I would be willing to contribute to this feature with guidance from the Iceberg Rust community

Additional Context

See also apache/datafusion#15018 (comment)

The text was updated successfully, but these errors were encountered:

Xuanwo · 2025-05-14T09:31:24Z

Thank you for getting this started. I’ve been thinking about this as well. I believe the FileIO trait is important for iceberg-rust, as it helps separate our abstraction from the underlying implementation.

I'm willing to work on this.

tustvold · 2025-05-14T10:00:12Z

IMO it would be good to get consensus on the outstanding questions before proceeding with an implementation. I think it would be good to articulate what differences this FileIO trait would have from existing ecosystem abstractions and therefore what problems it seeks to solve. I'm not sure people are on the same page here

linhr · 2025-05-14T11:02:10Z

It seems that FileIO internally manages Storage which is currently an enum. Maybe we should revisit the design of both together?

The only other thing in FileIO is FileIOBuilder, which seems to be configuration for the storage. As I mentioned earlier, I was wondering if the Iceberg library really need to manage such configuration. It might be more flexible to simply accept a storage implementation (possibly some Arc<dyn ...>) without knowing how the storage is configured. This is the idea of "dependency injection".

FWIW, FileIO::into_builder() does not seem to be used anywhere in the project.

(I think it's fine to have helper functions to build FileIO from configuration, but the configuration does not need to be stored once the data storage is instantiated.)

linhr · 2025-05-14T12:08:12Z

If we consider FileIO and Storage together, I realized that there is actually an alternative design for the IO abstraction. (This assumes that we remove FileIOBuilder from FileIO.)

#[derive(Clone, Debug)]
pub struct FileIO {
    inner: Storage,
}

#[derive(Clone, Debug)]
pub(crate) enum Storage {
    /// An OpenDAL operator.
    Operator(Operator),
    /// An object_store implementation.
    ObjectStore(Arc<dyn ObjectStore>),
}

This alternative has a few benefits. (Let me know if there is any drawback that I'm not aware of.)

There is no change to the usage of FileIO.
Storage can be cheaply cloned and be part of InputFile and OutputFile.
We do not need yet another abstraction for file/object operations on top of OpenDAL or object_store. (This is a discussion point raised by @tustvold.)
The intent is conveyed clearly that FileIO should work with both OpenDAL and object_store.
There is no longer need to wrap ObjectStore as an OpenDAL operator. In fact, after looking deeper into this, I'm not sure if the wrapper is in general well-defined since some of the OpenDAL methods (e.g. creating directory) are not part of the ObjectStore contract.

Happy to discuss!

tustvold · 2025-05-14T14:35:07Z

The major downside I can see of an enum based approach is that it forces the variants to be enumerated, which in turn limits downstream extensibility. This can be fudged over with feature flags but being able to have separate crates implementing a common interface typically ends up being easier to maintain for all involved.

We do not need yet another abstraction for file/object operations on top of OpenDAL or object_store.

TBC it is still another abstraction, regardless of if implemented as a trait or a crate private enum. It will entail building custom parquet readers, config handling, path representation, etc... and in turn limit the ability for people to bring their own existing implementations and setups.

linhr · 2025-05-14T15:01:04Z

Thanks @tustvold. Yeah the downside of Storage enum makes a lot of sense to me. I can see that a trait would be more extensible in general for downstream crate.

The solution I proposed was more like a short-term solution. I found this would result in less code change and smaller blast radius, given the current status of how FileIO is used in the Iceberg library. Although Storage is still an enum, it supports a wide range of use cases, assuming that OpenDAL and object_store have emerged as the top standards for storage abstraction in the Rust community. This would give us a quick path for object_store integration, while we evaluate the best path forward.

linhr · 2025-05-14T15:55:00Z

Or maybe we can have Storage as a trait while FileIO stays as a struct? (Again, I can see that the indirection is less ideal. What I was looking for is a migration path with least disruption, considering there are also InputFile and OutputFile etc.)

Sl1mb0 · 2025-05-15T21:55:44Z

It's worth mentioning that I've raised a somewhat related issue in the past regarding the decoupling of building & serialization. On that note - I don't think Iceberg rust needs to necessarily even provide a storage implementation - that's something I would argue users generally already have covered beforehand. If the entire set of metadata types had their building and serialization/de-serialization decoupled, users would have more control over where the Iceberg metadata they build gets written.

One thing I think that ties into that though is that it's been mentioned by some of the developers that users of Iceberg Rust should use the transaction API in order to create tables - this effectively means that (and please correct me if I'm wrong here):

Users need to provide an Iceberg Catalog implementation
Users write their manifests and manifest lists
- This means that the user is responsible for handling building and serde for these types (manifest-list.avro & manifest-file.avro)
Users use the catalog to generate table metadata
- This means that the catalog is responsible for building and serde of table metadata (eg v1.metadata.json)

With this understanding in mind - it is not clear to me why there is a storage API at all. If users are intended to build & write their own manifest lists, then Iceberg Rust should only need to provide the infrastructure for building and serializing those types, as users should be able to decide where & how they are written. In addition to that, if the Iceberg catalogs are responsible for the building and serialization of table metadata, again, why does Iceberg Rust care about the storage aspect of the table metadata, as that's a catalog implementation's responsibility.

liurenjie1024 · 2025-05-20T03:05:52Z

Hi, sorry for being late for this party, and thanks @tustvold for the summary of the discussions.

Should implementations pass around Arc<dyn FileIO or something else - Consider Using object_store as IO Abstraction #172 (comment) Consider Using object_store as IO Abstraction #172 (comment)

While we could abstract out underlying implementation using different providers/implemtations, I still recommend wrapping it in a struct when passing around in the crate. This helps to eliminate limitations of object safety in rust.

How important is preserving compatibility with iceberg-java - Consider Using object_store as IO Abstraction #172 (comment) Consider Using object_store as IO Abstraction #172 (comment)

I believe it's important to keep the abstractions/functionality required by java api, such as InputFile/OutputFile/delete, as these are required by other components. But it doesn't have be exactly same as, and should be idiomatic for rust developers.

Can we make breaking changes to iceberg-rust - Consider Using object_store as IO Abstraction #172 (comment)

I think it's fine to make breaking changes if we can't avoid.

How should the interface differ from ObjectStore - Consider Using object_store as IO Abstraction #172 (comment)

The reason we should keep FileIO trait rather than using ObjectStore trait directly is that we need to keep consistent as much as java implementation(reference implementation). For example, java's FileIO has following extensions recently:

https://github.com/apache/iceberg/blob/f06c4f7dfc98cc944a0e1d3a7b38ade0aaa52ce3/api/src/main/java/org/apache/iceberg/io/SupportsPrefixOperations.java#L25
https://github.com/apache/iceberg/blob/f06c4f7dfc98cc944a0e1d3a7b38ade0aaa52ce3/api/src/main/java/org/apache/iceberg/io/SupportsBulkOperations.java#L21
https://github.com/apache/iceberg/blob/50d310aef17908f03f595d520cd751527483752a/api/src/main/java/org/apache/iceberg/encryption/EncryptingFileIO.java#L37
https://github.com/apache/iceberg/blob/e9364faabcc67eef6c61af2ecdf7bcf9a3fef602/api/src/main/java/org/apache/iceberg/io/SupportsRecoveryOperations.java#L27
https://github.com/apache/iceberg/blob/817dc35a924b403716d2eb899aba46f3398a5ca9/core/src/main/java/org/apache/iceberg/io/SupportsStorageCredentials.java#L27

I believe keeping a FileIO trait rather than using object_store would make evolving easier.

liurenjie1024 · 2025-05-20T03:17:37Z

Despite the discussion points raised by @tustvold , I have other things to discuss:

Should we have an unified FileIOBuilder trait, just like what we did for CatalogBuilder in Experiment implementation for catalog builder #1231
Should we only keep traits like FileIO(and potentially FileIOBuilder) in core crate, and move concrete implementation to other crate? Just like what we did for Catalog?

tustvold · 2025-05-20T05:44:33Z

I still recommend wrapping it in a struct when passing around in the crate

So are you proposing making Storage a trait instead? If so for the sake of argument, could this just be ObjectStore?

Should we have an unified FileIOBuilder trait

FWIW we are upstreaming the ObjectStoreRegistry abstraction from DF that may serve as inspiration -
apache/arrow-rs-object-store#375

Should we only keep traits like FileIO(and potentially FileIOBuilder) in core crate, and move concrete implementation to other crate? Just like what we did for Catalog

This seems sensible to me FWIW

liurenjie1024 · 2025-05-20T07:05:50Z

I still recommend wrapping it in a struct when passing around in the crate

So are you proposing making Storage a trait instead? If so for the sake of argument, could this just be ObjectStore?

I still have concerns for using ObjectStore directly for several reasons:

Since the goal of this refactoring is to make underlying implementation extensible and allowing user to choose provider freely, I don't think it's a good idea to bind the interface to a specific vendor. For example, we also want to allow user to use OpenDAL.
Another important goal of this refactoring is to even allowing user to provide their own FileIO provider(or Storage) implementation. I took a look at ObjectStore trait, it's not aligned well with FileIO requirement. For example, it requires copy/list as required methods, which are not required by FileIO(this interface is optional).

tustvold · 2025-05-20T07:46:47Z

Since the goal of this refactoring is to make underlying implementation extensible and allowing user to choose provider freely, I don't think it's a good idea to bind the interface to a specific vendor.

This is the goal of ObjectStore, it represents an abstraction that has been developed and iterated on by the Rust data ecosystem to allow decoupling things like DataFusion and polars from specific implementations.

Now I'm not claiming it is perfect, but it does represent a non-trivial accumulation of knowledge up to this point.

As with anything there are engineering tradeoffs, and it is possible iceberg has different requirements, and that's fine. But my hope is that by finding out what these are:

We can ensure the proposed design actually delivers these requirements
We can potentially take some learnings back to ObjectStore to benefit the ecosystem more broadly
We avoid NotInventedHere syndrome and can potentially short-circuit a drawn out design process

For example, we also want to allow user to use OpenDAL.

There is object_store_opendal

For example, it requires copy/list as required methods, which are not required by FileIO(this interface is optional).

You can and people do leave methods unimplemented.

liurenjie1024 · 2025-05-22T06:33:28Z

For example, it requires copy/list as required methods, which are not required by FileIO(this interface is optional).

You can and people do leave methods unimplemented.

That's the problem, we are defining a richer interface that's not used by iceberg, then it would be confusing for people who want to define their own FileIO implementation, which part should they implement, and which part they should not? Also confusing for user, they are supposed to be careful when using FileIO since some implementation may not implement some methods.

Also another concern is that, if we delegate a core abstraction like FileIO to object_store, we may experience unnecessary breaking changes introduces when object_store evolves.

About the design part, I'm thinking about following design considerations:

We should define small trait rather than a large trait. For example we could define traits like following FileIO, SupportsBulkOperations , etc. So that concrete implementations could choose which traits to implement according their capability.
We could have an erased trait object definition for traits defined in 1, say DynFileIO, which are object safe.
For end user, we could provide a struct to wrap the DynFileIO trait object.

(We could discuss naming of each part)

Here is an example:

trait InputFile {
 ...
}

trait FileIO {
   type I: InputFile;

}

trait DynFileIO {
}

struct FileIOWrapper {
  inner:  Arc<dyn DynFileIO>
}

linhr · 2025-05-23T03:10:51Z

That's the problem, we are defining a richer interface that's not used by iceberg, then it would be confusing for people who want to define their own FileIO implementation, which part should they implement, and which part they should not? Also confusing for user, they are supposed to be careful when using FileIO since some implementation may not implement some methods.

This seems a good point. I feel an explicit interface conveys the intent if Iceberg only uses a small set of file/object operations.

Here is an example:

I'm not sure if I understand this though, especially the difference between FileIO and DynFileIO. Also it seems InputFile etc. are trait now, and I'm worried that it may add burden to the implementer.

Following my earlier thinking around making Storage a trait, would the following be a viable option?

pub struct FileIO {
    inner: Arc<dyn Storage>,
}

pub(crate) trait Storage {
}

pub struct InputFile {
    inner: Arc<dyn Storage>,
    path: String,
}

roeap · 2025-05-23T10:18:08Z

Just sharing some experiences from the delta world which may not immediately applicable to the question around which trait to use, but maybe be food for thought as to where things could be heading?

One thing that repeatedly comes up when talking about table formats is "Metadata is Data". The to me logical consequence of that is to treat it as such, meaning process it with the same tools that you would use processing data. To that avail delta-rs currently keeps all metadata around as arrow record batches, and delta-kernel goes even further abstracting away the specific data representation.

As such the higher level abstractions we chose are on the level of file formats. I.e. read this {parquet,json,avro,..} file into arrow with this schema. The internal logic processing the metadata either visits individual fields or applies expressions on the data to generate the plans for scans etc. I think to a certain degree this thinking is actually baked into the Iceberg spec via the metadata tables.

By default we provide an arrow (arrays and kernels) and object_store based implementation using many of the same tools used here to read data. Currently I am working on a datafusion engine for kernel, where datafusions execution plans are used to read data and datafusions' native expression for evaluation.

As a consequence virtually all resource management is under full control of the query engine which is also free to apply any more advanced optimisations (caching, etc.) as it sees fit.

All that said, I am about to start a PoC to find out how much of the query planning and eventually also maintenance that is implemented in aforementioned datafusion engine can be applied to both delta and iceberg.

One thing I am fairly certain of is that the work discussed here will be making my life much easier, and if we end up in a place where we can at least do something like ...

impl<T: ObjectStore> FileIo for T {
    ...
}

that would be awesome!

Once we have a consensus here, I am happy to offer my support driving this forward!

tustvold · 2025-05-24T09:22:17Z

Also another concern is that, if we delegate a core abstraction like FileIO to object_store, we may experience unnecessary breaking changes introduces when object_store evolves.

FWIW we try very hard to avoid these now, aiming for ~2 breaking releases per year. I would eventually like to release a v1.0.0, but that is unlikely to be this year.

We should define small trait rather than a large trait. For example we could define traits like following FileIO, SupportsBulkOperations , etc. So that concrete implementations could choose which traits to implement according their capability.

At least historically this sort of trait composition hasn't worked very well with trait objects. In particular you can't write Box<dyn A + B>, only auto-traits are supported. You could get around this by defining trait AB: A + B and then Rust versions > 1.86 can upcast, but this would quickly get unmanageable as the number of traits grows.

Whilst returning Err(Error::Unimplemented) is not as nice as getting a compile error, it is the best approach we've been able to devise.

We could have an erased trait object definition for traits defined in 1, say DynFileIO, which are object safe.

My experience with ObjectStore, is that overtime this interface will likely grow. Ultimately ObjectStore started out as precisely this, a very targetted IO abstraction for InfluxDB IOx, it then got donated to arrow-rs and overtime as more people have used it the interface has grown to its current state to accommodate their requirements. I suspect such a DynFileIO would likely have to go through the same process, especially if the goal is for people to use it as more than just a shim to this crate.

then it would be confusing for people who want to define their own FileIO implementation, which part should they implement, and which part they should not?

I had been viewing FileIO purely as a mechanism to shim into whatever IO abstraction is used by the broader system they're integrating with and not really as something people would generally be implementing or interacting with themselves.

However, I think this is the key perspective difference that is making consensus hard to come by, as I think there are two possible goals being discussed here and on the other linked tickets:

Provide an extensible IO interface inspired by iceberg-java people can use across their applications
Provide a way to integrate their existing ObjectStore based systems with this crate

Ultimately there are number of people with a demonstrable need for the latter, whereas I believe the former is largely theoretical at this stage. Further, as both @roeap and @Sl1mb0 allude to, people want to integrate iceberg into a broader system, and so even if iceberg devised a FileIO interface, other systems will likely continue to use something more expressive.

In the interests of making some progress perhaps we might find some way to decouple these objectives?

Some suggestions from the various tickets:

Add an ObjectStore variant to the storage enum - Make FileIO a Trait #1314 (comment)
Add an extension variant to the storage enum - Consider Using object_store as IO Abstraction #172 (comment)
Make Storage a trait - Make FileIO a Trait #1314 (comment)
Replace Storage with ObjectStore - Make FileIO a Trait #1314 (comment)

All of these would provide a way to solve the pain point that people are currently running into without requiring complex design work, and I'd be happy to help out once we achieve consensus on what it is we would like to do.

tustvold added the enhancement New feature or request label May 12, 2025

tustvold mentioned this issue May 12, 2025

Consider Using object_store as IO Abstraction #172

Open

Xuanwo self-assigned this May 14, 2025

Make FileIO a Trait #1314

Make FileIO a Trait #1314

Comments

tustvold commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Willingness to contribute

Additional Context

Xuanwo commented May 14, 2025

Uh oh!

tustvold commented May 14, 2025

Uh oh!

linhr commented May 14, 2025

Uh oh!

linhr commented May 14, 2025

Uh oh!

tustvold commented May 14, 2025

Uh oh!

linhr commented May 14, 2025

Uh oh!

linhr commented May 14, 2025

Uh oh!

Sl1mb0 commented May 15, 2025

Uh oh!

liurenjie1024 commented May 20, 2025

Uh oh!

liurenjie1024 commented May 20, 2025

Uh oh!

tustvold commented May 20, 2025

Uh oh!

liurenjie1024 commented May 20, 2025

Uh oh!

tustvold commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liurenjie1024 commented May 22, 2025

Uh oh!

linhr commented May 23, 2025

Uh oh!

roeap commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tustvold commented May 24, 2025

Uh oh!

tustvold commented May 12, 2025 •

edited

Loading

tustvold commented May 20, 2025 •

edited

Loading

roeap commented May 23, 2025 •

edited

Loading