Skip to content

Make FileIO a Trait #1314

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tustvold opened this issue May 12, 2025 · 17 comments
Open

Make FileIO a Trait #1314

tustvold opened this issue May 12, 2025 · 17 comments
Assignees
Labels
enhancement New feature or request

Comments

@tustvold
Copy link

tustvold commented May 12, 2025

Is your feature request related to a problem or challenge?

Originally proposed on #172 (comment) making FileIO a trait would allow for more pluggable storage access. This in turn would potentially allow better integration where people already have an existing storage setup, e.g. based on object_store, that they want to use.

Describe the solution you'd like

I would like FileIO to be changed to a trait, allowing for pluggable storage engines.

The major remaining questions from the linked issue are:

  1. Should implementations pass around Arc<dyn FileIO or something else - Consider Using object_store as IO Abstraction #172 (comment) Consider Using object_store as IO Abstraction #172 (comment)
  2. How important is preserving compatibility with iceberg-java - Consider Using object_store as IO Abstraction #172 (comment) Consider Using object_store as IO Abstraction #172 (comment)
  3. Can we make breaking changes to iceberg-rust - Consider Using object_store as IO Abstraction #172 (comment)

If the decision is that we can break compatibility with iceberg-java and are happy to use a trait object, the next question is

  1. How should the interface differ from ObjectStore - Consider Using object_store as IO Abstraction #172 (comment)

My 2 cents is that designing a custom abstraction at this granularity when one already exists and is well adopted within the Rust data ecosystem, seems unnecessary.

Willingness to contribute

I would be willing to contribute to this feature with guidance from the Iceberg Rust community

Additional Context

See also apache/datafusion#15018 (comment)

@tustvold tustvold added the enhancement New feature or request label May 12, 2025
@Xuanwo Xuanwo self-assigned this May 14, 2025
@Xuanwo
Copy link
Member

Xuanwo commented May 14, 2025

Thank you for getting this started. I’ve been thinking about this as well. I believe the FileIO trait is important for iceberg-rust, as it helps separate our abstraction from the underlying implementation.

I'm willing to work on this.

@tustvold
Copy link
Author

IMO it would be good to get consensus on the outstanding questions before proceeding with an implementation. I think it would be good to articulate what differences this FileIO trait would have from existing ecosystem abstractions and therefore what problems it seeks to solve. I'm not sure people are on the same page here

@linhr
Copy link

linhr commented May 14, 2025

It seems that FileIO internally manages Storage which is currently an enum. Maybe we should revisit the design of both together?

The only other thing in FileIO is FileIOBuilder, which seems to be configuration for the storage. As I mentioned earlier, I was wondering if the Iceberg library really need to manage such configuration. It might be more flexible to simply accept a storage implementation (possibly some Arc<dyn ...>) without knowing how the storage is configured. This is the idea of "dependency injection".

FWIW, FileIO::into_builder() does not seem to be used anywhere in the project.

(I think it's fine to have helper functions to build FileIO from configuration, but the configuration does not need to be stored once the data storage is instantiated.)

@linhr
Copy link

linhr commented May 14, 2025

If we consider FileIO and Storage together, I realized that there is actually an alternative design for the IO abstraction. (This assumes that we remove FileIOBuilder from FileIO.)

#[derive(Clone, Debug)]
pub struct FileIO {
    inner: Storage,
}

#[derive(Clone, Debug)]
pub(crate) enum Storage {
    /// An OpenDAL operator.
    Operator(Operator),
    /// An object_store implementation.
    ObjectStore(Arc<dyn ObjectStore>),
}

This alternative has a few benefits. (Let me know if there is any drawback that I'm not aware of.)

  1. There is no change to the usage of FileIO.
  2. Storage can be cheaply cloned and be part of InputFile and OutputFile.
  3. We do not need yet another abstraction for file/object operations on top of OpenDAL or object_store. (This is a discussion point raised by @tustvold.)
  4. The intent is conveyed clearly that FileIO should work with both OpenDAL and object_store.
  5. There is no longer need to wrap ObjectStore as an OpenDAL operator. In fact, after looking deeper into this, I'm not sure if the wrapper is in general well-defined since some of the OpenDAL methods (e.g. creating directory) are not part of the ObjectStore contract.

Happy to discuss!

@tustvold
Copy link
Author

The major downside I can see of an enum based approach is that it forces the variants to be enumerated, which in turn limits downstream extensibility. This can be fudged over with feature flags but being able to have separate crates implementing a common interface typically ends up being easier to maintain for all involved.

We do not need yet another abstraction for file/object operations on top of OpenDAL or object_store.

TBC it is still another abstraction, regardless of if implemented as a trait or a crate private enum. It will entail building custom parquet readers, config handling, path representation, etc... and in turn limit the ability for people to bring their own existing implementations and setups.

@linhr
Copy link

linhr commented May 14, 2025

Thanks @tustvold. Yeah the downside of Storage enum makes a lot of sense to me. I can see that a trait would be more extensible in general for downstream crate.

The solution I proposed was more like a short-term solution. I found this would result in less code change and smaller blast radius, given the current status of how FileIO is used in the Iceberg library. Although Storage is still an enum, it supports a wide range of use cases, assuming that OpenDAL and object_store have emerged as the top standards for storage abstraction in the Rust community. This would give us a quick path for object_store integration, while we evaluate the best path forward.

@linhr
Copy link

linhr commented May 14, 2025

Or maybe we can have Storage as a trait while FileIO stays as a struct? (Again, I can see that the indirection is less ideal. What I was looking for is a migration path with least disruption, considering there are also InputFile and OutputFile etc.)

@Sl1mb0
Copy link
Contributor

Sl1mb0 commented May 15, 2025

It's worth mentioning that I've raised a somewhat related issue in the past regarding the decoupling of building & serialization. On that note - I don't think Iceberg rust needs to necessarily even provide a storage implementation - that's something I would argue users generally already have covered beforehand. If the entire set of metadata types had their building and serialization/de-serialization decoupled, users would have more control over where the Iceberg metadata they build gets written.

One thing I think that ties into that though is that it's been mentioned by some of the developers that users of Iceberg Rust should use the transaction API in order to create tables - this effectively means that (and please correct me if I'm wrong here):

  • Users need to provide an Iceberg Catalog implementation
  • Users write their manifests and manifest lists
    • This means that the user is responsible for handling building and serde for these types (manifest-list.avro & manifest-file.avro)
  • Users use the catalog to generate table metadata
    • This means that the catalog is responsible for building and serde of table metadata (eg v1.metadata.json)

With this understanding in mind - it is not clear to me why there is a storage API at all. If users are intended to build & write their own manifest lists, then Iceberg Rust should only need to provide the infrastructure for building and serializing those types, as users should be able to decide where & how they are written. In addition to that, if the Iceberg catalogs are responsible for the building and serialization of table metadata, again, why does Iceberg Rust care about the storage aspect of the table metadata, as that's a catalog implementation's responsibility.

@liurenjie1024
Copy link
Contributor

Hi, sorry for being late for this party, and thanks @tustvold for the summary of the discussions.

  1. Should implementations pass around Arc<dyn FileIO or something else - Consider Using object_store as IO Abstraction #172 (comment) Consider Using object_store as IO Abstraction #172 (comment)

While we could abstract out underlying implementation using different providers/implemtations, I still recommend wrapping it in a struct when passing around in the crate. This helps to eliminate limitations of object safety in rust.

  1. How important is preserving compatibility with iceberg-java - Consider Using object_store as IO Abstraction #172 (comment) Consider Using object_store as IO Abstraction #172 (comment)

I believe it's important to keep the abstractions/functionality required by java api, such as InputFile/OutputFile/delete, as these are required by other components. But it doesn't have be exactly same as, and should be idiomatic for rust developers.

  1. Can we make breaking changes to iceberg-rust - Consider Using object_store as IO Abstraction #172 (comment)

I think it's fine to make breaking changes if we can't avoid.

  1. How should the interface differ from ObjectStore - Consider Using object_store as IO Abstraction #172 (comment)

The reason we should keep FileIO trait rather than using ObjectStore trait directly is that we need to keep consistent as much as java implementation(reference implementation). For example, java's FileIO has following extensions recently:

https://github.com/apache/iceberg/blob/f06c4f7dfc98cc944a0e1d3a7b38ade0aaa52ce3/api/src/main/java/org/apache/iceberg/io/SupportsPrefixOperations.java#L25
https://github.com/apache/iceberg/blob/f06c4f7dfc98cc944a0e1d3a7b38ade0aaa52ce3/api/src/main/java/org/apache/iceberg/io/SupportsBulkOperations.java#L21
https://github.com/apache/iceberg/blob/50d310aef17908f03f595d520cd751527483752a/api/src/main/java/org/apache/iceberg/encryption/EncryptingFileIO.java#L37
https://github.com/apache/iceberg/blob/e9364faabcc67eef6c61af2ecdf7bcf9a3fef602/api/src/main/java/org/apache/iceberg/io/SupportsRecoveryOperations.java#L27
https://github.com/apache/iceberg/blob/817dc35a924b403716d2eb899aba46f3398a5ca9/core/src/main/java/org/apache/iceberg/io/SupportsStorageCredentials.java#L27

I believe keeping a FileIO trait rather than using object_store would make evolving easier.

@liurenjie1024
Copy link
Contributor

Despite the discussion points raised by @tustvold , I have other things to discuss:

  1. Should we have an unified FileIOBuilder trait, just like what we did for CatalogBuilder in Experiment implementation for catalog builder #1231
  2. Should we only keep traits like FileIO(and potentially FileIOBuilder) in core crate, and move concrete implementation to other crate? Just like what we did for Catalog?

@tustvold
Copy link
Author

I still recommend wrapping it in a struct when passing around in the crate

So are you proposing making Storage a trait instead? If so for the sake of argument, could this just be ObjectStore?

Should we have an unified FileIOBuilder trait

FWIW we are upstreaming the ObjectStoreRegistry abstraction from DF that may serve as inspiration -
apache/arrow-rs-object-store#375

Should we only keep traits like FileIO(and potentially FileIOBuilder) in core crate, and move concrete implementation to other crate? Just like what we did for Catalog

This seems sensible to me FWIW

@liurenjie1024
Copy link
Contributor

I still recommend wrapping it in a struct when passing around in the crate

So are you proposing making Storage a trait instead? If so for the sake of argument, could this just be ObjectStore?

I still have concerns for using ObjectStore directly for several reasons:

  1. Since the goal of this refactoring is to make underlying implementation extensible and allowing user to choose provider freely, I don't think it's a good idea to bind the interface to a specific vendor. For example, we also want to allow user to use OpenDAL.
  2. Another important goal of this refactoring is to even allowing user to provide their own FileIO provider(or Storage) implementation. I took a look at ObjectStore trait, it's not aligned well with FileIO requirement. For example, it requires copy/list as required methods, which are not required by FileIO(this interface is optional).

@tustvold
Copy link
Author

tustvold commented May 20, 2025

Since the goal of this refactoring is to make underlying implementation extensible and allowing user to choose provider freely, I don't think it's a good idea to bind the interface to a specific vendor.

This is the goal of ObjectStore, it represents an abstraction that has been developed and iterated on by the Rust data ecosystem to allow decoupling things like DataFusion and polars from specific implementations.

Now I'm not claiming it is perfect, but it does represent a non-trivial accumulation of knowledge up to this point.

As with anything there are engineering tradeoffs, and it is possible iceberg has different requirements, and that's fine. But my hope is that by finding out what these are:

  • We can ensure the proposed design actually delivers these requirements
  • We can potentially take some learnings back to ObjectStore to benefit the ecosystem more broadly
  • We avoid NotInventedHere syndrome and can potentially short-circuit a drawn out design process

For example, we also want to allow user to use OpenDAL.

There is object_store_opendal

For example, it requires copy/list as required methods, which are not required by FileIO(this interface is optional).

You can and people do leave methods unimplemented.

@liurenjie1024
Copy link
Contributor

For example, it requires copy/list as required methods, which are not required by FileIO(this interface is optional).

You can and people do leave methods unimplemented.

That's the problem, we are defining a richer interface that's not used by iceberg, then it would be confusing for people who want to define their own FileIO implementation, which part should they implement, and which part they should not? Also confusing for user, they are supposed to be careful when using FileIO since some implementation may not implement some methods.

Also another concern is that, if we delegate a core abstraction like FileIO to object_store, we may experience unnecessary breaking changes introduces when object_store evolves.

About the design part, I'm thinking about following design considerations:

  1. We should define small trait rather than a large trait. For example we could define traits like following FileIO, SupportsBulkOperations , etc. So that concrete implementations could choose which traits to implement according their capability.
  2. We could have an erased trait object definition for traits defined in 1, say DynFileIO, which are object safe.
  3. For end user, we could provide a struct to wrap the DynFileIO trait object.

(We could discuss naming of each part)

Here is an example:

trait InputFile {
 ...
}

trait FileIO {
   type I: InputFile;

}

trait DynFileIO {
}

struct FileIOWrapper {
  inner:  Arc<dyn DynFileIO>
}

@linhr
Copy link

linhr commented May 23, 2025

That's the problem, we are defining a richer interface that's not used by iceberg, then it would be confusing for people who want to define their own FileIO implementation, which part should they implement, and which part they should not? Also confusing for user, they are supposed to be careful when using FileIO since some implementation may not implement some methods.

This seems a good point. I feel an explicit interface conveys the intent if Iceberg only uses a small set of file/object operations.

Here is an example:

I'm not sure if I understand this though, especially the difference between FileIO and DynFileIO. Also it seems InputFile etc. are trait now, and I'm worried that it may add burden to the implementer.

Following my earlier thinking around making Storage a trait, would the following be a viable option?

pub struct FileIO {
    inner: Arc<dyn Storage>,
}

pub(crate) trait Storage {
}

pub struct InputFile {
    inner: Arc<dyn Storage>,
    path: String,
}

@roeap
Copy link

roeap commented May 23, 2025

Just sharing some experiences from the delta world which may not immediately applicable to the question around which trait to use, but maybe be food for thought as to where things could be heading?

One thing that repeatedly comes up when talking about table formats is "Metadata is Data". The to me logical consequence of that is to treat it as such, meaning process it with the same tools that you would use processing data. To that avail delta-rs currently keeps all metadata around as arrow record batches, and delta-kernel goes even further abstracting away the specific data representation.

As such the higher level abstractions we chose are on the level of file formats. I.e. read this {parquet,json,avro,..} file into arrow with this schema. The internal logic processing the metadata either visits individual fields or applies expressions on the data to generate the plans for scans etc. I think to a certain degree this thinking is actually baked into the Iceberg spec via the metadata tables.

By default we provide an arrow (arrays and kernels) and object_store based implementation using many of the same tools used here to read data. Currently I am working on a datafusion engine for kernel, where datafusions execution plans are used to read data and datafusions' native expression for evaluation.

As a consequence virtually all resource management is under full control of the query engine which is also free to apply any more advanced optimisations (caching, etc.) as it sees fit.

All that said, I am about to start a PoC to find out how much of the query planning and eventually also maintenance that is implemented in aforementioned datafusion engine can be applied to both delta and iceberg.

One thing I am fairly certain of is that the work discussed here will be making my life much easier, and if we end up in a place where we can at least do something like ...

impl<T: ObjectStore> FileIo for T {
    ...
}

that would be awesome!

Once we have a consensus here, I am happy to offer my support driving this forward!

@tustvold
Copy link
Author

Also another concern is that, if we delegate a core abstraction like FileIO to object_store, we may experience unnecessary breaking changes introduces when object_store evolves.

FWIW we try very hard to avoid these now, aiming for ~2 breaking releases per year. I would eventually like to release a v1.0.0, but that is unlikely to be this year.

We should define small trait rather than a large trait. For example we could define traits like following FileIO, SupportsBulkOperations , etc. So that concrete implementations could choose which traits to implement according their capability.

At least historically this sort of trait composition hasn't worked very well with trait objects. In particular you can't write Box<dyn A + B>, only auto-traits are supported. You could get around this by defining trait AB: A + B and then Rust versions > 1.86 can upcast, but this would quickly get unmanageable as the number of traits grows.

Whilst returning Err(Error::Unimplemented) is not as nice as getting a compile error, it is the best approach we've been able to devise.

We could have an erased trait object definition for traits defined in 1, say DynFileIO, which are object safe.

My experience with ObjectStore, is that overtime this interface will likely grow. Ultimately ObjectStore started out as precisely this, a very targetted IO abstraction for InfluxDB IOx, it then got donated to arrow-rs and overtime as more people have used it the interface has grown to its current state to accommodate their requirements. I suspect such a DynFileIO would likely have to go through the same process, especially if the goal is for people to use it as more than just a shim to this crate.

then it would be confusing for people who want to define their own FileIO implementation, which part should they implement, and which part they should not?

I had been viewing FileIO purely as a mechanism to shim into whatever IO abstraction is used by the broader system they're integrating with and not really as something people would generally be implementing or interacting with themselves.

However, I think this is the key perspective difference that is making consensus hard to come by, as I think there are two possible goals being discussed here and on the other linked tickets:

  1. Provide an extensible IO interface inspired by iceberg-java people can use across their applications
  2. Provide a way to integrate their existing ObjectStore based systems with this crate

Ultimately there are number of people with a demonstrable need for the latter, whereas I believe the former is largely theoretical at this stage. Further, as both @roeap and @Sl1mb0 allude to, people want to integrate iceberg into a broader system, and so even if iceberg devised a FileIO interface, other systems will likely continue to use something more expressive.

In the interests of making some progress perhaps we might find some way to decouple these objectives?

Some suggestions from the various tickets:

All of these would provide a way to solve the pain point that people are currently running into without requiring complex design work, and I'd be happy to help out once we achieve consensus on what it is we would like to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants