-
Notifications
You must be signed in to change notification settings - Fork 260
Make FileIO a Trait #1314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thank you for getting this started. I’ve been thinking about this as well. I believe the I'm willing to work on this. |
IMO it would be good to get consensus on the outstanding questions before proceeding with an implementation. I think it would be good to articulate what differences this FileIO trait would have from existing ecosystem abstractions and therefore what problems it seeks to solve. I'm not sure people are on the same page here |
It seems that The only other thing in FWIW, (I think it's fine to have helper functions to build |
If we consider #[derive(Clone, Debug)]
pub struct FileIO {
inner: Storage,
}
#[derive(Clone, Debug)]
pub(crate) enum Storage {
/// An OpenDAL operator.
Operator(Operator),
/// An object_store implementation.
ObjectStore(Arc<dyn ObjectStore>),
} This alternative has a few benefits. (Let me know if there is any drawback that I'm not aware of.)
Happy to discuss! |
The major downside I can see of an enum based approach is that it forces the variants to be enumerated, which in turn limits downstream extensibility. This can be fudged over with feature flags but being able to have separate crates implementing a common interface typically ends up being easier to maintain for all involved.
TBC it is still another abstraction, regardless of if implemented as a trait or a crate private enum. It will entail building custom parquet readers, config handling, path representation, etc... and in turn limit the ability for people to bring their own existing implementations and setups. |
Thanks @tustvold. Yeah the downside of The solution I proposed was more like a short-term solution. I found this would result in less code change and smaller blast radius, given the current status of how |
Or maybe we can have |
It's worth mentioning that I've raised a somewhat related issue in the past regarding the decoupling of building & serialization. On that note - I don't think Iceberg rust needs to necessarily even provide a storage implementation - that's something I would argue users generally already have covered beforehand. If the entire set of metadata types had their building and serialization/de-serialization decoupled, users would have more control over where the Iceberg metadata they build gets written. One thing I think that ties into that though is that it's been mentioned by some of the developers that users of Iceberg Rust should use the transaction API in order to create tables - this effectively means that (and please correct me if I'm wrong here):
With this understanding in mind - it is not clear to me why there is a storage API at all. If users are intended to build & write their own manifest lists, then Iceberg Rust should only need to provide the infrastructure for building and serializing those types, as users should be able to decide where & how they are written. In addition to that, if the Iceberg catalogs are responsible for the building and serialization of table metadata, again, why does Iceberg Rust care about the storage aspect of the table metadata, as that's a catalog implementation's responsibility. |
Hi, sorry for being late for this party, and thanks @tustvold for the summary of the discussions.
While we could abstract out underlying implementation using different providers/implemtations, I still recommend wrapping it in a struct when passing around in the crate. This helps to eliminate limitations of object safety in rust.
I believe it's important to keep the abstractions/functionality required by java api, such as InputFile/OutputFile/delete, as these are required by other components. But it doesn't have be exactly same as, and should be idiomatic for rust developers.
I think it's fine to make breaking changes if we can't avoid.
The reason we should keep https://github.com/apache/iceberg/blob/f06c4f7dfc98cc944a0e1d3a7b38ade0aaa52ce3/api/src/main/java/org/apache/iceberg/io/SupportsPrefixOperations.java#L25 I believe keeping a |
Despite the discussion points raised by @tustvold , I have other things to discuss:
|
So are you proposing making Storage a trait instead? If so for the sake of argument, could this just be ObjectStore?
FWIW we are upstreaming the ObjectStoreRegistry abstraction from DF that may serve as inspiration -
This seems sensible to me FWIW |
I still have concerns for using
|
This is the goal of ObjectStore, it represents an abstraction that has been developed and iterated on by the Rust data ecosystem to allow decoupling things like DataFusion and polars from specific implementations. Now I'm not claiming it is perfect, but it does represent a non-trivial accumulation of knowledge up to this point. As with anything there are engineering tradeoffs, and it is possible iceberg has different requirements, and that's fine. But my hope is that by finding out what these are:
There is object_store_opendal
You can and people do leave methods unimplemented. |
That's the problem, we are defining a richer interface that's not used by iceberg, then it would be confusing for people who want to define their own Also another concern is that, if we delegate a core abstraction like About the design part, I'm thinking about following design considerations:
(We could discuss naming of each part) Here is an example: trait InputFile {
...
}
trait FileIO {
type I: InputFile;
}
trait DynFileIO {
}
struct FileIOWrapper {
inner: Arc<dyn DynFileIO>
} |
This seems a good point. I feel an explicit interface conveys the intent if Iceberg only uses a small set of file/object operations.
I'm not sure if I understand this though, especially the difference between Following my earlier thinking around making pub struct FileIO {
inner: Arc<dyn Storage>,
}
pub(crate) trait Storage {
}
pub struct InputFile {
inner: Arc<dyn Storage>,
path: String,
} |
Just sharing some experiences from the delta world which may not immediately applicable to the question around which trait to use, but maybe be food for thought as to where things could be heading? One thing that repeatedly comes up when talking about table formats is "Metadata is Data". The to me logical consequence of that is to treat it as such, meaning process it with the same tools that you would use processing data. To that avail delta-rs currently keeps all metadata around as arrow record batches, and delta-kernel goes even further abstracting away the specific data representation. As such the higher level abstractions we chose are on the level of file formats. I.e. read this {parquet,json,avro,..} file into arrow with this schema. The internal logic processing the metadata either visits individual fields or applies expressions on the data to generate the plans for scans etc. I think to a certain degree this thinking is actually baked into the Iceberg spec via the metadata tables. By default we provide an arrow (arrays and kernels) and object_store based implementation using many of the same tools used here to read data. Currently I am working on a datafusion engine for kernel, where datafusions execution plans are used to read data and datafusions' native expression for evaluation. As a consequence virtually all resource management is under full control of the query engine which is also free to apply any more advanced optimisations (caching, etc.) as it sees fit. All that said, I am about to start a PoC to find out how much of the query planning and eventually also maintenance that is implemented in aforementioned datafusion engine can be applied to both delta and iceberg. One thing I am fairly certain of is that the work discussed here will be making my life much easier, and if we end up in a place where we can at least do something like ... impl<T: ObjectStore> FileIo for T {
...
} that would be awesome! Once we have a consensus here, I am happy to offer my support driving this forward! |
FWIW we try very hard to avoid these now, aiming for ~2 breaking releases per year. I would eventually like to release a v1.0.0, but that is unlikely to be this year.
At least historically this sort of trait composition hasn't worked very well with trait objects. In particular you can't write Whilst returning
My experience with ObjectStore, is that overtime this interface will likely grow. Ultimately ObjectStore started out as precisely this, a very targetted IO abstraction for InfluxDB IOx, it then got donated to arrow-rs and overtime as more people have used it the interface has grown to its current state to accommodate their requirements. I suspect such a
I had been viewing FileIO purely as a mechanism to shim into whatever IO abstraction is used by the broader system they're integrating with and not really as something people would generally be implementing or interacting with themselves. However, I think this is the key perspective difference that is making consensus hard to come by, as I think there are two possible goals being discussed here and on the other linked tickets:
Ultimately there are number of people with a demonstrable need for the latter, whereas I believe the former is largely theoretical at this stage. Further, as both @roeap and @Sl1mb0 allude to, people want to integrate iceberg into a broader system, and so even if iceberg devised a FileIO interface, other systems will likely continue to use something more expressive. In the interests of making some progress perhaps we might find some way to decouple these objectives? Some suggestions from the various tickets:
All of these would provide a way to solve the pain point that people are currently running into without requiring complex design work, and I'd be happy to help out once we achieve consensus on what it is we would like to do. |
Uh oh!
There was an error while loading. Please reload this page.
Is your feature request related to a problem or challenge?
Originally proposed on #172 (comment) making FileIO a trait would allow for more pluggable storage access. This in turn would potentially allow better integration where people already have an existing storage setup, e.g. based on object_store, that they want to use.
Describe the solution you'd like
I would like FileIO to be changed to a trait, allowing for pluggable storage engines.
The major remaining questions from the linked issue are:
Arc<dyn FileIO
or something else - Consider Using object_store as IO Abstraction #172 (comment) Consider Using object_store as IO Abstraction #172 (comment)If the decision is that we can break compatibility with iceberg-java and are happy to use a trait object, the next question is
My 2 cents is that designing a custom abstraction at this granularity when one already exists and is well adopted within the Rust data ecosystem, seems unnecessary.
Willingness to contribute
I would be willing to contribute to this feature with guidance from the Iceberg Rust community
Additional Context
See also apache/datafusion#15018 (comment)
The text was updated successfully, but these errors were encountered: