You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Since the much needed cleanup and rationalization in #2578 of ListingTableUrl there is only support for glob patterns when no scheme is provided (in practice: only supported on local filesystem and not on other object_stores anymore).
Describe the solution you'd like
To have proper support for glob patterns. Eg, updating the documentation (and implementation) for ListingTableUrl to the following:
/// Parse a provided string as a `ListingTableUrl`
///
/// # Glob File Paths
///
/// If the path contains any of `'?', '*', '['`, it will be considered
/// a glob expression and resolved as following:
///
/// The string up to the first path segment containing a glob expression will be extracted,
/// and resolved as any other provided string.
///
/// The remaining string will be interpreted as a [`glob::Pattern`] and used as a
/// filter when listing files from object storage
///
/// # Paths without a Scheme
///
/// If no scheme is provided, or the string is an absolute filesystem path
/// as determined [`std::path::Path::is_absolute`], the string will be
/// interpreted as a path on the local filesystem using the operating
/// system's standard path delimiter, i.e. `\` on Windows, `/` on Unix.
///
/// If you wish to specify a path that does not exist on the local
/// machine you must provide it as a fully-qualified [file URI]
/// e.g. `file:///myfile.txt`
///
/// [file URI]: https://en.wikipedia.org/wiki/File_URI_scheme
Describe alternatives you've considered
We could keep things as they are and push support for globbing further into user-space.
In that case I suggest removing the support for glob altogether in ListingTableUrl.
Today, when a path/string contains an '*' or '[' the user is greeted with a BadSegment error anyway.
The reason I didn't do this is glob characters aren't URL-safe, so something like s3://bucket/path/*.parquet isn't a valid URL. I could only find examples of systems that supported glob expressions to local filesystem, and so I wasn't really sure how best to encode globs in URLs and opted to just punt on it.
Some possible ideas:
Just ignore that it isn't a valid URL and accept the fact it is potentially very confusing (what this ticket proposes)
Provide a programmatic interface to construct a ListingTableUrl with a custom scheme and glob
Encode the glob expression as a URL-encoded query parameter
Something else
It is also potentially worth highlighting that IIRC the logical plan serialization currently doesn't handle glob expressions and just drops them on the floor.
I think it would really help move this forward if we could find an example of a system that supports glob expressions to object stores, otherwise we end up having to design something custom which we will inevitably get wrong
A ListingTableUrl is currently different than just a valid Url.
** When this is not true, why not simply use Url?
** Perhaps TablePath was a more suitable name for this concept?
** And the url field could have been an object_store::path
Anyway, instead of making all these breaking changes without too much thinking I propose to introduce a GlobbingTable which has Globs (similar to ListingTable and it's ListingTableUrl) in datafusion-contrib and see how it works out...
Uh oh!
There was an error while loading. Please reload this page.
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Since the much needed cleanup and rationalization in #2578 of ListingTableUrl there is only support for glob patterns when no scheme is provided (in practice: only supported on local filesystem and not on other object_stores anymore).
Describe the solution you'd like
To have proper support for glob patterns. Eg, updating the documentation (and implementation) for ListingTableUrl to the following:
Describe alternatives you've considered
We could keep things as they are and push support for globbing further into user-space.
In that case I suggest removing the support for glob altogether in ListingTableUrl.
Today, when a path/string contains an '*' or '[' the user is greeted with a BadSegment error anyway.
@tustvold WDYT?
The text was updated successfully, but these errors were encountered: