-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Add nested struct casting support and integrate into SchemaAdapter #16371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This reverts commit c1059e7.
Summary use the table's SchemaAdapterFactory in list_files_for_scan add a test for statistics mapping via a custom factory Testing cargo test -p datafusion --lib test_statistics_mapping_with_custom_factory -- --nocapture cargo test -p datafusion --lib --quiet (fails: missing test data)
allow SchemaMapping to use a custom per-column adaptation function update DefaultSchemaAdapter and NestedStructSchemaAdapter to pass closures remove NestedStructSchemaMapping adjust tests
…streamline new methods
…gTable for improved file source creation
…a error handling in ListingTable
… adapter.rs, and mapping.rs—to improve navigability and reduce merge conflicts.
…larity and functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, this code looks very well written and has excellent unit tests. My one concern is whether it is correct to assume that structs should always be allowed to cast to other structs. I can see two ways forward:
- We continue with this assertion but add additional documentation to make this clear to users.
- Provide a "strict" mode where we fail if we have missing fields or fields that cannot be cast (maybe the second half is already done here)
} else { | ||
// If source is not a struct, return null array with target struct type | ||
Ok(new_null_array( | ||
&Struct(target_fields.to_vec().into()), | ||
source_col.len(), | ||
)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the function signature it seems like this is an improper usage - would it make sense that we would return an error in the case of calling this function for a non-struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this should be an error. I think at this point (reading of the data) we should only be doing casting, not coercion or in this case replacement with nulls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Valid point.
I'll make the change.
}; | ||
use std::sync::Arc; | ||
/// Adapt a struct column to match the target field type, handling nested structs recursively | ||
fn adapt_struct_column( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: would cast_struct_column
be a more intuitive name? Same for cast_column
below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started with adapt_.. because these were created for schema_adapter.
Checking around, I do see a lot of fn cast...
, and none of fn adapt..
.
I'll make the change
// Special handling for struct fields - always include them even if the | ||
// internal structure differs, as we'll adapt them later | ||
match (file_field.data_type(), table_field.data_type()) { | ||
(Struct(_), Struct(_)) => Ok(true), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it right that this presumes that we can always cast from struct to struct? It feels like that isn't necessarily true. For example, if I have an input struct that contains two fields, both of which are strings and I attempt to cast it to an output struct with one field of type u32 then I think the current approach would give me a pass but an empty array. Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out.
I'll add code to validate compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks great! I would split the changes to ListingTable out and address https://github.com/apache/datafusion/pull/16371/files#r2156903806
} else { | ||
// If source is not a struct, return null array with target struct type | ||
Ok(new_null_array( | ||
&Struct(target_fields.to_vec().into()), | ||
source_col.len(), | ||
)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this should be an error. I think at this point (reading of the data) we should only be doing casting, not coercion or in this case replacement with nulls.
/// Optional [`SchemaAdapterFactory`] for creating schema adapters | ||
schema_adapter_factory: Option<Arc<dyn SchemaAdapterFactory>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this part can be split into its own PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, let's look at that but as part of it's own PR
a2edcee
to
8874489
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good on my end now. Let's have @xudong963 give their review and then we can merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use datafusion_common::Result; | ||
use std::sync::Arc; | ||
|
||
/// A SchemaAdapter that handles schema evolution for nested struct types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for adding a new NestedStructAdapter here rather than just extending the behavior of the existing DefaultSchemaAdapter
?
It seems like the ability to coerce nested structs is a good one of everyone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this into DefaultSchemaAdapter as suggested.
/// - Arrow's cast function fails for non-struct types | ||
/// - Memory allocation fails during struct construction | ||
/// - Invalid data type combinations are encountered | ||
pub fn cast_column(source_col: &ArrayRef, target_field: &Field) -> Result<ArrayRef> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the ability to cast from one struct to another type is really nice
However, I am worried that by only supporting struct -> struct casting in this module (and then in the NestedSchemaAdapter below) most DataFusion users won't be able to take advantage of this feature
Have you considered changing all uses of
use arrow::compute::cast;
To use this function instead?
I think that woudl nicely close #14757 as well as improve the situation with #14396?
For example
> select {'a': 1};
+----------------------------------+
| named_struct(Utf8("a"),Int64(1)) |
+----------------------------------+
| {a: 1} |
+----------------------------------+
1 row(s) fetched.
Elapsed 0.037 seconds.
> select {'a': 1} = {'a': 2, 'b': NULL};
Error during planning: Cannot infer common argument type for comparison operation Struct(a Int64) = Struct(a Int64, b Null)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I logged this as issue in Expose and generalize
cast_column
to enable struct → struct casting in more contexts #16579
I will address it in another PR.
@alamb I asked @kosiew to move that out of the PR to keep it a bit smaller. The plan is to wire it up to ListingTable. I know we want to keep ListingTable simple but it's a relatively small change: #16371 (comment) |
ebd57af
to
33f8bdc
Compare
extend DefaultSchemaAdapter to handle nested struct fields remove NestedStructSchemaAdapter module update re-exports and tests adjust nested stats mapping test
2b6fee2
to
4456419
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…pache#16371) This commit introduces a new `nested_struct` module to support recursive struct-to-struct casting in DataFusion. It enables safe and flexible schema evolution through: - Recursive casting of nested structs via `cast_column` and `cast_struct_column` - Filling missing target fields with null values - Ignoring extra fields present in the source but absent in the target - Validating field compatibility using `validate_struct_compatibility`, including nested levels Integration updates include: - Enhancing `SchemaAdapter` and `SchemaMapping` to use the new nested struct casting logic - Injecting a customizable `cast_column` function into `SchemaMapping` - Updating schema mapping to support nested structs for file-to-table schema projection Test coverage: - Unit tests for simple, nested, missing, and incompatible struct scenarios - Adapter tests for batch mapping and statistics mapping with nested fields - Ensures existing functionality remains intact These changes provide robust support for evolving complex data schemas and improve the safety of nested data transformations, especially when working with Parquet and JSON data sources.
Which issue does this PR close?
This is the part of a series of PRs re-implementing #15295 to close #14757 by adding schema-evolution support for:
in DataFusion.
Rationale for this change
DataFusion’s existing casting logic could handle only primitive and top-level struct conversions via Arrow’s
cast
. Complex nested structs could not be projected or evolved safely across schema changes. This PR introduces a dedicatednested_struct
module to recursively cast and validate arbitrary-depth structs, ensuring:This enables robust schema evolution for Parquet, JSON, and other nested-schema file formats.
What changes are included in this PR?
datafusion/common/src/nested_struct.rs
cast_struct_column
&cast_column
for recursive struct castingvalidate_struct_compatibility
for compatibility checksdatafusion/datasource/src/schema_adapter.rs
)validate_struct_compatibility
when comparing struct fieldscast_column
intoSchemaMapping
for file → table conversionsdatafusion/core/src/datasource/mod.rs
)object_store
import for Parquet testsAre these changes tested?
Yes, this PR adds and updates multiple test suites:
nested_struct.rs
for casting and validationAll new and existing tests pass locally.
Are there any user-facing changes?
datafusion_common::nested_struct
SchemaAdapter
can now transparently cast nested structs without manual null-filling