Skip to content

Add nested struct casting support and integrate into SchemaAdapter #16371

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 67 commits into from
Jun 27, 2025

Conversation

kosiew
Copy link
Contributor

@kosiew kosiew commented Jun 11, 2025

Which issue does this PR close?

This is the part of a series of PRs re-implementing #15295 to close #14757 by adding schema-evolution support for:

  • listing‐based tables
  • with nested structs
    in DataFusion.

Rationale for this change

DataFusion’s existing casting logic could handle only primitive and top-level struct conversions via Arrow’s cast. Complex nested structs could not be projected or evolved safely across schema changes. This PR introduces a dedicated nested_struct module to recursively cast and validate arbitrary-depth structs, ensuring:

  • Missing target fields are filled with nulls
  • Extra source fields are ignored
  • Nested structs are handled depth-first
  • SchemaAdapter uses the new casting logic for file-to-table mappings

This enables robust schema evolution for Parquet, JSON, and other nested-schema file formats.


What changes are included in this PR?

  • New module datafusion/common/src/nested_struct.rs
    • cast_struct_column & cast_column for recursive struct casting
    • validate_struct_compatibility for compatibility checks
    • Comprehensive unit tests covering simple, missing, extra, nested, and incompatible scenarios
  • Common imports updated to expose the nested-struct API
  • SchemaAdapter (datafusion/datasource/src/schema_adapter.rs)
    • Use validate_struct_compatibility when comparing struct fields
    • Inject cast_column into SchemaMapping for file → table conversions
    • Update tests to cover nested struct adaptation and statistics mapping
  • Core datasource tests cleanup (datafusion/core/src/datasource/mod.rs)
    • Reintroduced missing object_store import for Parquet tests

Are these changes tested?

Yes, this PR adds and updates multiple test suites:

  1. Unit tests in nested_struct.rs for casting and validation
  2. SchemaAdapter tests covering nested struct adaptation, batch mapping, and statistics translation
  3. Core datasource tests ensuring Parquet integration still compiles and runs

All new and existing tests pass locally.


Are there any user-facing changes?

  • No breaking changes to public APIs; all new functions live under datafusion_common::nested_struct
  • Users leveraging custom SchemaAdapter can now transparently cast nested structs without manual null-filling
  • Documentation for nested struct casting will be added in a follow-up PR if needed

kosiew added 26 commits June 10, 2025 18:13
Summary
use the table's SchemaAdapterFactory in list_files_for_scan
add a test for statistics mapping via a custom factory
Testing
cargo test -p datafusion --lib test_statistics_mapping_with_custom_factory -- --nocapture
cargo test -p datafusion --lib --quiet (fails: missing test data)
allow SchemaMapping to use a custom per-column adaptation function
update DefaultSchemaAdapter and NestedStructSchemaAdapter to pass closures
remove NestedStructSchemaMapping
adjust tests
… adapter.rs, and mapping.rs—to improve navigability and reduce merge conflicts.
@github-actions github-actions bot added core Core DataFusion crate datasource Changes to the datasource crate labels Jun 11, 2025
Copy link
Contributor

@timsaucer timsaucer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, this code looks very well written and has excellent unit tests. My one concern is whether it is correct to assume that structs should always be allowed to cast to other structs. I can see two ways forward:

  1. We continue with this assertion but add additional documentation to make this clear to users.
  2. Provide a "strict" mode where we fail if we have missing fields or fields that cannot be cast (maybe the second half is already done here)

Comment on lines 53 to 59
} else {
// If source is not a struct, return null array with target struct type
Ok(new_null_array(
&Struct(target_fields.to_vec().into()),
source_col.len(),
))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the function signature it seems like this is an improper usage - would it make sense that we would return an error in the case of calling this function for a non-struct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this should be an error. I think at this point (reading of the data) we should only be doing casting, not coercion or in this case replacement with nulls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid point.
I'll make the change.

};
use std::sync::Arc;
/// Adapt a struct column to match the target field type, handling nested structs recursively
fn adapt_struct_column(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would cast_struct_column be a more intuitive name? Same for cast_column below

Copy link
Contributor Author

@kosiew kosiew Jun 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started with adapt_.. because these were created for schema_adapter.
Checking around, I do see a lot of fn cast..., and none of fn adapt...
I'll make the change

// Special handling for struct fields - always include them even if the
// internal structure differs, as we'll adapt them later
match (file_field.data_type(), table_field.data_type()) {
(Struct(_), Struct(_)) => Ok(true),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it right that this presumes that we can always cast from struct to struct? It feels like that isn't necessarily true. For example, if I have an input struct that contains two fields, both of which are strings and I attempt to cast it to an output struct with one field of type u32 then I think the current approach would give me a pass but an empty array. Is that correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out.
I'll add code to validate compatibility.

Copy link
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great! I would split the changes to ListingTable out and address https://github.com/apache/datafusion/pull/16371/files#r2156903806

Comment on lines 53 to 59
} else {
// If source is not a struct, return null array with target struct type
Ok(new_null_array(
&Struct(target_fields.to_vec().into()),
source_col.len(),
))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this should be an error. I think at this point (reading of the data) we should only be doing casting, not coercion or in this case replacement with nulls.

Comment on lines 88 to 89
/// Optional [`SchemaAdapterFactory`] for creating schema adapters
schema_adapter_factory: Option<Arc<dyn SchemaAdapterFactory>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this part can be split into its own PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the change.

Copy link
Contributor

@adriangb adriangb Jun 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, let's look at that but as part of it's own PR

@kosiew kosiew force-pushed the nested-struct-14757 branch from a2edcee to 8874489 Compare June 20, 2025 04:55
Copy link
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good on my end now. Let's have @xudong963 give their review and then we can merge.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @kosiew and @adriangb

I think the idea for a custom struct --> struct cast operator is really nice

However, I am concerned with how this is proposed to be wired in (it seems like it isn't connected to anything 🤔 )

use datafusion_common::Result;
use std::sync::Arc;

/// A SchemaAdapter that handles schema evolution for nested struct types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for adding a new NestedStructAdapter here rather than just extending the behavior of the existing DefaultSchemaAdapter?

It seems like the ability to coerce nested structs is a good one of everyone

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// - Arrow's cast function fails for non-struct types
/// - Memory allocation fails during struct construction
/// - Invalid data type combinations are encountered
pub fn cast_column(source_col: &ArrayRef, target_field: &Field) -> Result<ArrayRef> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the ability to cast from one struct to another type is really nice

However, I am worried that by only supporting struct -> struct casting in this module (and then in the NestedSchemaAdapter below) most DataFusion users won't be able to take advantage of this feature

Have you considered changing all uses of

use arrow::compute::cast;

To use this function instead?

I think that woudl nicely close #14757 as well as improve the situation with #14396?

For example

> select {'a': 1};
+----------------------------------+
| named_struct(Utf8("a"),Int64(1)) |
+----------------------------------+
| {a: 1}                           |
+----------------------------------+
1 row(s) fetched.
Elapsed 0.037 seconds.

> select {'a': 1} = {'a': 2, 'b': NULL};
Error during planning: Cannot infer common argument type for comparison operation Struct(a Int64) = Struct(a Int64, b Null)

Copy link
Contributor Author

@kosiew kosiew Jun 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adriangb
Copy link
Contributor

However, I am concerned with how this is proposed to be wired in (it seems like it isn't connected to anything 🤔 )

@alamb I asked @kosiew to move that out of the PR to keep it a bit smaller. The plan is to wire it up to ListingTable. I know we want to keep ListingTable simple but it's a relatively small change: #16371 (comment)

extend DefaultSchemaAdapter to handle nested struct fields
remove NestedStructSchemaAdapter module
update re-exports and tests
adjust nested stats mapping test
@kosiew kosiew force-pushed the nested-struct-14757 branch from 2b6fee2 to 4456419 Compare June 27, 2025 10:20
@kosiew kosiew changed the title Enable schema evolution for nested structs via adapt_column and custom adapter support in ListingTable Add nested struct casting support and integrate into SchemaAdapter Jun 27, 2025
@kosiew kosiew merged commit 0c30f13 into apache:main Jun 27, 2025
29 checks passed
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @kosiew and @adriangb

adriangb pushed a commit to pydantic/datafusion that referenced this pull request Jun 27, 2025
…pache#16371)

This commit introduces a new `nested_struct` module to support recursive struct-to-struct casting in DataFusion. It enables safe and flexible schema evolution through:

- Recursive casting of nested structs via `cast_column` and `cast_struct_column`
- Filling missing target fields with null values
- Ignoring extra fields present in the source but absent in the target
- Validating field compatibility using `validate_struct_compatibility`, including nested levels

Integration updates include:

- Enhancing `SchemaAdapter` and `SchemaMapping` to use the new nested struct casting logic
- Injecting a customizable `cast_column` function into `SchemaMapping`
- Updating schema mapping to support nested structs for file-to-table schema projection

Test coverage:

- Unit tests for simple, nested, missing, and incompatible struct scenarios
- Adapter tests for batch mapping and statistics mapping with nested fields
- Ensures existing functionality remains intact

These changes provide robust support for evolving complex data schemas and improve the safety of nested data transformations, especially when working with Parquet and JSON data sources.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Datafusion can't seem to cast evolving structs
4 participants