Fix duplicate unqualified Field name (schema error) on join queries #15438

LiaCastaneda · 2025-03-26T12:55:04Z

Which issue does this PR close?

Closes Duplicate unqualified field names error on queries with multiple JOIN #15439

Rationale for this change

As mentioned on the issue, when using the substrait consumer, doing multiple JOINs fail because renaming logic fails to make column names unique.

What changes are included in this PR?

Are these changes tested?

All tests pass, and a new test was added on consumer_integration.rs and on test_change_redundant_column

Are there any user-facing changes?

LiaCastaneda · 2025-03-26T12:57:03Z

datafusion/expr/src/logical_plan/builder.rs

 pub fn change_redundant_column(fields: &Fields) -> Vec<Field> {
    let mut name_map = HashMap::new();
+    let mut seen: HashSet<String> = HashSet::new();
+
    fields
        .into_iter()
        .map(|field| {
-            let counter = name_map.entry(field.name().to_string()).or_insert(0);
-            *counter += 1;
-            if *counter > 1 {
-                let new_name = format!("{}:{}", field.name(), *counter - 1);
-                Field::new(new_name, field.data_type().clone(), field.is_nullable())
-            } else {
-                field.as_ref().clone()


This seems to be the root cause of the issue: when doing joins, there is a function requalify_sides_if_needed to handle aliasing the columns so the resulting schema of a join : let in_join_schema = left.schema().join(right.schema())?; can be created . However, if we had a query like:

select * from first_agg LEFT JOIN fourth_random_table ON first_agg.id = fourth_random_table.id LEFT JOIN second_agg ON first_agg.id = second_agg.id LEFT JOIN third_agg ON first_agg.id = third_agg.id

The first JOIN to be converted to a logical plan: LEFT JOIN third_agg ON first_agg.id = third_agg.id will work, the join schema col names will stay as they are with an alias , however on the subsequent JOINs it will fail since the consumer does the following steps for each JOIN:

After handling the innermost join the resulting join schema is [left.id] [right.id] ✅

For the second join it we "carry" the previous schema, so in requalify_sides_if_needed we would have [id, left.id] [id, right.id] so we would have to alias again -> [left.id, left.id] [right.id, right.id] and because of this function we would end up having: [left.id:1 , left.id] [right.id:1 , right.id] ✅

On the outermost and final join the process would be repeated: [id, left.id:1 , left.id] [id, right.id:1 , right.id] ->
[left.id:1, left.id:1 , left.id] [right.id:1, right.id:1 , right.id] and because of id:1 being repeated with the current change_redundant_column algorithm, the query will fail with Schema contains duplicate unqualified field name "id:1" 🟥

Moreover we can observe that if we do just two levels of joins we would get no error:

select * from first_agg LEFT JOIN fourth_random_table ON first_agg.id = fourth_random_table.id LEFT JOIN second_agg ON first_agg.id = second_agg.id

LiaCastaneda · 2025-03-26T12:58:24Z

datafusion/physical-expr/src/equivalence/projection.rs

                            if col.name() != matching_input_field.name() {
-                                return internal_err!("Input field name {} does not match with the projection expression {}",
-                                    matching_input_field.name(),col.name())
-                                }
+                                let fixed_col = Column::new(col.name(), idx);
+                                return Ok(Transformed::yes(Arc::new(fixed_col)))
+                            }


If this check is skipped the query will still work, same as it can be skipped here for aggregate nodes schema check. Without this we would get the error: Input field name count(Int64(1)) does not match with the projection expression count(Int64(1)):1 still it would be nice to know if this is he correct approach

I think we should revert this change. This check was helpful in catching many errors, especially while developing projection-related code (e.g. projection pushdown). Sorry for my delayed response, but @LiaCastaneda, could you please address the root cause of the issue and revert this change?

As you mentioned, the problem likely stems from inconsistent naming conventions between columns and fields. I recall encountering similar issues with aggregation functions in the past, and we resolved them by unifying the naming. I believe the correct fix shouldn’t require too much effort.

Hi, sorry about this - I'm trying to understand this, in that change what I did is to ammend the issue by unifying the naming instead of skipping the check.

I checked on how we build the LogicalPlan and input fields have the same names as the projection column expressions, both have count(Int64(1)):1 so I don't know where that field is being set to count(Int64(1)) for the physcial input_schema. On the physical planner I also printed the Logical input schema here and it appears as count(Int64(1)):1 . I also noticed that same function creates a physical Expr for the projection based on the logical input_schema and not the physical input_exec schema hence why it fails later on the check.

I think an approach could be to move the check here and add an option on the runtime config we can get thorugh the session_state to skip it (and by default set it to false) so errors can still be caught while developing (iiuc this doesn't cause errors during execution), something similar was done for aggregate nodes. I was looking into your PR that adds that check, but apparently we don't have create_physical_name anymore after #11977

edit: I have a solution in mind based on what I mentioned above, can this wait a couple of days instead of reverting the PR commit, since it will also revert the duplicate schema names fix. I will try opening another PR this week which will include the check back.

gabotechs

Nice! this one seems like a very hard to catch problem

datafusion/substrait/tests/cases/consumer_integration.rs

datafusion/expr/src/logical_plan/builder.rs

LiaCastaneda · 2025-04-01T10:37:26Z

👋 Hi @alamb I was wondering if I could get a review on this PR, it happens while using the substrait consumer but it fixes a potential issue you observed here 🙇‍♀️

alamb

Thank you @LiaCastaneda and @gabotechs -- this PR looks good to me

alamb · 2025-04-01T16:42:41Z

datafusion/expr/src/logical_plan/builder.rs

-                field.as_ref().clone()
+            let base_name = field.name();
+            let count = name_map.entry(base_name.clone()).or_insert(0);
+            let mut new_name = base_name.clone();


I played around with trying to avoid this clone, but I could not come up with anything that was reasonable

Since this function is only called when creating subqueries I think it is fine

https://github.com/search?q=repo%3Aapache%2Fdatafusion%20change_redundant_column&type=code

alamb · 2025-04-01T16:43:42Z

datafusion/physical-expr/src/equivalence/projection.rs

-                                return internal_err!("Input field name {} does not match with the projection expression {}",
-                                    matching_input_field.name(),col.name())
-                                }
+                                let fixed_col = Column::new(col.name(), idx);


FYI @berkaysynnada and @akurmustafa

sorry for being late :( https://github.com/apache/datafusion/pull/15438/files#discussion_r2025001167

alamb · 2025-04-01T19:33:11Z

Hi @LiaCastaneda -- I believe the CI has failed on this PR due to a change in the CI actions. Can you please merge the PR up to main which i think will address the issue

…schema-error-on-join-queries

LiaCastaneda · 2025-04-02T09:00:37Z

Thanks for the reviews @alamb and @gabotechs

…pache#15438) * Fix duplicate unqualified field name issue * Adjust Projection Properly * Add reproducer plan * Adjust comment * Set metadata to be the same as well * Fix substrait reproducer + Add test case * Format * Add explanation comment * Add test case to change_redundant_column

…pache#15438) (#15) * Fix duplicate unqualified field name issue * Adjust Projection Properly * Add reproducer plan * Adjust comment * Set metadata to be the same as well * Fix substrait reproducer + Add test case * Format * Add explanation comment * Add test case to change_redundant_column

…pache#15438) * Fix duplicate unqualified field name issue * Adjust Projection Properly * Add reproducer plan * Adjust comment * Set metadata to be the same as well * Fix substrait reproducer + Add test case * Format * Add explanation comment * Add test case to change_redundant_column

LiaCastaneda added 3 commits March 26, 2025 13:47

Fix duplicate unqualified field name issue

ddbf54e

Adjust Projection Properly

8d6ac43

Add reproducer plan

36645ef

github-actions bot added logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates substrait Changes to the substrait crate labels Mar 26, 2025

LiaCastaneda commented Mar 26, 2025

View reviewed changes

LiaCastaneda mentioned this pull request Mar 26, 2025

Duplicate unqualified field names error on queries with multiple JOIN #15439

Closed

LiaCastaneda force-pushed the liacastaneda/fix-schema-error-on-join-queries branch from 2144459 to 3ec9665 Compare March 26, 2025 15:06

LiaCastaneda added 3 commits March 27, 2025 09:25

Adjust comment

a9d9949

Set metadata to be the same as well

c3c6abb

Fix substrait reproducer + Add test case

63cd54d

LiaCastaneda force-pushed the liacastaneda/fix-schema-error-on-join-queries branch 2 times, most recently from a11fba0 to 4867827 Compare March 27, 2025 15:33

LiaCastaneda marked this pull request as ready for review March 28, 2025 08:23

Format

2f925ce

LiaCastaneda force-pushed the liacastaneda/fix-schema-error-on-join-queries branch from 4867827 to 2f925ce Compare March 29, 2025 11:38

gabotechs approved these changes Mar 31, 2025

View reviewed changes

datafusion/substrait/tests/cases/consumer_integration.rs Outdated Show resolved Hide resolved

datafusion/expr/src/logical_plan/builder.rs Show resolved Hide resolved

datafusion/expr/src/logical_plan/builder.rs Show resolved Hide resolved

LiaCastaneda added 2 commits April 1, 2025 12:15

Add explanation comment

926fcb7

Add test case to change_redundant_column

97ef820

alamb approved these changes Apr 1, 2025

View reviewed changes

Merge remote-tracking branch 'upstreamDF/main' into liacastaneda/fix-…

301e232

…schema-error-on-join-queries

alamb merged commit b9decd7 into apache:main Apr 2, 2025
27 checks passed

LiaCastaneda mentioned this pull request Apr 3, 2025

[branch-46] Fix duplicate unqualified Field name (schema error) on join queries DataDog/datafusion#15

Merged

LiaCastaneda mentioned this pull request Apr 4, 2025

Move back schema not matching check and workaround #15580

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix duplicate unqualified Field name (schema error) on join queries #15438

Fix duplicate unqualified Field name (schema error) on join queries #15438

Uh oh!

LiaCastaneda commented Mar 26, 2025 •

edited

Loading

Uh oh!

LiaCastaneda Mar 26, 2025

Uh oh!

LiaCastaneda Mar 26, 2025 •

edited

Loading

Uh oh!

berkaysynnada Apr 2, 2025

Uh oh!

LiaCastaneda Apr 2, 2025 •

edited

Loading

Uh oh!

LiaCastaneda Apr 4, 2025

Uh oh!

gabotechs left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LiaCastaneda commented Apr 1, 2025

Uh oh!

alamb left a comment

Uh oh!

alamb Apr 1, 2025

Uh oh!

alamb Apr 1, 2025

Uh oh!

berkaysynnada Apr 2, 2025

Uh oh!

alamb commented Apr 1, 2025

Uh oh!

LiaCastaneda commented Apr 2, 2025

Uh oh!

Uh oh!

Uh oh!

Fix duplicate unqualified Field name (schema error) on join queries #15438

Fix duplicate unqualified Field name (schema error) on join queries #15438

Uh oh!

Conversation

LiaCastaneda commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

LiaCastaneda Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

LiaCastaneda Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

berkaysynnada Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

LiaCastaneda Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LiaCastaneda Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

gabotechs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LiaCastaneda commented Apr 1, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Apr 1, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Apr 1, 2025

Choose a reason for hiding this comment

Uh oh!

berkaysynnada Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

alamb commented Apr 1, 2025

Uh oh!

LiaCastaneda commented Apr 2, 2025

Uh oh!

Uh oh!

Uh oh!

LiaCastaneda commented Mar 26, 2025 •

edited

Loading

LiaCastaneda Mar 26, 2025 •

edited

Loading

LiaCastaneda Apr 2, 2025 •

edited

Loading