-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: join unexpectedly created extra column start with "key_" #61294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This seems to be the expected behavior if you look at the documentation examples. Moreover, it seems it is why we pass lsuffix and rsuffix. Please let me know if I am missing somthing. |
In this case the "key_0" columns seems redundant, as the logic is straightforward. Dataframe a's column 0 is the target column for the join operation, although Dataframe b also has column 0, but it can just rename it to 0_test (as specifed by rsuffix), and join by its index. Expecting result as the following 0 0_test Interestingly, the behavior is different with letter column name, see the following example
it will get result as a a_test |
take |
Thanks for clarifying! I think I have found the issue. The join function is basically a wrapper for the merge function, so down stream in result.insert(i, name or f"key_{i}", key_col) This basically adds the column to the DataFrame, with the variable on or "key_{column_name}". The intent for this was to catch when name was None. But, because The fix would be result.insert(i, name if name is not None else f"key_{i}", key_col) However, this is not the actual issue the issue comes a bit earlier and it is specifically due to how join passes the suffix "" to merge instead of None. This causes behavior like if any of the column names is an integer it does not join like their string counter parts instead making two columns: one an integer and one a string. There are two solutions to this. One would be to instead of defining Both would have some minor backward compatibility issues, the suffix changes in join would be more localized to just the join function while the latter would affect the merge feature and the join feature. However, Its seems pretty uncommon to add a empty suffix to a merge operation and even less common to use integer columns. And, changing join to None would be more inline with users expectations when using .join function. Thus, changing the join signature would impact backwards compatibility the least ensuring developers using merge with empty strings maintain their functionality, while aligning the join feature to be more like what happens when a string counterpart is used. |
+1. In addition we should document in these args that when either is specified, any non-string columns will be converted to strings before applying the suffix. |
@rhshadrach I'm not sure how I should go about this. Should I raise a |
@ShayanG9 - it should start as a |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
When dataframe a and b have the same column name, a key_ column is created unexpectedly after the join operation.
key_0 0 0_test
0 1 1 4
1 2 2 5
2 3 3 6
Expected Behavior
Expecting result without the key_0 column.
Installed Versions
pandas : 2.2.3
numpy : 1.26.4
The text was updated successfully, but these errors were encountered: