Skip to content

BUG: join unexpectedly created extra column start with "key_" #61294

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks done
albb318 opened this issue Apr 15, 2025 · 7 comments
Open
3 tasks done

BUG: join unexpectedly created extra column start with "key_" #61294

albb318 opened this issue Apr 15, 2025 · 7 comments
Assignees
Labels
Bug Deprecate Functionality to remove in pandas Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@albb318
Copy link

albb318 commented Apr 15, 2025

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

a = pd.DataFrame({0:[1,2,3]})
b = pd.DataFrame(index=[1,2,3],data={0:[4,5,6]})
a.join(b, on=0, rsuffix='_test')

Issue Description

When dataframe a and b have the same column name, a key_ column is created unexpectedly after the join operation.

key_0 0 0_test
0 1 1 4
1 2 2 5
2 3 3 6

Expected Behavior

Expecting result without the key_0 column.

Installed Versions

pandas : 2.2.3
numpy : 1.26.4

@albb318 albb318 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 15, 2025
@albb318 albb318 changed the title BUG: join unexpected created extra column start with "key_" BUG: join unexpectedly created extra column start with "key_" Apr 15, 2025
@ShayanG9
Copy link

This seems to be the expected behavior if you look at the documentation examples. Moreover, it seems it is why we pass lsuffix and rsuffix. Please let me know if I am missing somthing.

@albb318
Copy link
Author

albb318 commented Apr 15, 2025

In this case the "key_0" columns seems redundant, as the logic is straightforward. Dataframe a's column 0 is the target column for the join operation, although Dataframe b also has column 0, but it can just rename it to 0_test (as specifed by rsuffix), and join by its index. Expecting result as the following

0 0_test
0 1 4
1 2 5
2 3 6

Interestingly, the behavior is different with letter column name, see the following example

a = pd.DataFrame({'a':[1,2,3]})
b = pd.DataFrame(index=[1,2,3],data={'a':[4,5,6]})
a.join(b, on='a', rsuffix='_test')

it will get result as

a a_test
0 1 4
1 2 5
2 3 6

@ShayanG9
Copy link

take

@ShayanG9
Copy link

Thanks for clarifying! I think I have found the issue. The join function is basically a wrapper for the merge function, so down stream in pandas/core/reshape/merge.py There is a line of code that does

result.insert(i, name or f"key_{i}", key_col)

This basically adds the column to the DataFrame, with the variable on or "key_{column_name}". The intent for this was to catch when name was None. But, because 0 is truthy, while every other number is falsey the result is that you get the latter for columns with 0. This seems like just a oversight from the person who implemented this.

The fix would be

result.insert(i, name if name is not None else f"key_{i}", key_col)

However, this is not the actual issue the issue comes a bit earlier and it is specifically due to how join passes the suffix "" to merge instead of None. This causes behavior like if any of the column names is an integer it does not join like their string counter parts instead making two columns: one an integer and one a string.

There are two solutions to this. One would be to instead of defining lsuffix and rsuffix to be "" it should be None. The other is changing the renamer function to not touch the column if the suffix string is empty.

Both would have some minor backward compatibility issues, the suffix changes in join would be more localized to just the join function while the latter would affect the merge feature and the join feature. However, Its seems pretty uncommon to add a empty suffix to a merge operation and even less common to use integer columns. And, changing join to None would be more inline with users expectations when using .join function.

Thus, changing the join signature would impact backwards compatibility the least ensuring developers using merge with empty strings maintain their functionality, while aligning the join feature to be more like what happens when a string counterpart is used.

@rhshadrach
Copy link
Member

One would be to instead of defining lsuffix and rsuffix to be "" it should be None.

+1. In addition we should document in these args that when either is specified, any non-string columns will be converted to strings before applying the suffix.

@rhshadrach rhshadrach added Reshaping Concat, Merge/Join, Stack/Unstack, Explode and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 16, 2025
@ShayanG9
Copy link

@rhshadrach I'm not sure how I should go about this. Should I raise a FutureWarning and leave the default value as is. Or, should I change the default value to None and put up a DepreciationWarning?

@rhshadrach
Copy link
Member

@ShayanG9 - it should start as a DeprecationWarning. You will likely need to change the default value lib.no_default to be able to tell if the user is passing None or not. You can search the code for other uses of lib.no_default as examples.

@rhshadrach rhshadrach added the Deprecate Functionality to remove in pandas label Apr 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Deprecate Functionality to remove in pandas Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

3 participants