Skip to content

FIFO JoinHashMap for HashJoin #8130

Closed
@korowa

Description

@korowa

Is your feature request related to a problem or challenge?

The problem related to #8020

At this moment, collect_left_input function in hash join implementation produces JoinHasMap which is optimized for reverse iteration -- for each hash_value the actual HashMap stores the index of the last row from the build-side, and remaining indices for same has are stored in JoinHashMap.next vector.

To keep maintaining inputs order in join output, HashJoinStream iterates over probe-batch in reverse order, and after processing the whole batch, it, again, reverts order of matched build and probe indices.

The problem is that reverse iteration doesn't allow HashJoinStream to produce partial output batch (without having all matched indices for current probe batch) -- in this case join output order will be distorted.

Describe the solution you'd like

Desired behaviour of HashJoinStream -- preserving natural order of probe + build side record, without intermediary inversion of indices -- to be ready to produce output batch in any moment.

This could be achieved by modifying update_hash to save "head" of hash chains in Map. It may also require to still track chain tails while building JoinHashMap in order to keep performance at the same level -- tails might be stored in JoinHashMap (probably not the best decision in terms of memory utilization), or in a separate data structure which will only exist during build-side collection and won't be as memory-greedy as one more integer in HashMap tuples.

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions