Skip to content

RowConverter keeps growing in size while merging streams on high-cardinality dictionary fields #7200

Closed
@JayjeetAtGithub

Description

@JayjeetAtGithub

Describe the bug

On executing SortPreservingMerge on multiple streams of record batches and using a high-cardinality dictionary field as the sort key, the RowConverter instance used to merge the multiple RowCursorStreams keeps growing in memory (as it keeps accumulating the dict mappings internally in the OrderPreservingInterner structure). This unbounded memory growth eventually causes data fusion to get killed by the OOM killer.

To Reproduce

Detailed steps to reproduce this issue is given here.

Expected behavior

SortPreservingMerge on streams of record batches with high-cardinality dictionary-encoded sort keys should be memory aware and keep memory usage within a user-defined limit.

Additional context

Possible solution:

  1. Keep track of the memory usage for the RowConverter using the size() method which in the case of Dictionary fields returns the size of the OrderPreservingInterner.
  2. If the size of the RowConverter grows more than a user-defined memory limit, take note of the RowCursorStream that are still getting converted, delete the converter, create a new one, and re-do the aborted conversions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions