Closed
Description
Describe the bug
On executing SortPreservingMerge
on multiple streams of record batches and using a high-cardinality dictionary field as the sort key, the RowConverter
instance used to merge the multiple RowCursorStreams
keeps growing in memory (as it keeps accumulating the dict mappings internally in the OrderPreservingInterner
structure). This unbounded memory growth eventually causes data fusion to get killed by the OOM killer.
To Reproduce
Detailed steps to reproduce this issue is given here.
Expected behavior
SortPreservingMerge
on streams of record batches with high-cardinality dictionary-encoded sort keys should be memory aware and keep memory usage within a user-defined limit.
Additional context
Possible solution:
- Keep track of the memory usage for the
RowConverter
using thesize()
method which in the case ofDictionary
fields returns the size of theOrderPreservingInterner
. - If the size of the
RowConverter
grows more than a user-defined memory limit, take note of theRowCursorStream
that are still getting converted, delete the converter, create a new one, and re-do the aborted conversions.