Skip to content

Memory Limited GroupBy (Externalized / Spill) #1570

Closed
@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Support Grouping "arbitrarily" large inputs (e.g. when the group by hash table doesn't fit in RAM or within some user defined budget)

This typically happens when there are a "large" number of distinct groups. For example, with queries like

-- Find the top ten users by event count
SELECT user_id, count(events) 
FROM users 
GROUP BY user_id 
ORDER BY count(events) DESC
limit 10

When there are large number of groups

This ticket concerns the memory used the HashAggregateExec operator -- it doesn't cover other potential targets (e.g. externalized sort or join). That will be covered by other tasks tracked by #587

Describe the solution you'd like

  1. Allow DataFusion users to specify a RAM budget (aka via the config introduced in Initial MemoryManager and DiskManager APIs for query execution + External Sort implementation #1526) and have their queries complete running without the group by exceeding the budget allocated to it via the memory manager.

For the HashAggregateExec operator, I think the best behavior would be:

  1. Use an in memory hash table (as is done today in HashAggregateExec ), if the memory budget allows
  2. If all the input does not fit in the RAM budget, the hash table and partial aggregates are sorted by group key and written to temporary disk files
  3. Temporary disk files are read / merged to produce the final results

Hopefully after #1568 is complete we'll have an efficient N-way merge that can be reused.

Some ideas of how to break this task down

Describe alternatives you've considered
TBD

Context
This is follow on work from the great PR from @yjshen in #1526 and part of the story of limiting memory used by DataFusion #587

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions