-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Implement intermediate result blocked approach to aggregation memory management #15591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Implement intermediate result blocked approach to aggregation memory management #15591
Conversation
Hi @Rachelint I think I have a alternative proposal that seems relatively easy to implement. |
Really thanks. This design in pr indeed still introduces quite a few code changes... I tried to not modify anythings about
But I found this way will introduce too many extra cost... Maybe we place the |
cc37eba
to
f690940
Compare
95c6a36
to
a4c6f42
Compare
2100a5b
to
0ee951c
Compare
Has finished development(and test) of all needed common structs!
|
c51d409
to
2863809
Compare
It is very close, just need to add more tests! |
31d660d
to
2b8dd1e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏 👏 -- this PR is amazing @Rachelint -- it looks really nice and the code is very well structured and easy to read and understand (which is non trivial, given that this is some of the most complicated and performance sensitive parts of the code). I very much enjoyed reading this PR
I will run some benchmarks shortly but from my perspective if this shows a performance improvement we could merge this PR. I have a few suggestions, but I think they can all be done in a follow on PR
My largest concern with this PR is that it makes aggregation, one of the most complex parts of the code, even more complex as there is now another potential code path through all the aggregates
I think a new codepath is ok if it is temporary while we port over other GroupsAccumulators / GroupValues to use blocked management then we remove the non blocked version. However if there is some reason we can't port the other accumulators over to use blocked management I think we should reconsider.
I would be more than happy to help organize the effort (aka file tickets!) to port the remaining GroupsAccumulators over and remove the non blocked version
Now that I review the code, it seems like we have the same basic complexity creeping in via supports_convert_to_state 🤔 -- maybe we should also work on removing that (force all groups accumulators to implement convert_to_state) -- which would also likely improve performance -- I can work on filing tickets for that too
@@ -5,3 +5,4 @@ SELECT "SocialSourceNetworkID", "RegionID", COUNT(*), AVG("Age"), AVG("ParamPric | |||
SELECT "ClientIP", "WatchID", COUNT(*) c, MIN("ResponseStartTiming") tmin, MEDIAN("ResponseStartTiming") tmed, MAX("ResponseStartTiming") tmax FROM hits WHERE "JavaEnable" = 0 GROUP BY "ClientIP", "WatchID" HAVING c > 1 ORDER BY tmed DESC LIMIT 10; | |||
SELECT "ClientIP", "WatchID", COUNT(*) c, MIN("ResponseStartTiming") tmin, APPROX_PERCENTILE_CONT("ResponseStartTiming", 0.95) tp95, MAX("ResponseStartTiming") tmax FROM 'hits' WHERE "JavaEnable" = 0 GROUP BY "ClientIP", "WatchID" HAVING c > 1 ORDER BY tp95 DESC LIMIT 10; | |||
SELECT COUNT(*) AS ShareCount FROM hits WHERE "IsMobile" = 1 AND "MobilePhoneModel" LIKE 'iPhone%' AND "SocialAction" = 'share' AND "SocialSourceNetworkID" IN (5, 12) AND "ClientTimeZone" BETWEEN -5 AND 5 AND regexp_match("Referer", '\/campaign\/(spring|summer)_promo') IS NOT NULL AND CASE WHEN split_part(split_part("URL", 'resolution=', 2), '&', 1) ~ '^\d+$' THEN split_part(split_part("URL", 'resolution=', 2), '&', 1)::INT ELSE 0 END > 1920 AND levenshtein(CAST("UTMSource" AS STRING), CAST("UTMCampaign" AS STRING)) < 3; | |||
SELECT "WatchID", MIN("ResolutionWidth"), MAX("ResolutionWidth"), SUM("IsRefresh") FROM hits GROUP BY "WatchID" ORDER BY "WatchID" DESC LIMIT 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to add a new query to the extended benchmarks, can we please also document the query here?
https://github.com/apache/datafusion/tree/main/benchmarks/queries/clickbench#extended-queries
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed with #15936
datafusion/common/src/config.rs
Outdated
@@ -405,6 +405,18 @@ config_namespace! { | |||
/// in joins can reduce memory usage when joining large | |||
/// tables with a highly-selective join filter, but is also slightly slower. | |||
pub enforce_batch_size_in_joins: bool, default = false | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you expect users will ever disable this feature? Or does this setting exist as an "escape" valve in case we hit a problem with the new behavior and want to go back?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is just a way for going back + testing
.
} | ||
|
||
impl EmitTo { | ||
/// Remove and return `needed values` from `values`. | ||
pub fn take_needed<T>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you document what the is_blocked_groups
parameter means too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed
/// set the block size to `blk_size`, and the try will only success | ||
/// when the accumulator has supported blocked mode. | ||
/// | ||
/// NOTICE: After altering block size, all data in previous will be cleared. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that all existing accumulators will be cleared?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will be usually used in load back + merge
step in spilling:
- Emit the rest blocks at first
- Clear all stale data, and switch to
flat mode
and performsorted aggregation
@@ -51,6 +51,7 @@ datafusion-common = { workspace = true, default-features = true } | |||
datafusion-common-runtime = { workspace = true, default-features = true } | |||
datafusion-execution = { workspace = true } | |||
datafusion-expr = { workspace = true } | |||
datafusion-functions-aggregate-common = { workspace = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this new dependency is ok as it doesn't use any specific aggregate implementation which we are trying to avoid
/// `group values` will be stored in multiple `Vec`s, and each | ||
/// `Vec` if of `blk_size` len, and we call it a `block` | ||
/// | ||
block_size: Option<usize>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment above about making a Blocks
struct which I think would avoid some non trivial duplication
@@ -982,6 +1099,9 @@ impl GroupedHashAggregateStream { | |||
&& self.update_memory_reservation().is_err() | |||
{ | |||
assert_ne!(self.mode, AggregateMode::Partial); | |||
// TODO: support spilling when blocked group optimization is on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to file a ticket to track this -- but I think in general figuring out how to handle spilling for hashing in a better way is worth considering so maybe this particular task would become irrelevant
} | ||
} | ||
|
||
pub(crate) trait Block { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to add some comments here describing what the trait is for
/// If `row_x group_index_x` is not filtered(`group_index_x` is seen) | ||
/// `seen_values[group_index_x]` will be set to `true`. | ||
/// | ||
/// For `set_bit(block_id, block_offset, value)`, `block_id` is unused, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something I didn't see documented anywhere was what block_id
and block_offset
meant -- maybe we could add something on the HashAggregateStream 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed.
I add related comments in GroupsAccumulator::supports_blocked_groups
and GroupValues::supports_blocked_groups
, also link it in HashAggregateStream
.
🤖 |
🤖: Benchmark completed Details
|
how is the benchmark triggered and can we run clickbench extended too? upd: I didn't find improvement for extended query locally |
I am using some scripts in https://github.com/alamb/datafusion-benchmarking on a gcp machine (I haven't figured out how we would do this for the community in general I didn't run the extended tests because they ran into the following issue (which is how I found it initially) Now that they are fixed I think I can run extended tests and will kick them off |
(I am surprised this PR didn't yield better results, I am reruning now to see if the results are reproducable |
🤖 |
I am back from holiday, and continue to work on this today. Because for simplicity for the initial pr, I just:
And actually still don't have such a simple query can show the improvement now... And for showing the improvement, I add an new query in SELECT "WatchID", MIN("ResolutionWidth"), MAX("ResolutionWidth"), SUM("IsRefresh") FROM hits GROUP BY "WatchID" ORDER BY "WatchID" DESC LIMIT 10; |
Also for the new added one?
|
🤖: Benchmark completed Details
|
Emmm... As expected the new added query is not convered, I think I should submit an new pr for adding the query to SELECT "WatchID", MIN("ResolutionWidth"), MAX("ResolutionWidth"), SUM("IsRefresh") FROM hits GROUP BY "WatchID" ORDER BY "WatchID" DESC LIMIT 10; |
Welcome back!
Makes sense
What would be required to improve the performance for one or more of the real clickbench queries? Implementing group management for other data types? Given we have past evidence this approach will work I think we could merge it before the queries really sped up. My big concern is that we have a plan to eventually avoid both single and multi blocked management |
🤖 |
Yes. But we need to implement group management for multiple data types for I think it may be too complex for the initial pr... |
The mainly hard point for removing single management is about
I try to reuse codes as much as possible to ease this problem (like Actually I encounter the same problem in #12996 ... I think maybe it is a common problem about how to support I suspect if we can remove |
🤖: Benchmark completed Details
|
@alamb I have submitted an pr about new added query for this pr #15936 After merging #15936 , the benchmark result in my local (as expected, Q7 get faster):
|
Co-authored-by: Andrew Lamb <[email protected]>
7e07e81
to
426e2ee
Compare
Which issue does this PR close?
Rationale for this change
As mentioned in #7065 , we use a single
Vec
to manageaggregation intermediate results
both inGroupAccumulator
andGroupValues
.It is simple but not efficient enough in high-cardinality aggregation, because when
Vec
is not large enough, we need to allocate a newVec
and copy all data from the old one.So this pr introduces a
blocked approach
to manage theaggregation intermediate results
. We will never resize theVec
in the approach, and instead we split the data to blocks, when the capacity is not enough, we just allocate a new block. Detail can see #7065What changes are included in this PR?
PrimitiveGroupsAccumulator
andGroupValuesPrimitive
as the exampleAre these changes tested?
Test by exist tests. And new unit tests, new fuzzy tests.
Are there any user-facing changes?
Two functions are added to
GroupValues
andGroupAccumulator
trait.But as you can see, there are default implementations for them, and users can choose to really support the blocked approach when wanting a better performance for their
udaf
s.