-
Notifications
You must be signed in to change notification settings - Fork 135
layered cpuset support #1747
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
layered cpuset support #1747
Conversation
612f186
to
64f4056
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So AFAICT The changes are:
- Machinery for forwarding cpuset information to the BPF side
- The corresponding BPF code
- Logic that replicates allow_node_aligned but for cpusets.
If this fixes cpuset related problems I think it is reasonable, but I'm not sure about the naming - contianer enable is a bit confusing because containers aren't really a thing at this level of abstraction. Maybe replace "container" with "cpuset-based workloads"? This way it's clear what the code does concretely.
!(layer->allow_node_aligned && taskc->cpus_node_aligned)) || | ||
!layer->nr_cpus) { | ||
!(layer->allow_node_aligned && taskc->cpus_node_aligned)) || | ||
!(enable_container && taskc->cpus_cpuset_aligned) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrt the bug, maybe wrap lines 1354-1355 in parentheses? && has a higher precedence than || so rn all tasks without cpus_cpuset_aligned are getting put into the fallback queue regardless of the value of tasks->all_cpus_allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
think that was it, thanks 1000x for that.
well, part of it, went from exclusively using lo fallback to largely using it.
@@ -30,6 +30,7 @@ enum consts { | |||
MAX_TASKS = 131072, | |||
MAX_PATH = 4096, | |||
MAX_NUMA_NODES = 64, | |||
MAX_CONTAINERS = 64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use cpuset
/ CPUSET
consistently instead of using both container
and cpuset
and document the implemented behavior in a comment?
495915d
to
33cb330
Compare
33cb330
to
ed4c527
Compare
return -ENOMEM; | ||
|
||
bpf_for(j, 0, MAX_CPUS/64) { | ||
bpf_for(cpu, 0, 63) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This effectively skips the last CPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooh right, I think that's true rust side also, thx.
|
||
bpf_for(j, 0, MAX_CPUS/64) { | ||
bpf_for(cpu, 0, 63) { | ||
if (cpu < 0 || cpu >= 64 || j < 0 || j >= (MAX_CPUS/64) || i < 0 || i >= MAX_CPUSETS) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the bounds for j necessary? This is fine if they are, but really surprising since it's the loop index and that tends to work well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah it only needs i checks atm
return -1; | ||
} | ||
if (cpuset_fakemasks[i][j] & (1LLU << cpu)) { | ||
bpf_cpumask_set_cpu((MAX_CPUS/64 - j - 1) * 64 + cpu, cpumask); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So AFAICT the cpumask should fit all node cpumasks? This looks like it works because we clobber vmlinux.h to hold 128 64-bit numbers, which is fine to bypass verifier behavior but I'm not sure is great to depend on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally we'll rip out all unprintable trusted ptr cpumasks w/ printable arena cpumasks eventually I think, maybe. I really wish these were printable lol...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dump_layer_cpumask()
prints the trusted cpumasks. It shouldn't be too difficult to separate out a generic helper from it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop is unnecessarily convoluted. Just iterate MAX_CPUS
and index cpuset_fakemasks
accordingly?
|
||
// pay init cost once for faster lookups later. | ||
bpf_for(cpu, 0, nr_possible_cpus) { | ||
cpumask_box = bpf_map_lookup_percpu_elem(&cpuset_cpumask, &i, cpu); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit: Could you maybe rename to cpumask_wrapper to go with the lock_wrapper we use for our spinlocks across the schedulers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
misc btw -- is that a thing for the same reason as this (i.e. verifier pointer tracking)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I guess wrapping the type around makes it possible to reason about what type the pointer is? Not sure though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still have trouble understanding how this is supposed to work. Node aligned is easier because DSQs are LLC aligned and LLCs are node aligned. We don't have the guarantee that cpusets are LLC aligned. Is that okay? If so, why? Can you please document the theory of operation?
@@ -1381,9 +1399,11 @@ void BPF_STRUCT_OPS(layered_enqueue, struct task_struct *p, u64 enq_flags) | |||
* without making the whole scheduler node aware and should only be used | |||
* with open layers on non-saturated machines to avoid possible stalls. | |||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the comment to explain cpus_cpuset_aligned
.
@@ -2658,7 +2678,7 @@ static void refresh_cpus_flags(struct task_ctx *taskc, | |||
|
|||
if (!(nodec = lookup_node_ctx(node_id)) || | |||
!(node_cpumask = cast_mask(nodec->cpumask))) | |||
return; | |||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is scx_bpf_error()
condition. There's no point in continuing.
@@ -2667,6 +2687,21 @@ static void refresh_cpus_flags(struct task_ctx *taskc, | |||
break; | |||
} | |||
} | |||
if (enable_cpuset) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a blank line above?
@@ -2667,6 +2687,21 @@ static void refresh_cpus_flags(struct task_ctx *taskc, | |||
break; | |||
} | |||
} | |||
if (enable_cpuset) { | |||
bpf_for(cpuset_id, 0, nr_cpusets) { | |||
struct cpumask_wrapper* wrapper; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blank line.
} | ||
if (bpf_cpumask_equal(cast_mask(wrapper->mask), cpumask)) { | ||
taskc->cpus_cpuset_aligned = true; | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
break;
here so that it's consistent with the node aligned block and the function can be extended in the future? Note that this would require moving the false
setting. BTW, why not use the same partial overlapping test used by node alignment test instead of equality test? Is that not sufficient for forward progress guarantee? If not, it'd probably be worthwhile to explain why.
if (enable_cpuset) { | ||
bpf_for(i, 0, nr_cpusets) { | ||
cpumask = bpf_cpumask_create(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also customary to not have blank line between variable setting and test on it. In scheduler BPF code, we've been doing if ((var = expression))
a lot, so maybe adopt the style?
return -1; | ||
} | ||
if (cpuset_fakemasks[i][j] & (1LLU << cpu)) { | ||
bpf_cpumask_set_cpu((MAX_CPUS/64 - j - 1) * 64 + cpu, cpumask); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dump_layer_cpumask()
prints the trusted cpumasks. It shouldn't be too difficult to separate out a generic helper from it.
return -1; | ||
} | ||
if (cpuset_fakemasks[i][j] & (1LLU << cpu)) { | ||
bpf_cpumask_set_cpu((MAX_CPUS/64 - j - 1) * 64 + cpu, cpumask); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop is unnecessarily convoluted. Just iterate MAX_CPUS
and index cpuset_fakemasks
accordingly?
if (!cpumask) | ||
return -ENOMEM; | ||
|
||
bpf_for(j, 0, MAX_CPUS/64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add comments explaining what each block is doing?
} | ||
} | ||
|
||
// pay init cost once for faster lookups later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need per-cpu copies? Can't this be a part of cpu_ctx
? Note that percpu maps have a limited number of hot-caches per task and has a performance cliff beyond.
Will add documentation w/ the other updates. The TL;DR wrt/ theory of operation is that, if cpusets are being used, the onus is on whoever is setting those to ensure they are LLC aligned or perf will be affected and the scheduler can't (or perhaps shouldn't) really fix that. |
I need to debug this because it is still has all tasks going to lo fallback when I run:
https://github.com/likewhatevs/perfstuff/blob/main/noisy-workload.compose.yml
That being said, this does run/pass the verifier and has all the idk components needed to make this work so lmk thoughts on the approach etc.