Skip to content

layered cpuset support #1747

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

likewhatevs
Copy link
Contributor

I need to debug this because it is still has all tasks going to lo fallback when I run:

https://github.com/likewhatevs/perfstuff/blob/main/noisy-workload.compose.yml

That being said, this does run/pass the verifier and has all the idk components needed to make this work so lmk thoughts on the approach etc.

@likewhatevs likewhatevs marked this pull request as draft April 25, 2025 06:42
@likewhatevs likewhatevs force-pushed the layered-container-support-2 branch from 612f186 to 64f4056 Compare April 25, 2025 13:53
Copy link
Contributor

@etsal etsal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So AFAICT The changes are:

  1. Machinery for forwarding cpuset information to the BPF side
  2. The corresponding BPF code
  3. Logic that replicates allow_node_aligned but for cpusets.

If this fixes cpuset related problems I think it is reasonable, but I'm not sure about the naming - contianer enable is a bit confusing because containers aren't really a thing at this level of abstraction. Maybe replace "container" with "cpuset-based workloads"? This way it's clear what the code does concretely.

!(layer->allow_node_aligned && taskc->cpus_node_aligned)) ||
!layer->nr_cpus) {
!(layer->allow_node_aligned && taskc->cpus_node_aligned)) ||
!(enable_container && taskc->cpus_cpuset_aligned) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrt the bug, maybe wrap lines 1354-1355 in parentheses? && has a higher precedence than || so rn all tasks without cpus_cpuset_aligned are getting put into the fallback queue regardless of the value of tasks->all_cpus_allowed.

Copy link
Contributor Author

@likewhatevs likewhatevs Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think that was it, thanks 1000x for that.

well, part of it, went from exclusively using lo fallback to largely using it.

@@ -30,6 +30,7 @@ enum consts {
MAX_TASKS = 131072,
MAX_PATH = 4096,
MAX_NUMA_NODES = 64,
MAX_CONTAINERS = 64,
Copy link
Contributor

@htejun htejun Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use cpuset / CPUSET consistently instead of using both container and cpuset and document the implemented behavior in a comment?

@likewhatevs likewhatevs force-pushed the layered-container-support-2 branch 2 times, most recently from 495915d to 33cb330 Compare May 6, 2025 15:06
@likewhatevs likewhatevs force-pushed the layered-container-support-2 branch from 33cb330 to ed4c527 Compare May 12, 2025 03:58
@likewhatevs likewhatevs marked this pull request as ready for review May 12, 2025 04:01
@likewhatevs
Copy link
Contributor Author

likewhatevs commented May 12, 2025

got this working right (finally, lol), I think.
image

most of the changes are gymnastics around getting bitmasks from rust to bpf cpumasks in a way that keeps the verifier happy w/o messing up the bitmask.

EDIT -- I also ran stress-ng w/ tasks affinitized w/ a mask not matching that of any of the containers running on the system and those tasks went to lo fallback as they should:
image

@likewhatevs likewhatevs changed the title layered container support layered cpuset support May 12, 2025
return -ENOMEM;

bpf_for(j, 0, MAX_CPUS/64) {
bpf_for(cpu, 0, 63) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This effectively skips the last CPU.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooh right, I think that's true rust side also, thx.


bpf_for(j, 0, MAX_CPUS/64) {
bpf_for(cpu, 0, 63) {
if (cpu < 0 || cpu >= 64 || j < 0 || j >= (MAX_CPUS/64) || i < 0 || i >= MAX_CPUSETS) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the bounds for j necessary? This is fine if they are, but really surprising since it's the loop index and that tends to work well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it only needs i checks atm

return -1;
}
if (cpuset_fakemasks[i][j] & (1LLU << cpu)) {
bpf_cpumask_set_cpu((MAX_CPUS/64 - j - 1) * 64 + cpu, cpumask);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So AFAICT the cpumask should fit all node cpumasks? This looks like it works because we clobber vmlinux.h to hold 128 64-bit numbers, which is fine to bypass verifier behavior but I'm not sure is great to depend on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally we'll rip out all unprintable trusted ptr cpumasks w/ printable arena cpumasks eventually I think, maybe. I really wish these were printable lol...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dump_layer_cpumask() prints the trusted cpumasks. It shouldn't be too difficult to separate out a generic helper from it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop is unnecessarily convoluted. Just iterate MAX_CPUS and index cpuset_fakemasks accordingly?


// pay init cost once for faster lookups later.
bpf_for(cpu, 0, nr_possible_cpus) {
cpumask_box = bpf_map_lookup_percpu_elem(&cpuset_cpumask, &i, cpu);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: Could you maybe rename to cpumask_wrapper to go with the lock_wrapper we use for our spinlocks across the schedulers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

misc btw -- is that a thing for the same reason as this (i.e. verifier pointer tracking)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I guess wrapping the type around makes it possible to reason about what type the pointer is? Not sure though.

@likewhatevs
Copy link
Contributor Author

still works w/ the indexes fixed/renames etc.:

cpuset workload only:
Screenshot From 2025-05-12 10-27-57

cpuset workload + random affinity workload:
Screenshot From 2025-05-12 10-27-44

Copy link
Contributor

@htejun htejun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have trouble understanding how this is supposed to work. Node aligned is easier because DSQs are LLC aligned and LLCs are node aligned. We don't have the guarantee that cpusets are LLC aligned. Is that okay? If so, why? Can you please document the theory of operation?

@@ -1381,9 +1399,11 @@ void BPF_STRUCT_OPS(layered_enqueue, struct task_struct *p, u64 enq_flags)
* without making the whole scheduler node aware and should only be used
* with open layers on non-saturated machines to avoid possible stalls.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the comment to explain cpus_cpuset_aligned.

@@ -2658,7 +2678,7 @@ static void refresh_cpus_flags(struct task_ctx *taskc,

if (!(nodec = lookup_node_ctx(node_id)) ||
!(node_cpumask = cast_mask(nodec->cpumask)))
return;
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is scx_bpf_error() condition. There's no point in continuing.

@@ -2667,6 +2687,21 @@ static void refresh_cpus_flags(struct task_ctx *taskc,
break;
}
}
if (enable_cpuset) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a blank line above?

@@ -2667,6 +2687,21 @@ static void refresh_cpus_flags(struct task_ctx *taskc,
break;
}
}
if (enable_cpuset) {
bpf_for(cpuset_id, 0, nr_cpusets) {
struct cpumask_wrapper* wrapper;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blank line.

}
if (bpf_cpumask_equal(cast_mask(wrapper->mask), cpumask)) {
taskc->cpus_cpuset_aligned = true;
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

break; here so that it's consistent with the node aligned block and the function can be extended in the future? Note that this would require moving the false setting. BTW, why not use the same partial overlapping test used by node alignment test instead of equality test? Is that not sufficient for forward progress guarantee? If not, it'd probably be worthwhile to explain why.

if (enable_cpuset) {
bpf_for(i, 0, nr_cpusets) {
cpumask = bpf_cpumask_create();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also customary to not have blank line between variable setting and test on it. In scheduler BPF code, we've been doing if ((var = expression)) a lot, so maybe adopt the style?

return -1;
}
if (cpuset_fakemasks[i][j] & (1LLU << cpu)) {
bpf_cpumask_set_cpu((MAX_CPUS/64 - j - 1) * 64 + cpu, cpumask);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dump_layer_cpumask() prints the trusted cpumasks. It shouldn't be too difficult to separate out a generic helper from it.

return -1;
}
if (cpuset_fakemasks[i][j] & (1LLU << cpu)) {
bpf_cpumask_set_cpu((MAX_CPUS/64 - j - 1) * 64 + cpu, cpumask);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop is unnecessarily convoluted. Just iterate MAX_CPUS and index cpuset_fakemasks accordingly?

if (!cpumask)
return -ENOMEM;

bpf_for(j, 0, MAX_CPUS/64) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add comments explaining what each block is doing?

}
}

// pay init cost once for faster lookups later.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need per-cpu copies? Can't this be a part of cpu_ctx? Note that percpu maps have a limited number of hot-caches per task and has a performance cliff beyond.

@likewhatevs likewhatevs marked this pull request as draft May 12, 2025 18:16
@likewhatevs
Copy link
Contributor Author

We don't have the guarantee that cpusets are LLC aligned. Is that okay? If so, why? Can you please document the theory of operation?

Will add documentation w/ the other updates. The TL;DR wrt/ theory of operation is that, if cpusets are being used, the onus is on whoever is setting those to ensure they are LLC aligned or perf will be affected and the scheduler can't (or perhaps shouldn't) really fix that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants