layered cpuset support #1747

likewhatevs · 2025-04-25T06:34:28Z

I need to debug this because it is still has all tasks going to lo fallback when I run:

https://github.com/likewhatevs/perfstuff/blob/main/noisy-workload.compose.yml

That being said, this does run/pass the verifier and has all the idk components needed to make this work so lmk thoughts on the approach etc.

etsal

So AFAICT The changes are:

Machinery for forwarding cpuset information to the BPF side
The corresponding BPF code
Logic that replicates allow_node_aligned but for cpusets.

If this fixes cpuset related problems I think it is reasonable, but I'm not sure about the naming - contianer enable is a bit confusing because containers aren't really a thing at this level of abstraction. Maybe replace "container" with "cpuset-based workloads"? This way it's clear what the code does concretely.

etsal · 2025-04-25T17:02:59Z

scheds/rust/scx_layered/src/bpf/main.bpf.c

-	     !(layer->allow_node_aligned && taskc->cpus_node_aligned)) ||
-	    !layer->nr_cpus) {
+		!(layer->allow_node_aligned && taskc->cpus_node_aligned)) ||
+		!(enable_container && taskc->cpus_cpuset_aligned) ||


Wrt the bug, maybe wrap lines 1354-1355 in parentheses? && has a higher precedence than || so rn all tasks without cpus_cpuset_aligned are getting put into the fallback queue regardless of the value of tasks->all_cpus_allowed.

think that was it, thanks 1000x for that.

well, part of it, went from exclusively using lo fallback to largely using it.

htejun · 2025-04-28T18:56:59Z

scheds/rust/scx_layered/src/bpf/intf.h

@@ -30,6 +30,7 @@ enum consts {
 	MAX_TASKS		= 131072,
 	MAX_PATH		= 4096,
 	MAX_NUMA_NODES		= 64,
+	MAX_CONTAINERS		= 64,


Can we use cpuset / CPUSET consistently instead of using both container and cpuset and document the implemented behavior in a comment?

likewhatevs · 2025-05-12T04:05:24Z

got this working right (finally, lol), I think.

most of the changes are gymnastics around getting bitmasks from rust to bpf cpumasks in a way that keeps the verifier happy w/o messing up the bitmask.

EDIT -- I also ran stress-ng w/ tasks affinitized w/ a mask not matching that of any of the containers running on the system and those tasks went to lo fallback as they should:

etsal · 2025-05-12T13:49:53Z

scheds/rust/scx_layered/src/bpf/main.bpf.c

+				return -ENOMEM;
+
+			bpf_for(j, 0, MAX_CPUS/64) {
+				bpf_for(cpu, 0, 63) {


This effectively skips the last CPU.

ooh right, I think that's true rust side also, thx.

etsal · 2025-05-12T13:51:04Z

scheds/rust/scx_layered/src/bpf/main.bpf.c

+
+			bpf_for(j, 0, MAX_CPUS/64) {
+				bpf_for(cpu, 0, 63) {
+					if (cpu < 0 || cpu >= 64 || j < 0 || j >= (MAX_CPUS/64) || i < 0 || i >= MAX_CPUSETS) {


Are the bounds for j necessary? This is fine if they are, but really surprising since it's the loop index and that tends to work well.

yeah it only needs i checks atm

etsal · 2025-05-12T13:53:56Z

scheds/rust/scx_layered/src/bpf/main.bpf.c

+						return -1;
+					}
+					if (cpuset_fakemasks[i][j] & (1LLU << cpu)) {
+					 	bpf_cpumask_set_cpu((MAX_CPUS/64 - j - 1) * 64 + cpu, cpumask);


So AFAICT the cpumask should fit all node cpumasks? This looks like it works because we clobber vmlinux.h to hold 128 64-bit numbers, which is fine to bypass verifier behavior but I'm not sure is great to depend on.

ideally we'll rip out all unprintable trusted ptr cpumasks w/ printable arena cpumasks eventually I think, maybe. I really wish these were printable lol...

dump_layer_cpumask() prints the trusted cpumasks. It shouldn't be too difficult to separate out a generic helper from it.

This loop is unnecessarily convoluted. Just iterate MAX_CPUS and index cpuset_fakemasks accordingly?

etsal · 2025-05-12T13:56:34Z

scheds/rust/scx_layered/src/bpf/main.bpf.c

+
+			// pay init cost once for faster lookups later.
+			bpf_for(cpu, 0, nr_possible_cpus) {
+				cpumask_box = bpf_map_lookup_percpu_elem(&cpuset_cpumask, &i, cpu);


Minor nit: Could you maybe rename to cpumask_wrapper to go with the lock_wrapper we use for our spinlocks across the schedulers?

misc btw -- is that a thing for the same reason as this (i.e. verifier pointer tracking)?

Yeah I guess wrapping the type around makes it possible to reason about what type the pointer is? Not sure though.

likewhatevs · 2025-05-12T14:32:11Z

still works w/ the indexes fixed/renames etc.:

cpuset workload only:

cpuset workload + random affinity workload:

htejun

I still have trouble understanding how this is supposed to work. Node aligned is easier because DSQs are LLC aligned and LLCs are node aligned. We don't have the guarantee that cpusets are LLC aligned. Is that okay? If so, why? Can you please document the theory of operation?

htejun · 2025-05-12T17:30:36Z

scheds/rust/scx_layered/src/bpf/main.bpf.c

@@ -1381,9 +1399,11 @@ void BPF_STRUCT_OPS(layered_enqueue, struct task_struct *p, u64 enq_flags)
 	 * without making the whole scheduler node aware and should only be used
 	 * with open layers on non-saturated machines to avoid possible stalls.
 	 */


Please update the comment to explain cpus_cpuset_aligned.

htejun · 2025-05-12T17:39:14Z

scheds/rust/scx_layered/src/bpf/main.bpf.c

@@ -2658,7 +2678,7 @@ static void refresh_cpus_flags(struct task_ctx *taskc,

 		if (!(nodec = lookup_node_ctx(node_id)) ||
 		    !(node_cpumask = cast_mask(nodec->cpumask)))
-			return;
+			break;


This is scx_bpf_error() condition. There's no point in continuing.

htejun · 2025-05-12T17:40:26Z

scheds/rust/scx_layered/src/bpf/main.bpf.c

@@ -2667,6 +2687,21 @@ static void refresh_cpus_flags(struct task_ctx *taskc,
 			break;
 		}
 	}
+	if (enable_cpuset) {


Maybe a blank line above?

htejun · 2025-05-12T17:41:06Z

scheds/rust/scx_layered/src/bpf/main.bpf.c

@@ -2667,6 +2687,21 @@ static void refresh_cpus_flags(struct task_ctx *taskc,
 			break;
 		}
 	}
+	if (enable_cpuset) {
+		bpf_for(cpuset_id, 0, nr_cpusets) {
+			struct cpumask_wrapper* wrapper;


Blank line.

htejun · 2025-05-12T17:42:17Z

scheds/rust/scx_layered/src/bpf/main.bpf.c

+			}
+			if (bpf_cpumask_equal(cast_mask(wrapper->mask), cpumask)) {
+				taskc->cpus_cpuset_aligned = true;
+				return;


break; here so that it's consistent with the node aligned block and the function can be extended in the future? Note that this would require moving the false setting. BTW, why not use the same partial overlapping test used by node alignment test instead of equality test? Is that not sufficient for forward progress guarantee? If not, it'd probably be worthwhile to explain why.

htejun · 2025-05-12T17:47:21Z

scheds/rust/scx_layered/src/bpf/main.bpf.c

+	if (enable_cpuset) {
+		bpf_for(i, 0, nr_cpusets) {
+			cpumask = bpf_cpumask_create();
+


It's also customary to not have blank line between variable setting and test on it. In scheduler BPF code, we've been doing if ((var = expression)) a lot, so maybe adopt the style?

htejun · 2025-05-12T17:48:15Z

scheds/rust/scx_layered/src/bpf/main.bpf.c

+						return -1;
+					}
+					if (cpuset_fakemasks[i][j] & (1LLU << cpu)) {
+					 	bpf_cpumask_set_cpu((MAX_CPUS/64 - j - 1) * 64 + cpu, cpumask);


dump_layer_cpumask() prints the trusted cpumasks. It shouldn't be too difficult to separate out a generic helper from it.

htejun · 2025-05-12T17:49:38Z

scheds/rust/scx_layered/src/bpf/main.bpf.c

+						return -1;
+					}
+					if (cpuset_fakemasks[i][j] & (1LLU << cpu)) {
+					 	bpf_cpumask_set_cpu((MAX_CPUS/64 - j - 1) * 64 + cpu, cpumask);


This loop is unnecessarily convoluted. Just iterate MAX_CPUS and index cpuset_fakemasks accordingly?

htejun · 2025-05-12T17:50:25Z

scheds/rust/scx_layered/src/bpf/main.bpf.c

+			if (!cpumask)
+				return -ENOMEM;
+
+			bpf_for(j, 0, MAX_CPUS/64) {


Maybe add comments explaining what each block is doing?

htejun · 2025-05-12T17:53:08Z

scheds/rust/scx_layered/src/bpf/main.bpf.c

+				}
+			}
+
+			// pay init cost once for faster lookups later.


Why do we need per-cpu copies? Can't this be a part of cpu_ctx? Note that percpu maps have a limited number of hot-caches per task and has a performance cliff beyond.

likewhatevs · 2025-05-12T18:23:17Z

We don't have the guarantee that cpusets are LLC aligned. Is that okay? If so, why? Can you please document the theory of operation?

Will add documentation w/ the other updates. The TL;DR wrt/ theory of operation is that, if cpusets are being used, the onus is on whoever is setting those to ensure they are LLC aligned or perf will be affected and the scheduler can't (or perhaps shouldn't) really fix that.

likewhatevs requested review from htejun, etsal and kkdwivedi April 25, 2025 06:34

likewhatevs marked this pull request as draft April 25, 2025 06:42

likewhatevs force-pushed the layered-container-support-2 branch from 612f186 to 64f4056 Compare April 25, 2025 13:53

etsal reviewed Apr 25, 2025

View reviewed changes

htejun reviewed Apr 28, 2025

View reviewed changes

likewhatevs force-pushed the layered-container-support-2 branch 2 times, most recently from 495915d to 33cb330 Compare May 6, 2025 15:06

likewhatevs added 6 commits May 11, 2025 12:41

layered: get cpuset support working with masks anded

d0fdfd4

layered: write cpuset masks to map and check

dd5fdeb

layered: cpuset support, rename and bugfix

149c669

get cpumask printing via bpftool

ed5f936

layered rust side of cpuset works

e27b53c

layered: get cpuset support working

ed4c527

likewhatevs force-pushed the layered-container-support-2 branch from 33cb330 to ed4c527 Compare May 12, 2025 03:58

likewhatevs marked this pull request as ready for review May 12, 2025 04:01

likewhatevs changed the title ~~layered container support~~ layered cpuset support May 12, 2025

layered: cleanup cpuset naming

590315d

etsal reviewed May 12, 2025

View reviewed changes

layered: cleanup cpuset support

82e6ad4

htejun reviewed May 12, 2025

View reviewed changes

likewhatevs marked this pull request as draft May 12, 2025 18:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

layered cpuset support #1747

layered cpuset support #1747

likewhatevs commented Apr 25, 2025

etsal left a comment

etsal Apr 25, 2025

likewhatevs Apr 30, 2025 •

edited

Loading

htejun Apr 28, 2025 •

edited

Loading

likewhatevs commented May 12, 2025 •

edited

Loading

etsal May 12, 2025

likewhatevs May 12, 2025

etsal May 12, 2025

likewhatevs May 12, 2025

etsal May 12, 2025

likewhatevs May 12, 2025

htejun May 12, 2025

htejun May 12, 2025

etsal May 12, 2025

likewhatevs May 12, 2025

etsal May 12, 2025

likewhatevs commented May 12, 2025

htejun left a comment

htejun May 12, 2025

htejun May 12, 2025

htejun May 12, 2025

htejun May 12, 2025

htejun May 12, 2025

htejun May 12, 2025

htejun May 12, 2025

htejun May 12, 2025

htejun May 12, 2025

htejun May 12, 2025

likewhatevs commented May 12, 2025

layered cpuset support #1747

Are you sure you want to change the base?

layered cpuset support #1747

Conversation

likewhatevs commented Apr 25, 2025

etsal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

likewhatevs Apr 30, 2025 • edited Loading

Choose a reason for hiding this comment

htejun Apr 28, 2025 • edited Loading

Choose a reason for hiding this comment

likewhatevs commented May 12, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

likewhatevs commented May 12, 2025

htejun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

likewhatevs commented May 12, 2025

likewhatevs Apr 30, 2025 •

edited

Loading

htejun Apr 28, 2025 •

edited

Loading

likewhatevs commented May 12, 2025 •

edited

Loading