[libc] Perform bitfield zero initialization wave-parallel #143607

jhuber6 · 2025-06-10T21:36:34Z

Summary:
We need to set the bitfield memory to zero because the system does not
guarantee zeroed out memory. Even if fresh pages are zero, the system
allows re-use so we would need a kfd level API to skip this step.

Because we can't this patch updates the logic to perform the zero
initialization wave-parallel. This reduces the amount of time it takes
to allocate a fresh by up to a tenth.

This has the unfortunate side effect that the control flow is more
convoluted and we waste some extra registers, but it's worth it to
reduce the slab allocation latency.

llvmbot · 2025-06-10T21:37:09Z

@llvm/pr-subscribers-libc

Author: Joseph Huber (jhuber6)

Changes

Summary:
We need to set the bitfield memory to zero because the system does not
guarantee zeroed out memory. Even if fresh pages are zero, the system
allows re-use so we would need a kfd level API to skip this step.

Because we can't this patch updates the logic to perform the zero
initialization wave-parallel. This reduces the amount of time it takes
to allocate a fresh by up to a tenth.

This has the unfortunate side effect that the control flow is more
convoluted and we waste some extra registers, but it's worth it to
reduce the slab allocation latency.

Full diff: https://github.com/llvm/llvm-project/pull/143607.diff

1 Files Affected:

(modified) libc/src/__support/GPU/allocator.cpp (+45-26)

diff --git a/libc/src/__support/GPU/allocator.cpp b/libc/src/__support/GPU/allocator.cpp
index 135ced3df704c..73febad1c14d6 100644
--- a/libc/src/__support/GPU/allocator.cpp
+++ b/libc/src/__support/GPU/allocator.cpp
@@ -157,10 +157,18 @@ struct Slab {
     Header *header = reinterpret_cast<Header *>(memory);
     header->chunk_size = chunk_size;
     header->global_index = global_index;
+  }
 
-    // This memset is expensive and likely not necessary for the current 'kfd'
-    // driver. Until zeroed pages are exposed by the API we must be careful.
-    __builtin_memset(get_bitfield(), 0, bitfield_bytes(chunk_size));
+  // Set the necessary bitfield bytes to zero in parallel using many lanes. This
+  // must be called before the bitfield can be accessed safely, memory is not
+  // guaranteed to be zero initialized in the current implementation.
+  void initialize(uint64_t uniform) {
+    uint64_t lane_mask = gpu::get_lane_mask();
+    uint32_t *bitfield = get_bitfield();
+    uint32_t workers = cpp::popcount(uniform);
+    for (uint32_t i = impl::lane_count(lane_mask & uniform);
+         i < bitfield_bytes(get_chunk_size()) / sizeof(uint32_t); i += workers)
+      bitfield[i] = 0;
   }
 
   // Get the number of chunks that can theoretically fit inside this slab.
@@ -283,7 +291,7 @@ struct Slab {
 
 /// A wait-free guard around a pointer resource to be created dynamically if
 /// space is available and freed once there are no more users.
-template <typename T> struct GuardPtr {
+struct GuardPtr {
 private:
   struct RefCounter {
     // Indicates that the object is in its deallocation phase and thus invalid.
@@ -339,32 +347,25 @@ template <typename T> struct GuardPtr {
     cpp::Atomic<uint64_t> counter{0};
   };
 
-  cpp::Atomic<T *> ptr{nullptr};
+  cpp::Atomic<Slab *> ptr{nullptr};
   RefCounter ref{};
 
   // Should be called be a single lane for each different pointer.
   template <typename... Args>
-  T *try_lock_impl(uint32_t n, uint64_t &count, Args &&...args) {
-    T *expected = ptr.load(cpp::MemoryOrder::RELAXED);
+  Slab *try_lock_impl(uint32_t n, uint64_t &count, Args &&...args) {
+    Slab *expected = ptr.load(cpp::MemoryOrder::RELAXED);
     if (!expected &&
-        ptr.compare_exchange_strong(expected, reinterpret_cast<T *>(SENTINEL),
-                                    cpp::MemoryOrder::RELAXED,
-                                    cpp::MemoryOrder::RELAXED)) {
+        ptr.compare_exchange_strong(
+            expected, reinterpret_cast<Slab *>(SENTINEL),
+            cpp::MemoryOrder::RELAXED, cpp::MemoryOrder::RELAXED)) {
       count = cpp::numeric_limits<uint64_t>::max();
-      void *raw = impl::rpc_allocate(sizeof(T));
+      void *raw = impl::rpc_allocate(sizeof(Slab));
       if (!raw)
         return nullptr;
-      T *mem = new (raw) T(cpp::forward<Args>(args)...);
-
-      cpp::atomic_thread_fence(cpp::MemoryOrder::RELEASE);
-      ptr.store(mem, cpp::MemoryOrder::RELAXED);
-      cpp::atomic_thread_fence(cpp::MemoryOrder::ACQUIRE);
-      if (!ref.acquire(n, count))
-        ref.reset(n, count);
-      return mem;
+      return new (raw) Slab(cpp::forward<Args>(args)...);
     }
 
-    if (!expected || expected == reinterpret_cast<T *>(SENTINEL))
+    if (!expected || expected == reinterpret_cast<Slab *>(SENTINEL))
       return nullptr;
 
     if (!ref.acquire(n, count))
@@ -374,15 +375,25 @@ template <typename T> struct GuardPtr {
     return ptr.load(cpp::MemoryOrder::RELAXED);
   }
 
+  // Finalize the associated memory and signal that it is ready to use by
+  // resetting the counter.
+  void finalize(Slab *mem, uint32_t n, uint64_t &count) {
+    cpp::atomic_thread_fence(cpp::MemoryOrder::RELEASE);
+    ptr.store(mem, cpp::MemoryOrder::RELAXED);
+    cpp::atomic_thread_fence(cpp::MemoryOrder::ACQUIRE);
+    if (!ref.acquire(n, count))
+      ref.reset(n, count);
+  }
+
 public:
   // Attempt to lock access to the pointer, potentially creating it if empty.
   // The uniform mask represents which lanes share the same pointer. For each
   // uniform value we elect a leader to handle it on behalf of the other lanes.
   template <typename... Args>
-  T *try_lock(uint64_t lane_mask, uint64_t uniform, uint64_t &count,
-              Args &&...args) {
+  Slab *try_lock(uint64_t lane_mask, uint64_t uniform, uint64_t &count,
+                 Args &&...args) {
     count = 0;
-    T *result = nullptr;
+    Slab *result = nullptr;
     if (gpu::get_lane_id() == uint32_t(cpp::countr_zero(uniform)))
       result = try_lock_impl(cpp::popcount(uniform), count,
                              cpp::forward<Args>(args)...);
@@ -392,6 +403,14 @@ template <typename T> struct GuardPtr {
     if (!result)
       return nullptr;
 
+    // We defer storing the newly allocated slab until now so that we can use
+    // multiple lanes to initialize it and release it for use.
+    if (count == cpp::numeric_limits<uint64_t>::max()) {
+      result->initialize(uniform);
+      if (gpu::get_lane_id() == uint32_t(cpp::countr_zero(uniform)))
+        finalize(result, cpp::popcount(uniform), count);
+    }
+
     if (count != cpp::numeric_limits<uint64_t>::max())
       count = count - cpp::popcount(uniform) + impl::lane_count(uniform) + 1;
 
@@ -403,8 +422,8 @@ template <typename T> struct GuardPtr {
     cpp::atomic_thread_fence(cpp::MemoryOrder::RELEASE);
     if (gpu::get_lane_id() == uint32_t(cpp::countr_zero(mask)) &&
         ref.release(cpp::popcount(mask))) {
-      T *p = ptr.load(cpp::MemoryOrder::RELAXED);
-      p->~T();
+      Slab *p = ptr.load(cpp::MemoryOrder::RELAXED);
+      p->~Slab();
       impl::rpc_free(p);
       cpp::atomic_thread_fence(cpp::MemoryOrder::RELEASE);
       ptr.store(nullptr, cpp::MemoryOrder::RELAXED);
@@ -417,7 +436,7 @@ template <typename T> struct GuardPtr {
 };
 
 // The global array used to search for a valid slab to allocate from.
-static GuardPtr<Slab> slots[ARRAY_SIZE] = {};
+static GuardPtr slots[ARRAY_SIZE] = {};
 
 // Tries to find a slab in the table that can support the given chunk size.
 static Slab *find_slab(uint32_t chunk_size) {

JonChesterfield · 2025-06-11T15:58:58Z

Would you mind splitting this into one patch that does the T->Slab and similar stuff and a second which directly replaces the builtin_memset? I want to reference this in a bug raised against codegen and it would be useful to a have a minimal before/after with the open coded memset improving performance.

Raised #143741, fractionally easier link management if you leave this PR doing what is described in the commit message, and move the drive by fixes to a different one.

jhuber6 · 2025-06-11T16:32:43Z

Sure, I'll just precommit it and rebase.

JonChesterfield

Nice one, thanks. Marking this as approve since changing memset lowering might take a little while.

Summary: We don't need this to be generic, precommit for #143607

Summary: We need to set the bitfield memory to zero because the system does not guarantee zeroed out memory. Even if fresh pages are zero, the system allows re-use so we would need a `kfd` level API to skip this step. Because we can't this patch updates the logic to perform the zero initialization wave-parallel. This reduces the amount of time it takes to allocate a fresh by up to a tenth. This has the unfortunate side effect that the control flow is more convoluted and we waste some extra registers, but it's worth it to reduce the slab allocation latency.

…ounter Summary: We don't need this to be generic, precommit for llvm/llvm-project#143607

libc/src/__support/GPU/allocator.cpp

Summary: We don't need this to be generic, precommit for llvm#143607

Summary: We need to set the bitfield memory to zero because the system does not guarantee zeroed out memory. Even if fresh pages are zero, the system allows re-use so we would need a `kfd` level API to skip this step. Because we can't this patch updates the logic to perform the zero initialization wave-parallel. This reduces the amount of time it takes to allocate a fresh by up to a tenth. This has the unfortunate side effect that the control flow is more convoluted and we waste some extra registers, but it's worth it to reduce the slab allocation latency.

jhuber6 requested review from arsenm, JonChesterfield, lntue, michaelrj-google and shiltian June 10, 2025 21:36

llvmbot added the libc label Jun 10, 2025

jhuber6 force-pushed the Parallel branch from 9bc554d to 29119bd Compare June 11, 2025 13:15

JonChesterfield mentioned this pull request Jun 11, 2025

Open coding builtin_memset should not be faster #143741

Open

JonChesterfield approved these changes Jun 11, 2025

View reviewed changes

jhuber6 added a commit that referenced this pull request Jun 11, 2025

[libc][NFC] Remove template from GPU allocator reference counter

f1575de

Summary: We don't need this to be generic, precommit for #143607

jhuber6 force-pushed the Parallel branch from 29119bd to 26fb1aa Compare June 11, 2025 16:38

llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Jun 11, 2025

Automerge: [libc][NFC] Remove template from GPU allocator reference c…

3ba9709

…ounter Summary: We don't need this to be generic, precommit for llvm/llvm-project#143607

arsenm reviewed Jun 11, 2025

View reviewed changes

libc/src/__support/GPU/allocator.cpp Show resolved Hide resolved

Matt comments

ef1b896

jhuber6 merged commit dc4335a into llvm:main Jun 11, 2025
13 checks passed

tomtor pushed a commit to tomtor/llvm-project that referenced this pull request Jun 14, 2025

[libc][NFC] Remove template from GPU allocator reference counter

33ecdbc

Summary: We don't need this to be generic, precommit for llvm#143607

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[libc] Perform bitfield zero initialization wave-parallel #143607

[libc] Perform bitfield zero initialization wave-parallel #143607

Uh oh!

jhuber6 commented Jun 10, 2025

Uh oh!

llvmbot commented Jun 10, 2025

Uh oh!

JonChesterfield commented Jun 11, 2025 •

edited

Loading

Uh oh!

jhuber6 commented Jun 11, 2025

Uh oh!

JonChesterfield left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[libc] Perform bitfield zero initialization wave-parallel #143607

[libc] Perform bitfield zero initialization wave-parallel #143607

Uh oh!

Conversation

jhuber6 commented Jun 10, 2025

Uh oh!

llvmbot commented Jun 10, 2025

Uh oh!

JonChesterfield commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhuber6 commented Jun 11, 2025

Uh oh!

JonChesterfield left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JonChesterfield commented Jun 11, 2025 •

edited

Loading