[spike] evaluate + prototype interaction of unified memory abstraction with custom_ops #1556

Titus-von-Koeller · 2025-03-05T13:41:39Z

Unified memory isn't supported in PyTorch and was considered a potential blocker for the custom ops refactor.

We found a workaround at the time, with a simple viability proof.

It's however not clear how this fits together with the current open PR #1544 and RFC #1545 and this needs to be fleshed out.

Questions:

Are the needed changes to the code base deeply rooted or relatively superficial
Is it impactful to work on this right now or should we focus on finalizing the non-optimizer related custom_ops first
Maybe it's straight forward to already implement this while prototyping? If yes, we can already open a PR or make it part of the open PR.

Titus-von-Koeller · 2025-04-16T09:49:06Z

Key Requirements:

Create tensors using CUDA unified memory that appear as CUDA tensors
Maintain existing paged optimizer functionality
Integrate with PyTorch's device-based dispatch
Preserve memory management through existing cget_managed_ptr/cprefetch

Code Analysis:

Existing paged tensors use get_paged() in functional.py (203-211) with CPU device
C interface in cextension.py handles library loading
CUDA memory management in pythonInterface.cpp via cget_managed_ptr
Current implementation creates CPU tensors with managed memory

Critical Insight from alband:

Use torch.from_blob with CUDA device specification to create "CUDA" tensors with unified memory
This satisfies PyTorch's dispatch requirements while using managed memory

Implementation Strategy:

# Revised get_paged in functional.py
def get_paged(*shape, dtype=torch.float32, device=FIRST_CUDA_DEVICE):
    num_bytes = dtype.itemsize * prod(shape)
    # Use C++ extension instead of direct lib calls
    tensor = cpp_extension.get_managed_tensor(
        num_bytes,
        list(shape),
        dtype
    )
    tensor.is_paged = True
    tensor.page_deviceid = device.index
    return tensor

// Enhanced C++ implementation in pythonInterface.cpp
torch::Tensor get_managed_tensor(
    size_t nb_bytes,
    c10::IntArrayRef sizes,
    c10::ScalarType dtype
) {
    void* cuda_ptr;
    CUDA_CHECK(cudaMallocManaged(&cuda_ptr, nb_bytes, cudaMemAttachHost));
    
    auto options = torch::TensorOptions()
        .device(torch::kCUDA)  // Critical device specification
        .dtype(dtype)
        .requires_grad(false);

    return torch::from_blob(
        cuda_ptr,
        sizes,
        [](void* ptr) { CUDA_CHECK(cudaFree(ptr)); },
        options
    );
}

Integration Points:

Replace existing lib.cget_managed_ptr calls with new C++ extension
Maintain is_paged flag for backward compatibility
Preserve prefetch logic through existing cprefetch calls

Dispatch Mechanism:

# The tensor will report device=cuda
t = get_paged(1024, dtype=torch.float32)
print(t.device)  # Output: cuda:0

# Operations will dispatch to CUDA kernels
# Unified memory handles page migration automatically
y = t @ t.T  # Dispatches to CUDA GEMM

Memory Management:

Retain existing page_mng.paged_tensors tracking
Keep prefetch logic in optimizers' prefetch_state()
Unified memory allows access from both CPU/GPU with automatic migration

Backward Compatibility:

Maintain is_paged flag for existing optimizer logic
Keep page_deviceid for multi-GPU support
Preserve block-wise quantization logic that uses paged tensors

This approach satisfies all requirements while maintaining compatibility with existing optimizer infrastructure. The key innovation is creating CUDA device tensors that transparently use unified memory, enabling proper dispatch while retaining paged memory benefits.

Titus-von-Koeller · 2025-04-28T09:54:31Z

We closed this line of investigation because of the proposed solution:

import torch
from torch.utils.cpp_extension import load_inline


dontlook = """
#include <cuda.h>
#include <cuda_runtime.h>
#include <iostream>


at::Tensor getManagedTensor(size_t nb_bytes, c10::IntArrayRef sizes) {

    float* cuda_ptr;
    cudaMallocManaged(&cuda_ptr, nb_bytes);

    at::Tensor tensor = at::for_blob((void*)cuda_ptr, sizes)
             .deleter([](void *ptr) {
               cudaFree(ptr);
             })
             .options(at::device(at::kCUDA).dtype(at::kFloat))
             .make_tensor();

    return tensor;
}
"""

mod = load_inline("mod", cpp_sources=dontlook, with_cuda=True, functions=["getManagedTensor"])


size = 10
t = mod.getManagedTensor(size * 4, (size,))
t.fill_(0)

print(t)

del t
print("Done!")

having a dependency on torch lib which we don't want to depend on and the fact that no other implementation than AMD actually wants to use unified memory so far and AMD is dispatched under the dispatch key "cuda" anyways.

Therefore, it doesn't make sense to investigate this further.

Titus-von-Koeller self-assigned this Mar 5, 2025

Titus-von-Koeller added Optimizers Issues or feature requests relating to optimizers Spike labels Mar 5, 2025

Titus-von-Koeller closed this as completed Apr 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[spike] evaluate + prototype interaction of unified memory abstraction with custom_ops #1556

[spike] evaluate + prototype interaction of unified memory abstraction with custom_ops #1556

Titus-von-Koeller commented Mar 5, 2025

Titus-von-Koeller commented Apr 16, 2025

Titus-von-Koeller commented Apr 28, 2025

[spike] evaluate + prototype interaction of unified memory abstraction with custom_ops #1556

[spike] evaluate + prototype interaction of unified memory abstraction with custom_ops #1556

Comments

Titus-von-Koeller commented Mar 5, 2025

Titus-von-Koeller commented Apr 16, 2025

Titus-von-Koeller commented Apr 28, 2025