Skip to content

[spike] evaluate + prototype interaction of unified memory abstraction with custom_ops #1556

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks
Titus-von-Koeller opened this issue Mar 5, 2025 · 2 comments
Assignees
Labels
Optimizers Issues or feature requests relating to optimizers Spike

Comments

@Titus-von-Koeller
Copy link
Collaborator

Unified memory isn't supported in PyTorch and was considered a potential blocker for the custom ops refactor.

We found a workaround at the time, with a simple viability proof.

It's however not clear how this fits together with the current open PR #1544 and RFC #1545 and this needs to be fleshed out.

Questions:

  • Are the needed changes to the code base deeply rooted or relatively superficial
  • Is it impactful to work on this right now or should we focus on finalizing the non-optimizer related custom_ops first
  • Maybe it's straight forward to already implement this while prototyping? If yes, we can already open a PR or make it part of the open PR.
@Titus-von-Koeller Titus-von-Koeller self-assigned this Mar 5, 2025
@Titus-von-Koeller Titus-von-Koeller added Optimizers Issues or feature requests relating to optimizers Spike labels Mar 5, 2025
@Titus-von-Koeller
Copy link
Collaborator Author

  1. Key Requirements:
  • Create tensors using CUDA unified memory that appear as CUDA tensors
  • Maintain existing paged optimizer functionality
  • Integrate with PyTorch's device-based dispatch
  • Preserve memory management through existing cget_managed_ptr/cprefetch
  1. Code Analysis:
  • Existing paged tensors use get_paged() in functional.py (203-211) with CPU device
  • C interface in cextension.py handles library loading
  • CUDA memory management in pythonInterface.cpp via cget_managed_ptr
  • Current implementation creates CPU tensors with managed memory
  1. Critical Insight from alband:
  • Use torch.from_blob with CUDA device specification to create "CUDA" tensors with unified memory
  • This satisfies PyTorch's dispatch requirements while using managed memory
  1. Implementation Strategy:
# Revised get_paged in functional.py
def get_paged(*shape, dtype=torch.float32, device=FIRST_CUDA_DEVICE):
    num_bytes = dtype.itemsize * prod(shape)
    # Use C++ extension instead of direct lib calls
    tensor = cpp_extension.get_managed_tensor(
        num_bytes,
        list(shape),
        dtype
    )
    tensor.is_paged = True
    tensor.page_deviceid = device.index
    return tensor
// Enhanced C++ implementation in pythonInterface.cpp
torch::Tensor get_managed_tensor(
    size_t nb_bytes,
    c10::IntArrayRef sizes,
    c10::ScalarType dtype
) {
    void* cuda_ptr;
    CUDA_CHECK(cudaMallocManaged(&cuda_ptr, nb_bytes, cudaMemAttachHost));
    
    auto options = torch::TensorOptions()
        .device(torch::kCUDA)  // Critical device specification
        .dtype(dtype)
        .requires_grad(false);

    return torch::from_blob(
        cuda_ptr,
        sizes,
        [](void* ptr) { CUDA_CHECK(cudaFree(ptr)); },
        options
    );
}
  1. Integration Points:
  • Replace existing lib.cget_managed_ptr calls with new C++ extension
  • Maintain is_paged flag for backward compatibility
  • Preserve prefetch logic through existing cprefetch calls
  1. Dispatch Mechanism:
# The tensor will report device=cuda
t = get_paged(1024, dtype=torch.float32)
print(t.device)  # Output: cuda:0

# Operations will dispatch to CUDA kernels
# Unified memory handles page migration automatically
y = t @ t.T  # Dispatches to CUDA GEMM
  1. Memory Management:
  • Retain existing page_mng.paged_tensors tracking
  • Keep prefetch logic in optimizers' prefetch_state()
  • Unified memory allows access from both CPU/GPU with automatic migration
  1. Backward Compatibility:
  • Maintain is_paged flag for existing optimizer logic
  • Keep page_deviceid for multi-GPU support
  • Preserve block-wise quantization logic that uses paged tensors

This approach satisfies all requirements while maintaining compatibility with existing optimizer infrastructure. The key innovation is creating CUDA device tensors that transparently use unified memory, enabling proper dispatch while retaining paged memory benefits.

@Titus-von-Koeller
Copy link
Collaborator Author

We closed this line of investigation because of the proposed solution:

import torch
from torch.utils.cpp_extension import load_inline


dontlook = """
#include <cuda.h>
#include <cuda_runtime.h>
#include <iostream>


at::Tensor getManagedTensor(size_t nb_bytes, c10::IntArrayRef sizes) {

    float* cuda_ptr;
    cudaMallocManaged(&cuda_ptr, nb_bytes);

    at::Tensor tensor = at::for_blob((void*)cuda_ptr, sizes)
             .deleter([](void *ptr) {
               cudaFree(ptr);
             })
             .options(at::device(at::kCUDA).dtype(at::kFloat))
             .make_tensor();

    return tensor;
}
"""

mod = load_inline("mod", cpp_sources=dontlook, with_cuda=True, functions=["getManagedTensor"])


size = 10
t = mod.getManagedTensor(size * 4, (size,))
t.fill_(0)

print(t)

del t
print("Done!")

having a dependency on torch lib which we don't want to depend on and the fact that no other implementation than AMD actually wants to use unified memory so far and AMD is dispatched under the dispatch key "cuda" anyways.

Therefore, it doesn't make sense to investigate this further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Optimizers Issues or feature requests relating to optimizers Spike
Projects
None yet
Development

No branches or pull requests

1 participant