Skip to content

[Bug]: Inconsistency between is_contiguous and stride API in HIPGRAPH #2020

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tjtanaa opened this issue Apr 7, 2025 · 1 comment
Open

Comments

@tjtanaa
Copy link

tjtanaa commented Apr 7, 2025

🐛 Describe the bug

I could not replicate the scenario through simple test script. It is a bug found in vLLM ROCm when running meta-llama/Llama-4-Scout-17B-16E-Instruct. This behaviour only occurs in HIPGraph mode + torch.compile. In EAGER mode + torch.compile, the contiguous() API and stride() API are consistent.

In the HIPGraph, it could occur that the tensor A has the following properties:
.shape: ([1024, 1])
.is_contiguous(): True
.stride() : [1,1024]
.is_contiguous(memory_format=torch.channels_last) is False
.is_contiguous(memory_format=torch.contiguous_format) is True

Expected behaviour is .stride(1) == 1 is True as is_contiguous() is True.

This A = A.contiguous() do not fix the issue. It is still .stride() : [1,1024].

To fix this bug, the workaround right now is A = A.view(-1).reshape(A.shape).

On CUDA, the stride returns (1,1) but on ROCm, it returns (1,1024)

Versions

Collecting environment information...                                                                                                             
PyTorch version: 2.7.0a0+git295f2ed                                                                                                               
Is debug build: False                                                                                                                             
CUDA used to build PyTorch: N/A                                                                                                                   
ROCM used to build PyTorch: 6.3.42133-1b9c17779                                                                                                   
                                                                                                                                                  
OS: Ubuntu 22.04.5 LTS (x86_64)                                                                                                                   
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0                                                                                                
Clang version: 18.0.0git (https://github.com/RadeonOpenCompute/llvm-project roc-6.3.1 24491 1e0fda770a2079fbd71e4b70974d74f62fd3af10)             
CMake version: version 3.31.6                                                                                                                     
Libc version: glibc-2.35                                                                                                                          
                                                                                                                                                  
Python version: 3.12.9 (main, Feb  5 2025, 08:49:00) [GCC 11.4.0] (64-bit runtime)                                                                
Python platform: Linux-5.15.0-116-generic-x86_64-with-glibc2.35                                                                                   
Is CUDA available: True                                                                                                                           
CUDA runtime version: Could not collect                                                                                                           
CUDA_MODULE_LOADING set to: LAZY                                                                                                                  
GPU models and configuration: AMD Instinct MI300X (gfx942:sramecc+:xnack-)                                                                        
Nvidia driver version: Could not collect                                                                                                          cuDNN version: Could not collect
HIP runtime version: 6.3.42133                                           
MIOpen runtime version: 3.3.0                                            
Is XNNPACK available: True          

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.7.0a0+git295f2ed
[pip3] torchvision==0.21.0+7af6987
[pip3] triton==3.2.0+gite5be006a
@tjtanaa tjtanaa changed the title [Bug]: Inconsistency between is_contiguous and Stride API in HIPGRAPH [Bug]: Inconsistency between is_contiguous and stride API in HIPGRAPH Apr 7, 2025
@hongxiayang
Copy link
Collaborator

cc @jeffdaily @jataylo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants