Skip to content

Update MIOpen to support gfx1151 and use older version of frugally-deep #392

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

scottt
Copy link
Contributor

@scottt scottt commented Apr 10, 2025

I combined and reworked the MIOpen patches for gfx1151 from myself and @jammm and tested it a bit more with MIOpenDriver and pytorch.

# Originally mirrored from: https://github.com/Dobiasd/frugally-deep/archive/refs/tags/v0.16.2.tar.gz
URL https://rocm-third-party-deps.s3.us-east-2.amazonaws.com/frugally-deep-0.16.2.tar.gz
URL_HASH SHA256=b16af09606dcf02359de53b7c47323baaeda9a174e1c87e126c3127c55571971
# TODO: copy tarball to rocm-third-party-deps.s3.us-east-2.amazonaws.com
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to upload the specified version but can you elaborate why the downgraded is needed?

Copy link
Contributor Author

@scottt scottt Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marbre , without this frugally-deep downgrade, code using torch.nn.Conv2d from pytorch would produce MIOpen Error: tensor_shape_variable needs to be an array RuntimeError: miopenStatusUnknownError.
See
#244 (comment)

(Really appreciate your quick reviews!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marbre , I've amended the PR with a comment with links to the upstream MIOpen issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marbre as @scottt mentioned, more context available at ROCm/MIOpen#3588 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing me to it. Let me upload the artifact for you asap.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. @BradPepersAMD if you're not aware of this yet, you probably should be.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# TODO: copy tarball to rocm-third-party-deps.s3.us-east-2.amazonaws.com

Uploaded it, thus you can drop the TODO :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marbre , commit added.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 seems you still need to drop the TODO note.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, @marbre thanks for catching that. PR updated.

@scottt scottt force-pushed the gfx1151-for-merge branch from b923341 to 4220d64 Compare April 10, 2025 17:01
@marbre
Copy link
Member

marbre commented Apr 16, 2025

I have not much context on the MIOpen specific patches. @stellaraccident would you be able to review or can assign someone?

@scottt
Copy link
Contributor Author

scottt commented Apr 16, 2025

Some further explainations of the patches to help review:

  1. Enable MIOpenDriver

MIOpenDriver is an optional CLI program that's part of MIOpen
MIOpenDriver is handy to have during hardware bring up as it can be used to reproduce e.g. conv2d fp16 calculation request that would otherwise require Pytorch to reproduce

patches/amd-mainline/MIOpen/0001-driver-CMakeLists.txt-allow-MIOpenDriver-to-be-built.patch allows MIOpenDriver to be bulit as part of TheRock.

This patch changes third-party/nlohmann-json/CMakeLists.txt to build with -DJSON_MultipleHeaders=ON because that's how MIOpen upstream does it: https://github.com/ROCm/MIOpen/blob/develop/requirements.txt

  1. Fix gpu device specific macro definitions to build MIOpen for gfx1151

patches/amd-mainline/MIOpen/0002-Fix-build-for-gfx1151-gfx1036.patch

gfx1151 wants a BUFFER_RESOURCE_3RD_DWORD definition. The specific value used, 0x31014000, was taken from the hipBLASLt patch.

I'd personally love to learn what a BUFFER_RESOURCE_3RD_DWORD is and why the ROCm code base seemingly choose to open code a magic register value in headers copied in hipBLASLt, MIOpen, and Pytorch.

I made MIOpen build for gfx1036 as well, as that was the iGPU available in my 9950X3D CPU and I manually tested the MIOpen GPU assembly kernels on both.

  1. Disable conv_inwo_fury_RxS for gfx1151

This conv2d fp16 specific kernel was causing illegal opcode errors on the gfx1151

patches/amd-mainline/MIOpen/0003-Disable-conv_wino_fury_RxS-for-gfx115x.patch

I should probably file an issue here. I'm looking forward to learning how to find out which GPU instruction produced the illegal opcode error and fixing it.

@stellaraccident
Copy link
Collaborator

Ok, there are a few things that need to be done to land this. For the substantive changes to miopen, I at least need to get a PR opened upstream for feedback.

I'll piece this apart and start landing today.

@jammm
Copy link
Contributor

jammm commented Apr 16, 2025

This conv2d fp16 specific kernel was causing illegal opcode errors on the gfx1151

patches/amd-mainline/MIOpen/0003-Disable-conv_wino_fury_RxS-for-gfx115x.patch

I should probably file an issue here. I'm looking forward to learning how to find out which GPU instruction produced the illegal opcode error and fixing it.

I tried debugging this and am stumped - the ISA delta between RDNA3 and RDNA3.5 is pretty minimal to non-existent. hipcc doesn't complain about it during assembly either. I think there might be some very subtle differences but it's not clearly documented publicly yet. Ideally this hand-written ISA should be rewritten as MLIR which can then be lowered down to navi3/3.5 to handle those subtle idiosyncrasies.

@stellaraccident
Copy link
Collaborator

@BradPepersAMD is going to help advise on how to land these

stellaraccident added a commit that referenced this pull request Apr 16, 2025
* Both changes were found to be required upon additional testing of MIOpen.
* Includes a workaround for ROCm/MIOpen#3588

Part of breaking #392 up for landing.
stellaraccident added a commit that referenced this pull request Apr 16, 2025
* Both changes were found to be required upon additional testing of
MIOpen.
* Includes a workaround for ROCm/MIOpen#3588

Part of breaking #392 up for landing.
@stellaraccident
Copy link
Collaborator

Ok, I've broken this down into a couple of patches and either landed or set up to land. #432 is the one with the key functional changes and includes links to issues and upstream PRs I created for each. Closing this as everything it included has been pushed through an appropriate channel.

@github-project-automation github-project-automation bot moved this from TODO to Done in TheRock Triage Apr 16, 2025
stellaraccident added a commit that referenced this pull request Apr 17, 2025
* Enables MIOpenDriver (requires patch to correct system deps). Always
sets MIOPEN_BUILD_DRIVER=ON and propagates BUILD_TESTING from the parent
project #433
* Patches some device feature words for gfx1151 and gfx1036 and
simplifies macro logic (0002-Fix-build-for-gfx1151-gfx1036.patch) #434
* Patches conv_wino_fury_RxS to disable on gfx1151
(0003-Disable-conv_wino_fury_RxS-for-gfx115x.patch) #435
* Makes MIOpenDriver exit with failing status
(0004-MIOpenDriver-exit-with-status-1-on-usage-error.patch) #436

Lands remaining parts of #392.

Co-Authored-By: Scott Tsai <[email protected]>
@stellaraccident
Copy link
Collaborator

Ok, the PRs are landed in TheRock. Upstream patches are being reviewed. @scottt you may want to make sure you are subscribed to the gfx1151 specific MIOpen patches as they said they may want to discuss more.

@scottt
Copy link
Contributor Author

scottt commented Apr 17, 2025

@scottt you may want to make sure you are subscribed to the gfx1151 specific MIOpen patches as they said they may want to discuss more.

Thanks for help landing this and the heads up. I've subscribed to #433, #434, #435, #436

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants