Skip to content

[CI][HIP] Tracker ticket for flaky HIP CI issues #17464

Open
@npmiller

Description

@npmiller

Describe the bug

This ticket tracks all of the other tickets and disabled tests related to the flaky CI issues on the AMD runner.

It is specifically focusing on the cases where one or multiple test is hanging in the same run as one or multiple test is failing with a memory access fault.

The issue has been worked around by limiting the AMD CI to run on a single thread, so it shouldn't happen anymore, but this ticket is to investigate the issue and help close the tickets and re-enable the tests once we have figured out the actual issue.

Tests with memory access fault

Experimental/launch_queries/max_work_group_size.cpp
FreeFunctionCommands/mem_advise.cpp
Regression/commandlist/gpu.cpp
WeakObject/weak_object_expired.cpp
Reduction/reduction_nd_ext_half.cpp
Reduction/reduction_big_data.cpp
Basic/built-ins/vec_relational.cpp
Basic/built-ins/vec_math.cpp
SubGroup/sub_group_as.cpp
HostInteropTask/host-task-dependency3.cpp
Reduction/reduction_big_data.cpp
SharedLib/use_with_dlopen_verify_cache.cpp

Tests hanging

WorkGroupMemory/basic_usage.cpp
syclcompat/memory/memory_management_test2_usmnone.cpp
SubGroup/reduce_spirv13.cpp
Adapters/retain_events.cpp
SubGroup/scan.cpp

List of related tickets

PR disabling related tests

Workaround PR with -j1

To reproduce

Environment

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghipIssues related to execution on HIP backend.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions