Open
Description
Describe the bug
This ticket tracks all of the other tickets and disabled tests related to the flaky CI issues on the AMD runner.
It is specifically focusing on the cases where one or multiple test is hanging in the same run as one or multiple test is failing with a memory access fault.
The issue has been worked around by limiting the AMD CI to run on a single thread, so it shouldn't happen anymore, but this ticket is to investigate the issue and help close the tickets and re-enable the tests once we have figured out the actual issue.
Tests with memory access fault
Experimental/launch_queries/max_work_group_size.cpp
FreeFunctionCommands/mem_advise.cpp
Regression/commandlist/gpu.cpp
WeakObject/weak_object_expired.cpp
Reduction/reduction_nd_ext_half.cpp
Reduction/reduction_big_data.cpp
Basic/built-ins/vec_relational.cpp
Basic/built-ins/vec_math.cpp
SubGroup/sub_group_as.cpp
HostInteropTask/host-task-dependency3.cpp
Reduction/reduction_big_data.cpp
SharedLib/use_with_dlopen_verify_cache.cpp
Tests hanging
WorkGroupMemory/basic_usage.cpp
syclcompat/memory/memory_management_test2_usmnone.cpp
SubGroup/reduce_spirv13.cpp
Adapters/retain_events.cpp
SubGroup/scan.cpp
List of related tickets
- CI Failed Tests on AMD/HIP #17441
- e2e weak_object_expired GPU crash #17415
- tests failing on AMD/Hip ( flaky? ) #17339
- SubGroup/reduce_spirv13.cpp timeouts on unrelated changes in post-commit on HIP #17284
- SYCL :: SubGroup/sub_group_as.cpp fails on unrelated changes in post-commit #17283
- SYCL :: Adapters/retain_events.cpp flaky timeout on AMD CI #17236
- SYCL :: HostInteropTask/host-task-dependency3.cpp flaky fails on AMD pre-commit #17235
- SubGroup/scan.cpp, Reduction/reduction_big_data.cpp, SharedLib/use_with_dlopen_verify_cache.cpp fail in pre-commit on AMD on unrelated change #17194
PR disabling related tests
Workaround PR with -j1
To reproduce
Environment
Additional context
No response