Test Suite Flag `--rdma-mpi` Implemented (#598) #878

Malmahrouqi3 · 2025-06-11T15:50:32Z

Description

I added the flag under ./mfc.sh test. It can be invoked with -r or --rdma-mpi
It can be used on any test cases. It is intended to skip tests whenever the rdma_mpi is false or unspecified (false by default).

/rdma_mpi/MFC-mo2$ ./mfc.sh test  -r -o HLLD

Test Summary: 0 passed, 0 failed, 2 skipped.

 Skipped Cases

 1D -> MHD -> HLLD
 2D -> MHD -> HLLD

sbryngelson

There is an RDMA flag T/F in the case files right now (there's probably an example of this in examples/). What you've added is very useful for testing, but perhaps not for running cases (since RDMA_MPI doesn't need to be known at compile-time, we can just put it in the case file).

For testing (the purpose of this GH issue), it's useful to be able to flip RDMA_MPI on/off from the command line since some compilers/computers support it and some do not.

I would suggest:

Making this flag only an option in the ./mfc.sh test environment
If --rdma-mpi is provided as an argument, then we run three additional tests. They would be copies of the current 2-rank test cases, but have RDMA_MPI = 'T' in the case file. The goldenfiles would be identical. The hashes for the current 2-rank tests are below. Note that if the flag is enabled, we should still run the existing non-RDMA 2-rank tests (if the compiler supports RDMA_MPI then it also supports not using it). Current 2-rank tests:

  0FCCE9F1   1D -> 2 MPI Ranks
  8C7AA13B   2D -> 2 MPI Ranks
  CE232828   3D -> 2 MPI Ranks

If --rdma-mpi flag is not provided, then the new/extra cases are 'skipped'. This is already a feature of the test suite.
This flag should be used in the Frontier CI workflow. ./mfc.sh test -a --rdma-mpi (or whatever) since Frontier supports this feature.

codecov · 2025-06-11T18:49:34Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 45.77%. Comparing base (a34f0ca) to head (edbd532).
Report is 3 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master     #878   +/-   ##
=======================================
  Coverage   45.77%   45.77%           
=======================================
  Files          68       68           
  Lines       18664    18664           
  Branches     2251     2251           
=======================================
  Hits         8543     8543           
  Misses       8763     8763           
  Partials     1358     1358

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Malmahrouqi3 · 2025-06-11T20:33:42Z

Six 2-Rank MPI cases will be created. Pairs with the RDMA enabled and disabled. RDMA-On cases will be run only when the flag is given. Otherwise, they will be written but skipped.

0FCCE9F1 1D -> 2 MPI Ranks
8C7AA13B 2D -> 2 MPI Ranks
CE232828 3D -> 2 MPI Ranks

Example Usage:

./mfc.sh test --rdma-mpi                                        #only MPI Ranks cases.
./mfc.sh test -o Hypoelasticity --rdma-mpi                      #Hypoelasticity cases with the RDMA enabled.
./mfc.sh test -a --rdma-mpi                                     #all tests along with RDMA 2-Ranks MPI cases.

./mfc.sh test --help

  -r, --rdma-mpi        Enable RDMA MPI for tests (default: False)
  --no-build            (Testing) Do not rebuild MFC. (default: False)
  --no-examples         Do not test example cases. (default: False)
  --case-optimization   (GPU Optimization) Compile MFC targets with some case parameters hard-coded. (default: False)
  --dry-run             Build and generate case files but do not run tests. (default: False)
  --generate            (Test Generation) Generate golden files. (default: False)
  --add-new-variables   (Test Generation) If new variables are found in D/ when running tests, add them to the golden files. (default: False)
  --remove-old-tests    (Test Generation) Delete tests directories that are no longer. (default: False)

Malmahrouqi3 · 2025-06-11T20:36:18Z

Just need to update the testing docs for the flag and add ./mfc.sh test -a --rdma-mpi to Frontier CI.

…into rdma_mpi

sbryngelson · 2025-06-11T23:33:35Z

I'm not sure if this is the case right now, but if one enables the flag --rdma-mpi (i would remove the -r option for it) then it should run all the cases, not only the ones with RDMA MPI. that's what i was trying to suggest.

Malmahrouqi3 · 2025-06-11T23:39:46Z

@sbryngelson
When the flag is enabled, it should run whatever category specified or just all + RDMA cases.

Malmahrouqi3 · 2025-06-13T19:33:52Z

@sbryngelson I do not think it is viable to debug locally. I do not have access to Frontier, is RDMA available on (Phoenix, Bridges2, or Delta)??

sbryngelson · 2025-06-13T19:37:05Z

@sbryngelson I do not think it is viable to debug locally. I do not have access to Frontier, is RDMA available on (Phoenix, Bridges2, or Delta)??

You can ask the MFC Slack what other computers have compilers that support it. But the Frontier failure appears to have nothing to do with your PR. It's having some network issues.

Malmahrouqi3 · 2025-06-15T16:58:33Z

Working on it right now on Phoenix

…into rdma_mpi

Malmahrouqi3 · 2025-06-15T19:09:30Z

This should do it hopefully. The difference between the two commands is right now 3 tests. I will keep an eye on the test suite
./mfc.sh test -j $(nproc) -a
./mfc.sh test -j $(nproc) -a --rdma-mpi

I removed -r so it only can be run by --rdma-mpi at this moment.

Malmahrouqi3 · 2025-06-15T20:24:24Z

Self-Hosted Frontier failed due to irrelevant reasons at all. I will rerun momentarily.

Malmahrouqi3 · 2025-06-15T22:01:02Z

@anandrdbz Self-Hosted Frontier has failed on multiple re-runs for unclear reasons. Do you mind checking it out?

sbryngelson · 2025-06-16T00:17:31Z

the problem seems to be that it's trying to use the internet on a compute node (in the test part of the workflow). the compute nodes don't have internet access. it's trying to access silo. I am not sure why it thinks it isn't available, because all of the other PR workflows are working fine... which suggests something specific to this PR

Malmahrouqi3 · 2025-06-16T00:22:01Z

I have not clue to be honest

sbryngelson · 2025-06-16T00:28:22Z

frankly me either, your PR doesn't seem to tickle that part of the code. i would try commenting out some of your changes until it works. not an elegant method. probably leave the flag but comment out the extra test cases and how you modify the case files, and make sure that works. if that works, add in how you modify the case files but not the tests themselves.

as i look closer, i'm unsure if how you added the RDMA flag is proper - though i'm not entirely sure. what happens if you try to use it on Phoenix or something? If you run the ./mfc.sh test --dry-run -a and then run ./mfc.sh test --rdma-mpi -a does it try to build silo again? You can just try that out on Phoenix on your own (or whatever computer, for that matter).

Malmahrouqi3 · 2025-06-16T00:30:11Z

on Phoenix it ran just fine, but failed on the RDMA cases only

Malmahrouqi3 · 2025-06-16T00:30:36Z

I will keep playing with it tonight

sbryngelson · 2025-06-16T00:31:36Z

basically the idea of --dry-run is that it builds everything needed to run all of the tests. if you messed up the flags somehow, then it may try to build MFC again on the compute node automatically. (e.g., it sees a new flag dry-run and it says "i need to rebuild!" and that's the problem).

frankly i'm pretty confident now that something like that is happening. MFC rebuilds itself when it sees certain flag differences.
for example, if you compile with --gpu and then run with --no-gpu it will rebuild MFC for no-gpu cases. i think it may see the flag difference during testing/running and think it needs to rebuild. i could be wrong.

sbryngelson · 2025-06-16T00:32:35Z

on Phoenix it ran just fine, but failed on the RDMA cases only

well, in this case i know it should run fine, but the question is did it try to rebuild silo when you run test after you run ./mfc.sh test --dry-run

Malmahrouqi3 · 2025-06-16T00:35:29Z

I will wait for the re-run now and if it does not pass, I will try the flag on different clusters to investigate the issue further

sbryngelson · 2025-06-16T00:39:59Z

Yeah i don't think you even need a cluster. you could probably even do this on your laptop. you just need to see if it tries to fetch silo from github because that's the failure mode.

toolchain/mfc/test/test.py

Malmahrouqi3 · 2025-06-18T18:37:23Z

checked the logs and
Phoenix gpu benchmark failed since the master branch oddly did not complete all benchmark tests.
Self-Hosted Frontier cpu failed from the compiler end.

Copilot

Pull Request Overview

This PR introduces a new --rdma-mpi/-r flag to the test suite, allowing users to skip or run tests with RDMA MPI enabled by default off.

Adds a filter to skip tests labeled “RDMA MPI” unless --rdma-mpi is specified
Passes rdma_mpi=True to case.run() for simulation steps when appropriate
Updates test case definitions, CLI args, documentation, and CI workflow to support the new flag

Reviewed Changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
toolchain/mfc/test/test.py	Skip or enable RDMA MPI cases in filtering and test execution
toolchain/mfc/test/cases.py	Add “RDMA MPI” variants to 3D and IB test case definitions
toolchain/mfc/args.py	Define the `--rdma-mpi` argument
docs/documentation/testing.md	Document the new `--rdma-mpi` flag
.github/workflows/frontier/test.sh	Enable `--rdma-mpi` in CI test invocation

Comments suppressed due to low confidence (4)

toolchain/mfc/args.py:83

The implementation documentation mentions a short form -r for this flag, but it isn’t defined here. Consider adding -r alongside --rdma-mpi for consistency with the description.

    test.add_argument(      "--rdma-mpi",     action="store_true",     default=False,     help="Run tests with RDMA MPI enabled")

toolchain/mfc/test/test.py:60

New filtering logic skips RDMA MPI cases when the flag is unset; consider adding unit or integration tests for __filter to verify both skip and include behaviors.

    for case in cases[:]:

toolchain/mfc/test/test.py:234

The RDMA-enabled path for the post-process step is commented out under test_all, so --rdma-mpi has no effect there. You should either restore that logic or remove the commented block to avoid misleading code.

        # if ARG("rdma_mpi"):

toolchain/mfc/test/test.py:190

[nitpick] This else branch is indented with two spaces instead of four, which breaks the project’s indentation style. Please adjust for consistency.

        cmd = case.run([PRE_PROCESS, SIMULATION], gpus=devices)

Copilot · 2025-06-21T16:46:46Z

toolchain/mfc/test/test.py

+    if "RDMA MPI" in case.trace:
+        cmd = case.run([PRE_PROCESS, SIMULATION], gpus=devices, rdma_mpi=True)
+    else:
+        cmd = case.run([PRE_PROCESS, SIMULATION], gpus=devices)


[nitpick] Consider consolidating the RDMA branch with the standard case.run invocation by merging it into a single call that conditionally adds rdma_mpi=True, to reduce duplication.

Suggested change

if "RDMA MPI" in case.trace:

cmd = case.run([PRE_PROCESS, SIMULATION], gpus=devices, rdma_mpi=True)

else:

cmd = case.run([PRE_PROCESS, SIMULATION], gpus=devices)

rdma_mpi = "RDMA MPI" in case.trace

cmd = case.run([PRE_PROCESS, SIMULATION], gpus=devices, rdma_mpi=rdma_mpi)

sbryngelson · 2025-06-21T16:50:12Z

interesting copilot also generated these but left them in a dropdown not a comment because it had low confidence

Comments suppressed due to low confidence (4)

toolchain/mfc/args.py:83

The implementation documentation mentions a short form -r for this flag, but it isn’t defined here. Consider adding -r alongside --rdma-mpi for consistency with the description.

    test.add_argument(      "--rdma-mpi",     action="store_true",     default=False,     help="Run tests with RDMA MPI enabled")

toolchain/mfc/test/test.py:60

New filtering logic skips RDMA MPI cases when the flag is unset; consider adding unit or integration tests for __filter to verify both skip and include behaviors.

    for case in cases[:]:

toolchain/mfc/test/test.py:234

The RDMA-enabled path for the post-process step is commented out under test_all, so --rdma-mpi has no effect there. You should either restore that logic or remove the commented block to avoid misleading code.

        # if ARG("rdma_mpi"):

toolchain/mfc/test/test.py:190

[nitpick] This else branch is indented with two spaces instead of four, which breaks the project’s indentation style. Please adjust for consistency.

        cmd = case.run([PRE_PROCESS, SIMULATION], gpus=devices)

Malmahrouqi3 · 2025-06-21T16:54:58Z

Will cruise back into this PR in a sec

Malmahrouqi3 · 2025-06-21T17:03:30Z

github/workflows/frontier/test.sh has nothing sketchy looking tbh, just normal intended use of the flag.
docs/documentation/testing.md removed -r as you requested from both code and docs.
toolchain/mfc/args.py recognized the flag in general.
test.add_argument( "--rdma-mpi", action="store_true", default=False, help="Run tests with RDMA MPI enabled")

toolchain/mfc/test/cases.py intended to append the additional cases to the existing list.

    def alter_ppn(dimInfo):
        if len(dimInfo[0]) == 3:
            cases.append(define_case_d(stack, '2 MPI Ranks', {'m': 29, 'n': 29, 'p': 49}, ppn=2))
            cases.append(define_case_d(stack, '2 MPI Ranks -> RDMA MPI', {'m': 29, 'n': 29, 'p': 49, 'rdma_mpi': 'T'}, ppn=2))
        else:
            cases.append(define_case_d(stack, '2 MPI Ranks', {}, ppn=2))
            cases.append(define_case_d(stack, '2 MPI Ranks -> RDMA MPI', {'rdma_mpi': 'T'}, ppn=2))

toolchain/mfc/test/test.py skips RMDA cases if not enabled

    for case in cases[:]:
        if "RDMA MPI" in case.trace and not ARG("rdma_mpi"):
            cases.remove(case)
            skipped_cases.append(case)

added flag --rdma-mpi

14706c2

Malmahrouqi3 requested a review from a team as a code owner June 11, 2025 15:50

Malmahrouqi3 changed the title ~~added flag --rdma-mpi~~ Test Suite Flag --rdma-mpi Implemented (#598) Jun 11, 2025

sbryngelson reviewed Jun 11, 2025

View reviewed changes

Malmahrouqi3 and others added 3 commits June 11, 2025 15:23

modifications on the flag use

0dbef83

flag use changes

3c492d4

lint

8e27e41

Malmahrouqi3 added 3 commits June 11, 2025 16:42

updated testing docs

7fe5f32

RDMAP MPI added to Frontier CI

f788dfe

Merge branch 'rdma_mpi' of https://github.com/mohdsaid497566/MFC-mo2 …

db9d7a0

…into rdma_mpi

Merge branch 'master' into rdma_mpi

8a665b0

Mohammed Said Hamed Humaid Al-Mahrouqi and others added 2 commits June 15, 2025 14:27

fixed stuff

6aac583

Merge branch 'rdma_mpi' of https://github.com/mohdsaid497566/MFC-mo2 …

efa85fb

…into rdma_mpi

Malmahrouqi3 changed the title ~~Test Suite Flag --rdma-mpi Implemented (#598)~~ Test Suite Flag --rdma-mpi Implemented (#598) Jun 15, 2025

re-run with lint/format

902b1ea

Malmahrouqi3 commented Jun 16, 2025

View reviewed changes

toolchain/mfc/test/test.py Show resolved Hide resolved

toolchain/mfc/test/test.py Show resolved Hide resolved

Malmahrouqi3 and others added 5 commits June 15, 2025 20:47

removed duplicate assignment of CMD

c9b1599

redundant code

7839916

Merge branch 'master' into rdma_mpi

43dcc34

Merge branch 'master' into rdma_mpi

518bf23

Merge branch 'master' into rdma_mpi

57b600b

sbryngelson requested a review from Copilot June 21, 2025 16:45

Copilot AI reviewed Jun 21, 2025

View reviewed changes

Merge branch 'master' into rdma_mpi

50d5aea

sbryngelson self-requested a review June 21, 2025 17:03

sbryngelson and others added 2 commits June 21, 2025 13:27

Merge branch 'master' into rdma_mpi

df29256

Merge branch 'master' into rdma_mpi

edbd532

Test Suite Flag --rdma-mpi Implemented (#598) #878

Are you sure you want to change the base?

Test Suite Flag --rdma-mpi Implemented (#598) #878

Conversation

Malmahrouqi3 commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

sbryngelson left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Malmahrouqi3 commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Malmahrouqi3 commented Jun 11, 2025

Uh oh!

sbryngelson commented Jun 11, 2025

Uh oh!

Malmahrouqi3 commented Jun 11, 2025

Uh oh!

Malmahrouqi3 commented Jun 13, 2025

Uh oh!

sbryngelson commented Jun 13, 2025

Uh oh!

Malmahrouqi3 commented Jun 15, 2025

Uh oh!

Malmahrouqi3 commented Jun 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Malmahrouqi3 commented Jun 15, 2025

Uh oh!

Malmahrouqi3 commented Jun 15, 2025

Uh oh!

sbryngelson commented Jun 16, 2025

Uh oh!

Malmahrouqi3 commented Jun 16, 2025

Uh oh!

sbryngelson commented Jun 16, 2025

Uh oh!

Malmahrouqi3 commented Jun 16, 2025

Uh oh!

Malmahrouqi3 commented Jun 16, 2025

Uh oh!

sbryngelson commented Jun 16, 2025

Uh oh!

sbryngelson commented Jun 16, 2025

Uh oh!

Malmahrouqi3 commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sbryngelson commented Jun 16, 2025

Uh oh!

Uh oh!

Uh oh!

Malmahrouqi3 commented Jun 18, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jun 21, 2025

Choose a reason for hiding this comment

Uh oh!

sbryngelson commented Jun 21, 2025

Uh oh!

Malmahrouqi3 commented Jun 21, 2025

Uh oh!

Malmahrouqi3 commented Jun 21, 2025

Uh oh!

Uh oh!

Test Suite Flag `--rdma-mpi` Implemented (#598) #878

Test Suite Flag `--rdma-mpi` Implemented (#598) #878

Malmahrouqi3 commented Jun 11, 2025 •

edited

Loading

codecov bot commented Jun 11, 2025 •

edited

Loading

Malmahrouqi3 commented Jun 11, 2025 •

edited

Loading

Malmahrouqi3 commented Jun 15, 2025 •

edited

Loading

Malmahrouqi3 commented Jun 16, 2025 •

edited

Loading