Skip to content

Test Suite Flag --rdma-mpi Implemented (#598) #878

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

Malmahrouqi3
Copy link
Collaborator

@Malmahrouqi3 Malmahrouqi3 commented Jun 11, 2025

Description

I added the flag under ./mfc.sh test. It can be invoked with -r or --rdma-mpi
It can be used on any test cases. It is intended to skip tests whenever the rdma_mpi is false or unspecified (false by default).

/rdma_mpi/MFC-mo2$ ./mfc.sh test  -r -o HLLD

Test Summary: 0 passed, 0 failed, 2 skipped.

 Skipped Cases

 1D -> MHD -> HLLD
 2D -> MHD -> HLLD

@Malmahrouqi3 Malmahrouqi3 requested a review from a team as a code owner June 11, 2025 15:50
@Malmahrouqi3 Malmahrouqi3 changed the title added flag --rdma-mpi Test Suite Flag --rdma-mpi Implemented (#598) Jun 11, 2025
Copy link
Member

@sbryngelson sbryngelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an RDMA flag T/F in the case files right now (there's probably an example of this in examples/). What you've added is very useful for testing, but perhaps not for running cases (since RDMA_MPI doesn't need to be known at compile-time, we can just put it in the case file).

For testing (the purpose of this GH issue), it's useful to be able to flip RDMA_MPI on/off from the command line since some compilers/computers support it and some do not.

I would suggest:

  1. Making this flag only an option in the ./mfc.sh test environment
  2. If --rdma-mpi is provided as an argument, then we run three additional tests. They would be copies of the current 2-rank test cases, but have RDMA_MPI = 'T' in the case file. The goldenfiles would be identical. The hashes for the current 2-rank tests are below. Note that if the flag is enabled, we should still run the existing non-RDMA 2-rank tests (if the compiler supports RDMA_MPI then it also supports not using it). Current 2-rank tests:
  0FCCE9F1   1D -> 2 MPI Ranks
  8C7AA13B   2D -> 2 MPI Ranks
  CE232828   3D -> 2 MPI Ranks
  1. If --rdma-mpi flag is not provided, then the new/extra cases are 'skipped'. This is already a feature of the test suite.
  2. This flag should be used in the Frontier CI workflow. ./mfc.sh test -a --rdma-mpi (or whatever) since Frontier supports this feature.

Copy link

codecov bot commented Jun 11, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 45.77%. Comparing base (a34f0ca) to head (edbd532).
Report is 3 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #878   +/-   ##
=======================================
  Coverage   45.77%   45.77%           
=======================================
  Files          68       68           
  Lines       18664    18664           
  Branches     2251     2251           
=======================================
  Hits         8543     8543           
  Misses       8763     8763           
  Partials     1358     1358           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Malmahrouqi3
Copy link
Collaborator Author

Malmahrouqi3 commented Jun 11, 2025

Six 2-Rank MPI cases will be created. Pairs with the RDMA enabled and disabled. RDMA-On cases will be run only when the flag is given. Otherwise, they will be written but skipped.

0FCCE9F1 1D -> 2 MPI Ranks
8C7AA13B 2D -> 2 MPI Ranks
CE232828 3D -> 2 MPI Ranks

Example Usage:

./mfc.sh test --rdma-mpi                                        #only MPI Ranks cases.
./mfc.sh test -o Hypoelasticity --rdma-mpi                      #Hypoelasticity cases with the RDMA enabled.
./mfc.sh test -a --rdma-mpi                                     #all tests along with RDMA 2-Ranks MPI cases.

./mfc.sh test --help
  -r, --rdma-mpi        Enable RDMA MPI for tests (default: False)
  --no-build            (Testing) Do not rebuild MFC. (default: False)
  --no-examples         Do not test example cases. (default: False)
  --case-optimization   (GPU Optimization) Compile MFC targets with some case parameters hard-coded. (default: False)
  --dry-run             Build and generate case files but do not run tests. (default: False)
  --generate            (Test Generation) Generate golden files. (default: False)
  --add-new-variables   (Test Generation) If new variables are found in D/ when running tests, add them to the golden files. (default: False)
  --remove-old-tests    (Test Generation) Delete tests directories that are no longer. (default: False)

@Malmahrouqi3
Copy link
Collaborator Author

Just need to update the testing docs for the flag and add ./mfc.sh test -a --rdma-mpi to Frontier CI.

@sbryngelson
Copy link
Member

I'm not sure if this is the case right now, but if one enables the flag --rdma-mpi (i would remove the -r option for it) then it should run all the cases, not only the ones with RDMA MPI. that's what i was trying to suggest.

@Malmahrouqi3
Copy link
Collaborator Author

@sbryngelson
When the flag is enabled, it should run whatever category specified or just all + RDMA cases.

@Malmahrouqi3
Copy link
Collaborator Author

@sbryngelson I do not think it is viable to debug locally. I do not have access to Frontier, is RDMA available on (Phoenix, Bridges2, or Delta)??

@sbryngelson
Copy link
Member

@sbryngelson I do not think it is viable to debug locally. I do not have access to Frontier, is RDMA available on (Phoenix, Bridges2, or Delta)??

You can ask the MFC Slack what other computers have compilers that support it. But the Frontier failure appears to have nothing to do with your PR. It's having some network issues.

@Malmahrouqi3
Copy link
Collaborator Author

Working on it right now on Phoenix

Mohammed Said Hamed Humaid Al-Mahrouqi and others added 2 commits June 15, 2025 14:27
@Malmahrouqi3
Copy link
Collaborator Author

Malmahrouqi3 commented Jun 15, 2025

This should do it hopefully. The difference between the two commands is right now 3 tests. I will keep an eye on the test suite
./mfc.sh test -j $(nproc) -a
./mfc.sh test -j $(nproc) -a --rdma-mpi

I removed -r so it only can be run by --rdma-mpi at this moment.

@Malmahrouqi3 Malmahrouqi3 changed the title Test Suite Flag --rdma-mpi Implemented (#598) Test Suite Flag --rdma-mpi Implemented (#598) Jun 15, 2025
@Malmahrouqi3
Copy link
Collaborator Author

Self-Hosted Frontier failed due to irrelevant reasons at all. I will rerun momentarily.

@Malmahrouqi3
Copy link
Collaborator Author

@anandrdbz Self-Hosted Frontier has failed on multiple re-runs for unclear reasons. Do you mind checking it out?

@sbryngelson
Copy link
Member

the problem seems to be that it's trying to use the internet on a compute node (in the test part of the workflow). the compute nodes don't have internet access. it's trying to access silo. I am not sure why it thinks it isn't available, because all of the other PR workflows are working fine... which suggests something specific to this PR

@Malmahrouqi3
Copy link
Collaborator Author

I have not clue to be honest

@sbryngelson
Copy link
Member

frankly me either, your PR doesn't seem to tickle that part of the code. i would try commenting out some of your changes until it works. not an elegant method. probably leave the flag but comment out the extra test cases and how you modify the case files, and make sure that works. if that works, add in how you modify the case files but not the tests themselves.

as i look closer, i'm unsure if how you added the RDMA flag is proper - though i'm not entirely sure. what happens if you try to use it on Phoenix or something? If you run the ./mfc.sh test --dry-run -a and then run ./mfc.sh test --rdma-mpi -a does it try to build silo again? You can just try that out on Phoenix on your own (or whatever computer, for that matter).

@Malmahrouqi3
Copy link
Collaborator Author

on Phoenix it ran just fine, but failed on the RDMA cases only

@Malmahrouqi3
Copy link
Collaborator Author

I will keep playing with it tonight

@sbryngelson
Copy link
Member

basically the idea of --dry-run is that it builds everything needed to run all of the tests. if you messed up the flags somehow, then it may try to build MFC again on the compute node automatically. (e.g., it sees a new flag dry-run and it says "i need to rebuild!" and that's the problem).

frankly i'm pretty confident now that something like that is happening. MFC rebuilds itself when it sees certain flag differences.
for example, if you compile with --gpu and then run with --no-gpu it will rebuild MFC for no-gpu cases. i think it may see the flag difference during testing/running and think it needs to rebuild. i could be wrong.

@sbryngelson
Copy link
Member

on Phoenix it ran just fine, but failed on the RDMA cases only

well, in this case i know it should run fine, but the question is did it try to rebuild silo when you run test after you run ./mfc.sh test --dry-run

@Malmahrouqi3
Copy link
Collaborator Author

Malmahrouqi3 commented Jun 16, 2025

I will wait for the re-run now and if it does not pass, I will try the flag on different clusters to investigate the issue further

@sbryngelson
Copy link
Member

Yeah i don't think you even need a cluster. you could probably even do this on your laptop. you just need to see if it tries to fetch silo from github because that's the failure mode.

@Malmahrouqi3
Copy link
Collaborator Author

checked the logs and
Phoenix gpu benchmark failed since the master branch oddly did not complete all benchmark tests.
Self-Hosted Frontier cpu failed from the compiler end.

@sbryngelson sbryngelson requested a review from Copilot June 21, 2025 16:45
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new --rdma-mpi/-r flag to the test suite, allowing users to skip or run tests with RDMA MPI enabled by default off.

  • Adds a filter to skip tests labeled “RDMA MPI” unless --rdma-mpi is specified
  • Passes rdma_mpi=True to case.run() for simulation steps when appropriate
  • Updates test case definitions, CLI args, documentation, and CI workflow to support the new flag

Reviewed Changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
toolchain/mfc/test/test.py Skip or enable RDMA MPI cases in filtering and test execution
toolchain/mfc/test/cases.py Add “RDMA MPI” variants to 3D and IB test case definitions
toolchain/mfc/args.py Define the --rdma-mpi argument
docs/documentation/testing.md Document the new --rdma-mpi flag
.github/workflows/frontier/test.sh Enable --rdma-mpi in CI test invocation
Comments suppressed due to low confidence (4)

toolchain/mfc/args.py:83

  • The implementation documentation mentions a short form -r for this flag, but it isn’t defined here. Consider adding -r alongside --rdma-mpi for consistency with the description.
    test.add_argument(      "--rdma-mpi",     action="store_true",     default=False,     help="Run tests with RDMA MPI enabled")

toolchain/mfc/test/test.py:60

  • New filtering logic skips RDMA MPI cases when the flag is unset; consider adding unit or integration tests for __filter to verify both skip and include behaviors.
    for case in cases[:]:

toolchain/mfc/test/test.py:234

  • The RDMA-enabled path for the post-process step is commented out under test_all, so --rdma-mpi has no effect there. You should either restore that logic or remove the commented block to avoid misleading code.
        # if ARG("rdma_mpi"):

toolchain/mfc/test/test.py:190

  • [nitpick] This else branch is indented with two spaces instead of four, which breaks the project’s indentation style. Please adjust for consistency.
        cmd = case.run([PRE_PROCESS, SIMULATION], gpus=devices)

Comment on lines +187 to +190
if "RDMA MPI" in case.trace:
cmd = case.run([PRE_PROCESS, SIMULATION], gpus=devices, rdma_mpi=True)
else:
cmd = case.run([PRE_PROCESS, SIMULATION], gpus=devices)
Copy link
Preview

Copilot AI Jun 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider consolidating the RDMA branch with the standard case.run invocation by merging it into a single call that conditionally adds rdma_mpi=True, to reduce duplication.

Suggested change
if "RDMA MPI" in case.trace:
cmd = case.run([PRE_PROCESS, SIMULATION], gpus=devices, rdma_mpi=True)
else:
cmd = case.run([PRE_PROCESS, SIMULATION], gpus=devices)
rdma_mpi = "RDMA MPI" in case.trace
cmd = case.run([PRE_PROCESS, SIMULATION], gpus=devices, rdma_mpi=rdma_mpi)

Copilot uses AI. Check for mistakes.

@sbryngelson
Copy link
Member

interesting copilot also generated these but left them in a dropdown not a comment because it had low confidence

Comments suppressed due to low confidence (4)

toolchain/mfc/args.py:83

  • The implementation documentation mentions a short form -r for this flag, but it isn’t defined here. Consider adding -r alongside --rdma-mpi for consistency with the description.
    test.add_argument(      "--rdma-mpi",     action="store_true",     default=False,     help="Run tests with RDMA MPI enabled")

toolchain/mfc/test/test.py:60

  • New filtering logic skips RDMA MPI cases when the flag is unset; consider adding unit or integration tests for __filter to verify both skip and include behaviors.
    for case in cases[:]:

toolchain/mfc/test/test.py:234

  • The RDMA-enabled path for the post-process step is commented out under test_all, so --rdma-mpi has no effect there. You should either restore that logic or remove the commented block to avoid misleading code.
        # if ARG("rdma_mpi"):

toolchain/mfc/test/test.py:190

  • [nitpick] This else branch is indented with two spaces instead of four, which breaks the project’s indentation style. Please adjust for consistency.
        cmd = case.run([PRE_PROCESS, SIMULATION], gpus=devices)

@Malmahrouqi3
Copy link
Collaborator Author

Will cruise back into this PR in a sec

@sbryngelson sbryngelson self-requested a review June 21, 2025 17:03
@Malmahrouqi3
Copy link
Collaborator Author

github/workflows/frontier/test.sh has nothing sketchy looking tbh, just normal intended use of the flag.
docs/documentation/testing.md removed -r as you requested from both code and docs.
toolchain/mfc/args.py recognized the flag in general.
test.add_argument( "--rdma-mpi", action="store_true", default=False, help="Run tests with RDMA MPI enabled")

toolchain/mfc/test/cases.py intended to append the additional cases to the existing list.

    def alter_ppn(dimInfo):
        if len(dimInfo[0]) == 3:
            cases.append(define_case_d(stack, '2 MPI Ranks', {'m': 29, 'n': 29, 'p': 49}, ppn=2))
            cases.append(define_case_d(stack, '2 MPI Ranks -> RDMA MPI', {'m': 29, 'n': 29, 'p': 49, 'rdma_mpi': 'T'}, ppn=2))
        else:
            cases.append(define_case_d(stack, '2 MPI Ranks', {}, ppn=2))
            cases.append(define_case_d(stack, '2 MPI Ranks -> RDMA MPI', {'rdma_mpi': 'T'}, ppn=2))

toolchain/mfc/test/test.py skips RMDA cases if not enabled

    for case in cases[:]:
        if "RDMA MPI" in case.trace and not ARG("rdma_mpi"):
            cases.remove(case)
            skipped_cases.append(case)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants