-
Notifications
You must be signed in to change notification settings - Fork 106
Test Suite Flag --rdma-mpi
Implemented (#598)
#878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an RDMA flag T/F
in the case files right now (there's probably an example of this in examples/
). What you've added is very useful for testing, but perhaps not for running cases (since RDMA_MPI doesn't need to be known at compile-time, we can just put it in the case file).
For testing (the purpose of this GH issue), it's useful to be able to flip RDMA_MPI
on/off from the command line since some compilers/computers support it and some do not.
I would suggest:
- Making this flag only an option in the
./mfc.sh test
environment - If
--rdma-mpi
is provided as an argument, then we run three additional tests. They would be copies of the current 2-rank test cases, but haveRDMA_MPI = 'T'
in the case file. The goldenfiles would be identical. The hashes for the current 2-rank tests are below. Note that if the flag is enabled, we should still run the existing non-RDMA 2-rank tests (if the compiler supports RDMA_MPI then it also supports not using it). Current 2-rank tests:
0FCCE9F1 1D -> 2 MPI Ranks
8C7AA13B 2D -> 2 MPI Ranks
CE232828 3D -> 2 MPI Ranks
- If
--rdma-mpi
flag is not provided, then the new/extra cases are 'skipped'. This is already a feature of the test suite. - This flag should be used in the Frontier CI workflow.
./mfc.sh test -a --rdma-mpi
(or whatever) since Frontier supports this feature.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #878 +/- ##
=======================================
Coverage 45.77% 45.77%
=======================================
Files 68 68
Lines 18664 18664
Branches 2251 2251
=======================================
Hits 8543 8543
Misses 8763 8763
Partials 1358 1358 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Six 2-Rank MPI cases will be created. Pairs with the RDMA enabled and disabled. RDMA-On cases will be run only when the flag is given. Otherwise, they will be written but skipped.
Example Usage: ./mfc.sh test --rdma-mpi #only MPI Ranks cases.
./mfc.sh test -o Hypoelasticity --rdma-mpi #Hypoelasticity cases with the RDMA enabled.
./mfc.sh test -a --rdma-mpi #all tests along with RDMA 2-Ranks MPI cases. ./mfc.sh test --help -r, --rdma-mpi Enable RDMA MPI for tests (default: False)
--no-build (Testing) Do not rebuild MFC. (default: False)
--no-examples Do not test example cases. (default: False)
--case-optimization (GPU Optimization) Compile MFC targets with some case parameters hard-coded. (default: False)
--dry-run Build and generate case files but do not run tests. (default: False)
--generate (Test Generation) Generate golden files. (default: False)
--add-new-variables (Test Generation) If new variables are found in D/ when running tests, add them to the golden files. (default: False)
--remove-old-tests (Test Generation) Delete tests directories that are no longer. (default: False) |
Just need to update the testing docs for the flag and add |
I'm not sure if this is the case right now, but if one enables the flag |
@sbryngelson |
@sbryngelson I do not think it is viable to debug locally. I do not have access to Frontier, is RDMA available on (Phoenix, Bridges2, or Delta)?? |
You can ask the MFC Slack what other computers have compilers that support it. But the Frontier failure appears to have nothing to do with your PR. It's having some network issues. |
Working on it right now on Phoenix |
This should do it hopefully. The difference between the two commands is right now 3 tests. I will keep an eye on the test suite I removed |
--rdma-mpi
Implemented (#598)
Self-Hosted Frontier failed due to irrelevant reasons at all. I will rerun momentarily. |
@anandrdbz Self-Hosted Frontier has failed on multiple re-runs for unclear reasons. Do you mind checking it out? |
the problem seems to be that it's trying to use the internet on a compute node (in the test part of the workflow). the compute nodes don't have internet access. it's trying to access silo. I am not sure why it thinks it isn't available, because all of the other PR workflows are working fine... which suggests something specific to this PR |
I have not clue to be honest |
frankly me either, your PR doesn't seem to tickle that part of the code. i would try commenting out some of your changes until it works. not an elegant method. probably leave the flag but comment out the extra test cases and how you modify the case files, and make sure that works. if that works, add in how you modify the case files but not the tests themselves. as i look closer, i'm unsure if how you added the RDMA flag is proper - though i'm not entirely sure. what happens if you try to use it on Phoenix or something? If you run the |
on Phoenix it ran just fine, but failed on the RDMA cases only |
I will keep playing with it tonight |
basically the idea of frankly i'm pretty confident now that something like that is happening. MFC rebuilds itself when it sees certain flag differences. |
well, in this case i know it should run fine, but the question is did it try to rebuild silo when you run test after you run |
I will wait for the re-run now and if it does not pass, I will try the flag on different clusters to investigate the issue further |
Yeah i don't think you even need a cluster. you could probably even do this on your laptop. you just need to see if it tries to fetch silo from github because that's the failure mode. |
checked the logs and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a new --rdma-mpi
/-r
flag to the test suite, allowing users to skip or run tests with RDMA MPI enabled by default off.
- Adds a filter to skip tests labeled “RDMA MPI” unless
--rdma-mpi
is specified - Passes
rdma_mpi=True
tocase.run()
for simulation steps when appropriate - Updates test case definitions, CLI args, documentation, and CI workflow to support the new flag
Reviewed Changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
toolchain/mfc/test/test.py | Skip or enable RDMA MPI cases in filtering and test execution |
toolchain/mfc/test/cases.py | Add “RDMA MPI” variants to 3D and IB test case definitions |
toolchain/mfc/args.py | Define the --rdma-mpi argument |
docs/documentation/testing.md | Document the new --rdma-mpi flag |
.github/workflows/frontier/test.sh | Enable --rdma-mpi in CI test invocation |
Comments suppressed due to low confidence (4)
toolchain/mfc/args.py:83
- The implementation documentation mentions a short form
-r
for this flag, but it isn’t defined here. Consider adding-r
alongside--rdma-mpi
for consistency with the description.
test.add_argument( "--rdma-mpi", action="store_true", default=False, help="Run tests with RDMA MPI enabled")
toolchain/mfc/test/test.py:60
- New filtering logic skips RDMA MPI cases when the flag is unset; consider adding unit or integration tests for
__filter
to verify both skip and include behaviors.
for case in cases[:]:
toolchain/mfc/test/test.py:234
- The RDMA-enabled path for the post-process step is commented out under
test_all
, so--rdma-mpi
has no effect there. You should either restore that logic or remove the commented block to avoid misleading code.
# if ARG("rdma_mpi"):
toolchain/mfc/test/test.py:190
- [nitpick] This
else
branch is indented with two spaces instead of four, which breaks the project’s indentation style. Please adjust for consistency.
cmd = case.run([PRE_PROCESS, SIMULATION], gpus=devices)
if "RDMA MPI" in case.trace: | ||
cmd = case.run([PRE_PROCESS, SIMULATION], gpus=devices, rdma_mpi=True) | ||
else: | ||
cmd = case.run([PRE_PROCESS, SIMULATION], gpus=devices) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Consider consolidating the RDMA branch with the standard case.run
invocation by merging it into a single call that conditionally adds rdma_mpi=True
, to reduce duplication.
if "RDMA MPI" in case.trace: | |
cmd = case.run([PRE_PROCESS, SIMULATION], gpus=devices, rdma_mpi=True) | |
else: | |
cmd = case.run([PRE_PROCESS, SIMULATION], gpus=devices) | |
rdma_mpi = "RDMA MPI" in case.trace | |
cmd = case.run([PRE_PROCESS, SIMULATION], gpus=devices, rdma_mpi=rdma_mpi) |
Copilot uses AI. Check for mistakes.
interesting copilot also generated these but left them in a dropdown not a comment because it had low confidence toolchain/mfc/args.py:83
toolchain/mfc/test/test.py:60
toolchain/mfc/test/test.py:234
toolchain/mfc/test/test.py:190
|
Will cruise back into this PR in a sec |
github/workflows/frontier/test.sh has nothing sketchy looking tbh, just normal intended use of the flag. toolchain/mfc/test/cases.py intended to append the additional cases to the existing list. def alter_ppn(dimInfo):
if len(dimInfo[0]) == 3:
cases.append(define_case_d(stack, '2 MPI Ranks', {'m': 29, 'n': 29, 'p': 49}, ppn=2))
cases.append(define_case_d(stack, '2 MPI Ranks -> RDMA MPI', {'m': 29, 'n': 29, 'p': 49, 'rdma_mpi': 'T'}, ppn=2))
else:
cases.append(define_case_d(stack, '2 MPI Ranks', {}, ppn=2))
cases.append(define_case_d(stack, '2 MPI Ranks -> RDMA MPI', {'rdma_mpi': 'T'}, ppn=2)) toolchain/mfc/test/test.py skips RMDA cases if not enabled for case in cases[:]:
if "RDMA MPI" in case.trace and not ARG("rdma_mpi"):
cases.remove(case)
skipped_cases.append(case) |
Description
I added the flag under
./mfc.sh test
. It can be invoked with-r
or--rdma-mpi
It can be used on any test cases. It is intended to skip tests whenever the
rdma_mpi
is false or unspecified (false by default).