Replies: 1 comment
-
The new version is just automatically scanning a range of payloads, but essentially it's the same as just doing a single 4GiB payload. Most likely your inter-node setup is misconfigured. See the following to help you to diagnose the problems: But otherwise https://github.com/NVIDIA/nccl would you be your destination to ask such questions. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I am wondering if you could help me with older version of all_reduce_bench.py which initially had below parameters
these emulate the payload which will become a M * N * 4-sized tensor below
N = 500000
M = 2000
The newer all_reduce_bench.py is not helping me do my test, I am not sure what parameters I need to change to reach to higher busbw.
I am running below command
python -u -m torch.distributed.run --nproc_per_node=8 --node_rank=0 --master_addr=192.168.1.91 --nnode=2 all_reduce_bench.py
python -u -m torch.distributed.run --nproc_per_node=8 --node_rank=1 --master_addr=192.168.1.91 --nnode=2 all_reduce_bench.py
Device info: _CudaDeviceProperties(name='NVIDIA H100 80GB HBM3', major=9, minor=0, total_memory=81105MB, multi_processor_count=132, uuid=6a390cce-5191-1dbe-2a0f-897cc7b3beb4, L2_cache_size=50MB)
The average bandwidth of all_reduce over 16 ranks (5 warmups / 20 trials):
Getting above result for below host configuration.
2 x Intel(R) Xeon(R) Platinum 8480+ (56 Core each processor)
2TB Memory
8 x MT2910 Family [ConnectX-7]
8 x 3.84TB NVMe Drives
It would be great if you shed some light to help me achieve higher busbw.
Regards,
Amit Vyas
Beta Was this translation helpful? Give feedback.
All reactions