Older all_reduce_bench.py. #105

amitmvyas · 2025-05-23T11:33:54Z

amitmvyas
May 23, 2025

Hi,

I am wondering if you could help me with older version of all_reduce_bench.py which initially had below parameters

these emulate the payload which will become a M * N * 4-sized tensor below

N = 500000
M = 2000

The newer all_reduce_bench.py is not helping me do my test, I am not sure what parameters I need to change to reach to higher busbw.

I am running below command
python -u -m torch.distributed.run --nproc_per_node=8 --node_rank=0 --master_addr=192.168.1.91 --nnode=2 all_reduce_bench.py

python -u -m torch.distributed.run --nproc_per_node=8 --node_rank=1 --master_addr=192.168.1.91 --nnode=2 all_reduce_bench.py

Device info: _CudaDeviceProperties(name='NVIDIA H100 80GB HBM3', major=9, minor=0, total_memory=81105MB, multi_processor_count=132, uuid=6a390cce-5191-1dbe-2a0f-897cc7b3beb4, L2_cache_size=50MB)

The average bandwidth of all_reduce over 16 ranks (5 warmups / 20 trials):

payload	busbw	algbw
32KiB	0.70GBps	0.37GBps
64KiB	1.23GBps	0.65GBps
128KiB	1.80GBps	0.96GBps
256KiB	3.15GBps	1.68GBps
512KiB	4.51GBps	2.40GBps
1MiB	10.45GBps	5.57GBps
2MiB	17.89GBps	9.54GBps
4MiB	24.54GBps	13.09GBps
8MiB	28.22GBps	15.05GBps
16MiB	31.50GBps	16.80GBps
32MiB	30.95GBps	16.50GBps
64MiB	31.86GBps	16.99GBps
128MiB	25.05GBps	13.36GBps
256MiB	21.73GBps	11.59GBps
512MiB	31.25GBps	16.67GBps
1GiB	33.06GBps	17.63GBps
2GiB	32.67GBps	17.42GBps
4GiB	32.99GBps	17.60GBps
8GiB	33.27GBps	17.74GBps
16GiB	23.76GBps	12.67GBps

Getting above result for below host configuration.
2 x Intel(R) Xeon(R) Platinum 8480+ (56 Core each processor)
2TB Memory
8 x MT2910 Family [ConnectX-7]
8 x 3.84TB NVMe Drives

It would be great if you shed some light to help me achieve higher busbw.

Regards,
Amit Vyas

stas00 · 2025-05-23T16:36:27Z

stas00
May 23, 2025
Maintainer

The new version is just automatically scanning a range of payloads, but essentially it's the same as just doing a single 4GiB payload.

Most likely your inter-node setup is misconfigured. See the following to help you to diagnose the problems:
https://github.com/stas00/ml-engineering/tree/master/network/debug#how-to-diagnose-nccl-multi-gpu-and-multi-node-connectivity-issues

But otherwise https://github.com/NVIDIA/nccl would you be your destination to ask such questions.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Older all_reduce_bench.py. #105

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Older all_reduce_bench.py. #105

Uh oh!

amitmvyas May 23, 2025

these emulate the payload which will become a M * N * 4-sized tensor below

Replies: 1 comment

Uh oh!

stas00 May 23, 2025 Maintainer

amitmvyas
May 23, 2025

stas00
May 23, 2025
Maintainer