Deepspeed zero-3 failures in main #37300

winglian · 2025-04-05T04:23:28Z

System Info

transformers version: 4.51.0.dev0
Platform: Linux-6.8.0-52-generic-x86_64-with-glibc2.35
Python version: 3.11.11
Huggingface_hub version: 0.30.1
Safetensors version: 0.5.3
Accelerate version: 1.5.2
Accelerate config: not found
DeepSpeed version: 0.15.4
PyTorch version (GPU?): 2.6.0+cu124 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:
Using GPU in script?:
GPU type: NVIDIA H100 80GB HBM3

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

When running the axolotl multigpu test suite, I get multiple regressions on main, mostly involving deepspeed zero3.
Doing some bisecting on git and running tests, everything passes up to commit 880560040609b03e62cb2ee7ad505825efb158bb and fails on the merge of this PR #36963

[gw0] [ 43%] FAILED tests/e2e/multigpu/test_llama.py::TestMultiGPULlama::test_ds_zero3_packed[True-deepspeed_configs/zero3_bf16.json-1]
tests/e2e/multigpu/test_llama.py::TestMultiGPULlama::test_ds_zero3_packed[True-deepspeed_configs/zero3_bf16.json-2]
[gw1] [ 45%] FAILED tests/e2e/multigpu/test_llama.py::TestMultiGPULlama::test_ds_zero3_packed[True-deepspeed_configs/zero3_bf16_cpuoffload_all.json-2]
tests/e2e/multigpu/test_llama.py::TestMultiGPULlama::test_ds_zero3_packed[False-deepspeed_configs/zero3_bf16.json-1]
[gw0] [ 48%] FAILED tests/e2e/multigpu/test_llama.py::TestMultiGPULlama::test_ds_zero3_packed[True-deepspeed_configs/zero3_bf16.json-2]
tests/e2e/multigpu/test_llama.py::TestMultiGPULlama::test_ds_zero3_packed[True-deepspeed_configs/zero3_bf16_cpuoffload_all.json-1]
[gw1] [ 51%] FAILED tests/e2e/multigpu/test_llama.py::TestMultiGPULlama::test_ds_zero3_packed[False-deepspeed_configs/zero3_bf16.json-1]
tests/e2e/multigpu/test_llama.py::TestMultiGPULlama::test_ds_zero3_packed[False-deepspeed_configs/zero3_bf16.json-2]
[gw0] [ 54%] FAILED tests/e2e/multigpu/test_llama.py::TestMultiGPULlama::test_ds_zero3_packed[True-deepspeed_configs/zero3_bf16_cpuoffload_all.json-1]
tests/e2e/multigpu/test_llama.py::TestMultiGPULlama::test_ds_zero3_packed[False-deepspeed_configs/zero3_bf16_cpuoffload_all.json-1]
[gw1] [ 56%] FAILED tests/e2e/multigpu/test_llama.py::TestMultiGPULlama::test_ds_zero3_packed[False-deepspeed_configs/zero3_bf16.json-2]
tests/e2e/multigpu/test_llama.py::TestMultiGPULlama::test_ds_zero2_packed[True-2]
[gw0] [ 59%] FAILED tests/e2e/multigpu/test_llama.py::TestMultiGPULlama::test_ds_zero3_packed[False-deepspeed_configs/zero3_bf16_cpuoffload_all.json-1]
tests/e2e/multigpu/test_llama.py::TestMultiGPULlama::test_ds_zero3_packed[False-deepspeed_configs/zero3_bf16_cpuoffload_all.json-2]
[gw0] [ 62%] FAILED tests/e2e/multigpu/test_llama.py::TestMultiGPULlama::test_ds_zero3_packed[False-deepspeed_configs/zero3_bf16_cpuoffload_all.json-2]
tests/e2e/multigpu/test_llama.py::TestMultiGPULlama::test_ds_zero2_packed[True-1]
[gw1] [ 64%] PASSED tests/e2e/multigpu/test_llama.py::TestMultiGPULlama::test_ds_zero2_packed[True-2]
tests/e2e/multigpu/test_llama.py::TestMultiGPULlama::test_ds_zero2_packed[False-1]
[gw0] [ 67%] PASSED tests/e2e/multigpu/test_llama.py::TestMultiGPULlama::test_ds_zero2_packed[True-1]
tests/e2e/multigpu/test_llama.py::TestMultiGPULlama::test_ds_zero2_packed[False-2]
[gw1] [ 70%] PASSED tests/e2e/multigpu/test_llama.py::TestMultiGPULlama::test_ds_zero2_packed[False-1]
tests/e2e/multigpu/test_llama.py::TestMultiGPULlama::test_ds_zero1_packed[True-1]
[gw0] [ 72%] PASSED tests/e2e/multigpu/test_llama.py::TestMultiGPULlama::test_ds_zero2_packed[False-2]
tests/e2e/multigpu/test_llama.py::TestMultiGPULlama::test_ds_zero1_packed[True-2]
[gw1] [ 75%] PASSED tests/e2e/multigpu/test_llama.py::TestMultiGPULlama::test_ds_zero1_packed[True-1]
tests/e2e/multigpu/test_llama.py::TestMultiGPULlama::test_ds_zero1_packed[False-1]
[gw0] [ 78%] PASSED tests/e2e/multigpu/test_llama.py::TestMultiGPULlama::test_ds_zero1_packed[True-2]
tests/e2e/multigpu/test_llama.py::TestMultiGPULlama::test_ds_zero1_packed[False-2]
[gw1] [ 81%] PASSED tests/e2e/multigpu/test_llama.py::TestMultiGPULlama::test_ds_zero1_packed[False-1]
tests/e2e/multigpu/test_llama.py::TestMultiGPULlama::test_fix_untrained_tokens 
[gw0] [ 83%] PASSED tests/e2e/multigpu/test_llama.py::TestMultiGPULlama::test_ds_zero1_packed[False-2]
tests/e2e/multigpu/test_qwen2.py::TestMultiGPUQwen2::test_qlora_fsdp_dpo[Qwen/Qwen2-0.5B]

stderr: [rank0]:        While copying the parameter named "model.layers.28.self_attn.q_proj.weight", whose dimensions in the model are torch.Size([576, 576]) and whose dimensions in the checkpoint are torch.Size([576, 576]), an exception occurred : ('Ca
nnot copy out of meta tensor; no data!',).
stderr: [rank0]:        While copying the parameter named "model.layers.28.self_attn.k_proj.weight", whose dimensions in the model are torch.Size([192, 576]) and whose dimensions in the checkpoint are torch.Size([192, 576]), an exception occurred : ('Ca
nnot copy out of meta tensor; no data!',).
stderr: [rank0]:        While copying the parameter named "model.layers.28.self_attn.v_proj.weight", whose dimensions in the model are torch.Size([192, 576]) and whose dimensions in the checkpoint are torch.Size([192, 576]), an exception occurred : ('Ca
nnot copy out of meta tensor; no data!',).
stderr: [rank0]:        While copying the parameter named "model.layers.28.self_attn.o_proj.weight", whose dimensions in the model are torch.Size([576, 576]) and whose dimensions in the checkpoint are torch.Size([576, 576]), an exception occurred : ('Ca
nnot copy out of meta tensor; no data!',).                                                                                    
stderr: [rank0]:        While copying the parameter named "model.layers.28.mlp.gate_proj.weight", whose dimensions in the model are torch.Size([1536, 576]) and whose dimensions in the checkpoint are torch.Size([1536, 576]), an exception occurred : ('Can
not copy out of meta tensor; no data!',). 
stderr: [rank0]:        While copying the parameter named "model.layers.28.mlp.up_proj.weight", whose dimensions in the model are torch.Size([1536, 576]) and whose dimensions in the checkpoint are torch.Size([1536, 576]), an exception occurred : ('Canno
t copy out of meta tensor; no data!',).
stderr: [rank0]:        While copying the parameter named "model.layers.28.mlp.down_proj.weight", whose dimensions in the model are torch.Size([576, 1536]) and whose dimensions in the checkpoint are torch.Size([576, 1536]), an exception occurred : ('Can
not copy out of meta tensor; no data!',).                                                                                     
stderr: [rank0]:        While copying the parameter named "model.layers.28.input_layernorm.weight", whose dimensions in the model are torch.Size([576]) and whose dimensions in the checkpoint are torch.Size([576]), an exception occurred : ('Cannot copy o
ut of meta tensor; no data!',).                                                                                               
stderr: [rank0]:        While copying the parameter named "model.layers.28.post_attention_layernorm.weight", whose dimensions in the model are torch.Size([576]) and whose dimensions in the checkpoint are torch.Size([576]), an exception occurred : ('Cann
ot copy out of meta tensor; no data!',).     
stderr: [rank0]:        While copying the parameter named "model.layers.29.self_attn.q_proj.weight", whose dimensions in the model are torch.Size([576, 576]) and whose dimensions in the checkpoint are torch.Size([576, 576]), an exception occurred : ('Ca
nnot copy out of meta tensor; no data!',).
stderr: [rank0]:        While copying the parameter named "model.layers.29.self_attn.k_proj.weight", whose dimensions in the model are torch.Size([192, 576]) and whose dimensions in the checkpoint are torch.Size([192, 576]), an exception occurred : ('Ca
nnot copy out of meta tensor; no data!',).                                                                                    
stderr: [rank0]:        While copying the parameter named "model.layers.29.self_attn.v_proj.weight", whose dimensions in the model are torch.Size([192, 576]) and whose dimensions in the checkpoint are torch.Size([192, 576]), an exception occurred : ('Ca
nnot copy out of meta tensor; no data!',).                                                                                    
stderr: [rank0]:        While copying the parameter named "model.layers.29.self_attn.o_proj.weight", whose dimensions in the model are torch.Size([576, 576]) and whose dimensions in the checkpoint are torch.Size([576, 576]), an exception occurred : ('Ca
nnot copy out of meta tensor; no data!',).                
stderr: [rank0]:        While copying the parameter named "model.layers.29.mlp.gate_proj.weight", whose dimensions in the model are torch.Size([1536, 576]) and whose dimensions in the checkpoint are torch.Size([1536, 576]), an exception occurred : ('Can
not copy out of meta tensor; no data!',).                                                                                     
stderr: [rank0]:        While copying the parameter named "model.layers.29.mlp.up_proj.weight", whose dimensions in the model are torch.Size([1536, 576]) and whose dimensions in the checkpoint are torch.Size([1536, 576]), an exception occurred : ('Canno
t copy out of meta tensor; no data!',).       
stderr: [rank0]:        While copying the parameter named "model.layers.29.mlp.down_proj.weight", whose dimensions in the model are torch.Size([576, 1536]) and whose dimensions in the checkpoint are torch.Size([576, 1536]), an exception occurred : ('Can
not copy out of meta tensor; no data!',).        
stderr: [rank0]:        While copying the parameter named "model.layers.29.input_layernorm.weight", whose dimensions in the model are torch.Size([576]) and whose dimensions in the checkpoint are torch.Size([576]), an exception occurred : ('Cannot copy o
ut of meta tensor; no data!',).                                                                                               
stderr: [rank0]:        While copying the parameter named "model.layers.29.post_attention_layernorm.weight", whose dimensions in the model are torch.Size([576]) and whose dimensions in the checkpoint are torch.Size([576]), an exception occurred : ('Cann
ot copy out of meta tensor; no data!',).                                                                                      
stderr: [rank0]:        While copying the parameter named "model.norm.weight", whose dimensions in the model are torch.Size([576]) and whose dimensions in the checkpoint are torch.Size([576]), an exception occurred : ('Cannot copy out of meta tensor; no
 data!',).

Expected behavior

zero3 should work

The text was updated successfully, but these errors were encountered:

winglian · 2025-04-05T04:27:15Z

looks like you're on this and will get addressed with #37281

ArthurZucker · 2025-04-07T21:09:40Z

CLosing as fixed!

winglian added the bug label Apr 5, 2025

ArthurZucker closed this as completed Apr 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deepspeed zero-3 failures in main #37300

Deepspeed zero-3 failures in main #37300

winglian commented Apr 5, 2025

winglian commented Apr 5, 2025

ArthurZucker commented Apr 7, 2025

Deepspeed zero-3 failures in main #37300

Deepspeed zero-3 failures in main #37300

Comments

winglian commented Apr 5, 2025

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

winglian commented Apr 5, 2025

ArthurZucker commented Apr 7, 2025