You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction
When running the axolotl multigpu test suite, I get multiple regressions on main, mostly involving deepspeed zero3.
Doing some bisecting on git and running tests, everything passes up to commit 880560040609b03e62cb2ee7ad505825efb158bb and fails on the merge of this PR #36963
stderr: [rank0]: While copying the parameter named "model.layers.28.self_attn.q_proj.weight", whose dimensions in the model are torch.Size([576, 576]) and whose dimensions in the checkpoint are torch.Size([576, 576]), an exception occurred : ('Ca
nnot copy out of meta tensor; no data!',).
stderr: [rank0]: While copying the parameter named "model.layers.28.self_attn.k_proj.weight", whose dimensions in the model are torch.Size([192, 576]) and whose dimensions in the checkpoint are torch.Size([192, 576]), an exception occurred : ('Ca
nnot copy out of meta tensor; no data!',).
stderr: [rank0]: While copying the parameter named "model.layers.28.self_attn.v_proj.weight", whose dimensions in the model are torch.Size([192, 576]) and whose dimensions in the checkpoint are torch.Size([192, 576]), an exception occurred : ('Ca
nnot copy out of meta tensor; no data!',).
stderr: [rank0]: While copying the parameter named "model.layers.28.self_attn.o_proj.weight", whose dimensions in the model are torch.Size([576, 576]) and whose dimensions in the checkpoint are torch.Size([576, 576]), an exception occurred : ('Ca
nnot copy out of meta tensor; no data!',).
stderr: [rank0]: While copying the parameter named "model.layers.28.mlp.gate_proj.weight", whose dimensions in the model are torch.Size([1536, 576]) and whose dimensions in the checkpoint are torch.Size([1536, 576]), an exception occurred : ('Can
not copy out of meta tensor; no data!',).
stderr: [rank0]: While copying the parameter named "model.layers.28.mlp.up_proj.weight", whose dimensions in the model are torch.Size([1536, 576]) and whose dimensions in the checkpoint are torch.Size([1536, 576]), an exception occurred : ('Canno
t copy out of meta tensor; no data!',).
stderr: [rank0]: While copying the parameter named "model.layers.28.mlp.down_proj.weight", whose dimensions in the model are torch.Size([576, 1536]) and whose dimensions in the checkpoint are torch.Size([576, 1536]), an exception occurred : ('Can
not copy out of meta tensor; no data!',).
stderr: [rank0]: While copying the parameter named "model.layers.28.input_layernorm.weight", whose dimensions in the model are torch.Size([576]) and whose dimensions in the checkpoint are torch.Size([576]), an exception occurred : ('Cannot copy o
ut of meta tensor; no data!',).
stderr: [rank0]: While copying the parameter named "model.layers.28.post_attention_layernorm.weight", whose dimensions in the model are torch.Size([576]) and whose dimensions in the checkpoint are torch.Size([576]), an exception occurred : ('Cann
ot copy out of meta tensor; no data!',).
stderr: [rank0]: While copying the parameter named "model.layers.29.self_attn.q_proj.weight", whose dimensions in the model are torch.Size([576, 576]) and whose dimensions in the checkpoint are torch.Size([576, 576]), an exception occurred : ('Ca
nnot copy out of meta tensor; no data!',).
stderr: [rank0]: While copying the parameter named "model.layers.29.self_attn.k_proj.weight", whose dimensions in the model are torch.Size([192, 576]) and whose dimensions in the checkpoint are torch.Size([192, 576]), an exception occurred : ('Ca
nnot copy out of meta tensor; no data!',).
stderr: [rank0]: While copying the parameter named "model.layers.29.self_attn.v_proj.weight", whose dimensions in the model are torch.Size([192, 576]) and whose dimensions in the checkpoint are torch.Size([192, 576]), an exception occurred : ('Ca
nnot copy out of meta tensor; no data!',).
stderr: [rank0]: While copying the parameter named "model.layers.29.self_attn.o_proj.weight", whose dimensions in the model are torch.Size([576, 576]) and whose dimensions in the checkpoint are torch.Size([576, 576]), an exception occurred : ('Ca
nnot copy out of meta tensor; no data!',).
stderr: [rank0]: While copying the parameter named "model.layers.29.mlp.gate_proj.weight", whose dimensions in the model are torch.Size([1536, 576]) and whose dimensions in the checkpoint are torch.Size([1536, 576]), an exception occurred : ('Can
not copy out of meta tensor; no data!',).
stderr: [rank0]: While copying the parameter named "model.layers.29.mlp.up_proj.weight", whose dimensions in the model are torch.Size([1536, 576]) and whose dimensions in the checkpoint are torch.Size([1536, 576]), an exception occurred : ('Canno
t copy out of meta tensor; no data!',).
stderr: [rank0]: While copying the parameter named "model.layers.29.mlp.down_proj.weight", whose dimensions in the model are torch.Size([576, 1536]) and whose dimensions in the checkpoint are torch.Size([576, 1536]), an exception occurred : ('Can
not copy out of meta tensor; no data!',).
stderr: [rank0]: While copying the parameter named "model.layers.29.input_layernorm.weight", whose dimensions in the model are torch.Size([576]) and whose dimensions in the checkpoint are torch.Size([576]), an exception occurred : ('Cannot copy o
ut of meta tensor; no data!',).
stderr: [rank0]: While copying the parameter named "model.layers.29.post_attention_layernorm.weight", whose dimensions in the model are torch.Size([576]) and whose dimensions in the checkpoint are torch.Size([576]), an exception occurred : ('Cann
ot copy out of meta tensor; no data!',).
stderr: [rank0]: While copying the parameter named "model.norm.weight", whose dimensions in the model are torch.Size([576]) and whose dimensions in the checkpoint are torch.Size([576]), an exception occurred : ('Cannot copy out of meta tensor; no
data!',).
Expected behavior
zero3 should work
The text was updated successfully, but these errors were encountered:
System Info
transformers
version: 4.51.0.dev0Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
When running the axolotl multigpu test suite, I get multiple regressions on main, mostly involving deepspeed zero3.
Doing some bisecting on git and running tests, everything passes up to commit
880560040609b03e62cb2ee7ad505825efb158bb
and fails on the merge of this PR #36963Expected behavior
zero3 should work
The text was updated successfully, but these errors were encountered: