-
Notifications
You must be signed in to change notification settings - Fork 28.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OSError: meta-llama/Llama-4-Scout-17B-16E-Instruct does not appear to have a file named X #37314
Comments
Delete the model from ~/.cache/huggingface/hub and download again. There was an error with your download. |
@aroradrishan I've tried that a few times with no luck |
This does seem to be an issue of reading weights from many processes (hundreds) at once. Reading from a single 8-H100 node can load the model with |
Might be |
+1 can't get this to work |
It looks like the issue is if you don't have enough disk space all the files will not download, but there is no warning or error printed. |
Same error with 9.5 TB space Update: Created new env with no |
Yeah I have been able to load the model in a python repl with no deepspeed. However when I try to load from the cache in a distributed setting I get the reported error. Using the same env and loading Mistral 24B doesn't have an issue either. So it seems to be something around this log of fetching files which might be creating some lock on the files? I have had race condition issues loading HF checkpoints in a distributed setting previously... |
System Info
I have transformers 4.51.0 and am trying to load the Llama 4 Scout model for training. I loaded the safetensor files to disk and am pointing to that location with the
cache_dir
arg. It seems like each time I load the model withAutoModelForCausalLM
I get a different error. One run it isOSError: meta-llama/Llama-4-Scout-17B-16E-Instruct does not appear to have a file named model-00008-of-00050.safetensors. Checkout 'https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct/tree/main'for available files.
and the next run it is
It seems to change each time I run the training script. Wondering if there is some error with loading from the cache for this many files.
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Load the model with
Expected behavior
Model is loaded into memory
The text was updated successfully, but these errors were encountered: