Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSError: meta-llama/Llama-4-Scout-17B-16E-Instruct does not appear to have a file named X #37314

Open
4 tasks
sam-h-bean opened this issue Apr 6, 2025 · 8 comments
Labels

Comments

@sam-h-bean
Copy link
Contributor

System Info

I have transformers 4.51.0 and am trying to load the Llama 4 Scout model for training. I loaded the safetensor files to disk and am pointing to that location with the cache_dir arg. It seems like each time I load the model with AutoModelForCausalLM I get a different error. One run it is

OSError: meta-llama/Llama-4-Scout-17B-16E-Instruct does not appear to have a file named model-00008-of-00050.safetensors. Checkout 'https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct/tree/main'for available files.

and the next run it is

OSError: meta-llama/Llama-4-Scout-17B-16E-Instruct does not appear to have files named ('model-00002-of-00050.safetensors', 'model-00003-of-00050.safetensors', 'model-00004-of-00050.safetensors', 'model-00005-of-00050.safetensors', 'model-00006-of-00050.safetensors', 'model-00007-of-00050.safetensors', 'model-00008-of-00050.safetensors', 'model-00011-of-00050.safetensors', 'model-00012-of-00050.safetensors', 'model-00014-of-00050.safetensors', 'model-00016-of-00050.safetensors', 'model-00017-of-00050.safetensors'). Checkout 'https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct/tree/main'for available files.

It seems to change each time I run the training script. Wondering if there is some error with loading from the cache for this many files.

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Load the model with

model = AutoModelForCausalLM.from_pretrained(
  "meta-llama/Llama-4-Scout-17B-16E-Instruct",
  torch_dtype=torch.bfloat16,
  token=HF_TOKEN,
  attn_implementation="flash_attention_2",
)

Expected behavior

Model is loaded into memory

@sam-h-bean sam-h-bean added the bug label Apr 6, 2025
@aroradrishan
Copy link

Delete the model from ~/.cache/huggingface/hub and download again. There was an error with your download.

@sam-h-bean
Copy link
Contributor Author

@aroradrishan I've tried that a few times with no luck

@sam-h-bean
Copy link
Contributor Author

This does seem to be an issue of reading weights from many processes (hundreds) at once. Reading from a single 8-H100 node can load the model with device_map='auto' however loading from many processes in a ray cluster leads to this error. Pretty odd...

@ArthurZucker
Copy link
Collaborator

Might be hf_transfer !

@pb-sameereddy
Copy link

+1 can't get this to work

@pb-sameereddy
Copy link

It looks like the issue is if you don't have enough disk space all the files will not download, but there is no warning or error printed.

@yashkumaratri
Copy link

yashkumaratri commented Apr 8, 2025

Same error with 9.5 TB space

Update:

Created new env with no deepspeed and hf_xet and it works.

@sam-h-bean
Copy link
Contributor Author

Yeah I have been able to load the model in a python repl with no deepspeed. However when I try to load from the cache in a distributed setting I get the reported error. Using the same env and loading Mistral 24B doesn't have an issue either. So it seems to be something around this log of fetching files which might be creating some lock on the files? I have had race condition issues loading HF checkpoints in a distributed setting previously...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants