OSError: meta-llama/Llama-4-Scout-17B-16E-Instruct does not appear to have a file named X #37314

sam-h-bean · 2025-04-06T00:21:47Z

System Info

I have transformers 4.51.0 and am trying to load the Llama 4 Scout model for training. I loaded the safetensor files to disk and am pointing to that location with the cache_dir arg. It seems like each time I load the model with AutoModelForCausalLM I get a different error. One run it is

OSError: meta-llama/Llama-4-Scout-17B-16E-Instruct does not appear to have a file named model-00008-of-00050.safetensors. Checkout 'https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct/tree/main'for available files.

and the next run it is

OSError: meta-llama/Llama-4-Scout-17B-16E-Instruct does not appear to have files named ('model-00002-of-00050.safetensors', 'model-00003-of-00050.safetensors', 'model-00004-of-00050.safetensors', 'model-00005-of-00050.safetensors', 'model-00006-of-00050.safetensors', 'model-00007-of-00050.safetensors', 'model-00008-of-00050.safetensors', 'model-00011-of-00050.safetensors', 'model-00012-of-00050.safetensors', 'model-00014-of-00050.safetensors', 'model-00016-of-00050.safetensors', 'model-00017-of-00050.safetensors'). Checkout 'https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct/tree/main'for available files.

It seems to change each time I run the training script. Wondering if there is some error with loading from the cache for this many files.

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Load the model with

model = AutoModelForCausalLM.from_pretrained(
  "meta-llama/Llama-4-Scout-17B-16E-Instruct",
  torch_dtype=torch.bfloat16,
  token=HF_TOKEN,
  attn_implementation="flash_attention_2",
)

Expected behavior

Model is loaded into memory

The text was updated successfully, but these errors were encountered:

aroradrishan · 2025-04-06T00:29:39Z

Delete the model from ~/.cache/huggingface/hub and download again. There was an error with your download.

sam-h-bean · 2025-04-06T00:30:09Z

@aroradrishan I've tried that a few times with no luck

sam-h-bean · 2025-04-07T02:38:21Z

This does seem to be an issue of reading weights from many processes (hundreds) at once. Reading from a single 8-H100 node can load the model with device_map='auto' however loading from many processes in a ray cluster leads to this error. Pretty odd...

ArthurZucker · 2025-04-07T07:15:04Z

Might be hf_transfer !

pb-sameereddy · 2025-04-07T21:39:05Z

+1 can't get this to work

pb-sameereddy · 2025-04-07T21:46:26Z

It looks like the issue is if you don't have enough disk space all the files will not download, but there is no warning or error printed.

yashkumaratri · 2025-04-08T04:12:48Z

Same error with 9.5 TB space

Update:

Created new env with no deepspeed and hf_xet and it works.

sam-h-bean · 2025-04-08T18:32:21Z

Yeah I have been able to load the model in a python repl with no deepspeed. However when I try to load from the cache in a distributed setting I get the reported error. Using the same env and loading Mistral 24B doesn't have an issue either. So it seems to be something around this log of fetching files which might be creating some lock on the files? I have had race condition issues loading HF checkpoints in a distributed setting previously...

sam-h-bean added the bug label Apr 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OSError: meta-llama/Llama-4-Scout-17B-16E-Instruct does not appear to have a file named X #37314

OSError: meta-llama/Llama-4-Scout-17B-16E-Instruct does not appear to have a file named X #37314

sam-h-bean commented Apr 6, 2025

aroradrishan commented Apr 6, 2025

sam-h-bean commented Apr 6, 2025

sam-h-bean commented Apr 7, 2025

ArthurZucker commented Apr 7, 2025

pb-sameereddy commented Apr 7, 2025

pb-sameereddy commented Apr 7, 2025

yashkumaratri commented Apr 8, 2025 •

edited

Loading

sam-h-bean commented Apr 8, 2025

OSError: meta-llama/Llama-4-Scout-17B-16E-Instruct does not appear to have a file named X #37314

OSError: meta-llama/Llama-4-Scout-17B-16E-Instruct does not appear to have a file named X #37314

Comments

sam-h-bean commented Apr 6, 2025

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

aroradrishan commented Apr 6, 2025

sam-h-bean commented Apr 6, 2025

sam-h-bean commented Apr 7, 2025

ArthurZucker commented Apr 7, 2025

pb-sameereddy commented Apr 7, 2025

pb-sameereddy commented Apr 7, 2025

yashkumaratri commented Apr 8, 2025 • edited Loading

sam-h-bean commented Apr 8, 2025

yashkumaratri commented Apr 8, 2025 •

edited

Loading