-
-
Notifications
You must be signed in to change notification settings - Fork 6.9k
[Bug]: Not able to deploy Llama-4-Scout-17B-16E-Instruct on vllm-openai v0.8.3 #16197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Is there an env variable for torch.compile you can throw to false
|
Having the same issue with Maverick-FP8 model on 4 x H100(96Gb). My configuration is as follows:
Running latest v0.8.3 docker image. Have a strong feeling that the memory is not sufficient, because increasing number of nodes to two (8 GPUs in total) fixes the problem and the model runs just fine. Here more informative error message would have been great. @rabaja check if increasing number of GPUs helps. The debug logs are as follows
|
I suspect this is an OOM error that vLLM could do better job of surfacing, TP=2 wouldnt work due to insufficient memory. I recommend trying with TP=8 if available. @rabaja I also believe you aren't running on v0.8.3 because the stack trace contains a reference to cc @njhill since you changed the engine startup code recently. I think we can improve the error handling here |
Based on our test, Scout BF16 can run on H100/A100 80GB x 4, and Maverick fp8 requires H100 80GB x 8. We are working on int4 Scout model, which may fit in one H100/A100 80GB card. |
@sarckk I am using a v0.8.3 Normal Scheduled 84s default-scheduler Successfully assigned ncgr-llama-4-scout-17b-16e-instruct-ns/vllm-meta-llama-4-scout-17b-16e-instruct-77f7db54cc-w6pgn to aks-gpu2-29549835-vmss00000h |
Running into this same error when running the unsloth bnb fp8 Scout model on 4xH100. Here is my command: docker run |
I will look into the error visibility issue today. I think it's preexisting and not caused by the recent #15906 PR which was reverted for 0.8.3. That PR did introduce a new related issue where the server might not exit in this case, since fixed by #16137. |
@houseroad @sarckk FWIW #11737 is the PR where the error visibility and related robustness changes are being done. |
+1 getting the same error. |
@brandonbiggs what is your command and how many / what GPUs are you running on? |
I tried on 2 and 4 80gb H100s. I won't be able to get the command until tomorrow morning but I was using the latest vLLM container from dockerhub. |
Facing the same issue using command CUDA_VISIBLE_DEVICE=2,3,4,5 VLLM_DISABLE_COMPILE_CACHE=1 vllm serve /model/Llama-4-Scout-17B-16E-Instruct/ --device cuda --override-generation-config='{"attn_temperature_tuning": true}' --max-model-len 524288 --tensor-parallel-size 4 --host 0.0.0.0 --port 8006 Specs: 4 x A100 (80GB) |
I think this error is coming from flash attention utils file not properly found. I currently add this in my dockerfile:
|
@sarckk sorry it took me so long to get back to this. Command: export CUDA_VISIBLE_DEVICES=0,1,2,3
export VLLM_DISABLE_COMPILE_CACHE=1
apptainer exec --nv \
docker://vllm/vllm-openai:v0.8.3 \
vllm serve Llama-4-Scout-17B-16E-Instruct \
--tensor-parallel-size 4 \
--max-model-len 64000 \
--override-generation-config='{"attn_temperature_tuning": true}' 4 80GB Nvidia H100s |
I got this error with all models that has tensor parallel size greater than 1, not just llama 4. This is my docker container script
|
Your current environment
Any help will be appriciated.
🐛 Describe the bug
args:
Any help will be appriciated.
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: