Skip to content

[Bug]: Not able to deploy Llama-4-Scout-17B-16E-Instruct on vllm-openai v0.8.3 #16197

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task done
rabaja opened this issue Apr 7, 2025 · 16 comments
Open
1 task done
Labels
bug Something isn't working

Comments

@rabaja
Copy link

rabaja commented Apr 7, 2025

Your current environment

  1. Download the Llama-4-Scout-17B-16E-Instruct on PVC
  2. Deploy the model on azure kubernates on A100 with 2 GPUs 80GB each.
  3. using below arguments
args:
        - "--model"
        - "/mnt/models/meta-llama-4-scout-17b-16e-instruct"
        - "--api-key"
        - "$(VLLM_API_KEY)"
        - "--tensor-parallel-size"
        - "2"
        - "--dtype"
        - "bfloat16"
        - "--port"
        - "8000"
        - "--max-model-len"
        - "32768"
        - "--max-num-batched-tokens"
        - "32768"
        - "--max-num-seqs"
        - "16"
        - "--gpu-memory-utilization"
        - "0.99"
        - "--served-model-name"
        - "Llama-4-Scout-17B-16E-Instruct"
        - "--trust-remote-code"
        - "--disable-log-requests"
        - "--enable-chunked-prefill"
        - "--enable-prefix-caching"
  1. Getting fallowing error
(VllmWorker rank=0 pid=222) INFO 04-07 08:59:23 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorker rank=1 pid=239) INFO 04-07 08:59:23 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorker rank=0 pid=222) INFO 04-07 08:59:23 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_57eb3085'), local_subscribe_addr='ipc:///tmp/8f0dd0fa-95b6-4959-8738-3b5acb47a883', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=239) INFO 04-07 08:59:23 [parallel_state.py:957] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1
(VllmWorker rank=0 pid=222) INFO 04-07 08:59:23 [parallel_state.py:957] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0
(VllmWorker rank=1 pid=239) INFO 04-07 08:59:23 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=0 pid=222) INFO 04-07 08:59:23 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=0 pid=222) INFO 04-07 08:59:26 [gpu_model_runner.py:1258] Starting to load model /mnt/models/meta-llama-4-scout-17b-16e-instruct...
(VllmWorker rank=0 pid=222) INFO 04-07 08:59:26 [config.py:3334] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
(VllmWorker rank=1 pid=239) INFO 04-07 08:59:26 [gpu_model_runner.py:1258] Starting to load model /mnt/models/meta-llama-4-scout-17b-16e-instruct...
(VllmWorker rank=1 pid=239) INFO 04-07 08:59:26 [config.py:3334] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
(VllmWorker rank=0 pid=222) WARNING 04-07 08:59:26 [config.py:3785] `torch.compile` is turned on, but the model /mnt/models/meta-llama-4-scout-17b-16e-instruct does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=0 pid=222) WARNING 04-07 08:59:26 [config.py:3785] `torch.compile` is turned on, but the model /mnt/models/meta-llama-4-scout-17b-16e-instruct does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=1 pid=239) WARNING 04-07 08:59:26 [config.py:3785] `torch.compile` is turned on, but the model /mnt/models/meta-llama-4-scout-17b-16e-instruct does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=1 pid=239) WARNING 04-07 08:59:26 [config.py:3785] `torch.compile` is turned on, but the model /mnt/models/meta-llama-4-scout-17b-16e-instruct does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=0 pid=222) Process SpawnProcess-1:1:
CRITICAL 04-07 08:59:27 [multiproc_executor.py:49] MulitprocExecutor got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
CRITICAL 04-07 08:59:27 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
 0 nranks 2 cudaDev 0 nvmlDev 0 busId 100000 commId 0xaf1be216fd147d3d - Init COMPLETE
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1121, in <module>
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1069, in run_server
    async with build_async_engine_client(args) as engine_client:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 178, in build_async_engine_client_from_engine_args
    async_llm = AsyncLLM.from_vllm_config(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 136, in from_vllm_config
    return cls(
           ^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 102, in __init__
    self.engine_core = EngineCoreClient.make_client(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 69, in make_client
    return AsyncMPClient(vllm_config, executor_class, log_stats)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 570, in __init__
    super().__init__(
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 401, in __init__
    engine.proc_handle.wait_for_startup()
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/utils.py", line 127, in wait_for_startup
    if self.reader.recv()["status"] != "READY":
       ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
          ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/connection.py", line 430, in _recv_bytes
    buf = self._recv(4)
          ^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/connection.py", line 399, in _recv
    raise EOFError

Any help will be appriciated.

🐛 Describe the bug

  1. Download the Llama-4-Scout-17B-16E-Instruct on PVC
  2. Deploy the model on azure kubernates on A100 with 2 GPUs 80GB each.
  3. using below arguments
    args:
        - "--model"
        - "/mnt/models/meta-llama-4-scout-17b-16e-instruct"
        - "--api-key"
        - "$(VLLM_API_KEY)"
        - "--tensor-parallel-size"
        - "2"
        - "--dtype"
        - "bfloat16"
        - "--port"
        - "8000"
        - "--max-model-len"
        - "32768"
        - "--max-num-batched-tokens"
        - "32768"
        - "--max-num-seqs"
        - "16"
        - "--gpu-memory-utilization"
        - "0.99"
        - "--served-model-name"
        - "Llama-4-Scout-17B-16E-Instruct"
        - "--trust-remote-code"
        - "--disable-log-requests"
        - "--enable-chunked-prefill"
        - "--enable-prefix-caching"
  1. Getting fallowing error
(VllmWorker rank=0 pid=222) INFO 04-07 08:59:23 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorker rank=1 pid=239) INFO 04-07 08:59:23 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorker rank=0 pid=222) INFO 04-07 08:59:23 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_57eb3085'), local_subscribe_addr='ipc:///tmp/8f0dd0fa-95b6-4959-8738-3b5acb47a883', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=239) INFO 04-07 08:59:23 [parallel_state.py:957] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1
(VllmWorker rank=0 pid=222) INFO 04-07 08:59:23 [parallel_state.py:957] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0
(VllmWorker rank=1 pid=239) INFO 04-07 08:59:23 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=0 pid=222) INFO 04-07 08:59:23 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=0 pid=222) INFO 04-07 08:59:26 [gpu_model_runner.py:1258] Starting to load model /mnt/models/meta-llama-4-scout-17b-16e-instruct...
(VllmWorker rank=0 pid=222) INFO 04-07 08:59:26 [config.py:3334] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
(VllmWorker rank=1 pid=239) INFO 04-07 08:59:26 [gpu_model_runner.py:1258] Starting to load model /mnt/models/meta-llama-4-scout-17b-16e-instruct...
(VllmWorker rank=1 pid=239) INFO 04-07 08:59:26 [config.py:3334] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
(VllmWorker rank=0 pid=222) WARNING 04-07 08:59:26 [config.py:3785] `torch.compile` is turned on, but the model /mnt/models/meta-llama-4-scout-17b-16e-instruct does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=0 pid=222) WARNING 04-07 08:59:26 [config.py:3785] `torch.compile` is turned on, but the model /mnt/models/meta-llama-4-scout-17b-16e-instruct does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=1 pid=239) WARNING 04-07 08:59:26 [config.py:3785] `torch.compile` is turned on, but the model /mnt/models/meta-llama-4-scout-17b-16e-instruct does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=1 pid=239) WARNING 04-07 08:59:26 [config.py:3785] `torch.compile` is turned on, but the model /mnt/models/meta-llama-4-scout-17b-16e-instruct does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=0 pid=222) Process SpawnProcess-1:1:
CRITICAL 04-07 08:59:27 [multiproc_executor.py:49] MulitprocExecutor got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
CRITICAL 04-07 08:59:27 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
 0 nranks 2 cudaDev 0 nvmlDev 0 busId 100000 commId 0xaf1be216fd147d3d - Init COMPLETE
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1121, in <module>
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1069, in run_server
    async with build_async_engine_client(args) as engine_client:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 178, in build_async_engine_client_from_engine_args
    async_llm = AsyncLLM.from_vllm_config(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 136, in from_vllm_config
    return cls(
           ^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 102, in __init__
    self.engine_core = EngineCoreClient.make_client(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 69, in make_client
    return AsyncMPClient(vllm_config, executor_class, log_stats)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 570, in __init__
    super().__init__(
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 401, in __init__
    engine.proc_handle.wait_for_startup()
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/utils.py", line 127, in wait_for_startup
    if self.reader.recv()["status"] != "READY":
       ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
          ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/connection.py", line 430, in _recv_bytes
    buf = self._recv(4)
          ^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/connection.py", line 399, in _recv
    raise EOFError

Any help will be appriciated.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@rabaja rabaja added the bug Something isn't working label Apr 7, 2025
@SquadUpSquid
Copy link

(VllmWorker rank=0 pid=222) WARNING 04-07 08:59:26 [config.py:3785] torch.compile is turned on, but the model /mnt/models/meta-llama-4-scout-17b-16e-instruct does not support it. Please open an issue on GitHub if you want it to be supported.

Is there an env variable for torch.compile you can throw to false

TORCH_COMPILE = 0 ?

@lazariv
Copy link

lazariv commented Apr 7, 2025

Having the same issue with Maverick-FP8 model on 4 x H100(96Gb). My configuration is as follows:

singularity exec --nv vllm.sif vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 --port 3010 --trust-remote-code --disable-log-requests --download-dir $volume --max-model-len 1000 --max-num-batched-tokens 1024 --kv-cache-dtype fp8 --tensor-parallel-size 4

Running latest v0.8.3 docker image. Have a strong feeling that the memory is not sufficient, because increasing number of nodes to two (8 GPUs in total) fixes the problem and the model runs just fine. Here more informative error message would have been great.

@rabaja check if increasing number of GPUs helps.

The debug logs are as follows

DEBUG 04-07 19:38:47 [__init__.py:28] No plugins for group vllm.platform_plugins found.
DEBUG 04-07 19:38:47 [__init__.py:34] Checking if TPU platform is available.
DEBUG 04-07 19:38:47 [__init__.py:44] TPU platform is not available because: No module named 'libtpu'
DEBUG 04-07 19:38:47 [__init__.py:52] Checking if CUDA platform is available.
DEBUG 04-07 19:38:47 [__init__.py:72] Confirmed CUDA platform is available.
DEBUG 04-07 19:38:47 [__init__.py:100] Checking if ROCm platform is available.
DEBUG 04-07 19:38:47 [__init__.py:114] ROCm platform is not available because: No module named 'amdsmi'
DEBUG 04-07 19:38:47 [__init__.py:122] Checking if HPU platform is available.
DEBUG 04-07 19:38:47 [__init__.py:129] HPU platform is not available because habana_frameworks is not found.
DEBUG 04-07 19:38:47 [__init__.py:140] Checking if XPU platform is available.
DEBUG 04-07 19:38:47 [__init__.py:150] XPU platform is not available because: No module named 'intel_extension_for_pytorch'
DEBUG 04-07 19:38:47 [__init__.py:158] Checking if CPU platform is available.
DEBUG 04-07 19:38:47 [__init__.py:180] Checking if Neuron platform is available.
DEBUG 04-07 19:38:47 [__init__.py:187] Neuron platform is not available because: No module named 'transformers_neuronx'
DEBUG 04-07 19:38:47 [__init__.py:52] Checking if CUDA platform is available.
DEBUG 04-07 19:38:47 [__init__.py:72] Confirmed CUDA platform is available.
INFO 04-07 19:38:47 [__init__.py:239] Automatically detected platform cuda.
DEBUG 04-07 19:38:50 [utils.py:135] Setting VLLM_WORKER_MULTIPROC_METHOD to 'spawn'
DEBUG 04-07 19:38:50 [__init__.py:28] No plugins for group vllm.general_plugins found.
INFO 04-07 19:38:50 [api_server.py:1034] vLLM API server version 0.8.3
INFO 04-07 19:38:50 [api_server.py:1035] args: Namespace(subparser='serve', model_tag='meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8', config='', host=None, port=3009, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir='/data/cat/ws/scadsllm-llm-infrastructure-models-cat/data', load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='fp8', max_model_len=1000, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=4, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=1024, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_log_requests=True, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7f319f428f40>)
INFO 04-07 19:39:02 [config.py:600] This model supports multiple tasks: {'generate', 'reward', 'score', 'classify', 'embed'}. Defaulting to 'generate'.
DEBUG 04-07 19:39:09 [arg_utils.py:1673] Setting max_num_seqs to 1024 for OPENAI_API_SERVER usage context.
INFO 04-07 19:39:09 [config.py:1222] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
INFO 04-07 19:39:09 [config.py:1600] Defaulting to use mp for distributed inference
INFO 04-07 19:39:09 [config.py:1780] Chunked prefill is enabled with max_num_batched_tokens=1024.
/usr/local/lib/python3.12/dist-packages/transformers/utils/hub.py:105: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
DEBUG 04-07 19:39:16 [__init__.py:28] No plugins for group vllm.platform_plugins found.
DEBUG 04-07 19:39:16 [__init__.py:34] Checking if TPU platform is available.
DEBUG 04-07 19:39:16 [__init__.py:44] TPU platform is not available because: No module named 'libtpu'
DEBUG 04-07 19:39:16 [__init__.py:52] Checking if CUDA platform is available.
DEBUG 04-07 19:39:16 [__init__.py:72] Confirmed CUDA platform is available.
DEBUG 04-07 19:39:16 [__init__.py:100] Checking if ROCm platform is available.
DEBUG 04-07 19:39:16 [__init__.py:114] ROCm platform is not available because: No module named 'amdsmi'
DEBUG 04-07 19:39:16 [__init__.py:122] Checking if HPU platform is available.
DEBUG 04-07 19:39:16 [__init__.py:129] HPU platform is not available because habana_frameworks is not found.
DEBUG 04-07 19:39:16 [__init__.py:140] Checking if XPU platform is available.
DEBUG 04-07 19:39:16 [__init__.py:150] XPU platform is not available because: No module named 'intel_extension_for_pytorch'
DEBUG 04-07 19:39:16 [__init__.py:158] Checking if CPU platform is available.
DEBUG 04-07 19:39:16 [__init__.py:180] Checking if Neuron platform is available.
DEBUG 04-07 19:39:16 [__init__.py:187] Neuron platform is not available because: No module named 'transformers_neuronx'
DEBUG 04-07 19:39:16 [__init__.py:52] Checking if CUDA platform is available.
DEBUG 04-07 19:39:16 [__init__.py:72] Confirmed CUDA platform is available.
INFO 04-07 19:39:16 [__init__.py:239] Automatically detected platform cuda.
INFO 04-07 19:39:21 [core.py:61] Initializing a V1 LLM engine (v0.8.3) with config: model='meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8', speculative_config=None, tokenizer='meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=1000, download_dir='/data/cat/ws/scadsllm-llm-infrastructure-models-cat/data', load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=fp8,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
DEBUG 04-07 19:39:21 [shm_broadcast.py:219] Binding to ipc:///tmp/711ca5bd-2093-4d56-b761-f77c150d1a9c
INFO 04-07 19:39:21 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3], buffer_handle=(4, 10485760, 10, 'psm_e75e114a'), local_subscribe_addr='ipc:///tmp/711ca5bd-2093-4d56-b761-f77c150d1a9c', remote_subscribe_addr=None, remote_addr_ipv6=False)
/usr/local/lib/python3.12/dist-packages/transformers/utils/hub.py:105: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
DEBUG 04-07 19:39:26 [multiproc_executor.py:351] Waiting for WorkerProc to startup.
DEBUG 04-07 19:39:26 [__init__.py:28] No plugins for group vllm.platform_plugins found.
DEBUG 04-07 19:39:26 [__init__.py:34] Checking if TPU platform is available.
DEBUG 04-07 19:39:26 [__init__.py:44] TPU platform is not available because: No module named 'libtpu'
DEBUG 04-07 19:39:26 [__init__.py:52] Checking if CUDA platform is available.
DEBUG 04-07 19:39:26 [__init__.py:72] Confirmed CUDA platform is available.
DEBUG 04-07 19:39:26 [__init__.py:100] Checking if ROCm platform is available.
DEBUG 04-07 19:39:26 [__init__.py:114] ROCm platform is not available because: No module named 'amdsmi'
DEBUG 04-07 19:39:26 [__init__.py:122] Checking if HPU platform is available.
DEBUG 04-07 19:39:26 [__init__.py:129] HPU platform is not available because habana_frameworks is not found.
DEBUG 04-07 19:39:26 [__init__.py:140] Checking if XPU platform is available.
DEBUG 04-07 19:39:26 [__init__.py:150] XPU platform is not available because: No module named 'intel_extension_for_pytorch'
DEBUG 04-07 19:39:26 [__init__.py:158] Checking if CPU platform is available.
DEBUG 04-07 19:39:26 [__init__.py:180] Checking if Neuron platform is available.
DEBUG 04-07 19:39:26 [__init__.py:187] Neuron platform is not available because: No module named 'transformers_neuronx'
DEBUG 04-07 19:39:26 [__init__.py:52] Checking if CUDA platform is available.
DEBUG 04-07 19:39:27 [__init__.py:72] Confirmed CUDA platform is available.
INFO 04-07 19:39:27 [__init__.py:239] Automatically detected platform cuda.
DEBUG 04-07 19:39:31 [multiproc_executor.py:351] Waiting for WorkerProc to startup.
DEBUG 04-07 19:39:31 [__init__.py:28] No plugins for group vllm.general_plugins found.
WARNING 04-07 19:39:33 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f5fa42640b0>
DEBUG 04-07 19:39:33 [config.py:3773] enabled custom ops: Counter()
DEBUG 04-07 19:39:33 [config.py:3775] disabled custom ops: Counter()
(VllmWorker rank=0 pid=16624) DEBUG 04-07 19:39:33 [shm_broadcast.py:288] Connecting to ipc:///tmp/711ca5bd-2093-4d56-b761-f77c150d1a9c
(VllmWorker rank=0 pid=16624) DEBUG 04-07 19:39:33 [shm_broadcast.py:219] Binding to ipc:///tmp/c96a2995-1342-40c9-948e-a4712e87c263
(VllmWorker rank=0 pid=16624) INFO 04-07 19:39:33 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_920bbb50'), local_subscribe_addr='ipc:///tmp/c96a2995-1342-40c9-948e-a4712e87c263', remote_subscribe_addr=None, remote_addr_ipv6=False)
DEBUG 04-07 19:39:33 [shm_broadcast.py:288] Connecting to ipc:///tmp/c96a2995-1342-40c9-948e-a4712e87c263
(VllmWorker rank=0 pid=16624) DEBUG 04-07 19:39:33 [parallel_state.py:820] world_size=4 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:40055 backend=nccl
[W407 19:39:33.061058051 socket.cpp:204] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W407 19:39:33.069730764 socket.cpp:204] [c10d] The hostname of the client socket cannot be retrieved. err=-3
/usr/local/lib/python3.12/dist-packages/transformers/utils/hub.py:105: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
DEBUG 04-07 19:39:38 [multiproc_executor.py:351] Waiting for WorkerProc to startup.
DEBUG 04-07 19:39:38 [__init__.py:28] No plugins for group vllm.platform_plugins found.
DEBUG 04-07 19:39:38 [__init__.py:34] Checking if TPU platform is available.
DEBUG 04-07 19:39:38 [__init__.py:44] TPU platform is not available because: No module named 'libtpu'
DEBUG 04-07 19:39:38 [__init__.py:52] Checking if CUDA platform is available.
DEBUG 04-07 19:39:38 [__init__.py:72] Confirmed CUDA platform is available.
DEBUG 04-07 19:39:38 [__init__.py:100] Checking if ROCm platform is available.
DEBUG 04-07 19:39:38 [__init__.py:114] ROCm platform is not available because: No module named 'amdsmi'
DEBUG 04-07 19:39:38 [__init__.py:122] Checking if HPU platform is available.
DEBUG 04-07 19:39:38 [__init__.py:129] HPU platform is not available because habana_frameworks is not found.
DEBUG 04-07 19:39:38 [__init__.py:140] Checking if XPU platform is available.
DEBUG 04-07 19:39:38 [__init__.py:150] XPU platform is not available because: No module named 'intel_extension_for_pytorch'
DEBUG 04-07 19:39:38 [__init__.py:158] Checking if CPU platform is available.
DEBUG 04-07 19:39:38 [__init__.py:180] Checking if Neuron platform is available.
DEBUG 04-07 19:39:38 [__init__.py:187] Neuron platform is not available because: No module named 'transformers_neuronx'
DEBUG 04-07 19:39:38 [__init__.py:52] Checking if CUDA platform is available.
DEBUG 04-07 19:39:39 [__init__.py:72] Confirmed CUDA platform is available.
INFO 04-07 19:39:39 [__init__.py:239] Automatically detected platform cuda.
DEBUG 04-07 19:39:43 [multiproc_executor.py:351] Waiting for WorkerProc to startup.
DEBUG 04-07 19:39:43 [__init__.py:28] No plugins for group vllm.general_plugins found.
WARNING 04-07 19:39:44 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f9d9e8d8bc0>
DEBUG 04-07 19:39:44 [config.py:3773] enabled custom ops: Counter()
DEBUG 04-07 19:39:44 [config.py:3775] disabled custom ops: Counter()
(VllmWorker rank=1 pid=16656) DEBUG 04-07 19:39:44 [shm_broadcast.py:288] Connecting to ipc:///tmp/711ca5bd-2093-4d56-b761-f77c150d1a9c
(VllmWorker rank=1 pid=16656) DEBUG 04-07 19:39:44 [shm_broadcast.py:219] Binding to ipc:///tmp/778b5d0c-3c3a-405c-a5b3-cf68629eea06
(VllmWorker rank=1 pid=16656) INFO 04-07 19:39:44 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_5f61059a'), local_subscribe_addr='ipc:///tmp/778b5d0c-3c3a-405c-a5b3-cf68629eea06', remote_subscribe_addr=None, remote_addr_ipv6=False)
DEBUG 04-07 19:39:44 [shm_broadcast.py:288] Connecting to ipc:///tmp/778b5d0c-3c3a-405c-a5b3-cf68629eea06
(VllmWorker rank=1 pid=16656) DEBUG 04-07 19:39:44 [parallel_state.py:820] world_size=4 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:40055 backend=nccl
[W407 19:39:44.514501293 socket.cpp:204] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W407 19:39:44.521468510 socket.cpp:204] [c10d] The hostname of the client socket cannot be retrieved. err=-3
/usr/local/lib/python3.12/dist-packages/transformers/utils/hub.py:105: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
DEBUG 04-07 19:39:49 [multiproc_executor.py:351] Waiting for WorkerProc to startup.
DEBUG 04-07 19:39:50 [__init__.py:28] No plugins for group vllm.platform_plugins found.
DEBUG 04-07 19:39:50 [__init__.py:34] Checking if TPU platform is available.
DEBUG 04-07 19:39:50 [__init__.py:44] TPU platform is not available because: No module named 'libtpu'
DEBUG 04-07 19:39:50 [__init__.py:52] Checking if CUDA platform is available.
DEBUG 04-07 19:39:50 [__init__.py:72] Confirmed CUDA platform is available.
DEBUG 04-07 19:39:50 [__init__.py:100] Checking if ROCm platform is available.
DEBUG 04-07 19:39:50 [__init__.py:114] ROCm platform is not available because: No module named 'amdsmi'
DEBUG 04-07 19:39:50 [__init__.py:122] Checking if HPU platform is available.
DEBUG 04-07 19:39:50 [__init__.py:129] HPU platform is not available because habana_frameworks is not found.
DEBUG 04-07 19:39:50 [__init__.py:140] Checking if XPU platform is available.
DEBUG 04-07 19:39:50 [__init__.py:150] XPU platform is not available because: No module named 'intel_extension_for_pytorch'
DEBUG 04-07 19:39:50 [__init__.py:158] Checking if CPU platform is available.
DEBUG 04-07 19:39:50 [__init__.py:180] Checking if Neuron platform is available.
DEBUG 04-07 19:39:50 [__init__.py:187] Neuron platform is not available because: No module named 'transformers_neuronx'
DEBUG 04-07 19:39:50 [__init__.py:52] Checking if CUDA platform is available.
DEBUG 04-07 19:39:50 [__init__.py:72] Confirmed CUDA platform is available.
INFO 04-07 19:39:50 [__init__.py:239] Automatically detected platform cuda.
DEBUG 04-07 19:39:54 [multiproc_executor.py:351] Waiting for WorkerProc to startup.
DEBUG 04-07 19:39:54 [__init__.py:28] No plugins for group vllm.general_plugins found.
WARNING 04-07 19:39:55 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f316a5040b0>
DEBUG 04-07 19:39:55 [config.py:3773] enabled custom ops: Counter()
DEBUG 04-07 19:39:55 [config.py:3775] disabled custom ops: Counter()
(VllmWorker rank=2 pid=16683) DEBUG 04-07 19:39:55 [shm_broadcast.py:288] Connecting to ipc:///tmp/711ca5bd-2093-4d56-b761-f77c150d1a9c
(VllmWorker rank=2 pid=16683) DEBUG 04-07 19:39:55 [shm_broadcast.py:219] Binding to ipc:///tmp/71d6f7ed-6a46-4380-a16c-d10bafcf7e4c
(VllmWorker rank=2 pid=16683) INFO 04-07 19:39:55 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_6d8261ab'), local_subscribe_addr='ipc:///tmp/71d6f7ed-6a46-4380-a16c-d10bafcf7e4c', remote_subscribe_addr=None, remote_addr_ipv6=False)
DEBUG 04-07 19:39:55 [shm_broadcast.py:288] Connecting to ipc:///tmp/71d6f7ed-6a46-4380-a16c-d10bafcf7e4c
(VllmWorker rank=2 pid=16683) DEBUG 04-07 19:39:56 [parallel_state.py:820] world_size=4 rank=2 local_rank=2 distributed_init_method=tcp://127.0.0.1:40055 backend=nccl
[W407 19:39:56.682040965 socket.cpp:204] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W407 19:39:56.689849060 socket.cpp:204] [c10d] The hostname of the client socket cannot be retrieved. err=-3
/usr/local/lib/python3.12/dist-packages/transformers/utils/hub.py:105: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
DEBUG 04-07 19:40:00 [multiproc_executor.py:351] Waiting for WorkerProc to startup.
DEBUG 04-07 19:40:01 [__init__.py:28] No plugins for group vllm.platform_plugins found.
DEBUG 04-07 19:40:01 [__init__.py:34] Checking if TPU platform is available.
DEBUG 04-07 19:40:01 [__init__.py:44] TPU platform is not available because: No module named 'libtpu'
DEBUG 04-07 19:40:01 [__init__.py:52] Checking if CUDA platform is available.
DEBUG 04-07 19:40:01 [__init__.py:72] Confirmed CUDA platform is available.
DEBUG 04-07 19:40:01 [__init__.py:100] Checking if ROCm platform is available.
DEBUG 04-07 19:40:01 [__init__.py:114] ROCm platform is not available because: No module named 'amdsmi'
DEBUG 04-07 19:40:01 [__init__.py:122] Checking if HPU platform is available.
DEBUG 04-07 19:40:01 [__init__.py:129] HPU platform is not available because habana_frameworks is not found.
DEBUG 04-07 19:40:01 [__init__.py:140] Checking if XPU platform is available.
DEBUG 04-07 19:40:01 [__init__.py:150] XPU platform is not available because: No module named 'intel_extension_for_pytorch'
DEBUG 04-07 19:40:01 [__init__.py:158] Checking if CPU platform is available.
DEBUG 04-07 19:40:01 [__init__.py:180] Checking if Neuron platform is available.
DEBUG 04-07 19:40:01 [__init__.py:187] Neuron platform is not available because: No module named 'transformers_neuronx'
DEBUG 04-07 19:40:01 [__init__.py:52] Checking if CUDA platform is available.
DEBUG 04-07 19:40:01 [__init__.py:72] Confirmed CUDA platform is available.
INFO 04-07 19:40:01 [__init__.py:239] Automatically detected platform cuda.
DEBUG 04-07 19:40:05 [__init__.py:28] No plugins for group vllm.general_plugins found.
DEBUG 04-07 19:40:05 [multiproc_executor.py:351] Waiting for WorkerProc to startup.
WARNING 04-07 19:40:06 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fc50f5404d0>
DEBUG 04-07 19:40:06 [config.py:3773] enabled custom ops: Counter()
DEBUG 04-07 19:40:06 [config.py:3775] disabled custom ops: Counter()
(VllmWorker rank=3 pid=16719) DEBUG 04-07 19:40:06 [shm_broadcast.py:288] Connecting to ipc:///tmp/711ca5bd-2093-4d56-b761-f77c150d1a9c
(VllmWorker rank=3 pid=16719) DEBUG 04-07 19:40:06 [shm_broadcast.py:219] Binding to ipc:///tmp/d0078120-5d13-4375-a91f-9062ba5540af
(VllmWorker rank=3 pid=16719) INFO 04-07 19:40:06 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_dd922aac'), local_subscribe_addr='ipc:///tmp/d0078120-5d13-4375-a91f-9062ba5540af', remote_subscribe_addr=None, remote_addr_ipv6=False)
DEBUG 04-07 19:40:06 [shm_broadcast.py:288] Connecting to ipc:///tmp/d0078120-5d13-4375-a91f-9062ba5540af
(VllmWorker rank=3 pid=16719) DEBUG 04-07 19:40:06 [parallel_state.py:820] world_size=4 rank=3 local_rank=3 distributed_init_method=tcp://127.0.0.1:40055 backend=nccl
[W407 19:40:07.633721022 socket.cpp:204] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W407 19:40:07.649928281 socket.cpp:204] [c10d] The hostname of the client socket cannot be retrieved. err=-3
(VllmWorker rank=3 pid=16719) INFO 04-07 19:40:07 [utils.py:990] Found nccl from library libnccl.so.2
(VllmWorker rank=1 pid=16656) INFO 04-07 19:40:07 [utils.py:990] Found nccl from library libnccl.so.2
(VllmWorker rank=2 pid=16683) INFO 04-07 19:40:07 [utils.py:990] Found nccl from library libnccl.so.2
(VllmWorker rank=3 pid=16719) INFO 04-07 19:40:07 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorker rank=1 pid=16656) INFO 04-07 19:40:07 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorker rank=0 pid=16624) INFO 04-07 19:40:07 [utils.py:990] Found nccl from library libnccl.so.2
(VllmWorker rank=2 pid=16683) INFO 04-07 19:40:07 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorker rank=0 pid=16624) INFO 04-07 19:40:07 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorker rank=3 pid=16719) INFO 04-07 19:40:10 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/scadsllm/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
(VllmWorker rank=2 pid=16683) INFO 04-07 19:40:10 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/scadsllm/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
(VllmWorker rank=0 pid=16624) INFO 04-07 19:40:10 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/scadsllm/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
(VllmWorker rank=1 pid=16656) INFO 04-07 19:40:10 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/scadsllm/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
(VllmWorker rank=0 pid=16624) DEBUG 04-07 19:40:10 [shm_broadcast.py:219] Binding to ipc:///tmp/20b14d25-44b1-4395-bcb9-887d599567e1
(VllmWorker rank=0 pid=16624) INFO 04-07 19:40:10 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_a6c33163'), local_subscribe_addr='ipc:///tmp/20b14d25-44b1-4395-bcb9-887d599567e1', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=2 pid=16683) DEBUG 04-07 19:40:10 [shm_broadcast.py:288] Connecting to ipc:///tmp/20b14d25-44b1-4395-bcb9-887d599567e1
(VllmWorker rank=1 pid=16656) DEBUG 04-07 19:40:10 [shm_broadcast.py:288] Connecting to ipc:///tmp/20b14d25-44b1-4395-bcb9-887d599567e1
(VllmWorker rank=3 pid=16719) DEBUG 04-07 19:40:10 [shm_broadcast.py:288] Connecting to ipc:///tmp/20b14d25-44b1-4395-bcb9-887d599567e1
(VllmWorker rank=3 pid=16719) INFO 04-07 19:40:10 [parallel_state.py:957] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 3
(VllmWorker rank=2 pid=16683) INFO 04-07 19:40:10 [parallel_state.py:957] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 2
(VllmWorker rank=1 pid=16656) INFO 04-07 19:40:10 [parallel_state.py:957] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1
(VllmWorker rank=0 pid=16624) INFO 04-07 19:40:10 [parallel_state.py:957] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0
(VllmWorker rank=2 pid=16683) INFO 04-07 19:40:10 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=3 pid=16719) INFO 04-07 19:40:10 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=0 pid=16624) INFO 04-07 19:40:10 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=1 pid=16656) INFO 04-07 19:40:10 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=2 pid=16683) DEBUG 04-07 19:40:15 [config.py:3773] enabled custom ops: Counter()
(VllmWorker rank=2 pid=16683) DEBUG 04-07 19:40:15 [config.py:3775] disabled custom ops: Counter()
(VllmWorker rank=2 pid=16683) INFO 04-07 19:40:15 [gpu_model_runner.py:1258] Starting to load model meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8...
(VllmWorker rank=3 pid=16719) DEBUG 04-07 19:40:15 [config.py:3773] enabled custom ops: Counter()
(VllmWorker rank=3 pid=16719) DEBUG 04-07 19:40:15 [config.py:3775] disabled custom ops: Counter()
(VllmWorker rank=3 pid=16719) INFO 04-07 19:40:15 [gpu_model_runner.py:1258] Starting to load model meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8...
(VllmWorker rank=2 pid=16683) INFO 04-07 19:40:15 [config.py:3334] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
(VllmWorker rank=2 pid=16683) DEBUG 04-07 19:40:15 [decorators.py:109] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.llama.LlamaModel'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
(VllmWorker rank=2 pid=16683) DEBUG 04-07 19:40:15 [decorators.py:109] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.llama4.Llama4Model'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
(VllmWorker rank=3 pid=16719) INFO 04-07 19:40:15 [config.py:3334] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
(VllmWorker rank=3 pid=16719) DEBUG 04-07 19:40:15 [decorators.py:109] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.llama.LlamaModel'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
(VllmWorker rank=3 pid=16719) DEBUG 04-07 19:40:15 [decorators.py:109] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.llama4.Llama4Model'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
(VllmWorker rank=2 pid=16683) DEBUG 04-07 19:40:15 [config.py:3773] enabled custom ops: Counter()
(VllmWorker rank=2 pid=16683) DEBUG 04-07 19:40:15 [config.py:3775] disabled custom ops: Counter({'rms_norm': 90, 'silu_and_mul': 45, 'rotary_embedding': 2})
(VllmWorker rank=2 pid=16683) WARNING 04-07 19:40:15 [config.py:3785] `torch.compile` is turned on, but the model meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=2 pid=16683) DEBUG 04-07 19:40:15 [config.py:3773] enabled custom ops: Counter()
(VllmWorker rank=2 pid=16683) DEBUG 04-07 19:40:15 [config.py:3775] disabled custom ops: Counter({'rms_norm': 90, 'silu_and_mul': 45, 'rotary_embedding': 2})
(VllmWorker rank=2 pid=16683) WARNING 04-07 19:40:15 [config.py:3785] `torch.compile` is turned on, but the model meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=3 pid=16719) DEBUG 04-07 19:40:15 [config.py:3773] enabled custom ops: Counter()
(VllmWorker rank=3 pid=16719) DEBUG 04-07 19:40:15 [config.py:3775] disabled custom ops: Counter({'rms_norm': 90, 'silu_and_mul': 45, 'rotary_embedding': 2})
(VllmWorker rank=3 pid=16719) WARNING 04-07 19:40:15 [config.py:3785] `torch.compile` is turned on, but the model meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=3 pid=16719) DEBUG 04-07 19:40:15 [config.py:3773] enabled custom ops: Counter()
(VllmWorker rank=3 pid=16719) DEBUG 04-07 19:40:15 [config.py:3775] disabled custom ops: Counter({'rms_norm': 90, 'silu_and_mul': 45, 'rotary_embedding': 2})
(VllmWorker rank=3 pid=16719) WARNING 04-07 19:40:15 [config.py:3785] `torch.compile` is turned on, but the model meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=1 pid=16656) DEBUG 04-07 19:40:15 [config.py:3773] enabled custom ops: Counter()
(VllmWorker rank=1 pid=16656) DEBUG 04-07 19:40:15 [config.py:3775] disabled custom ops: Counter()
(VllmWorker rank=1 pid=16656) INFO 04-07 19:40:15 [gpu_model_runner.py:1258] Starting to load model meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8...
(VllmWorker rank=1 pid=16656) INFO 04-07 19:40:15 [config.py:3334] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
(VllmWorker rank=1 pid=16656) DEBUG 04-07 19:40:15 [decorators.py:109] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.llama.LlamaModel'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
(VllmWorker rank=1 pid=16656) DEBUG 04-07 19:40:15 [decorators.py:109] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.llama4.Llama4Model'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
(VllmWorker rank=2 pid=16683) Process SpawnProcess-1:3:
CRITICAL 04-07 19:40:15 [multiproc_executor.py:49] MulitprocExecutor got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
CRITICAL 04-07 19:40:15 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
(VllmWorker rank=1 pid=16656) DEBUG 04-07 19:40:15 [config.py:3773] enabled custom ops: Counter()
(VllmWorker rank=1 pid=16656) DEBUG 04-07 19:40:15 [config.py:3775] disabled custom ops: Counter({'rotary_embedding': 1})
(VllmWorker rank=1 pid=16656) WARNING 04-07 19:40:15 [config.py:3785] `torch.compile` is turned on, but the model meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=1 pid=16656) DEBUG 04-07 19:40:15 [config.py:3773] enabled custom ops: Counter()
(VllmWorker rank=1 pid=16656) DEBUG 04-07 19:40:15 [config.py:3775] disabled custom ops: Counter({'rotary_embedding': 1})
(VllmWorker rank=1 pid=16656) WARNING 04-07 19:40:15 [config.py:3785] `torch.compile` is turned on, but the model meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=3 pid=16719) DEBUG 04-07 19:40:15 [multiproc_executor.py:327] Worker interrupted.
/var/spool/slurmd/job196612/slurm_script: line 76: 16489 Killed                  singularity exec --nv vllm.sif vllm serve 

@w013nad
Copy link

w013nad commented Apr 7, 2025

#16127

@sarckk
Copy link
Collaborator

sarckk commented Apr 7, 2025

I suspect this is an OOM error that vLLM could do better job of surfacing, TP=2 wouldnt work due to insufficient memory. I recommend trying with TP=8 if available.

@rabaja I also believe you aren't running on v0.8.3 because the stack trace contains a reference to engine.proc_handle.wait_for_startup() which was removed in #15906 as part of the release.

cc @njhill since you changed the engine startup code recently. I think we can improve the error handling here

@sarckk sarckk moved this from Backlog to In progress in Llama-4 Issues & Bugs Apr 7, 2025
@houseroad
Copy link
Collaborator

Based on our test, Scout BF16 can run on H100/A100 80GB x 4, and Maverick fp8 requires H100 80GB x 8.

We are working on int4 Scout model, which may fit in one H100/A100 80GB card.

@rabaja
Copy link
Author

rabaja commented Apr 8, 2025

@sarckk I am using a v0.8.3
Events:
Type Reason Age From Message


Normal Scheduled 84s default-scheduler Successfully assigned ncgr-llama-4-scout-17b-16e-instruct-ns/vllm-meta-llama-4-scout-17b-16e-instruct-77f7db54cc-w6pgn to aks-gpu2-29549835-vmss00000h
Normal Pulling 81s kubelet Pulling image "vllm/vllm-openai:v0.8.3"
Normal Pulled 81s kubelet Successfully pulled image "vllm/vllm-openai:v0.8.3" in 635ms (635ms including waiting)
Normal Created 27s (x2 over 81s) kubelet Created container vllm-server
Normal Pulled 27s kubelet Container image "vllm/vllm-openai:v0.8.3" already present on machine
Normal Started 26s (x2 over 81s) kubelet Started container vllm-server

@magdyksaleh
Copy link

Running into this same error when running the unsloth bnb fp8 Scout model on 4xH100.

Here is my command:

docker run
-e VLLM_DISABLE_COMPILE_CACHE=1
-e CUDA_VISIBLE_DEVICES=2,3,5,7
--gpus all
--runtime=nvidia
vllm/vllm-openai:v0.8.3
--model unsloth/Llama-4-Scout-17B-16E-unsloth-bnb-8bit
--quantization bitsandbytes --load-format bitsandbytes
--kv-cache-dtype fp8
--tensor-parallel-size 4

@njhill
Copy link
Member

njhill commented Apr 8, 2025

I suspect this is an OOM error that vLLM could do better job of surfacing, TP=2 wouldnt work due to insufficient memory. I recommend trying with TP=8 if available.

@rabaja I also believe you aren't running on v0.8.3 because the stack trace contains a reference to engine.proc_handle.wait_for_startup() which was removed in #15906 as part of the release.

cc @njhill since you changed the engine startup code recently. I think we can improve the error handling here

I will look into the error visibility issue today. I think it's preexisting and not caused by the recent #15906 PR which was reverted for 0.8.3. That PR did introduce a new related issue where the server might not exit in this case, since fixed by #16137.

@njhill
Copy link
Member

njhill commented Apr 9, 2025

@houseroad @sarckk FWIW #11737 is the PR where the error visibility and related robustness changes are being done.

@brandonbiggs
Copy link

+1 getting the same error.

@sarckk
Copy link
Collaborator

sarckk commented Apr 9, 2025

@brandonbiggs what is your command and how many / what GPUs are you running on?

@brandonbiggs
Copy link

I tried on 2 and 4 80gb H100s. I won't be able to get the command until tomorrow morning but I was using the latest vLLM container from dockerhub.

@harishd1998
Copy link

Facing the same issue using command

CUDA_VISIBLE_DEVICE=2,3,4,5 VLLM_DISABLE_COMPILE_CACHE=1 vllm serve /model/Llama-4-Scout-17B-16E-Instruct/ --device cuda --override-generation-config='{"attn_temperature_tuning": true}' --max-model-len 524288 --tensor-parallel-size 4 --host 0.0.0.0 --port 8006

Specs: 4 x A100 (80GB)
Vllm version: v0.8.3

@rakshith-writer
Copy link

I think this error is coming from flash attention utils file not properly found. I currently add this in my dockerfile:

COPY vllm/vllm_flash_attn/fa_utils.py /usr/local/lib/python3.12/dist-packages/vllm/vllm_flash_attn/

@brandonbiggs
Copy link

@sarckk sorry it took me so long to get back to this.

Command:

export CUDA_VISIBLE_DEVICES=0,1,2,3
export VLLM_DISABLE_COMPILE_CACHE=1
apptainer exec --nv \
    docker://vllm/vllm-openai:v0.8.3 \
    vllm serve Llama-4-Scout-17B-16E-Instruct \
    --tensor-parallel-size 4 \
    --max-model-len 64000 \
    --override-generation-config='{"attn_temperature_tuning": true}'

4 80GB Nvidia H100s

@ypeng1
Copy link

ypeng1 commented Apr 15, 2025

I got this error with all models that has tensor parallel size greater than 1, not just llama 4.
It only happens when I upgraded to 0.8.3.
0.8.2 are fine.

This is my docker container script

#!/bin/bash

# Set Hugging Face token
HUGGING_FACE_HUB_TOKEN="<fill in your huggingface token>"

# Run Llama 3.3-70B-Instruct with 2 GPUs
docker run -d --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN" \
    -p 8001:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model meta-llama/Llama-3.3-70B-Instruct \
    --swap-space 16 \
    --disable-log-requests \
    --tensor-parallel-size 8 \
    --distributed-executor-backend mp \
    --gpu-memory-utilization 0.90 \
    --max-model-len 8192 \
    --max-num-batched-tokens 65536 \
    --enable-chunked-prefill=True \
    --kv-cache-dtype=auto \
    --enable-prefix-caching &

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: In progress
Development

No branches or pull requests