Description
Your current environment
The output of python collect_env.py
INFO 06-24 09:18:17 [__init__.py:244] Automatically detected platform cuda.
Collecting environment information...
==============================
System Info
==============================
OS : Ubuntu 24.04.2 LTS (x86_64)
GCC version : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version : Could not collect
CMake version : version 3.31.6
Libc version : glibc-2.39
==============================
PyTorch Info
==============================
PyTorch version : 2.7.0+cu126
Is debug build : False
CUDA used to build PyTorch : 12.6
ROCM used to build PyTorch : N/A
==============================
Python Environment
==============================
Python version : 3.12.3 (main, Feb 4 2025, 14:48:35) [GCC 13.3.0] (64-bit runtime)
Python platform : Linux-5.15.0-139-generic-x86_64-with-glibc2.39
==============================
CUDA / GPU Info
==============================
Is CUDA available : True
CUDA runtime version : 12.9.41
CUDA_MODULE_LOADING set to : LAZY
GPU models and configuration :
GPU 0: NVIDIA A100-PCIE-40GB
GPU 1: NVIDIA A100-PCIE-40GB
GPU 2: NVIDIA A100-PCIE-40GB
GPU 3: NVIDIA A100-PCIE-40GB
Nvidia driver version : 535.183.06
cuDNN version : Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.10.1
HIP runtime version : N/A
MIOpen runtime version : N/A
Is XNNPACK available : True
==============================
CPU Info
==============================
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 43 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7742 64-Core Processor
CPU family: 23
Model: 49
Thread(s) per core: 1
Core(s) per socket: 64
Socket(s): 2
Stepping: 0
Frequency boost: enabled
CPU(s) scaling MHz: 88%
CPU max MHz: 2250.0000
CPU min MHz: 1500.0000
BogoMIPS: 4499.62
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es
Virtualization: AMD-V
L1d cache: 4 MiB (128 instances)
L1i cache: 4 MiB (128 instances)
L2 cache: 64 MiB (128 instances)
L3 cache: 512 MiB (32 instances)
NUMA node(s): 8
NUMA node0 CPU(s): 0-15
NUMA node1 CPU(s): 16-31
NUMA node2 CPU(s): 32-47
NUMA node3 CPU(s): 48-63
NUMA node4 CPU(s): 64-79
NUMA node5 CPU(s): 80-95
NUMA node6 CPU(s): 96-111
NUMA node7 CPU(s): 112-127
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled
Vulnerability Spec rstack overflow: Mitigation; SMT disabled
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
==============================
Versions of relevant libraries
==============================
[pip3] mypy_extensions==1.1.0
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.6.4.1
[pip3] nvidia-cuda-cupti-cu12==12.6.80
[pip3] nvidia-cuda-nvrtc-cu12==12.6.77
[pip3] nvidia-cuda-runtime-cu12==12.6.77
[pip3] nvidia-cudnn-cu12==9.5.1.17
[pip3] nvidia-cudnn-frontend==1.11.0
[pip3] nvidia-cufft-cu12==11.3.0.4
[pip3] nvidia-cufile-cu12==1.11.1.6
[pip3] nvidia-curand-cu12==10.3.7.77
[pip3] nvidia-cusolver-cu12==11.7.1.2
[pip3] nvidia-cusparse-cu12==12.5.4.2
[pip3] nvidia-cusparselt-cu12==0.6.3
[pip3] nvidia-dali-cuda120==1.49.0
[pip3] nvidia-ml-py==12.570.86
[pip3] nvidia-modelopt==0.27.1
[pip3] nvidia-modelopt-core==0.27.1
[pip3] nvidia-nccl-cu12==2.26.2
[pip3] nvidia-nvcomp-cu12==4.2.0.14
[pip3] nvidia-nvimgcodec-cu12==0.5.0.13
[pip3] nvidia-nvjitlink-cu12==12.6.85
[pip3] nvidia-nvjpeg-cu12==12.4.0.16
[pip3] nvidia-nvjpeg2k-cu12==0.8.1.40
[pip3] nvidia-nvtiff-cu12==0.5.0.67
[pip3] nvidia-nvtx-cu12==12.6.77
[pip3] nvidia-resiliency-ext==0.3.0
[pip3] onnx==1.17.0
[pip3] optree==0.15.0
[pip3] pynvml==12.0.0
[pip3] pytorch-triton==3.3.0+git96316ce52.nvinternal
[pip3] pyzmq==26.4.0
[pip3] torch==2.7.0
[pip3] torch_tensorrt==2.8.0a0
[pip3] torchaudio==2.7.0
[pip3] torchprofile==0.0.4
[pip3] torchvision==0.22.0
[pip3] transformers==4.52.4
[pip3] triton==3.3.0
[conda] Could not collect
==============================
vLLM Info
==============================
ROCM Version : Could not collect
Neuron SDK Version : N/A
vLLM Version : 0.9.1
vLLM Build Flags:
CUDA Archs: 7.5 8.0 8.6 9.0 10.0 12.0+PTX; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X SYS SYS SYS 48-63 3 N/A
GPU1 SYS X SYS SYS 32-47 2 N/A
GPU2 SYS SYS X SYS 16-31 1 N/A
GPU3 SYS SYS SYS X 64-79 4 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
==============================
Environment Variables
==============================
NVIDIA_VISIBLE_DEVICES=void
CUBLAS_VERSION=12.9.0.13
NVIDIA_REQUIRE_CUDA=cuda>=9.0
TORCH_CUDA_ARCH_LIST=7.5 8.0 8.6 9.0 10.0 12.0+PTX
NCCL_VERSION=2.26.5
NVIDIA_DRIVER_CAPABILITIES=compute,utility,video
TORCH_NCCL_USE_COMM_NONBLOCKING=0
CUDA_ARCH_LIST=7.5 8.0 8.6 9.0 10.0 12.0
NVIDIA_PRODUCT_NAME=PyTorch
CUDA_VERSION=12.9.0.043
PYTORCH_VERSION=2.8.0a0+5228986
PYTORCH_BUILD_NUMBER=0
CUBLASMP_VERSION=0.4.0.789
CUDNN_FRONTEND_VERSION=1.11.0
CUDNN_VERSION=9.10.1.4
PYTORCH_HOME=/opt/pytorch/pytorch
LD_LIBRARY_PATH=/usr/local/lib/python3.12/dist-packages/torch/lib:/usr/local/lib/python3.12/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
NVIDIA_BUILD_ID=170559088
CUDA_DRIVER_VERSION=575.51.03
PYTORCH_BUILD_VERSION=2.8.0a0+5228986
CUDA_HOME=/usr/local/cuda
CUDA_HOME=/usr/local/cuda
CUDA_MODULE_LOADING=LAZY
NVIDIA_REQUIRE_JETPACK_HOST_MOUNTS=
NVIDIA_PYTORCH_VERSION=25.05
TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
🐛 Describe the bug
- How to produce
Run llama with data_parallel.py in examples and disable expert parallel, change enable_expert_parallel=True
to enable_expert_parallel=False
is just ok to produce the error.
I run data_parallel.py with the following cmd:
python data_parallel.py --model=/Path/to/Llama-3.2-1B-Instruct --dp-size=2 --tp-size=2 --enforce-eager
Then it will raise an AssertionError after all dp group finish work.
- Error logs
Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 4616.00it/s]
Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 4672.36it/s]
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 467.61it/s, est. speed input: 3039.73 toks/s, output: 7482.31 toks/s]
DP rank 0, Prompt: 'Hello, my name is', Generated text: " Emily and I'm a huge fan of your YouTube channel! Your content is so"
DP rank 0, Prompt: 'The president of the United States is', Generated text: ' the head of state and government. This is the most obvious job in the United'
DP rank 0, Prompt: 'The capital of France is', Generated text: " Paris. That's all you need to know.\n\nThis response is a simple example"
DP rank 0, Prompt: 'The future of AI is', Generated text: ' being shaped by many factors, including advancements in computing power, increased data availability,'
DP rank 0, Prompt: 'Hello, my name is', Generated text: ' Ish, and I am a big fan of your work. I have been following'
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 356.21it/s, est. speed input: 2315.58 toks/s, output: 7117.69 toks/s]
DP rank 1, Prompt: 'Hello, my name is', Generated text: " Emily and I'm a huge fan of your YouTube channel! Your content is so engaging, informative,"
DP rank 1, Prompt: 'The president of the United States is', Generated text: ' the head of state and government. This is the most obvious job in the United States. The President'
DP rank 1, Prompt: 'The capital of France is', Generated text: " Paris. That's all you need to know.\n\nThis response is a simple example of a prompt that"
DP rank 1, Prompt: 'The future of AI is', Generated text: ' being shaped by many factors, including advancements in computing power, increased data availability, and advancements in machine'
DP rank 1, Prompt: 'Hello, my name is', Generated text: ' Ish, and I am a big fan of your work. I have been following your blog for a'
(EngineCore_0 pid=7862) Exception in thread Thread-2 (process_output_sockets):
(EngineCore_0 pid=7862) Traceback (most recent call last):
(EngineCore_0 pid=7862) File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
(EngineCore_0 pid=7862) self.run()
(EngineCore_0 pid=7862) File "/usr/lib/python3.12/threading.py", line 1010, in run
(EngineCore_0 pid=7862) self._target(*self._args, **self._kwargs)
(EngineCore_0 pid=7862) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 720, in process_output_sockets
(EngineCore_0 pid=7862) assert coord_socket is not None
(EngineCore_0 pid=7862) ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=7862) AssertionError
- Full error logs
Full error logs
INFO 06-24 09:23:54 [init.py:244] Automatically detected platform cuda.
DP rank 0 needs to process 200 prompts
DP rank 1 needs to process 200 prompts
INFO 06-24 09:24:05 [config.py:823] This model supports multiple tasks: {'embed', 'generate', 'reward', 'classify', 'score'}. Defaulting to 'generate'.
INFO 06-24 09:24:05 [config.py:823] This model supports multiple tasks: {'embed', 'generate', 'reward', 'classify', 'score'}. Defaulting to 'generate'.
INFO 06-24 09:24:05 [config.py:1946] Defaulting to use mp for distributed inference
INFO 06-24 09:24:05 [config.py:2195] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 06-24 09:24:05 [cuda.py:91] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 06-24 09:24:05 [config.py:1946] Defaulting to use mp for distributed inference
INFO 06-24 09:24:05 [config.py:2195] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 06-24 09:24:05 [cuda.py:91] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
(EngineCore_1 pid=7854) INFO 06-24 09:24:06 [core.py:455] Waiting for init message from front-end.
(EngineCore_0 pid=7862) INFO 06-24 09:24:06 [core.py:455] Waiting for init message from front-end.
(EngineCore_1 pid=7854) INFO 06-24 09:24:07 [core.py:70] Initializing a V1 LLM engine (v0.9.1) with config: model='/data/vllm/models/LLM-Research/Llama-3.2-1B-Instruct', speculative_config=None, tokenizer='/data/vllm/models/LLM-Research/Llama-3.2-1B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/data/vllm/models/LLM-Research/Llama-3.2-1B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":0,"local_cache_dir":null}
(EngineCore_1 pid=7854) WARNING 06-24 09:24:07 [multiproc_worker_utils.py:307] Reducing Torch parallelism from 128 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_1 pid=7854) INFO 06-24 09:24:07 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1], buffer_handle=(2, 16777216, 10, 'psm_f783857c'), local_subscribe_addr='ipc:///tmp/63e3ad62-63a3-4aa4-bd9c-22cbec42672d', remote_subscribe_addr=None, remote_addr_ipv6=False)
(EngineCore_0 pid=7862) INFO 06-24 09:24:07 [core.py:70] Initializing a V1 LLM engine (v0.9.1) with config: model='/data/vllm/models/LLM-Research/Llama-3.2-1B-Instruct', speculative_config=None, tokenizer='/data/vllm/models/LLM-Research/Llama-3.2-1B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/data/vllm/models/LLM-Research/Llama-3.2-1B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":0,"local_cache_dir":null}
(EngineCore_0 pid=7862) WARNING 06-24 09:24:07 [multiproc_worker_utils.py:307] Reducing Torch parallelism from 128 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_0 pid=7862) INFO 06-24 09:24:07 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1], buffer_handle=(2, 16777216, 10, 'psm_9954119f'), local_subscribe_addr='ipc:///tmp/e42dd325-4ed8-42a1-ab4d-ffc51b185728', remote_subscribe_addr=None, remote_addr_ipv6=False)
(EngineCore_1 pid=7854) WARNING 06-24 09:24:07 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7eff3edf3dd0>
(EngineCore_1 pid=7854) WARNING 06-24 09:24:07 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7eff3edf36e0>
(EngineCore_1 pid=7854) (VllmWorker rank=0 pid=7878) INFO 06-24 09:24:07 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_6b9caf2f'), local_subscribe_addr='ipc:///tmp/3f2329ac-5d8c-4e40-a644-e98126f91b8a', remote_subscribe_addr=None, remote_addr_ipv6=False)
(EngineCore_1 pid=7854) (VllmWorker rank=1 pid=7880) INFO 06-24 09:24:07 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_1dec3a75'), local_subscribe_addr='ipc:///tmp/ef6d1180-dd03-47e1-98af-6eab4f3312ed', remote_subscribe_addr=None, remote_addr_ipv6=False)
(EngineCore_0 pid=7862) WARNING 06-24 09:24:07 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7eff3ebff500>
(EngineCore_0 pid=7862) WARNING 06-24 09:24:07 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7eff3abdf3b0>
(EngineCore_0 pid=7862) (VllmWorker rank=1 pid=7881) INFO 06-24 09:24:07 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_01e3c3d4'), local_subscribe_addr='ipc:///tmp/37ab59c0-a0b3-4f00-a4ab-7384db8af311', remote_subscribe_addr=None, remote_addr_ipv6=False)
(EngineCore_0 pid=7862) (VllmWorker rank=0 pid=7879) INFO 06-24 09:24:07 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_298dcf90'), local_subscribe_addr='ipc:///tmp/0ef16bf1-6694-4523-8ef8-a8762d833559', remote_subscribe_addr=None, remote_addr_ipv6=False)
(EngineCore_0 pid=7862) (VllmWorker rank=1 pid=7881) INFO 06-24 09:24:09 [parallel_state.py:934] Adjusting world_size=4 rank=1 distributed_init_method=tcp://127.0.0.1:49774 for DP
(EngineCore_1 pid=7854) (VllmWorker rank=1 pid=7880) INFO 06-24 09:24:09 [parallel_state.py:934] Adjusting world_size=4 rank=3 distributed_init_method=tcp://127.0.0.1:49774 for DP
(EngineCore_1 pid=7854) (VllmWorker rank=0 pid=7878) INFO 06-24 09:24:09 [parallel_state.py:934] Adjusting world_size=4 rank=2 distributed_init_method=tcp://127.0.0.1:49774 for DP
(EngineCore_0 pid=7862) (VllmWorker rank=0 pid=7879) INFO 06-24 09:24:09 [parallel_state.py:934] Adjusting world_size=4 rank=0 distributed_init_method=tcp://127.0.0.1:49774 for DP
(EngineCore_0 pid=7862) (VllmWorker rank=1 pid=7881) INFO 06-24 09:24:09 [utils.py:1126] Found nccl from library libnccl.so.2
(EngineCore_1 pid=7854) (VllmWorker rank=1 pid=7880) INFO 06-24 09:24:09 [utils.py:1126] Found nccl from library libnccl.so.2
(EngineCore_0 pid=7862) (VllmWorker rank=1 pid=7881) INFO 06-24 09:24:09 [pynccl.py:70] vLLM is using nccl==2.26.2
(EngineCore_1 pid=7854) (VllmWorker rank=1 pid=7880) INFO 06-24 09:24:09 [pynccl.py:70] vLLM is using nccl==2.26.2
(EngineCore_1 pid=7854) (VllmWorker rank=0 pid=7878) INFO 06-24 09:24:09 [utils.py:1126] Found nccl from library libnccl.so.2
(EngineCore_1 pid=7854) (VllmWorker rank=0 pid=7878) INFO 06-24 09:24:09 [pynccl.py:70] vLLM is using nccl==2.26.2
(EngineCore_0 pid=7862) (VllmWorker rank=0 pid=7879) INFO 06-24 09:24:09 [utils.py:1126] Found nccl from library libnccl.so.2
(EngineCore_0 pid=7862) (VllmWorker rank=0 pid=7879) INFO 06-24 09:24:09 [pynccl.py:70] vLLM is using nccl==2.26.2
(EngineCore_1 pid=7854) (VllmWorker rank=1 pid=7880) INFO 06-24 09:24:10 [custom_all_reduce_utils.py:246] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_2,3.json
(EngineCore_0 pid=7862) (VllmWorker rank=0 pid=7879) INFO 06-24 09:24:10 [custom_all_reduce_utils.py:246] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(EngineCore_1 pid=7854) (VllmWorker rank=0 pid=7878) INFO 06-24 09:24:10 [custom_all_reduce_utils.py:246] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_2,3.json
(EngineCore_0 pid=7862) (VllmWorker rank=1 pid=7881) INFO 06-24 09:24:10 [custom_all_reduce_utils.py:246] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(EngineCore_0 pid=7862) (VllmWorker rank=0 pid=7879) INFO 06-24 09:24:10 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_50a881da'), local_subscribe_addr='ipc:///tmp/690de017-7700-4b3e-b959-178a48ba1bc5', remote_subscribe_addr=None, remote_addr_ipv6=False)
(EngineCore_1 pid=7854) (VllmWorker rank=0 pid=7878) INFO 06-24 09:24:10 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_c3ae5421'), local_subscribe_addr='ipc:///tmp/86d2bc1b-dc9c-49e4-8133-4d37190e6017', remote_subscribe_addr=None, remote_addr_ipv6=False)
(EngineCore_1 pid=7854) (VllmWorker rank=1 pid=7880) INFO 06-24 09:24:10 [utils.py:1126] Found nccl from library libnccl.so.2
(EngineCore_1 pid=7854) (VllmWorker rank=1 pid=7880) INFO 06-24 09:24:10 [pynccl.py:70] vLLM is using nccl==2.26.2
(EngineCore_0 pid=7862) (VllmWorker rank=0 pid=7879) INFO 06-24 09:24:10 [utils.py:1126] Found nccl from library libnccl.so.2
(EngineCore_0 pid=7862) (VllmWorker rank=0 pid=7879) INFO 06-24 09:24:10 [pynccl.py:70] vLLM is using nccl==2.26.2
(EngineCore_0 pid=7862) (VllmWorker rank=1 pid=7881) INFO 06-24 09:24:10 [utils.py:1126] Found nccl from library libnccl.so.2
(EngineCore_0 pid=7862) (VllmWorker rank=1 pid=7881) INFO 06-24 09:24:10 [pynccl.py:70] vLLM is using nccl==2.26.2
(EngineCore_1 pid=7854) (VllmWorker rank=0 pid=7878) INFO 06-24 09:24:10 [utils.py:1126] Found nccl from library libnccl.so.2
(EngineCore_1 pid=7854) (VllmWorker rank=0 pid=7878) INFO 06-24 09:24:10 [pynccl.py:70] vLLM is using nccl==2.26.2
(EngineCore_0 pid=7862) (VllmWorker rank=1 pid=7881) INFO 06-24 09:24:10 [cuda_communicator.py:65] Using naive all2all manager.
(EngineCore_1 pid=7854) (VllmWorker rank=0 pid=7878) INFO 06-24 09:24:10 [cuda_communicator.py:65] Using naive all2all manager.
(EngineCore_1 pid=7854) (VllmWorker rank=1 pid=7880) INFO 06-24 09:24:10 [cuda_communicator.py:65] Using naive all2all manager.
(EngineCore_0 pid=7862) (VllmWorker rank=0 pid=7879) INFO 06-24 09:24:10 [cuda_communicator.py:65] Using naive all2all manager.
(EngineCore_1 pid=7854) (VllmWorker rank=0 pid=7878) INFO 06-24 09:24:10 [parallel_state.py:1065] rank 2 in world size 4 is assigned as DP rank 1, PP rank 0, TP rank 0, EP rank 2
(EngineCore_0 pid=7862) (VllmWorker rank=1 pid=7881) INFO 06-24 09:24:10 [parallel_state.py:1065] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
(EngineCore_1 pid=7854) (VllmWorker rank=1 pid=7880) INFO 06-24 09:24:10 [parallel_state.py:1065] rank 3 in world size 4 is assigned as DP rank 1, PP rank 0, TP rank 1, EP rank 3
(EngineCore_0 pid=7862) (VllmWorker rank=0 pid=7879) INFO 06-24 09:24:10 [parallel_state.py:1065] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_1 pid=7854) (VllmWorker rank=0 pid=7878) WARNING 06-24 09:24:10 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(EngineCore_0 pid=7862) (VllmWorker rank=1 pid=7881) WARNING 06-24 09:24:10 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(EngineCore_1 pid=7854) (VllmWorker rank=1 pid=7880) WARNING 06-24 09:24:10 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(EngineCore_0 pid=7862) (VllmWorker rank=0 pid=7879) WARNING 06-24 09:24:10 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(EngineCore_0 pid=7862) (VllmWorker rank=0 pid=7879) INFO 06-24 09:24:10 [gpu_model_runner.py:1595] Starting to load model /data/vllm/models/LLM-Research/Llama-3.2-1B-Instruct...
(EngineCore_0 pid=7862) (VllmWorker rank=1 pid=7881) INFO 06-24 09:24:10 [gpu_model_runner.py:1595] Starting to load model /data/vllm/models/LLM-Research/Llama-3.2-1B-Instruct...
(EngineCore_1 pid=7854) (VllmWorker rank=1 pid=7880) INFO 06-24 09:24:10 [gpu_model_runner.py:1595] Starting to load model /data/vllm/models/LLM-Research/Llama-3.2-1B-Instruct...
(EngineCore_1 pid=7854) (VllmWorker rank=0 pid=7878) INFO 06-24 09:24:10 [gpu_model_runner.py:1595] Starting to load model /data/vllm/models/LLM-Research/Llama-3.2-1B-Instruct...
(EngineCore_0 pid=7862) (VllmWorker rank=0 pid=7879) INFO 06-24 09:24:10 [gpu_model_runner.py:1600] Loading model from scratch...
(EngineCore_0 pid=7862) (VllmWorker rank=0 pid=7879) INFO 06-24 09:24:10 [cuda.py:252] Using Flash Attention backend on V1 engine.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
(EngineCore_0 pid=7862) (VllmWorker rank=1 pid=7881) INFO 06-24 09:24:10 [gpu_model_runner.py:1600] Loading model from scratch...
(EngineCore_1 pid=7854) (VllmWorker rank=1 pid=7880) INFO 06-24 09:24:10 [gpu_model_runner.py:1600] Loading model from scratch...
(EngineCore_1 pid=7854) (VllmWorker rank=0 pid=7878) INFO 06-24 09:24:10 [gpu_model_runner.py:1600] Loading model from scratch...
(EngineCore_0 pid=7862) (VllmWorker rank=1 pid=7881) INFO 06-24 09:24:10 [cuda.py:252] Using Flash Attention backend on V1 engine.
(EngineCore_1 pid=7854) (VllmWorker rank=1 pid=7880) INFO 06-24 09:24:10 [cuda.py:252] Using Flash Attention backend on V1 engine.
(EngineCore_1 pid=7854) (VllmWorker rank=0 pid=7878) INFO 06-24 09:24:10 [cuda.py:252] Using Flash Attention backend on V1 engine.
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 3.41it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 3.40it/s]
(EngineCore_0 pid=7862) (VllmWorker rank=0 pid=7879)
(EngineCore_0 pid=7862) (VllmWorker rank=0 pid=7879) INFO 06-24 09:24:11 [default_loader.py:272] Loading weights took 0.34 seconds
(EngineCore_1 pid=7854) (VllmWorker rank=1 pid=7880) INFO 06-24 09:24:11 [default_loader.py:272] Loading weights took 0.33 seconds
(EngineCore_0 pid=7862) (VllmWorker rank=1 pid=7881) INFO 06-24 09:24:11 [default_loader.py:272] Loading weights took 0.38 seconds
(EngineCore_1 pid=7854) (VllmWorker rank=0 pid=7878) INFO 06-24 09:24:11 [default_loader.py:272] Loading weights took 0.37 seconds
(EngineCore_0 pid=7862) (VllmWorker rank=0 pid=7879) INFO 06-24 09:24:11 [gpu_model_runner.py:1624] Model loading took 1.1667 GiB and 0.463341 seconds
(EngineCore_1 pid=7854) (VllmWorker rank=1 pid=7880) INFO 06-24 09:24:11 [gpu_model_runner.py:1624] Model loading took 1.1667 GiB and 0.465979 seconds
(EngineCore_0 pid=7862) (VllmWorker rank=1 pid=7881) INFO 06-24 09:24:11 [gpu_model_runner.py:1624] Model loading took 1.1667 GiB and 0.507217 seconds
(EngineCore_1 pid=7854) (VllmWorker rank=0 pid=7878) INFO 06-24 09:24:11 [gpu_model_runner.py:1624] Model loading took 1.1667 GiB and 0.508762 seconds
(EngineCore_0 pid=7862) (VllmWorker rank=0 pid=7879) INFO 06-24 09:24:13 [gpu_worker.py:227] Available KV cache memory: 32.97 GiB
(EngineCore_1 pid=7854) (VllmWorker rank=0 pid=7878) INFO 06-24 09:24:13 [gpu_worker.py:227] Available KV cache memory: 32.97 GiB
(EngineCore_0 pid=7862) (VllmWorker rank=1 pid=7881) INFO 06-24 09:24:13 [gpu_worker.py:227] Available KV cache memory: 32.97 GiB
(EngineCore_1 pid=7854) (VllmWorker rank=1 pid=7880) INFO 06-24 09:24:13 [gpu_worker.py:227] Available KV cache memory: 32.97 GiB
(EngineCore_0 pid=7862) INFO 06-24 09:24:13 [kv_cache_utils.py:715] GPU KV cache size: 2,160,416 tokens
(EngineCore_0 pid=7862) INFO 06-24 09:24:13 [kv_cache_utils.py:719] Maximum concurrency for 131,072 tokens per request: 16.48x
(EngineCore_0 pid=7862) INFO 06-24 09:24:13 [kv_cache_utils.py:715] GPU KV cache size: 2,160,416 tokens
(EngineCore_0 pid=7862) INFO 06-24 09:24:13 [kv_cache_utils.py:719] Maximum concurrency for 131,072 tokens per request: 16.48x
(EngineCore_1 pid=7854) INFO 06-24 09:24:13 [kv_cache_utils.py:715] GPU KV cache size: 2,160,416 tokens
(EngineCore_1 pid=7854) INFO 06-24 09:24:13 [kv_cache_utils.py:719] Maximum concurrency for 131,072 tokens per request: 16.48x
(EngineCore_1 pid=7854) INFO 06-24 09:24:13 [kv_cache_utils.py:715] GPU KV cache size: 2,160,416 tokens
(EngineCore_1 pid=7854) INFO 06-24 09:24:13 [kv_cache_utils.py:719] Maximum concurrency for 131,072 tokens per request: 16.48x
(EngineCore_0 pid=7862) INFO 06-24 09:24:14 [core.py:171] init engine (profile, create kv cache, warmup model) took 2.26 seconds
(EngineCore_1 pid=7854) INFO 06-24 09:24:14 [core.py:171] init engine (profile, create kv cache, warmup model) took 2.19 seconds
Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 4616.00it/s]
Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 4672.36it/s]
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 467.61it/s, est. speed input: 3039.73 toks/s, output: 7482.31 toks/s]
DP rank 0, Prompt: 'Hello, my name is', Generated text: " Emily and I'm a huge fan of your YouTube channel! Your content is so"
DP rank 0, Prompt: 'The president of the United States is', Generated text: ' the head of state and government. This is the most obvious job in the United'
DP rank 0, Prompt: 'The capital of France is', Generated text: " Paris. That's all you need to know.\n\nThis response is a simple example"
DP rank 0, Prompt: 'The future of AI is', Generated text: ' being shaped by many factors, including advancements in computing power, increased data availability,'
DP rank 0, Prompt: 'Hello, my name is', Generated text: ' Ish, and I am a big fan of your work. I have been following'
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 356.21it/s, est. speed input: 2315.58 toks/s, output: 7117.69 toks/s]
DP rank 1, Prompt: 'Hello, my name is', Generated text: " Emily and I'm a huge fan of your YouTube channel! Your content is so engaging, informative,"
DP rank 1, Prompt: 'The president of the United States is', Generated text: ' the head of state and government. This is the most obvious job in the United States. The President'
DP rank 1, Prompt: 'The capital of France is', Generated text: " Paris. That's all you need to know.\n\nThis response is a simple example of a prompt that"
DP rank 1, Prompt: 'The future of AI is', Generated text: ' being shaped by many factors, including advancements in computing power, increased data availability, and advancements in machine'
DP rank 1, Prompt: 'Hello, my name is', Generated text: ' Ish, and I am a big fan of your work. I have been following your blog for a'
(EngineCore_0 pid=7862) Exception in thread Thread-2 (process_output_sockets):
(EngineCore_0 pid=7862) Traceback (most recent call last):
(EngineCore_0 pid=7862) File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
(EngineCore_0 pid=7862) self.run()
(EngineCore_0 pid=7862) File "/usr/lib/python3.12/threading.py", line 1010, in run
(EngineCore_0 pid=7862) self._target(*self._args, **self._kwargs)
(EngineCore_0 pid=7862) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 720, in process_output_sockets
(EngineCore_0 pid=7862) assert coord_socket is not None
(EngineCore_0 pid=7862) ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=7862) AssertionError
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.