Skip to content

[Bug]: Calling /wake_up after /sleep and then sending a request leads to improper LLM response #16234

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
akshayqylis opened this issue Apr 8, 2025 · 3 comments
Labels
bug Something isn't working

Comments

@akshayqylis
Copy link

akshayqylis commented Apr 8, 2025

Your current environment

Please note that I am using the docker image directly and generated the log by logging in into the Docker container using docker exec

INFO 04-07 22:57:45 [init.py:239] Automatically detected platform cuda.
Collecting environment information...
PyTorch version: 2.6.0+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 4.0.0
Libc version: glibc-2.35

Python version: 3.12.9 (main, Feb 5 2025, 08:49:00) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.8.0-1024-aws-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA L4
GPU 1: NVIDIA L4

Nvidia driver version: 550.144.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7R13 Processor
CPU family: 25
Model: 1
Thread(s) per core: 2
Core(s) per socket: 24
Socket(s): 1
Stepping: 1
BogoMIPS: 5299.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save vaes vpclmulqdq rdpid
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 768 KiB (24 instances)
L1i cache: 768 KiB (24 instances)
L2 cache: 12 MiB (24 instances)
L3 cache: 96 MiB (3 instances)
NUMA node(s): 1
NUMA node0 CPU(s): 0-47
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected

Versions of relevant libraries:
[pip3] flashinfer-python==0.2.1.post2+cu124torch2.6
[pip3] numpy==2.1.3
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-cusparselt-cu12==0.6.2
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] pyzmq==26.4.0
[pip3] torch==2.6.0
[pip3] torchaudio==2.6.0
[pip3] torchvision==0.21.0
[pip3] transformers==4.51.0
[pip3] triton==3.2.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.8.3
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X SYS 0-47 0 N/A
GPU1 SYS X 0-47 0 N/A

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

NVIDIA_VISIBLE_DEVICES=all
NVIDIA_REQUIRE_CUDA=cuda>=12.4 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526 brand=tesla,driver>=535,driver<536 brand=unknown,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=geforce,driver>=535,driver<536 brand=geforcertx,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=titan,driver>=535,driver<536 brand=titanrtx,driver>=535,driver<536
NCCL_VERSION=2.20.5-1
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NVIDIA_PRODUCT_NAME=CUDA
VLLM_USAGE_SOURCE=production-docker-image
CUDA_VERSION=12.4.0
VLLM_SERVER_DEV_MODE=1
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY

🐛 Describe the bug

Hi,

I am following the discussions of the issue #299 and I am trying to test and use the newly introduced /sleep and /wake_up endpoints. I believe there might be a bug since I do not get a proper response from an LLM when I use /wake_up followed by /sleep. Please see the description below:

I am using vllm/vllm-openai:latest docker image that I downloaded today (around 09:30 AM IST 08/04/2025) for serving meta-llama/Llama-3.1-8B-Instruct model.

vllm/vllm-openai  latest   24d76f8822cb   2 days ago    17.1GB

The docker-compose file docker-compose-sleep-llm.yaml I am using is as below:

name: llmserver
services:
    vllm-openai:
        runtime: nvidia
        deploy:
            resources:
                reservations:
                    devices:
                        - driver: nvidia
                          device_ids: ['2', '3']
                          capabilities: [gpu]
        volumes:
            - <my_cache_path>:/root/.cache/huggingface
        environment:
            - HUGGING_FACE_HUB_TOKEN=<mytoken>
            - VLLM_SERVER_DEV_MODE=1
        ports:
            - 8000:8000
        ipc: host
        image: vllm/vllm-openai:latest
        command: --model meta-llama/Llama-3.1-8B-Instruct --tensor-parallel-size 2 --enable-sleep-mode

The Python code test_llm.py I am using to send a request is as below:

from openai import OpenAI

# Initialize client
client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1"
)

# Send request
response1 = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Give me a paragraph on India."},
        ]
    }],
    stream=True
)

print('Response in chunks.')
print('\n\n')
for sse_chunk in response1:
    content = sse_chunk.choices[0].delta.content
    print(content, end='', flush=True)
print('\n\n')

The steps in my experiment:

  1. Start the container:
docker compose -f docker-compose-sleep-llm.yaml up
  1. Send request using Python:
python3 test_llm.py

Response:

Response in chunks.

India is a vast and diverse country located in South Asia, known for its rich cultural heritage and breathtaking natural beauty. With a population of over 1.3 billion people, it is the second-most populous country in the world. From the snow-capped Himalayan mountains in the north to the tropical beaches of the south, India's geography is incredibly varied. The country is home to numerous UNESCO World Heritage Sites, including the Taj Mahal, a stunning white marble mausoleum in Agra, and the ancient city of Varanasi, one of the oldest continuously inhabited cities in the world. India is also known for its vibrant cities, such as Mumbai and Delhi, which are hubs for business, technology, and entertainment. With its diverse languages, cuisines, and festivals, India is a truly unique and fascinating country that offers something for everyone.
  1. Send sleep request
curl --data "level=1" http://localhost:8000/sleep
  1. Send wake up request
curl --data "" http://localhost:8000/wake_up
  1. Send chat request again:
python3 test_llm.py

Response:

Response in chunks.



The brown water of the Ganges flowed through the landscape, and the people lived and farmed the land. The dark soil was rich and fertile, and the people were poor. The brown water of the Ganges flowed through the landscape, and the people bathed. The dark people were poor and rich, and the brown water of the Ganges flowed through the landscape.

The Ganges, a river, flowed through the landscape, and the people bathed in the dark, rich soil. The Ganges river flowed through the brown landscape, and the people lived and farmed in the dark, rich soil.

The people of the Ganges were dark and rich, and they lived and farmed in the dark, rich soil. The Ganges river flowed through the brown landscape, and the people bathed in the dark, rich soil.

The people of the Ganges were dark and rich, and they lived and farmed in the dark, rich soil. The Ganges river flowed through the brown landscape, and the people bathed in the dark, rich soil.

The people of the Ganges were dark and rich, and they lived and farmed in the dark, rich soil. The Ganges river flowed through the brown landscape, and the people bathed in the dark, rich soil.

The last sentence repeats indefinitely. This doesn't happen if I repeat the experiment without having steps 3 and 4 and the response of python3 test_llm.py is identical.

Thanks in advance.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@akshayqylis akshayqylis added the bug Something isn't working label Apr 8, 2025
@DarkLight1337
Copy link
Member

cc @youkaichao

@youkaichao
Copy link
Member

cc @comaniac would it be related to how we reset prefix caching?

@akshayqylis can you try to add --no-enable-prefix-caching to see if it helps?

@akshayqylis
Copy link
Author

cc @comaniac would it be related to how we reset prefix caching?

@akshayqylis can you try to add --no-enable-prefix-caching to see if it helps?

Adding the --no-enable-prefix-caching option works. Thanks a lot @youkaichao and your team for your work on vLLM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants