Skip to content

[Core] feat: Implement Priority Scheduling in V1 Engine #19057

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 47 commits into from
Jun 23, 2025

Conversation

amitm02
Copy link
Contributor

@amitm02 amitm02 commented Jun 3, 2025

This commit introduces priority scheduling capabilities to the V1 LLM engine.

Key changes include:

EngineCoreRequest and Request updates:

Added a priority field to EngineCoreRequest and Request classes to carry priority information.
Processor update:

Modified Processor.process_inputs to accept and pass the priority to EngineCoreRequest.
V1 Scheduler modifications:

The scheduler now respects the --scheduling-policy argument.
When policy="priority", self.waiting is managed as a min-heap, prioritizing requests by their assigned priority value (lower value means higher priority) and then by arrival time (FCFS).
Preemption logic now correctly identifies and preempts the actual lowest-priority running request when space is needed for higher-priority or new requests.
FCFS behavior is maintained when policy="fcfs".
Documentation:

Updated docs/usage/v1_guide.md and docs/serving/openai_compatible_server.md to reflect V1 engine's support for priority scheduling.
Unit Tests:

Added a new test suite in tests/v1/core/test_scheduler.py.
This allows you to influence the order of request processing in the V1 engine by assigning priorities, which is particularly useful in scenarios with varying request importance.

FIX #14002

@mergify mergify bot added documentation Improvements or additions to documentation v1 labels Jun 3, 2025
Copy link

github-actions bot commented Jun 3, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link

mergify bot commented Jun 4, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @amitm02.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jun 4, 2025
Copy link
Collaborator

@aarnphm aarnphm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick few notes about formatting change to help with reviewing the core logic of the scheduler.

For the types of the waiting queue, I think we can implement a dequeue subclass that contains priority ranking per request

@amitm02 amitm02 requested review from ywang96 and WoosukKwon June 19, 2025 16:01
@amitm02 amitm02 requested a review from njhill June 19, 2025 17:32
Copy link
Collaborator

@aarnphm aarnphm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code looks a lot cleaner. Thanks for this.

But I'm not sure about the degradation in FCFS. 5% is quite a huge margin...

@amitm02 amitm02 requested a review from aarnphm June 20, 2025 09:44
@amitm02
Copy link
Contributor Author

amitm02 commented Jun 20, 2025

I think the code looks a lot cleaner. Thanks for this.

But I'm not sure about the degradation in FCFS. 5% is quite a huge margin...

Thanks! Glad the cleanup looks better.

Regarding the FCFS degradation — while it’s around 5%, we’re still seeing ~18k ops/sec, which is far above the actual bottleneck introduced by vLLM. I don’t think it’s worth optimizing something that already performs well beyond what the system can realistically handle. It feels like premature optimization at this point.

@amitm02 amitm02 requested a review from hmellor June 22, 2025 15:35
@WoosukKwon WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 22, 2025

def peek_request(self) -> Request:
"""Peek at the next request in the queue without removing it."""
if not self:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if not self:
if len(self) == 0:

Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amitm02 Thanks for all the fixes! Overall, I think the PR is in a good enough shape to be merged. I have small concerns on the overhead of finish_request and preemption, but I think it's acceptable (and we don't have a good idea at hand right now).

One last issue is the performance overheads in the default FCFS setting. I found that this PR does cause ~1% perf degradation in the sharegpt benchmark, but the overhead is eliminated by the following change:

class FCFSRequestQueue(deque[Request], RequestQueue):
    """A first-come-first-served queue that supports deque operations."""

    # Avoid the Python function call overheads by directly calling the parent
    # methods.
    add_request = deque.append
    pop_request = deque.popleft
    prepend_request = deque.appendleft
    prepend_requests = deque.extendleft
    remove_request = deque.remove
    __len__ = deque.__len__
    __iter__ = deque.__iter__
    __reversed__ = deque.__reversed__

    def __bool__(self) -> bool:
        """Check if queue has any requests."""
        return len(self) > 0

    def peek_request(self) -> Request:
        """Peek at the next request in the queue without removing it."""
        if len(self) == 0:
            raise IndexError("peek from an empty queue")
        return self[0]

    def remove_requests(self, requests: Iterable[Request]) -> None:
        """Remove multiple specific requests from the queue."""
        requests_to_remove = set(requests)
        filtered_requests = [
            req for req in self if req not in requests_to_remove
        ]
        # deque does not support in-place filtering, so we need to clear
        # and extend
        self.clear()
        self.extend(filtered_requests)

The code doesn't look so good, but we can fix it later after we introduce async scheduling.

@WoosukKwon WoosukKwon merged commit 4a0f788 into vllm-project:main Jun 23, 2025
76 of 77 checks passed
juncheoll pushed a commit to juncheoll/vllm that referenced this pull request Jun 23, 2025
fhl2000 pushed a commit to fhl2000/vllm that referenced this pull request Jun 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build documentation Improvements or additions to documentation frontend llama Related to Llama models multi-modality Related to multi-modality (#4194) ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding structured-output tool-calling v1
Projects
Status: Done
Status: Done
Development

Successfully merging this pull request may close these issues.

[Feature]: Implement Priority Scheduling In V1 Engine
7 participants