-
-
Notifications
You must be signed in to change notification settings - Fork 8.3k
[Core] feat: Implement Priority Scheduling in V1 Engine #19057
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: amit <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a quick few notes about formatting change to help with reviewing the core logic of the scheduler.
For the types of the waiting queue, I think we can implement a dequeue subclass that contains priority ranking per request
Signed-off-by: amit <[email protected]>
…tive value Signed-off-by: amit <[email protected]>
Signed-off-by: amit <[email protected]>
Signed-off-by: amit <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code looks a lot cleaner. Thanks for this.
But I'm not sure about the degradation in FCFS. 5% is quite a huge margin...
Signed-off-by: amit <[email protected]>
Signed-off-by: amit <[email protected]>
Signed-off-by: amit <[email protected]>
Signed-off-by: amit <[email protected]>
Thanks! Glad the cleanup looks better. Regarding the FCFS degradation — while it’s around 5%, we’re still seeing ~18k ops/sec, which is far above the actual bottleneck introduced by vLLM. I don’t think it’s worth optimizing something that already performs well beyond what the system can realistically handle. It feels like premature optimization at this point. |
|
||
def peek_request(self) -> Request: | ||
"""Peek at the next request in the queue without removing it.""" | ||
if not self: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not self: | |
if len(self) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amitm02 Thanks for all the fixes! Overall, I think the PR is in a good enough shape to be merged. I have small concerns on the overhead of finish_request
and preemption, but I think it's acceptable (and we don't have a good idea at hand right now).
One last issue is the performance overheads in the default FCFS setting. I found that this PR does cause ~1% perf degradation in the sharegpt benchmark, but the overhead is eliminated by the following change:
class FCFSRequestQueue(deque[Request], RequestQueue):
"""A first-come-first-served queue that supports deque operations."""
# Avoid the Python function call overheads by directly calling the parent
# methods.
add_request = deque.append
pop_request = deque.popleft
prepend_request = deque.appendleft
prepend_requests = deque.extendleft
remove_request = deque.remove
__len__ = deque.__len__
__iter__ = deque.__iter__
__reversed__ = deque.__reversed__
def __bool__(self) -> bool:
"""Check if queue has any requests."""
return len(self) > 0
def peek_request(self) -> Request:
"""Peek at the next request in the queue without removing it."""
if len(self) == 0:
raise IndexError("peek from an empty queue")
return self[0]
def remove_requests(self, requests: Iterable[Request]) -> None:
"""Remove multiple specific requests from the queue."""
requests_to_remove = set(requests)
filtered_requests = [
req for req in self if req not in requests_to_remove
]
# deque does not support in-place filtering, so we need to clear
# and extend
self.clear()
self.extend(filtered_requests)
The code doesn't look so good, but we can fix it later after we introduce async scheduling.
…#19057) Signed-off-by: amit <[email protected]> Co-authored-by: Roger Wang <[email protected]> Signed-off-by: juncheoll <[email protected]>
…#19057) Signed-off-by: amit <[email protected]> Co-authored-by: Roger Wang <[email protected]> Signed-off-by: fhl <[email protected]>
This commit introduces priority scheduling capabilities to the V1 LLM engine.
Key changes include:
EngineCoreRequest and Request updates:
Added a priority field to EngineCoreRequest and Request classes to carry priority information.
Processor update:
Modified Processor.process_inputs to accept and pass the priority to EngineCoreRequest.
V1 Scheduler modifications:
The scheduler now respects the --scheduling-policy argument.
When policy="priority", self.waiting is managed as a min-heap, prioritizing requests by their assigned priority value (lower value means higher priority) and then by arrival time (FCFS).
Preemption logic now correctly identifies and preempts the actual lowest-priority running request when space is needed for higher-priority or new requests.
FCFS behavior is maintained when policy="fcfs".
Documentation:
Updated docs/usage/v1_guide.md and docs/serving/openai_compatible_server.md to reflect V1 engine's support for priority scheduling.
Unit Tests:
Added a new test suite in tests/v1/core/test_scheduler.py.
This allows you to influence the order of request processing in the V1 engine by assigning priorities, which is particularly useful in scenarios with varying request importance.
FIX #14002