Skip to content

More accurate RPM limit enforcement on keys #10037

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

krrishdholakia
Copy link
Contributor

Title

More accurate RPM limit enforcement on keys

Relevant issues

  • Fixes issue where instances were overwriting each others increment values on redis cache
  • Improves rpm checking by incrementing on check (reducing spillover from 66 -> 2 in multi-instance, high traffic setup)

Pre-Submission checklist

Please complete all items before asking a LiteLLM maintainer to review your PR

  • I have Added testing in the tests/litellm/ directory, Adding at least 1 test is a hard requirement - see details
  • I have added a screenshot of my new test passing locally
  • My PR passes all unit tests on (make test-unit)[https://docs.litellm.ai/docs/extras/contributing_code]
  • My PR's scope is as isolated as possible, it only solves 1 specific problem

Type

🆕 New Feature
🐛 Bug Fix

Changes

  • migrate parallel_request_limiter.py to inherit from base_routing_strategy.py (already solved rpm increment/syncing with redis problem)
  • v2 check_key_in_limits function

…llel request handler to use base routing strategy

allows for better redis / internal memory cache usage
uses redis increment cache logic

ensures tpm/rpm logic works well across instances
reduces spillover (from 66 -> 2 at 10k+ requests in 1min.)
Copy link

vercel bot commented Apr 16, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
litellm ✅ Ready (Inspect) Visit Preview 💬 Add feedback Apr 29, 2025 3:26am

@krrishdholakia krrishdholakia marked this pull request as draft April 16, 2025 01:13
@harvardfly
Copy link

When will this issue be fixed? It is stably reproducible in version 1.65.4, and the multi-instance rate limiting is failing. @krrishdholakia

@krrishdholakia
Copy link
Contributor Author

Hey @harvardfly acknowledging this. I hope to have this done over the next 2 weeks.

Need to do:

  • pass unit testing
  • run load testing to confirm this passes as expected

Follow up pr:

  • migrate team, user, model-level rate limiting as well

@harvardfly
Copy link

harvardfly commented Apr 27, 2025

Got it, thank you very much @krrishdholakia . I hope you can fix it in the stable version as soon as possible, as TPM and RPM limits are crucial features.

@ScGPS
Copy link

ScGPS commented Apr 29, 2025

@krrishdholakia,
From your PR code, there are 2 issues exists:

    1. There is 500 error found. How to reproduce it? please use the same key to get service twice.
10:55:03 - LiteLLM Proxy:ERROR: proxy_server.py:3642 - litellm.proxy.proxy_server.completion(): Exception occured - unsupported operand type(s) for +: 'dict' and 'int'
Traceback (most recent call last):
  File "/mnt/d/projects/AIAPIChannel/litellm_oss/litellm/proxy/proxy_server.py", line 3524, in completion
    data = await proxy_logging_obj.pre_call_hook(  # type: ignore
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/d/projects/AIAPIChannel/litellm_oss/litellm/proxy/utils.py", line 565, in pre_call_hook
    raise e
  File "/mnt/d/projects/AIAPIChannel/litellm_oss/litellm/proxy/utils.py", line 552, in pre_call_hook
    response = await _callback.async_pre_call_hook(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/d/projects/AIAPIChannel/litellm_oss/litellm/proxy/hooks/parallel_request_limiter.py", line 381, in async_pre_call_hook
    await self.check_key_in_limits_v2(
  File "/mnt/d/projects/AIAPIChannel/litellm_oss/litellm/proxy/hooks/parallel_request_limiter.py", line 128, in check_key_in_limits_v2
    results = await self._increment_value_list_in_current_window(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/d/projects/AIAPIChannel/litellm_oss/litellm/router_strategy/base_routing_strategy.py", line 64, in _increment_value_list_in_current_window
    result = await self._increment_value_in_current_window(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/d/projects/AIAPIChannel/litellm_oss/litellm/router_strategy/base_routing_strategy.py", line 81, in _increment_value_in_current_window
    result = await self.dual_cache.in_memory_cache.async_increment(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/d/projects/AIAPIChannel/litellm_oss/litellm/caching/in_memory_cache.py", line 184, in async_increment
    value = init_value + value
            ~~~~~~~~~~~^~~~~~~
TypeError: unsupported operand type(s) for +: 'dict' and 'int'
INFO:     127.0.0.1:36024 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
    1. Suggest: atomic increment need be only used in redis cache, not in in-memory cache. When redis cache updated by atomic increment, need use redis cache to update in-memory cache in multi-instances case.
      result = await self.dual_cache.in_memory_cache.async_increment(
      key=key,
      value=value,
      ttl=ttl,
      )
      increment_op = RedisPipelineIncrementOperation(
      key=key,
      increment_value=value,
      ttl=ttl,
      )

@krrishdholakia
Copy link
Contributor Author

Closing as this is now on main -

Designed to work on a multi-instance setup, where multiple instances are writing to redis simultaneously

@krrishdholakia krrishdholakia deleted the litellm_dev_04_15_2025_p1 branch May 2, 2025 05:10
@krrishdholakia
Copy link
Contributor Author

@ScGPS we do sync the redis value to the in memory cache (handled by the BaseRoutingStrategy class)

We do this periodically (every 0.01s) to avoid calling redis on each request

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants