Add proposal for tenant limits API #6818

bogdan-at-adobe · 2025-06-14T00:08:07Z

What this PR does:

This PR adds a proposal for a tenant limits API.

Which issue(s) this PR fixes:
Fixes #

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

bogdan-at-adobe · 2025-06-14T00:12:32Z

Hello, this has been something that has been bothering me and my team for a while.
I would love to work on this, even though I don't have much knowledge about what it would take to implement something like this and will probably need some guidance.

friedrichg

Thanks for this. it's a great idea

docs/proposals/limits-api.md

Signed-off-by: Bogdan Stancu <[email protected]>

harry671003 · 2025-06-20T20:26:20Z

docs/proposals/limits-api.md

+}
+```
+
+#### 2. PUT /api/v1/user-limits


The limits are managed by the runtime-config which is either stored on a volume backed by a config map or in from an S3/gcs/azure bucket.

How would this API work in the former case?

The end goal is to remove admin intervention for user limits. My initial idea was writing to either the config map or the s3/gcs/azure bucket but I'm not 100% sure of all the implications, other than requiring more access.

Good question. @bogdan-at-adobe Configmap are normally readonly. Put it in the spec that the API will not support configmaps, only block storage backends.

harry671003 · 2025-06-20T20:27:05Z

docs/proposals/limits-api.md

+### Endpoints
+
+#### 1. GET /api/v1/user-limits
+Returns the current limits configuration for a specific tenant.


Currently the limits are loaded periodically in an interval. Would this API read the config directly from storage?

I think reading from the currently loaded config is fine enough for this. Using the api to make changes will also trigger a reload of the loaded limits so the only issue I see would be changing the config manually and waiting for it to get reloaded which will lead to a wrong answer from the api for 10 seconds max (assuming the default), change that is probably made by an admin and is aware of this implication. I might be wrong on this. GET /runtime_config endpoint makes the same assumptions.

I think the answer is yes, s3 is our only source of truth. Similar to how Alertmanager cortex API works.

harry671003 · 2025-06-20T20:29:25Z

docs/proposals/limits-api.md

+
+### Endpoints
+
+#### 1. GET /api/v1/user-limits


Which component in Cortex will serve this API? Maybe a new admin service in Cortex for this purpose?

as @friedrichg said

We already have a component that reads limits, so it's perfect for this use case.

So my guess is that the cortex-overrides is a good place. Looking at the fact that the GET /runtime_config is on all components I don't see a reason why the limits api wouldn't be the same though.

We only want this in the cortex overrides, no need to put In the other components.

Signed-off-by: Bogdan Stancu <[email protected]>

friedrichg

Looks good so far.

In terms of design, I have a question on soft vs hard limits. So far all our limits are hard and this API modifies those limits.

But what if some malicious tenant tries to set unreasonable limits for their tenant? do we allow it? or do we also think of a way to prevent tenants from doing this in the API?

You can answer the question or leave it up as an open question. I think it would need to be tackled eventually

harry671003 · 2025-06-26T18:12:04Z

I think we need to have some validations on this API for checking whether a limit update is within a safe ranges.

Also, not all limits should be allowed to be modified through this API. Limits like shard_size should only be modified by the admin

bogdan-at-adobe · 2025-06-26T21:53:43Z

I agree that many limits should only be modified by an admin, we use the tiers defined in the cortex-jsonnet repo and I wrote this proposal with those in mind, there have been few occasions when users needed limit increases other than those.

Related to setting unreasonable limits, I taught a bit about this and, at least in our use case, defining "reasonable" would be pretty hard. It might be perfectly reasonable for a huge user to double their data overnight and not very reasonable for a small one to do the same thing.
Are you thinking about dinamically changing the upper limit based on their current usage? Id like that, but dont know the implications really.
Setting some hard limit "in the middle" for everyone might encourage all small users to max it cause why not, and big users wont get any benefit.
I guess having the "limit of limits" per tenant might be the correct answer here as I have seen that work in other systems, that will turn 5 limit increase requests into 1-2 hard limit increases.
Do you have any other suggestions for this? Id like to add it to this doc as I think it is a pretty important matter.

friedrichg · 2025-06-26T22:26:16Z

docs/proposals/limits-api.md

@@ -0,0 +1,71 @@
+---
+title: "Limits API"


Suggested change

title: "Limits API"

title: "Overrides API"

friedrichg · 2025-06-26T22:26:23Z

docs/proposals/limits-api.md

@@ -0,0 +1,71 @@
+---
+title: "Limits API"
+linkTitle: "Limits API"


Suggested change

linkTitle: "Limits API"

linkTitle: "Overrides API"

friedrichg · 2025-06-26T22:28:29Z

docs/proposals/limits-api.md

+
+### Endpoints
+
+#### 1. GET /api/v1/user-limits


Suggested change

#### 1. GET /api/v1/user-limits

#### 1. GET /api/v1/user-overrides

I just noticed you keep saying limits, but this is more about overrides. That is the name in cortex https://cortexmetrics.io/docs/guides/overrides-exporter/

friedrichg · 2025-06-26T22:34:59Z

call it hard overrides

For example:
# file: runtime.yaml
# In this example, we're overriding ingestion limits for a single tenant.
overrides:
  "user1":
    ingestion_burst_size: 350000
    ingestion_rate: 350000
    max_global_series_per_metric: 300000
    max_global_series_per_user: 300000
    max_series_per_metric: 0
    max_series_per_user: 0
    max_samples_per_query: 100000
    max_series_per_query: 100000
configurable-overrides:
  "user1":
    ingestion_rate: 700000
    max_global_series_per_user: 700000

configurable-overrides or hard-overrides. I don't know which one communicates better the situation .
Everything defined in configurable-overrides can be modified in the overrides

bogdan-at-adobe · 2025-06-27T12:49:19Z

What about defining a quota unit (the default values) and keeping the "hard limit" as an integer for how many quota units a user can reach?
I think changing the limits individually is also useful but since they usually scale as a group I would love this to support these quota units as well, as hard limits for what can be configured but also as a way to batch increase overrides using the api.

friedrichg · 2025-06-27T18:58:39Z

What about defining a quota unit (the default values) and keeping the "hard limit" as an integer for how many quota units a user can reach?
I think changing the limits individually is also useful but since they usually scale as a group I would love this to support these quota units as well, as hard limits for what can be configured but also as a way to batch increase overrides using the api.

I believe you mean increasing quota would increase a couple of limits. But I think there is a misunderstanding. Overrides is more than just limits, it's configuration like DisabledRuleGroups and OutOfOrderTimeWindow.

pull-request-size bot added the size/M label Jun 14, 2025

dosubot bot added the component/documentation label Jun 14, 2025

friedrichg reviewed Jun 16, 2025

View reviewed changes

Bogdan Stancu added 3 commits June 19, 2025 20:07

Add proposal for tenant limits API

65b4b79

Signed-off-by: Bogdan Stancu <[email protected]>

Change endpoints

c7b0867

Signed-off-by: Bogdan Stancu <[email protected]>

suggestions

e2eca89

Signed-off-by: Bogdan Stancu <[email protected]>

bogdan-at-adobe force-pushed the limits-api-proposal branch from 2ca6d4b to e2eca89 Compare June 19, 2025 17:09

harry671003 reviewed Jun 20, 2025

View reviewed changes

support only block storage

3544ba3

Signed-off-by: Bogdan Stancu <[email protected]>

bogdan-at-adobe force-pushed the limits-api-proposal branch from 7990645 to 3544ba3 Compare June 24, 2025 17:16

friedrichg reviewed Jun 26, 2025

View reviewed changes

	#### 1. GET /api/v1/user-limits
	#### 1. GET /api/v1/user-overrides

Add proposal for tenant limits API #6818

Are you sure you want to change the base?

Add proposal for tenant limits API #6818

Uh oh!

Conversation

bogdan-at-adobe commented Jun 14, 2025 • edited by friedrichg Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bogdan-at-adobe commented Jun 14, 2025

Uh oh!

friedrichg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bogdan-at-adobe Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

friedrichg left a comment

Choose a reason for hiding this comment

Uh oh!

harry671003 commented Jun 26, 2025

Uh oh!

bogdan-at-adobe commented Jun 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

friedrichg commented Jun 26, 2025

Uh oh!

bogdan-at-adobe commented Jun 27, 2025

Uh oh!

friedrichg commented Jun 27, 2025

Uh oh!

Uh oh!

bogdan-at-adobe commented Jun 14, 2025 •

edited by friedrichg

Loading

bogdan-at-adobe Jun 23, 2025 •

edited

Loading