Skip to content

Add proposal for tenant limits API #6818

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

bogdan-at-adobe
Copy link

@bogdan-at-adobe bogdan-at-adobe commented Jun 14, 2025

What this PR does:

This PR adds a proposal for a tenant limits API.

Which issue(s) this PR fixes:
Fixes #

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@bogdan-at-adobe
Copy link
Author

Hello, this has been something that has been bothering me and my team for a while.
I would love to work on this, even though I don't have much knowledge about what it would take to implement something like this and will probably need some guidance.

Copy link
Member

@friedrichg friedrichg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. it's a great idea

Bogdan Stancu added 3 commits June 19, 2025 20:07
Signed-off-by: Bogdan Stancu <[email protected]>
Signed-off-by: Bogdan Stancu <[email protected]>
}
```

#### 2. PUT /api/v1/user-limits
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The limits are managed by the runtime-config which is either stored on a volume backed by a config map or in from an S3/gcs/azure bucket.

  • How would this API work in the former case?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The end goal is to remove admin intervention for user limits. My initial idea was writing to either the config map or the s3/gcs/azure bucket but I'm not 100% sure of all the implications, other than requiring more access.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. @bogdan-at-adobe Configmap are normally readonly. Put it in the spec that the API will not support configmaps, only block storage backends.

### Endpoints

#### 1. GET /api/v1/user-limits
Returns the current limits configuration for a specific tenant.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the limits are loaded periodically in an interval. Would this API read the config directly from storage?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think reading from the currently loaded config is fine enough for this. Using the api to make changes will also trigger a reload of the loaded limits so the only issue I see would be changing the config manually and waiting for it to get reloaded which will lead to a wrong answer from the api for 10 seconds max (assuming the default), change that is probably made by an admin and is aware of this implication. I might be wrong on this. GET /runtime_config endpoint makes the same assumptions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the answer is yes, s3 is our only source of truth. Similar to how Alertmanager cortex API works.


### Endpoints

#### 1. GET /api/v1/user-limits
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which component in Cortex will serve this API? Maybe a new admin service in Cortex for this purpose?

Copy link
Author

@bogdan-at-adobe bogdan-at-adobe Jun 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as @friedrichg said

We already have a component that reads limits, so it's perfect for this use case.

So my guess is that the cortex-overrides is a good place. Looking at the fact that the GET /runtime_config is on all components I don't see a reason why the limits api wouldn't be the same though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only want this in the cortex overrides, no need to put In the other components.

Signed-off-by: Bogdan Stancu <[email protected]>
Copy link
Member

@friedrichg friedrichg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far.

In terms of design, I have a question on soft vs hard limits. So far all our limits are hard and this API modifies those limits.

But what if some malicious tenant tries to set unreasonable limits for their tenant? do we allow it? or do we also think of a way to prevent tenants from doing this in the API?

You can answer the question or leave it up as an open question. I think it would need to be tackled eventually

@harry671003
Copy link
Contributor

I think we need to have some validations on this API for checking whether a limit update is within a safe ranges.

Also, not all limits should be allowed to be modified through this API. Limits like shard_size should only be modified by the admin

@bogdan-at-adobe
Copy link
Author

I agree that many limits should only be modified by an admin, we use the tiers defined in the cortex-jsonnet repo and I wrote this proposal with those in mind, there have been few occasions when users needed limit increases other than those.

Related to setting unreasonable limits, I taught a bit about this and, at least in our use case, defining "reasonable" would be pretty hard. It might be perfectly reasonable for a huge user to double their data overnight and not very reasonable for a small one to do the same thing.
Are you thinking about dinamically changing the upper limit based on their current usage? Id like that, but dont know the implications really.
Setting some hard limit "in the middle" for everyone might encourage all small users to max it cause why not, and big users wont get any benefit.
I guess having the "limit of limits" per tenant might be the correct answer here as I have seen that work in other systems, that will turn 5 limit increase requests into 1-2 hard limit increases.
Do you have any other suggestions for this? Id like to add it to this doc as I think it is a pretty important matter.

@@ -0,0 +1,71 @@
---
title: "Limits API"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
title: "Limits API"
title: "Overrides API"

@@ -0,0 +1,71 @@
---
title: "Limits API"
linkTitle: "Limits API"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
linkTitle: "Limits API"
linkTitle: "Overrides API"


### Endpoints

#### 1. GET /api/v1/user-limits
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### 1. GET /api/v1/user-limits
#### 1. GET /api/v1/user-overrides

I just noticed you keep saying limits, but this is more about overrides. That is the name in cortex https://cortexmetrics.io/docs/guides/overrides-exporter/

@friedrichg
Copy link
Member

call it hard overrides

For example:
# file: runtime.yaml
# In this example, we're overriding ingestion limits for a single tenant.
overrides:
  "user1":
    ingestion_burst_size: 350000
    ingestion_rate: 350000
    max_global_series_per_metric: 300000
    max_global_series_per_user: 300000
    max_series_per_metric: 0
    max_series_per_user: 0
    max_samples_per_query: 100000
    max_series_per_query: 100000
configurable-overrides:
  "user1":
    ingestion_rate: 700000
    max_global_series_per_user: 700000

configurable-overrides or hard-overrides. I don't know which one communicates better the situation .
Everything defined in configurable-overrides can be modified in the overrides

@bogdan-at-adobe
Copy link
Author

What about defining a quota unit (the default values) and keeping the "hard limit" as an integer for how many quota units a user can reach?
I think changing the limits individually is also useful but since they usually scale as a group I would love this to support these quota units as well, as hard limits for what can be configured but also as a way to batch increase overrides using the api.

@friedrichg
Copy link
Member

What about defining a quota unit (the default values) and keeping the "hard limit" as an integer for how many quota units a user can reach?
I think changing the limits individually is also useful but since they usually scale as a group I would love this to support these quota units as well, as hard limits for what can be configured but also as a way to batch increase overrides using the api.

I believe you mean increasing quota would increase a couple of limits. But I think there is a misunderstanding. Overrides is more than just limits, it's configuration like DisabledRuleGroups and OutOfOrderTimeWindow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants