Skip to content

Restrict cluster prometheus rules to metrics of the according cluster #555

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

fpfuetsch
Copy link
Contributor

Closes #554

@fpfuetsch fpfuetsch requested a review from itay-grudev as a code owner April 11, 2025 10:36
@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. chart( cluster ) Related to the cluster chart labels Apr 11, 2025
@itay-grudev
Copy link
Collaborator

itay-grudev commented Apr 11, 2025

@fpfuetsch I have an even better solution. I rewrote almost all the alerts so they can be deployed by the operator chart and the rules are rewritten to match against the cluster name.

Here is an example for CNPGClusterInstancesOnSameNode:

count by (namespace, created_by_name, node) (kube_pod_info{created_by_kind="Cluster"}) > 1

or with a cluster label:

count by ({{ .Values.monitoring.prometheusRule.clusterLabel }}, namespace, created_by_name, node) (kube_pod_info{created_by_kind="Cluster"}) > 1

Here the cluster name is:

labels:
  severity: warning
  {{ .Values.monitoring.prometheusRule.clusterLabel }}: $labels.{{ .Values.monitoring.prometheusRule.clusterLabel }}
  namespace: {{ "{{ $labels.namespace }}" }}
  cnpg_cluster: {{ "{{ $labels.created_by_name }}" }}

I already have these. I just need to test them.

@itay-grudev
Copy link
Collaborator

And rewrite them in the context of the operator chart.

@fpfuetsch
Copy link
Contributor Author

fpfuetsch commented Apr 11, 2025

@itay-grudev I have two questions regarding your approach:

  • Would it be then still possible to exclude certain alerts for certain clusters (like test clusters with only a single instance)?
  • Think about a multi tenant k8s environments, with one team providing the operator while multiple other teams are consuming it to create pg cluster. For the consuming teams it would be useful to add custom labels to the alert rules of their clusters (currently the labels are static: namespace, cnpg_cluster, severity) to filter relevant alerts for them, would this be possible to implement when the rules are managed centralized?

I personally like the idea that alert rules for the clusters are part of the cluster deployment while alert rules for the operator are part of the operator deployment.

@itay-grudev
Copy link
Collaborator

No. You'll have to implement the ignore logic in your alert routing or escalation software - like Grafana Oncall.

@fpfuetsch
Copy link
Contributor Author

fpfuetsch commented Apr 11, 2025

So you prefer opt-out (ignoring, silencing) instead of opt-in for alerts?

@itay-grudev
Copy link
Collaborator

I am not sure, but it would make sense to have a set of alerts shipped with the operator. The only reason I implemented them at the cluster level is because I never figured out how to implement a Cluster Offline Alert without knowledge of the existence of a cluster.

@fpfuetsch
Copy link
Contributor Author

@itay-grudev What do you think about merging this change until we found a good way to translate the alerts for the operator chart?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chart( cluster ) Related to the cluster chart size:M This PR changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cluster Chart: Prometheus rule expressions are not cluster specific
2 participants