[WIP] OSD-28525 - Initial implementation for MachineHealthCheckUnterminatedShortCircuitSRE alert

tnierman · tnierman · commit 01b531425c61 · 2025-04-04T21:23:19.000-04:00
diff --git a/pkg/investigations/machineHealthCheckUnterminatedShortCircuitSRE/machineHealthCheckUnterminatedShortCircuitSRE.go b/pkg/investigations/machineHealthCheckUnterminatedShortCircuitSRE/machineHealthCheckUnterminatedShortCircuitSRE.go
diff --git a/pkg/investigations/machineHealthCheckUnterminatedShortCircuitSRE/metadata.yaml b/pkg/investigations/machineHealthCheckUnterminatedShortCircuitSRE/metadata.yaml
@@ -18,4 +18,4 @@ rbac:
         - ""
       resources:
         - "nodes"
-customerDataAccess: true
+customerDataAccess: false
diff --git a/pkg/investigations/machineHealthCheckUnterminatedShortCircuitSRE/testing/README.md b/pkg/investigations/machineHealthCheckUnterminatedShortCircuitSRE/testing/README.md
@@ -0,0 +1,90 @@
+# Testing MachineHealthCheckUnterminatedShortCircuitSRE
+
+The `MachineHealthCheckUnterminatedShortCircuitSRE` alert is derived from the `MachineHealthCheck` objects in the `openshift-machine-api` namespace on the cluster. Specifically, it is triggered when the number of nodes matching one of the `.spec.unhealthyConditions` meets or exceeds the `.spec.maxUnhealthy` value for a [duration of time](https://github.com/openshift/managed-cluster-config/blob/3338dd375fa6517d7768eca985c3ca115bbc1484/deploy/sre-prometheus/100-machine-health-check-unterminated-short-circuit.PrometheusRule.yaml#L16).
+
+
+## Setup
+
+Before applying any test configuration, be sure to [pause hive syncsetting](https://github.com/openshift/ops-sop/blob/master/v4/knowledge_base/pause-syncset.md), to avoid having these changes overwritten mid-test.
+
+Then, apply the following changes to the `MachineHealthCheck` object(s) you wish to check against, in order to make testing a little easier:
+- Reducing the `.spec.maxUnhealthy` value will lower the number of nodes that need to be "broken" to short-circuit the machine-api-operator and halt remediation
+- Reducing the `.spec.unhealthyConditions` timeouts ensures that the machine-api short-circuits much more quickly after modifying the nodes
+
+A patched version of the default `MachineHealthCheck/srep-worker-healthcheck` object is pre-configured [here](./srep-worker-healthcheck_machinehealthcheck.yaml). Use the following command to apply it to your test cluster:
+
+```sh
+ocm backplane elevate "testing CAD" -- replace -f ./srep-worker-healthcheck_machinehealthcheck.yaml
+```
+
+## Test Cases
+
+Because it has a fairly broad definition, a `MachineHealthCheckUnterminatedShortCircuitSRE` alert could fire as a result of several different scenarios. A few are outlined below, along with methods to reproduce them.
+
+### nodes `NotReady`, machines `Running`
+
+While the `machine-api-operator` owns and operates the `machine` object-type, it's important to note that its `MachineHealthCheck` objects actually utilize the `.status` of the corresponding **node** to determine if a machine is healthy. This is because a machine's status only reflects whether the VM in the cloud provider is running or not, while the node's status indicates whether the instance is a functional part of the cluster. Therefore, it's possible for a `MachineHealthCheckUnterminatedShortCircuitSRE` alert to fire while all `machines` have a `.status.phase` of `Running`.
+
+The simplest way to reproduce this is to login onto the node and stop the `kubelet.service` on multiple nodes at once.
+
+
+This can be done via the debug command:
+
+```sh
+ocm backplane elevate "testing CAD" -- debug node/$NODE
+```
+
+inside the container, run:
+
+```sh
+chroot /host
+systemctl stop kubelet.service
+```
+
+This should automatically remove the debug pod. The node status should flip to `NotReady` shortly thereafter.
+
+### Nodes stuck deleting due to customer workloads not draining
+
+Components like the cluster-autoscaler will try to automatically manage cluster machines via scale-up/scale-down actions based on overall resource utilization. If workloads cannot be drained during a scale-down operation, several nodes can get stuck in a `SchedulingDisabled` phase concurrently, triggering the alert.
+
+To simulate this, create a pod that will have to be drained prior to deleting the machine, alongside an (incorrectly configured) PDB that will prevent it from being drained:
+
+```sh
+ocm backplane elevate "testing CAD -- create -f ./unstoppable_workload.yaml -f ./unstoppable_pdb.yaml"
+```
+
+Next, simulate a scale-down by patching the machineset the pod is running on. *NOTE*: Just deleting the machine will result in it being replaced, and will not trigger a MachineHealthCheckUnterminatedShortCircuitSRE alert.
+```sh
+# Get pod's node
+NODE=$(oc get po -n default -l app=test-cad -o jsonpath="{.items[].spec.nodeName}")
+# Get node's machineset
+MACHINESET=$(oc get machines -A -o json | jq --arg NODE "${NODE}" -r '.items[] | select(.status.nodeRef.name == $NODE) | .metadata.labels["machine.openshift.io/cluster-api-machineset"]')
+# Scale node's machineset
+oc scale machineset/${MACHINESET} --replicas=0 -n openshift-machine-api
+```
+
+### Machines in `Failed` phase
+
+Having several machines in a `Failed` state still violates a `MachineHealthCheck`'s `.status.maxUnhealthy`, despite the machines not having any corresponding nodes to check against.
+
+One method to simulate this is to edit the machineset so it contains invalid configurations. The following patch updates a worker machineset to use the `fakeinstancetype` machine-type, for example:
+
+```sh
+ocm backplane elevate "testing CAD" -- patch machinesets $MACHINESET -n openshift-machine-api --type merge -p '{"spec": {"template": {"spec": {"providerSpec": {"value": {"instanceType": "fakeinstancetype"}}}}}}'
+oc delete machine -n openshift-machine-api -l machine.openshift.io/cluster-api-machineset=$MACHINESET
+```
+
+### Machines with no phase
+
+TODO - does this trigger a healthcheck alert?
+
+Remove operator role from AWS IAM > roles page and see if it triggers the alert
+
+
+## Additional Resources
+The following pages may be useful if the information in this guide is insufficient or has become stale.
+
+- [Machine API Brief](https://github.com/openshift/machine-api-operator/blob/main/docs/user/machine-api-operator-overview.md)
+- [Machine API FAQ](https://github.com/openshift/machine-api-operator/blob/main/FAQ.md)
+- [MachineHealthCheck documentation](https://docs.redhat.com/en/documentation/openshift_container_platform/4.17/html/machine_management/deploying-machine-health-checks#machine-health-checks-resource_deploying-machine-health-checks)
+- [Alert SOP](https://github.com/openshift/ops-sop/blob/master/v4/alerts/MachineHealthCheckUnterminatedShortCircuitSRE.md)
diff --git a/pkg/investigations/machineHealthCheckUnterminatedShortCircuitSRE/testing/srep-worker-healthcheck_machinehealthcheck.yaml b/pkg/investigations/machineHealthCheckUnterminatedShortCircuitSRE/testing/srep-worker-healthcheck_machinehealthcheck.yaml
@@ -0,0 +1,53 @@
+apiVersion: machine.openshift.io/v1beta1
+kind: MachineHealthCheck
+metadata:
+  name: srep-worker-healthcheck
+  namespace: openshift-machine-api
+spec:
+  maxUnhealthy: 0
+  nodeStartupTimeout: 25m
+  selector:
+    matchExpressions:
+    - key: machine.openshift.io/cluster-api-machine-role
+      operator: NotIn
+      values:
+      - infra
+      - master
+    - key: machine.openshift.io/cluster-api-machineset
+      operator: Exists
+    - key: machine.openshift.io/instance-type
+      operator: NotIn
+      values:
+      - m5.metal
+      - m5d.metal
+      - m5n.metal
+      - m5dn.metal
+      - m5zn.metal
+      - m6a.metal
+      - m6i.metal
+      - m6id.metal
+      - r5.metal
+      - r5d.metal
+      - r5n.metal
+      - r5dn.metal
+      - r6a.metal
+      - r6i.metal
+      - r6id.metal
+      - x2iezn.metal
+      - z1d.metal
+      - c5.metal
+      - c5d.metal
+      - c5n.metal
+      - c6a.metal
+      - c6i.metal
+      - c6id.metal
+      - i3.metal
+      - i3en.metal
+      - r7i.48xlarge
+  unhealthyConditions:
+  - status: "False"
+    timeout: 10s
+    type: Ready
+  - status: Unknown
+    timeout: 10s
+    type: Ready
diff --git a/pkg/investigations/machineHealthCheckUnterminatedShortCircuitSRE/testing/unstoppable_pdb.yaml b/pkg/investigations/machineHealthCheckUnterminatedShortCircuitSRE/testing/unstoppable_pdb.yaml
@@ -0,0 +1,11 @@
+apiVersion: policy/v1
+kind: PodDisruptionBudget
+metadata:
+  name: test-cad
+  namespace: default
+spec:
+  maxUnavailable: 0
+  selector:
+    matchLabels:
+      app: "test-cad"
+  unhealthyPodEvictionPolicy: AlwaysAllow
diff --git a/pkg/investigations/machineHealthCheckUnterminatedShortCircuitSRE/testing/unstoppable_workload.yaml b/pkg/investigations/machineHealthCheckUnterminatedShortCircuitSRE/testing/unstoppable_workload.yaml
@@ -0,0 +1,33 @@
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  labels:
+    app: "test-cad"
+  name: test-cad
+  namespace: default
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: "test-cad"
+  template:
+    metadata:
+      labels:
+        app: "test-cad"
+    spec:
+      affinity:
+        nodeAffinity:
+          preferredDuringSchedulingIgnoredDuringExecution:
+          - preference:
+              matchExpressions:
+              - key: node-role.kubernetes.io/worker
+                operator: Exists
+            weight: 1
+      containers:
+      - command:
+          - "sleep"
+          - "infinity"
+        image: "quay.io/app-sre/ubi8-ubi:latest"
+        imagePullPolicy: IfNotPresent
+        name: test
+      restartPolicy: Always
diff --git a/pkg/investigations/registry.go b/pkg/investigations/registry.go
@@ -6,6 +6,7 @@ import (
 	"github.com/openshift/configuration-anomaly-detection/pkg/investigations/clustermonitoringerrorbudgetburn"
 	"github.com/openshift/configuration-anomaly-detection/pkg/investigations/cpd"
 	"github.com/openshift/configuration-anomaly-detection/pkg/investigations/investigation"
+	machinehealthcheckunterminatedshortcircuitsre "github.com/openshift/configuration-anomaly-detection/pkg/investigations/machineHealthCheckUnterminatedShortCircuitSRE"
 )
 
 // availableInvestigations holds all Investigation implementations.
@@ -14,6 +15,7 @@ var availableInvestigations = []investigation.Investigation{
 	&chgm.Investiation{},
 	&clustermonitoringerrorbudgetburn.Investigation{},
 	&cpd.Investigation{},
+	&machinehealthcheckunterminatedshortcircuitsre.Investigation{},
 }
 
 // GetInvestigation returns the first Investigation that applies to the given alert title.
diff --git a/pkg/k8s/client.go b/pkg/k8s/client.go
@@ -5,6 +5,7 @@ import (
 	"os"
 
 	configv1 "github.com/openshift/api/config/v1"
+	machinev1beta1 "github.com/openshift/api/machine/v1beta1"
 	"github.com/openshift/backplane-cli/pkg/cli/config"
 	bpremediation "github.com/openshift/backplane-cli/pkg/remediation"
 	"github.com/openshift/configuration-anomaly-detection/pkg/ocm"
@@ -21,7 +22,7 @@ func New(clusterID string, ocmClient ocm.Client, remediation string) (client.Cli
 
 	cfg, err := bpremediation.CreateRemediationWithConn(config.BackplaneConfiguration{URL: backplaneURL}, ocmClient.GetConnection(), clusterID, remediation)
 	if err != nil {
-		return nil, err
+		return nil, fmt.Errorf("failed to create remediation: %w", err)
 	}
 
 	scheme, err := initScheme()
@@ -53,5 +54,11 @@ func initScheme() (*runtime.Scheme, error) {
 	if err := configv1.Install(scheme); err != nil {
 		return nil, fmt.Errorf("unable to add openshift/api/config scheme: %w", err)
 	}
+
+	// Add config.openshift.io/v1 to scheme for clusteroperator
+	if err := machinev1beta1.Install(scheme); err != nil {
+		return nil, fmt.Errorf("unable to add openshift/api/machine/v1beta1 scheme: %w", err)
+	}
+
 	return scheme, nil
 }
diff --git a/test/generate_incident.sh b/test/generate_incident.sh
@@ -16,6 +16,7 @@ declare -A alert_mapping=(
     ["ClusterHasGoneMissing"]="cadtest has gone missing"
     ["ClusterProvisioningDelay"]="ClusterProvisioningDelay -"
     ["ClusterMonitoringErrorBudgetBurnSRE"]="ClusterMonitoringErrorBudgetBurnSRE Critical (1)"
+    ["MachineHealthCheckUnterminatedShortCircuitSRE"]="MachineHealthCheckUnterminatedShortCircuitSRE CRITICAL (1)"
 )
 
 # Function to print help message
@@ -25,7 +26,7 @@ print_help() {
     for alert_name in "${!alert_mapping[@]}"; do
         echo -n "$alert_name, "
     done
-    echo 
+    echo
 }
 # Check if the correct number of arguments is provided
 if [ "$#" -ne 2 ]; then
@@ -49,9 +50,9 @@ alert_title="${alert_mapping[$alert_name]}"
 # Load testing routing key and test service url from vault
 export VAULT_ADDR="https://vault.devshift.net"
 export VAULT_TOKEN="$(vault login -method=oidc -token-only)"
-for v in $(vault kv get  -format=json osd-sre/configuration-anomaly-detection/cad-testing | jq -r ".data.data|to_entries|map(\"\(.key)=\(.value|tostring)\")|.[]"); do export $v; done	
+for v in $(vault kv get  -format=json osd-sre/configuration-anomaly-detection/cad-testing | jq -r ".data.data|to_entries|map(\"\(.key)=\(.value|tostring)\")|.[]"); do export $v; done
 unset VAULT_ADDR VAULT_TOKEN
-echo 
+echo
 
 dedup_key=$(uuidgen)
 

Original file line number	Diff line number	Diff line change
`@@ -6,6 +6,7 @@ import (`
`6`	`6`	`"github.com/openshift/configuration-anomaly-detection/pkg/investigations/clustermonitoringerrorbudgetburn"`
`7`	`7`	`"github.com/openshift/configuration-anomaly-detection/pkg/investigations/cpd"`
`8`	`8`	`"github.com/openshift/configuration-anomaly-detection/pkg/investigations/investigation"`
	`9`	`+ machinehealthcheckunterminatedshortcircuitsre "github.com/openshift/configuration-anomaly-detection/pkg/investigations/machineHealthCheckUnterminatedShortCircuitSRE"`
`9`	`10`	`)`
`10`	`11`
`11`	`12`	`// availableInvestigations holds all Investigation implementations.`
`@@ -14,6 +15,7 @@ var availableInvestigations = []investigation.Investigation{`
`14`	`15`	`&chgm.Investiation{},`
`15`	`16`	`&clustermonitoringerrorbudgetburn.Investigation{},`
`16`	`17`	`&cpd.Investigation{},`
	`18`	`+ &machinehealthcheckunterminatedshortcircuitsre.Investigation{},`
`17`	`19`	`}`
`18`	`20`
`19`	`21`	`// GetInvestigation returns the first Investigation that applies to the given alert title.`