Skip to content

Commit 01b5314

Browse files
committed
[WIP] OSD-28525 - Initial implementation for MachineHealthCheckUnterminatedShortCircuitSRE alert
1 parent c6ad107 commit 01b5314

9 files changed

+638
-5
lines changed

pkg/investigations/machineHealthCheckUnterminatedShortCircuitSRE/machineHealthCheckUnterminatedShortCircuitSRE.go

+436
Large diffs are not rendered by default.

pkg/investigations/machineHealthCheckUnterminatedShortCircuitSRE/metadata.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -18,4 +18,4 @@ rbac:
1818
- ""
1919
resources:
2020
- "nodes"
21-
customerDataAccess: true
21+
customerDataAccess: false
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
# Testing MachineHealthCheckUnterminatedShortCircuitSRE
2+
3+
The `MachineHealthCheckUnterminatedShortCircuitSRE` alert is derived from the `MachineHealthCheck` objects in the `openshift-machine-api` namespace on the cluster. Specifically, it is triggered when the number of nodes matching one of the `.spec.unhealthyConditions` meets or exceeds the `.spec.maxUnhealthy` value for a [duration of time](https://github.com/openshift/managed-cluster-config/blob/3338dd375fa6517d7768eca985c3ca115bbc1484/deploy/sre-prometheus/100-machine-health-check-unterminated-short-circuit.PrometheusRule.yaml#L16).
4+
5+
6+
## Setup
7+
8+
Before applying any test configuration, be sure to [pause hive syncsetting](https://github.com/openshift/ops-sop/blob/master/v4/knowledge_base/pause-syncset.md), to avoid having these changes overwritten mid-test.
9+
10+
Then, apply the following changes to the `MachineHealthCheck` object(s) you wish to check against, in order to make testing a little easier:
11+
- Reducing the `.spec.maxUnhealthy` value will lower the number of nodes that need to be "broken" to short-circuit the machine-api-operator and halt remediation
12+
- Reducing the `.spec.unhealthyConditions` timeouts ensures that the machine-api short-circuits much more quickly after modifying the nodes
13+
14+
A patched version of the default `MachineHealthCheck/srep-worker-healthcheck` object is pre-configured [here](./srep-worker-healthcheck_machinehealthcheck.yaml). Use the following command to apply it to your test cluster:
15+
16+
```sh
17+
ocm backplane elevate "testing CAD" -- replace -f ./srep-worker-healthcheck_machinehealthcheck.yaml
18+
```
19+
20+
## Test Cases
21+
22+
Because it has a fairly broad definition, a `MachineHealthCheckUnterminatedShortCircuitSRE` alert could fire as a result of several different scenarios. A few are outlined below, along with methods to reproduce them.
23+
24+
### nodes `NotReady`, machines `Running`
25+
26+
While the `machine-api-operator` owns and operates the `machine` object-type, it's important to note that its `MachineHealthCheck` objects actually utilize the `.status` of the corresponding **node** to determine if a machine is healthy. This is because a machine's status only reflects whether the VM in the cloud provider is running or not, while the node's status indicates whether the instance is a functional part of the cluster. Therefore, it's possible for a `MachineHealthCheckUnterminatedShortCircuitSRE` alert to fire while all `machines` have a `.status.phase` of `Running`.
27+
28+
The simplest way to reproduce this is to login onto the node and stop the `kubelet.service` on multiple nodes at once.
29+
30+
31+
This can be done via the debug command:
32+
33+
```sh
34+
ocm backplane elevate "testing CAD" -- debug node/$NODE
35+
```
36+
37+
inside the container, run:
38+
39+
```sh
40+
chroot /host
41+
systemctl stop kubelet.service
42+
```
43+
44+
This should automatically remove the debug pod. The node status should flip to `NotReady` shortly thereafter.
45+
46+
### Nodes stuck deleting due to customer workloads not draining
47+
48+
Components like the cluster-autoscaler will try to automatically manage cluster machines via scale-up/scale-down actions based on overall resource utilization. If workloads cannot be drained during a scale-down operation, several nodes can get stuck in a `SchedulingDisabled` phase concurrently, triggering the alert.
49+
50+
To simulate this, create a pod that will have to be drained prior to deleting the machine, alongside an (incorrectly configured) PDB that will prevent it from being drained:
51+
52+
```sh
53+
ocm backplane elevate "testing CAD -- create -f ./unstoppable_workload.yaml -f ./unstoppable_pdb.yaml"
54+
```
55+
56+
Next, simulate a scale-down by patching the machineset the pod is running on. *NOTE*: Just deleting the machine will result in it being replaced, and will not trigger a MachineHealthCheckUnterminatedShortCircuitSRE alert.
57+
```sh
58+
# Get pod's node
59+
NODE=$(oc get po -n default -l app=test-cad -o jsonpath="{.items[].spec.nodeName}")
60+
# Get node's machineset
61+
MACHINESET=$(oc get machines -A -o json | jq --arg NODE "${NODE}" -r '.items[] | select(.status.nodeRef.name == $NODE) | .metadata.labels["machine.openshift.io/cluster-api-machineset"]')
62+
# Scale node's machineset
63+
oc scale machineset/${MACHINESET} --replicas=0 -n openshift-machine-api
64+
```
65+
66+
### Machines in `Failed` phase
67+
68+
Having several machines in a `Failed` state still violates a `MachineHealthCheck`'s `.status.maxUnhealthy`, despite the machines not having any corresponding nodes to check against.
69+
70+
One method to simulate this is to edit the machineset so it contains invalid configurations. The following patch updates a worker machineset to use the `fakeinstancetype` machine-type, for example:
71+
72+
```sh
73+
ocm backplane elevate "testing CAD" -- patch machinesets $MACHINESET -n openshift-machine-api --type merge -p '{"spec": {"template": {"spec": {"providerSpec": {"value": {"instanceType": "fakeinstancetype"}}}}}}'
74+
oc delete machine -n openshift-machine-api -l machine.openshift.io/cluster-api-machineset=$MACHINESET
75+
```
76+
77+
### Machines with no phase
78+
79+
TODO - does this trigger a healthcheck alert?
80+
81+
Remove operator role from AWS IAM > roles page and see if it triggers the alert
82+
83+
84+
## Additional Resources
85+
The following pages may be useful if the information in this guide is insufficient or has become stale.
86+
87+
- [Machine API Brief](https://github.com/openshift/machine-api-operator/blob/main/docs/user/machine-api-operator-overview.md)
88+
- [Machine API FAQ](https://github.com/openshift/machine-api-operator/blob/main/FAQ.md)
89+
- [MachineHealthCheck documentation](https://docs.redhat.com/en/documentation/openshift_container_platform/4.17/html/machine_management/deploying-machine-health-checks#machine-health-checks-resource_deploying-machine-health-checks)
90+
- [Alert SOP](https://github.com/openshift/ops-sop/blob/master/v4/alerts/MachineHealthCheckUnterminatedShortCircuitSRE.md)
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
apiVersion: machine.openshift.io/v1beta1
2+
kind: MachineHealthCheck
3+
metadata:
4+
name: srep-worker-healthcheck
5+
namespace: openshift-machine-api
6+
spec:
7+
maxUnhealthy: 0
8+
nodeStartupTimeout: 25m
9+
selector:
10+
matchExpressions:
11+
- key: machine.openshift.io/cluster-api-machine-role
12+
operator: NotIn
13+
values:
14+
- infra
15+
- master
16+
- key: machine.openshift.io/cluster-api-machineset
17+
operator: Exists
18+
- key: machine.openshift.io/instance-type
19+
operator: NotIn
20+
values:
21+
- m5.metal
22+
- m5d.metal
23+
- m5n.metal
24+
- m5dn.metal
25+
- m5zn.metal
26+
- m6a.metal
27+
- m6i.metal
28+
- m6id.metal
29+
- r5.metal
30+
- r5d.metal
31+
- r5n.metal
32+
- r5dn.metal
33+
- r6a.metal
34+
- r6i.metal
35+
- r6id.metal
36+
- x2iezn.metal
37+
- z1d.metal
38+
- c5.metal
39+
- c5d.metal
40+
- c5n.metal
41+
- c6a.metal
42+
- c6i.metal
43+
- c6id.metal
44+
- i3.metal
45+
- i3en.metal
46+
- r7i.48xlarge
47+
unhealthyConditions:
48+
- status: "False"
49+
timeout: 10s
50+
type: Ready
51+
- status: Unknown
52+
timeout: 10s
53+
type: Ready
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
apiVersion: policy/v1
2+
kind: PodDisruptionBudget
3+
metadata:
4+
name: test-cad
5+
namespace: default
6+
spec:
7+
maxUnavailable: 0
8+
selector:
9+
matchLabels:
10+
app: "test-cad"
11+
unhealthyPodEvictionPolicy: AlwaysAllow
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
apiVersion: apps/v1
2+
kind: Deployment
3+
metadata:
4+
labels:
5+
app: "test-cad"
6+
name: test-cad
7+
namespace: default
8+
spec:
9+
replicas: 1
10+
selector:
11+
matchLabels:
12+
app: "test-cad"
13+
template:
14+
metadata:
15+
labels:
16+
app: "test-cad"
17+
spec:
18+
affinity:
19+
nodeAffinity:
20+
preferredDuringSchedulingIgnoredDuringExecution:
21+
- preference:
22+
matchExpressions:
23+
- key: node-role.kubernetes.io/worker
24+
operator: Exists
25+
weight: 1
26+
containers:
27+
- command:
28+
- "sleep"
29+
- "infinity"
30+
image: "quay.io/app-sre/ubi8-ubi:latest"
31+
imagePullPolicy: IfNotPresent
32+
name: test
33+
restartPolicy: Always

pkg/investigations/registry.go

+2
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ import (
66
"github.com/openshift/configuration-anomaly-detection/pkg/investigations/clustermonitoringerrorbudgetburn"
77
"github.com/openshift/configuration-anomaly-detection/pkg/investigations/cpd"
88
"github.com/openshift/configuration-anomaly-detection/pkg/investigations/investigation"
9+
machinehealthcheckunterminatedshortcircuitsre "github.com/openshift/configuration-anomaly-detection/pkg/investigations/machineHealthCheckUnterminatedShortCircuitSRE"
910
)
1011

1112
// availableInvestigations holds all Investigation implementations.
@@ -14,6 +15,7 @@ var availableInvestigations = []investigation.Investigation{
1415
&chgm.Investiation{},
1516
&clustermonitoringerrorbudgetburn.Investigation{},
1617
&cpd.Investigation{},
18+
&machinehealthcheckunterminatedshortcircuitsre.Investigation{},
1719
}
1820

1921
// GetInvestigation returns the first Investigation that applies to the given alert title.

pkg/k8s/client.go

+8-1
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ import (
55
"os"
66

77
configv1 "github.com/openshift/api/config/v1"
8+
machinev1beta1 "github.com/openshift/api/machine/v1beta1"
89
"github.com/openshift/backplane-cli/pkg/cli/config"
910
bpremediation "github.com/openshift/backplane-cli/pkg/remediation"
1011
"github.com/openshift/configuration-anomaly-detection/pkg/ocm"
@@ -21,7 +22,7 @@ func New(clusterID string, ocmClient ocm.Client, remediation string) (client.Cli
2122

2223
cfg, err := bpremediation.CreateRemediationWithConn(config.BackplaneConfiguration{URL: backplaneURL}, ocmClient.GetConnection(), clusterID, remediation)
2324
if err != nil {
24-
return nil, err
25+
return nil, fmt.Errorf("failed to create remediation: %w", err)
2526
}
2627

2728
scheme, err := initScheme()
@@ -53,5 +54,11 @@ func initScheme() (*runtime.Scheme, error) {
5354
if err := configv1.Install(scheme); err != nil {
5455
return nil, fmt.Errorf("unable to add openshift/api/config scheme: %w", err)
5556
}
57+
58+
// Add config.openshift.io/v1 to scheme for clusteroperator
59+
if err := machinev1beta1.Install(scheme); err != nil {
60+
return nil, fmt.Errorf("unable to add openshift/api/machine/v1beta1 scheme: %w", err)
61+
}
62+
5663
return scheme, nil
5764
}

test/generate_incident.sh

+4-3
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ declare -A alert_mapping=(
1616
["ClusterHasGoneMissing"]="cadtest has gone missing"
1717
["ClusterProvisioningDelay"]="ClusterProvisioningDelay -"
1818
["ClusterMonitoringErrorBudgetBurnSRE"]="ClusterMonitoringErrorBudgetBurnSRE Critical (1)"
19+
["MachineHealthCheckUnterminatedShortCircuitSRE"]="MachineHealthCheckUnterminatedShortCircuitSRE CRITICAL (1)"
1920
)
2021

2122
# Function to print help message
@@ -25,7 +26,7 @@ print_help() {
2526
for alert_name in "${!alert_mapping[@]}"; do
2627
echo -n "$alert_name, "
2728
done
28-
echo
29+
echo
2930
}
3031
# Check if the correct number of arguments is provided
3132
if [ "$#" -ne 2 ]; then
@@ -49,9 +50,9 @@ alert_title="${alert_mapping[$alert_name]}"
4950
# Load testing routing key and test service url from vault
5051
export VAULT_ADDR="https://vault.devshift.net"
5152
export VAULT_TOKEN="$(vault login -method=oidc -token-only)"
52-
for v in $(vault kv get -format=json osd-sre/configuration-anomaly-detection/cad-testing | jq -r ".data.data|to_entries|map(\"\(.key)=\(.value|tostring)\")|.[]"); do export $v; done
53+
for v in $(vault kv get -format=json osd-sre/configuration-anomaly-detection/cad-testing | jq -r ".data.data|to_entries|map(\"\(.key)=\(.value|tostring)\")|.[]"); do export $v; done
5354
unset VAULT_ADDR VAULT_TOKEN
54-
echo
55+
echo
5556

5657
dedup_key=$(uuidgen)
5758

0 commit comments

Comments
 (0)