OSD-28525 - Initial implementation for MachineHealthCheckUnterminatedShortCircuitSRE investigation #395

tnierman · 2025-03-27T20:34:15Z

https://issues.redhat.com/browse/OSD-28525

These changes introduce the initial implementation for the machineHealthCheckUnterminatedShortCircuitSRE alert. The alert investigation takes the following steps:

Retrieves the machines managed by any failling machinehealthcheck object. Machines & nodes that are failing which are otherwise un-managed by a machinehealthcheck that is short-circuited (ie - masters) are not investigated
- Machines that are deleting will have their node checked to see if it's stuck draining. At the moment, this check is imperfect: once the ability to query metrics is introduced, we can more precisely confirm what workloads are blocking a node drain based on the pods_preventing_node_drain metric
- Machines that are in a failed state will have their .Status.ErrorReason and .Status.ErrorMessages examined
- For each of the machines that's examined and found to be problematic, a recommended action will be added to the investigation's recommendations map. This map is used to holistically evaluate the state of the cluster following the investigation
Once all machines have been evaluated, the investigation moves on to the machines' nodes.
- Nodes whose machines have already been identified as problematic are not evaluated: this cuts down on noise and prevents us from accidentally recommending two different courses of action for the same underlying compute instance
- Currently, the node investigation is a bit bare, it primarily determines whether a specific node warrants further human investigation. Future iterations involving metrics may be able to expand on this aspect of the investigation
Finally, once all objects have been evaluated, the investigation's recommendations are summarized. By compiling the summary at the end of the investigation, rather than during each iteration, a single action can be recommended for a number of different nodes & machines (ie - "send this one servicelog regarding machine A, B, and C" rather than "send servicelog regarding machine A" + "send servicelog regarding machine B" + "send servicelog regarding machine C", etc)

Additionally, recommended test methods are supplied in the testing/ directory, along with a (somewhat) detailed README to help newcomers get started. This includes test objects that should be directly applicable to any OSD/ROSA staging cluster.

codecov-commenter · 2025-03-27T20:54:01Z

Codecov Report

Attention: Patch coverage is 52.54902% with 121 lines in your changes missing coverage. Please review.

Project coverage is 33.44%. Comparing base (7b0f2f0) to head (c1da1a1).

Files with missing lines	Patch %	Lines
...e/machinehealthcheckunterminatedshortcircuitsre.go	50.88%	79 Missing and 4 partials ⚠️
...checkunterminatedshortcircuitsre/recommendation.go	21.73%	18 Missing ⚠️
pkg/investigations/utils/machine/machine.go	66.66%	12 Missing and 2 partials ⚠️
pkg/investigations/utils/node/node.go	78.94%	4 Missing ⚠️
pkg/k8s/scheme.go	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #395      +/-   ##
==========================================
+ Coverage   31.09%   33.44%   +2.34%     
==========================================
  Files          29       33       +4     
  Lines        2074     2329     +255     
==========================================
+ Hits          645      779     +134     
- Misses       1376     1491     +115     
- Partials       53       59       +6

Files with missing lines	Coverage Δ
pkg/investigations/registry.go	`0.00% <ø> (ø)`
pkg/k8s/scheme.go	`0.00% <0.00%> (ø)`
pkg/investigations/utils/node/node.go	`78.94% <78.94%> (ø)`
pkg/investigations/utils/machine/machine.go	`66.66% <66.66%> (ø)`
...checkunterminatedshortcircuitsre/recommendation.go	`21.73% <21.73%> (ø)`
...e/machinehealthcheckunterminatedshortcircuitsre.go	`50.88% <50.88%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

...chinehealthcheckunterminatedshortcircuitsre/machinehealthcheckunterminatedshortcircuitsre.go

bergmannf · 2025-05-05T10:54:45Z

I think this looks really good already - I guess it could check more things, but I think the PR is already quite big and it's better to get this in and running in stage before adding even more checks.
(Something that happens and could be found by this is customers running out of available CIDRs for new nodes: machine = Running, Node never becoming ready, which would need to inspect the network configuration)

typeid

Added some suggestions/nit comments.

Additional suggestions/thoughts (non-blocking):
I generally find it a bit odd we're going through all machines and looking for MHC available remediation. Would it be more logical to look at the MHCs and triaging by everything that has unhealthyCount > maxUnhealthy and to focus on those?

As the file grew quite large, we could also consider splitting up the large file here into something like the following:

machine_investigation.go: InvestigateMachines, investigateFailingMachine, investigateDeletingMachine, getMachineRole
node_investigation.go: InvestigateNode, InvestigateNodes, findNodeReadyCondition, checkForStuckDrain
utils.go: findNoScheduleTaint, getNodeRole
recommendations.go: recommendations, investigationResult, and String() logic

...chinehealthcheckunterminatedshortcircuitsre/machinehealthcheckunterminatedshortcircuitsre.go

typeid · 2025-05-05T11:40:28Z

...chinehealthcheckunterminatedshortcircuitsre/machinehealthcheckunterminatedshortcircuitsre.go

+	switch errorReason {
+	case machinev1beta1.IPAddressInvalidReason:
+		notes := fmt.Sprintf("invalid IP address: %q. Deleting machine may allow the cloud provider to assign a valid IP address", errorMsg)
+		i.recommendations.addRecommendation(recommendationDeleteMachine, machine.Name, notes)
+	case machinev1beta1.CreateMachineError:
+		notes := fmt.Sprintf("machine failed to create: %q. Deleting machine may resolve any transient issues with the cloud provider", errorMsg)
+		i.recommendations.addRecommendation(recommendationDeleteMachine, machine.Name, notes)
+	case machinev1beta1.InvalidConfigurationMachineError:
+		notes := fmt.Sprintf("the machine configuration is invalid: %q. Checking splunk audit logs may indicate whether the customer has modified the machine or its machineset", errorMsg)
+		i.recommendations.addRecommendation(recommendationInvestigateMachine, machine.Name, notes)
+	case machinev1beta1.DeleteMachineError:
+		notes := fmt.Sprintf("the machine's node could not be gracefully terminated automatically: %q", errorMsg)
+		i.recommendations.addRecommendation(recommendationInvestigateMachine, machine.Name, notes)
+	case machinev1beta1.InsufficientResourcesMachineError:
+		notes := fmt.Sprintf("a servicelog should be sent because there is insufficient quota to provision the machine: %q", errorMsg)
+		i.recommendations.addRecommendation(recommendationQuotaServiceLog, machine.Name, notes)
+	default:
+		notes := "no .Status.ErrorReason found for machine"
+		i.recommendations.addRecommendation(recommendationInvestigateMachine, machine.Name, notes)
+	}


Suggestion: Instead of the big switch here, you could centralize the mapping of errorReason to recommendation logic, e.g.:

var machineFailureHandlers = map[machinev1beta1.MachineStatusError]func(machinev1beta1.Machine) (recommendedAction, string){ machinev1beta1.IPAddressInvalidReason: func(m machinev1beta1.Machine) (recommendedAction, string) { return recommendationDeleteMachine, fmt.Sprintf("invalid IP: %q", *m.Status.ErrorMessage) }, // etc... }

Then you could replace the long switch case:

handler, ok := machineFailureHandlers[errorReason] if !ok { i.notes.AppendInfo("Unknown error reason %s on failed machine %s", errorReason, machine.Name) return nil } action, note := handler(machine) i.recommendations.addRecommendation(action, machine.Name, note) return nil

This would decouple things and should be easier on the eyes and decouple things:)

After thinking on it for a bit, I personally prefer the big-switch: it's easier to tell at-a-glance what recommendation we're making for each machine failure-state, and it feels like less complexity for the same end-result.

If this is something that you or others feel strongly about, I'm happy to convert this, however 🙂

I added newlines between the case statements in an effort to improve readability

tnierman · 2025-05-05T18:29:51Z

@typeid - thanks for the recommendations!

Specifically regarding

I generally find it a bit odd we're going through all machines and looking for MHC available remediation. Would it be more logical to look at the MHCs and triaging by everything that has unhealthyCount > maxUnhealthy and to focus on those?

It sounds like you're suggesting we remediate only the machines owned by a failing machineHealthCheck, correct?

That was the goal of getMachinesFromFailingMHC() (previously getTargetMachines()), though upon re-examining that logic, I had mismatched the return value to the name, and then misused the value to still get the right set of machines (realistically, I probably changed the name of that function, and forgot to change the behavior 😬). I pushed a fix for that now, but let me know if your concern extends beyond that error

tnierman · 2025-05-05T18:49:40Z

/hold to implement some additional recommendations

tnierman · 2025-05-05T21:37:16Z

/unhold - I think most recommendations have been accounted for

…ShortCircuitSRE alert

openshift-ci · 2025-05-06T16:27:19Z

@tnierman: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

tnierman · 2025-05-06T17:05:43Z

/unhold

typeid

/lgtm
/approve

openshift-ci · 2025-05-07T08:29:37Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tnierman, typeid

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [tnierman,typeid]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 27, 2025

openshift-ci bot requested review from Nikokolas3270 and typeid March 27, 2025 20:34

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 27, 2025

tnierman force-pushed the osd-28525 branch 2 times, most recently from b2e6e6d to 68504de Compare March 27, 2025 20:47

tnierman force-pushed the osd-28525 branch from 68504de to bdf7f01 Compare March 27, 2025 21:04

tnierman force-pushed the osd-28525 branch from bdf7f01 to 2f91794 Compare April 4, 2025 19:03

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 4, 2025

tnierman force-pushed the osd-28525 branch from 2f91794 to 01b5314 Compare April 5, 2025 01:23

tnierman force-pushed the osd-28525 branch 10 times, most recently from ae62e23 to 5140dfb Compare May 1, 2025 18:16

tnierman changed the title ~~[WIP] OSD-28525 - Initial implementation for MachineHealthCheckUnterminatedShortCircuitSRE investigation~~ OSD-28525 - Initial implementation for MachineHealthCheckUnterminatedShortCircuitSRE investigation May 1, 2025

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 1, 2025

tnierman force-pushed the osd-28525 branch from 5140dfb to b5cb9cb Compare May 1, 2025 18:30

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 1, 2025

tnierman force-pushed the osd-28525 branch 2 times, most recently from 0e11b24 to 75ebed2 Compare May 1, 2025 20:21

bergmannf reviewed May 5, 2025

View reviewed changes

...chinehealthcheckunterminatedshortcircuitsre/machinehealthcheckunterminatedshortcircuitsre.go Outdated Show resolved Hide resolved

typeid reviewed May 5, 2025

View reviewed changes

tnierman force-pushed the osd-28525 branch 2 times, most recently from f7b0de3 to 5d5c05d Compare May 5, 2025 18:27

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 5, 2025

tnierman force-pushed the osd-28525 branch 2 times, most recently from 08989ef to f284e28 Compare May 5, 2025 21:37

tnierman force-pushed the osd-28525 branch 3 times, most recently from 82d7b7f to cb1b303 Compare May 5, 2025 23:17

OSD-28525 - Initial implementation for MachineHealthCheckUnterminated…

c1da1a1

…ShortCircuitSRE alert

tnierman force-pushed the osd-28525 branch from cb1b303 to c1da1a1 Compare May 6, 2025 16:07

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 6, 2025

typeid approved these changes May 7, 2025

View reviewed changes

openshift-ci bot assigned typeid May 7, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 7, 2025

openshift-merge-bot bot merged commit f842c77 into openshift:main May 7, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OSD-28525 - Initial implementation for MachineHealthCheckUnterminatedShortCircuitSRE investigation #395

OSD-28525 - Initial implementation for MachineHealthCheckUnterminatedShortCircuitSRE investigation #395

tnierman commented Mar 27, 2025 •

edited

Loading

codecov-commenter commented Mar 27, 2025 •

edited

Loading

bergmannf commented May 5, 2025

typeid left a comment

typeid May 5, 2025 •

edited

Loading

tnierman May 5, 2025 •

edited

Loading

tnierman May 5, 2025

tnierman commented May 5, 2025

tnierman commented May 5, 2025

tnierman commented May 5, 2025

openshift-ci bot commented May 6, 2025

tnierman commented May 6, 2025

typeid left a comment

openshift-ci bot commented May 7, 2025

OSD-28525 - Initial implementation for MachineHealthCheckUnterminatedShortCircuitSRE investigation #395

OSD-28525 - Initial implementation for MachineHealthCheckUnterminatedShortCircuitSRE investigation #395

Conversation

tnierman commented Mar 27, 2025 • edited Loading

codecov-commenter commented Mar 27, 2025 • edited Loading

Codecov Report

bergmannf commented May 5, 2025

typeid left a comment

Choose a reason for hiding this comment

typeid May 5, 2025 • edited Loading

Choose a reason for hiding this comment

tnierman May 5, 2025 • edited Loading

Choose a reason for hiding this comment

tnierman May 5, 2025

Choose a reason for hiding this comment

tnierman commented May 5, 2025

tnierman commented May 5, 2025

tnierman commented May 5, 2025

openshift-ci bot commented May 6, 2025

tnierman commented May 6, 2025

typeid left a comment

Choose a reason for hiding this comment

openshift-ci bot commented May 7, 2025

tnierman commented Mar 27, 2025 •

edited

Loading

codecov-commenter commented Mar 27, 2025 •

edited

Loading

typeid May 5, 2025 •

edited

Loading

tnierman May 5, 2025 •

edited

Loading