Skip to content

OSD-28525 - Initial implementation for MachineHealthCheckUnterminatedShortCircuitSRE investigation #395

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 7, 2025

Conversation

tnierman
Copy link
Member

@tnierman tnierman commented Mar 27, 2025

https://issues.redhat.com/browse/OSD-28525


These changes introduce the initial implementation for the machineHealthCheckUnterminatedShortCircuitSRE alert. The alert investigation takes the following steps:

  • Retrieves the machines managed by any failling machinehealthcheck object. Machines & nodes that are failing which are otherwise un-managed by a machinehealthcheck that is short-circuited (ie - masters) are not investigated
    • Machines that are deleting will have their node checked to see if it's stuck draining. At the moment, this check is imperfect: once the ability to query metrics is introduced, we can more precisely confirm what workloads are blocking a node drain based on the pods_preventing_node_drain metric
    • Machines that are in a failed state will have their .Status.ErrorReason and .Status.ErrorMessages examined
    • For each of the machines that's examined and found to be problematic, a recommended action will be added to the investigation's recommendations map. This map is used to holistically evaluate the state of the cluster following the investigation
  • Once all machines have been evaluated, the investigation moves on to the machines' nodes.
    • Nodes whose machines have already been identified as problematic are not evaluated: this cuts down on noise and prevents us from accidentally recommending two different courses of action for the same underlying compute instance
    • Currently, the node investigation is a bit bare, it primarily determines whether a specific node warrants further human investigation. Future iterations involving metrics may be able to expand on this aspect of the investigation
  • Finally, once all objects have been evaluated, the investigation's recommendations are summarized. By compiling the summary at the end of the investigation, rather than during each iteration, a single action can be recommended for a number of different nodes & machines (ie - "send this one servicelog regarding machine A, B, and C" rather than "send servicelog regarding machine A" + "send servicelog regarding machine B" + "send servicelog regarding machine C", etc)

Additionally, recommended test methods are supplied in the testing/ directory, along with a (somewhat) detailed README to help newcomers get started. This includes test objects that should be directly applicable to any OSD/ROSA staging cluster.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 27, 2025
@openshift-ci openshift-ci bot requested review from Nikokolas3270 and typeid March 27, 2025 20:34
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 27, 2025
@tnierman tnierman force-pushed the osd-28525 branch 2 times, most recently from b2e6e6d to 68504de Compare March 27, 2025 20:47
@codecov-commenter
Copy link

codecov-commenter commented Mar 27, 2025

Codecov Report

Attention: Patch coverage is 52.54902% with 121 lines in your changes missing coverage. Please review.

Project coverage is 33.44%. Comparing base (7b0f2f0) to head (c1da1a1).

Files with missing lines Patch % Lines
...e/machinehealthcheckunterminatedshortcircuitsre.go 50.88% 79 Missing and 4 partials ⚠️
...checkunterminatedshortcircuitsre/recommendation.go 21.73% 18 Missing ⚠️
pkg/investigations/utils/machine/machine.go 66.66% 12 Missing and 2 partials ⚠️
pkg/investigations/utils/node/node.go 78.94% 4 Missing ⚠️
pkg/k8s/scheme.go 0.00% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #395      +/-   ##
==========================================
+ Coverage   31.09%   33.44%   +2.34%     
==========================================
  Files          29       33       +4     
  Lines        2074     2329     +255     
==========================================
+ Hits          645      779     +134     
- Misses       1376     1491     +115     
- Partials       53       59       +6     
Files with missing lines Coverage Δ
pkg/investigations/registry.go 0.00% <ø> (ø)
pkg/k8s/scheme.go 0.00% <0.00%> (ø)
pkg/investigations/utils/node/node.go 78.94% <78.94%> (ø)
pkg/investigations/utils/machine/machine.go 66.66% <66.66%> (ø)
...checkunterminatedshortcircuitsre/recommendation.go 21.73% <21.73%> (ø)
...e/machinehealthcheckunterminatedshortcircuitsre.go 50.88% <50.88%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 4, 2025
@tnierman tnierman force-pushed the osd-28525 branch 10 times, most recently from ae62e23 to 5140dfb Compare May 1, 2025 18:16
@tnierman tnierman changed the title [WIP] OSD-28525 - Initial implementation for MachineHealthCheckUnterminatedShortCircuitSRE investigation OSD-28525 - Initial implementation for MachineHealthCheckUnterminatedShortCircuitSRE investigation May 1, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 1, 2025
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 1, 2025
@tnierman tnierman force-pushed the osd-28525 branch 2 times, most recently from 0e11b24 to 75ebed2 Compare May 1, 2025 20:21
@bergmannf
Copy link
Contributor

I think this looks really good already - I guess it could check more things, but I think the PR is already quite big and it's better to get this in and running in stage before adding even more checks.
(Something that happens and could be found by this is customers running out of available CIDRs for new nodes: machine = Running, Node never becoming ready, which would need to inspect the network configuration)

Copy link
Member

@typeid typeid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some suggestions/nit comments.

Additional suggestions/thoughts (non-blocking):
I generally find it a bit odd we're going through all machines and looking for MHC available remediation. Would it be more logical to look at the MHCs and triaging by everything that has unhealthyCount > maxUnhealthy and to focus on those?

As the file grew quite large, we could also consider splitting up the large file here into something like the following:

  • machine_investigation.go: InvestigateMachines, investigateFailingMachine, investigateDeletingMachine, getMachineRole
  • node_investigation.go: InvestigateNode, InvestigateNodes, findNodeReadyCondition, checkForStuckDrain
  • utils.go: findNoScheduleTaint, getNodeRole
  • recommendations.go: recommendations, investigationResult, and String() logic

Comment on lines 239 to 230
switch errorReason {
case machinev1beta1.IPAddressInvalidReason:
notes := fmt.Sprintf("invalid IP address: %q. Deleting machine may allow the cloud provider to assign a valid IP address", errorMsg)
i.recommendations.addRecommendation(recommendationDeleteMachine, machine.Name, notes)
case machinev1beta1.CreateMachineError:
notes := fmt.Sprintf("machine failed to create: %q. Deleting machine may resolve any transient issues with the cloud provider", errorMsg)
i.recommendations.addRecommendation(recommendationDeleteMachine, machine.Name, notes)
case machinev1beta1.InvalidConfigurationMachineError:
notes := fmt.Sprintf("the machine configuration is invalid: %q. Checking splunk audit logs may indicate whether the customer has modified the machine or its machineset", errorMsg)
i.recommendations.addRecommendation(recommendationInvestigateMachine, machine.Name, notes)
case machinev1beta1.DeleteMachineError:
notes := fmt.Sprintf("the machine's node could not be gracefully terminated automatically: %q", errorMsg)
i.recommendations.addRecommendation(recommendationInvestigateMachine, machine.Name, notes)
case machinev1beta1.InsufficientResourcesMachineError:
notes := fmt.Sprintf("a servicelog should be sent because there is insufficient quota to provision the machine: %q", errorMsg)
i.recommendations.addRecommendation(recommendationQuotaServiceLog, machine.Name, notes)
default:
notes := "no .Status.ErrorReason found for machine"
i.recommendations.addRecommendation(recommendationInvestigateMachine, machine.Name, notes)
}
Copy link
Member

@typeid typeid May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Instead of the big switch here, you could centralize the mapping of errorReason to recommendation logic, e.g.:

var machineFailureHandlers = map[machinev1beta1.MachineStatusError]func(machinev1beta1.Machine) (recommendedAction, string){
	machinev1beta1.IPAddressInvalidReason: func(m machinev1beta1.Machine) (recommendedAction, string) {
		return recommendationDeleteMachine, fmt.Sprintf("invalid IP: %q", *m.Status.ErrorMessage)
	},
	// etc...
}

Then you could replace the long switch case:

handler, ok := machineFailureHandlers[errorReason]
if !ok {
	i.notes.AppendInfo("Unknown error reason %s on failed machine %s", errorReason, machine.Name)
	return nil
}

action, note := handler(machine)
i.recommendations.addRecommendation(action, machine.Name, note)
return nil

This would decouple things and should be easier on the eyes and decouple things:)

Copy link
Member Author

@tnierman tnierman May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking on it for a bit, I personally prefer the big-switch: it's easier to tell at-a-glance what recommendation we're making for each machine failure-state, and it feels like less complexity for the same end-result.

If this is something that you or others feel strongly about, I'm happy to convert this, however 🙂

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added newlines between the case statements in an effort to improve readability

@tnierman tnierman force-pushed the osd-28525 branch 2 times, most recently from f7b0de3 to 5d5c05d Compare May 5, 2025 18:27
@tnierman
Copy link
Member Author

tnierman commented May 5, 2025

@typeid - thanks for the recommendations!

Specifically regarding

I generally find it a bit odd we're going through all machines and looking for MHC available remediation. Would it be more logical to look at the MHCs and triaging by everything that has unhealthyCount > maxUnhealthy and to focus on those?

It sounds like you're suggesting we remediate only the machines owned by a failing machineHealthCheck, correct?

That was the goal of getMachinesFromFailingMHC() (previously getTargetMachines()), though upon re-examining that logic, I had mismatched the return value to the name, and then misused the value to still get the right set of machines (realistically, I probably changed the name of that function, and forgot to change the behavior 😬). I pushed a fix for that now, but let me know if your concern extends beyond that error

@tnierman
Copy link
Member Author

tnierman commented May 5, 2025

/hold to implement some additional recommendations

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 5, 2025
@tnierman tnierman force-pushed the osd-28525 branch 2 times, most recently from 08989ef to f284e28 Compare May 5, 2025 21:37
@tnierman
Copy link
Member Author

tnierman commented May 5, 2025

/unhold - I think most recommendations have been accounted for

@tnierman tnierman force-pushed the osd-28525 branch 3 times, most recently from 82d7b7f to cb1b303 Compare May 5, 2025 23:17
Copy link
Contributor

openshift-ci bot commented May 6, 2025

@tnierman: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@tnierman
Copy link
Member Author

tnierman commented May 6, 2025

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 6, 2025
Copy link
Member

@typeid typeid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 7, 2025
Copy link
Contributor

openshift-ci bot commented May 7, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tnierman, typeid

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit f842c77 into openshift:main May 7, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants