-
Notifications
You must be signed in to change notification settings - Fork 45
OSD-28525 - Initial implementation for MachineHealthCheckUnterminatedShortCircuitSRE investigation #395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b2e6e6d
to
68504de
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #395 +/- ##
==========================================
+ Coverage 31.09% 33.44% +2.34%
==========================================
Files 29 33 +4
Lines 2074 2329 +255
==========================================
+ Hits 645 779 +134
- Misses 1376 1491 +115
- Partials 53 59 +6
🚀 New features to boost your workflow:
|
ae62e23
to
5140dfb
Compare
0e11b24
to
75ebed2
Compare
...chinehealthcheckunterminatedshortcircuitsre/machinehealthcheckunterminatedshortcircuitsre.go
Outdated
Show resolved
Hide resolved
I think this looks really good already - I guess it could check more things, but I think the PR is already quite big and it's better to get this in and running in stage before adding even more checks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some suggestions/nit comments.
Additional suggestions/thoughts (non-blocking):
I generally find it a bit odd we're going through all machines and looking for MHC available remediation. Would it be more logical to look at the MHCs and triaging by everything that has unhealthyCount > maxUnhealthy
and to focus on those?
As the file grew quite large, we could also consider splitting up the large file here into something like the following:
machine_investigation.go
:InvestigateMachines
,investigateFailingMachine
,investigateDeletingMachine
,getMachineRole
node_investigation.go
:InvestigateNode
,InvestigateNodes
,findNodeReadyCondition
,checkForStuckDrain
utils.go
:findNoScheduleTaint
,getNodeRole
recommendations.go
:recommendations
,investigationResult
, andString()
logic
...chinehealthcheckunterminatedshortcircuitsre/machinehealthcheckunterminatedshortcircuitsre.go
Outdated
Show resolved
Hide resolved
...chinehealthcheckunterminatedshortcircuitsre/machinehealthcheckunterminatedshortcircuitsre.go
Outdated
Show resolved
Hide resolved
...chinehealthcheckunterminatedshortcircuitsre/machinehealthcheckunterminatedshortcircuitsre.go
Outdated
Show resolved
Hide resolved
switch errorReason { | ||
case machinev1beta1.IPAddressInvalidReason: | ||
notes := fmt.Sprintf("invalid IP address: %q. Deleting machine may allow the cloud provider to assign a valid IP address", errorMsg) | ||
i.recommendations.addRecommendation(recommendationDeleteMachine, machine.Name, notes) | ||
case machinev1beta1.CreateMachineError: | ||
notes := fmt.Sprintf("machine failed to create: %q. Deleting machine may resolve any transient issues with the cloud provider", errorMsg) | ||
i.recommendations.addRecommendation(recommendationDeleteMachine, machine.Name, notes) | ||
case machinev1beta1.InvalidConfigurationMachineError: | ||
notes := fmt.Sprintf("the machine configuration is invalid: %q. Checking splunk audit logs may indicate whether the customer has modified the machine or its machineset", errorMsg) | ||
i.recommendations.addRecommendation(recommendationInvestigateMachine, machine.Name, notes) | ||
case machinev1beta1.DeleteMachineError: | ||
notes := fmt.Sprintf("the machine's node could not be gracefully terminated automatically: %q", errorMsg) | ||
i.recommendations.addRecommendation(recommendationInvestigateMachine, machine.Name, notes) | ||
case machinev1beta1.InsufficientResourcesMachineError: | ||
notes := fmt.Sprintf("a servicelog should be sent because there is insufficient quota to provision the machine: %q", errorMsg) | ||
i.recommendations.addRecommendation(recommendationQuotaServiceLog, machine.Name, notes) | ||
default: | ||
notes := "no .Status.ErrorReason found for machine" | ||
i.recommendations.addRecommendation(recommendationInvestigateMachine, machine.Name, notes) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Instead of the big switch here, you could centralize the mapping of errorReason
to recommendation logic, e.g.:
var machineFailureHandlers = map[machinev1beta1.MachineStatusError]func(machinev1beta1.Machine) (recommendedAction, string){
machinev1beta1.IPAddressInvalidReason: func(m machinev1beta1.Machine) (recommendedAction, string) {
return recommendationDeleteMachine, fmt.Sprintf("invalid IP: %q", *m.Status.ErrorMessage)
},
// etc...
}
Then you could replace the long switch case:
handler, ok := machineFailureHandlers[errorReason]
if !ok {
i.notes.AppendInfo("Unknown error reason %s on failed machine %s", errorReason, machine.Name)
return nil
}
action, note := handler(machine)
i.recommendations.addRecommendation(action, machine.Name, note)
return nil
This would decouple things and should be easier on the eyes and decouple things:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking on it for a bit, I personally prefer the big-switch: it's easier to tell at-a-glance what recommendation we're making for each machine failure-state, and it feels like less complexity for the same end-result.
If this is something that you or others feel strongly about, I'm happy to convert this, however 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added newlines between the case
statements in an effort to improve readability
f7b0de3
to
5d5c05d
Compare
@typeid - thanks for the recommendations! Specifically regarding
It sounds like you're suggesting we remediate only the machines owned by a failing machineHealthCheck, correct? That was the goal of |
/hold to implement some additional recommendations |
08989ef
to
f284e28
Compare
/unhold - I think most recommendations have been accounted for |
82d7b7f
to
cb1b303
Compare
…ShortCircuitSRE alert
@tnierman: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
/unhold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tnierman, typeid The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
https://issues.redhat.com/browse/OSD-28525
These changes introduce the initial implementation for the
machineHealthCheckUnterminatedShortCircuitSRE
alert. The alert investigation takes the following steps:pods_preventing_node_drain
metricrecommendations
map. This map is used to holistically evaluate the state of the cluster following the investigationrecommendations
are summarized. By compiling the summary at the end of the investigation, rather than during each iteration, a single action can be recommended for a number of different nodes & machines (ie - "send this one servicelog regarding machine A, B, and C" rather than "send servicelog regarding machine A" + "send servicelog regarding machine B" + "send servicelog regarding machine C", etc)Additionally, recommended test methods are supplied in the
testing/
directory, along with a (somewhat) detailed README to help newcomers get started. This includes test objects that should be directly applicable to any OSD/ROSA staging cluster.