-
Notifications
You must be signed in to change notification settings - Fork 43
[DRAFT] - OSD-18645 - Initial implementation for CannotRetrieveUpdatesSRE #404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: anispate The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #404 +/- ##
==========================================
- Coverage 32.42% 30.90% -1.52%
==========================================
Files 26 27 +1
Lines 1937 2032 +95
==========================================
Hits 628 628
- Misses 1261 1356 +95
Partials 48 48
🚀 New features to boost your workflow:
|
bcb9b3c
to
f246d1f
Compare
pkg/investigations/CannotRetrieveUpdatesSRE/CannotRetrieveUpdatesSRE.go
Outdated
Show resolved
Hide resolved
pkg/investigations/CannotRetrieveUpdatesSRE/CannotRetrieveUpdatesSRE.go
Outdated
Show resolved
Hide resolved
logging.Infof("Network verification completed with result: %v", verifierResult) | ||
switch verifierResult { | ||
case networkverifier.Success: | ||
i.notes.AppendSuccess("Network verifier passed") | ||
case networkverifier.Failure: | ||
logging.Infof("Network verifier reported failure: %s", failureReason) | ||
result.ServiceLogPrepared = investigation.InvestigationStep{Performed: true, Labels: nil} | ||
i.notes.AppendWarning("NetworkVerifier found unreachable targets. \n \n Verify and send service log if necessary: \n osdctl servicelog post %s -t https://raw.githubusercontent.com/openshift/managed-notifications/master/osd/required_network_egresses_are_blocked.json -p URLS=%s", | ||
r.Cluster.ID(), failureReason) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit tight, I would suggest something like the following to make it a bit easier to read each thing occurring in this else
logging.Infof("Network verification completed with result: %v", verifierResult) | |
switch verifierResult { | |
case networkverifier.Success: | |
i.notes.AppendSuccess("Network verifier passed") | |
case networkverifier.Failure: | |
logging.Infof("Network verifier reported failure: %s", failureReason) | |
result.ServiceLogPrepared = investigation.InvestigationStep{Performed: true, Labels: nil} | |
i.notes.AppendWarning("NetworkVerifier found unreachable targets. \n \n Verify and send service log if necessary: \n osdctl servicelog post %s -t https://raw.githubusercontent.com/openshift/managed-notifications/master/osd/required_network_egresses_are_blocked.json -p URLS=%s", | |
r.Cluster.ID(), failureReason) | |
} | |
logging.Infof("Network verification completed with result: %v", verifierResult) | |
switch verifierResult { | |
case networkverifier.Success: | |
i.notes.AppendSuccess("Network verifier passed") | |
case networkverifier.Failure: | |
logging.Infof("Network verifier reported failure: %s", failureReason) | |
result.ServiceLogPrepared = investigation.InvestigationStep{Performed: true, Labels: nil} | |
i.notes.AppendWarning("NetworkVerifier found unreachable targets. \n \n Verify and send service log if necessary: \n osdctl servicelog post %s -t https://raw.githubusercontent.com/openshift/managed-notifications/master/osd/required_network_egresses_are_blocked.json -p URLS=%s", | |
r.Cluster.ID(), failureReason) | |
} |
switch { | ||
case err != nil: | ||
logging.Errorf("Failed to list ClusterVersion: %v", err) | ||
i.notes.AppendWarning("Failed to list ClusterVersion: %v\nThis may indicate cluster access issues", err) | ||
case len(cvList.Items) != 1: | ||
logging.Warnf("Found %d ClusterVersions, expected 1", len(cvList.Items)) | ||
i.notes.AppendWarning("Found %d ClusterVersions, expected 1", len(cvList.Items)) | ||
default: | ||
versionCv := cvList.Items[0] | ||
logging.Infof("ClusterVersion found: %s", versionCv.Status.Desired.Version) | ||
for _, condition := range versionCv.Status.Conditions { | ||
logging.Debugf("Checking ClusterVersion condition: Type=%s, Status=%s, Reason=%s, Message=%s", | ||
condition.Type, condition.Status, condition.Reason, condition.Message) | ||
if condition.Type == "RetrievedUpdates" && | ||
condition.Status == "False" && | ||
condition.Reason == "VersionNotFound" && | ||
strings.Contains(condition.Message, "Unable to retrieve available updates") { | ||
i.notes.AppendWarning("ClusterVersion error detected: %s\nThis indicates the current version %s is not found in the specified channel %s", | ||
condition.Message, versionCv.Status.Desired.Version, versionCv.Spec.Channel) | ||
} | ||
} | ||
fmt.Printf("Cluster version: %s\n", versionCv.Status.Desired.Version) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be cleaner without a switch block. Just perform conditional checks for your each of your failures before doing the default
behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to do without a switch block but then one of the test was failing and it recommended me to use the switch case that time. This was the test that failed: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_configuration-anomaly-detection/404/pull-ci-openshift-configuration-anomaly-detection-main-lint/1909316667297501184
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow, what do you mean "it recommended me to use the switch case that time"?
I think this is pointing to a broader issue that your Run
here has a lot of business logic in one function, including performing multiple "validations" and needing to never return early.
If you break each "validation" into private functions, and call each in order from Run
then you can handle the failure modes independent and it might be cleaner. Here is some pseudocode to work from. Let me know what you think and if you have any questions 👍
func Run() {
err := checkClusterVersion()
if err != nil {
// log warning
// write warning note for pagerduty
}
err = runNetworkVerifier()
if err != nil {
// log warning
// write warning note for pagerduty
}
// cleanup
}
func checkClusterVersion() error {
cvList, err := getCvs()
if err != nil {
return err
}
if len(cvList) > 1 {
return errors.New("More than one cluster version found")
}
// do the rest of the happy path CV validation
return nil
}
func runNetworkVerifier() error {
result, err := runVerifier()
if err != nil || result.Failed {
return err
}
return nil
}
553c4b0
to
c933fc3
Compare
@anispate: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
} | ||
|
||
if err := i.runNetworkVerifier(r, &result); err != nil { | ||
logging.Errorf("Network verification failed: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably handle this case when the network-verifier fails, right?
ie - add a note to the PD incident, prepare an SL (eventually), return a result
and escalate to on-call, etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, the network-verifier could fail on unrelated domains, so perhaps it is better to continue the investigation 🤔
OSD-18645 - CAD implementation for CannotRetrieveUpdatesSRE
Sample ticket: https://redhat.pagerduty.com/incidents/Q1S45W54TK1QKU#:~:text=%E2%9A%A0%EF%B8%8F%20ClusterVersion%20error%20detected,primary%20for%20review