Skip to content

[DRAFT] - OSD-18645 - Initial implementation for CannotRetrieveUpdatesSRE #404

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

anispate
Copy link
Contributor

@anispate anispate commented Apr 4, 2025

Copy link
Contributor

openshift-ci bot commented Apr 4, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: anispate
Once this PR has been reviewed and has the lgtm label, please assign dustman9000 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@codecov-commenter
Copy link

codecov-commenter commented Apr 7, 2025

Codecov Report

Attention: Patch coverage is 0% with 95 lines in your changes missing coverage. Please review.

Project coverage is 30.90%. Comparing base (53d07dc) to head (02d2c44).
Report is 16 commits behind head on main.

Files with missing lines Patch % Lines
...cannotretrieveupdatesre/cannotRetrieveUpdateSRE.go 0.00% 95 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #404      +/-   ##
==========================================
- Coverage   32.42%   30.90%   -1.52%     
==========================================
  Files          26       27       +1     
  Lines        1937     2032      +95     
==========================================
  Hits          628      628              
- Misses       1261     1356      +95     
  Partials       48       48              
Files with missing lines Coverage Δ
pkg/investigations/registry.go 0.00% <ø> (ø)
...cannotretrieveupdatesre/cannotRetrieveUpdateSRE.go 0.00% <0.00%> (ø)

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@anispate anispate force-pushed the OSD-18645 branch 4 times, most recently from bcb9b3c to f246d1f Compare April 8, 2025 18:17
Comment on lines 86 to 95
logging.Infof("Network verification completed with result: %v", verifierResult)
switch verifierResult {
case networkverifier.Success:
i.notes.AppendSuccess("Network verifier passed")
case networkverifier.Failure:
logging.Infof("Network verifier reported failure: %s", failureReason)
result.ServiceLogPrepared = investigation.InvestigationStep{Performed: true, Labels: nil}
i.notes.AppendWarning("NetworkVerifier found unreachable targets. \n \n Verify and send service log if necessary: \n osdctl servicelog post %s -t https://raw.githubusercontent.com/openshift/managed-notifications/master/osd/required_network_egresses_are_blocked.json -p URLS=%s",
r.Cluster.ID(), failureReason)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit tight, I would suggest something like the following to make it a bit easier to read each thing occurring in this else

Suggested change
logging.Infof("Network verification completed with result: %v", verifierResult)
switch verifierResult {
case networkverifier.Success:
i.notes.AppendSuccess("Network verifier passed")
case networkverifier.Failure:
logging.Infof("Network verifier reported failure: %s", failureReason)
result.ServiceLogPrepared = investigation.InvestigationStep{Performed: true, Labels: nil}
i.notes.AppendWarning("NetworkVerifier found unreachable targets. \n \n Verify and send service log if necessary: \n osdctl servicelog post %s -t https://raw.githubusercontent.com/openshift/managed-notifications/master/osd/required_network_egresses_are_blocked.json -p URLS=%s",
r.Cluster.ID(), failureReason)
}
logging.Infof("Network verification completed with result: %v", verifierResult)
switch verifierResult {
case networkverifier.Success:
i.notes.AppendSuccess("Network verifier passed")
case networkverifier.Failure:
logging.Infof("Network verifier reported failure: %s", failureReason)
result.ServiceLogPrepared = investigation.InvestigationStep{Performed: true, Labels: nil}
i.notes.AppendWarning("NetworkVerifier found unreachable targets. \n \n Verify and send service log if necessary: \n osdctl servicelog post %s -t https://raw.githubusercontent.com/openshift/managed-notifications/master/osd/required_network_egresses_are_blocked.json -p URLS=%s",
r.Cluster.ID(), failureReason)
}

Comment on lines 103 to 125
switch {
case err != nil:
logging.Errorf("Failed to list ClusterVersion: %v", err)
i.notes.AppendWarning("Failed to list ClusterVersion: %v\nThis may indicate cluster access issues", err)
case len(cvList.Items) != 1:
logging.Warnf("Found %d ClusterVersions, expected 1", len(cvList.Items))
i.notes.AppendWarning("Found %d ClusterVersions, expected 1", len(cvList.Items))
default:
versionCv := cvList.Items[0]
logging.Infof("ClusterVersion found: %s", versionCv.Status.Desired.Version)
for _, condition := range versionCv.Status.Conditions {
logging.Debugf("Checking ClusterVersion condition: Type=%s, Status=%s, Reason=%s, Message=%s",
condition.Type, condition.Status, condition.Reason, condition.Message)
if condition.Type == "RetrievedUpdates" &&
condition.Status == "False" &&
condition.Reason == "VersionNotFound" &&
strings.Contains(condition.Message, "Unable to retrieve available updates") {
i.notes.AppendWarning("ClusterVersion error detected: %s\nThis indicates the current version %s is not found in the specified channel %s",
condition.Message, versionCv.Status.Desired.Version, versionCv.Spec.Channel)
}
}
fmt.Printf("Cluster version: %s\n", versionCv.Status.Desired.Version)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be cleaner without a switch block. Just perform conditional checks for your each of your failures before doing the default behavior

Copy link
Contributor Author

@anispate anispate Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to do without a switch block but then one of the test was failing and it recommended me to use the switch case that time. This was the test that failed: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_configuration-anomaly-detection/404/pull-ci-openshift-configuration-anomaly-detection-main-lint/1909316667297501184

Copy link
Contributor

@joshbranham joshbranham Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow, what do you mean "it recommended me to use the switch case that time"?

I think this is pointing to a broader issue that your Run here has a lot of business logic in one function, including performing multiple "validations" and needing to never return early.

If you break each "validation" into private functions, and call each in order from Run then you can handle the failure modes independent and it might be cleaner. Here is some pseudocode to work from. Let me know what you think and if you have any questions 👍

func Run() {
	err := checkClusterVersion()
	if err != nil {
		// log warning
		// write warning note for pagerduty
	}
	
	err = runNetworkVerifier()
	if err != nil {
		// log warning
		// write warning note for pagerduty
	}
	
	// cleanup
}

func checkClusterVersion() error {
	cvList, err := getCvs()
	if err != nil {
		return err
	}
	
	if len(cvList) > 1 {
		return errors.New("More than one cluster version found")
	}
	
	// do the rest of the happy path CV validation

    return nil
}

func runNetworkVerifier() error {
	result, err := runVerifier()
	if err != nil || result.Failed {
		return err
	}

	return nil
}

@anispate anispate force-pushed the OSD-18645 branch 2 times, most recently from 553c4b0 to c933fc3 Compare April 10, 2025 20:12
Copy link
Contributor

openshift-ci bot commented Apr 10, 2025

@anispate: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

}

if err := i.runNetworkVerifier(r, &result); err != nil {
logging.Errorf("Network verification failed: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably handle this case when the network-verifier fails, right?

ie - add a note to the PD incident, prepare an SL (eventually), return a result and escalate to on-call, etc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, the network-verifier could fail on unrelated domains, so perhaps it is better to continue the investigation 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants