Skip to content

Hotfix dev env #372

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,5 @@ dist
.envrc
.idea
.vscode
cad_testing
cad_testing
backplane-api
38 changes: 28 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,23 +90,41 @@ They are initialized for you and passed to the investigation via investigation.R
## Testing locally

### Pre-requirements
- an existing cluster
- an existing PagerDuty incident for the cluster and alert type that is being tested
- An existing stage cluster
- A Pagerduty incident

To quickly create an incident for a cluster_id, you can run `./test/generate_incident.sh <alertname> <clusterid>`.
Example usage:`./test/generate_incident.sh ClusterHasGoneMissing 2b94brrrrrrrrrrrrrrrrrrhkaj`.
```bash
# (Optional) Export you pagerduty token to automatically retireve the incident id
export pd_token=<your_pd_token>
# Generates incident and creates payload file with incident ID
./test/generate_incident.sh <alertname> <clusterid>
```

### Running cadctl for an incident ID
1) Export the required ENV variables, see [required ENV variables](#required-env-variables).
2) Create a payload file containing the incident ID
If you are not using pd_token, create the payload file with the incidentID manually
```bash
export INCIDENT_ID=
echo '{"__pd_metadata":{"incident":{"id":"'${INCIDENT_ID}'"}}}' > ./payload
```

### Running cadctl

1) Run backplane-api locally in a second terminal ( requires being logged into ocm )

```
./test/backplane.sh
```

> If there is an issue with this step, comment out the `BACKPLANE_URL` env in `set_stage_env.sh`. You will then run against stage backplane, meaning backplane wont be able to see any local changes to metadata files, expect errors like `file not found`
2) Export the required ENV variables, see [required ENV variables](#required-env-variables).
```
source test/set_stage_env.sh
```
3) Run `cadctl` using the payload file
```bash
./bin/cadctl investigate --payload-path payload
```
```bash
./bin/cadctl investigate --payload-path payload
```

> If you are testing a new invesitigation using k8sclient, you need to run backplane locally and the metadata file needs to be temporarily commited to main.

### Logging levels

Expand Down
22 changes: 14 additions & 8 deletions cadctl/cmd/investigate/investigate.go
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ import (
"path/filepath"

cmv1 "github.com/openshift-online/ocm-sdk-go/clustersmgmt/v1"
"github.com/openshift/configuration-anomaly-detection/pkg/aws"
investigations "github.com/openshift/configuration-anomaly-detection/pkg/investigations"
"github.com/openshift/configuration-anomaly-detection/pkg/investigations/ccam"
investigation "github.com/openshift/configuration-anomaly-detection/pkg/investigations/investigation"
Expand Down Expand Up @@ -129,18 +130,23 @@ func run(cmd *cobra.Command, _ []string) error {
return fmt.Errorf("could not retrieve Cluster Deployment for %s: %w", internalClusterID, err)
}

customerAwsClient, err := managedcloud.CreateCustomerAWSClient(cluster, ocmClient)
if err != nil {
ccamResources := &investigation.Resources{Name: "ccam", Cluster: cluster, ClusterDeployment: clusterDeployment, AwsClient: customerAwsClient, OcmClient: ocmClient, PdClient: pdClient, AdditionalResources: map[string]interface{}{"error": err}}
inv := ccam.Investigation{}
result, err := inv.Run(ccamResources)
updateMetrics(alertInvestigation.Name(), &result)
return err
var customerAwsClient *aws.SdkClient
if alertInvestigation.RequiresAwsClient() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: how would we test aws access + kube-api access short term?
This is an interesting side effect of running the local backplane-api.

customerAwsClient, err := managedcloud.CreateCustomerAWSClient(cluster, ocmClient)
if err != nil {
ccamResources := &investigation.Resources{Name: "ccam", Cluster: cluster, ClusterDeployment: clusterDeployment, AwsClient: customerAwsClient, OcmClient: ocmClient, PdClient: pdClient, AdditionalResources: map[string]interface{}{"error": err}}
inv := ccam.Investigation{}
result, err := inv.Run(ccamResources)
updateMetrics(alertInvestigation.Name(), &result)
return err
}
} else {
customerAwsClient = &aws.SdkClient{}
}

investigationResources := &investigation.Resources{Name: alertInvestigation.Name(), Cluster: cluster, ClusterDeployment: clusterDeployment, AwsClient: customerAwsClient, OcmClient: ocmClient, PdClient: pdClient}

logging.Infof("Starting investigation for %s", alertInvestigation.Name)
logging.Infof("Starting investigation for %s", alertInvestigation.Name())
result, err := alertInvestigation.Run(investigationResources)
updateMetrics(alertInvestigation.Name(), &result)
return err
Expand Down
2 changes: 1 addition & 1 deletion pkg/aws/aws.go
Original file line number Diff line number Diff line change
Expand Up @@ -625,7 +625,7 @@ func eventContainsInstances(instances []ec2v2types.Instance, event cloudtrailv2t

func getTime(rawReason string) (time.Time, error) {
subMatches := stopInstanceDateRegex.FindStringSubmatch(rawReason)
if subMatches == nil || len(subMatches) < 2 {
if len(subMatches) < 2 {
return time.Time{}, fmt.Errorf("did not find matches: raw data %s", rawReason)
}
if len(subMatches) != 2 {
Expand Down
4 changes: 4 additions & 0 deletions pkg/investigations/ccam/ccam.go
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,10 @@ var ccamLimitedSupport = &ocm.LimitedSupportReason{
Details: "Your cluster requires you to take action because Red Hat is not able to access the infrastructure with the provided credentials. Please restore the credentials and permissions provided during install",
}

func (c *Investigation) RequiresAwsClient() bool {
return false
}

// Evaluate estimates if the awsError is a cluster credentials are missing error. If it determines that it is,
// the cluster is placed into limited support (if the cluster state allows it), otherwise an error is returned.
func (c *Investigation) Run(r *investigation.Resources) (investigation.InvestigationResult, error) {
Expand Down
16 changes: 10 additions & 6 deletions pkg/investigations/chgm/chgm.go
Original file line number Diff line number Diff line change
Expand Up @@ -36,10 +36,14 @@ var (
}
)

type Investiation struct{}
type Investigation struct{}

func (c *Investigation) RequiresAwsClient() bool {
return true
}

// Run runs the investigation for a triggered chgm pagerduty event
func (c *Investiation) Run(r *investigation.Resources) (investigation.InvestigationResult, error) {
func (c *Investigation) Run(r *investigation.Resources) (investigation.InvestigationResult, error) {
result := investigation.InvestigationResult{}
notes := notewriter.New("CHGM", logging.RawLogger)

Expand Down Expand Up @@ -118,19 +122,19 @@ func (c *Investiation) Run(r *investigation.Resources) (investigation.Investigat
return result, r.PdClient.EscalateIncidentWithNote(notes.String())
}

func (c *Investiation) Name() string {
func (c *Investigation) Name() string {
return "Cluster Has Gone Missing (CHGM)"
}

func (c *Investiation) Description() string {
func (c *Investigation) Description() string {
return "Detects reason for clusters that have gone missing"
}

func (c *Investiation) ShouldInvestigateAlert(alert string) bool {
func (c *Investigation) ShouldInvestigateAlert(alert string) bool {
return strings.Contains(alert, "has gone missing")
}

func (c *Investiation) IsExperimental() bool {
func (c *Investigation) IsExperimental() bool {
return false
}

Expand Down
2 changes: 1 addition & 1 deletion pkg/investigations/chgm/chgm_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ var _ = Describe("chgm", func() {
mockCtrl.Finish()
})

inv := Investiation{}
inv := Investigation{}

Describe("Triggered", func() {
When("Triggered finds instances stopped by the customer", func() {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,10 @@ var uwmMisconfiguredSL = ocm.ServiceLog{

type Investigation struct{}

func (c *Investigation) RequiresAwsClient() bool {
return false
}

func (c *Investigation) Run(r *investigation.Resources) (investigation.InvestigationResult, error) {
// Initialize k8s client
// This would be better suited to be passend in with the investigation resources
Expand Down
4 changes: 4 additions & 0 deletions pkg/investigations/cpd/cpd.go
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,10 @@ import (

type Investigation struct{}

func (c *Investigation) RequiresAwsClient() bool {
return true
}

// https://raw.githubusercontent.com/openshift/managed-notifications/master/osd/aws/InstallFailed_NoRouteToInternet.json
var byovpcRoutingSL = &ocm.ServiceLog{Severity: "Major", Summary: "Installation blocked: Missing route to internet", Description: "Your cluster's installation is blocked because of the missing route to internet in the route table(s) associated with the supplied subnet(s) for cluster installation. Please review and validate the routes by following documentation and re-install the cluster: https://docs.openshift.com/container-platform/latest/installing/installing_aws/installing-aws-vpc.html#installation-custom-aws-vpc-requirements_installing-aws-vpc.", InternalOnly: false, ServiceName: "SREManualAction"}

Expand Down
1 change: 1 addition & 0 deletions pkg/investigations/investigation/investigation.go
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ type Investigation interface {
Description() string
IsExperimental() bool
ShouldInvestigateAlert(string) bool
RequiresAwsClient() bool
}

// Resources holds all resources/tools required for alert investigations
Expand Down
4 changes: 3 additions & 1 deletion pkg/investigations/registry.go
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,13 @@ import (
"github.com/openshift/configuration-anomaly-detection/pkg/investigations/clustermonitoringerrorbudgetburn"
"github.com/openshift/configuration-anomaly-detection/pkg/investigations/cpd"
"github.com/openshift/configuration-anomaly-detection/pkg/investigations/investigation"
"github.com/openshift/configuration-anomaly-detection/pkg/logging"
)

// availableInvestigations holds all Investigation implementations.
var availableInvestigations = []investigation.Investigation{
&ccam.Investigation{},
&chgm.Investiation{},
&chgm.Investigation{},
&clustermonitoringerrorbudgetburn.Investigation{},
&cpd.Investigation{},
}
Expand All @@ -26,5 +27,6 @@ func GetInvestigation(title string, experimental bool) investigation.Investigati
return inv
}
}
logging.Debugf("No investigation found for: %s", title)
return nil
}
17 changes: 17 additions & 0 deletions test/backplane.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
#!/bin/bash
set -euo pipefail

# clone
git -C backplane-api pull || git clone --depth 1 --branch master [email protected]:service/backplane-api.git
# build
cd backplane-api
make build
# setup, this does not look to good :D
sudo make dev-certs
sudo chmod 644 localhost.key
# setup ocm config
cp $HOME/.config/ocm/ocm.json configs/ocm.json
# run, in background? second terminal ?
RUN_ARGS=--cloud-config=./configs/cloud-config.yml make run-local-with-testremediation GIT_REPO="../"


3 changes: 2 additions & 1 deletion test/set_stage_env.sh
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ for v in $(vault kv get -format=json osd-sre/configuration-anomaly-detection/pd
unset VAULT_ADDR VAULT_TOKEN

export CAD_EXPERIMENTAL_ENABLED=true
export BACKPLANE_PROXY=http://squid.corp.redhat.com:3128
# export BACKPLANE_PROXY=http://squid.corp.redhat.com:3128
export BACKPLANE_URL=https://localhost:8001

set +euo pipefail