|
11 | 11 | - [Contributing](#contributing)
|
12 | 12 | - [Building](#building)
|
13 | 13 | - [Adding a new investigation](#adding-a-new-investigation)
|
| 14 | + - [Graduating an investigation](#graduating-an-investigation) |
14 | 15 | - [Testing locally](#testing-locally)
|
15 | 16 | - [Pre-requirements](#pre-requirements)
|
16 | 17 | - [Running cadctl for an incident ID](#running-cadctl-for-an-incident-id)
|
|
20 | 21 | - [Templates](#templates)
|
21 | 22 | - [Dashboards](#dashboards)
|
22 | 23 | - [Deployment](#deployment)
|
| 24 | + - [Progressive Delivery](#progressive-delivery) |
23 | 25 | - [Boilerplate](#boilerplate)
|
24 | 26 | - [PipelinePruner](#pipelinepruner)
|
25 | 27 | - [Required ENV variables](#required-env-variables)
|
@@ -71,6 +73,27 @@ To add a new alert investigation:
|
71 | 73 | - investigation.Resources contain initialized clients for the clusters aws environment, ocm and more. See [Integrations](#integrations)
|
72 | 74 | - Add test objects or scripts used to recreate the alert symptoms to the `pkg/investigations/$INVESTIGATION_NAME/testing/` directory for future use. Be sure to clearly document the testing procedure under the `Testing` section of the investigation-specific README.md file
|
73 | 75 |
|
| 76 | +### Graduating an investigation |
| 77 | + |
| 78 | +New investigations and their remediation steps are deployed in advancing stages through a progressive delivery strategy (see [Progressive Delivery](#progressive-delivery)). |
| 79 | + |
| 80 | +1. **Informing stage (Read-only):** |
| 81 | + The investigation is merely informative through PagerDuty at this stage; remediation _**does not involve any write operations**_. Notes are collected throughout the investigation, and upon the investigation's conclusion are posted to PagerDuty. |
| 82 | + |
| 83 | + **Aim**: Validating the investigation's accuracy and usefulness **without performing any write actions**. |
| 84 | + |
| 85 | + **Validation Criteria:** The investigation successfully carries out each step on it's respective incident type, over a span of several days. It provides useful information (equivalent to a manual investigation) to SREs through PagerDuty. |
| 86 | + |
| 87 | +2. **Incubation / Canary (Limited Write):** |
| 88 | + The remediation continues to be limited to information gathering on the majority of clusters, however write operations are validated on a small subset of clusters, based on region (TODO). |
| 89 | + |
| 90 | + **Aim:** Validating the remediation's **_write operations_** on a controlled subset of the fleet. |
| 91 | + |
| 92 | + **Validation Criteria:** Write operations perform successfully and as expected on the defined subset of clusters; potential issues with write actions should be caught at this stage. |
| 93 | + |
| 94 | +3. **Graduation (Read & Write):** |
| 95 | + The investigation's remediation functions, including **read and write**, are performed on all applicable clusters (high-impact clusters should remain read-only). |
| 96 | + |
74 | 97 | ### Integrations
|
75 | 98 |
|
76 | 99 | > **Note:** When writing an investiation, you can use them right away.
|
@@ -147,6 +170,24 @@ Grafana dashboard configmaps are stored in the [Dashboards](./dashboards/) direc
|
147 | 170 | * [Skip Webhooks](./deploy/skip-webhook/README.md) -- Skipping the eventlistener and creating the pipelinerun directly.
|
148 | 171 | * [Namespace](./deploy/namespace/README.md) -- Allowing the code to ignore the namespace.
|
149 | 172 |
|
| 173 | +### Progressive Delivery |
| 174 | + |
| 175 | +New investigations are deployed following a "Canary Deployment Strategy". This allows for a monitored, progressive deployment of new investigations and remediative steps, and limits fleet-wide issues. |
| 176 | + |
| 177 | +Investigations and their respective remediation capabilities are promoted as follows: |
| 178 | + |
| 179 | +1. Read-only (informing) on stage & production |
| 180 | + |
| 181 | +2. Read/Write investigation implemented on stage |
| 182 | + |
| 183 | +3. Read/Write promoted to production canary clusters |
| 184 | + |
| 185 | +4. Soak time |
| 186 | + |
| 187 | +5. Full investigation (read/write) promoted fleet-wide |
| 188 | + |
| 189 | +> **Note:** A workaround "fast-track" graduation is possible in instances of necessity to force changes fleet-wide as quickly as possible. |
| 190 | +
|
150 | 191 | ### Boilerplate
|
151 | 192 |
|
152 | 193 | * [Boilerplate](./boilerplate/openshift/osd-container-image/README.md) -- Conventions for OSD containers.
|
|
0 commit comments