Skip to content

Commit 21d09e4

Browse files
committed
Add documentation for graduation strategy
1 parent b8f493d commit 21d09e4

File tree

1 file changed

+41
-0
lines changed

1 file changed

+41
-0
lines changed

README.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
- [Contributing](#contributing)
1212
- [Building](#building)
1313
- [Adding a new investigation](#adding-a-new-investigation)
14+
- [Graduating an investigation](#graduating-an-investigation)
1415
- [Testing locally](#testing-locally)
1516
- [Pre-requirements](#pre-requirements)
1617
- [Running cadctl for an incident ID](#running-cadctl-for-an-incident-id)
@@ -20,6 +21,7 @@
2021
- [Templates](#templates)
2122
- [Dashboards](#dashboards)
2223
- [Deployment](#deployment)
24+
- [Progressive Delivery](#progressive-delivery)
2325
- [Boilerplate](#boilerplate)
2426
- [PipelinePruner](#pipelinepruner)
2527
- [Required ENV variables](#required-env-variables)
@@ -71,6 +73,27 @@ To add a new alert investigation:
7173
- investigation.Resources contain initialized clients for the clusters aws environment, ocm and more. See [Integrations](#integrations)
7274
- Add test objects or scripts used to recreate the alert symptoms to the `pkg/investigations/$INVESTIGATION_NAME/testing/` directory for future use. Be sure to clearly document the testing procedure under the `Testing` section of the investigation-specific README.md file
7375

76+
### Graduating an investigation
77+
78+
New investigations and their remediation steps are deployed in advancing stages through a progressive delivery strategy (see [Progressive Delivery](#progressive-delivery)).
79+
80+
1. **Informing stage (Read-only):**
81+
The investigation is merely informative through PagerDuty at this stage; remediation _**does not involve any write operations**_. Notes are collected throughout the investigation, and upon the investigation's conclusion are posted to PagerDuty.
82+
83+
**Aim**: Validating the investigation's accuracy and usefulness **without performing any write actions**.
84+
85+
**Validation Criteria:** The investigation successfully carries out each step on it's respective incident type, over a span of several days. It provides useful information (equivalent to a manual investigation) to SREs through PagerDuty.
86+
87+
2. **Incubation / Canary (Limited Write):**
88+
The remediation continues to be limited to information gathering on the majority of clusters, however write operations are validated on a small subset of clusters, based on region (TODO).
89+
90+
**Aim:** Validating the remediation's **_write operations_** on a controlled subset of the fleet.
91+
92+
**Validation Criteria:** Write operations perform successfully and as expected on the defined subset of clusters; potential issues with write actions should be caught at this stage.
93+
94+
3. **Graduation (Read & Write):**
95+
The investigation's remediation functions, including **read and write**, are performed on all applicable clusters (high-impact clusters should remain read-only).
96+
7497
### Integrations
7598

7699
> **Note:** When writing an investiation, you can use them right away.
@@ -147,6 +170,24 @@ Grafana dashboard configmaps are stored in the [Dashboards](./dashboards/) direc
147170
* [Skip Webhooks](./deploy/skip-webhook/README.md) -- Skipping the eventlistener and creating the pipelinerun directly.
148171
* [Namespace](./deploy/namespace/README.md) -- Allowing the code to ignore the namespace.
149172

173+
### Progressive Delivery
174+
175+
New investigations are deployed following a "Canary Deployment Strategy". This allows for a monitored, progressive deployment of new investigations and remediative steps, and limits fleet-wide issues.
176+
177+
Investigations and their respective remediation capabilities are promoted as follows:
178+
179+
1. Read-only (informing) on stage & production
180+
181+
2. Read/Write investigation implemented on stage
182+
183+
3. Read/Write promoted to production canary clusters
184+
185+
4. Soak time
186+
187+
5. Full investigation (read/write) promoted fleet-wide
188+
189+
> **Note:** A workaround "fast-track" graduation is possible in instances of necessity to force changes fleet-wide as quickly as possible.
190+
150191
### Boilerplate
151192

152193
* [Boilerplate](./boilerplate/openshift/osd-container-image/README.md) -- Conventions for OSD containers.

0 commit comments

Comments
 (0)