Enhancements to reliability/chaos testing sections (microsoft#938)

nairmai · web-flow · commit ad04710f27c7 · 2023-02-26T10:24:27.000+02:00
* Minor spelling/grammar fixes in CONTRIBUTING.md

* Update reliability and fault injection testing docs.

* Minor docs edits for testing, reliability, etc.

* Fix typo

* Update list of chaos testing tools in reliability/README

* Resolving cspell issues

* Change cspell manually (without vscode extension)
diff --git a/.cspell.json b/.cspell.json
@@ -456,7 +456,8 @@
     "Kubeseal",
     "KSOPS",
     "OTLP",
-    "Quic"
+    "Quic",
+    "Simmy"
   ],
   "version": "0.2"
-}
+}
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -76,7 +76,7 @@ When this occurs, do the following
 1. Verify that the link is OK, if it redirects, change the path to be the final link.
 1. If the link is not ok, fix the link (even if it is not in your document) if you find a good equivalent link. If you can't find a good equivalent link, contact one of the [maintainers](#maintainers) for a solution.
 1. Re-run the job, or ask to have the job re-run (if you are a first time contributor). Sometimes the link checker fails due to temporary connectivity issues.
-1. If the link checker still fails, and you have confirmed that the link is ok, exclude the link from checking, in the `.markodownlinkcheck.json` file in the root of the repository.
+1. If the link checker still fails, and you have confirmed that the link is ok, exclude the link from checking, in the `.markdownlinkcheck.json` file in the root of the repository.
 
 ## Running Locally (*Remotely*)
 
@@ -90,7 +90,7 @@ Finally, launch the site locally using the `mkdocs serve` command from the root
 
 ## Maintainers
 
-For any questions or concerns, please contact [Tess Ferrandez](https://github.com/TessFerrandez), [Shiran Rubin](https://github.com/shiranr) or [Federica Nocera](https://github.com/fnocera)
+For any questions or concerns, please contact [Tess Ferrandez](https://github.com/TessFerrandez), [Shiran Rubin](https://github.com/shiranr) or [Federica Nocera](https://github.com/fnocera).
 
 ## Legal Notices
 
@@ -99,7 +99,7 @@ Microsoft and any contributors grant you a license to the Microsoft documentatio
 Microsoft, Windows, Microsoft Azure and/or other Microsoft products and services referenced in the documentation may be either trademarks or registered trademarks of Microsoft in the United States and/or other countries. The licenses for this project do not grant you rights to use any Microsoft
 names, logos, or trademarks. Microsoft's general trademark guidelines can be found at <https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks>.
 
-Privacy information can be found at <https://privacy.microsoft.com/en-us/>
+Privacy information can be found at <https://privacy.microsoft.com/en-us/>.
 
 Microsoft and any contributors reserve all others rights, whether under their respective copyrights, patents, or trademarks, whether by implication, estoppel
 or otherwise.
diff --git a/docs/automated-testing/README.md b/docs/automated-testing/README.md
@@ -43,11 +43,12 @@ The table below maps outcomes -- the results that you may want to achieve in you
 | Staging; Operation                                                | Create/Exercise runbook for increasing/reducing provisioning                                                                                                                                                                                                                                                                                                      | Scale drills                                                                                                                                                                                                                      |
 | Staging; Operation                                                | Measure behavior under rapid changes in traffic                                                                                                                                                                                                                                                                                                                   | Spike                                                                                                                                                                                                                             |
 | Staging; Optimizing                                               | Discover cost metrics per unit load volume (what factors influence cost at what load points, e.g. cost per million concurrent users)                                                                                                                                                                                                                              | Load (stress)                                                                                                                                                                                                                     |
-| Development; Operation                                            | Discover points where a system is not resilient to unpredictable yet inevitable failures (network outage, hardware failure, VM host servicing, rack/switch failures, random acts of the Malevolent Divine, solar flares, sharks that eat undersea cable relays, cosmic radiation, power outages, renegade backhoe operators, wolves chewing on junction boxes, …) | Chaos                                                                                                                                                                                                                             |
+| Development; Operation                                            | Discover points where a system is not resilient to unpredictable yet inevitable failures (network outage, hardware failure, VM host servicing, rack/switch failures, random acts of the Malevolent Divine, solar flares, sharks that eat undersea cable relays, cosmic radiation, power outages, renegade backhoe operators, wolves chewing on junction boxes, …) | [Fault injection/chaos testing](fault-injection-testing/README.md)                                                                                                                                                                      |
 | Development                                                       | Perform unit testing on Power platform custom connectors                                                                                                                                                                                                                                                                                                          | [Custom Connector Testing](unit-testing/custom-connector.md)                                                                                                                                                                      |
 
 ## Sections within Testing
 
+- [Consumer-driven contract (CDC) testing](cdc-testing/README.md)
 - [End-to-End testing](e2e-testing/README.md)
 - [Fault Injection testing](fault-injection-testing/README.md)
 - [Integration testing](integration-testing/README.md)
diff --git a/docs/automated-testing/fault-injection-testing/README.md b/docs/automated-testing/fault-injection-testing/README.md
@@ -1,6 +1,6 @@
 # Fault Injection Testing
 
-Fault injection testing is the deliberate introduction of errors and faults to a system to validate and harden its stability and reliability. The goal is to improve the system's design for resiliency and performance under intermittent failure conditions over time.
+Fault injection testing is the deliberate introduction of errors and faults to a system to validate and harden its [stability and reliability](../../reliability/README.md). The goal is to improve the system's design for resiliency and performance under intermittent failure conditions over time.
 
 ## When To Use
 
@@ -60,11 +60,11 @@ Much like [Synthetic Monitoring Tests](../synthetic-monitoring-tests/README.md),
 Fault injection tests rely on metrics observability and are usually statistical; The following high-level steps provide a sample of practicing fault injection and chaos engineering:
 
 * Measure and define a steady (healthy) state for the system's interoperability.
-* hypothesize based on a fault mode.
+* Create hypotheses based on predicted behavior when a fault is introduced.
 * Introduce real-world fault-events to the system.
 * Measure the state and compare it to the baseline state.
-* Document the process and the observations
-* Identify and act on the result
+* Document the process and the observations.
+* Identify and act on the result.
 
 ## Best Practices and Advice
 
@@ -89,9 +89,11 @@ A test can either succeed or fail. In the event of failure, there will likely be
 
 ### Chaos
 
+* [Azure Chaos Studio](https://learn.microsoft.com/en-US/azure/chaos-studio/chaos-studio-overview) - An in-preview tool for orchestrating controlled fault injection experiments on Azure resources.
 * [Chaos toolkit](https://chaostoolkit.org/) - A declarative, modular chaos platform with many extensions, including the [Azure actions and probes kit](https://github.com/chaostoolkit-incubator/chaostoolkit-azure).
 * [Kraken](https://github.com/openshift-scale/kraken) - An Openshift-specific chaos tool, maintained by Redhat.
 * [Chaos Monkey](https://github.com/netflix/chaosmonkey) - The Netflix platform which popularized chaos engineering (doesn't support Azure OOTB).
+* [Simmy](https://github.com/Polly-Contrib/Simmy) - A .NET library for chaos testing and fault injection integrated with the [Polly](https://github.com/App-vNext/Polly) library for resilience engineering.
 
 ## Conclusion
 
diff --git a/docs/reliability/README.md b/docs/reliability/README.md
@@ -64,7 +64,8 @@ We can build graceful failure (or graceful degradation) into our software stack
   * [Leader Election](https://en.wikipedia.org/wiki/Leader_election) can be used to keep healthy services on standby in case the leader experiences issues.
   * Entire cluster failover can redirect traffic to another region or availability zone.
   * Propagate downstream failures of **dependent services** up the stack via health checks, so that your ingress points can re-route to healthy services.
-* [Circuit breakers](https://techblog.constantcontact.com/software-development/circuit-breakers-and-microservices/#:~:text=The%20Circuit%20breaker%20pattern%20helps,unavailable%20or%20have%20high%20latency.) can bail early on requests vs. propagating errors throughout the system
+* [Circuit breakers](https://techblog.constantcontact.com/software-development/circuit-breakers-and-microservices/#:~:text=The%20Circuit%20breaker%20pattern%20helps,unavailable%20or%20have%20high%20latency.) can bail early on requests vs. propagating errors throughout the system.
+  Consider using a well-known, tested library such as [Polly](https://github.com/App-vNext/Polly) (.NET) that enables configurable implementations of this and other common resilience and transient fault-handling patterns.
 
 ## Practice
 
@@ -80,12 +81,15 @@ Take the time to fabricate scenarios, and run a D&D style campaign to solve your
 
 ### Chaos Testing
 
-Leverage automated chaos testing to see how things break. Check out the list of the following tools:
+Leverage automated chaos testing to see how things break. You can read this playbook's [article on fault injection testing](../automated-testing/fault-injection-testing/README.md) for more information on developing a hypothesis-driven suite of automated chaos test. The following list of chaos testing tools as well as [this section in the article linked above](../automated-testing/fault-injection-testing/README.md#chaos) have more details on available platforms and tooling for this purpose:
 
-* [Chaos Monkey](https://netflix.github.io/chaosmonkey/)
-* [Kraken](https://github.com/cloud-bulldozer/kraken)
+* [Azure Chaos Studio](https://learn.microsoft.com/en-US/azure/chaos-studio/chaos-studio-overview) - An in-preview tool for orchestrating controlled fault injection experiments on Azure resources.
+* [Chaos toolkit](https://chaostoolkit.org/) - A declarative, modular chaos platform with many extensions, including the [Azure actions and probes kit](https://github.com/chaostoolkit-incubator/chaostoolkit-azure).
+* [Kraken](https://github.com/openshift-scale/kraken) - An Openshift-specific chaos tool, maintained by Redhat.
+* [Chaos Monkey](https://github.com/netflix/chaosmonkey) - The Netflix platform which popularized chaos engineering (doesn't support Azure OOTB).
 * Many services meshes, like [Linkerd](https://linkerd.io/2/features/fault-injection/), offer fault injection tooling through the use of their sidecars.
 * [Chaos Mesh](https://github.com/chaos-mesh/chaos-mesh)
+* [Simmy](https://github.com/Polly-Contrib/Simmy) - A .NET library for chaos testing and fault injection integrated with the [Polly](https://github.com/App-vNext/Polly) library for resilience engineering.
 
 ## Analyze all Failures