Skip to content

Commit ad04710

Browse files
authored
Enhancements to reliability/chaos testing sections (microsoft#938)
* Minor spelling/grammar fixes in CONTRIBUTING.md * Update reliability and fault injection testing docs. * Minor docs edits for testing, reliability, etc. * Fix typo * Update list of chaos testing tools in reliability/README * Resolving cspell issues * Change cspell manually (without vscode extension)
1 parent db0e1eb commit ad04710

File tree

5 files changed

+22
-14
lines changed

5 files changed

+22
-14
lines changed

.cspell.json

+3-2
Original file line numberDiff line numberDiff line change
@@ -456,7 +456,8 @@
456456
"Kubeseal",
457457
"KSOPS",
458458
"OTLP",
459-
"Quic"
459+
"Quic",
460+
"Simmy"
460461
],
461462
"version": "0.2"
462-
}
463+
}

CONTRIBUTING.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@ When this occurs, do the following
7676
1. Verify that the link is OK, if it redirects, change the path to be the final link.
7777
1. If the link is not ok, fix the link (even if it is not in your document) if you find a good equivalent link. If you can't find a good equivalent link, contact one of the [maintainers](#maintainers) for a solution.
7878
1. Re-run the job, or ask to have the job re-run (if you are a first time contributor). Sometimes the link checker fails due to temporary connectivity issues.
79-
1. If the link checker still fails, and you have confirmed that the link is ok, exclude the link from checking, in the `.markodownlinkcheck.json` file in the root of the repository.
79+
1. If the link checker still fails, and you have confirmed that the link is ok, exclude the link from checking, in the `.markdownlinkcheck.json` file in the root of the repository.
8080

8181
## Running Locally (*Remotely*)
8282

@@ -90,7 +90,7 @@ Finally, launch the site locally using the `mkdocs serve` command from the root
9090

9191
## Maintainers
9292

93-
For any questions or concerns, please contact [Tess Ferrandez](https://github.com/TessFerrandez), [Shiran Rubin](https://github.com/shiranr) or [Federica Nocera](https://github.com/fnocera)
93+
For any questions or concerns, please contact [Tess Ferrandez](https://github.com/TessFerrandez), [Shiran Rubin](https://github.com/shiranr) or [Federica Nocera](https://github.com/fnocera).
9494

9595
## Legal Notices
9696

@@ -99,7 +99,7 @@ Microsoft and any contributors grant you a license to the Microsoft documentatio
9999
Microsoft, Windows, Microsoft Azure and/or other Microsoft products and services referenced in the documentation may be either trademarks or registered trademarks of Microsoft in the United States and/or other countries. The licenses for this project do not grant you rights to use any Microsoft
100100
names, logos, or trademarks. Microsoft's general trademark guidelines can be found at <https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks>.
101101

102-
Privacy information can be found at <https://privacy.microsoft.com/en-us/>
102+
Privacy information can be found at <https://privacy.microsoft.com/en-us/>.
103103

104104
Microsoft and any contributors reserve all others rights, whether under their respective copyrights, patents, or trademarks, whether by implication, estoppel
105105
or otherwise.

docs/automated-testing/README.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -43,11 +43,12 @@ The table below maps outcomes -- the results that you may want to achieve in you
4343
| Staging; Operation | Create/Exercise runbook for increasing/reducing provisioning | Scale drills |
4444
| Staging; Operation | Measure behavior under rapid changes in traffic | Spike |
4545
| Staging; Optimizing | Discover cost metrics per unit load volume (what factors influence cost at what load points, e.g. cost per million concurrent users) | Load (stress) |
46-
| Development; Operation | Discover points where a system is not resilient to unpredictable yet inevitable failures (network outage, hardware failure, VM host servicing, rack/switch failures, random acts of the Malevolent Divine, solar flares, sharks that eat undersea cable relays, cosmic radiation, power outages, renegade backhoe operators, wolves chewing on junction boxes, …) | Chaos |
46+
| Development; Operation | Discover points where a system is not resilient to unpredictable yet inevitable failures (network outage, hardware failure, VM host servicing, rack/switch failures, random acts of the Malevolent Divine, solar flares, sharks that eat undersea cable relays, cosmic radiation, power outages, renegade backhoe operators, wolves chewing on junction boxes, …) | [Fault injection/chaos testing](fault-injection-testing/README.md) |
4747
| Development | Perform unit testing on Power platform custom connectors | [Custom Connector Testing](unit-testing/custom-connector.md) |
4848

4949
## Sections within Testing
5050

51+
- [Consumer-driven contract (CDC) testing](cdc-testing/README.md)
5152
- [End-to-End testing](e2e-testing/README.md)
5253
- [Fault Injection testing](fault-injection-testing/README.md)
5354
- [Integration testing](integration-testing/README.md)

docs/automated-testing/fault-injection-testing/README.md

+6-4
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Fault Injection Testing
22

3-
Fault injection testing is the deliberate introduction of errors and faults to a system to validate and harden its stability and reliability. The goal is to improve the system's design for resiliency and performance under intermittent failure conditions over time.
3+
Fault injection testing is the deliberate introduction of errors and faults to a system to validate and harden its [stability and reliability](../../reliability/README.md). The goal is to improve the system's design for resiliency and performance under intermittent failure conditions over time.
44

55
## When To Use
66

@@ -60,11 +60,11 @@ Much like [Synthetic Monitoring Tests](../synthetic-monitoring-tests/README.md),
6060
Fault injection tests rely on metrics observability and are usually statistical; The following high-level steps provide a sample of practicing fault injection and chaos engineering:
6161

6262
* Measure and define a steady (healthy) state for the system's interoperability.
63-
* hypothesize based on a fault mode.
63+
* Create hypotheses based on predicted behavior when a fault is introduced.
6464
* Introduce real-world fault-events to the system.
6565
* Measure the state and compare it to the baseline state.
66-
* Document the process and the observations
67-
* Identify and act on the result
66+
* Document the process and the observations.
67+
* Identify and act on the result.
6868

6969
## Best Practices and Advice
7070

@@ -89,9 +89,11 @@ A test can either succeed or fail. In the event of failure, there will likely be
8989

9090
### Chaos
9191

92+
* [Azure Chaos Studio](https://learn.microsoft.com/en-US/azure/chaos-studio/chaos-studio-overview) - An in-preview tool for orchestrating controlled fault injection experiments on Azure resources.
9293
* [Chaos toolkit](https://chaostoolkit.org/) - A declarative, modular chaos platform with many extensions, including the [Azure actions and probes kit](https://github.com/chaostoolkit-incubator/chaostoolkit-azure).
9394
* [Kraken](https://github.com/openshift-scale/kraken) - An Openshift-specific chaos tool, maintained by Redhat.
9495
* [Chaos Monkey](https://github.com/netflix/chaosmonkey) - The Netflix platform which popularized chaos engineering (doesn't support Azure OOTB).
96+
* [Simmy](https://github.com/Polly-Contrib/Simmy) - A .NET library for chaos testing and fault injection integrated with the [Polly](https://github.com/App-vNext/Polly) library for resilience engineering.
9597

9698
## Conclusion
9799

docs/reliability/README.md

+8-4
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,8 @@ We can build graceful failure (or graceful degradation) into our software stack
6464
* [Leader Election](https://en.wikipedia.org/wiki/Leader_election) can be used to keep healthy services on standby in case the leader experiences issues.
6565
* Entire cluster failover can redirect traffic to another region or availability zone.
6666
* Propagate downstream failures of **dependent services** up the stack via health checks, so that your ingress points can re-route to healthy services.
67-
* [Circuit breakers](https://techblog.constantcontact.com/software-development/circuit-breakers-and-microservices/#:~:text=The%20Circuit%20breaker%20pattern%20helps,unavailable%20or%20have%20high%20latency.) can bail early on requests vs. propagating errors throughout the system
67+
* [Circuit breakers](https://techblog.constantcontact.com/software-development/circuit-breakers-and-microservices/#:~:text=The%20Circuit%20breaker%20pattern%20helps,unavailable%20or%20have%20high%20latency.) can bail early on requests vs. propagating errors throughout the system.
68+
Consider using a well-known, tested library such as [Polly](https://github.com/App-vNext/Polly) (.NET) that enables configurable implementations of this and other common resilience and transient fault-handling patterns.
6869

6970
## Practice
7071

@@ -80,12 +81,15 @@ Take the time to fabricate scenarios, and run a D&D style campaign to solve your
8081

8182
### Chaos Testing
8283

83-
Leverage automated chaos testing to see how things break. Check out the list of the following tools:
84+
Leverage automated chaos testing to see how things break. You can read this playbook's [article on fault injection testing](../automated-testing/fault-injection-testing/README.md) for more information on developing a hypothesis-driven suite of automated chaos test. The following list of chaos testing tools as well as [this section in the article linked above](../automated-testing/fault-injection-testing/README.md#chaos) have more details on available platforms and tooling for this purpose:
8485

85-
* [Chaos Monkey](https://netflix.github.io/chaosmonkey/)
86-
* [Kraken](https://github.com/cloud-bulldozer/kraken)
86+
* [Azure Chaos Studio](https://learn.microsoft.com/en-US/azure/chaos-studio/chaos-studio-overview) - An in-preview tool for orchestrating controlled fault injection experiments on Azure resources.
87+
* [Chaos toolkit](https://chaostoolkit.org/) - A declarative, modular chaos platform with many extensions, including the [Azure actions and probes kit](https://github.com/chaostoolkit-incubator/chaostoolkit-azure).
88+
* [Kraken](https://github.com/openshift-scale/kraken) - An Openshift-specific chaos tool, maintained by Redhat.
89+
* [Chaos Monkey](https://github.com/netflix/chaosmonkey) - The Netflix platform which popularized chaos engineering (doesn't support Azure OOTB).
8790
* Many services meshes, like [Linkerd](https://linkerd.io/2/features/fault-injection/), offer fault injection tooling through the use of their sidecars.
8891
* [Chaos Mesh](https://github.com/chaos-mesh/chaos-mesh)
92+
* [Simmy](https://github.com/Polly-Contrib/Simmy) - A .NET library for chaos testing and fault injection integrated with the [Polly](https://github.com/App-vNext/Polly) library for resilience engineering.
8993

9094
## Analyze all Failures
9195

0 commit comments

Comments
 (0)