You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Enhancements to reliability/chaos testing sections (microsoft#938)
* Minor spelling/grammar fixes in CONTRIBUTING.md
* Update reliability and fault injection testing docs.
* Minor docs edits for testing, reliability, etc.
* Fix typo
* Update list of chaos testing tools in reliability/README
* Resolving cspell issues
* Change cspell manually (without vscode extension)
Copy file name to clipboardExpand all lines: CONTRIBUTING.md
+3-3
Original file line number
Diff line number
Diff line change
@@ -76,7 +76,7 @@ When this occurs, do the following
76
76
1. Verify that the link is OK, if it redirects, change the path to be the final link.
77
77
1. If the link is not ok, fix the link (even if it is not in your document) if you find a good equivalent link. If you can't find a good equivalent link, contact one of the [maintainers](#maintainers) for a solution.
78
78
1. Re-run the job, or ask to have the job re-run (if you are a first time contributor). Sometimes the link checker fails due to temporary connectivity issues.
79
-
1. If the link checker still fails, and you have confirmed that the link is ok, exclude the link from checking, in the `.markodownlinkcheck.json` file in the root of the repository.
79
+
1. If the link checker still fails, and you have confirmed that the link is ok, exclude the link from checking, in the `.markdownlinkcheck.json` file in the root of the repository.
80
80
81
81
## Running Locally (*Remotely*)
82
82
@@ -90,7 +90,7 @@ Finally, launch the site locally using the `mkdocs serve` command from the root
90
90
91
91
## Maintainers
92
92
93
-
For any questions or concerns, please contact [Tess Ferrandez](https://github.com/TessFerrandez), [Shiran Rubin](https://github.com/shiranr) or [Federica Nocera](https://github.com/fnocera)
93
+
For any questions or concerns, please contact [Tess Ferrandez](https://github.com/TessFerrandez), [Shiran Rubin](https://github.com/shiranr) or [Federica Nocera](https://github.com/fnocera).
94
94
95
95
## Legal Notices
96
96
@@ -99,7 +99,7 @@ Microsoft and any contributors grant you a license to the Microsoft documentatio
99
99
Microsoft, Windows, Microsoft Azure and/or other Microsoft products and services referenced in the documentation may be either trademarks or registered trademarks of Microsoft in the United States and/or other countries. The licenses for this project do not grant you rights to use any Microsoft
100
100
names, logos, or trademarks. Microsoft's general trademark guidelines can be found at <https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks>.
101
101
102
-
Privacy information can be found at <https://privacy.microsoft.com/en-us/>
102
+
Privacy information can be found at <https://privacy.microsoft.com/en-us/>.
103
103
104
104
Microsoft and any contributors reserve all others rights, whether under their respective copyrights, patents, or trademarks, whether by implication, estoppel
| Staging; Operation | Measure behavior under rapid changes in traffic | Spike |
45
45
| Staging; Optimizing | Discover cost metrics per unit load volume (what factors influence cost at what load points, e.g. cost per million concurrent users) | Load (stress) |
46
-
| Development; Operation | Discover points where a system is not resilient to unpredictable yet inevitable failures (network outage, hardware failure, VM host servicing, rack/switch failures, random acts of the Malevolent Divine, solar flares, sharks that eat undersea cable relays, cosmic radiation, power outages, renegade backhoe operators, wolves chewing on junction boxes, …) |Chaos |
46
+
| Development; Operation | Discover points where a system is not resilient to unpredictable yet inevitable failures (network outage, hardware failure, VM host servicing, rack/switch failures, random acts of the Malevolent Divine, solar flares, sharks that eat undersea cable relays, cosmic radiation, power outages, renegade backhoe operators, wolves chewing on junction boxes, …) |[Fault injection/chaos testing](fault-injection-testing/README.md)|
47
47
| Development | Perform unit testing on Power platform custom connectors |[Custom Connector Testing](unit-testing/custom-connector.md)|
Copy file name to clipboardExpand all lines: docs/automated-testing/fault-injection-testing/README.md
+6-4
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# Fault Injection Testing
2
2
3
-
Fault injection testing is the deliberate introduction of errors and faults to a system to validate and harden its stability and reliability. The goal is to improve the system's design for resiliency and performance under intermittent failure conditions over time.
3
+
Fault injection testing is the deliberate introduction of errors and faults to a system to validate and harden its [stability and reliability](../../reliability/README.md). The goal is to improve the system's design for resiliency and performance under intermittent failure conditions over time.
4
4
5
5
## When To Use
6
6
@@ -60,11 +60,11 @@ Much like [Synthetic Monitoring Tests](../synthetic-monitoring-tests/README.md),
60
60
Fault injection tests rely on metrics observability and are usually statistical; The following high-level steps provide a sample of practicing fault injection and chaos engineering:
61
61
62
62
* Measure and define a steady (healthy) state for the system's interoperability.
63
-
*hypothesize based on a fault mode.
63
+
*Create hypotheses based on predicted behavior when a fault is introduced.
64
64
* Introduce real-world fault-events to the system.
65
65
* Measure the state and compare it to the baseline state.
66
-
* Document the process and the observations
67
-
* Identify and act on the result
66
+
* Document the process and the observations.
67
+
* Identify and act on the result.
68
68
69
69
## Best Practices and Advice
70
70
@@ -89,9 +89,11 @@ A test can either succeed or fail. In the event of failure, there will likely be
89
89
90
90
### Chaos
91
91
92
+
*[Azure Chaos Studio](https://learn.microsoft.com/en-US/azure/chaos-studio/chaos-studio-overview) - An in-preview tool for orchestrating controlled fault injection experiments on Azure resources.
92
93
*[Chaos toolkit](https://chaostoolkit.org/) - A declarative, modular chaos platform with many extensions, including the [Azure actions and probes kit](https://github.com/chaostoolkit-incubator/chaostoolkit-azure).
93
94
*[Kraken](https://github.com/openshift-scale/kraken) - An Openshift-specific chaos tool, maintained by Redhat.
94
95
*[Chaos Monkey](https://github.com/netflix/chaosmonkey) - The Netflix platform which popularized chaos engineering (doesn't support Azure OOTB).
96
+
*[Simmy](https://github.com/Polly-Contrib/Simmy) - A .NET library for chaos testing and fault injection integrated with the [Polly](https://github.com/App-vNext/Polly) library for resilience engineering.
Copy file name to clipboardExpand all lines: docs/reliability/README.md
+8-4
Original file line number
Diff line number
Diff line change
@@ -64,7 +64,8 @@ We can build graceful failure (or graceful degradation) into our software stack
64
64
*[Leader Election](https://en.wikipedia.org/wiki/Leader_election) can be used to keep healthy services on standby in case the leader experiences issues.
65
65
* Entire cluster failover can redirect traffic to another region or availability zone.
66
66
* Propagate downstream failures of **dependent services** up the stack via health checks, so that your ingress points can re-route to healthy services.
67
-
*[Circuit breakers](https://techblog.constantcontact.com/software-development/circuit-breakers-and-microservices/#:~:text=The%20Circuit%20breaker%20pattern%20helps,unavailable%20or%20have%20high%20latency.) can bail early on requests vs. propagating errors throughout the system
67
+
*[Circuit breakers](https://techblog.constantcontact.com/software-development/circuit-breakers-and-microservices/#:~:text=The%20Circuit%20breaker%20pattern%20helps,unavailable%20or%20have%20high%20latency.) can bail early on requests vs. propagating errors throughout the system.
68
+
Consider using a well-known, tested library such as [Polly](https://github.com/App-vNext/Polly) (.NET) that enables configurable implementations of this and other common resilience and transient fault-handling patterns.
68
69
69
70
## Practice
70
71
@@ -80,12 +81,15 @@ Take the time to fabricate scenarios, and run a D&D style campaign to solve your
80
81
81
82
### Chaos Testing
82
83
83
-
Leverage automated chaos testing to see how things break. Check out the list of the following tools:
84
+
Leverage automated chaos testing to see how things break. You can read this playbook's [article on fault injection testing](../automated-testing/fault-injection-testing/README.md) for more information on developing a hypothesis-driven suite of automated chaos test. The following list of chaos testing tools as well as [this section in the article linked above](../automated-testing/fault-injection-testing/README.md#chaos) have more details on available platforms and tooling for this purpose:
*[Azure Chaos Studio](https://learn.microsoft.com/en-US/azure/chaos-studio/chaos-studio-overview) - An in-preview tool for orchestrating controlled fault injection experiments on Azure resources.
87
+
*[Chaos toolkit](https://chaostoolkit.org/) - A declarative, modular chaos platform with many extensions, including the [Azure actions and probes kit](https://github.com/chaostoolkit-incubator/chaostoolkit-azure).
88
+
*[Kraken](https://github.com/openshift-scale/kraken) - An Openshift-specific chaos tool, maintained by Redhat.
89
+
*[Chaos Monkey](https://github.com/netflix/chaosmonkey) - The Netflix platform which popularized chaos engineering (doesn't support Azure OOTB).
87
90
* Many services meshes, like [Linkerd](https://linkerd.io/2/features/fault-injection/), offer fault injection tooling through the use of their sidecars.
*[Simmy](https://github.com/Polly-Contrib/Simmy) - A .NET library for chaos testing and fault injection integrated with the [Polly](https://github.com/App-vNext/Polly) library for resilience engineering.
0 commit comments