|
| 1 | +% Atomics |
| 2 | + |
| 3 | +Rust pretty blatantly just inherits C11's memory model for atomics. This is not |
| 4 | +due this model being particularly excellent or easy to understand. Indeed, this |
| 5 | +model is quite complex and known to have [several flaws][C11-busted]. Rather, it |
| 6 | +is a pragmatic concession to the fact that *everyone* is pretty bad at modeling |
| 7 | +atomics. At very least, we can benefit from existing tooling and research around |
| 8 | +C. |
| 9 | + |
| 10 | +Trying to fully explain the model in this book is fairly hopeless. It's defined |
| 11 | +in terms of madness-inducing causality graphs that require a full book to |
| 12 | +properly understand in a practical way. If you want all the nitty-gritty |
| 13 | +details, you should check out [C's specification (Section 7.17)][C11-model]. |
| 14 | +Still, we'll try to cover the basics and some of the problems Rust developers |
| 15 | +face. |
| 16 | + |
| 17 | +The C11 memory model is fundamentally about trying to bridge the gap between the |
| 18 | +semantics we want, the optimizations compilers want, and the inconsistent chaos |
| 19 | +our hardware wants. *We* would like to just write programs and have them do |
| 20 | +exactly what we said but, you know, *fast*. Wouldn't that be great? |
| 21 | + |
| 22 | + |
| 23 | + |
| 24 | + |
| 25 | +# Compiler Reordering |
| 26 | + |
| 27 | +Compilers fundamentally want to be able to do all sorts of crazy transformations |
| 28 | +to reduce data dependencies and eliminate dead code. In particular, they may |
| 29 | +radically change the actual order of events, or make events never occur! If we |
| 30 | +write something like |
| 31 | + |
| 32 | +```rust,ignore |
| 33 | +x = 1; |
| 34 | +y = 3; |
| 35 | +x = 2; |
| 36 | +``` |
| 37 | + |
| 38 | +The compiler may conclude that it would *really* be best if your program did |
| 39 | + |
| 40 | +```rust,ignore |
| 41 | +x = 2; |
| 42 | +y = 3; |
| 43 | +``` |
| 44 | + |
| 45 | +This has inverted the order of events *and* completely eliminated one event. |
| 46 | +From a single-threaded perspective this is completely unobservable: after all |
| 47 | +the statements have executed we are in exactly the same state. But if our |
| 48 | +program is multi-threaded, we may have been relying on `x` to *actually* be |
| 49 | +assigned to 1 before `y` was assigned. We would *really* like the compiler to be |
| 50 | +able to make these kinds of optimizations, because they can seriously improve |
| 51 | +performance. On the other hand, we'd really like to be able to depend on our |
| 52 | +program *doing the thing we said*. |
| 53 | + |
| 54 | + |
| 55 | + |
| 56 | + |
| 57 | +# Hardware Reordering |
| 58 | + |
| 59 | +On the other hand, even if the compiler totally understood what we wanted and |
| 60 | +respected our wishes, our *hardware* might instead get us in trouble. Trouble |
| 61 | +comes from CPUs in the form of memory hierarchies. There is indeed a global |
| 62 | +shared memory space somewhere in your hardware, but from the perspective of each |
| 63 | +CPU core it is *so very far away* and *so very slow*. Each CPU would rather work |
| 64 | +with its local cache of the data and only go through all the *anguish* of |
| 65 | +talking to shared memory *only* when it doesn't actually have that memory in |
| 66 | +cache. |
| 67 | + |
| 68 | +After all, that's the whole *point* of the cache, right? If every read from the |
| 69 | +cache had to run back to shared memory to double check that it hadn't changed, |
| 70 | +what would the point be? The end result is that the hardware doesn't guarantee |
| 71 | +that events that occur in the same order on *one* thread, occur in the same |
| 72 | +order on *another* thread. To guarantee this, we must issue special instructions |
| 73 | +to the CPU telling it to be a bit less smart. |
| 74 | + |
| 75 | +For instance, say we convince the compiler to emit this logic: |
| 76 | + |
| 77 | +```text |
| 78 | +initial state: x = 0, y = 1 |
| 79 | +
|
| 80 | +THREAD 1 THREAD2 |
| 81 | +y = 3; if x == 1 { |
| 82 | +x = 1; y *= 2; |
| 83 | + } |
| 84 | +``` |
| 85 | + |
| 86 | +Ideally this program has 2 possible final states: |
| 87 | + |
| 88 | +* `y = 3`: (thread 2 did the check before thread 1 completed) |
| 89 | +* `y = 6`: (thread 2 did the check after thread 1 completed) |
| 90 | + |
| 91 | +However there's a third potential state that the hardware enables: |
| 92 | + |
| 93 | +* `y = 2`: (thread 2 saw `x = 1`, but not `y = 3`, and then overwrote `y = 3`) |
| 94 | + |
| 95 | +It's worth noting that different kinds of CPU provide different guarantees. It |
| 96 | +is common to separate hardware into two categories: strongly-ordered and weakly- |
| 97 | +ordered. Most notably x86/64 provides strong ordering guarantees, while ARM |
| 98 | +provides weak ordering guarantees. This has two consequences for concurrent |
| 99 | +programming: |
| 100 | + |
| 101 | +* Asking for stronger guarantees on strongly-ordered hardware may be cheap or |
| 102 | + even *free* because they already provide strong guarantees unconditionally. |
| 103 | + Weaker guarantees may only yield performance wins on weakly-ordered hardware. |
| 104 | + |
| 105 | +* Asking for guarantees that are *too* weak on strongly-ordered hardware is |
| 106 | + more likely to *happen* to work, even though your program is strictly |
| 107 | + incorrect. If possible, concurrent algorithms should be tested on weakly- |
| 108 | + ordered hardware. |
| 109 | + |
| 110 | + |
| 111 | + |
| 112 | + |
| 113 | + |
| 114 | +# Data Accesses |
| 115 | + |
| 116 | +The C11 memory model attempts to bridge the gap by allowing us to talk about the |
| 117 | +*causality* of our program. Generally, this is by establishing a *happens |
| 118 | +before* relationships between parts of the program and the threads that are |
| 119 | +running them. This gives the hardware and compiler room to optimize the program |
| 120 | +more aggressively where a strict happens-before relationship isn't established, |
| 121 | +but forces them to be more careful where one *is* established. The way we |
| 122 | +communicate these relationships are through *data accesses* and *atomic |
| 123 | +accesses*. |
| 124 | + |
| 125 | +Data accesses are the bread-and-butter of the programming world. They are |
| 126 | +fundamentally unsynchronized and compilers are free to aggressively optimize |
| 127 | +them. In particular, data accesses are free to be reordered by the compiler on |
| 128 | +the assumption that the program is single-threaded. The hardware is also free to |
| 129 | +propagate the changes made in data accesses to other threads as lazily and |
| 130 | +inconsistently as it wants. Mostly critically, data accesses are how data races |
| 131 | +happen. Data accesses are very friendly to the hardware and compiler, but as |
| 132 | +we've seen they offer *awful* semantics to try to write synchronized code with. |
| 133 | +Actually, that's too weak. *It is literally impossible to write correct |
| 134 | +synchronized code using only data accesses*. |
| 135 | + |
| 136 | +Atomic accesses are how we tell the hardware and compiler that our program is |
| 137 | +multi-threaded. Each atomic access can be marked with an *ordering* that |
| 138 | +specifies what kind of relationship it establishes with other accesses. In |
| 139 | +practice, this boils down to telling the compiler and hardware certain things |
| 140 | +they *can't* do. For the compiler, this largely revolves around re-ordering of |
| 141 | +instructions. For the hardware, this largely revolves around how writes are |
| 142 | +propagated to other threads. The set of orderings Rust exposes are: |
| 143 | + |
| 144 | +* Sequentially Consistent (SeqCst) Release Acquire Relaxed |
| 145 | + |
| 146 | +(Note: We explicitly do not expose the C11 *consume* ordering) |
| 147 | + |
| 148 | +TODO: negative reasoning vs positive reasoning? TODO: "can't forget to |
| 149 | +synchronize" |
| 150 | + |
| 151 | + |
| 152 | + |
| 153 | +# Sequentially Consistent |
| 154 | + |
| 155 | +Sequentially Consistent is the most powerful of all, implying the restrictions |
| 156 | +of all other orderings. Intuitively, a sequentially consistent operation |
| 157 | +*cannot* be reordered: all accesses on one thread that happen before and after a |
| 158 | +SeqCst access *stay* before and after it. A data-race-free program that uses |
| 159 | +only sequentially consistent atomics and data accesses has the very nice |
| 160 | +property that there is a single global execution of the program's instructions |
| 161 | +that all threads agree on. This execution is also particularly nice to reason |
| 162 | +about: it's just an interleaving of each thread's individual executions. This |
| 163 | +*does not* hold if you start using the weaker atomic orderings. |
| 164 | + |
| 165 | +The relative developer-friendliness of sequential consistency doesn't come for |
| 166 | +free. Even on strongly-ordered platforms sequential consistency involves |
| 167 | +emitting memory fences. |
| 168 | + |
| 169 | +In practice, sequential consistency is rarely necessary for program correctness. |
| 170 | +However sequential consistency is definitely the right choice if you're not |
| 171 | +confident about the other memory orders. Having your program run a bit slower |
| 172 | +than it needs to is certainly better than it running incorrectly! It's also |
| 173 | +*mechanically* trivial to downgrade atomic operations to have a weaker |
| 174 | +consistency later on. Just change `SeqCst` to e.g. `Relaxed` and you're done! Of |
| 175 | +course, proving that this transformation is *correct* is a whole other matter. |
| 176 | + |
| 177 | + |
| 178 | + |
| 179 | + |
| 180 | +# Acquire-Release |
| 181 | + |
| 182 | +Acquire and Release are largely intended to be paired. Their names hint at their |
| 183 | +use case: they're perfectly suited for acquiring and releasing locks, and |
| 184 | +ensuring that critical sections don't overlap. |
| 185 | + |
| 186 | +Intuitively, an acquire access ensures that every access after it *stays* after |
| 187 | +it. However operations that occur before an acquire are free to be reordered to |
| 188 | +occur after it. Similarly, a release access ensures that every access before it |
| 189 | +*stays* before it. However operations that occur after a release are free to be |
| 190 | +reordered to occur before it. |
| 191 | + |
| 192 | +When thread A releases a location in memory and then thread B subsequently |
| 193 | +acquires *the same* location in memory, causality is established. Every write |
| 194 | +that happened *before* A's release will be observed by B *after* its release. |
| 195 | +However no causality is established with any other threads. Similarly, no |
| 196 | +causality is established if A and B access *different* locations in memory. |
| 197 | + |
| 198 | +Basic use of release-acquire is therefore simple: you acquire a location of |
| 199 | +memory to begin the critical section, and then release that location to end it. |
| 200 | +For instance, a simple spinlock might look like: |
| 201 | + |
| 202 | +```rust |
| 203 | +use std::sync::Arc; |
| 204 | +use std::sync::atomic::{AtomicBool, Ordering}; |
| 205 | +use std::thread; |
| 206 | + |
| 207 | +fn main() { |
| 208 | + let lock = Arc::new(AtomicBool::new(true)); // value answers "am I locked?" |
| 209 | + |
| 210 | + // ... distribute lock to threads somehow ... |
| 211 | + |
| 212 | + // Try to acquire the lock by setting it to false |
| 213 | + while !lock.compare_and_swap(true, false, Ordering::Acquire) { } |
| 214 | + // broke out of the loop, so we successfully acquired the lock! |
| 215 | + |
| 216 | + // ... scary data accesses ... |
| 217 | + |
| 218 | + // ok we're done, release the lock |
| 219 | + lock.store(true, Ordering::Release); |
| 220 | +} |
| 221 | +``` |
| 222 | + |
| 223 | +On strongly-ordered platforms most accesses have release or acquire semantics, |
| 224 | +making release and acquire often totally free. This is not the case on |
| 225 | +weakly-ordered platforms. |
| 226 | + |
| 227 | + |
| 228 | + |
| 229 | + |
| 230 | +# Relaxed |
| 231 | + |
| 232 | +Relaxed accesses are the absolute weakest. They can be freely re-ordered and |
| 233 | +provide no happens-before relationship. Still, relaxed operations *are* still |
| 234 | +atomic. That is, they don't count as data accesses and any read-modify-write |
| 235 | +operations done to them occur atomically. Relaxed operations are appropriate for |
| 236 | +things that you definitely want to happen, but don't particularly otherwise care |
| 237 | +about. For instance, incrementing a counter can be safely done by multiple |
| 238 | +threads using a relaxed `fetch_add` if you're not using the counter to |
| 239 | +synchronize any other accesses. |
| 240 | + |
| 241 | +There's rarely a benefit in making an operation relaxed on strongly-ordered |
| 242 | +platforms, since they usually provide release-acquire semantics anyway. However |
| 243 | +relaxed operations can be cheaper on weakly-ordered platforms. |
| 244 | + |
| 245 | + |
| 246 | + |
| 247 | + |
| 248 | + |
| 249 | +[C11-busted]: http://plv.mpi-sws.org/c11comp/popl15.pdf |
| 250 | +[C11-model]: http://www.open-std.org/jtc1/sc22/wg14/www/standards.html#9899 |
0 commit comments