Skip to content

Commit caac584

Browse files
mark-i-mtshepangnikomatsakis
authored
Add a bit about various type system concepts (#697)
* add a bit on dataflow analysis * add a bit on quanitification * add a bit on debruijn index * add a bit on early and late bound params * add missing link * Typos Co-authored-by: Tshepang Lekhonkhobe <[email protected]> * clarify dataflow example * fix formatting * fix typos * Typos Co-authored-by: Tshepang Lekhonkhobe <[email protected]> * fix errors in background * remove dup material and make early/late intro short * adjust intro * Niko's intro Co-authored-by: Niko Matsakis <[email protected]> Co-authored-by: Tshepang Lekhonkhobe <[email protected]> Co-authored-by: Niko Matsakis <[email protected]>
1 parent 8a9a1f9 commit caac584

File tree

4 files changed

+253
-2
lines changed

4 files changed

+253
-2
lines changed

src/SUMMARY.md

+1
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,7 @@
7676
- [Generic arguments](./generic_arguments.md)
7777
- [Type inference](./type-inference.md)
7878
- [Trait solving](./traits/resolution.md)
79+
- [Early and Late Bound Parameters](./early-late-bound.md)
7980
- [Higher-ranked trait bounds](./traits/hrtb.md)
8081
- [Caching subtleties](./traits/caching.md)
8182
- [Specialization](./traits/specialization.md)

src/appendix/background.md

+144-2
Original file line numberDiff line numberDiff line change
@@ -79,16 +79,158 @@ cycle.
7979
[*Static Program Analysis*](https://cs.au.dk/~amoeller/spa/) by Anders Møller
8080
and Michael I. Schwartzbach is an incredible resource!
8181

82-
*to be written*
82+
_Dataflow analysis_ is a type of static analysis that is common in many
83+
compilers. It describes a general technique, rather than a particular analysis.
84+
85+
The basic idea is that we can walk over a [CFG](#cfg) and keep track of what
86+
some value could be. At the end of the walk, we might have shown that some
87+
claim is true or not necessarily true (e.g. "this variable must be
88+
initialized"). `rustc` tends to do dataflow analyses over the MIR, since that
89+
is already a CFG.
90+
91+
For example, suppose we want to check that `x` is initialized before it is used
92+
in this snippet:
93+
94+
```rust,ignore
95+
fn foo() {
96+
let mut x;
97+
98+
if some_cond {
99+
x = 1;
100+
}
101+
102+
dbg!(x);
103+
}
104+
```
105+
106+
A CFG for this code might look like this:
107+
108+
```txt
109+
+------+
110+
| Init | (A)
111+
+------+
112+
| \
113+
| if some_cond
114+
else \ +-------+
115+
| \| x = 1 | (B)
116+
| +-------+
117+
| /
118+
+---------+
119+
| dbg!(x) | (C)
120+
+---------+
121+
```
122+
123+
We can do the dataflow analysis as follows: we will start off with a flag `init`
124+
which indicates if we know `x` is initialized. As we walk the CFG, we will
125+
update the flag. At the end, we can check its value.
126+
127+
So first, in block (A), the variable `x` is declared but not initialized, so
128+
`init = false`. In block (B), we initialize the value, so we know that `x` is
129+
initialized. So at the end of (B), `init = true`.
130+
131+
Block (C) is where things get interesting. Notice that there are two incoming
132+
edges, one from (A) and one from (B), corresponding to whether `some_cond` is true or not.
133+
But we cannot know that! It could be the case the `some_cond` is always true,
134+
so that `x` is actually always initialized. It could also be the case that
135+
`some_cond` depends on something random (e.g. the time), so `x` may not be
136+
initialized. In general, we cannot know statically (due to [Rice's
137+
Theorem][rice]). So what should the value of `init` be in block (C)?
138+
139+
[rice]: https://en.wikipedia.org/wiki/Rice%27s_theorem
140+
141+
Generally, in dataflow analyses, if a block has multiple parents (like (C) in
142+
our example), its dataflow value will be some function of all its parents (and
143+
of course, what happens in (C)). Which function we use depends on the analysis
144+
we are doing.
145+
146+
In this case, we want to be able to prove definitively that `x` must be
147+
initialized before use. This forces us to be conservative and assume that
148+
`some_cond` might be false sometimes. So our "merging function" is "and". That
149+
is, `init = true` in (C) if `init = true` in (A) _and_ in (B) (or if `x` is
150+
initialized in (C)). But this is not the case; in particular, `init = false` in
151+
(A), and `x` is not initialized in (C). Thus, `init = false` in (C); we can
152+
report an error that "`x` may not be initialized before use".
153+
154+
There is definitely a lot more that can be said about dataflow analyses. There is an
155+
extensive body of research literature on the topic, including a lot of theory.
156+
We only discussed a forwards analysis, but backwards dataflow analysis is also
157+
useful. For example, rather than starting from block (A) and moving forwards,
158+
we might have started with the usage of `x` and moved backwards to try to find
159+
its initialization.
83160

84161
<a name="quantified"></a>
85162

86163
## What is "universally quantified"? What about "existentially quantified"?
87164

88-
*to be written*
165+
In math, a predicate may be _universally quantified_ or _existentially
166+
quantified_:
167+
168+
- _Universal_ quantification:
169+
- the predicate holds if it is true for all possible inputs.
170+
- Traditional notation: ∀x: P(x). Read as "for all x, P(x) holds".
171+
- _Existential_ quantification:
172+
- the predicate holds if there is any input where it is true, i.e., there
173+
only has to be a single input.
174+
- Traditional notation: ∃x: P(x). Read as "there exists x such that P(x) holds".
175+
176+
In Rust, they come up in type checking and trait solving. For example,
177+
178+
```rust,ignore
179+
fn foo<T>()
180+
```
181+
This function claims that the function is well-typed for all types `T`: `∀ T: well_typed(foo)`.
182+
183+
Another example:
184+
185+
```rust,ignore
186+
fn foo<'a>(_: &'a usize)
187+
```
188+
This function claims that for any lifetime `'a` (determined by the
189+
caller), it is well-typed: `∀ 'a: well_typed(foo)`.
190+
191+
Another example:
192+
193+
```rust,ignore
194+
fn foo<F>()
195+
where for<'a> F: Fn(&'a u8)
196+
```
197+
This function claims that it is well-typed for all types `F` such that for all
198+
lifetimes `'a`, `F: Fn(&'a u8)`: `∀ F: ∀ 'a: (F: Fn(&'a u8)) => well_typed(foo)`.
199+
200+
One more example:
201+
202+
```rust,ignore
203+
fn foo(_: dyn Debug)
204+
```
205+
This function claims that there exists some type `T` that implements `Debug`
206+
such that the function is well-typed: `∃ T: (T: Debug) and well_typed(foo)`.
89207

90208
<a name="variance"></a>
91209

210+
## What is a DeBruijn Index?
211+
212+
DeBruijn indices are a way of representing which variables are bound in
213+
which binders using only integers. They were [originally invented][wikideb] for
214+
use in lambda calculus evaluation. In `rustc`, we use a similar idea for the
215+
[representation of generic types][sub].
216+
217+
[wikideb]: https://en.wikipedia.org/wiki/De_Bruijn_index
218+
[sub]: ../generics.md
219+
220+
Here is a basic example of how DeBruijn indices might be used for closures (we
221+
don't actually do this in `rustc` though):
222+
223+
```rust,ignore
224+
|x| {
225+
f(x) // de Bruijn index of `x` is 1 because `x` is bound 1 level up
226+
227+
|y| {
228+
g(x, y) // index of `x` is 2 because it is bound 2 levels up
229+
// index of `y` is 1 because it is bound 1 level up
230+
}
231+
}
232+
```
233+
92234
## What is co- and contra-variance?
93235

94236
Check out the subtyping chapter from the

src/appendix/glossary.md

+1
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ CTFE <div id="ctfe"/> | Short for Compile-Time Function Eval
1515
cx <div id="cx"/> | We tend to use "cx" as an abbreviation for context. See also `tcx`, `infcx`, etc.
1616
DAG <div id="dag"/> | A directed acyclic graph is used during compilation to keep track of dependencies between queries. ([see more](../queries/incremental-compilation.html))
1717
data-flow analysis <div id="data-flow"/> | A static analysis that figures out what properties are true at each point in the control-flow of a program; see [the background chapter for more](./background.html#dataflow).
18+
DeBruijn Index <div id="debruijn"> | A technique for describing which binder a variable is bound by using only integers. It has the benefit that it is invariant under variable renaming. ([see more](./background.md#debruijn))
1819
DefId <div id="def-id"/> | An index identifying a definition (see `librustc_middle/hir/def_id.rs`). Uniquely identifies a `DefPath`. See [the HIR chapter for more](../hir.html#identifiers-in-the-hir).
1920
Double pointer <div id="double-ptr"/> | A pointer with additional metadata. See "fat pointer" for more.
2021
drop glue <div id="drop-glue"/> | (internal) compiler-generated instructions that handle calling the destructors (`Drop`) for data types.

src/early-late-bound.md

+107
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
# Early and Late Bound Variables
2+
3+
In Rust, item definitions (like `fn`) can often have generic parameters, which
4+
are always [_universally_ quantified][quant]. That is, if you have a function
5+
like
6+
7+
```rust
8+
fn foo<T>(x: T) { }
9+
```
10+
11+
this function is defined "for all T" (not "for some specific T", which would be
12+
[_existentially_ quantified][quant]).
13+
14+
[quant]: ./appendix/background.md#quantified
15+
16+
While Rust *items* can be quantified over types, lifetimes, and constants, the
17+
types of values in Rust are only ever quantified over lifetimes. So you can
18+
have a type like `for<'a> fn(&'a u32)`, which represents a function pointer
19+
that takes a reference with any lifetime, or `for<'a> dyn Trait<'a>`, which is
20+
a `dyn` trait for a trait implemented for any lifetime; but we have no type
21+
like `for<T> fn(T)`, which would be a function that takes a value of *any type*
22+
as a parameter. This is a consequence of monomorphization -- to support a value
23+
of type `for<T> fn(T)`, we would need a single function pointer that can be
24+
used for a parameter of any type, but in Rust we generate customized code for
25+
each parameter type.
26+
27+
One consequence of this asymmetry is a weird split in how we represesent some
28+
generic types: _early-_ and _late-_ bound parameters.
29+
Basically, if we cannot represent a type (e.g. a universally quantified type),
30+
we have to bind it _early_ so that the unrepresentable type is never around.
31+
32+
Consider the following example:
33+
34+
```rust,ignore
35+
fn foo<'a, 'b, T>(x: &'a u32, y: &'b T) where T: 'b { ... }
36+
```
37+
38+
We cannot treat `'a`, `'b`, and `T` in the same way. Types in Rust can't have
39+
`for<T> { .. }`, only `for<'a> {...}`, so whenever you reference `foo` the type
40+
you get back can't be `for<'a, 'b, T> fn(&'a u32, y: &'b T)`. Instead, the `T`
41+
must be substituted early. In particular, you have:
42+
43+
```rust,ignore
44+
let x = foo; // T, 'b have to be substituted here
45+
x(...); // 'a substituted here, at the point of call
46+
x(...); // 'a substituted here with a different value
47+
```
48+
49+
## Early-bound parameters
50+
51+
Early-bound parameters in rustc are identified by an index, stored in the
52+
[`ParamTy`] struct for types or the [`EarlyBoundRegion`] struct for lifetimes.
53+
The index counts from the outermost declaration in scope. This means that as you
54+
add more binders inside, the index doesn't change.
55+
56+
For example,
57+
58+
```rust,ignore
59+
trait Foo<T> {
60+
type Bar<U> = (Self, T, U);
61+
}
62+
```
63+
64+
Here, the type `(Self, T, U)` would be `($0, $1, $2)`, where `$N` means a
65+
[`ParamTy`] with the index of `N`.
66+
67+
In rustc, the [`Generics`] structure carries the this information. So the
68+
[`Generics`] for `Bar` above would be just like for `U` and would indicate the
69+
'parent' generics of `Foo`, which declares `Self` and `T`. You can read more
70+
in [this chapter](./generics.md).
71+
72+
[`ParamTy`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_middle/ty/struct.ParamTy.html
73+
[`EarlyBoundRegion`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_middle/ty/struct.EarlyBoundRegion.html
74+
[`Generics`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_middle/ty/struct.Generics.html
75+
76+
## Late-bound parameters
77+
78+
Late-bound parameters in `rustc` are handled quite differently (they are also
79+
specialized to lifetimes since, right now, only late-bound lifetimes are
80+
supported, though with GATs that has to change). We indicate their potential
81+
presence by a [`Binder`] type. The [`Binder`] doesn't know how many variables
82+
there are at that binding level. This can only be determined by walking the
83+
type itself and collecting them. So a type like `for<'a, 'b> ('a, 'b)` would be
84+
`for (^0.a, ^0.b)`. Here, we just write `for` because we don't know the names
85+
of the things bound within.
86+
87+
Moreover, a reference to a late-bound lifetime is written `^0.a`:
88+
89+
- The `0` is the index; it identifies that this lifetime is bound in the
90+
innermost binder (the `for`).
91+
- The `a` is the "name"; late-bound lifetimes in rustc are identified by a
92+
"name" -- the [`BoundRegion`] struct. This struct can contain a
93+
[`DefId`][defid] or it might have various "anonymous" numbered names. The
94+
latter arise from types like `fn(&u32, &u32)`, which are equivalent to
95+
something like `for<'a, 'b> fn(&'a u32, &'b u32)`, but the names of those
96+
lifetimes must be generated.
97+
98+
This setup of not knowing the full set of variables at a binding level has some
99+
advantages and some disadvantages. The disadvantage is that you must walk the
100+
type to find out what is bound at the given level and so forth. The advantage
101+
is primarily that, when constructing types from Rust syntax, if we encounter
102+
anonymous regions like in `fn(&u32)`, we just create a fresh index and don't have
103+
to update the binder.
104+
105+
[`Binder`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_middle/ty/struct.Binder.html
106+
[`BoundRegion`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_middle/ty/enum.BoundRegion.html
107+
[defid]: ./hir.html#identifiers-in-the-hir

0 commit comments

Comments
 (0)