Add a bit about various type system concepts (#697)

mark-i-m · tshepang · nikomatsakis · web-flow · commit caac584a9e88 · 2020-06-05T13:11:09.000-03:00
* add a bit on dataflow analysis

* add a bit on quanitification

* add a bit on debruijn index

* add a bit on early and late bound params

* add missing link

* Typos

Co-authored-by: Tshepang Lekhonkhobe &lt;tshepang@gmail.com&gt;

* clarify dataflow example

* fix formatting

* fix typos

* Typos

Co-authored-by: Tshepang Lekhonkhobe &lt;tshepang@gmail.com&gt;

* fix errors in background

* remove dup material and make early/late intro short

* adjust intro

* Niko's intro

Co-authored-by: Niko Matsakis &lt;niko@alum.mit.edu&gt;

Co-authored-by: Tshepang Lekhonkhobe &lt;tshepang@gmail.com&gt;
Co-authored-by: Niko Matsakis &lt;niko@alum.mit.edu&gt;
diff --git a/src/SUMMARY.md b/src/SUMMARY.md
@@ -76,6 +76,7 @@
         - [Generic arguments](./generic_arguments.md)
     - [Type inference](./type-inference.md)
     - [Trait solving](./traits/resolution.md)
+        - [Early and Late Bound Parameters](./early-late-bound.md)
         - [Higher-ranked trait bounds](./traits/hrtb.md)
         - [Caching subtleties](./traits/caching.md)
         - [Specialization](./traits/specialization.md)
diff --git a/src/appendix/background.md b/src/appendix/background.md
@@ -79,16 +79,158 @@ cycle.
 [*Static Program Analysis*](https://cs.au.dk/~amoeller/spa/) by Anders Møller
 and Michael I. Schwartzbach is an incredible resource!
 
-*to be written*
+_Dataflow analysis_ is a type of static analysis that is common in many
+compilers. It describes a general technique, rather than a particular analysis.
+
+The basic idea is that we can walk over a [CFG](#cfg) and keep track of what
+some value could be. At the end of the walk, we might have shown that some
+claim is true or not necessarily true (e.g. "this variable must be
+initialized"). `rustc` tends to do dataflow analyses over the MIR, since that
+is already a CFG.
+
+For example, suppose we want to check that `x` is initialized before it is used
+in this snippet:
+
+```rust,ignore
+fn foo() {
+    let mut x;
+
+    if some_cond {
+        x = 1;
+    }
+
+    dbg!(x);
+}
+```
+
+A CFG for this code might look like this:
+
+```txt
+ +------+
+ | Init | (A)
+ +------+
+    |   \
+    |   if some_cond
+  else    \ +-------+
+    |      \| x = 1 | (B)
+    |       +-------+
+    |      /
+ +---------+
+ | dbg!(x) | (C)
+ +---------+
+```
+
+We can do the dataflow analysis as follows: we will start off with a flag `init`
+which indicates if we know `x` is initialized. As we walk the CFG, we will
+update the flag. At the end, we can check its value.
+
+So first, in block (A), the variable `x` is declared but not initialized, so
+`init = false`. In block (B), we initialize the value, so we know that `x` is
+initialized. So at the end of (B), `init = true`.
+
+Block (C) is where things get interesting. Notice that there are two incoming
+edges, one from (A) and one from (B), corresponding to whether `some_cond` is true or not.
+But we cannot know that! It could be the case the `some_cond` is always true,
+so that `x` is actually always initialized. It could also be the case that
+`some_cond` depends on something random (e.g. the time), so `x` may not be
+initialized. In general, we cannot know statically (due to [Rice's
+Theorem][rice]).  So what should the value of `init` be in block (C)?
+
+[rice]: https://en.wikipedia.org/wiki/Rice%27s_theorem
+
+Generally, in dataflow analyses, if a block has multiple parents (like (C) in
+our example), its dataflow value will be some function of all its parents (and
+of course, what happens in (C)).  Which function we use depends on the analysis
+we are doing.
+
+In this case, we want to be able to prove definitively that `x` must be
+initialized before use. This forces us to be conservative and assume that
+`some_cond` might be false sometimes. So our "merging function" is "and". That
+is, `init = true` in (C) if `init = true` in (A) _and_ in (B) (or if `x` is
+initialized in (C)). But this is not the case; in particular, `init = false` in
+(A), and `x` is not initialized in (C).  Thus, `init = false` in (C); we can
+report an error that "`x` may not be initialized before use".
+
+There is definitely a lot more that can be said about dataflow analyses. There is an
+extensive body of research literature on the topic, including a lot of theory.
+We only discussed a forwards analysis, but backwards dataflow analysis is also
+useful. For example, rather than starting from block (A) and moving forwards,
+we might have started with the usage of `x` and moved backwards to try to find
+its initialization.
 
 <a name="quantified"></a>
 
 ## What is "universally quantified"? What about "existentially quantified"?
 
-*to be written*
+In math, a predicate may be _universally quantified_ or _existentially
+quantified_:
+
+- _Universal_ quantification:
+  - the predicate holds if it is true for all possible inputs.
+  - Traditional notation: ∀x: P(x). Read as "for all x, P(x) holds".
+- _Existential_ quantification:
+  - the predicate holds if there is any input where it is true, i.e., there
+    only has to be a single input.
+  - Traditional notation: ∃x: P(x). Read as "there exists x such that P(x) holds".
+
+In Rust, they come up in type checking and trait solving. For example,
+
+```rust,ignore
+fn foo<T>()
+```
+This function claims that the function is well-typed for all types `T`: `∀ T: well_typed(foo)`.
+
+Another example:
+
+```rust,ignore
+fn foo<'a>(_: &'a usize)
+```
+This function claims that for any lifetime `'a` (determined by the
+caller), it is well-typed: `∀ 'a: well_typed(foo)`.
+
+Another example:
+
+```rust,ignore
+fn foo<F>()
+where for<'a> F: Fn(&'a u8)
+```
+This function claims that it is well-typed for all types `F` such that for all
+lifetimes `'a`, `F: Fn(&'a u8)`: `∀ F: ∀ 'a: (F: Fn(&'a u8)) => well_typed(foo)`.
+
+One more example:
+
+```rust,ignore
+fn foo(_: dyn Debug)
+```
+This function claims that there exists some type `T` that implements `Debug`
+such that the function is well-typed: `∃ T:  (T: Debug) and well_typed(foo)`.
 
 <a name="variance"></a>
 
+## What is a DeBruijn Index?
+
+DeBruijn indices are a way of representing which variables are bound in
+which binders using only integers. They were [originally invented][wikideb] for
+use in lambda calculus evaluation. In `rustc`, we use a similar idea for the
+[representation of generic types][sub].
+
+[wikideb]: https://en.wikipedia.org/wiki/De_Bruijn_index
+[sub]: ../generics.md
+
+Here is a basic example of how DeBruijn indices might be used for closures (we
+don't actually do this in `rustc` though):
+
+```rust,ignore
+|x| {
+    f(x) // de Bruijn index of `x` is 1 because `x` is bound 1 level up
+
+    |y| {
+        g(x, y) // index of `x` is 2 because it is bound 2 levels up
+                // index of `y` is 1 because it is bound 1 level up
+    }
+}
+```
+
 ## What is co- and contra-variance?
 
 Check out the subtyping chapter from the
diff --git a/src/appendix/glossary.md b/src/appendix/glossary.md
@@ -15,6 +15,7 @@ CTFE <div id="ctfe"/>                    |  Short for Compile-Time Function Eval
 cx <div id="cx"/>                        |  We tend to use "cx" as an abbreviation for context. See also `tcx`, `infcx`, etc.
 DAG <div id="dag"/>                      |  A directed acyclic graph is used during compilation to keep track of dependencies between queries. ([see more](../queries/incremental-compilation.html))
 data-flow analysis <div id="data-flow"/> |  A static analysis that figures out what properties are true at each point in the control-flow of a program; see [the background chapter for more](./background.html#dataflow).
+DeBruijn Index <div id="debruijn">       |  A technique for describing which binder a variable is bound by using only integers. It has the benefit that it is invariant under variable renaming. ([see more](./background.md#debruijn))
 DefId <div id="def-id"/>                 |  An index identifying a definition (see `librustc_middle/hir/def_id.rs`). Uniquely identifies a `DefPath`. See [the HIR chapter for more](../hir.html#identifiers-in-the-hir).
 Double pointer <div id="double-ptr"/>    |  A pointer with additional metadata. See "fat pointer" for more.
 drop glue <div id="drop-glue"/>          |  (internal) compiler-generated instructions that handle calling the destructors (`Drop`) for data types.
diff --git a/src/early-late-bound.md b/src/early-late-bound.md
@@ -0,0 +1,107 @@
+# Early and Late Bound Variables
+
+In Rust, item definitions (like `fn`) can often have generic parameters, which
+are always [_universally_ quantified][quant]. That is, if you have a function
+like
+
+```rust
+fn foo<T>(x: T) { }
+```
+
+this function is defined "for all T" (not "for some specific T", which would be
+[_existentially_ quantified][quant]).
+
+[quant]: ./appendix/background.md#quantified
+
+While Rust *items* can be quantified over types, lifetimes, and constants, the
+types of values in Rust are only ever quantified over lifetimes. So you can
+have a type like `for<'a> fn(&'a u32)`, which represents a function pointer
+that takes a reference with any lifetime, or `for<'a> dyn Trait<'a>`, which is
+a `dyn` trait for a trait implemented for any lifetime; but we have no type
+like `for<T> fn(T)`, which would be a function that takes a value of *any type*
+as a parameter. This is a consequence of monomorphization -- to support a value
+of type `for<T> fn(T)`, we would need a single function pointer that can be
+used for a parameter of any type, but in Rust we generate customized code for
+each parameter type.
+
+One consequence of this asymmetry is a weird split in how we represesent some
+generic types: _early-_ and _late-_ bound parameters.
+Basically, if we cannot represent a type (e.g. a universally quantified type),
+we have to bind it _early_ so that the unrepresentable type is never around.
+
+Consider the following example:
+
+```rust,ignore
+fn foo<'a, 'b, T>(x: &'a u32, y: &'b T) where T: 'b { ... }
+```
+
+We cannot treat `'a`, `'b`, and `T` in the same way.  Types in Rust can't have
+`for<T> { .. }`, only `for<'a> {...}`, so whenever you reference `foo` the type
+you get back can't be `for<'a, 'b, T> fn(&'a u32, y: &'b T)`. Instead, the `T`
+must be substituted early. In particular, you have:
+
+```rust,ignore
+let x = foo; // T, 'b have to be substituted here
+x(...);      // 'a substituted here, at the point of call
+x(...);      // 'a substituted here with a different value
+```
+
+## Early-bound parameters
+
+Early-bound parameters in rustc are identified by an index, stored in the
+[`ParamTy`] struct for types or the [`EarlyBoundRegion`] struct for lifetimes.
+The index counts from the outermost declaration in scope. This means that as you
+add more binders inside, the index doesn't change.
+
+For example,
+
+```rust,ignore
+trait Foo<T> {
+  type Bar<U> = (Self, T, U);
+}
+```
+
+Here, the type `(Self, T, U)` would be `($0, $1, $2)`, where `$N` means a
+[`ParamTy`] with the index of `N`.
+
+In rustc, the [`Generics`] structure carries the this information. So the
+[`Generics`] for `Bar` above would be just like for `U` and would indicate the
+'parent' generics of `Foo`, which declares `Self` and `T`.  You can read more
+in [this chapter](./generics.md).
+
+[`ParamTy`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_middle/ty/struct.ParamTy.html
+[`EarlyBoundRegion`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_middle/ty/struct.EarlyBoundRegion.html
+[`Generics`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_middle/ty/struct.Generics.html
+
+## Late-bound parameters
+
+Late-bound parameters in `rustc` are handled quite differently (they are also
+specialized to lifetimes since, right now, only late-bound lifetimes are
+supported, though with GATs that has to change). We indicate their potential
+presence by a [`Binder`] type. The [`Binder`] doesn't know how many variables
+there are at that binding level. This can only be determined by walking the
+type itself and collecting them. So a type like `for<'a, 'b> ('a, 'b)` would be
+`for (^0.a, ^0.b)`. Here, we just write `for` because we don't know the names
+of the things bound within.
+
+Moreover, a reference to a late-bound lifetime is written `^0.a`:
+
+- The `0` is the index; it identifies that this lifetime is bound in the
+  innermost binder (the `for`).
+- The `a` is the "name"; late-bound lifetimes in rustc are identified by a
+  "name" -- the [`BoundRegion`] struct. This struct can contain a
+  [`DefId`][defid] or it might have various "anonymous" numbered names. The
+  latter arise from types like `fn(&u32, &u32)`, which are equivalent to
+  something like `for<'a, 'b> fn(&'a u32, &'b u32)`, but the names of those
+  lifetimes must be generated.
+
+This setup of not knowing the full set of variables at a binding level has some
+advantages and some disadvantages. The disadvantage is that you must walk the
+type to find out what is bound at the given level and so forth. The advantage
+is primarily that, when constructing types from Rust syntax, if we encounter
+anonymous regions like in `fn(&u32)`, we just create a fresh index and don't have
+to update the binder.
+
+[`Binder`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_middle/ty/struct.Binder.html
+[`BoundRegion`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_middle/ty/enum.BoundRegion.html
+[defid]: ./hir.html#identifiers-in-the-hir