Add docs for reachability.h.

PiperOrigin-RevId: 340330922
google · Nov 3, 2020 · a51a99f · a51a99f
1 parent 9dbc716
commit a51a99f
Showing 1 changed file with 94 additions and 7 deletions.
diff --git a/docs/developers/typegraph.md b/docs/developers/typegraph.md
@@ -1,7 +1,7 @@
 # The Typegraph
 
 <!--*
-freshness: { owner: 'tsudol' reviewed: '2020-09-11' }
+freshness: { owner: 'tsudol' reviewed: '2020-11-02' }
 *-->
 
 <!--ts-->
@@ -12,8 +12,10 @@ freshness: { owner: 'tsudol' reviewed: '2020-09-11' }
          * [Default Data](#default-data)
       * [Sets in the Typegraph](#sets-in-the-typegraph)
          * [std::set or std::unordered_set?](#stdset-or-stdunordered_set)
+      * [Reachability](#reachability)
+         * [Implementation](#implementation)
 
-<!-- Added by: rechen, at: 2020-09-14T12:22-07:00 -->
+<!-- Added by: tsudol, at: 2020-11-02T11:52-08:00 -->
 
 <!--te-->
 
@@ -150,12 +152,97 @@ cannot guarantee that the items in a set won't be modified, which violates the
 `std::unordered_set` invariants. Finally, iterating over and comparing sets is
 more efficient for `std::set`.
 
-<!--
-Reachability (reachable.h and .cc)
+## Reachability
+
+A basic operation when analyzing the CFG is checking if there is a path between
+two nodes. This is used when checking if a binding's origin is visible at a
+given node, for example. Queries always progress from a child node to the
+parent, in the opposite direction of the edges of the CFG. To make these lookups
+faster, the `Program` class uses a **backwards reachability** cache that's
+handled by the `ReachabilityAnalyzer` class. Backwards reachability means that
+the reachability analysis proceeds from child to parent, in the opposite
+direction of the directed edges of the graph.
+
+The `ReachabilityAnalyzer` tracks the list of adjacent nodes for each CFG node.
+For node `i`, `reacahble[i][j]` indicates whether `j` is reachable from `i`.
+When an edge is added to the reachability cache, the cache updates every node to
+see if connections are possible.
 
--   we know it's to make parts of the solver faster.
--   But how
--   and does it actually?
+For example, consider three nodes `A -> B -> C`. The cache would be initialised
+as:
+
+```
+reachable[A] = [True, False, False]
+reachable[B] = [False, True, False]
+reacahble[C] = [False, False, True]
+```
+
+(A node can always be reached from itself.) Since there's a connection from `B`
+to `C`, the backwards edge `C -> B` is added to the cache:
+
+```
+reachable[A] = [True, False, False]
+reachable[B] = [False, True, False]
+reacahble[C] = [False, True, True]
+```
+
+Then the backwards edge `B -> A` is added:
+
+```
+reachable[A] = [True, False, False]
+reachable[B] = [True, True, False]
+reacahble[C] = [True, True, True]
+```
+
+The cache now shows that `A` can only reach itself, `B` can reach itself and
+`A`, and `C` can reach all three nodes.
+
+### Implementation
+
+Every CFG node has an ID number of type `size_t`, which these docs will assume
+is 64 bits. A naive implementation would make `reachable[n]` a list of the IDs
+of the nodes reachable from node `n`, but at 8 bytes per node and potentially
+thousands of nodes, that will get too expensive quickly. Instead, nodes are
+split into _buckets_ based on their ID. Each bucket tracks 64 nodes, so the top
+58 bits of the ID determine the node's bucket and the bottom 6 bits determine
+the node's index within the bucket.
+
+These buckets are implemented using bit vectors. Since a bucket covers 64 nodes,
+it's represented by an `int64_t`. A node's reachable list is then a
+`std::vector<int64_t>`, and the reachability cache that tracks every node's
+reachable list is a `std::vector<std::vector<int64_t>>`. All together, node `j`
+is reachable from node `i` if `reachable[i][bucket(j)] & bit(j) == 1`.
+
+For example, consider a CFG with 100 nodes, which have IDs from 0 to 99. There
+will be 100 entries in the reachability cache, one for each node, such that
+`reachable[n]` corresponds to the nodes that are backwards reachable from node
+`n`. `reachable[n]` is a `std::vector<int64_t>` with two elements, the first
+tracking nodes 0 - 63 and the second tracking nodes 64 - 99, with room to track
+another 28 nodes.
+
+Let's check if node 75 is backwards reachable from node 30: `is_reachable(30,
+75)`.
+
+1.  Find node 75's bucket: `bucket = 75 >> 6 = 1`. (This is equivalent to `75 /
+    64`.)
+1.  Find node 75's bit: `bit = 1 << (75 & 0x3F) = 1 << 11`.
+1.  Check node 30's reachability: `reachable[30][bucket] & bit`.
+
+Adding a new node to the reachability cache is accomplished by adding another
+entry to `reachable`. There is a catch: the cache must check if a new bucket is
+needed to track the new node. If one is, then every node's reachable list is
+extended with one more bucket. Finally, `reachable[n][bucket(n)] = bit[n]` is
+set, indicating that node `n` is reachable from itself.
+
+Adding an edge between two nodes is only slightly more complex. Because the
+cache tracks reachability, adding an edge may update every node. Remember the `A
+-> B -> C` example previously: `add_edge(B, A)` updated both `B` and `C`,
+because `B` is reachable from `C`. For `add_edge(src, dst)`, the cache checks if
+`src` is reachable from each node `i`, and if so, bitwise-ORs `reachable[i]` and
+`reachable[dst]` together. Because `src` is reachable from itself, this will
+also update `reachable[src]` when `i == src`.
+
+<!--
 
 Hashing and Sets