From 6ec6207717b73ffcb5672c1490b854a35d5d02d1 Mon Sep 17 00:00:00 2001
From: Chris Fallin <chris@cfallin.org>
Date: Fri, 18 Jun 2021 13:59:12 -0700
Subject: [PATCH] Add design document.

---
 README.md     |  138 +----
 doc/DESIGN.md | 1625 +++++++++++++++++++++++++++++++++++++++++++++++++
 src/lib.rs    |    2 +-
 3 files changed, 1635 insertions(+), 130 deletions(-)
 create mode 100644 doc/DESIGN.md

diff --git a/README.md b/README.md
index c187fe9..a160ed3 100644
--- a/README.md
+++ b/README.md
@@ -1,139 +1,19 @@
 ## regalloc2: another register allocator
 
-This is a register allocator that started life as, and is about 75%
+This is a register allocator that started life as, and is about 50%
 still, a port of IonMonkey's backtracking register allocator to
-Rust. The data structures and invariants have been simplified a little
-bit, and the interfaces made a little more generic and reusable. In
-addition, it contains substantial amounts of testing infrastructure
+Rust. In many regards, it has been generalized, optimized, and
+improved since the initial port, and now supports both SSA and non-SSA
+use-cases.
+
+In addition, it contains substantial amounts of testing infrastructure
 (fuzzing harnesses and checkers) that does not exist in the original
 IonMonkey allocator.
 
-### Design Overview
+See the [design overview](doc/DESIGN.md) for (much!) more detail on
+how the allocator works.
 
-TODO
-
-- SSA with blockparams
-
-- Operands with constraints, and clobbers, and reused regs; contrast
-  with regalloc.rs approach of vregs and pregs and many moves that get
-  coalesced/elided
-
-### Differences from IonMonkey Backtracking Allocator
-
-There are a number of differences between the [IonMonkey
-allocator](https://searchfox.org/mozilla-central/source/js/src/jit/BacktrackingAllocator.cpp)
-and this one:
-
-* Most significantly, there are [fuzz/fuzz_targets/](many different
-  fuzz targets) that exercise the allocator, including a full symbolic
-  checker (`ion_checker` target) based on the [symbolic checker in
-  regalloc.rs](https://cfallin.org/blog/2021/03/15/cranelift-isel-3/)
-  and, e.g., a targetted fuzzer for the parallel move-resolution
-  algorithm (`moves`) and the SSA generator used for generating cases
-  for the other fuzz targets (`ssagen`).
-
-* The data-structure invariants are simplified. While the IonMonkey
-  allocator allowed for LiveRanges and Bundles to overlap in certain
-  cases, this allocator sticks to a strict invariant: ranges do not
-  overlap in bundles, and bundles do not overlap. There are other
-  examples too: e.g., the definition of minimal bundles is very simple
-  and does not depend on scanning the code at all. In general, we
-  should be able to state simple invariants and see by inspection (as
-  well as fuzzing -- see above) that they hold.
-  
-* Many of the algorithms in the IonMonkey allocator are built with
-  helper functions that do linear scans. These "small quadratic" loops
-  are likely not a huge issue in practice, but nevertheless have the
-  potential to be in corner cases. As much as possible, all work in
-  this allocator is done in linear scans. For example, bundle
-  splitting is done in a single compound scan over a bundle, ranges in
-  the bundle, and a sorted list of split-points.
-  
-* There are novel schemes for solving certain interesting design
-  challenges. One example: in IonMonkey, liveranges are connected
-  across blocks by, when reaching one end of a control-flow edge in a
-  scan, doing a lookup of the allocation at the other end. This is in
-  principle a linear lookup (so quadratic overall). We instead
-  generate a list of "half-moves", keyed on the edge and from/to
-  vregs, with each holding one of the allocations. By sorting and then
-  scanning this list, we can generate all edge moves in one linear
-  scan. There are a number of other examples of simplifications: for
-  example, we handle multiple conflicting
-  physical-register-constrained uses of a vreg in a single instruction
-  by recording a copy to do in a side-table, then removing constraints
-  for the core regalloc. Ion instead has to tweak its definition of
-  minimal bundles and create two liveranges that overlap (!) to
-  represent the two uses.
-  
-* Using block parameters rather than phi-nodes significantly
-  simplifies handling of inter-block data movement. IonMonkey had to
-  special-case phis in many ways because they are actually quite
-  weird: their uses happen semantically in other blocks, and their
-  defs happen in parallel at the top of the block. Block parameters
-  naturally and explicitly reprsent these semantics in a direct way.
-
-* The allocator supports irreducible control flow and arbitrary block
-  ordering (its only CFG requirement is that critical edges are
-  split). It handles loops during live-range computation in a way that
-  is similar in spirit to IonMonkey's allocator -- in a single pass,
-  when we discover a loop, we just mark the whole loop as a liverange
-  for values live at the top of the loop -- but we find the loop body
-  without the fixpoint workqueue loop that IonMonkey uses, instead
-  doing a single linear scan for backedges and finding the minimal
-  extent that covers all intermingled loops. In order to support
-  arbitrary block order and irreducible control flow, we relax the
-  invariant that the first liverange for a vreg always starts at its
-  def; instead, the def can happen anywhere, and a liverange may
-  overapproximate. It turns out this is not too hard to handle and is
-  a more robust invariant. (It also means that non-SSA code *may* not
-  be too hard to adapt to, though I haven't seriously thought about
-  this.)
-
-### Rough Performance Comparison with Regalloc.rs
-
-The allocator has not yet been wired up to a suitable compiler backend
-(such as Cranelift) to perform a true apples-to-apples compile-time
-and runtime comparison. However, we can get some idea of compile speed
-by running suitable test cases through the allocator and measuring
-*throughput*: that is, instructions per second for which registers are
-allocated.
-
-To do so, I measured the `qsort2` benchmark in
-[regalloc.rs](https://github.com/bytecodealliance/regalloc.rs),
-register-allocated with default options in that crate's backtracking
-allocator, using the Criterion benchmark framework to measure ~620K
-instructions per second:
-
-
-```plain
-benches/0               time:   [365.68 us 367.36 us 369.04 us]
-                        thrpt:  [617.82 Kelem/s 620.65 Kelem/s 623.49 Kelem/s]
-```
-
-I then measured three different fuzztest-SSA-generator test cases in
-this allocator, `regalloc2`, measuring between 1.1M and 2.3M
-instructions per second (closer to the former for larger functions):
-
-```plain
-==== 459 instructions
-benches/0               time:   [377.91 us 378.09 us 378.27 us]
-                        thrpt:  [1.2134 Melem/s 1.2140 Melem/s 1.2146 Melem/s]
-
-==== 225 instructions
-benches/1               time:   [202.03 us 202.14 us 202.27 us]
-                        thrpt:  [1.1124 Melem/s 1.1131 Melem/s 1.1137 Melem/s]
-
-==== 21 instructions
-benches/2               time:   [9.5605 us 9.5655 us 9.5702 us]
-                        thrpt:  [2.1943 Melem/s 2.1954 Melem/s 2.1965 Melem/s]
-```
-
-Though not apples-to-apples (SSA vs. non-SSA, completely different
-code only with similar length), this is at least some evidence that
-`regalloc2` is likely to lead to at least a compile-time improvement
-when used in e.g. Cranelift.
-
-### License
+## License
 
 Unless otherwise specified, code in this crate is licensed under the Apache 2.0
 License with LLVM Exception. This license text can be found in the file
diff --git a/doc/DESIGN.md b/doc/DESIGN.md
new file mode 100644
index 0000000..c55887a
--- /dev/null
+++ b/doc/DESIGN.md
@@ -0,0 +1,1625 @@
+# regalloc2 Design Overview
+
+This document describes the basic architecture of the regalloc2
+register allocator. It describes the externally-visible interface
+(input CFG, instructions, operands, with their invariants; meaning of
+various parts of the output); core data structures; and the allocation
+pipeline, or series of algorithms that compute an allocation. It ends
+with a description of future work and expectations, as well as an
+appendix that notes design influences and similarities to the
+IonMonkey backtracking allocator.
+
+# API, Input IR and Invariants
+
+The toplevel API to regalloc2 consists of a single entry point `run()`
+that takes a register environment, which specifies all physical
+registers, and the input program. The function returns either an error
+or an `Output` struct that provides allocations for each operand and a
+list of additional instructions (moves, loads, stores) to insert.
+
+## Register Environment
+
+The allocator takes a `MachineEnv` which specifies, for each of the
+two register classes `Int` and `Float`, a list of `PReg`s by index. A
+`PReg` is nothing more than the class and index within the class; the
+allocator does not need to know anything more.
+
+The `MachineEnv` provides a list of preferred and non-preferred
+physical registers per class. Any register not on either list will not
+be allocated. Usually, registers that do not need to be saved in the
+prologue if used (i.e., caller-save registers) are given in the
+"preferred" list. The environment also provides exactly one scratch
+register per class. This register must not be in the preferred or
+non-preferred lists, and is used whenever a set of moves that need to
+occur logically in parallel have a cycle (for a simple example,
+consider a swap `r0, r1 := r1, r0`).
+
+With some more work, we could potentially remove the need for the
+scratch register by requiring support for an additional edit type from
+the client ("swap"), but we have not pursued this.
+
+## CFG and Instructions
+
+The allocator operates on an input program that is in a standard CFG
+representation: the function body is a list of basic blocks, and each
+block has a list of insructions and zero or more successors. The
+allocator also requires the client to provide predecessors for each
+block, and these must be consistent with the successor
+lists.
+
+Instructions are opaque to the allocator except for a few important
+bits: (1) `is_ret` (is a return instruction); (2) `is_branch` (is a
+branch instruction); (3) `is_call` (is a call instruction, for
+heuristic purposes only), (4) `is_move` (is a move between registers),
+and (5) a list of Operands, covered below. Every block must end in a
+return or branch.
+
+Both instructions and blocks are named by indices in contiguous index
+spaces. A block's instructions must be a contiguous range of
+instruction indices, and block i's first instruction must come
+immediately after block i-1's last instruction.
+
+The CFG must have *no critical edges*. A critical edge is an edge from
+block A to block B such that A has more than one successor *and* B has
+more than one predecessor. For this definition, the entry block has an
+implicit predecessor, and any block that ends in a return has an
+implicit successor.
+
+Note that there are *no* requirements related to the ordering of
+blocks, and there is no requirement that the control flow be
+reducible. Some *heuristics* used by the allocator will perform better
+if the code is reducible and ordered in reverse postorder (RPO),
+however: in particular, (1) this interacts better with the
+contiguous-range-of-instruction-indices live range representation that
+we use, and (2) the "approximate loop depth" metric will actually be
+exact if both these conditions are met.
+
+## Operands and VRegs
+
+Every instruction operates on values by way of `Operand`s. An operand
+consists of the following fields:
+
+- VReg, or virtual register. *Every* operand mentions a virtual
+  register, even if it is constrained to a single physical register in
+  practice. This is because we track liveranges uniformly by vreg.
+  
+- Policy, or "constraint". Every reference to a vreg can apply some
+  constraint to the vreg at that point in the program. Valid policies are:
+  
+  - Any location;
+  - Any register of the vreg's class;
+  - Any stack slot;
+  - A particular fixed physical register; or
+  - For a def (output), a *reuse* of an input register.
+  
+- The "kind" of reference to this vreg: Def, Use, Mod. A def
+  (definition) writes to the vreg, and disregards any possible earlier
+  value. A mod (modify) reads the current value then writes a new
+  one. A use simply reads the vreg's value.
+  
+- The position: before or after the instruction.
+  - Note that to have a def (output) register available in a way that
+    does not conflict with inputs, the def should be placed at the
+    "before" position. Similarly, to have a use (input) register
+    available in a way that does not conflict with outputs, the use
+    should be placed at the "after" position.
+
+This operand-specification design allows for SSA and non-SSA code (see
+section below for details).
+
+VRegs, or virtual registers, are specified by an index and a register
+class (Float or Int). The classes are not given separately; they are
+encoded on every mention of the vreg. (In a sense, the class is an
+extra index bit, or part of the register name.) The input function
+trait does require the client to provide the exact vreg count,
+however.
+
+Implementation note: both vregs and operands are bit-packed into
+u32s. This is essential for memory-efficiency. As a result of the
+operand bit-packing in particular (including the policy constraints!),
+the allocator supports up to 2^20 (1M) vregs per function, and 2^5
+(32) physical registers per class. Later we will also see a limit of
+2^20 (1M) instructions per function. These limits are considered
+sufficient for the anticipated use-cases (e.g., compiling Wasm, which
+also has function-size implementation limits); for larger functions,
+it is likely better to use a simpler register allocator in any case.
+
+## Reuses and Two-Address ISAs
+
+Some instruction sets primarily have instructions that name only two
+registers for a binary operator, rather than three: both registers are
+inputs, and the result is placed in one of the registers, clobbering
+its original value. The most well-known modern example is x86. It is
+thus imperative that we support this pattern well in the register
+allocator.
+
+This instruction-set design is somewhat at odds with an SSA
+representation, where a value cannot be redefined. Even in non-SSA
+code, it is awkward to overwrite a vreg that may need to be used again
+later.
+
+Thus, the allocator supports a useful fiction of sorts: the
+instruction can be described as if it has three register mentions --
+two inputs and a separate output -- and neither input will be
+clobbered. The output, however, is special: its register-placement
+policy is "reuse input i" (where i == 0 or 1). The allocator
+guarantees that the register assignment for that input and the output
+will be the same, so the instruction can use that register as its
+"modifies" operand. If the input is needed again later, the allocator
+will take care of the necessary copying.
+
+We will see below how the allocator makes this work by doing some
+preprocessing so that the core allocation algorithms do not need to
+worry about this constraint.
+
+Note that some non-SSA clients, such as Cranelift using the
+regalloc.rs-to-regalloc2 compatibility shim, will instead generate
+their own copies (copying to the output vreg first) and then use "mod"
+operand kinds, which allow the output vreg to be both read and
+written. regalloc2 works hard to make this as efficient as the
+reused-input scheme by treating moves specially (see below).
+
+## SSA
+
+regalloc2 was originally designed to take an SSA IR as input, where
+the usual definitions apply: every vreg is defined exactly once, and
+every vreg use is dominated by its one def. (Useing blockparams means
+that we do not need additional conditions for phi-nodes.)
+
+The allocator then evolved to support non-SSA inputs as well. As a
+result, the input is maximally flexible right now: it does not check
+for and enforce, nor try to take advantage of, the single-def
+rule. However, blockparams are still available.
+
+In the future, we hope to change this, however, once compilation of
+non-SSA inputs is no longer needed. Specifically, if we can migrate
+Cranelift to the native regalloc2 API rather than the regalloc.rs
+compatibility shim, we will be able to remove "mod" operand kinds,
+assume (and verify) single defs, and take advantage of this when
+reasoning about various algorithms in the allocator.
+
+## Block Parameters
+
+Every block can have *block parameters*, and a branch to a block with
+block parameters must provide values for those parameters via
+operands. When a branch has more than one successor, it provides
+separate operands for each possible successor. These block parameters
+are equivalent to phi-nodes; we chose this representation because they
+are in many ways a more consistent representation of SSA. 
+
+To see why we believe block parameters are a slightly nicer design
+choice than use of phi nodes, consider: phis are special
+pseudoinstructions that must come first in a block, are all defined in
+parallel, and whose uses occur on the edge of a particular
+predecessor. All of these facts complicate any analysis that scans
+instructions and reasons about uses and defs. It is much closer to the
+truth to actually put those uses *in* the predecessor, on the branch,
+and put all the defs at the top of the block as a separate kind of
+def. The tradeoff is that a vreg's def now has two possibilities --
+ordinary instruction def or blockparam def -- but this is fairly
+reasonable to handle.
+
+## Non-SSA
+
+As mentioned, regalloc2 supports non-SSA inputs as well. No special
+flag is needed to place the allocator in this mode or disable SSA
+verification. However, we hope to eventually remove this functionality
+when it is no longer needed.
+
+## Program Moves
+
+As an especially useful feature for non-SSA IR, regalloc2 supports
+special handling of "move" instructions: it will try to merge the
+input and output allocations to elide the move altogether.
+
+It turns out that moves are used frequently in the non-SSA input that
+we observe from Cranelift via the regalloc.rs compatibility shim. They
+are used in three different ways:
+
+- Moves to or from physical registers, used to implement ABI details
+  or place values in particular registers required by certain
+  instructions.
+- Moves between vregs on program edges, as lowered from phi/blockparam
+  dataflow in the higher-level SSA IR (CLIF).
+- Moves just prior to two-address-form instructions that modify an
+  input to form an output: the input is moved to the output vreg to
+  avoid clobbering the input.
+  
+Note that, strictly speaking, special handling of program moves is
+redundant because each of these kinds of uses has an equivalent in the
+"native" regalloc2 API:
+
+- Moves to/from physical registers can become operand constraints,
+  either on a particular instruction that requires/produces the values
+  in certain registers (e.g., a call or ret with args/results in regs,
+  or a special instruction with fixed register args), or on a ghost
+  instruction at the top of function that defs vregs for all in-reg
+  args.
+  
+- Moves between vregs as a lowering of blockparams/phi nodes can be
+  replaced with use of regalloc2's native blockparam support.
+  
+- Moves prior to two-address-form instructions can be replaced with
+  the reused-input mechanism.
+  
+Thus, eventually, special handling of program moves should be
+removed. However, it is very important for performance at the moment.
+
+## Output
+
+The allocator produces two main data structures as output: an array of
+`Allocation`s and a list of edits. Some other data, such as stackmap
+slot info, is also provided.
+
+### Allocations
+
+The allocator provides an array of `Allocation` values, one per
+`Operand`. Each `Allocation` has a kind and an index. The kind may
+indicate that this is a physical register or a stack slot, and the
+index gives the respective register or slot. All allocations will
+conform to the constraints given, and will faithfully preserve the
+dataflow of the input program.
+
+### Inserted Moves
+
+In order to implement the necessary movement of data between
+allocations, the allocator needs to insert moves at various program
+points.
+
+The list of inserted moves contains tuples that name a program point
+and an "edit". The edit is either a move, from one `Allocation` to
+another, or else a kind of metadata used by the checker to know which
+VReg is live in a given allocation at any particular time. The latter
+sort of edit can be ignored by a backend that is just interested in
+generating machine code.
+
+Note that the allocator will never generate a move from one stackslot
+directly to another, by design. Instead, if it needs to do so, it will
+make use of the scratch register. (Sometimes such a move occurs when
+the scratch register is already holding a value, e.g. to resolve a
+cycle of moves; in this case, it will allocate another spillslot and
+spill the original scratch value around the move.)
+
+Thus, the single "edit" type can become either a register-to-register
+move, a load from a stackslot into a register, or a store from a
+register into a stackslot.
+
+# Data Structures
+
+We now review the data structures that regalloc2 uses to track its
+state.
+
+## Program-Derived Alloc-Invariant Data
+
+There are a number of data structures that are computed in a
+deterministic way from the input program and then subsequently used
+only as read-only data during the core allocation procedure.
+
+### Livein/Liveout Bitsets
+
+The livein and liveout bitsets (`liveins` and `liveouts` on the `Env`)
+are allocated one per basic block and record, per block, which vregs
+are live entering and leaving that block. They are computed using a
+standard backward iterative dataflow analysis and are exact; they do
+not over-approximate (this turns out to be important for performance,
+and is also necessary for correctness in the case of stackmaps).
+
+### Blockparam Lists: Source-Side and Dest-Side
+
+The initialization stage scans the input program and produces two
+lists that represent blockparam flows from branches to destination
+blocks: `blockparam_ins` and `blockparam_outs`.
+
+These two lists are the first instance we will see of a recurring
+pattern: the lists contain tuples that are carefully ordered in a way
+such that their sort-order is meaningful. "Build a list lazily then
+sort" is a common idiom: it batches the O(n log n) cost into one
+operation that the stdlib has aggressively optimized, it provides
+dense storage, and it allows for a scan in a certain order that often
+lines up with a scan over the program.
+
+In this particular case, we will build lists of (vreg, block) points
+that are meaningful either at the start or end of a block, so that
+later, when we scan over a particular vreg's allocations in block
+order, we can generate another list of allocations. One side (the
+"outs") also contains enough information that it can line up with the
+other side (the "ins") in a later sort.
+
+To make this work, `blockparam_ins` contains a list of (to-vreg,
+to-block, from-block) tuples, and has an entry for every blockparam of
+every block. Note that we can compute this without actually observing
+from-blocks; we only need to iterate over `block_preds` at any given
+block.
+
+Then, `blockparam_outs` contains a list of (from-vreg, from-block,
+to-block, to-vreg), and has an entry for every parameter on every
+branch that ends a block. There is exactly one "out" tuple for every
+"in" tuple. As mentioned above, we will later scan over both to
+generate moves.
+
+### Program-Move Lists: Source-Side and Dest-Side
+
+Similar to blockparams, we handle moves specially. In fact, we ingest
+all moves in the input program into a set of lists -- "move sources"
+and "move dests", analogous to the "ins" and "outs" blockparam lists
+described above -- and then completely ignore the moves in the program
+thereafter. The semantics of the API are such that all program moves
+will be recreated with regalloc-inserted edits, and should not still
+be emitted after regalloc. This may seem inefficient, but in fact it
+allows for better code because it integrates program-moves with the
+move resolution that handles other forms of vreg movement. We
+previously took the simpler approach of handling program-moves as
+opaque instructions with a source and dest, and we found that there
+were many redundant move-chains (A->B, B->C) that are eliminated when
+everything is handled centrally.
+
+We also construct a `prog_move_merges` list of live-range index pairs
+to attempt to merge when we reach that stage of allocation.
+
+## Core Allocation State: Ranges, Uses, Bundles, VRegs, PRegs
+
+We now come to the core data structures: live-ranges, bundles, virtual
+registers and their state, and physical registers and their state.
+
+First we must define a `ProgPoint` precisely: a `ProgPoint` is an
+instruction index and a `Before` or `After` suffix. We pack the
+before/after suffix into the LSB of a `u32`, so a `ProgPoint` can be
+incremented and compared as a simple integer.
+
+A live-range is a contiguous range of program points (half-open,
+i.e. including `from` and excluding `to`) for which a particular vreg
+is live with a value.
+
+A live-range contains a list of uses. Each use contains four parts:
+the Operand word (directly copied, so there is no need to dereference
+it); the ProgPoint at which the use occurs; the operand slot on that
+instruction, if any, that the operand comes from, and the use's
+'weight". (It's possible to have "ghost uses" that do not derive from
+any slot on the isntruction.) These four parts are packed into three
+`u32`s: the slot can fit in 8 bits, and the weight in 16.
+
+The live-range carries its program-point range, uses, vreg index,
+bundle index (see below), and some metadata: spill weight and
+flags. The spill weight is the sum of weights of each use. The flags
+set currently carries one flag only: whether the live-range starts at
+a Def-kind operand. (This is equivalent to whether the range consumes
+a value at its start or not.)
+
+Uses are owned only by live-ranges and have no separate identity, but
+live-ranges live in a toplevel array and are known by `LiveRangeIndex`
+values throughout the allocator. New live-ranges can be created
+(e.g. during splitting); old ones are not cleaned up, but rather, all
+state is bulk-freed at the end.
+
+Live-ranges are aggregated into "bundles". A bundle is a collection of
+ranges that does not overlap. Each bundle carries: a list (inline
+SmallVec) of (range, live-range index) tuples, an allocation (starts
+as "none"), a "spillset" (more below), and some metadata, including a
+spill weight (sum of ranges' weights), a priority (sum of ranges'
+lengths), and three property flags: "minimal", "contains fixed
+constraints", "contains stack constraints".
+
+VRegs also contain their lists of live-ranges, in the same form as a
+bundle does (inline SmallVec that has inline (from, to) range bounds
+and range indices).
+
+There are two important overlap invariants: (i) no liveranges within a
+bundle overlap, and (ii) no liveranges within a vreg overlap. These
+are extremely important and we rely on them implicitly in many places.
+
+The live-range lists in bundles and vregs, and use-lists in ranges,
+have various sorting invariants as well. These invariants differ
+according to the phase of the allocator's computation. First, during
+live-range construction, live-ranges are placed into vregs in reverse
+order (because the computation is a reverse scan) and uses into ranges
+in reverse order; these are sorted into forward order at the end of
+live-range computation. When bundles are first constructed, their
+range lists are sorted, and they remain so for the rest of allocation,
+as we need for interference testing. However, as ranges are created
+and split, sortedness of vreg ranges is *not* maintained; they are
+sorted once more, in bulk, when allocation is done and we start to
+resolve moves.
+
+Finally, we have physical registers. The main data associated with
+each is the allocation map. This map is a standard BTree, indexed by
+ranges (`from` and `to` ProgPoints) and yielding a LiveRange for each
+location range. The ranges have a custom comparison operator defined
+that compares equal for any overlap.
+
+This comparison operator allows us to determine whether a range is
+free, i.e. has no overlap with a particular range, in one probe -- the
+btree will not contain a match. However, it makes iteration over *all*
+overlapping ranges somewhat tricky to get right. Notably, Rust's
+BTreeMap does not guarantee that the lookup result will be the *first*
+equal key, if multiple keys are equal to the probe key. Thus, when we
+want to enumerate all overlapping ranges, we probe with a range that
+consists of the single program point *before* the start of the actual
+query range, using the API that returns an iterator over a range in
+the BTree, and then iterate through the resulting iterator to gather
+all overlapping ranges (which will be contiguous).
+
+## Spill Bundles
+
+It is worth describing "spill bundles" separately. Every spillset (see
+below; a group of bundles that originated from one bundle) optionally
+points to a single bundle that we designate the "spill bundle" for
+that spillset. Contrary to the name, this bundle is not
+unconditionally spilled. Rather, one can see it as a sort of fallback:
+it is where liveranges go when we give up on processing them via the
+normal backtracking loop, and will only process them once more in the
+"second-chance" stage.
+
+The spill bundle acquires liveranges in two ways. First, as we split
+bundles, we will trim the split pieces in certain ways so that some
+liveranges are immediately placed in the spill bundle. Intuitively,
+the "empty" regions that just carry a value, but do not satisfy any
+operands, should be in the spill bundle: it is better to have a single
+consistent location for the value than to move it between lots of
+different split pieces without using it, as moves carry a cost.
+
+Second, the spill bundle acquires the liveranges of a bundle that has
+no requirement to be in a register when that bundle is processed, but
+only if the spill bundle already exists. In other words, we won't
+create a second-chance spill bundle just for a liverange with an "Any"
+use; but if it was already forced into existence by splitting and
+trimming, then we might as well use it.
+
+Note that unlike other bundles, a spill bundle's liverange list
+remains unsorted until we do the second-chance allocation. This allows
+quick appends of more liveranges.
+
+## Allocation Queue
+
+The allocation queue is simply a priority queue (built with a binary
+max-heap) of (prio, bundle-index) tuples.
+
+## Spillsets and Spillslots
+
+Every bundle contains a reference to a spillset. Spillsets are used to
+assign spillslots near the end of allocation, but before then, they
+are also a convenient place to store information that is common among
+*all bundles* that share the spillset. In particular, spillsets are
+initially assigned 1-to-1 to bundles after all bundle-merging is
+complete; so spillsets represent in some sense the "original bundles",
+and as splitting commences, the smaller bundle-pieces continue to
+refer to their original spillsets.
+
+We stash some useful information on the spillset because of this: a
+register hint, used to create some "stickiness" between pieces of an
+original bundle that are assigned separately after splitting; the
+spill bundle; the common register class of all vregs in this bundle;
+the vregs whose liveranges are contained in this bundle; and then some
+information actually used if this is spilled to the stack (`required`
+indicates actual stack use; `size` is the spillslot count; `slot` is
+the actual stack slot).
+
+Spill *sets* are later allocated to spill *slots*. Multiple spillsets
+can be assigned to one spillslot; the only constraint is that
+spillsets assigned to a spillslot must not overlap. When we look up
+the allocation for a bundle, if the bundle is not given a specific
+allocation (its `alloc` field is `Allocation::none()`), this means it
+is spilled, and we traverse to the spillset then spillslot.
+
+## Other: Fixups, Stats, Debug Annotations
+
+There are a few fixup lists that we will cover in more detail
+later. Of particular note is the "multi-fixed-reg fixup list": this
+handles instructions that constrain the same input vreg to multiple,
+different, fixed registers for different operands at the same program
+point. The only way to satisfy such a set of constraints is to
+decouple all but one of the inputs (make them no longer refer to the
+vreg) and then later insert copies from the first fixed use of the
+vreg to the other fixed regs.
+
+The `Env` also carries a statistics structure with counters that are
+incremented, which can be useful for evaluating the effects of
+changes; and a "debug annotations" hashmap from program point to
+arbitrary strings that is filled out with various useful diagnostic
+information if enabled, so that an annotated view of the program with
+its liveranges, bundle assignments, inserted moves, merge and split
+decisions, etc. can be viewed.
+
+# Allocation Pipeline
+
+We now describe the pipeline that computes register allocations.
+
+## Live-range Construction
+
+The first step in performing allocation is to analyze the input
+program to understand its dataflow: that is, the ranges during which
+virtual registers must be assigned to physical registers. Computing
+these ranges is what allows us to do better than a trivial "every vreg
+lives in a different location, always" allocation.
+
+We compute precise liveness first using an iterative dataflow
+algorithm with BitVecs. (See below for our sparse chunked BitVec
+description.) This produces the `liveins` and `liveouts` vectors of
+BitVecs per block.
+
+We then perform a single pass over blocks in reverse order, and scan
+instructions in each block in reverse order. Why reverse order? We
+must see instructions within a block in reverse to properly compute
+liveness (a value is live backward from an use to a def). Because we
+want to keep liveranges in-order as we build them, to enable
+coalescing, we visit blocks in reverse order as well, so overall this
+is simply a scan over the whole instruction index space in reverse
+order.
+
+For each block, we perform a scan with the following state:
+
+- A liveness bitvec, initialized at the start from `liveouts`.
+- A vector of live-range indices, with one entry per vreg, initially
+  "invalid" (this vector is allocated once and reused at each block).
+- In-progress list of live-range indices per vreg in the vreg state,
+  in *reverse* order (we will reverse it when we're done).
+
+A vreg is live at the current point in the scan if its bit is set in
+the bitvec; its entry in the vreg-to-liverange vec may be stale, but
+if the bit is not set, we ignore it.
+
+We initially create a liverange for all vregs that are live out of the
+block, spanning the whole block. We will trim this below if it is
+locally def'd and does not pass through the block.
+
+For each instruction, we process its effects on the scan state:
+
+- For all clobbers (which logically happen at the end of the
+  instruction), add a single-program-point liverange to each clobbered
+  preg.
+
+- If not a move:
+  - for each program point [after, before]:
+    - for each operand at this point(\*):
+      - if a def or mod:
+        - if not currently live, this is a dead def; create an empty
+          LR.
+      - if a def:
+        - set the start of the LR for this vreg to this point.
+        - set as dead.
+      - if a use:
+        - create LR if not live, with start at beginning of block.
+      
+- Else, if a move:
+  - simple case (no pinned vregs): 
+    - add to `prog_move` data structures, and update LRs as above.
+    - effective point for the use is *after* the move, and for the mod
+      is *before* the *next* instruction. Why not more conventional
+      use-before, def-after? Because this allows the move to happen in
+      parallel with other moves that the move-resolution inserts
+      (between split fragments of a vreg); these moves always happen
+      at the gaps between instructions. We place it after, not before,
+      because before may land at a block-start and interfere with edge
+      moves, while after is always a "normal" gap (a move cannot end a
+      block).
+  - otherwise: see below (pinned vregs).
+
+
+(\*) an instruction operand's effective point is adjusted in a few
+cases. If the instruction is a branch, its uses (which are
+blockparams) are extended to the "after" point. If there is a reused
+input, all *other* inputs are extended to "after": this ensures proper
+interference (as we explain more below).
+
+We then treat blockparams as defs at the end of the scan (beginning of
+the block), and create the "ins" tuples. (The uses for the other side
+of the edge are already handled as normal uses on a branch
+instruction.)
+
+### Optimization: Pinned VRegs and Moves
+
+In order to efficiently handle the translation from the regalloc.rs
+API, which uses named RealRegs that are distinct from VirtualRegs
+rather than operand constraints, we need to implement a few
+optimizations. The translation layer translates RealRegs as particular
+vregs at the regalloc2 layer, because we need to track their liveness
+properly. Handling these as "normal" vregs, with massive bundles of
+many liveranges throughout the function, turns out to be a very
+inefficient solution. So we mark them as "pinned" with a hook in the
+RA2 API. Semantically, this means they are always assigned to a
+particular preg whenever mentioned in an operand (but *NOT* between
+those points; it is possible for a pinned vreg to move all about
+registers and stackslots as long as it eventually makes it back to its
+home preg in time for its next use).
+
+This has a few implications during liverange construction. First, when
+we see an operand that mentions a pinned vreg, we translate this to an
+operand constraint that names a fixed preg. Later, when we build
+bundles, we will not create a bundle for the pinned vreg; instead we
+will transfer its liveranges directly as unmoveable reservations in
+pregs' allocation maps. Finally, we need to handle moves specially.
+
+With the caveat that "this is a massive hack and I am very very
+sorry", here is how it works. A move between two pinned vregs is easy:
+we add that to the inserted-moves list right away because we know the
+Allocation on both sides. A move from a pinned vreg to a normal vreg
+is the first interesting case. In this case, we (i) create a ghost def
+with a fixed-register policy on the normal vreg, doing the other
+liverange-maintenance bits as above, and (ii) adjust the liveranges on
+the pinned vreg (so the preg) in a particular way. If the preg is live
+flowing downward, then this move implies a copy, because the normal
+vreg and the pinned vreg are both used in the future and cannot
+overlap. But we cannot keep the preg continuously live, because at
+exactly one program point, the normal vreg is pinned to it. So we cut
+the downward-flowing liverange just *after* the normal vreg's
+fixed-reg ghost def. Then, whether it is live downward or not, we
+create an upward-flowing liverange on the pinned vreg that ends just
+*before* the ghost def.
+
+The move-from-normal-to-pinned case is similar. First, we create a
+ghost use on the normal vreg that pins its value at this program point
+to the fixed preg. Then, if the preg is live flowing downward, we trim
+its downward liverange to start just after the fixed use.
+
+There are also some tricky metadata-maintenance records that we emit
+so that the checker can keep this all straight.
+
+The outcome of this hack, together with the operand-constraint
+translation on normal uses/defs/mods on pinned vregs, is that we
+essentially are translating regalloc.rs's means of referring to real
+registers to regalloc2's preferred abstractions by doing a bit of
+reverse-engineering. It is not perfect, but it works. Still, we hope
+to rip it all out once we get rid of the need for the compatibility
+shim.
+
+### Handling Reused Inputs
+
+Reused inputs are also handled a bit specially. We have already
+described how we essentially translate the idiom so that the output's
+allocation is used for input and output, and there is a move just
+before the instruction that copies the actual input (which will not be
+clobbered) to the output. Together with an attempt to merge the
+bundles for the two, to elide the move if possible, this works
+perfectly well as long as we ignore all of the other inputs.
+
+But we can't do that: we have to ensure that other inputs' allocations
+are correct too. Note that using the output's allocation as the input
+is actually potentially incorrect if the output is at the After point
+and the input is at the Before: the output might share a register with
+one of the *other* (normal, non-reused) inputs if that input's vreg
+were dead afterward. This will mean that we clobber the other input.
+
+So, to get the interference right, we *extend* all other (non-reused)
+inputs of an instruction with a reused input to the After point. This
+ensures that the other inputs are *not* clobbered by the slightly
+premature use of the output register.
+
+The source has a link to a comment in IonMonkey that implies that it
+uses a similar solution to this problem, though it's not entirely
+clear.
+
+(This odd dance, like many of the others above and below, is "written
+in fuzzbug failures", so to speak. It's not entirely obvious until one
+sees the corner case where it's necessary!)
+
+## Bundle Merging
+
+Once we have built the liverange lists for every vreg, we can reverse
+these lists (recall, they were built in strict reverse order) and
+initially assign one bundle per (non-pinned) vreg. We then try to
+merge bundles together as long as find pairs of bundles that do not
+overlap and that (heuristically) make sense to merge.
+
+Note that this is the only point in the allocation pipeline where
+bundles get larger. We initially merge as large as we dare (but not
+too large, because then we'll just cause lots of conflicts and
+splitting later), and then try out assignments, backtrack via
+eviction, and split continuously to chip away at the problem until we
+have a working set of allocation assignments.
+
+We attempt to merge three kinds of bundle pairs: reused-input to
+corresponding output; across program moves; and across blockparam
+assignments.
+
+To merge two bundles, we traverse over both their sorted liverange
+lists at once, checking for overlaps. Note that we can do this without
+pointer-chasing to the liverange data; the (from, to) range is in the
+liverange list itself.
+
+We also check whether the merged bundle would have conflicting
+requirements (see below for more on requirements). We do a coarse
+check first, checking 1-bit flags that indicate whether either bundle
+has any fixed-reg constraints or stack-only constraints. If so, we
+need to do a detailed check by actually computing merged requirements
+on both sides, merging, and checking for Conflict (the lattice bottom
+value). If no conflict, we merge.
+
+A performance note: merging is extremely performance-sensitive, and it
+turns out that a mergesort-like merge of the liverange lists is too
+expensive, partly because it requires allocating a separate result
+vector (in-place merge in mergesort is infamously complex). Instead,
+we simply append one vector onto the end of the other and invoke
+Rust's builtin sort. We could special-case "one bundle is completely
+before the other", but we currently don't do that (performance idea!).
+
+Once all bundles are merged as far as they will go, we compute cached
+bundle properties (priorities and weights) and enqueue them on the
+priority queue for allocation.
+
+## Recurring: Bundle Property Computation
+
+The core allocation loop is a recurring iteration of the following: we
+take the highest-priority bundle from the allocation queue; we compute
+its requirements; we try to find it a register according to those
+requirements; if no fit, we either evict some other bundle(s) from
+their allocations and try again, or we split the bundle and put the
+parts back on the queue. We record all the information we need to make
+the evict-or-split decision (and where to split) *during* the physical
+register allocation-map scans, so we don't need to go back again to
+compute that.
+
+Termination is nontrivial to see, because of eviction. How do we
+guarantee we don't get into an infinite loop where two bundles fight
+over a register forever? In fact, this can easily happen if there is a
+bug; we fixed many fuzzbugs like this, and we have a check for
+"infinite loop" based on an upper bound on iterations. But if the
+allocator is correct, it should never happen.
+
+Termination is guaranteed because (i) bundles always get smaller, (ii)
+eviction only occurs when a bundle is *strictly* higher weight (not
+higher-or-equal), and (iii) once a bundle gets down to its "minimal"
+size, it has an extremely high weight that is guaranteed to evict any
+non-minimal bundle. A minimal bundle is one that covers only one
+instruction. As long as the input program does not have impossible
+constraints that require more than one vreg to exist in one preg, an
+allocation problem of all minimal bundles will always have a solution.
+
+## Bundle Processing
+
+Let's now talk about what happens when we take a bundle off the
+allocation queue. The three basic outcomes are: allocate; split and
+requeue; or evict and try again immediately (and eventually allocate
+or split/requeue).
+
+### Properties: Weight, Priority, and Requirements
+
+To process a bundle, we have to compute a few properties. In fact we
+will have already computed a few of these beforehand, but we describe
+them all here.
+
+- Priority: a bundle's priority determines the order in which it is
+  considered for allocation. RA2 defines as the sum of the lengths (in
+  instruction index space) of each liverange. This causes the
+  allocator to consider larger bundles first, when the allocation maps
+  are generally more free; they can always be evicted and split later.
+
+- Weight: a bundle's weight indicates how important (in terms of
+  runtime) its uses/register mentions are. In an approximate sense,
+  inner loop bodies create higher-weight uses. Fixed register
+  constraints add some weight, and defs add some weight. Finally,
+  weight is divided by priority, so a very large bundle that happens
+  to have a few important uses does not unformly exert its weight
+  across its entire range. This has the effect of causing bundles to
+  be more important (more likely to evict others) the more they are
+  split.
+  
+- Requirement: a bundle's requirement is a value in a lattice that we
+  have defined, where top is "Unknown" and bottom is
+  "Conflict". Between these two, we have: any register (of a class);
+  any stackslot (of a class); a particular register. "Any register"
+  can degrade to "a particular register", but any other pair of
+  different requirements meets to Conflict. Requirements are derived
+  from the operand constraints for all uses in all liveranges in a
+  bundle, and then merged with the lattice meet-function.
+
+Once we have the Requirement for a bundle, we can decide what to do.
+
+### No-Register-Required Cases
+
+If the requirement indicates that no register is needed (`Unknown` or
+`Any`), *and* if the spill bundle already exists for this bundle's
+spillset, then we move all the liveranges over to the spill bundle, as
+described above.
+
+If the requirement indicates that the stack is needed explicitly
+(e.g., for a safepoint), we set our spillset as "required" (this will
+cause it to allocate a spillslot) and return; because the bundle has
+no other allocation set, it will look to the spillset's spillslot by
+default.
+
+If the requirement indicates a conflict, we immediately split and
+requeue the split pieces. This split is a special one: rather than
+split in a way informed by conflicts (see below), we unconditionally
+split off the first use. This is a heuristic and we could in theory do
+better by finding the source of the conflict; but in practice this
+works well enough. Note that a bundle can reach this stage with a
+conflicting requirement only if the original liverange had conflicting
+uses (e.g., a liverange from a def in a register to a use on stack, or
+a liverange between two different fixed-reg-constrained operands); our
+bundle merging logic explicitly avoids merging two bundles if it would
+create a conflict.
+
+### Allocation-Map Probing
+
+If we did not immediately dispose of the bundle as described above,
+then we *can* use a register (either `Any`, which accepts a register
+as one of several options, or `Reg`, which must have one, or `Fixed`,
+which must have a particular one).
+
+We determine the list of physical registers whose allocation maps we
+will probe, and in what order. If a particular fixed register is
+required, we probe only that register. Otherwise, we probe all
+registers in the required class.
+
+The order in which we probe, if we are not constrained to a single
+register, is carefully chosen. First, if there is a hint register from
+the spillset (this is set by the last allocation into a register of
+any other bundle in this spillset), we probe that. Then, we probe all
+preferred registers; then all non-preferred registers.
+
+For each of the preferred and non-preferred register lists, we probe
+in an *offset* manner: we start at some index partway through the
+list, determined by some heuristic number that is random and
+well-dstributed. (In practice, we use the sum of the bundle index and
+the instruction index of the start of the first range in the bundle.)
+We then march through the list and wrap around, stopping before we hit
+our starting point again.
+
+The purpose of this offset is to distribute the contention and speed
+up the allocation process. In the common case where there are enough
+registers to hold values without spilling (for small functions), we
+are more likely to choose a free register right away if we throw the
+dart at random than if we start *every* probe at register 0, in
+order. This has a large allocation performance impact in practice.
+
+For each register in probe order, we probe the allocation map, and
+gather, simultaneously, several results: (i) whether the entire range
+is free; (ii) if not, the list of all conflicting bundles, *and* the
+highest weight among those bundles; (iii) if not, the *first* conflict
+point.
+
+We do this by iterating over all liveranges in the preg's btree that
+overlap with each range in the current bundle. This iteration is
+somewhat subtle due to multiple "equal" keys (see above where we
+describe the use of the btree). It is also adaptive for performance
+reasons: it initially obtains an iterator into the btree corresponding
+to the start of the first range in the bundle, and concurrently
+iterates through both the btree and the bundle. However, if there is a
+large gap in the bundle, this might require skipping many irrelevant
+entries in the btree. So, if we skip too many entries (heuristically,
+16, right now), we do another lookup from scratch in the btree for the
+start of the next range in the bundle. This balances between the two
+cases: dense bundle, where O(1) iteration through the btree is faster,
+and sparse bundle, where O(log n) lookup for each entry is better.
+
+### Decision: Allocate, Evict, or Split
+
+First, the "allocate" case is easy: if, during our register probe
+loop, we find a physical register whose allocations do not overlap
+this bundle, then we allocate this register; done!
+
+If not, then we need to decide whether to evict some conflicting
+bundles and retry, or to split the current bundle into smaller pieces
+that may have better luck.
+
+A bit about our split strategy first: contrary to the IonMonkey
+allocator which inspired much of our design, we do *not* have a list
+of split strategies that split one bundle into many pieces at
+once. Instead, each iteration of the allocation loop splits at most
+*once*. This simplifies the splitting code greatly, but also turns out
+to be a nice heuristic: we split at the point that the bundle first
+encounters a conflict for a particular preg assignment, then we hint
+that preg for the first (pre-conflict) piece when we retry. In this
+way, we always make forward progress -- one piece of the bundle is
+always allocated -- and splits are informed by the actual situation at
+hand, rather than best guesses. Also note that while this may appear
+at first to be a greedy algorithm, it still allows backtracking: the
+first half of the split bundle, which we *can* now assign to a preg,
+does not necessarily remain on that preg forever (it can still be
+evicted later). It is just a split that is known to make at least one
+part of the allocation problem solvable.
+
+To determine whether to split or evict, we track our best options: as
+we probe, we track the "lowest cost eviction option", which is a set
+of bundles and the maximum weight in that set of bundles. We also
+track the "lowest cost split option", which is the cost (more below),
+the point at which to split, and the register for this option.
+
+For each register we probe, if there is a conflict but none of the
+conflicts are fixed allocations, we receive a list of bundles that
+conflicted, and also separately, the first conflicting program
+point. We update the lowest-cost eviction option if the cost (max
+weight) of the conflicting bundles is less than the current best. We
+update the lowest-cost split option if the cost is less as well,
+according to the following definition of cost: a split's cost is the
+cost of its move, as defined by the weight of a normal def operand at
+the split program point, plus the cost of all bundles beyond the split
+point (which will still be conflicts even after the split).
+
+If there is a conflict with a fixed allocation, then eviction is not
+an option, but we can still compute the candidate split point and cost
+in the same way as above.
+
+Finally, as an optimization, we pass in the current best cost to the
+btree probe inner loop; if, while probing, we have already exceeded
+the best cost, we stop early (this improves allocation time without
+affecting the result).
+
+Once we have the best cost for split and evict options, we split if
+(i) the bundle is not already a minimal bundle, and (ii) we've already
+evicted once in this toplevel iteration without success, or the weight
+of the current bundle is less than the eviction cost. We then requeue
+*both* resulting halves of the bundle with the preg that resulted in
+this option as the register hint. Otherwise, we evict all conflicting
+bundles and try again.
+
+Note that the split cost does not actually play into the above (split
+vs. evict) decision; it is only used to choose *which* split is
+best. This is equivalent to saying: we never evict if the current
+bundle is less important than the evicted bundles, even if the split
+is more expensive still. This is important for forward progress, and
+the case where the split would be even more expensive should be very
+very rare (it would have to come from a costly move in the middle of
+an inner loop).
+
+### How to Split
+
+The actual split procedure is fairly simple. We are given a bundle and
+a split-point. We create a new bundle to take on the second half
+("rest") of the original. We find the point in the liverange list that
+corresponds to the split, and distribute appropriately. If the
+split-point lands in the middle of a liverange, then we split that
+liverange as well.
+
+In the case that a new liverange is created, we add the liverange to
+the corresponding vreg liverange list as well. Note that, as described
+above, the vreg's liverange list is unsorted while splitting is
+occurring (because we do not need to traverse it or do any lookups
+during this phase); so we just append.
+
+The splitting code also supports a "minimal split", in which it simply
+peels off the first use. This is used to ensure forward progress when
+a bundle has conflicting requirements within it (see above).
+
+#### Spill Bundle and Splitting
+
+Once a split occurs, however, it turns out that we can improve results
+by doing a little cleanup. Once we distribute a bundle's liveranges
+across two half-bundles, we postprocess by trimming a bit.
+
+In particular, if we see that the "loose ends" around the split point
+extend beyond uses, we will create and move ranges to a spill
+bundle. That is: if the last liverange in the first-half bundle
+extends beyond its last use, we trim that part off into an empty (no
+uses) liverange and place that liverange in the spill
+bundle. Likewise, if the first liverange in the second-half bundle
+starts before its first use, we trim that part off into an empty
+liverange and place it in the spill bundle.
+
+This is, empirically, an improvement: it reduces register contention
+and makes splitting more effective. The intuition is twofold: (i) it
+is better to put all of the "flow-through" parts of a vreg's liveness
+into one bundle that is never split, and can be spilled to the stack
+if needed, to avoid unnecessary moves; and (ii) if contention is high
+enough to cause splitting, it is more likely there will be an actual
+stack spill, and if this is the case, it is better to do the store
+just after the last use and reload just before the first use of the
+respective bundles.
+
+Unfortunately, this heuristic choice does interact somewhat poorly
+with program moves: moves between two normal (non-pinned) vregs do not
+create ghost uses or defs, and so these points of the ranges can be
+spilled, turning a normal register move into a move from or to the
+stack. However, empirically, we have found that adding such ghost
+uses/defs actually regresses some cases as well, because it pulls
+values back into registers when we could have had a stack-to-stack
+move (that might even be a no-op if the same spillset); overall, it
+seems better to trim. It also improves allocation performance by
+reducing contention in the registers during the core loop (before
+second-chance allocation).
+
+## Second-Chance Allocation: Spilled Bundles
+
+Once the main allocation loop terminates, when all bundles have either
+been allocated or punted to the "spilled bundles" list, we do
+second-chance allocation. This is a simpler loop that never evicts and
+never splits. Instead, each bundle gets one second chance, in which it
+can probe pregs and attempt to allocate. If it fails, it will actually
+live on the stack.
+
+This is correct because we are careful to only place bundles on the
+spilled-bundles list that are *allowed* to live on the
+stack. Specifically, only the canonical spill bundles (which will
+contain only empty ranges) and other bundles that have an "any" or
+"unknown" requirement are placed here (but *not* "stack" requirements;
+those *must* be on the stack, so do not undergo second-chance
+allocation).
+
+At the end of this process, we have marked spillsets as required
+whenever at least one bundle in the spillset actually requires a stack
+slot. We can then allocate slots to the spillsets.
+
+## Spillslot Allocation
+
+We must allocate space on the stack, denoted by an abstract index
+space, to each spillset that requires it, and for the liveranges in
+which it requires it.
+
+To facilitate this, we keep a btree per spillslot in the same way we
+do per preg. We will allocate spillsets to slots in a way that avoids
+interference.
+
+Note that we actually overapproximate the required ranges for each
+spillset in order to improve the behavior of a later phase (redundant
+move elimination). Specifically, when we allocate a slot for a
+spillset, we reserve that slot for *all* of the liveranges of *every*
+vreg that is assigned to that spillset (due to merging rules that
+initially merge one-vreg bundles into final merged bundles, there will
+be no overlaps here). In other words, we rule out interleaving of
+completely different values in the same slot, though bundle merging
+does mean that potentially many (non-interfering) vregs may share
+it. This provides the important property that if a vreg has been
+reloaded, but not modified, its spillslot *still contains the
+up-to-date value* (because the slot is reserved for all liveranges of
+the vreg). This enables us to avoid another store to the spillslot
+later if there is another spilled range.
+
+We perform probing in a way that is somewhat different than for
+registers, because the spillslot space is conceptually infinite. We
+can thus optimize for slightly better allocation performance by giving
+up and allocating a new slot at any time.
+
+For each size class, we keep a linked list of slots. When we need to
+allocate a spillset to a slot, we traverse down the list and try a
+fixed number of slots. If we find one that fits the spillset's ranges,
+we allocate, and we remove the slot from its current place in the list
+and append to the end. In this way, it is deprioritized from probing
+"for a while", which tends to reduce contention. This is a simple way
+to round-robin between slots. If we don't find one that fits after a
+fixed number of probes, we allocate a new slot.
+
+And with that, we have valid allocations for all vregs for all points
+that they are live! Now we just need to modify the program to reify
+these choices.
+
+## Allocation Assignment
+
+The first step in reifying the allocation is to iterate through all
+mentions of a vreg and fill in the resulting `Allocation` array with
+the appropriate allocations. We do this by simply traversing
+liveranges per vreg, looking up the allocation by observing the bundle
+(and spillset if no specific allocation for the bundle), and for each
+use, filling in the slot according to the saved progpoint/slot info in
+the use data.
+
+## Move Generation
+
+The more difficult half of the reification step is generating the
+*moves* that will put the values in the right spots.
+
+There are two sources of moves that we must generate. The first are
+moves between different ranges of the same vreg, as the split pieces
+of that vreg's original bundle may have been assigned to different
+locations. The second are moves that result from move semantics in the
+input program: either assignments from blockparam args on branches to
+the target block's params, or program move instructions. (Recall that
+we reify program moves in a unified way with all other moves, so the
+client should not generate any machine code for their original moves
+in the pre-allocation program.)
+
+Moves are tricky to handle efficiently because they join two
+potentially very different locations in the program (in the case of
+control-flow-edge moves). In order to avoid the need for random
+lookups, which are a cache-locality nightmare even if we have O(log n)
+lookups, we instead take a scan-sort-scan approach.
+
+First, we scan over each vreg's liveranges, find the allocation for
+each, and for each move that comes *to* or *from* this liverange,
+generate a "half-move". The key idea is that we generate a record for
+each "side" of the move, and these records are keyed in a way that
+after a sort, the "from" and "to" ends will be consecutive. We can
+sort the list of halfmoves once (this is expensive, but not as
+expensive as many separate pointer-chasing lookups), then scan it
+again to actually generate the move instructions.
+
+To enable the sort to work, half-moves are sorted by a key that is
+equivalent to the tuple (from-block, to-block, to-vreg, kind), where
+`kind` is "source" or "dest". For each key, the payload is an
+allocation. The fields in this tuple are carefully chosen: we know all
+of them at every location we generate a halfmove, without expensive
+lookups, and sorting by this key will make the source and all dests
+(there can be more than one) contiguous in the final order.
+
+Half-moves are generated for several situations. First, at the start
+of every block covered by a liverange, we can generate "dest"
+half-moves for blockparams, and at the end of every block covered by a
+liverange, we can generate "source" half-moves for blockparam args on
+branches. Incidentally, this is the reason that `blockparam_ins` and
+`blockparam_outs` are sorted tuple-lists whose tuples begin with
+(vreg, block, ...): this is the order in which we do the toplevel scan
+over allocations.
+
+Second, at every block edge, if the vreg is live in any pred (at
+block-start) or succ (at block-end), we generate a half-move to
+transfer the vreg to its own location in the connected block.
+
+This completes the "edge-moves". We sort the half-move array and then
+have all of the alloc-to-alloc pairs on a given (from-block, to-block)
+edge.
+
+There are also two kinds of moves that happen within blocks. First,
+when a live-range ends and another begins for the same vreg in the
+same block (i.e., a split in the middle of a block), we know both
+sides of the move immediately (because it is the same vreg and we can
+look up the adjacent allocation easily), and we can generate that
+move.
+
+Second, program moves occur within blocks. Here we need to do a
+similar thing as for block-edge half-moves, but keyed on program point
+instead. This is why the `prog_move_srcs` and `prog_move_dsts` arrays
+are initially sorted by their (vreg, inst) keys: we can directly fill
+in their allocation slots during our main scan. Note that when sorted
+this way, the source and dest for a given move instruction will be at
+different indices. After the main scan, we *re-sort* the arrays by
+just the instruction, so the two sides of a move line up at the same
+index; we can then traverse both arrays, zipped together, and generate
+moves.
+
+Finally, we generate moves to fix up multi-fixed-reg-constraint
+situations, and make reused inputs work, as described earlier.
+
+## Move Resolution
+
+During this whole discussion, we have described "generating moves",
+but we have not said what that meant. Note that in many cases, there
+are several moves at a particular program point that semantically
+happen *in parallel*. For example, if multiple vregs change
+allocations between two instructions, all of those moves happen as
+part of one parallel permutation. Similarly, blockparams have
+parallel-assignment semantics. We thus enqueue all the moves that we
+generate at program points and resolve them into lists of sequential
+moves that can actually be lowered to move instructions in the machine
+code.
+
+First, a word on *move priorities*. There are different kinds of moves
+that are generated between instructions, and we have to ensure that
+some happen before others, i.e., *not* in parallel. For example, a
+vreg might change allocation (due to a split) before an instruction,
+then be copied to an output register for an output with a reused-input
+policy. The latter move must happen *after* the vreg has been moved
+into its location for this instruction.
+
+To enable this, we define "move priorities", which are a logical
+extension of program points (i.e., they are sub-points) that enable
+finer-grained ordering of moves. We currently have the following
+priorities:
+
+- In-edge moves, to place edge-moves before the first instruction in a
+  block.
+- Block-param metadata, used for the checker only.
+- Regular, used for vreg movement between allocations.
+- Post-regular, used for checker metadata related to pinned-vreg moves.
+- Multi-fixed-reg, used for moves that handle the
+  single-vreg-in-multiple-fixed-pregs constraint case.
+- Reused-input, used for implementing outputs with reused-input policies.
+- Out-edge moves, to place edge-moves after the last instruction
+  (prior to the branch) in a block.
+
+Every move is statically given one of these priorities by the code
+that generates it.
+
+We collect moves with (prog-point, prio) keys, and we short by those
+keys. We then have, for each such key, a list of moves that
+semantically happen in parallel.
+
+We then resolve those moves using a parallel-move resolver, as we now
+describe.
+
+### Parallel-Move Resolver
+
+The fundamental issue that arises when resolving parallel moves to
+sequential moves is *overlap*: some of the moves may overwrite
+registers that other moves use as sources. We must carefully order
+moves so that this does not clobber values incorrectly.
+
+We first check if such overlap occurs. If it does not (this is
+actually the most common case), the list of parallel moves can be
+emitted as sequential moves directly. Done!
+
+Otherwise, we have to order the moves carefully. Furthermore, if there
+is a *cycle* anywhere among the moves, we will need a scratch
+register. (Consider, e.g., t0 := t1 and t1 := t0 in parallel: with
+only move instructions and no direct "exchange" instruction, we cannot
+reify this without a third register.)
+
+We first compute a mapping from each move instruction to the move
+instruction, if any, that it must precede. Note that there can be only
+one such move for a given move, because each destination can be
+written only once; so a move might be constrained only before the one
+move that overwrites its source. (This will be important in a bit!)
+
+Our task is now to find an ordering of moves that respects these
+dependencies. To do so, we perform a depth-first search on the graph
+induced by the dependencies, which will generate a list of sequential
+moves in reverse order. We keep a stack of moves; we start with any
+move that has not been visited yet; in each iteration, if the
+top-of-stack has no out-edge to another move (does not need to come
+before any others), then push it to a result vector, followed by all
+others on the stack (in popped order). If it does have an out-edge and
+the target is already visited and not on the stack anymore (so already
+emitted), likewise, emit this move and the rest on the stack. If it
+has an out-edge to a move not yet visited, push on the stack and
+continue. Otherwise, if out-edge to a move currently on the stack, we
+have found a cycle. In this case, we emit the moves on the stack with
+a modification: the first move writes to a scratch register, and we
+emit an additional move that moves from the scratch to the first
+move's dest. This breaks the cycle.
+
+The astute reader may notice that this sounds like a canonical
+application of Tarjan's algorithm for finding SCCs (strongly-connected
+components). Why don't we have the full complexity of that algorithm?
+In particular, *why* can we emit the cycle *right away* once we find
+it, rather than ensuring that we have gotten all of the SCC first?
+
+The answer is that because there is only *one* out-edge at most (a
+move can only require preceding *one* other move), all SCCs must be
+simple cycles. This means that once we have found a cycle, no other
+nodes (moves) can be part of the SCC, because every node's single
+out-edge is already accounted for. This is what allows us to avoid a
+fully general SCC algorithm.
+
+Once the list of moves in-reverse has been constructed, we reverse it
+and return.
+
+Note that this "move resolver" is fuzzed separately with a simple
+symbolic move simulator (the `moves` fuzz-target).
+
+### Stack-to-Stack Moves
+
+There is one potentially difficult situation that could arise from the
+move-resolution logic so far: if a vreg moves from one spillslot to
+another, this implies a memory-to-memory move, which most machine
+architectures cannot handle natively. It would be much nicer if we
+could ensure within the regalloc that this never occurs.
+
+This is in fact possible to do in a postprocessing step. We iterate
+through the sequential moves, tracking whether the scratch register is
+in use (has been written). When we see a stack-to-stack move: (i) if
+the scratch register is not in use, generate a stack-to-scratch move
+and scratch-to-stack move; otherwise, (ii) if the scratch register is
+in use, allocate an "extra spillslot" if one has not already been
+allocated, move the scratch reg to that, do the above stack-to-scratch
+/ scratch-to-stack sequence, then reload the scratch reg from the
+extra spillslot.
+
+## Redundant-Move Elimination
+
+As a final step before returning the list of program edits to the
+client, we perform one optimization: redundant-move elimination.
+
+To understand the need for this, consider what will occur when a vreg
+is (i) defined once, (ii) used many times, and (iii) spilled multiple
+times between some of the uses: with the design described above, we
+will move the value from the preg to the stack after every segment of
+uses, and then reload it when the next use occurs. However, only the
+first spill is actually needed; as we noted above, we allocate
+spillslots so that the slot that corresponded to the vreg at the first
+spill will always be reserved for that vreg as long as it is live. If
+no other defs or mods occur, the value in the slot can be reloaded,
+and need not be written back every time.
+
+This inefficiency is a result of our invariant that a vreg lives in
+exactly one place at a time, and these locations are joined by
+moves. This is a simple and effective design to use for most of the
+allocation pipeline, but falls flat here. It is especially inefficient
+when the unnecessary spill occurs in an inner loop. (E.g.: value
+defined at top of function is spilled, then used once in the middle of
+an inner loop body.)
+
+The opposite case can also sometimes occur, though it is rarer: a
+value is loaded into a register, spilled, and then reloaded into the
+same register. This can happen when hinting is successful at getting
+several segments of a vreg to use the same preg, but splitting has
+trimmed part of the liverange between uses and put it in the spill
+bundle, and the spill bundle did not get a reg.
+
+In order to resolve this inefficiency, we implement a general
+redundant-move elimination pass. This pass tracks, for every
+allocation (reg or spillslot), whether it is a copy of another
+allocation. This state is invalidated whenever either that allocation
+or the allocation of which it is a copy is overwritten. When we see a
+move instruction, if the destination is already a copy of the source,
+we elide the move. (There are some additional complexities to preserve
+checker metadata which we do not describe here.)
+
+Note that this could, in principle, be done as a fixpoint analysis
+over the CFG; it must be, if we try to preserve state across
+blocks. This is because a location is only a copy of another if that
+is true on every incoming edge. However, to avoid the cost and
+complexity of doing such an analysis, we instead take the much simpler
+approach of doing only an intra-block analysis. This turns out to be
+sufficient to remove most redundant moves, especially in the common
+case of a single use of an otherwise-spilled value.
+
+Note that we could do better *if* we accepted only SSA code, because
+we would know that a value could not be redefined once written. We
+should consider this again once we clean up and remove the non-SSA
+support.
+
+# Future Plans
+
+## SSA-Only Cleanup
+
+When the major user (Cranelift via the regalloc.rs shim) migrates to
+generate SSA code and native regalloc2 operands, there are many bits
+of complexity we can remove, as noted throughout this
+writeup. Briefly, we could (i) remove special handling of program
+moves, (ii) remove the pinned-vreg hack, (iii) simplify redundant-move
+elimination, (iv) remove special handling of "mod" operands, and (v)
+probably simplify plenty of code given the invariant that a def always
+starts a range.
+
+More importantly, we expect this change to result in potentially much
+better allocation performance. The use of special pinned vregs and
+moves to/from them instead of fixed-reg constraints, explicit moves
+for every reused-input constraint, and already-sequentialized series
+of move instructions on edges for phi nodes, are all expensive ways of
+encoding regalloc2's native input primitives that have to be
+reverse-engineered. Removing that translation layer would be
+ideal. Also, allowing regalloc2 to handle phi-node (blockparam)
+lowering in a way that is integrated with other moves will likely
+generate better code than the way that program-move handling interacts
+with Cranelift's manually lowered phi-moves at the moment.
+
+## Better Split Heuristics
+
+We have spent quite some effort trying to improve splitting behavior,
+and it is now generally decent, but more work could be done here,
+especially with regard to the interaction between splits and the loop
+nest.
+
+## Native Debuginfo Output
+
+Cranelift currently computes value locations (in registers and
+stack-slots) for detailed debuginfo with an expensive post-pass, after
+regalloc is complete. This is because the existing register allocator
+does not support returning this information directly. However,
+providing such information by generating it while we scan over
+liveranges in each vreg would be relatively simple, and has the
+potential to be much faster and more reliable for Cranelift. We should
+investigate adding an interface for this to regalloc2 and using it.
+
+# Appendix: Comparison to IonMonkey Allocator
+
+There are a number of differences between the [IonMonkey
+allocator](https://searchfox.org/mozilla-central/source/js/src/jit/BacktrackingAllocator.cpp)
+and this one. While this allocator initially began as an attempt to
+clone IonMonkey's, it has drifted significantly as we optimized the
+design (especially after we built the regalloc.rs shim and had to
+adapt to its code style); it is easier at this point to name the
+similarities than the differences.
+
+* The core abstractions of "liverange", "bundle", "vreg", "preg", and
+  "operand" (with policies/constraints) are the same.
+  
+* The overall allocator pipeline is the same, and the top-level
+  structure of each stage should look similar. Both allocators begin
+  by computing liveranges, then merging bundles, then handling bundles
+  and splitting/evicting as necessary, then doing second-chance
+  allocation, then reifying the decisions.
+  
+* The cost functions are very similar, though the heuristics that make
+  decisions based on them are not.
+
+Several notable high-level differences are:
+
+* There are [fuzz/fuzz_targets/](many different fuzz targets) that
+  exercise the allocator, including a full symbolic checker
+  (`ion_checker` target) based on the [symbolic checker in
+  regalloc.rs](https://cfallin.org/blog/2021/03/15/cranelift-isel-3/)
+  and, e.g., a targetted fuzzer for the parallel move-resolution
+  algorithm (`moves`) and the SSA generator used for generating cases
+  for the other fuzz targets (`ssagen`).
+
+* The data-structure invariants are simplified. While the IonMonkey
+  allocator allowed for LiveRanges and Bundles to overlap in certain
+  cases, this allocator sticks to a strict invariant: ranges do not
+  overlap in bundles, and bundles do not overlap. There are other
+  examples too: e.g., the definition of minimal bundles is very simple
+  and does not depend on scanning the code at all. In general, we
+  should be able to state simple invariants and see by inspection (as
+  well as fuzzing -- see above) that they hold.
+  
+* The data structures themselves are simplified. Where IonMonkey uses
+  linked lists in many places, this allocator stores simple inline
+  smallvecs of liveranges on bundles and vregs, and smallvecs of uses
+  on liveranges. We also (i) find a way to construct liveranges
+  in-order immediately, without any need for splicing, unlike
+  IonMonkey, and (ii) relax sorting invariants where possible to allow
+  for cheap append operations in many cases.
+  
+* The splitting heuristics are significantly reworked. Whereas
+  IonMonkey has an all-at-once approach to splitting an entire bundle,
+  and has a list of complex heuristics to choose where to split, this
+  allocator does conflict-based splitting, and tries to decide whether
+  to split or evict and which split to take based on cost heuristics.
+  
+* The liverange computation is exact, whereas IonMonkey approximates
+  using a single-pass algorithm that makes vregs live across entire
+  loop bodies. We have found that precise liveness improves allocation
+  performance and generated code quality, even though the liveness
+  itself is slightly more expensive to compute.
+  
+* Many of the algorithms in the IonMonkey allocator are built with
+  helper functions that do linear scans. These "small quadratic" loops
+  are likely not a huge issue in practice, but nevertheless have the
+  potential to be in corner cases. As much as possible, all work in
+  this allocator is done in linear scans.
+  
+* There are novel schemes for solving certain interesting design
+  challenges. One example: in IonMonkey, liveranges are connected
+  across blocks by, when reaching one end of a control-flow edge in a
+  scan, doing a lookup of the allocation at the other end. This is in
+  principle a linear lookup (so quadratic overall). We instead
+  generate a list of "half-moves", keyed on the edge and from/to
+  vregs, with each holding one of the allocations. By sorting and then
+  scanning this list, we can generate all edge moves in one linear
+  scan. There are a number of other examples of simplifications: for
+  example, we handle multiple conflicting
+  physical-register-constrained uses of a vreg in a single instruction
+  by recording a copy to do in a side-table, then removing constraints
+  for the core regalloc. Ion instead has to tweak its definition of
+  minimal bundles and create two liveranges that overlap (!) to
+  represent the two uses.
+  
+* Using block parameters rather than phi-nodes significantly
+  simplifies handling of inter-block data movement. IonMonkey had to
+  special-case phis in many ways because they are actually quite
+  weird: their uses happen semantically in other blocks, and their
+  defs happen in parallel at the top of the block. Block parameters
+  naturally and explicitly reprsent these semantics in a direct way.
+
+* The allocator supports irreducible control flow and arbitrary block
+  ordering (its only CFG requirement is that critical edges are
+  split).
+  
+* The allocator supports non-SSA code, and has native support for
+  handling program moves specially.
+
+# Appendix: Performance-Tuning Lessons
+
+In the course of optimizing the allocator's performance, we found a
+number of general principles:
+
+* We got substantial performance speedups from using vectors rather
+  than linked lists everywhere. This is well-known, but nevertheless,
+  it took some thought to work out how to avoid the need for any
+  splicing, and it turns out that even when our design is slightly
+  less efficient asymptotically (e.g., apend-and-re-sort rather than
+  linear-time merge of two sorted liverange lists when merging
+  bundles), it is faster.
+
+* We initially used a direct translation of IonMonkey's splay tree as
+  an allocation map for each PReg. This turned out to be significantly
+  (!) less efficient than Rust's built-in BTree data structures, for
+  the usual cache-efficiency vs. pointer-chasing reasons.
+  
+* We initially used dense bitvecs, as IonMonkey does, for
+  livein/liveout bits. It turned out that a chunked sparse design (see
+  below) was much more efficient.
+
+* Precise liveness significantly improves performance because it
+  reduces the size of liveranges (i.e., interference), and probing
+  registers with liveranges is the most significant hot inner
+  loop. Paying a fraction of a percent runtime for the iterative
+  dataflow algorithm to get precise bitsets is more than worth it.
+
+* The randomized probing of registers was a huge win: as above, the
+  probing is very expensive, and reducing the average number of probes
+  it takes to find a free register is very important.
+
+* In general, single-pass algorithms and design of data structures to
+  enable them are important. For example, the half-move technique
+  avoids the need to do any O(log n) search at all, and is relatively
+  cache-efficient. As another example, a side-effect of the precise
+  liveness was that we could then process operands within blocks in
+  actual instruction order (in reverse), which allowed us to simply
+  append liveranges to in-progress vreg liverange lists and then
+  reverse at the end. The expensive part is a single pass; only the
+  bitset computation is a fixpoint loop.
+  
+* Sorts are better than always-sorted data structures (like btrees):
+  they amortize all the comparison and update cost to one phase, and
+  this phase is much more cache-friendly than a bunch of spread-out
+  updates.
+
+* Take care of basic data structures and their operator definitions!
+  We initially used the auto-derived comparator on ProgPoint, and let
+  ProgPoint be a normal struct (with a u32 inst index and a
+  Befor/After enum). The comparator for this, used in many sorting
+  inner loops, was a compound thing with conditionals. Instead, pack
+  them in a u32 and do a simple compare (and save half the memory as
+  well). Likewise, the half-move key is a single value packed in a
+  u64; this is far more efficient than the tuple comparator on a
+  4-tuple, and the half-move sort (which can be a few percent or more
+  of total allocation time) became multiple times cheaper.
+
+# Appendix: Data Structure: Chunked Sparse BitVec
+
+We use a "chunked sparse bitvec" to store liveness information, which
+is just a set of VReg indices. The design is fairly simple: the
+toplevel is a HashMap from "chunk" to a `u64`, and each `u64`
+represents 64 contiguous indices.
+
+The intuition is that while the vreg sets are likely sparse overall,
+they will probably be dense within small regions of the index
+space. For example, in the Nth block in a function, the values that
+flow from block N-1 will largely be almost-contiguous vreg indices, if
+vregs are allocated in sequence down the function body. Or, at least,
+they will be some local vregs together with a few defined at the top
+of the function; two separate chunks will cover that.
+
+We tried a number of other designs as well. Initially we used a simple
+dense bitvec, but this was prohibitively expensive: O(n^2) space when
+the real need is closer to O(n) (i.e., a classic sparse matrix). We
+also tried a hybrid scheme that kept a list of indices when small and
+used either a bitvec or a hashset when large. This did not perform as
+well because (i) it was less memory-efficient (the chunking helps with
+this) and (ii) insertions are more expensive when they always require
+a full hashset/hashmap insert.
+
+# Appendix: Fuzzing
+
+We have five fuzz targets: `ssagen`, `domtree`, `moves`, `ion`, and
+`ion_checker`.
+
+## SSAGen
+
+The SSAGen target tests our SSA generator, which generates cases for
+the full allocator fuzz targets. The SSA generator is careful to
+always generate a valid CFG, with split critical edges, and valid SSA,
+so that we never have to throw out a test input before we reach the
+allocator itself. (An alternative fuzzing approach randomly generates
+programs and then throws out those that do not meet certain conditions
+before using them as legitimate testcases; this is much simpler, but
+less efficient.)
+
+To generate a valid CFG, with no unreachable blocks and with no
+critical edges, the generator (i) glues together units of either one
+or three blocks (A->B, A->C), forming either a straight-through
+section or a conditional. These are glued together into a "spine", and
+the conditionals (the "C" block), where they exist, are then linked to
+a random target block chosen among the main blocks of these one- or
+three-block units. The targets are chosen either randomly, for
+potentially irreducible CFGs, or in a way that ensures proper nesting
+of loop backedges, if a structured CFG is requested.
+
+SSA is generated by first choosing which vregs will be defined in each
+block, and which will be defined as blockparams vs. instruction
+defs. Instructions are then generated, with operands chosen among the
+"available" vregs: those defined so far in the current block and all
+of those in any other block that dominates this one.
+
+The SSAGen fuzz target runs the above code generator against an SSA
+validator, and thus ensures that it will only generate valid SSA code.
+
+## Domtree
+
+The `domtree` fuzz target computes dominance using the algorithm that
+we use elsewhere in our CFG analysis, and then walks a
+randomly-generated path through the CFG. It checks that the dominance
+definition ("a dom b if any path from entry to b must pass through a")
+is consistent with this particular randomly-chosen path.
+
+## Moves
+
+The `moves` fuzz target tests the parallel move resolver. It generates
+a random sequence of parallel moves, careful to ensure that each
+destination is written only once. It then runs the parallel move
+resolver, and then *abstractly interprets* the resulting sequential
+series of moves, thus determining which inputs flow to which
+outputs. This must match the original set of parallel moves.
+
+## Ion and Ion-checker
+
+The `ion` fuzz target runs the allocator over test programs generated
+by SSAGen. It does not validate the output; it only tests that the
+allocator runs to completion and does not panic. This was used mainly
+during development, and is now less useful than the checker-based
+target.
+
+The `ion_checker` fuzz target runs the allocator's result through a
+symbolic checker, which is adapted from the one developed for
+regalloc.rs (see [this blog
+post](https://cfallin.org/blog/2021/01/22/cranelift-isel-2/) for more
+details). This is the most useful fuzz target in the fuzzing suite,
+and has found many bugs in development.
diff --git a/src/lib.rs b/src/lib.rs
index 7d55624..3a6ecb4 100644
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -1,5 +1,5 @@
 /*
- * The fellowing license applies to this file, which derives many
+ * The following license applies to this file, which derives many
  * details (register and constraint definitions, for example) from the
  * files `BacktrackingAllocator.h`, `BacktrackingAllocator.cpp`,
  * `LIR.h`, and possibly definitions in other related files in