regalloc2/doc/DESIGN.md

# regalloc2 Design Overview

This document describes the basic architecture of the regalloc2
register allocator. It describes the externally-visible interface
(input CFG, instructions, operands, with their invariants; meaning of
various parts of the output); core data structures; and the allocation
pipeline, or series of algorithms that compute an allocation. It ends
with a description of future work and expectations, as well as an
appendix that notes design influences and similarities to the
IonMonkey backtracking allocator.

# API, Input IR and Invariants

The toplevel API to regalloc2 consists of a single entry point `run()`
that takes a register environment, which specifies all physical
registers, and the input program. The function returns either an error
or an `Output` struct that provides allocations for each operand and a
vector of additional instructions (moves, loads, stores) to insert.

## Register Environment

The allocator takes a `MachineEnv` which specifies, for each of the
two register classes `Int` and `Float`, a vector of `PReg`s by index. A
`PReg` is nothing more than the class and index within the class; the
allocator does not need to know anything more.

The `MachineEnv` provides a vector of preferred and non-preferred
physical registers per class. Any register not in either vector will
not be allocated. Usually, registers that do not need to be saved in
the prologue if used (i.e., caller-save registers) are given in the
"preferred" vector. The environment also provides exactly one scratch
register per class. This register must not be in the preferred or
non-preferred vectors, and is used whenever a set of moves that need
to occur logically in parallel have a cycle (for a simple example,
consider a swap `r0, r1 := r1, r0`).

With some more work, we could potentially remove the need for the
scratch register by requiring support for an additional edit type from
the client ("swap"), but we have not pursued this.

## CFG and Instructions

The allocator operates on an input program that is in a standard CFG
representation: the function body is a sequence of basic blocks, and
each block has a sequence of instructions and zero or more
successors. The allocator also requires the client to provide
predecessors for each block, and these must be consistent with the
successors.

Instructions are opaque to the allocator except for a few important
bits: (1) `is_ret` (is a return instruction); (2) `is_branch` (is a
branch instruction); (3) `is_move` (is a move between registers), and
(4) a vector of Operands, covered below. Every block must end in a
return or branch.

Both instructions and blocks are named by indices in contiguous index
spaces. A block's instructions must be a contiguous range of
instruction indices, and block i's first instruction must come
immediately after block i-1's last instruction.

The CFG must have *no critical edges*. A critical edge is an edge from
block A to block B such that A has more than one successor *and* B has
more than one predecessor. For this definition, the entry block has an
implicit predecessor, and any block that ends in a return has an
implicit successor.

Note that there are *no* requirements related to the ordering of
blocks, and there is no requirement that the control flow be
reducible. Some *heuristics* used by the allocator will perform better
if the code is reducible and ordered in reverse postorder (RPO),
however: in particular, (1) this interacts better with the
contiguous-range-of-instruction-indices live range representation that
we use, and (2) the "approximate loop depth" metric will actually be
exact if both these conditions are met.

## Operands and VRegs

Every instruction operates on values by way of `Operand`s. An operand
consists of the following fields:

- VReg, or virtual register. *Every* operand mentions a virtual
  register, even if it is constrained to a single physical register in
  practice. This is because we track liveranges uniformly by vreg.

- Policy, or "constraint". Every reference to a vreg can apply some
  constraint to the vreg at that point in the program. Valid policies are:

  - Any location;
  - Any register of the vreg's class;
  - Any stack slot;
  - A particular fixed physical register; or
  - For a def (output), a *reuse* of an input register.

- The "kind" of reference to this vreg: Def, Use, Mod. A def
  (definition) writes to the vreg, and disregards any possible earlier
  value. A mod (modify) reads the current value then writes a new
  one. A use simply reads the vreg's value.

- The position: before or after the instruction.
  - Note that to have a def (output) register available in a way that
    does not conflict with inputs, the def should be placed at the
    "before" position. Similarly, to have a use (input) register
    available in a way that does not conflict with outputs, the use
    should be placed at the "after" position.

This operand-specification design allows for SSA and non-SSA code (see
section below for details).

VRegs, or virtual registers, are specified by an index and a register
class (Float or Int). The classes are not given separately; they are
encoded on every mention of the vreg. (In a sense, the class is an
extra index bit, or part of the register name.) The input function
trait does require the client to provide the exact vreg count,
however.

Implementation note: both vregs and operands are bit-packed into
u32s. This is essential for memory-efficiency. As a result of the
operand bit-packing in particular (including the policy constraints!),
the allocator supports up to 2^21 (2M) vregs per function, and 2^6
(64) physical registers per class. Later we will also see a limit of
2^20 (1M) instructions per function. These limits are considered
sufficient for the anticipated use-cases (e.g., compiling Wasm, which
also has function-size implementation limits); for larger functions,
it is likely better to use a simpler register allocator in any case.

## Reuses and Two-Address ISAs

Some instruction sets primarily have instructions that name only two
registers for a binary operator, rather than three: both registers are
inputs, and the result is placed in one of the registers, clobbering
its original value. The most well-known modern example is x86. It is
thus imperative that we support this pattern well in the register
allocator.

This instruction-set design is somewhat at odds with an SSA
representation, where a value cannot be redefined. Even in non-SSA
code, it is awkward to overwrite a vreg that may need to be used again
later.

Thus, the allocator supports a useful fiction of sorts: the
instruction can be described as if it has three register mentions --
two inputs and a separate output -- and neither input will be
clobbered. The output, however, is special: its register-placement
policy is "reuse input i" (where i == 0 or 1). The allocator
guarantees that the register assignment for that input and the output
will be the same, so the instruction can use that register as its
"modifies" operand. If the input is needed again later, the allocator
will take care of the necessary copying.

We will see below how the allocator makes this work by doing some
preprocessing so that the core allocation algorithms do not need to
worry about this constraint.

Note that some non-SSA clients, such as Cranelift using the
regalloc.rs-to-regalloc2 compatibility shim, will instead generate
their own copies (copying to the output vreg first) and then use "mod"
operand kinds, which allow the output vreg to be both read and
written. regalloc2 works hard to make this as efficient as the
reused-input scheme by treating moves specially (see below).

## SSA

regalloc2 was originally designed to take an SSA IR as input, where
the usual definitions apply: every vreg is defined exactly once, and
every vreg use is dominated by its one def. (Useing blockparams means
that we do not need additional conditions for phi-nodes.)

The allocator then evolved to support non-SSA inputs as well. As a
result, the input is maximally flexible right now: it does not check
for and enforce, nor try to take advantage of, the single-def
rule. However, blockparams are still available.

In the future, we hope to change this, however, once compilation of
non-SSA inputs is no longer needed. Specifically, if we can migrate
Cranelift to the native regalloc2 API rather than the regalloc.rs
compatibility shim, we will be able to remove "mod" operand kinds,
assume (and verify) single defs, and take advantage of this when
reasoning about various algorithms in the allocator.

## Block Parameters

Every block can have *block parameters*, and a branch to a block with
block parameters must provide values for those parameters via
operands. When a branch has more than one successor, it provides
separate operands for each possible successor. These block parameters
are equivalent to phi-nodes; we chose this representation because they
are in many ways a more consistent representation of SSA.

To see why we believe block parameters are a slightly nicer design
choice than use of phi nodes, consider: phis are special
pseudoinstructions that must come first in a block, are all defined in
parallel, and whose uses occur on the edge of a particular
predecessor. All of these facts complicate any analysis that scans
instructions and reasons about uses and defs. It is much closer to the
truth to actually put those uses *in* the predecessor, on the branch,
and put all the defs at the top of the block as a separate kind of
def. The tradeoff is that a vreg's def now has two possibilities --
ordinary instruction def or blockparam def -- but this is fairly
reasonable to handle.

## Non-SSA

As mentioned, regalloc2 supports non-SSA inputs as well. No special
flag is needed to place the allocator in this mode or disable SSA
verification. However, we hope to eventually remove this functionality
when it is no longer needed.

## Program Moves

As an especially useful feature for non-SSA IR, regalloc2 supports
special handling of "move" instructions: it will try to merge the
input and output allocations to elide the move altogether.

It turns out that moves are used frequently in the non-SSA input that
we observe from Cranelift via the regalloc.rs compatibility shim. They
are used in three different ways:

- Moves to or from physical registers, used to implement ABI details
  or place values in particular registers required by certain
  instructions.
- Moves between vregs on program edges, as lowered from phi/blockparam
  dataflow in the higher-level SSA IR (CLIF).
- Moves just prior to two-address-form instructions that modify an
  input to form an output: the input is moved to the output vreg to
  avoid clobbering the input.

Note that, strictly speaking, special handling of program moves is
redundant because each of these kinds of uses has an equivalent in the
"native" regalloc2 API:

- Moves to/from physical registers can become operand constraints,
  either on a particular instruction that requires/produces the values
  in certain registers (e.g., a call or ret with args/results in regs,
  or a special instruction with fixed register args), or on a ghost
  instruction at the top of function that defs vregs for all in-reg
  args.

- Moves between vregs as a lowering of blockparams/phi nodes can be
  replaced with use of regalloc2's native blockparam support.

- Moves prior to two-address-form instructions can be replaced with
  the reused-input mechanism.

Thus, eventually, special handling of program moves should be
removed. However, it is very important for performance at the moment.

## Output

The allocator produces two main data structures as output: an array of
`Allocation`s and a sequence of edits. Some other data, such as
stackmap slot info, is also provided.

### Allocations

The allocator provides an array of `Allocation` values, one per
`Operand`. Each `Allocation` has a kind and an index. The kind may
indicate that this is a physical register or a stack slot, and the
index gives the respective register or slot. All allocations will
conform to the constraints given, and will faithfully preserve the
dataflow of the input program.

### Inserted Moves

In order to implement the necessary movement of data between
allocations, the allocator needs to insert moves at various program
points.

The vector of inserted moves contains tuples that name a program point
and an "edit". The edit is either a move, from one `Allocation` to
another, or else a kind of metadata used by the checker to know which
VReg is live in a given allocation at any particular time. The latter
sort of edit can be ignored by a backend that is just interested in
generating machine code.

Note that the allocator will never generate a move from one stackslot
directly to another, by design. Instead, if it needs to do so, it will
make use of the scratch register. (Sometimes such a move occurs when
the scratch register is already holding a value, e.g. to resolve a
cycle of moves; in this case, it will allocate another spillslot and
spill the original scratch value around the move.)

Thus, the single "edit" type can become either a register-to-register
move, a load from a stackslot into a register, or a store from a
register into a stackslot.

# Data Structures

We now review the data structures that regalloc2 uses to track its
state.

## Program-Derived Alloc-Invariant Data

There are a number of data structures that are computed in a
deterministic way from the input program and then subsequently used
only as read-only data during the core allocation procedure.

### Livein/Liveout Bitsets

The livein and liveout bitsets (`liveins` and `liveouts` on the `Env`)
are allocated one per basic block and record, per block, which vregs
are live entering and leaving that block. They are computed using a
standard backward iterative dataflow analysis and are exact; they do
not over-approximate (this turns out to be important for performance,
and is also necessary for correctness in the case of stackmaps).

### Blockparam Vectors: Source-Side and Dest-Side

The initialization stage scans the input program and produces two
vectors that represent blockparam flows from branches to destination
blocks: `blockparam_ins` and `blockparam_outs`.

These two vectors are the first instance we will see of a recurring
pattern: the vectors contain tuples that are carefully ordered in a
way such that their sort-order is meaningful. "Build a vector lazily
then sort" is a common idiom: it batches the O(n log n) cost into one
operation that the stdlib has aggressively optimized, it provides
dense storage, and it allows for a scan in a certain order that often
lines up with a scan over the program.

In this particular case, we will build vectors of (vreg, block) points
that are meaningful either at the start or end of a block, so that
later, when we scan over a particular vreg's allocations in block
order, we can generate another vector of allocations. One side (the
"outs") also contains enough information that it can line up with the
other side (the "ins") in a later sort.

To make this work, `blockparam_ins` contains a vector of (to-vreg,
to-block, from-block) tuples, and has an entry for every blockparam of
every block. Note that we can compute this without actually observing
from-blocks; we only need to iterate over `block_preds` at any given
block.

Then, `blockparam_outs` contains a vector of (from-vreg, from-block,
to-block, to-vreg), and has an entry for every parameter on every
branch that ends a block. There is exactly one "out" tuple for every
"in" tuple. As mentioned above, we will later scan over both to
generate moves.

### Program-Move Vectors: Source-Side and Dest-Side

Similar to blockparams, we handle moves specially. In fact, we ingest
all moves in the input program into a set of vectors -- "move sources"
and "move dests", analogous to the "ins" and "outs" blockparam vectors
described above -- and then completely ignore the moves in the program
thereafter. The semantics of the API are such that all program moves
will be recreated with regalloc-inserted edits, and should not still
be emitted after regalloc. This may seem inefficient, but in fact it
allows for better code because it integrates program-moves with the
move resolution that handles other forms of vreg movement. We
previously took the simpler approach of handling program-moves as
opaque instructions with a source and dest, and we found that there
were many redundant move-chains (A->B, B->C) that are eliminated when
everything is handled centrally.

We also construct a `prog_move_merges` vector of live-range index pairs
to attempt to merge when we reach that stage of allocation.

## Core Allocation State: Ranges, Uses, Bundles, VRegs, PRegs

We now come to the core data structures: live-ranges, bundles, virtual
registers and their state, and physical registers and their state.

First we must define a `ProgPoint` precisely: a `ProgPoint` is an
instruction index and a `Before` or `After` suffix. We pack the
before/after suffix into the LSB of a `u32`, so a `ProgPoint` can be
incremented and compared as a simple integer.

A live-range is a contiguous range of program points (half-open,
i.e. including `from` and excluding `to`) for which a particular vreg
is live with a value.

A live-range contains a vector of uses. Each use contains four parts:
the Operand word (directly copied, so there is no need to dereference
it); the ProgPoint at which the use occurs; the operand slot on that
instruction, if any, that the operand comes from, and the use's
'weight". (It's possible to have "ghost uses" that do not derive from
any slot on the instruction.) These four parts are packed into three
`u32`s: the slot can fit in 8 bits, and the weight in 16.

The live-range carries its program-point range, uses, vreg index,
bundle index (see below), and some metadata: spill weight and
flags. The spill weight is the sum of weights of each use. The flags
set currently carries one flag only: whether the live-range starts at
a Def-kind operand. (This is equivalent to whether the range consumes
a value at its start or not.)

Uses are owned only by live-ranges and have no separate identity, but
live-ranges live in a toplevel array and are known by `LiveRangeIndex`
values throughout the allocator. New live-ranges can be created
(e.g. during splitting); old ones are not cleaned up, but rather, all
state is bulk-freed at the end.

Live-ranges are aggregated into "bundles". A bundle is a collection of
ranges that does not overlap. Each bundle carries: a vector (inline
SmallVec) of (range, live-range index) tuples, an allocation (starts
as "none"), a "spillset" (more below), and some metadata, including a
spill weight (sum of ranges' weights), a priority (sum of ranges'
lengths), and three property flags: "minimal", "contains fixed
constraints", "contains stack constraints".

VRegs also contain their vectors of live-ranges, in the same form as a
bundle does (inline SmallVec that has inline (from, to) range bounds
and range indices).

There are two important overlap invariants: (i) no liveranges within a
bundle overlap, and (ii) no liveranges within a vreg overlap. These
are extremely important and we rely on them implicitly in many places.

The live-range vectors in bundles and vregs, and use-vectors in ranges,
have various sorting invariants as well. These invariants differ
according to the phase of the allocator's computation. First, during
live-range construction, live-ranges are placed into vregs in reverse
order (because the computation is a reverse scan) and uses into ranges
in reverse order; these are sorted into forward order at the end of
live-range computation. When bundles are first constructed, their
range vectors are sorted, and they remain so for the rest of allocation,
as we need for interference testing. However, as ranges are created
and split, sortedness of vreg ranges is *not* maintained; they are
sorted once more, in bulk, when allocation is done and we start to
resolve moves.

Finally, we have physical registers. The main data associated with
each is the allocation map. This map is a standard BTree, indexed by
ranges (`from` and `to` ProgPoints) and yielding a LiveRange for each
location range. The ranges have a custom comparison operator defined
that compares equal for any overlap.

This comparison operator allows us to determine whether a range is
free, i.e. has no overlap with a particular range, in one probe -- the
btree will not contain a match. However, it makes iteration over *all*
overlapping ranges somewhat tricky to get right. Notably, Rust's
BTreeMap does not guarantee that the lookup result will be the *first*
equal key, if multiple keys are equal to the probe key. Thus, when we
want to enumerate all overlapping ranges, we probe with a range that
consists of the single program point *before* the start of the actual
query range, using the API that returns an iterator over a range in
the BTree, and then iterate through the resulting iterator to gather
all overlapping ranges (which will be contiguous).

## Spill Bundles

It is worth describing "spill bundles" separately. Every spillset (see
below; a group of bundles that originated from one bundle) optionally
points to a single bundle that we designate the "spill bundle" for
that spillset. Contrary to the name, this bundle is not
unconditionally spilled. Rather, one can see it as a sort of fallback:
it is where liveranges go when we give up on processing them via the
normal backtracking loop, and will only process them once more in the
"second-chance" stage.

This fallback behavior implies that the spill bundle must always be
able to accept a spillslot allocation, i.e., it cannot require a
register. This invariant is what allows spill bundles to be processed
in a different way, after backtracking has completed.

The spill bundle acquires liveranges in two ways. First, as we split
bundles, we will trim the split pieces in certain ways so that some
liveranges are immediately placed in the spill bundle. Intuitively,
the "empty" regions that just carry a value, but do not satisfy any
operands, should be in the spill bundle: it is better to have a single
consistent location for the value than to move it between lots of
different split pieces without using it, as moves carry a cost.

Second, the spill bundle acquires the liveranges of a bundle that has
no requirement to be in a register when that bundle is processed, but
only if the spill bundle already exists. In other words, we won't
create a second-chance spill bundle just for a liverange with an "Any"
use; but if it was already forced into existence by splitting and
trimming, then we might as well use it.

Note that unlike other bundles, a spill bundle's liverange vector
remains unsorted until we do the second-chance allocation. This allows
quick appends of more liveranges.

## Allocation Queue

The allocation queue is simply a priority queue (built with a binary
max-heap) of (prio, bundle-index) tuples.

## Spillsets and Spillslots

Every bundle contains a reference to a spillset. Spillsets are used to
assign spillslots near the end of allocation, but before then, they
are also a convenient place to store information that is common among
*all bundles* that share the spillset. In particular, spillsets are
initially assigned 1-to-1 to bundles after all bundle-merging is
complete; so spillsets represent in some sense the "original bundles",
and as splitting commences, the smaller bundle-pieces continue to
refer to their original spillsets.

We stash some useful information on the spillset because of this: a
register hint, used to create some "stickiness" between pieces of an
original bundle that are assigned separately after splitting; the
spill bundle; the common register class of all vregs in this bundle;
the vregs whose liveranges are contained in this bundle; and then some
information actually used if this is spilled to the stack (`required`
indicates actual stack use; `size` is the spillslot count; `slot` is
the actual stack slot).

Spill *sets* are later allocated to spill *slots*. Multiple spillsets
can be assigned to one spillslot; the only constraint is that
spillsets assigned to a spillslot must not overlap. When we look up
the allocation for a bundle, if the bundle is not given a specific
allocation (its `alloc` field is `Allocation::none()`), this means it
is spilled, and we traverse to the spillset then spillslot.

## Other: Fixups, Stats, Debug Annotations

There are a few fixup vectors that we will cover in more detail
later. Of particular note is the "multi-fixed-reg fixup vector": this
handles instructions that constrain the same input vreg to multiple,
different, fixed registers for different operands at the same program
point. The only way to satisfy such a set of constraints is to
decouple all but one of the inputs (make them no longer refer to the
vreg) and then later insert copies from the first fixed use of the
vreg to the other fixed regs.

The `Env` also carries a statistics structure with counters that are
incremented, which can be useful for evaluating the effects of
changes; and a "debug annotations" hashmap from program point to
arbitrary strings that is filled out with various useful diagnostic
information if enabled, so that an annotated view of the program with
its liveranges, bundle assignments, inserted moves, merge and split
decisions, etc. can be viewed.

# Allocation Pipeline

We now describe the pipeline that computes register allocations.

## Live-range Construction

The first step in performing allocation is to analyze the input
program to understand its dataflow: that is, the ranges during which
virtual registers must be assigned to physical registers. Computing
these ranges is what allows us to do better than a trivial "every vreg
lives in a different location, always" allocation.

We compute precise liveness first using an iterative dataflow
algorithm with BitVecs. (See below for our sparse chunked BitVec
description.) This produces the `liveins` and `liveouts` vectors of
BitVecs per block.

We then perform a single pass over blocks in reverse order, and scan
instructions in each block in reverse order. Why reverse order? We
must see instructions within a block in reverse to properly compute
liveness (a value is live backward from an use to a def). Because we
want to keep liveranges in-order as we build them, to enable
coalescing, we visit blocks in reverse order as well, so overall this
is simply a scan over the whole instruction index space in reverse
order.

For each block, we perform a scan with the following state:

- A liveness bitvec, initialized at the start from `liveouts`.
- A vector of live-range indices, with one entry per vreg, initially
  "invalid" (this vector is allocated once and reused at each block).
- In-progress vector of live-range indices per vreg in the vreg state,
  in *reverse* order (we will reverse it when we're done).

A vreg is live at the current point in the scan if its bit is set in
the bitvec; its entry in the vreg-to-liverange vec may be stale, but
if the bit is not set, we ignore it.

We initially create a liverange for all vregs that are live out of the
block, spanning the whole block. We will trim this below if it is
locally def'd and does not pass through the block.

For each instruction, we process its effects on the scan state:

- For all clobbers (which logically happen at the end of the
  instruction), add a single-program-point liverange to each clobbered
  preg.

- If not a move:
  - for each program point [after, before], for each operand at
    this point(\*):
    - if a def or mod:
      - if not currently live, this is a dead def; create an empty
        LR.
    - if a def:
      - set the start of the LR for this vreg to this point.
      - set as dead.
    - if a use:
      - create LR if not live, with start at beginning of block.

- Else, if a move:
  - simple case (no pinned vregs):
    - add to `prog_move` data structures, and update LRs as above.
    - effective point for the use is *after* the move, and for the mod
      is *before* the *next* instruction. Why not more conventional
      use-before, def-after? Because this allows the move to happen in
      parallel with other moves that the move-resolution inserts
      (between split fragments of a vreg); these moves always happen
      at the gaps between instructions. We place it after, not before,
      because before may land at a block-start and interfere with edge
      moves, while after is always a "normal" gap (a move cannot end a
      block).
  - otherwise: see below (pinned vregs).


(\*) an instruction operand's effective point is adjusted in a few
cases. If the instruction is a branch, its uses (which are
blockparams) are extended to the "after" point. If there is a reused
input, all *other* inputs are extended to "after": this ensures proper
interference (as we explain more below).

We then treat blockparams as defs at the end of the scan (beginning of
the block), and create the "ins" tuples. (The uses for the other side
of the edge are already handled as normal uses on a branch
instruction.)

### Optimization: Pinned VRegs and Moves

In order to efficiently handle the translation from the regalloc.rs
API, which uses named RealRegs that are distinct from VirtualRegs
rather than operand constraints, we need to implement a few
optimizations. The translation layer translates RealRegs as particular
vregs at the regalloc2 layer, because we need to track their liveness
properly. Handling these as "normal" vregs, with massive bundles of
many liveranges throughout the function, turns out to be a very
inefficient solution. So we mark them as "pinned" with a hook in the
RA2 API. Semantically, this means they are always assigned to a
particular preg whenever mentioned in an operand (but *NOT* between
those points; it is possible for a pinned vreg to move all about
registers and stackslots as long as it eventually makes it back to its
home preg in time for its next use).

This has a few implications during liverange construction. First, when
we see an operand that mentions a pinned vreg, we translate this to an
operand constraint that names a fixed preg. Later, when we build
bundles, we will not create a bundle for the pinned vreg; instead we
will transfer its liveranges directly as unmoveable reservations in
pregs' allocation maps. Finally, we need to handle moves specially.

With the caveat that "this is a massive hack and I am very very
sorry", here is how it works. A move between two pinned vregs is easy:
we add that to the inserted-moves vector right away because we know the
Allocation on both sides. A move from a pinned vreg to a normal vreg
is the first interesting case. In this case, we (i) create a ghost def
with a fixed-register policy on the normal vreg, doing the other
liverange-maintenance bits as above, and (ii) adjust the liveranges on
the pinned vreg (so the preg) in a particular way. If the preg is live
flowing downward, then this move implies a copy, because the normal
vreg and the pinned vreg are both used in the future and cannot
overlap. But we cannot keep the preg continuously live, because at
exactly one program point, the normal vreg is pinned to it. So we cut
the downward-flowing liverange just *after* the normal vreg's
fixed-reg ghost def. Then, whether it is live downward or not, we
create an upward-flowing liverange on the pinned vreg that ends just
*before* the ghost def.

The move-from-normal-to-pinned case is similar. First, we create a
ghost use on the normal vreg that pins its value at this program point
to the fixed preg. Then, if the preg is live flowing downward, we trim
its downward liverange to start just after the fixed use.

There are also some tricky metadata-maintenance records that we emit
so that the checker can keep this all straight.

The outcome of this hack, together with the operand-constraint
translation on normal uses/defs/mods on pinned vregs, is that we
essentially are translating regalloc.rs's means of referring to real
registers to regalloc2's preferred abstractions by doing a bit of
reverse-engineering. It is not perfect, but it works. Still, we hope
to rip it all out once we get rid of the need for the compatibility
shim.

### Handling Reused Inputs

Reused inputs are also handled a bit specially. We have already
described how we essentially translate the idiom so that the output's
allocation is used for input and output, and there is a move just
before the instruction that copies the actual input (which will not be
clobbered) to the output. Together with an attempt to merge the
bundles for the two, to elide the move if possible, this works
perfectly well as long as we ignore all of the other inputs.

But we can't do that: we have to ensure that other inputs' allocations
are correct too. Note that using the output's allocation as the input
is actually potentially incorrect if the output is at the After point
and the input is at the Before: the output might share a register with
one of the *other* (normal, non-reused) inputs if that input's vreg
were dead afterward. This will mean that we clobber the other input.

So, to get the interference right, we *extend* all other (non-reused)
inputs of an instruction with a reused input to the After point. This
ensures that the other inputs are *not* clobbered by the slightly
premature use of the output register.

The source has a link to a comment in IonMonkey that implies that it
uses a similar solution to this problem, though it's not entirely
clear.

(This odd dance, like many of the others above and below, is "written
in fuzzbug failures", so to speak. It's not entirely obvious until one
sees the corner case where it's necessary!)

## Bundle Merging

Once we have built the liverange vectors for every vreg, we can reverse
these vectors (recall, they were built in strict reverse order) and
initially assign one bundle per (non-pinned) vreg. We then try to
merge bundles together as long as find pairs of bundles that do not
overlap and that (heuristically) make sense to merge.

Note that this is the only point in the allocation pipeline where
bundles get larger. We initially merge as large as we dare (but not
too large, because then we'll just cause lots of conflicts and
splitting later), and then try out assignments, backtrack via
eviction, and split continuously to chip away at the problem until we
have a working set of allocation assignments.

We attempt to merge three kinds of bundle pairs: reused-input to
corresponding output; across program moves; and across blockparam
assignments.

To merge two bundles, we traverse over both their sorted liverange
vectors at once, checking for overlaps. Note that we can do this without
pointer-chasing to the liverange data; the (from, to) range is in the
liverange vector itself.

We also check whether the merged bundle would have conflicting
requirements (see below for more on requirements). We do a coarse
check first, checking 1-bit flags that indicate whether either bundle
has any fixed-reg constraints or stack-only constraints. If so, we
need to do a detailed check by actually computing merged requirements
on both sides, merging, and checking for Conflict (the lattice bottom
value). If no conflict, we merge.

A performance note: merging is extremely performance-sensitive, and it
turns out that a mergesort-like merge of the liverange vectors is too
expensive, partly because it requires allocating a separate result
vector (in-place merge in mergesort is infamously complex). Instead,
we simply append one vector onto the end of the other and invoke
Rust's builtin sort. We could special-case "one bundle is completely
before the other", but we currently don't do that (performance idea!).

Once all bundles are merged as far as they will go, we compute cached
bundle properties (priorities and weights) and enqueue them on the
priority queue for allocation.

## Recurring: Bundle Property Computation

The core allocation loop is a recurring iteration of the following: we
take the highest-priority bundle from the allocation queue; we compute
its requirements; we try to find it a register according to those
requirements; if no fit, we either evict some other bundle(s) from
their allocations and try again, or we split the bundle and put the
parts back on the queue. We record all the information we need to make
the evict-or-split decision (and where to split) *during* the physical
register allocation-map scans, so we don't need to go back again to
compute that.

Termination is nontrivial to see, because of eviction. How do we
guarantee we don't get into an infinite loop where two bundles fight
over a register forever? In fact, this can easily happen if there is a
bug; we fixed many fuzzbugs like this, and we have a check for
"infinite loop" based on an upper bound on iterations. But if the
allocator is correct, it should never happen.

Termination is guaranteed because (i) bundles always get smaller, (ii)
eviction only occurs when a bundle is *strictly* higher weight (not
higher-or-equal), and (iii) once a bundle gets down to its "minimal"
size, it has an extremely high weight that is guaranteed to evict any
non-minimal bundle. A minimal bundle is one that covers only one
instruction. As long as the input program does not have impossible
constraints that require more than one vreg to exist in one preg, an
allocation problem of all minimal bundles will always have a solution.

## Bundle Processing

Let's now talk about what happens when we take a bundle off the
allocation queue. The three basic outcomes are: allocate; split and
requeue; or evict and try again immediately (and eventually allocate
or split/requeue).

### Properties: Weight, Priority, and Requirements

To process a bundle, we have to compute a few properties. In fact we
will have already computed a few of these beforehand, but we describe
them all here.

- Priority: a bundle's priority determines the order in which it is
  considered for allocation. RA2 defines as the sum of the lengths (in
  instruction index space) of each liverange. This causes the
  allocator to consider larger bundles first, when the allocation maps
  are generally more free; they can always be evicted and split later.

- Weight: a bundle's weight indicates how important (in terms of
  runtime) its uses/register mentions are. In an approximate sense,
  inner loop bodies create higher-weight uses. Fixed register
  constraints add some weight, and defs add some weight. Finally,
  weight is divided by priority, so a very large bundle that happens
  to have a few important uses does not unformly exert its weight
  across its entire range. This has the effect of causing bundles to
  be more important (more likely to evict others) the more they are
  split.

- Requirement: a bundle's requirement is a value in a lattice that we
  have defined, where top is "Unknown" and bottom is
  "Conflict". Between these two, we have: any register (of a class);
  any stackslot (of a class); a particular register. "Any register"
  can degrade to "a particular register", but any other pair of
  different requirements meets to Conflict. Requirements are derived
  from the operand constraints for all uses in all liveranges in a
  bundle, and then merged with the lattice meet-function.

The lattice is as follows (diagram simplified to remove multiple
classes and multiple fixed registers which parameterize nodes; any two
differently-parameterized values are unordered with respect to each
other):

```plain

        ___Unknown_____
        |      |      |
        |      |      |
        | ____Any(rc) |
        |/     |      |
   Stack(rc)  FixedReg(reg)
         \    /
        Conflict
```

Once we have the Requirement for a bundle, we can decide what to do.

### No-Register-Required Cases

If the requirement indicates that no register is needed (`Unknown` or
`Any`, i.e. a register or stack slot would be OK), *and* if the spill
bundle already exists for this bundle's spillset, then we move all the
liveranges over to the spill bundle, as described above.

If the requirement indicates that the stack is needed explicitly
(e.g., for a safepoint), we set our spillset as "required" (this will
cause it to allocate a spillslot) and return; because the bundle has
no other allocation set, it will look to the spillset's spillslot by
default.

If the requirement indicates a conflict, we immediately split and
requeue the split pieces. This split is performed at the point at
which the conflict is first introduced, i.e. just before the first use
whose requirement, when merged into the requirement for all prior uses
combined, goes to `Conflict`. In this way, we always guarantee forward
progress. Note also that a bundle can reach this stage with a
conflicting requirement only if the original liverange had conflicting
uses (e.g., a liverange from a def in a register to a use on stack, or
a liverange between two different fixed-reg-constrained operands); our
bundle merging logic explicitly avoids merging two bundles if it would
create a conflict.

### Allocation-Map Probing

If we did not immediately dispose of the bundle as described above,
then we *can* use a register (either `Any`, which accepts a register
as one of several options, or `Reg`, which must have one, or `Fixed`,
which must have a particular one).

We determine which physical registers whose allocation maps we will
probe, and in what order. If a particular fixed register is required,
we probe only that register. Otherwise, we probe all registers in the
required class.

The order in which we probe, if we are not constrained to a single
register, is carefully chosen. First, if there is a hint register from
the spillset (this is set by the last allocation into a register of
any other bundle in this spillset), we probe that. Then, we probe all
preferred registers; then all non-preferred registers.

For each of the preferred and non-preferred register sequences, we
probe in an *offset* manner: we start at some index partway through
the sequence, determined by some heuristic number that is random and
well-distributed. (In practice, we use the sum of the bundle index and
the instruction index of the start of the first range in the bundle.)
We then march through the sequence and wrap around, stopping before we
hit our starting point again.

The purpose of this offset is to distribute the contention and speed
up the allocation process. In the common case where there are enough
registers to hold values without spilling (for small functions), we
are more likely to choose a free register right away if we throw the
dart at random than if we start *every* probe at register 0, in
order. This has a large allocation performance impact in practice.

For each register in probe order, we probe the allocation map, and
gather, simultaneously, several results: (i) whether the entire range
is free; (ii) if not, the vector of all conflicting bundles, *and* the
highest weight among those bundles; (iii) if not, the *first* conflict
point.

We do this by iterating over all liveranges in the preg's btree that
overlap with each range in the current bundle. This iteration is
somewhat subtle due to multiple "equal" keys (see above where we
describe the use of the btree). It is also adaptive for performance
reasons: it initially obtains an iterator into the btree corresponding
to the start of the first range in the bundle, and concurrently
iterates through both the btree and the bundle. However, if there is a
large gap in the bundle, this might require skipping many irrelevant
entries in the btree. So, if we skip too many entries (heuristically,
16, right now), we do another lookup from scratch in the btree for the
start of the next range in the bundle. This balances between the two
cases: dense bundle, where O(1) iteration through the btree is faster,
and sparse bundle, where O(log n) lookup for each entry is better.

### Decision: Allocate, Evict, or Split

First, the "allocate" case is easy: if, during our register probe
loop, we find a physical register whose allocations do not overlap
this bundle, then we allocate this register; done!

If not, then we need to decide whether to evict some conflicting
bundles and retry, or to split the current bundle into smaller pieces
that may have better luck.

A bit about our split strategy first: contrary to the IonMonkey
allocator which inspired much of our design, we do *not* have a list
of split strategies that split one bundle into many pieces at
once. Instead, each iteration of the allocation loop splits at most
*once*. This simplifies the splitting code greatly, but also turns out
to be a nice heuristic: we split at the point that the bundle first
encounters a conflict for a particular preg assignment, then we hint
that preg for the first (pre-conflict) piece when we retry. In this
way, we always make forward progress -- one piece of the bundle is
always allocated -- and splits are informed by the actual situation at
hand, rather than best guesses. Also note that while this may appear
at first to be a greedy algorithm, it still allows backtracking: the
first half of the split bundle, which we *can* now assign to a preg,
does not necessarily remain on that preg forever (it can still be
evicted later). It is just a split that is known to make at least one
part of the allocation problem solvable.

To determine whether to split or evict, we track our best options: as
we probe, we track the "lowest cost eviction option", which is a set
of bundles and the maximum weight in that set of bundles. We also
track the "lowest cost split option", which is the cost (more below),
the point at which to split, and the register for this option.

For each register we probe, if there is a conflict but none of the
conflicts are fixed allocations, we receive a vector of bundles that
conflicted, and also separately, the first conflicting program
point. We update the lowest-cost eviction option if the cost (max
weight) of the conflicting bundles is less than the current best. We
update the lowest-cost split option if the cost is less as well,
according to the following definition of cost: a split's cost is the
cost of its move, as defined by the weight of a normal def operand at
the split program point, plus the cost of all bundles beyond the split
point (which will still be conflicts even after the split).

If there is a conflict with a fixed allocation, then eviction is not
an option, but we can still compute the candidate split point and cost
in the same way as above.

Finally, as an optimization, we pass in the current best cost to the
btree probe inner loop; if, while probing, we have already exceeded
the best cost, we stop early (this improves allocation time without
affecting the result).

Once we have the best cost for split and evict options, we split if
(i) the bundle is not already a minimal bundle, and (ii) we've already
evicted once in this toplevel iteration without success, or the weight
of the current bundle is less than the eviction cost. We then requeue
*both* resulting halves of the bundle with the preg that resulted in
this option as the register hint. Otherwise, we evict all conflicting
bundles and try again.

Note that the split cost does not actually play into the above (split
vs. evict) decision; it is only used to choose *which* split is
best. This is equivalent to saying: we never evict if the current
bundle is less important than the evicted bundles, even if the split
is more expensive still. This is important for forward progress, and
the case where the split would be even more expensive should be very
very rare (it would have to come from a costly move in the middle of
an inner loop).

### How to Split

The actual split procedure is fairly simple. We are given a bundle and
a split-point. We create a new bundle to take on the second half
("rest") of the original. We find the point in the liverange vector
that corresponds to the split, and distribute appropriately. If the
split-point lands in the middle of a liverange, then we split that
liverange as well.

In the case that a new liverange is created, we add the liverange to
the corresponding vreg liverange vector as well. Note that, as described
above, the vreg's liverange vector is unsorted while splitting is
occurring (because we do not need to traverse it or do any lookups
during this phase); so we just append.

The splitting code also supports a "minimal split", in which it simply
peels off the first use. This is used to ensure forward progress when
a bundle has conflicting requirements within it (see above).

#### Spill Bundle and Splitting

Once a split occurs, however, it turns out that we can improve results
by doing a little cleanup. Once we distribute a bundle's liveranges
across two half-bundles, we postprocess by trimming a bit.

In particular, if we see that the "loose ends" around the split point
extend beyond uses, we will create and move ranges to a spill
bundle. That is: if the last liverange in the first-half bundle
extends beyond its last use, we trim that part off into an empty (no
uses) liverange and place that liverange in the spill
bundle. Likewise, if the first liverange in the second-half bundle
starts before its first use, we trim that part off into an empty
liverange and place it in the spill bundle.

This is, empirically, an improvement: it reduces register contention
and makes splitting more effective. The intuition is twofold: (i) it
is better to put all of the "flow-through" parts of a vreg's liveness
into one bundle that is never split, and can be spilled to the stack
if needed, to avoid unnecessary moves; and (ii) if contention is high
enough to cause splitting, it is more likely there will be an actual
stack spill, and if this is the case, it is better to do the store
just after the last use and reload just before the first use of the
respective bundles.

Unfortunately, this heuristic choice does interact somewhat poorly
with program moves: moves between two normal (non-pinned) vregs do not
create ghost uses or defs, and so these points of the ranges can be
spilled, turning a normal register move into a move from or to the
stack. However, empirically, we have found that adding such ghost
uses/defs actually regresses some cases as well, because it pulls
values back into registers when we could have had a stack-to-stack
move (that might even be a no-op if the same spillset); overall, it
seems better to trim. It also improves allocation performance by
reducing contention in the registers during the core loop (before
second-chance allocation).

## Second-Chance Allocation: Spilled Bundles

Once the main allocation loop terminates, when all bundles have either
been allocated or punted to the "spilled bundles" vector, we do
second-chance allocation. This is a simpler loop that never evicts and
never splits. Instead, each bundle gets one second chance, in which it
can probe pregs and attempt to allocate. If it fails, it will actually
live on the stack.

This is correct because we are careful to only place bundles on the
spilled-bundles vector that are *allowed* to live on the
stack. Specifically, only the canonical spill bundles (which will
contain only empty ranges) and other bundles that have an "any" or
"unknown" requirement are placed here (but *not* "stack" requirements;
those *must* be on the stack, so do not undergo second-chance
allocation).

At the end of this process, we have marked spillsets as required
whenever at least one bundle in the spillset actually requires a stack
slot. We can then allocate slots to the spillsets.

## Spillslot Allocation

We must allocate space on the stack, denoted by an abstract index
space, to each spillset that requires it, and for the liveranges in
which it requires it.

To facilitate this, we keep a btree per spillslot in the same way we
do per preg. We will allocate spillsets to slots in a way that avoids
interference.

Note that we actually overapproximate the required ranges for each
spillset in order to improve the behavior of a later phase (redundant
move elimination). Specifically, when we allocate a slot for a
spillset, we reserve that slot for *all* of the liveranges of *every*
vreg that is assigned to that spillset (due to merging rules that
initially merge one-vreg bundles into final merged bundles, there will
be no overlaps here). In other words, we rule out interleaving of
completely different values in the same slot, though bundle merging
does mean that potentially many (non-interfering) vregs may share
it. This provides the important property that if a vreg has been
reloaded, but not modified, its spillslot *still contains the
up-to-date value* (because the slot is reserved for all liveranges of
the vreg). This enables us to avoid another store to the spillslot
later if there is another spilled range.

We perform probing in a way that is somewhat different than for
registers, because the spillslot space is conceptually infinite. We
can thus optimize for slightly better allocation performance by giving
up and allocating a new slot at any time.

For each size class, we keep a linked list of slots. When we need to
allocate a spillset to a slot, we traverse down the list and try a
fixed number of slots. If we find one that fits the spillset's ranges,
we allocate, and we remove the slot from its current place in the list
and append to the end. In this way, it is deprioritized from probing
"for a while", which tends to reduce contention. This is a simple way
to round-robin between slots. If we don't find one that fits after a
fixed number of probes, we allocate a new slot.

And with that, we have valid allocations for all vregs for all points
that they are live! Now we just need to modify the program to reify
these choices.

## Allocation Assignment

The first step in reifying the allocation is to iterate through all
mentions of a vreg and fill in the resulting `Allocation` array with
the appropriate allocations. We do this by simply traversing
liveranges per vreg, looking up the allocation by observing the bundle
(and spillset if no specific allocation for the bundle), and for each
use, filling in the slot according to the saved progpoint/slot info in
the use data.

## Move Generation

The more difficult half of the reification step is generating the
*moves* that will put the values in the right spots.

There are two sources of moves that we must generate. The first are
moves between different ranges of the same vreg, as the split pieces
of that vreg's original bundle may have been assigned to different
locations. The second are moves that result from move semantics in the
input program: either assignments from blockparam args on branches to
the target block's params, or program move instructions. (Recall that
we reify program moves in a unified way with all other moves, so the
client should not generate any machine code for their original moves
in the pre-allocation program.)

Moves are tricky to handle efficiently because they join two
potentially very different locations in the program (in the case of
control-flow-edge moves). In order to avoid the need for random
lookups, which are a cache-locality nightmare even if we have O(log n)
lookups, we instead take a scan-sort-scan approach.

First, we scan over each vreg's liveranges, find the allocation for
each, and for each move that comes *to* or *from* this liverange,
generate a "half-move". The key idea is that we generate a record for
each "side" of the move, and these records are keyed in a way that
after a sort, the "from" and "to" ends will be consecutive. We can
sort the vector of halfmoves once (this is expensive, but not as
expensive as many separate pointer-chasing lookups), then scan it
again to actually generate the move instructions.

To enable the sort to work, half-moves are sorted by a key that is
equivalent to the tuple (from-block, to-block, to-vreg, kind), where
`kind` is "source" or "dest". For each key, the payload is an
allocation. The fields in this tuple are carefully chosen: we know all
of them at every location we generate a halfmove, without expensive
lookups, and sorting by this key will make the source and all dests
(there can be more than one) contiguous in the final order.

Half-moves are generated for several situations. First, at the start
of every block covered by a liverange, we can generate "dest"
half-moves for blockparams, and at the end of every block covered by a
liverange, we can generate "source" half-moves for blockparam args on
branches. Incidentally, this is the reason that `blockparam_ins` and
`blockparam_outs` are sorted tuple-vectors whose tuples begin with
(vreg, block, ...): this is the order in which we do the toplevel scan
over allocations.

Second, at every block edge, if the vreg is live in any pred (at
block-start) or succ (at block-end), we generate a half-move to
transfer the vreg to its own location in the connected block.

This completes the "edge-moves". We sort the half-move array and then
have all of the alloc-to-alloc pairs on a given (from-block, to-block)
edge.

There are also two kinds of moves that happen within blocks. First,
when a live-range ends and another begins for the same vreg in the
same block (i.e., a split in the middle of a block), we know both
sides of the move immediately (because it is the same vreg and we can
look up the adjacent allocation easily), and we can generate that
move.

Second, program moves occur within blocks. Here we need to do a
similar thing as for block-edge half-moves, but keyed on program point
instead. This is why the `prog_move_srcs` and `prog_move_dsts` arrays
are initially sorted by their (vreg, inst) keys: we can directly fill
in their allocation slots during our main scan. Note that when sorted
this way, the source and dest for a given move instruction will be at
different indices. After the main scan, we *re-sort* the arrays by
just the instruction, so the two sides of a move line up at the same
index; we can then traverse both arrays, zipped together, and generate
moves.

Finally, we generate moves to fix up multi-fixed-reg-constraint
situations, and make reused inputs work, as described earlier.

## Move Resolution

During this whole discussion, we have described "generating moves",
but we have not said what that meant. Note that in many cases, there
are several moves at a particular program point that semantically
happen *in parallel*. For example, if multiple vregs change
allocations between two instructions, all of those moves happen as
part of one parallel permutation. Similarly, blockparams have
parallel-assignment semantics. We thus enqueue all the moves that we
generate at program points and resolve them into sequences of
sequential moves that can actually be lowered to move instructions in
the machine code.

First, a word on *move priorities*. There are different kinds of moves
that are generated between instructions, and we have to ensure that
some happen before others, i.e., *not* in parallel. For example, a
vreg might change allocation (due to a split) before an instruction,
then be copied to an output register for an output with a reused-input
policy. The latter move must happen *after* the vreg has been moved
into its location for this instruction.

To enable this, we define "move priorities", which are a logical
extension of program points (i.e., they are sub-points) that enable
finer-grained ordering of moves. We currently have the following
priorities:

- In-edge moves, to place edge-moves before the first instruction in a
  block.
- Block-param metadata, used for the checker only.
- Regular, used for vreg movement between allocations.
- Post-regular, used for checker metadata related to pinned-vreg moves.
- Multi-fixed-reg, used for moves that handle the
  single-vreg-in-multiple-fixed-pregs constraint case.
- Reused-input, used for implementing outputs with reused-input policies.
- Out-edge moves, to place edge-moves after the last instruction
  (prior to the branch) in a block.

Every move is statically given one of these priorities by the code
that generates it.

We collect moves with (prog-point, prio) keys, and we sort by those
keys. We then have, for each such key, a set of moves that
semantically happen in parallel.

We then resolve those moves using a parallel-move resolver, as we now
describe.

### Parallel-Move Resolver

The fundamental issue that arises when resolving parallel moves to
sequential moves is *overlap*: some of the moves may overwrite
registers that other moves use as sources. We must carefully order
moves so that this does not clobber values incorrectly.

We first check if such overlap occurs. If it does not (this is
actually the most common case), the sequence of parallel moves can be
emitted as sequential moves directly. Done!

Otherwise, we have to order the moves carefully. Furthermore, if there
is a *cycle* anywhere among the moves, we will need a scratch
register. (Consider, e.g., t0 := t1 and t1 := t0 in parallel: with
only move instructions and no direct "exchange" instruction, we cannot
reify this without a third register.)

We first compute a mapping from each move instruction to the move
instruction, if any, that it must precede. Note that there can be only
one such move for a given move, because each destination can be
written only once; so a move might be constrained only before the one
move that overwrites its source. (This will be important in a bit!)

Our task is now to find an ordering of moves that respects these
dependencies. To do so, we perform a depth-first search on the graph
induced by the dependencies, which will generate a sequence of
sequential moves in reverse order. We keep a stack of moves; we start
with any move that has not been visited yet; in each iteration, if the
top-of-stack has no out-edge to another move (does not need to come
before any others), then push it to a result vector, followed by all
others on the stack (in popped order). If it does have an out-edge and
the target is already visited and not on the stack anymore (so already
emitted), likewise, emit this move and the rest on the stack. If it
has an out-edge to a move not yet visited, push on the stack and
continue. Otherwise, if out-edge to a move currently on the stack, we
have found a cycle. In this case, we emit the moves on the stack with
a modification: the first move writes to a scratch register, and we
emit an additional move that moves from the scratch to the first
move's dest. This breaks the cycle.

The astute reader may notice that this sounds like a canonical
application of Tarjan's algorithm for finding SCCs (strongly-connected
components). Why don't we have the full complexity of that algorithm?
In particular, *why* can we emit the cycle *right away* once we find
it, rather than ensuring that we have gotten all of the SCC first?

The answer is that because there is only *one* out-edge at most (a
move can only require preceding *one* other move), all SCCs must be
simple cycles. This means that once we have found a cycle, no other
nodes (moves) can be part of the SCC, because every node's single
out-edge is already accounted for. This is what allows us to avoid a
fully general SCC algorithm.

Once the vector of moves in-reverse has been constructed, we reverse
it and return.

Note that this "move resolver" is fuzzed separately with a simple
symbolic move simulator (the `moves` fuzz-target).

### Stack-to-Stack Moves

There is one potentially difficult situation that could arise from the
move-resolution logic so far: if a vreg moves from one spillslot to
another, this implies a memory-to-memory move, which most machine
architectures cannot handle natively. It would be much nicer if we
could ensure within the regalloc that this never occurs.

This is in fact possible to do in a postprocessing step. We iterate
through the sequential moves, tracking whether the scratch register is
in use (has been written). When we see a stack-to-stack move: (i) if
the scratch register is not in use, generate a stack-to-scratch move
and scratch-to-stack move; otherwise, (ii) if the scratch register is
in use, allocate an "extra spillslot" if one has not already been
allocated, move the scratch reg to that, do the above stack-to-scratch
/ scratch-to-stack sequence, then reload the scratch reg from the
extra spillslot.

## Redundant-Spill/Load Elimination

As a final step before returning the vector of program edits to the
client, we perform one optimization: redundant-spill/load elimination.

To understand the need for this, consider what will occur when a vreg
is (i) defined once, (ii) used many times, and (iii) spilled multiple
times between some of the uses: with the design described above, we
will move the value from the preg to the stack after every segment of
uses, and then reload it when the next use occurs. However, only the
first spill is actually needed; as we noted above, we allocate
spillslots so that the slot that corresponded to the vreg at the first
spill will always be reserved for that vreg as long as it is live. If
no other defs or mods occur, the value in the slot can be reloaded,
and need not be written back every time.

This inefficiency is a result of our invariant that a vreg lives in
exactly one place at a time, and these locations are joined by
moves. This is a simple and effective design to use for most of the
allocation pipeline, but falls flat here. It is especially inefficient
when the unnecessary spill occurs in an inner loop. (E.g.: value
defined at top of function is spilled, then used once in the middle of
an inner loop body.)

The opposite case can also sometimes occur, though it is rarer: a
value is loaded into a register, spilled, and then reloaded into the
same register. This can happen when hinting is successful at getting
several segments of a vreg to use the same preg, but splitting has
trimmed part of the liverange between uses and put it in the spill
bundle, and the spill bundle did not get a reg.

In order to resolve this inefficiency, we implement a general
redundant-spill/load elimination pass (an even more general solution
would be a full redundant-move elimination pass, but we focus on moves
that are spills/loads to contain the complexity for now). This pass
tracks, for every allocation (reg or spillslot), whether it is a copy
of another allocation. This state is invalidated whenever either that
allocation or the allocation of which it is a copy is
overwritten. When we see a move instruction, if the destination is
already a copy of the source, we elide the move. (There are some
additional complexities to preserve checker metadata which we do not
describe here.)

Note that this could, in principle, be done as a fixpoint analysis
over the CFG; it must be, if we try to preserve state across
blocks. This is because a location is only a copy of another if that
is true on every incoming edge. However, to avoid the cost and
complexity of doing such an analysis, we instead take the much simpler
approach of doing only an intra-block analysis. This turns out to be
sufficient to remove most redundant moves, especially in the common
case of a single use of an otherwise-spilled value.

Note that we could do better *if* we accepted only SSA code, because
we would know that a value could not be redefined once written. We
should consider this again once we clean up and remove the non-SSA
support.

# Future Plans

## SSA-Only Cleanup

When the major user (Cranelift via the regalloc.rs shim) migrates to
generate SSA code and native regalloc2 operands, there are many bits
of complexity we can remove, as noted throughout this
writeup. Briefly, we could (i) remove special handling of program
moves, (ii) remove the pinned-vreg hack, (iii) simplify redundant-move
elimination, (iv) remove special handling of "mod" operands, and (v)
probably simplify plenty of code given the invariant that a def always
starts a range.

More importantly, we expect this change to result in potentially much
better allocation performance. The use of special pinned vregs and
moves to/from them instead of fixed-reg constraints, explicit moves
for every reused-input constraint, and already-sequentialized series
of move instructions on edges for phi nodes, are all expensive ways of
encoding regalloc2's native input primitives that have to be
reverse-engineered. Removing that translation layer would be
ideal. Also, allowing regalloc2 to handle phi-node (blockparam)
lowering in a way that is integrated with other moves will likely
generate better code than the way that program-move handling interacts
with Cranelift's manually lowered phi-moves at the moment.

## Better Split Heuristics

We have spent quite some effort trying to improve splitting behavior,
and it is now generally decent, but more work could be done here,
especially with regard to the interaction between splits and the loop
nest.

## Native Debuginfo Output

Cranelift currently computes value locations (in registers and
stack-slots) for detailed debuginfo with an expensive post-pass, after
regalloc is complete. This is because the existing register allocator
does not support returning this information directly. However,
providing such information by generating it while we scan over
liveranges in each vreg would be relatively simple, and has the
potential to be much faster and more reliable for Cranelift. We should
investigate adding an interface for this to regalloc2 and using it.

# Appendix: Comparison to IonMonkey Allocator

There are a number of differences between the [IonMonkey
allocator](https://searchfox.org/mozilla-central/source/js/src/jit/BacktrackingAllocator.cpp)
and this one. While this allocator initially began as an attempt to
clone IonMonkey's, it has drifted significantly as we optimized the
design (especially after we built the regalloc.rs shim and had to
adapt to its code style); it is easier at this point to name the
similarities than the differences.

* The core abstractions of "liverange", "bundle", "vreg", "preg", and
  "operand" (with policies/constraints) are the same.

* The overall allocator pipeline is the same, and the top-level
  structure of each stage should look similar. Both allocators begin
  by computing liveranges, then merging bundles, then handling bundles
  and splitting/evicting as necessary, then doing second-chance
  allocation, then reifying the decisions.

* The cost functions are very similar, though the heuristics that make
  decisions based on them are not.

Several notable high-level differences are:

* There are [fuzz/fuzz_targets/](many different fuzz targets) that
  exercise the allocator, including a full symbolic checker
  (`ion_checker` target) based on the [symbolic checker in
  regalloc.rs](https://cfallin.org/blog/2021/03/15/cranelift-isel-3/)
  and, e.g., a targetted fuzzer for the parallel move-resolution
  algorithm (`moves`) and the SSA generator used for generating cases
  for the other fuzz targets (`ssagen`).

* The data-structure invariants are simplified. While the IonMonkey
  allocator allowed for LiveRanges and Bundles to overlap in certain
  cases, this allocator sticks to a strict invariant: ranges do not
  overlap in bundles, and bundles do not overlap. There are other
  examples too: e.g., the definition of minimal bundles is very simple
  and does not depend on scanning the code at all. In general, we
  should be able to state simple invariants and see by inspection (as
  well as fuzzing -- see above) that they hold.

* The data structures themselves are simplified. Where IonMonkey uses
  linked lists in many places, this allocator stores simple inline
  smallvecs of liveranges on bundles and vregs, and smallvecs of uses
  on liveranges. We also (i) find a way to construct liveranges
  in-order immediately, without any need for splicing, unlike
  IonMonkey, and (ii) relax sorting invariants where possible to allow
  for cheap append operations in many cases.

* The splitting heuristics are significantly reworked. Whereas
  IonMonkey has an all-at-once approach to splitting an entire bundle,
  and has a list of complex heuristics to choose where to split, this
  allocator does conflict-based splitting, and tries to decide whether
  to split or evict and which split to take based on cost heuristics.

* The liverange computation is exact, whereas IonMonkey approximates
  using a single-pass algorithm that makes vregs live across entire
  loop bodies. We have found that precise liveness improves allocation
  performance and generated code quality, even though the liveness
  itself is slightly more expensive to compute.

* Many of the algorithms in the IonMonkey allocator are built with
  helper functions that do linear scans. These "small quadratic" loops
  are likely not a huge issue in practice, but nevertheless have the
  potential to be in corner cases. As much as possible, all work in
  this allocator is done in linear scans.

* There are novel schemes for solving certain interesting design
  challenges. One example: in IonMonkey, liveranges are connected
  across blocks by, when reaching one end of a control-flow edge in a
  scan, doing a lookup of the allocation at the other end. This is in
  principle a linear lookup (so quadratic overall). We instead
  generate a vector of "half-moves", keyed on the edge and from/to
  vregs, with each holding one of the allocations. By sorting and then
  scanning this vector, we can generate all edge moves in one linear
  scan. There are a number of other examples of simplifications: for
  example, we handle multiple conflicting
  physical-register-constrained uses of a vreg in a single instruction
  by recording a copy to do in a side-table, then removing constraints
  for the core regalloc. Ion instead has to tweak its definition of
  minimal bundles and create two liveranges that overlap (!) to
  represent the two uses.

* Using block parameters rather than phi-nodes significantly
  simplifies handling of inter-block data movement. IonMonkey had to
  special-case phis in many ways because they are actually quite
  weird: their uses happen semantically in other blocks, and their
  defs happen in parallel at the top of the block. Block parameters
  naturally and explicitly reprsent these semantics in a direct way.

* The allocator supports irreducible control flow and arbitrary block
  ordering (its only CFG requirement is that critical edges are
  split).

* The allocator supports non-SSA code, and has native support for
  handling program moves specially.

# Appendix: Performance-Tuning Lessons

In the course of optimizing the allocator's performance, we found a
number of general principles:

* We got substantial performance speedups from using vectors rather
  than linked lists everywhere. This is well-known, but nevertheless,
  it took some thought to work out how to avoid the need for any
  splicing, and it turns out that even when our design is slightly
  less efficient asymptotically (e.g., apend-and-re-sort rather than
  linear-time merge of two sorted liverange lists when merging
  bundles), it is faster.

* We initially used a direct translation of IonMonkey's splay tree as
  an allocation map for each PReg. This turned out to be significantly
  (!) less efficient than Rust's built-in BTree data structures, for
  the usual cache-efficiency vs. pointer-chasing reasons.

* We initially used dense bitvecs, as IonMonkey does, for
  livein/liveout bits. It turned out that a chunked sparse design (see
  below) was much more efficient.

* Precise liveness significantly improves performance because it
  reduces the size of liveranges (i.e., interference), and probing
  registers with liveranges is the most significant hot inner
  loop. Paying a fraction of a percent runtime for the iterative
  dataflow algorithm to get precise bitsets is more than worth it.

* The randomized probing of registers was a huge win: as above, the
  probing is very expensive, and reducing the average number of probes
  it takes to find a free register is very important.

* In general, single-pass algorithms and design of data structures to
  enable them are important. For example, the half-move technique
  avoids the need to do any O(log n) search at all, and is relatively
  cache-efficient. As another example, a side-effect of the precise
  liveness was that we could then process operands within blocks in
  actual instruction order (in reverse), which allowed us to simply
  append liveranges to in-progress vreg liverange vectors and then
  reverse at the end. The expensive part is a single pass; only the
  bitset computation is a fixpoint loop.

* Sorts are better than always-sorted data structures (like btrees):
  they amortize all the comparison and update cost to one phase, and
  this phase is much more cache-friendly than a bunch of spread-out
  updates.

* Take care of basic data structures and their operator definitions!
  We initially used the auto-derived comparator on ProgPoint, and let
  ProgPoint be a normal struct (with a u32 inst index and a
  Befor/After enum). The comparator for this, used in many sorting
  inner loops, was a compound thing with conditionals. Instead, pack
  them in a u32 and do a simple compare (and save half the memory as
  well). Likewise, the half-move key is a single value packed in a
  u64; this is far more efficient than the tuple comparator on a
  4-tuple, and the half-move sort (which can be a few percent or more
  of total allocation time) became multiple times cheaper.

# Appendix: Data Structure: Chunked Sparse BitVec

We use a "chunked sparse bitvec" to store liveness information, which
is just a set of VReg indices. The design is fairly simple: the
toplevel is a HashMap from "chunk" to a `u64`, and each `u64`
represents 64 contiguous indices.

The intuition is that while the vreg sets are likely sparse overall,
they will probably be dense within small regions of the index
space. For example, in the Nth block in a function, the values that
flow from block N-1 will largely be almost-contiguous vreg indices, if
vregs are allocated in sequence down the function body. Or, at least,
they will be some local vregs together with a few defined at the top
of the function; two separate chunks will cover that.

We tried a number of other designs as well. Initially we used a simple
dense bitvec, but this was prohibitively expensive: O(n^2) space when
the real need is closer to O(n) (i.e., a classic sparse matrix). We
also tried a hybrid scheme that kept a vector of indices when small
and used either a bitvec or a hashset when large. This did not perform
as well because (i) it was less memory-efficient (the chunking helps
with this) and (ii) insertions are more expensive when they always
require a full hashset/hashmap insert.

# Appendix: Fuzzing

We have five fuzz targets: `ssagen`, `domtree`, `moves`, `ion`, and
`ion_checker`.

## SSAGen

The SSAGen target tests our SSA generator, which generates cases for
the full allocator fuzz targets. The SSA generator is careful to
always generate a valid CFG, with split critical edges, and valid SSA,
so that we never have to throw out a test input before we reach the
allocator itself. (An alternative fuzzing approach randomly generates
programs and then throws out those that do not meet certain conditions
before using them as legitimate testcases; this is much simpler, but
less efficient.)

To generate a valid CFG, with no unreachable blocks and with no
critical edges, the generator (i) glues together units of either one
or three blocks (A->B, A->C), forming either a straight-through
section or a conditional. These are glued together into a "spine", and
the conditionals (the "C" block), where they exist, are then linked to
a random target block chosen among the main blocks of these one- or
three-block units. The targets are chosen either randomly, for
potentially irreducible CFGs, or in a way that ensures proper nesting
of loop backedges, if a structured CFG is requested.

SSA is generated by first choosing which vregs will be defined in each
block, and which will be defined as blockparams vs. instruction
defs. Instructions are then generated, with operands chosen among the
"available" vregs: those defined so far in the current block and all
of those in any other block that dominates this one.

The SSAGen fuzz target runs the above code generator against an SSA
validator, and thus ensures that it will only generate valid SSA code.

## Domtree

The `domtree` fuzz target computes dominance using the algorithm that
we use elsewhere in our CFG analysis, and then walks a
randomly-generated path through the CFG. It checks that the dominance
definition ("a dom b if any path from entry to b must pass through a")
is consistent with this particular randomly-chosen path.

## Moves

The `moves` fuzz target tests the parallel move resolver. It generates
a random sequence of parallel moves, careful to ensure that each
destination is written only once. It then runs the parallel move
resolver, and then *abstractly interprets* the resulting sequential
series of moves, thus determining which inputs flow to which
outputs. This must match the original set of parallel moves.

## Ion and Ion-checker

The `ion` fuzz target runs the allocator over test programs generated
by SSAGen. It does not validate the output; it only tests that the
allocator runs to completion and does not panic. This was used mainly
during development, and is now less useful than the checker-based
target.

The `ion_checker` fuzz target runs the allocator's result through a
symbolic checker, which is adapted from the one developed for
regalloc.rs (see [this blog
post](https://cfallin.org/blog/2021/01/22/cranelift-isel-2/) for more
details). This is the most useful fuzz target in the fuzzing suite,
and has found many bugs in development.