From 6ec6207717b73ffcb5672c1490b854a35d5d02d1 Mon Sep 17 00:00:00 2001 From: Chris Fallin Date: Fri, 18 Jun 2021 13:59:12 -0700 Subject: [PATCH] Add design document. --- README.md | 138 +---- doc/DESIGN.md | 1625 +++++++++++++++++++++++++++++++++++++++++++++++++ src/lib.rs | 2 +- 3 files changed, 1635 insertions(+), 130 deletions(-) create mode 100644 doc/DESIGN.md diff --git a/README.md b/README.md index c187fe9..a160ed3 100644 --- a/README.md +++ b/README.md @@ -1,139 +1,19 @@ ## regalloc2: another register allocator -This is a register allocator that started life as, and is about 75% +This is a register allocator that started life as, and is about 50% still, a port of IonMonkey's backtracking register allocator to -Rust. The data structures and invariants have been simplified a little -bit, and the interfaces made a little more generic and reusable. In -addition, it contains substantial amounts of testing infrastructure +Rust. In many regards, it has been generalized, optimized, and +improved since the initial port, and now supports both SSA and non-SSA +use-cases. + +In addition, it contains substantial amounts of testing infrastructure (fuzzing harnesses and checkers) that does not exist in the original IonMonkey allocator. -### Design Overview +See the [design overview](doc/DESIGN.md) for (much!) more detail on +how the allocator works. -TODO - -- SSA with blockparams - -- Operands with constraints, and clobbers, and reused regs; contrast - with regalloc.rs approach of vregs and pregs and many moves that get - coalesced/elided - -### Differences from IonMonkey Backtracking Allocator - -There are a number of differences between the [IonMonkey -allocator](https://searchfox.org/mozilla-central/source/js/src/jit/BacktrackingAllocator.cpp) -and this one: - -* Most significantly, there are [fuzz/fuzz_targets/](many different - fuzz targets) that exercise the allocator, including a full symbolic - checker (`ion_checker` target) based on the [symbolic checker in - regalloc.rs](https://cfallin.org/blog/2021/03/15/cranelift-isel-3/) - and, e.g., a targetted fuzzer for the parallel move-resolution - algorithm (`moves`) and the SSA generator used for generating cases - for the other fuzz targets (`ssagen`). - -* The data-structure invariants are simplified. While the IonMonkey - allocator allowed for LiveRanges and Bundles to overlap in certain - cases, this allocator sticks to a strict invariant: ranges do not - overlap in bundles, and bundles do not overlap. There are other - examples too: e.g., the definition of minimal bundles is very simple - and does not depend on scanning the code at all. In general, we - should be able to state simple invariants and see by inspection (as - well as fuzzing -- see above) that they hold. - -* Many of the algorithms in the IonMonkey allocator are built with - helper functions that do linear scans. These "small quadratic" loops - are likely not a huge issue in practice, but nevertheless have the - potential to be in corner cases. As much as possible, all work in - this allocator is done in linear scans. For example, bundle - splitting is done in a single compound scan over a bundle, ranges in - the bundle, and a sorted list of split-points. - -* There are novel schemes for solving certain interesting design - challenges. One example: in IonMonkey, liveranges are connected - across blocks by, when reaching one end of a control-flow edge in a - scan, doing a lookup of the allocation at the other end. This is in - principle a linear lookup (so quadratic overall). We instead - generate a list of "half-moves", keyed on the edge and from/to - vregs, with each holding one of the allocations. By sorting and then - scanning this list, we can generate all edge moves in one linear - scan. There are a number of other examples of simplifications: for - example, we handle multiple conflicting - physical-register-constrained uses of a vreg in a single instruction - by recording a copy to do in a side-table, then removing constraints - for the core regalloc. Ion instead has to tweak its definition of - minimal bundles and create two liveranges that overlap (!) to - represent the two uses. - -* Using block parameters rather than phi-nodes significantly - simplifies handling of inter-block data movement. IonMonkey had to - special-case phis in many ways because they are actually quite - weird: their uses happen semantically in other blocks, and their - defs happen in parallel at the top of the block. Block parameters - naturally and explicitly reprsent these semantics in a direct way. - -* The allocator supports irreducible control flow and arbitrary block - ordering (its only CFG requirement is that critical edges are - split). It handles loops during live-range computation in a way that - is similar in spirit to IonMonkey's allocator -- in a single pass, - when we discover a loop, we just mark the whole loop as a liverange - for values live at the top of the loop -- but we find the loop body - without the fixpoint workqueue loop that IonMonkey uses, instead - doing a single linear scan for backedges and finding the minimal - extent that covers all intermingled loops. In order to support - arbitrary block order and irreducible control flow, we relax the - invariant that the first liverange for a vreg always starts at its - def; instead, the def can happen anywhere, and a liverange may - overapproximate. It turns out this is not too hard to handle and is - a more robust invariant. (It also means that non-SSA code *may* not - be too hard to adapt to, though I haven't seriously thought about - this.) - -### Rough Performance Comparison with Regalloc.rs - -The allocator has not yet been wired up to a suitable compiler backend -(such as Cranelift) to perform a true apples-to-apples compile-time -and runtime comparison. However, we can get some idea of compile speed -by running suitable test cases through the allocator and measuring -*throughput*: that is, instructions per second for which registers are -allocated. - -To do so, I measured the `qsort2` benchmark in -[regalloc.rs](https://github.com/bytecodealliance/regalloc.rs), -register-allocated with default options in that crate's backtracking -allocator, using the Criterion benchmark framework to measure ~620K -instructions per second: - - -```plain -benches/0 time: [365.68 us 367.36 us 369.04 us] - thrpt: [617.82 Kelem/s 620.65 Kelem/s 623.49 Kelem/s] -``` - -I then measured three different fuzztest-SSA-generator test cases in -this allocator, `regalloc2`, measuring between 1.1M and 2.3M -instructions per second (closer to the former for larger functions): - -```plain -==== 459 instructions -benches/0 time: [377.91 us 378.09 us 378.27 us] - thrpt: [1.2134 Melem/s 1.2140 Melem/s 1.2146 Melem/s] - -==== 225 instructions -benches/1 time: [202.03 us 202.14 us 202.27 us] - thrpt: [1.1124 Melem/s 1.1131 Melem/s 1.1137 Melem/s] - -==== 21 instructions -benches/2 time: [9.5605 us 9.5655 us 9.5702 us] - thrpt: [2.1943 Melem/s 2.1954 Melem/s 2.1965 Melem/s] -``` - -Though not apples-to-apples (SSA vs. non-SSA, completely different -code only with similar length), this is at least some evidence that -`regalloc2` is likely to lead to at least a compile-time improvement -when used in e.g. Cranelift. - -### License +## License Unless otherwise specified, code in this crate is licensed under the Apache 2.0 License with LLVM Exception. This license text can be found in the file diff --git a/doc/DESIGN.md b/doc/DESIGN.md new file mode 100644 index 0000000..c55887a --- /dev/null +++ b/doc/DESIGN.md @@ -0,0 +1,1625 @@ +# regalloc2 Design Overview + +This document describes the basic architecture of the regalloc2 +register allocator. It describes the externally-visible interface +(input CFG, instructions, operands, with their invariants; meaning of +various parts of the output); core data structures; and the allocation +pipeline, or series of algorithms that compute an allocation. It ends +with a description of future work and expectations, as well as an +appendix that notes design influences and similarities to the +IonMonkey backtracking allocator. + +# API, Input IR and Invariants + +The toplevel API to regalloc2 consists of a single entry point `run()` +that takes a register environment, which specifies all physical +registers, and the input program. The function returns either an error +or an `Output` struct that provides allocations for each operand and a +list of additional instructions (moves, loads, stores) to insert. + +## Register Environment + +The allocator takes a `MachineEnv` which specifies, for each of the +two register classes `Int` and `Float`, a list of `PReg`s by index. A +`PReg` is nothing more than the class and index within the class; the +allocator does not need to know anything more. + +The `MachineEnv` provides a list of preferred and non-preferred +physical registers per class. Any register not on either list will not +be allocated. Usually, registers that do not need to be saved in the +prologue if used (i.e., caller-save registers) are given in the +"preferred" list. The environment also provides exactly one scratch +register per class. This register must not be in the preferred or +non-preferred lists, and is used whenever a set of moves that need to +occur logically in parallel have a cycle (for a simple example, +consider a swap `r0, r1 := r1, r0`). + +With some more work, we could potentially remove the need for the +scratch register by requiring support for an additional edit type from +the client ("swap"), but we have not pursued this. + +## CFG and Instructions + +The allocator operates on an input program that is in a standard CFG +representation: the function body is a list of basic blocks, and each +block has a list of insructions and zero or more successors. The +allocator also requires the client to provide predecessors for each +block, and these must be consistent with the successor +lists. + +Instructions are opaque to the allocator except for a few important +bits: (1) `is_ret` (is a return instruction); (2) `is_branch` (is a +branch instruction); (3) `is_call` (is a call instruction, for +heuristic purposes only), (4) `is_move` (is a move between registers), +and (5) a list of Operands, covered below. Every block must end in a +return or branch. + +Both instructions and blocks are named by indices in contiguous index +spaces. A block's instructions must be a contiguous range of +instruction indices, and block i's first instruction must come +immediately after block i-1's last instruction. + +The CFG must have *no critical edges*. A critical edge is an edge from +block A to block B such that A has more than one successor *and* B has +more than one predecessor. For this definition, the entry block has an +implicit predecessor, and any block that ends in a return has an +implicit successor. + +Note that there are *no* requirements related to the ordering of +blocks, and there is no requirement that the control flow be +reducible. Some *heuristics* used by the allocator will perform better +if the code is reducible and ordered in reverse postorder (RPO), +however: in particular, (1) this interacts better with the +contiguous-range-of-instruction-indices live range representation that +we use, and (2) the "approximate loop depth" metric will actually be +exact if both these conditions are met. + +## Operands and VRegs + +Every instruction operates on values by way of `Operand`s. An operand +consists of the following fields: + +- VReg, or virtual register. *Every* operand mentions a virtual + register, even if it is constrained to a single physical register in + practice. This is because we track liveranges uniformly by vreg. + +- Policy, or "constraint". Every reference to a vreg can apply some + constraint to the vreg at that point in the program. Valid policies are: + + - Any location; + - Any register of the vreg's class; + - Any stack slot; + - A particular fixed physical register; or + - For a def (output), a *reuse* of an input register. + +- The "kind" of reference to this vreg: Def, Use, Mod. A def + (definition) writes to the vreg, and disregards any possible earlier + value. A mod (modify) reads the current value then writes a new + one. A use simply reads the vreg's value. + +- The position: before or after the instruction. + - Note that to have a def (output) register available in a way that + does not conflict with inputs, the def should be placed at the + "before" position. Similarly, to have a use (input) register + available in a way that does not conflict with outputs, the use + should be placed at the "after" position. + +This operand-specification design allows for SSA and non-SSA code (see +section below for details). + +VRegs, or virtual registers, are specified by an index and a register +class (Float or Int). The classes are not given separately; they are +encoded on every mention of the vreg. (In a sense, the class is an +extra index bit, or part of the register name.) The input function +trait does require the client to provide the exact vreg count, +however. + +Implementation note: both vregs and operands are bit-packed into +u32s. This is essential for memory-efficiency. As a result of the +operand bit-packing in particular (including the policy constraints!), +the allocator supports up to 2^20 (1M) vregs per function, and 2^5 +(32) physical registers per class. Later we will also see a limit of +2^20 (1M) instructions per function. These limits are considered +sufficient for the anticipated use-cases (e.g., compiling Wasm, which +also has function-size implementation limits); for larger functions, +it is likely better to use a simpler register allocator in any case. + +## Reuses and Two-Address ISAs + +Some instruction sets primarily have instructions that name only two +registers for a binary operator, rather than three: both registers are +inputs, and the result is placed in one of the registers, clobbering +its original value. The most well-known modern example is x86. It is +thus imperative that we support this pattern well in the register +allocator. + +This instruction-set design is somewhat at odds with an SSA +representation, where a value cannot be redefined. Even in non-SSA +code, it is awkward to overwrite a vreg that may need to be used again +later. + +Thus, the allocator supports a useful fiction of sorts: the +instruction can be described as if it has three register mentions -- +two inputs and a separate output -- and neither input will be +clobbered. The output, however, is special: its register-placement +policy is "reuse input i" (where i == 0 or 1). The allocator +guarantees that the register assignment for that input and the output +will be the same, so the instruction can use that register as its +"modifies" operand. If the input is needed again later, the allocator +will take care of the necessary copying. + +We will see below how the allocator makes this work by doing some +preprocessing so that the core allocation algorithms do not need to +worry about this constraint. + +Note that some non-SSA clients, such as Cranelift using the +regalloc.rs-to-regalloc2 compatibility shim, will instead generate +their own copies (copying to the output vreg first) and then use "mod" +operand kinds, which allow the output vreg to be both read and +written. regalloc2 works hard to make this as efficient as the +reused-input scheme by treating moves specially (see below). + +## SSA + +regalloc2 was originally designed to take an SSA IR as input, where +the usual definitions apply: every vreg is defined exactly once, and +every vreg use is dominated by its one def. (Useing blockparams means +that we do not need additional conditions for phi-nodes.) + +The allocator then evolved to support non-SSA inputs as well. As a +result, the input is maximally flexible right now: it does not check +for and enforce, nor try to take advantage of, the single-def +rule. However, blockparams are still available. + +In the future, we hope to change this, however, once compilation of +non-SSA inputs is no longer needed. Specifically, if we can migrate +Cranelift to the native regalloc2 API rather than the regalloc.rs +compatibility shim, we will be able to remove "mod" operand kinds, +assume (and verify) single defs, and take advantage of this when +reasoning about various algorithms in the allocator. + +## Block Parameters + +Every block can have *block parameters*, and a branch to a block with +block parameters must provide values for those parameters via +operands. When a branch has more than one successor, it provides +separate operands for each possible successor. These block parameters +are equivalent to phi-nodes; we chose this representation because they +are in many ways a more consistent representation of SSA. + +To see why we believe block parameters are a slightly nicer design +choice than use of phi nodes, consider: phis are special +pseudoinstructions that must come first in a block, are all defined in +parallel, and whose uses occur on the edge of a particular +predecessor. All of these facts complicate any analysis that scans +instructions and reasons about uses and defs. It is much closer to the +truth to actually put those uses *in* the predecessor, on the branch, +and put all the defs at the top of the block as a separate kind of +def. The tradeoff is that a vreg's def now has two possibilities -- +ordinary instruction def or blockparam def -- but this is fairly +reasonable to handle. + +## Non-SSA + +As mentioned, regalloc2 supports non-SSA inputs as well. No special +flag is needed to place the allocator in this mode or disable SSA +verification. However, we hope to eventually remove this functionality +when it is no longer needed. + +## Program Moves + +As an especially useful feature for non-SSA IR, regalloc2 supports +special handling of "move" instructions: it will try to merge the +input and output allocations to elide the move altogether. + +It turns out that moves are used frequently in the non-SSA input that +we observe from Cranelift via the regalloc.rs compatibility shim. They +are used in three different ways: + +- Moves to or from physical registers, used to implement ABI details + or place values in particular registers required by certain + instructions. +- Moves between vregs on program edges, as lowered from phi/blockparam + dataflow in the higher-level SSA IR (CLIF). +- Moves just prior to two-address-form instructions that modify an + input to form an output: the input is moved to the output vreg to + avoid clobbering the input. + +Note that, strictly speaking, special handling of program moves is +redundant because each of these kinds of uses has an equivalent in the +"native" regalloc2 API: + +- Moves to/from physical registers can become operand constraints, + either on a particular instruction that requires/produces the values + in certain registers (e.g., a call or ret with args/results in regs, + or a special instruction with fixed register args), or on a ghost + instruction at the top of function that defs vregs for all in-reg + args. + +- Moves between vregs as a lowering of blockparams/phi nodes can be + replaced with use of regalloc2's native blockparam support. + +- Moves prior to two-address-form instructions can be replaced with + the reused-input mechanism. + +Thus, eventually, special handling of program moves should be +removed. However, it is very important for performance at the moment. + +## Output + +The allocator produces two main data structures as output: an array of +`Allocation`s and a list of edits. Some other data, such as stackmap +slot info, is also provided. + +### Allocations + +The allocator provides an array of `Allocation` values, one per +`Operand`. Each `Allocation` has a kind and an index. The kind may +indicate that this is a physical register or a stack slot, and the +index gives the respective register or slot. All allocations will +conform to the constraints given, and will faithfully preserve the +dataflow of the input program. + +### Inserted Moves + +In order to implement the necessary movement of data between +allocations, the allocator needs to insert moves at various program +points. + +The list of inserted moves contains tuples that name a program point +and an "edit". The edit is either a move, from one `Allocation` to +another, or else a kind of metadata used by the checker to know which +VReg is live in a given allocation at any particular time. The latter +sort of edit can be ignored by a backend that is just interested in +generating machine code. + +Note that the allocator will never generate a move from one stackslot +directly to another, by design. Instead, if it needs to do so, it will +make use of the scratch register. (Sometimes such a move occurs when +the scratch register is already holding a value, e.g. to resolve a +cycle of moves; in this case, it will allocate another spillslot and +spill the original scratch value around the move.) + +Thus, the single "edit" type can become either a register-to-register +move, a load from a stackslot into a register, or a store from a +register into a stackslot. + +# Data Structures + +We now review the data structures that regalloc2 uses to track its +state. + +## Program-Derived Alloc-Invariant Data + +There are a number of data structures that are computed in a +deterministic way from the input program and then subsequently used +only as read-only data during the core allocation procedure. + +### Livein/Liveout Bitsets + +The livein and liveout bitsets (`liveins` and `liveouts` on the `Env`) +are allocated one per basic block and record, per block, which vregs +are live entering and leaving that block. They are computed using a +standard backward iterative dataflow analysis and are exact; they do +not over-approximate (this turns out to be important for performance, +and is also necessary for correctness in the case of stackmaps). + +### Blockparam Lists: Source-Side and Dest-Side + +The initialization stage scans the input program and produces two +lists that represent blockparam flows from branches to destination +blocks: `blockparam_ins` and `blockparam_outs`. + +These two lists are the first instance we will see of a recurring +pattern: the lists contain tuples that are carefully ordered in a way +such that their sort-order is meaningful. "Build a list lazily then +sort" is a common idiom: it batches the O(n log n) cost into one +operation that the stdlib has aggressively optimized, it provides +dense storage, and it allows for a scan in a certain order that often +lines up with a scan over the program. + +In this particular case, we will build lists of (vreg, block) points +that are meaningful either at the start or end of a block, so that +later, when we scan over a particular vreg's allocations in block +order, we can generate another list of allocations. One side (the +"outs") also contains enough information that it can line up with the +other side (the "ins") in a later sort. + +To make this work, `blockparam_ins` contains a list of (to-vreg, +to-block, from-block) tuples, and has an entry for every blockparam of +every block. Note that we can compute this without actually observing +from-blocks; we only need to iterate over `block_preds` at any given +block. + +Then, `blockparam_outs` contains a list of (from-vreg, from-block, +to-block, to-vreg), and has an entry for every parameter on every +branch that ends a block. There is exactly one "out" tuple for every +"in" tuple. As mentioned above, we will later scan over both to +generate moves. + +### Program-Move Lists: Source-Side and Dest-Side + +Similar to blockparams, we handle moves specially. In fact, we ingest +all moves in the input program into a set of lists -- "move sources" +and "move dests", analogous to the "ins" and "outs" blockparam lists +described above -- and then completely ignore the moves in the program +thereafter. The semantics of the API are such that all program moves +will be recreated with regalloc-inserted edits, and should not still +be emitted after regalloc. This may seem inefficient, but in fact it +allows for better code because it integrates program-moves with the +move resolution that handles other forms of vreg movement. We +previously took the simpler approach of handling program-moves as +opaque instructions with a source and dest, and we found that there +were many redundant move-chains (A->B, B->C) that are eliminated when +everything is handled centrally. + +We also construct a `prog_move_merges` list of live-range index pairs +to attempt to merge when we reach that stage of allocation. + +## Core Allocation State: Ranges, Uses, Bundles, VRegs, PRegs + +We now come to the core data structures: live-ranges, bundles, virtual +registers and their state, and physical registers and their state. + +First we must define a `ProgPoint` precisely: a `ProgPoint` is an +instruction index and a `Before` or `After` suffix. We pack the +before/after suffix into the LSB of a `u32`, so a `ProgPoint` can be +incremented and compared as a simple integer. + +A live-range is a contiguous range of program points (half-open, +i.e. including `from` and excluding `to`) for which a particular vreg +is live with a value. + +A live-range contains a list of uses. Each use contains four parts: +the Operand word (directly copied, so there is no need to dereference +it); the ProgPoint at which the use occurs; the operand slot on that +instruction, if any, that the operand comes from, and the use's +'weight". (It's possible to have "ghost uses" that do not derive from +any slot on the isntruction.) These four parts are packed into three +`u32`s: the slot can fit in 8 bits, and the weight in 16. + +The live-range carries its program-point range, uses, vreg index, +bundle index (see below), and some metadata: spill weight and +flags. The spill weight is the sum of weights of each use. The flags +set currently carries one flag only: whether the live-range starts at +a Def-kind operand. (This is equivalent to whether the range consumes +a value at its start or not.) + +Uses are owned only by live-ranges and have no separate identity, but +live-ranges live in a toplevel array and are known by `LiveRangeIndex` +values throughout the allocator. New live-ranges can be created +(e.g. during splitting); old ones are not cleaned up, but rather, all +state is bulk-freed at the end. + +Live-ranges are aggregated into "bundles". A bundle is a collection of +ranges that does not overlap. Each bundle carries: a list (inline +SmallVec) of (range, live-range index) tuples, an allocation (starts +as "none"), a "spillset" (more below), and some metadata, including a +spill weight (sum of ranges' weights), a priority (sum of ranges' +lengths), and three property flags: "minimal", "contains fixed +constraints", "contains stack constraints". + +VRegs also contain their lists of live-ranges, in the same form as a +bundle does (inline SmallVec that has inline (from, to) range bounds +and range indices). + +There are two important overlap invariants: (i) no liveranges within a +bundle overlap, and (ii) no liveranges within a vreg overlap. These +are extremely important and we rely on them implicitly in many places. + +The live-range lists in bundles and vregs, and use-lists in ranges, +have various sorting invariants as well. These invariants differ +according to the phase of the allocator's computation. First, during +live-range construction, live-ranges are placed into vregs in reverse +order (because the computation is a reverse scan) and uses into ranges +in reverse order; these are sorted into forward order at the end of +live-range computation. When bundles are first constructed, their +range lists are sorted, and they remain so for the rest of allocation, +as we need for interference testing. However, as ranges are created +and split, sortedness of vreg ranges is *not* maintained; they are +sorted once more, in bulk, when allocation is done and we start to +resolve moves. + +Finally, we have physical registers. The main data associated with +each is the allocation map. This map is a standard BTree, indexed by +ranges (`from` and `to` ProgPoints) and yielding a LiveRange for each +location range. The ranges have a custom comparison operator defined +that compares equal for any overlap. + +This comparison operator allows us to determine whether a range is +free, i.e. has no overlap with a particular range, in one probe -- the +btree will not contain a match. However, it makes iteration over *all* +overlapping ranges somewhat tricky to get right. Notably, Rust's +BTreeMap does not guarantee that the lookup result will be the *first* +equal key, if multiple keys are equal to the probe key. Thus, when we +want to enumerate all overlapping ranges, we probe with a range that +consists of the single program point *before* the start of the actual +query range, using the API that returns an iterator over a range in +the BTree, and then iterate through the resulting iterator to gather +all overlapping ranges (which will be contiguous). + +## Spill Bundles + +It is worth describing "spill bundles" separately. Every spillset (see +below; a group of bundles that originated from one bundle) optionally +points to a single bundle that we designate the "spill bundle" for +that spillset. Contrary to the name, this bundle is not +unconditionally spilled. Rather, one can see it as a sort of fallback: +it is where liveranges go when we give up on processing them via the +normal backtracking loop, and will only process them once more in the +"second-chance" stage. + +The spill bundle acquires liveranges in two ways. First, as we split +bundles, we will trim the split pieces in certain ways so that some +liveranges are immediately placed in the spill bundle. Intuitively, +the "empty" regions that just carry a value, but do not satisfy any +operands, should be in the spill bundle: it is better to have a single +consistent location for the value than to move it between lots of +different split pieces without using it, as moves carry a cost. + +Second, the spill bundle acquires the liveranges of a bundle that has +no requirement to be in a register when that bundle is processed, but +only if the spill bundle already exists. In other words, we won't +create a second-chance spill bundle just for a liverange with an "Any" +use; but if it was already forced into existence by splitting and +trimming, then we might as well use it. + +Note that unlike other bundles, a spill bundle's liverange list +remains unsorted until we do the second-chance allocation. This allows +quick appends of more liveranges. + +## Allocation Queue + +The allocation queue is simply a priority queue (built with a binary +max-heap) of (prio, bundle-index) tuples. + +## Spillsets and Spillslots + +Every bundle contains a reference to a spillset. Spillsets are used to +assign spillslots near the end of allocation, but before then, they +are also a convenient place to store information that is common among +*all bundles* that share the spillset. In particular, spillsets are +initially assigned 1-to-1 to bundles after all bundle-merging is +complete; so spillsets represent in some sense the "original bundles", +and as splitting commences, the smaller bundle-pieces continue to +refer to their original spillsets. + +We stash some useful information on the spillset because of this: a +register hint, used to create some "stickiness" between pieces of an +original bundle that are assigned separately after splitting; the +spill bundle; the common register class of all vregs in this bundle; +the vregs whose liveranges are contained in this bundle; and then some +information actually used if this is spilled to the stack (`required` +indicates actual stack use; `size` is the spillslot count; `slot` is +the actual stack slot). + +Spill *sets* are later allocated to spill *slots*. Multiple spillsets +can be assigned to one spillslot; the only constraint is that +spillsets assigned to a spillslot must not overlap. When we look up +the allocation for a bundle, if the bundle is not given a specific +allocation (its `alloc` field is `Allocation::none()`), this means it +is spilled, and we traverse to the spillset then spillslot. + +## Other: Fixups, Stats, Debug Annotations + +There are a few fixup lists that we will cover in more detail +later. Of particular note is the "multi-fixed-reg fixup list": this +handles instructions that constrain the same input vreg to multiple, +different, fixed registers for different operands at the same program +point. The only way to satisfy such a set of constraints is to +decouple all but one of the inputs (make them no longer refer to the +vreg) and then later insert copies from the first fixed use of the +vreg to the other fixed regs. + +The `Env` also carries a statistics structure with counters that are +incremented, which can be useful for evaluating the effects of +changes; and a "debug annotations" hashmap from program point to +arbitrary strings that is filled out with various useful diagnostic +information if enabled, so that an annotated view of the program with +its liveranges, bundle assignments, inserted moves, merge and split +decisions, etc. can be viewed. + +# Allocation Pipeline + +We now describe the pipeline that computes register allocations. + +## Live-range Construction + +The first step in performing allocation is to analyze the input +program to understand its dataflow: that is, the ranges during which +virtual registers must be assigned to physical registers. Computing +these ranges is what allows us to do better than a trivial "every vreg +lives in a different location, always" allocation. + +We compute precise liveness first using an iterative dataflow +algorithm with BitVecs. (See below for our sparse chunked BitVec +description.) This produces the `liveins` and `liveouts` vectors of +BitVecs per block. + +We then perform a single pass over blocks in reverse order, and scan +instructions in each block in reverse order. Why reverse order? We +must see instructions within a block in reverse to properly compute +liveness (a value is live backward from an use to a def). Because we +want to keep liveranges in-order as we build them, to enable +coalescing, we visit blocks in reverse order as well, so overall this +is simply a scan over the whole instruction index space in reverse +order. + +For each block, we perform a scan with the following state: + +- A liveness bitvec, initialized at the start from `liveouts`. +- A vector of live-range indices, with one entry per vreg, initially + "invalid" (this vector is allocated once and reused at each block). +- In-progress list of live-range indices per vreg in the vreg state, + in *reverse* order (we will reverse it when we're done). + +A vreg is live at the current point in the scan if its bit is set in +the bitvec; its entry in the vreg-to-liverange vec may be stale, but +if the bit is not set, we ignore it. + +We initially create a liverange for all vregs that are live out of the +block, spanning the whole block. We will trim this below if it is +locally def'd and does not pass through the block. + +For each instruction, we process its effects on the scan state: + +- For all clobbers (which logically happen at the end of the + instruction), add a single-program-point liverange to each clobbered + preg. + +- If not a move: + - for each program point [after, before]: + - for each operand at this point(\*): + - if a def or mod: + - if not currently live, this is a dead def; create an empty + LR. + - if a def: + - set the start of the LR for this vreg to this point. + - set as dead. + - if a use: + - create LR if not live, with start at beginning of block. + +- Else, if a move: + - simple case (no pinned vregs): + - add to `prog_move` data structures, and update LRs as above. + - effective point for the use is *after* the move, and for the mod + is *before* the *next* instruction. Why not more conventional + use-before, def-after? Because this allows the move to happen in + parallel with other moves that the move-resolution inserts + (between split fragments of a vreg); these moves always happen + at the gaps between instructions. We place it after, not before, + because before may land at a block-start and interfere with edge + moves, while after is always a "normal" gap (a move cannot end a + block). + - otherwise: see below (pinned vregs). + + +(\*) an instruction operand's effective point is adjusted in a few +cases. If the instruction is a branch, its uses (which are +blockparams) are extended to the "after" point. If there is a reused +input, all *other* inputs are extended to "after": this ensures proper +interference (as we explain more below). + +We then treat blockparams as defs at the end of the scan (beginning of +the block), and create the "ins" tuples. (The uses for the other side +of the edge are already handled as normal uses on a branch +instruction.) + +### Optimization: Pinned VRegs and Moves + +In order to efficiently handle the translation from the regalloc.rs +API, which uses named RealRegs that are distinct from VirtualRegs +rather than operand constraints, we need to implement a few +optimizations. The translation layer translates RealRegs as particular +vregs at the regalloc2 layer, because we need to track their liveness +properly. Handling these as "normal" vregs, with massive bundles of +many liveranges throughout the function, turns out to be a very +inefficient solution. So we mark them as "pinned" with a hook in the +RA2 API. Semantically, this means they are always assigned to a +particular preg whenever mentioned in an operand (but *NOT* between +those points; it is possible for a pinned vreg to move all about +registers and stackslots as long as it eventually makes it back to its +home preg in time for its next use). + +This has a few implications during liverange construction. First, when +we see an operand that mentions a pinned vreg, we translate this to an +operand constraint that names a fixed preg. Later, when we build +bundles, we will not create a bundle for the pinned vreg; instead we +will transfer its liveranges directly as unmoveable reservations in +pregs' allocation maps. Finally, we need to handle moves specially. + +With the caveat that "this is a massive hack and I am very very +sorry", here is how it works. A move between two pinned vregs is easy: +we add that to the inserted-moves list right away because we know the +Allocation on both sides. A move from a pinned vreg to a normal vreg +is the first interesting case. In this case, we (i) create a ghost def +with a fixed-register policy on the normal vreg, doing the other +liverange-maintenance bits as above, and (ii) adjust the liveranges on +the pinned vreg (so the preg) in a particular way. If the preg is live +flowing downward, then this move implies a copy, because the normal +vreg and the pinned vreg are both used in the future and cannot +overlap. But we cannot keep the preg continuously live, because at +exactly one program point, the normal vreg is pinned to it. So we cut +the downward-flowing liverange just *after* the normal vreg's +fixed-reg ghost def. Then, whether it is live downward or not, we +create an upward-flowing liverange on the pinned vreg that ends just +*before* the ghost def. + +The move-from-normal-to-pinned case is similar. First, we create a +ghost use on the normal vreg that pins its value at this program point +to the fixed preg. Then, if the preg is live flowing downward, we trim +its downward liverange to start just after the fixed use. + +There are also some tricky metadata-maintenance records that we emit +so that the checker can keep this all straight. + +The outcome of this hack, together with the operand-constraint +translation on normal uses/defs/mods on pinned vregs, is that we +essentially are translating regalloc.rs's means of referring to real +registers to regalloc2's preferred abstractions by doing a bit of +reverse-engineering. It is not perfect, but it works. Still, we hope +to rip it all out once we get rid of the need for the compatibility +shim. + +### Handling Reused Inputs + +Reused inputs are also handled a bit specially. We have already +described how we essentially translate the idiom so that the output's +allocation is used for input and output, and there is a move just +before the instruction that copies the actual input (which will not be +clobbered) to the output. Together with an attempt to merge the +bundles for the two, to elide the move if possible, this works +perfectly well as long as we ignore all of the other inputs. + +But we can't do that: we have to ensure that other inputs' allocations +are correct too. Note that using the output's allocation as the input +is actually potentially incorrect if the output is at the After point +and the input is at the Before: the output might share a register with +one of the *other* (normal, non-reused) inputs if that input's vreg +were dead afterward. This will mean that we clobber the other input. + +So, to get the interference right, we *extend* all other (non-reused) +inputs of an instruction with a reused input to the After point. This +ensures that the other inputs are *not* clobbered by the slightly +premature use of the output register. + +The source has a link to a comment in IonMonkey that implies that it +uses a similar solution to this problem, though it's not entirely +clear. + +(This odd dance, like many of the others above and below, is "written +in fuzzbug failures", so to speak. It's not entirely obvious until one +sees the corner case where it's necessary!) + +## Bundle Merging + +Once we have built the liverange lists for every vreg, we can reverse +these lists (recall, they were built in strict reverse order) and +initially assign one bundle per (non-pinned) vreg. We then try to +merge bundles together as long as find pairs of bundles that do not +overlap and that (heuristically) make sense to merge. + +Note that this is the only point in the allocation pipeline where +bundles get larger. We initially merge as large as we dare (but not +too large, because then we'll just cause lots of conflicts and +splitting later), and then try out assignments, backtrack via +eviction, and split continuously to chip away at the problem until we +have a working set of allocation assignments. + +We attempt to merge three kinds of bundle pairs: reused-input to +corresponding output; across program moves; and across blockparam +assignments. + +To merge two bundles, we traverse over both their sorted liverange +lists at once, checking for overlaps. Note that we can do this without +pointer-chasing to the liverange data; the (from, to) range is in the +liverange list itself. + +We also check whether the merged bundle would have conflicting +requirements (see below for more on requirements). We do a coarse +check first, checking 1-bit flags that indicate whether either bundle +has any fixed-reg constraints or stack-only constraints. If so, we +need to do a detailed check by actually computing merged requirements +on both sides, merging, and checking for Conflict (the lattice bottom +value). If no conflict, we merge. + +A performance note: merging is extremely performance-sensitive, and it +turns out that a mergesort-like merge of the liverange lists is too +expensive, partly because it requires allocating a separate result +vector (in-place merge in mergesort is infamously complex). Instead, +we simply append one vector onto the end of the other and invoke +Rust's builtin sort. We could special-case "one bundle is completely +before the other", but we currently don't do that (performance idea!). + +Once all bundles are merged as far as they will go, we compute cached +bundle properties (priorities and weights) and enqueue them on the +priority queue for allocation. + +## Recurring: Bundle Property Computation + +The core allocation loop is a recurring iteration of the following: we +take the highest-priority bundle from the allocation queue; we compute +its requirements; we try to find it a register according to those +requirements; if no fit, we either evict some other bundle(s) from +their allocations and try again, or we split the bundle and put the +parts back on the queue. We record all the information we need to make +the evict-or-split decision (and where to split) *during* the physical +register allocation-map scans, so we don't need to go back again to +compute that. + +Termination is nontrivial to see, because of eviction. How do we +guarantee we don't get into an infinite loop where two bundles fight +over a register forever? In fact, this can easily happen if there is a +bug; we fixed many fuzzbugs like this, and we have a check for +"infinite loop" based on an upper bound on iterations. But if the +allocator is correct, it should never happen. + +Termination is guaranteed because (i) bundles always get smaller, (ii) +eviction only occurs when a bundle is *strictly* higher weight (not +higher-or-equal), and (iii) once a bundle gets down to its "minimal" +size, it has an extremely high weight that is guaranteed to evict any +non-minimal bundle. A minimal bundle is one that covers only one +instruction. As long as the input program does not have impossible +constraints that require more than one vreg to exist in one preg, an +allocation problem of all minimal bundles will always have a solution. + +## Bundle Processing + +Let's now talk about what happens when we take a bundle off the +allocation queue. The three basic outcomes are: allocate; split and +requeue; or evict and try again immediately (and eventually allocate +or split/requeue). + +### Properties: Weight, Priority, and Requirements + +To process a bundle, we have to compute a few properties. In fact we +will have already computed a few of these beforehand, but we describe +them all here. + +- Priority: a bundle's priority determines the order in which it is + considered for allocation. RA2 defines as the sum of the lengths (in + instruction index space) of each liverange. This causes the + allocator to consider larger bundles first, when the allocation maps + are generally more free; they can always be evicted and split later. + +- Weight: a bundle's weight indicates how important (in terms of + runtime) its uses/register mentions are. In an approximate sense, + inner loop bodies create higher-weight uses. Fixed register + constraints add some weight, and defs add some weight. Finally, + weight is divided by priority, so a very large bundle that happens + to have a few important uses does not unformly exert its weight + across its entire range. This has the effect of causing bundles to + be more important (more likely to evict others) the more they are + split. + +- Requirement: a bundle's requirement is a value in a lattice that we + have defined, where top is "Unknown" and bottom is + "Conflict". Between these two, we have: any register (of a class); + any stackslot (of a class); a particular register. "Any register" + can degrade to "a particular register", but any other pair of + different requirements meets to Conflict. Requirements are derived + from the operand constraints for all uses in all liveranges in a + bundle, and then merged with the lattice meet-function. + +Once we have the Requirement for a bundle, we can decide what to do. + +### No-Register-Required Cases + +If the requirement indicates that no register is needed (`Unknown` or +`Any`), *and* if the spill bundle already exists for this bundle's +spillset, then we move all the liveranges over to the spill bundle, as +described above. + +If the requirement indicates that the stack is needed explicitly +(e.g., for a safepoint), we set our spillset as "required" (this will +cause it to allocate a spillslot) and return; because the bundle has +no other allocation set, it will look to the spillset's spillslot by +default. + +If the requirement indicates a conflict, we immediately split and +requeue the split pieces. This split is a special one: rather than +split in a way informed by conflicts (see below), we unconditionally +split off the first use. This is a heuristic and we could in theory do +better by finding the source of the conflict; but in practice this +works well enough. Note that a bundle can reach this stage with a +conflicting requirement only if the original liverange had conflicting +uses (e.g., a liverange from a def in a register to a use on stack, or +a liverange between two different fixed-reg-constrained operands); our +bundle merging logic explicitly avoids merging two bundles if it would +create a conflict. + +### Allocation-Map Probing + +If we did not immediately dispose of the bundle as described above, +then we *can* use a register (either `Any`, which accepts a register +as one of several options, or `Reg`, which must have one, or `Fixed`, +which must have a particular one). + +We determine the list of physical registers whose allocation maps we +will probe, and in what order. If a particular fixed register is +required, we probe only that register. Otherwise, we probe all +registers in the required class. + +The order in which we probe, if we are not constrained to a single +register, is carefully chosen. First, if there is a hint register from +the spillset (this is set by the last allocation into a register of +any other bundle in this spillset), we probe that. Then, we probe all +preferred registers; then all non-preferred registers. + +For each of the preferred and non-preferred register lists, we probe +in an *offset* manner: we start at some index partway through the +list, determined by some heuristic number that is random and +well-dstributed. (In practice, we use the sum of the bundle index and +the instruction index of the start of the first range in the bundle.) +We then march through the list and wrap around, stopping before we hit +our starting point again. + +The purpose of this offset is to distribute the contention and speed +up the allocation process. In the common case where there are enough +registers to hold values without spilling (for small functions), we +are more likely to choose a free register right away if we throw the +dart at random than if we start *every* probe at register 0, in +order. This has a large allocation performance impact in practice. + +For each register in probe order, we probe the allocation map, and +gather, simultaneously, several results: (i) whether the entire range +is free; (ii) if not, the list of all conflicting bundles, *and* the +highest weight among those bundles; (iii) if not, the *first* conflict +point. + +We do this by iterating over all liveranges in the preg's btree that +overlap with each range in the current bundle. This iteration is +somewhat subtle due to multiple "equal" keys (see above where we +describe the use of the btree). It is also adaptive for performance +reasons: it initially obtains an iterator into the btree corresponding +to the start of the first range in the bundle, and concurrently +iterates through both the btree and the bundle. However, if there is a +large gap in the bundle, this might require skipping many irrelevant +entries in the btree. So, if we skip too many entries (heuristically, +16, right now), we do another lookup from scratch in the btree for the +start of the next range in the bundle. This balances between the two +cases: dense bundle, where O(1) iteration through the btree is faster, +and sparse bundle, where O(log n) lookup for each entry is better. + +### Decision: Allocate, Evict, or Split + +First, the "allocate" case is easy: if, during our register probe +loop, we find a physical register whose allocations do not overlap +this bundle, then we allocate this register; done! + +If not, then we need to decide whether to evict some conflicting +bundles and retry, or to split the current bundle into smaller pieces +that may have better luck. + +A bit about our split strategy first: contrary to the IonMonkey +allocator which inspired much of our design, we do *not* have a list +of split strategies that split one bundle into many pieces at +once. Instead, each iteration of the allocation loop splits at most +*once*. This simplifies the splitting code greatly, but also turns out +to be a nice heuristic: we split at the point that the bundle first +encounters a conflict for a particular preg assignment, then we hint +that preg for the first (pre-conflict) piece when we retry. In this +way, we always make forward progress -- one piece of the bundle is +always allocated -- and splits are informed by the actual situation at +hand, rather than best guesses. Also note that while this may appear +at first to be a greedy algorithm, it still allows backtracking: the +first half of the split bundle, which we *can* now assign to a preg, +does not necessarily remain on that preg forever (it can still be +evicted later). It is just a split that is known to make at least one +part of the allocation problem solvable. + +To determine whether to split or evict, we track our best options: as +we probe, we track the "lowest cost eviction option", which is a set +of bundles and the maximum weight in that set of bundles. We also +track the "lowest cost split option", which is the cost (more below), +the point at which to split, and the register for this option. + +For each register we probe, if there is a conflict but none of the +conflicts are fixed allocations, we receive a list of bundles that +conflicted, and also separately, the first conflicting program +point. We update the lowest-cost eviction option if the cost (max +weight) of the conflicting bundles is less than the current best. We +update the lowest-cost split option if the cost is less as well, +according to the following definition of cost: a split's cost is the +cost of its move, as defined by the weight of a normal def operand at +the split program point, plus the cost of all bundles beyond the split +point (which will still be conflicts even after the split). + +If there is a conflict with a fixed allocation, then eviction is not +an option, but we can still compute the candidate split point and cost +in the same way as above. + +Finally, as an optimization, we pass in the current best cost to the +btree probe inner loop; if, while probing, we have already exceeded +the best cost, we stop early (this improves allocation time without +affecting the result). + +Once we have the best cost for split and evict options, we split if +(i) the bundle is not already a minimal bundle, and (ii) we've already +evicted once in this toplevel iteration without success, or the weight +of the current bundle is less than the eviction cost. We then requeue +*both* resulting halves of the bundle with the preg that resulted in +this option as the register hint. Otherwise, we evict all conflicting +bundles and try again. + +Note that the split cost does not actually play into the above (split +vs. evict) decision; it is only used to choose *which* split is +best. This is equivalent to saying: we never evict if the current +bundle is less important than the evicted bundles, even if the split +is more expensive still. This is important for forward progress, and +the case where the split would be even more expensive should be very +very rare (it would have to come from a costly move in the middle of +an inner loop). + +### How to Split + +The actual split procedure is fairly simple. We are given a bundle and +a split-point. We create a new bundle to take on the second half +("rest") of the original. We find the point in the liverange list that +corresponds to the split, and distribute appropriately. If the +split-point lands in the middle of a liverange, then we split that +liverange as well. + +In the case that a new liverange is created, we add the liverange to +the corresponding vreg liverange list as well. Note that, as described +above, the vreg's liverange list is unsorted while splitting is +occurring (because we do not need to traverse it or do any lookups +during this phase); so we just append. + +The splitting code also supports a "minimal split", in which it simply +peels off the first use. This is used to ensure forward progress when +a bundle has conflicting requirements within it (see above). + +#### Spill Bundle and Splitting + +Once a split occurs, however, it turns out that we can improve results +by doing a little cleanup. Once we distribute a bundle's liveranges +across two half-bundles, we postprocess by trimming a bit. + +In particular, if we see that the "loose ends" around the split point +extend beyond uses, we will create and move ranges to a spill +bundle. That is: if the last liverange in the first-half bundle +extends beyond its last use, we trim that part off into an empty (no +uses) liverange and place that liverange in the spill +bundle. Likewise, if the first liverange in the second-half bundle +starts before its first use, we trim that part off into an empty +liverange and place it in the spill bundle. + +This is, empirically, an improvement: it reduces register contention +and makes splitting more effective. The intuition is twofold: (i) it +is better to put all of the "flow-through" parts of a vreg's liveness +into one bundle that is never split, and can be spilled to the stack +if needed, to avoid unnecessary moves; and (ii) if contention is high +enough to cause splitting, it is more likely there will be an actual +stack spill, and if this is the case, it is better to do the store +just after the last use and reload just before the first use of the +respective bundles. + +Unfortunately, this heuristic choice does interact somewhat poorly +with program moves: moves between two normal (non-pinned) vregs do not +create ghost uses or defs, and so these points of the ranges can be +spilled, turning a normal register move into a move from or to the +stack. However, empirically, we have found that adding such ghost +uses/defs actually regresses some cases as well, because it pulls +values back into registers when we could have had a stack-to-stack +move (that might even be a no-op if the same spillset); overall, it +seems better to trim. It also improves allocation performance by +reducing contention in the registers during the core loop (before +second-chance allocation). + +## Second-Chance Allocation: Spilled Bundles + +Once the main allocation loop terminates, when all bundles have either +been allocated or punted to the "spilled bundles" list, we do +second-chance allocation. This is a simpler loop that never evicts and +never splits. Instead, each bundle gets one second chance, in which it +can probe pregs and attempt to allocate. If it fails, it will actually +live on the stack. + +This is correct because we are careful to only place bundles on the +spilled-bundles list that are *allowed* to live on the +stack. Specifically, only the canonical spill bundles (which will +contain only empty ranges) and other bundles that have an "any" or +"unknown" requirement are placed here (but *not* "stack" requirements; +those *must* be on the stack, so do not undergo second-chance +allocation). + +At the end of this process, we have marked spillsets as required +whenever at least one bundle in the spillset actually requires a stack +slot. We can then allocate slots to the spillsets. + +## Spillslot Allocation + +We must allocate space on the stack, denoted by an abstract index +space, to each spillset that requires it, and for the liveranges in +which it requires it. + +To facilitate this, we keep a btree per spillslot in the same way we +do per preg. We will allocate spillsets to slots in a way that avoids +interference. + +Note that we actually overapproximate the required ranges for each +spillset in order to improve the behavior of a later phase (redundant +move elimination). Specifically, when we allocate a slot for a +spillset, we reserve that slot for *all* of the liveranges of *every* +vreg that is assigned to that spillset (due to merging rules that +initially merge one-vreg bundles into final merged bundles, there will +be no overlaps here). In other words, we rule out interleaving of +completely different values in the same slot, though bundle merging +does mean that potentially many (non-interfering) vregs may share +it. This provides the important property that if a vreg has been +reloaded, but not modified, its spillslot *still contains the +up-to-date value* (because the slot is reserved for all liveranges of +the vreg). This enables us to avoid another store to the spillslot +later if there is another spilled range. + +We perform probing in a way that is somewhat different than for +registers, because the spillslot space is conceptually infinite. We +can thus optimize for slightly better allocation performance by giving +up and allocating a new slot at any time. + +For each size class, we keep a linked list of slots. When we need to +allocate a spillset to a slot, we traverse down the list and try a +fixed number of slots. If we find one that fits the spillset's ranges, +we allocate, and we remove the slot from its current place in the list +and append to the end. In this way, it is deprioritized from probing +"for a while", which tends to reduce contention. This is a simple way +to round-robin between slots. If we don't find one that fits after a +fixed number of probes, we allocate a new slot. + +And with that, we have valid allocations for all vregs for all points +that they are live! Now we just need to modify the program to reify +these choices. + +## Allocation Assignment + +The first step in reifying the allocation is to iterate through all +mentions of a vreg and fill in the resulting `Allocation` array with +the appropriate allocations. We do this by simply traversing +liveranges per vreg, looking up the allocation by observing the bundle +(and spillset if no specific allocation for the bundle), and for each +use, filling in the slot according to the saved progpoint/slot info in +the use data. + +## Move Generation + +The more difficult half of the reification step is generating the +*moves* that will put the values in the right spots. + +There are two sources of moves that we must generate. The first are +moves between different ranges of the same vreg, as the split pieces +of that vreg's original bundle may have been assigned to different +locations. The second are moves that result from move semantics in the +input program: either assignments from blockparam args on branches to +the target block's params, or program move instructions. (Recall that +we reify program moves in a unified way with all other moves, so the +client should not generate any machine code for their original moves +in the pre-allocation program.) + +Moves are tricky to handle efficiently because they join two +potentially very different locations in the program (in the case of +control-flow-edge moves). In order to avoid the need for random +lookups, which are a cache-locality nightmare even if we have O(log n) +lookups, we instead take a scan-sort-scan approach. + +First, we scan over each vreg's liveranges, find the allocation for +each, and for each move that comes *to* or *from* this liverange, +generate a "half-move". The key idea is that we generate a record for +each "side" of the move, and these records are keyed in a way that +after a sort, the "from" and "to" ends will be consecutive. We can +sort the list of halfmoves once (this is expensive, but not as +expensive as many separate pointer-chasing lookups), then scan it +again to actually generate the move instructions. + +To enable the sort to work, half-moves are sorted by a key that is +equivalent to the tuple (from-block, to-block, to-vreg, kind), where +`kind` is "source" or "dest". For each key, the payload is an +allocation. The fields in this tuple are carefully chosen: we know all +of them at every location we generate a halfmove, without expensive +lookups, and sorting by this key will make the source and all dests +(there can be more than one) contiguous in the final order. + +Half-moves are generated for several situations. First, at the start +of every block covered by a liverange, we can generate "dest" +half-moves for blockparams, and at the end of every block covered by a +liverange, we can generate "source" half-moves for blockparam args on +branches. Incidentally, this is the reason that `blockparam_ins` and +`blockparam_outs` are sorted tuple-lists whose tuples begin with +(vreg, block, ...): this is the order in which we do the toplevel scan +over allocations. + +Second, at every block edge, if the vreg is live in any pred (at +block-start) or succ (at block-end), we generate a half-move to +transfer the vreg to its own location in the connected block. + +This completes the "edge-moves". We sort the half-move array and then +have all of the alloc-to-alloc pairs on a given (from-block, to-block) +edge. + +There are also two kinds of moves that happen within blocks. First, +when a live-range ends and another begins for the same vreg in the +same block (i.e., a split in the middle of a block), we know both +sides of the move immediately (because it is the same vreg and we can +look up the adjacent allocation easily), and we can generate that +move. + +Second, program moves occur within blocks. Here we need to do a +similar thing as for block-edge half-moves, but keyed on program point +instead. This is why the `prog_move_srcs` and `prog_move_dsts` arrays +are initially sorted by their (vreg, inst) keys: we can directly fill +in their allocation slots during our main scan. Note that when sorted +this way, the source and dest for a given move instruction will be at +different indices. After the main scan, we *re-sort* the arrays by +just the instruction, so the two sides of a move line up at the same +index; we can then traverse both arrays, zipped together, and generate +moves. + +Finally, we generate moves to fix up multi-fixed-reg-constraint +situations, and make reused inputs work, as described earlier. + +## Move Resolution + +During this whole discussion, we have described "generating moves", +but we have not said what that meant. Note that in many cases, there +are several moves at a particular program point that semantically +happen *in parallel*. For example, if multiple vregs change +allocations between two instructions, all of those moves happen as +part of one parallel permutation. Similarly, blockparams have +parallel-assignment semantics. We thus enqueue all the moves that we +generate at program points and resolve them into lists of sequential +moves that can actually be lowered to move instructions in the machine +code. + +First, a word on *move priorities*. There are different kinds of moves +that are generated between instructions, and we have to ensure that +some happen before others, i.e., *not* in parallel. For example, a +vreg might change allocation (due to a split) before an instruction, +then be copied to an output register for an output with a reused-input +policy. The latter move must happen *after* the vreg has been moved +into its location for this instruction. + +To enable this, we define "move priorities", which are a logical +extension of program points (i.e., they are sub-points) that enable +finer-grained ordering of moves. We currently have the following +priorities: + +- In-edge moves, to place edge-moves before the first instruction in a + block. +- Block-param metadata, used for the checker only. +- Regular, used for vreg movement between allocations. +- Post-regular, used for checker metadata related to pinned-vreg moves. +- Multi-fixed-reg, used for moves that handle the + single-vreg-in-multiple-fixed-pregs constraint case. +- Reused-input, used for implementing outputs with reused-input policies. +- Out-edge moves, to place edge-moves after the last instruction + (prior to the branch) in a block. + +Every move is statically given one of these priorities by the code +that generates it. + +We collect moves with (prog-point, prio) keys, and we short by those +keys. We then have, for each such key, a list of moves that +semantically happen in parallel. + +We then resolve those moves using a parallel-move resolver, as we now +describe. + +### Parallel-Move Resolver + +The fundamental issue that arises when resolving parallel moves to +sequential moves is *overlap*: some of the moves may overwrite +registers that other moves use as sources. We must carefully order +moves so that this does not clobber values incorrectly. + +We first check if such overlap occurs. If it does not (this is +actually the most common case), the list of parallel moves can be +emitted as sequential moves directly. Done! + +Otherwise, we have to order the moves carefully. Furthermore, if there +is a *cycle* anywhere among the moves, we will need a scratch +register. (Consider, e.g., t0 := t1 and t1 := t0 in parallel: with +only move instructions and no direct "exchange" instruction, we cannot +reify this without a third register.) + +We first compute a mapping from each move instruction to the move +instruction, if any, that it must precede. Note that there can be only +one such move for a given move, because each destination can be +written only once; so a move might be constrained only before the one +move that overwrites its source. (This will be important in a bit!) + +Our task is now to find an ordering of moves that respects these +dependencies. To do so, we perform a depth-first search on the graph +induced by the dependencies, which will generate a list of sequential +moves in reverse order. We keep a stack of moves; we start with any +move that has not been visited yet; in each iteration, if the +top-of-stack has no out-edge to another move (does not need to come +before any others), then push it to a result vector, followed by all +others on the stack (in popped order). If it does have an out-edge and +the target is already visited and not on the stack anymore (so already +emitted), likewise, emit this move and the rest on the stack. If it +has an out-edge to a move not yet visited, push on the stack and +continue. Otherwise, if out-edge to a move currently on the stack, we +have found a cycle. In this case, we emit the moves on the stack with +a modification: the first move writes to a scratch register, and we +emit an additional move that moves from the scratch to the first +move's dest. This breaks the cycle. + +The astute reader may notice that this sounds like a canonical +application of Tarjan's algorithm for finding SCCs (strongly-connected +components). Why don't we have the full complexity of that algorithm? +In particular, *why* can we emit the cycle *right away* once we find +it, rather than ensuring that we have gotten all of the SCC first? + +The answer is that because there is only *one* out-edge at most (a +move can only require preceding *one* other move), all SCCs must be +simple cycles. This means that once we have found a cycle, no other +nodes (moves) can be part of the SCC, because every node's single +out-edge is already accounted for. This is what allows us to avoid a +fully general SCC algorithm. + +Once the list of moves in-reverse has been constructed, we reverse it +and return. + +Note that this "move resolver" is fuzzed separately with a simple +symbolic move simulator (the `moves` fuzz-target). + +### Stack-to-Stack Moves + +There is one potentially difficult situation that could arise from the +move-resolution logic so far: if a vreg moves from one spillslot to +another, this implies a memory-to-memory move, which most machine +architectures cannot handle natively. It would be much nicer if we +could ensure within the regalloc that this never occurs. + +This is in fact possible to do in a postprocessing step. We iterate +through the sequential moves, tracking whether the scratch register is +in use (has been written). When we see a stack-to-stack move: (i) if +the scratch register is not in use, generate a stack-to-scratch move +and scratch-to-stack move; otherwise, (ii) if the scratch register is +in use, allocate an "extra spillslot" if one has not already been +allocated, move the scratch reg to that, do the above stack-to-scratch +/ scratch-to-stack sequence, then reload the scratch reg from the +extra spillslot. + +## Redundant-Move Elimination + +As a final step before returning the list of program edits to the +client, we perform one optimization: redundant-move elimination. + +To understand the need for this, consider what will occur when a vreg +is (i) defined once, (ii) used many times, and (iii) spilled multiple +times between some of the uses: with the design described above, we +will move the value from the preg to the stack after every segment of +uses, and then reload it when the next use occurs. However, only the +first spill is actually needed; as we noted above, we allocate +spillslots so that the slot that corresponded to the vreg at the first +spill will always be reserved for that vreg as long as it is live. If +no other defs or mods occur, the value in the slot can be reloaded, +and need not be written back every time. + +This inefficiency is a result of our invariant that a vreg lives in +exactly one place at a time, and these locations are joined by +moves. This is a simple and effective design to use for most of the +allocation pipeline, but falls flat here. It is especially inefficient +when the unnecessary spill occurs in an inner loop. (E.g.: value +defined at top of function is spilled, then used once in the middle of +an inner loop body.) + +The opposite case can also sometimes occur, though it is rarer: a +value is loaded into a register, spilled, and then reloaded into the +same register. This can happen when hinting is successful at getting +several segments of a vreg to use the same preg, but splitting has +trimmed part of the liverange between uses and put it in the spill +bundle, and the spill bundle did not get a reg. + +In order to resolve this inefficiency, we implement a general +redundant-move elimination pass. This pass tracks, for every +allocation (reg or spillslot), whether it is a copy of another +allocation. This state is invalidated whenever either that allocation +or the allocation of which it is a copy is overwritten. When we see a +move instruction, if the destination is already a copy of the source, +we elide the move. (There are some additional complexities to preserve +checker metadata which we do not describe here.) + +Note that this could, in principle, be done as a fixpoint analysis +over the CFG; it must be, if we try to preserve state across +blocks. This is because a location is only a copy of another if that +is true on every incoming edge. However, to avoid the cost and +complexity of doing such an analysis, we instead take the much simpler +approach of doing only an intra-block analysis. This turns out to be +sufficient to remove most redundant moves, especially in the common +case of a single use of an otherwise-spilled value. + +Note that we could do better *if* we accepted only SSA code, because +we would know that a value could not be redefined once written. We +should consider this again once we clean up and remove the non-SSA +support. + +# Future Plans + +## SSA-Only Cleanup + +When the major user (Cranelift via the regalloc.rs shim) migrates to +generate SSA code and native regalloc2 operands, there are many bits +of complexity we can remove, as noted throughout this +writeup. Briefly, we could (i) remove special handling of program +moves, (ii) remove the pinned-vreg hack, (iii) simplify redundant-move +elimination, (iv) remove special handling of "mod" operands, and (v) +probably simplify plenty of code given the invariant that a def always +starts a range. + +More importantly, we expect this change to result in potentially much +better allocation performance. The use of special pinned vregs and +moves to/from them instead of fixed-reg constraints, explicit moves +for every reused-input constraint, and already-sequentialized series +of move instructions on edges for phi nodes, are all expensive ways of +encoding regalloc2's native input primitives that have to be +reverse-engineered. Removing that translation layer would be +ideal. Also, allowing regalloc2 to handle phi-node (blockparam) +lowering in a way that is integrated with other moves will likely +generate better code than the way that program-move handling interacts +with Cranelift's manually lowered phi-moves at the moment. + +## Better Split Heuristics + +We have spent quite some effort trying to improve splitting behavior, +and it is now generally decent, but more work could be done here, +especially with regard to the interaction between splits and the loop +nest. + +## Native Debuginfo Output + +Cranelift currently computes value locations (in registers and +stack-slots) for detailed debuginfo with an expensive post-pass, after +regalloc is complete. This is because the existing register allocator +does not support returning this information directly. However, +providing such information by generating it while we scan over +liveranges in each vreg would be relatively simple, and has the +potential to be much faster and more reliable for Cranelift. We should +investigate adding an interface for this to regalloc2 and using it. + +# Appendix: Comparison to IonMonkey Allocator + +There are a number of differences between the [IonMonkey +allocator](https://searchfox.org/mozilla-central/source/js/src/jit/BacktrackingAllocator.cpp) +and this one. While this allocator initially began as an attempt to +clone IonMonkey's, it has drifted significantly as we optimized the +design (especially after we built the regalloc.rs shim and had to +adapt to its code style); it is easier at this point to name the +similarities than the differences. + +* The core abstractions of "liverange", "bundle", "vreg", "preg", and + "operand" (with policies/constraints) are the same. + +* The overall allocator pipeline is the same, and the top-level + structure of each stage should look similar. Both allocators begin + by computing liveranges, then merging bundles, then handling bundles + and splitting/evicting as necessary, then doing second-chance + allocation, then reifying the decisions. + +* The cost functions are very similar, though the heuristics that make + decisions based on them are not. + +Several notable high-level differences are: + +* There are [fuzz/fuzz_targets/](many different fuzz targets) that + exercise the allocator, including a full symbolic checker + (`ion_checker` target) based on the [symbolic checker in + regalloc.rs](https://cfallin.org/blog/2021/03/15/cranelift-isel-3/) + and, e.g., a targetted fuzzer for the parallel move-resolution + algorithm (`moves`) and the SSA generator used for generating cases + for the other fuzz targets (`ssagen`). + +* The data-structure invariants are simplified. While the IonMonkey + allocator allowed for LiveRanges and Bundles to overlap in certain + cases, this allocator sticks to a strict invariant: ranges do not + overlap in bundles, and bundles do not overlap. There are other + examples too: e.g., the definition of minimal bundles is very simple + and does not depend on scanning the code at all. In general, we + should be able to state simple invariants and see by inspection (as + well as fuzzing -- see above) that they hold. + +* The data structures themselves are simplified. Where IonMonkey uses + linked lists in many places, this allocator stores simple inline + smallvecs of liveranges on bundles and vregs, and smallvecs of uses + on liveranges. We also (i) find a way to construct liveranges + in-order immediately, without any need for splicing, unlike + IonMonkey, and (ii) relax sorting invariants where possible to allow + for cheap append operations in many cases. + +* The splitting heuristics are significantly reworked. Whereas + IonMonkey has an all-at-once approach to splitting an entire bundle, + and has a list of complex heuristics to choose where to split, this + allocator does conflict-based splitting, and tries to decide whether + to split or evict and which split to take based on cost heuristics. + +* The liverange computation is exact, whereas IonMonkey approximates + using a single-pass algorithm that makes vregs live across entire + loop bodies. We have found that precise liveness improves allocation + performance and generated code quality, even though the liveness + itself is slightly more expensive to compute. + +* Many of the algorithms in the IonMonkey allocator are built with + helper functions that do linear scans. These "small quadratic" loops + are likely not a huge issue in practice, but nevertheless have the + potential to be in corner cases. As much as possible, all work in + this allocator is done in linear scans. + +* There are novel schemes for solving certain interesting design + challenges. One example: in IonMonkey, liveranges are connected + across blocks by, when reaching one end of a control-flow edge in a + scan, doing a lookup of the allocation at the other end. This is in + principle a linear lookup (so quadratic overall). We instead + generate a list of "half-moves", keyed on the edge and from/to + vregs, with each holding one of the allocations. By sorting and then + scanning this list, we can generate all edge moves in one linear + scan. There are a number of other examples of simplifications: for + example, we handle multiple conflicting + physical-register-constrained uses of a vreg in a single instruction + by recording a copy to do in a side-table, then removing constraints + for the core regalloc. Ion instead has to tweak its definition of + minimal bundles and create two liveranges that overlap (!) to + represent the two uses. + +* Using block parameters rather than phi-nodes significantly + simplifies handling of inter-block data movement. IonMonkey had to + special-case phis in many ways because they are actually quite + weird: their uses happen semantically in other blocks, and their + defs happen in parallel at the top of the block. Block parameters + naturally and explicitly reprsent these semantics in a direct way. + +* The allocator supports irreducible control flow and arbitrary block + ordering (its only CFG requirement is that critical edges are + split). + +* The allocator supports non-SSA code, and has native support for + handling program moves specially. + +# Appendix: Performance-Tuning Lessons + +In the course of optimizing the allocator's performance, we found a +number of general principles: + +* We got substantial performance speedups from using vectors rather + than linked lists everywhere. This is well-known, but nevertheless, + it took some thought to work out how to avoid the need for any + splicing, and it turns out that even when our design is slightly + less efficient asymptotically (e.g., apend-and-re-sort rather than + linear-time merge of two sorted liverange lists when merging + bundles), it is faster. + +* We initially used a direct translation of IonMonkey's splay tree as + an allocation map for each PReg. This turned out to be significantly + (!) less efficient than Rust's built-in BTree data structures, for + the usual cache-efficiency vs. pointer-chasing reasons. + +* We initially used dense bitvecs, as IonMonkey does, for + livein/liveout bits. It turned out that a chunked sparse design (see + below) was much more efficient. + +* Precise liveness significantly improves performance because it + reduces the size of liveranges (i.e., interference), and probing + registers with liveranges is the most significant hot inner + loop. Paying a fraction of a percent runtime for the iterative + dataflow algorithm to get precise bitsets is more than worth it. + +* The randomized probing of registers was a huge win: as above, the + probing is very expensive, and reducing the average number of probes + it takes to find a free register is very important. + +* In general, single-pass algorithms and design of data structures to + enable them are important. For example, the half-move technique + avoids the need to do any O(log n) search at all, and is relatively + cache-efficient. As another example, a side-effect of the precise + liveness was that we could then process operands within blocks in + actual instruction order (in reverse), which allowed us to simply + append liveranges to in-progress vreg liverange lists and then + reverse at the end. The expensive part is a single pass; only the + bitset computation is a fixpoint loop. + +* Sorts are better than always-sorted data structures (like btrees): + they amortize all the comparison and update cost to one phase, and + this phase is much more cache-friendly than a bunch of spread-out + updates. + +* Take care of basic data structures and their operator definitions! + We initially used the auto-derived comparator on ProgPoint, and let + ProgPoint be a normal struct (with a u32 inst index and a + Befor/After enum). The comparator for this, used in many sorting + inner loops, was a compound thing with conditionals. Instead, pack + them in a u32 and do a simple compare (and save half the memory as + well). Likewise, the half-move key is a single value packed in a + u64; this is far more efficient than the tuple comparator on a + 4-tuple, and the half-move sort (which can be a few percent or more + of total allocation time) became multiple times cheaper. + +# Appendix: Data Structure: Chunked Sparse BitVec + +We use a "chunked sparse bitvec" to store liveness information, which +is just a set of VReg indices. The design is fairly simple: the +toplevel is a HashMap from "chunk" to a `u64`, and each `u64` +represents 64 contiguous indices. + +The intuition is that while the vreg sets are likely sparse overall, +they will probably be dense within small regions of the index +space. For example, in the Nth block in a function, the values that +flow from block N-1 will largely be almost-contiguous vreg indices, if +vregs are allocated in sequence down the function body. Or, at least, +they will be some local vregs together with a few defined at the top +of the function; two separate chunks will cover that. + +We tried a number of other designs as well. Initially we used a simple +dense bitvec, but this was prohibitively expensive: O(n^2) space when +the real need is closer to O(n) (i.e., a classic sparse matrix). We +also tried a hybrid scheme that kept a list of indices when small and +used either a bitvec or a hashset when large. This did not perform as +well because (i) it was less memory-efficient (the chunking helps with +this) and (ii) insertions are more expensive when they always require +a full hashset/hashmap insert. + +# Appendix: Fuzzing + +We have five fuzz targets: `ssagen`, `domtree`, `moves`, `ion`, and +`ion_checker`. + +## SSAGen + +The SSAGen target tests our SSA generator, which generates cases for +the full allocator fuzz targets. The SSA generator is careful to +always generate a valid CFG, with split critical edges, and valid SSA, +so that we never have to throw out a test input before we reach the +allocator itself. (An alternative fuzzing approach randomly generates +programs and then throws out those that do not meet certain conditions +before using them as legitimate testcases; this is much simpler, but +less efficient.) + +To generate a valid CFG, with no unreachable blocks and with no +critical edges, the generator (i) glues together units of either one +or three blocks (A->B, A->C), forming either a straight-through +section or a conditional. These are glued together into a "spine", and +the conditionals (the "C" block), where they exist, are then linked to +a random target block chosen among the main blocks of these one- or +three-block units. The targets are chosen either randomly, for +potentially irreducible CFGs, or in a way that ensures proper nesting +of loop backedges, if a structured CFG is requested. + +SSA is generated by first choosing which vregs will be defined in each +block, and which will be defined as blockparams vs. instruction +defs. Instructions are then generated, with operands chosen among the +"available" vregs: those defined so far in the current block and all +of those in any other block that dominates this one. + +The SSAGen fuzz target runs the above code generator against an SSA +validator, and thus ensures that it will only generate valid SSA code. + +## Domtree + +The `domtree` fuzz target computes dominance using the algorithm that +we use elsewhere in our CFG analysis, and then walks a +randomly-generated path through the CFG. It checks that the dominance +definition ("a dom b if any path from entry to b must pass through a") +is consistent with this particular randomly-chosen path. + +## Moves + +The `moves` fuzz target tests the parallel move resolver. It generates +a random sequence of parallel moves, careful to ensure that each +destination is written only once. It then runs the parallel move +resolver, and then *abstractly interprets* the resulting sequential +series of moves, thus determining which inputs flow to which +outputs. This must match the original set of parallel moves. + +## Ion and Ion-checker + +The `ion` fuzz target runs the allocator over test programs generated +by SSAGen. It does not validate the output; it only tests that the +allocator runs to completion and does not panic. This was used mainly +during development, and is now less useful than the checker-based +target. + +The `ion_checker` fuzz target runs the allocator's result through a +symbolic checker, which is adapted from the one developed for +regalloc.rs (see [this blog +post](https://cfallin.org/blog/2021/01/22/cranelift-isel-2/) for more +details). This is the most useful fuzz target in the fuzzing suite, +and has found many bugs in development. diff --git a/src/lib.rs b/src/lib.rs index 7d55624..3a6ecb4 100644 --- a/src/lib.rs +++ b/src/lib.rs @@ -1,5 +1,5 @@ /* - * The fellowing license applies to this file, which derives many + * The following license applies to this file, which derives many * details (register and constraint definitions, for example) from the * files `BacktrackingAllocator.h`, `BacktrackingAllocator.cpp`, * `LIR.h`, and possibly definitions in other related files in