Add design document.
This commit is contained in:
138
README.md
138
README.md
@@ -1,139 +1,19 @@
|
|||||||
## regalloc2: another register allocator
|
## regalloc2: another register allocator
|
||||||
|
|
||||||
This is a register allocator that started life as, and is about 75%
|
This is a register allocator that started life as, and is about 50%
|
||||||
still, a port of IonMonkey's backtracking register allocator to
|
still, a port of IonMonkey's backtracking register allocator to
|
||||||
Rust. The data structures and invariants have been simplified a little
|
Rust. In many regards, it has been generalized, optimized, and
|
||||||
bit, and the interfaces made a little more generic and reusable. In
|
improved since the initial port, and now supports both SSA and non-SSA
|
||||||
addition, it contains substantial amounts of testing infrastructure
|
use-cases.
|
||||||
|
|
||||||
|
In addition, it contains substantial amounts of testing infrastructure
|
||||||
(fuzzing harnesses and checkers) that does not exist in the original
|
(fuzzing harnesses and checkers) that does not exist in the original
|
||||||
IonMonkey allocator.
|
IonMonkey allocator.
|
||||||
|
|
||||||
### Design Overview
|
See the [design overview](doc/DESIGN.md) for (much!) more detail on
|
||||||
|
how the allocator works.
|
||||||
|
|
||||||
TODO
|
## License
|
||||||
|
|
||||||
- SSA with blockparams
|
|
||||||
|
|
||||||
- Operands with constraints, and clobbers, and reused regs; contrast
|
|
||||||
with regalloc.rs approach of vregs and pregs and many moves that get
|
|
||||||
coalesced/elided
|
|
||||||
|
|
||||||
### Differences from IonMonkey Backtracking Allocator
|
|
||||||
|
|
||||||
There are a number of differences between the [IonMonkey
|
|
||||||
allocator](https://searchfox.org/mozilla-central/source/js/src/jit/BacktrackingAllocator.cpp)
|
|
||||||
and this one:
|
|
||||||
|
|
||||||
* Most significantly, there are [fuzz/fuzz_targets/](many different
|
|
||||||
fuzz targets) that exercise the allocator, including a full symbolic
|
|
||||||
checker (`ion_checker` target) based on the [symbolic checker in
|
|
||||||
regalloc.rs](https://cfallin.org/blog/2021/03/15/cranelift-isel-3/)
|
|
||||||
and, e.g., a targetted fuzzer for the parallel move-resolution
|
|
||||||
algorithm (`moves`) and the SSA generator used for generating cases
|
|
||||||
for the other fuzz targets (`ssagen`).
|
|
||||||
|
|
||||||
* The data-structure invariants are simplified. While the IonMonkey
|
|
||||||
allocator allowed for LiveRanges and Bundles to overlap in certain
|
|
||||||
cases, this allocator sticks to a strict invariant: ranges do not
|
|
||||||
overlap in bundles, and bundles do not overlap. There are other
|
|
||||||
examples too: e.g., the definition of minimal bundles is very simple
|
|
||||||
and does not depend on scanning the code at all. In general, we
|
|
||||||
should be able to state simple invariants and see by inspection (as
|
|
||||||
well as fuzzing -- see above) that they hold.
|
|
||||||
|
|
||||||
* Many of the algorithms in the IonMonkey allocator are built with
|
|
||||||
helper functions that do linear scans. These "small quadratic" loops
|
|
||||||
are likely not a huge issue in practice, but nevertheless have the
|
|
||||||
potential to be in corner cases. As much as possible, all work in
|
|
||||||
this allocator is done in linear scans. For example, bundle
|
|
||||||
splitting is done in a single compound scan over a bundle, ranges in
|
|
||||||
the bundle, and a sorted list of split-points.
|
|
||||||
|
|
||||||
* There are novel schemes for solving certain interesting design
|
|
||||||
challenges. One example: in IonMonkey, liveranges are connected
|
|
||||||
across blocks by, when reaching one end of a control-flow edge in a
|
|
||||||
scan, doing a lookup of the allocation at the other end. This is in
|
|
||||||
principle a linear lookup (so quadratic overall). We instead
|
|
||||||
generate a list of "half-moves", keyed on the edge and from/to
|
|
||||||
vregs, with each holding one of the allocations. By sorting and then
|
|
||||||
scanning this list, we can generate all edge moves in one linear
|
|
||||||
scan. There are a number of other examples of simplifications: for
|
|
||||||
example, we handle multiple conflicting
|
|
||||||
physical-register-constrained uses of a vreg in a single instruction
|
|
||||||
by recording a copy to do in a side-table, then removing constraints
|
|
||||||
for the core regalloc. Ion instead has to tweak its definition of
|
|
||||||
minimal bundles and create two liveranges that overlap (!) to
|
|
||||||
represent the two uses.
|
|
||||||
|
|
||||||
* Using block parameters rather than phi-nodes significantly
|
|
||||||
simplifies handling of inter-block data movement. IonMonkey had to
|
|
||||||
special-case phis in many ways because they are actually quite
|
|
||||||
weird: their uses happen semantically in other blocks, and their
|
|
||||||
defs happen in parallel at the top of the block. Block parameters
|
|
||||||
naturally and explicitly reprsent these semantics in a direct way.
|
|
||||||
|
|
||||||
* The allocator supports irreducible control flow and arbitrary block
|
|
||||||
ordering (its only CFG requirement is that critical edges are
|
|
||||||
split). It handles loops during live-range computation in a way that
|
|
||||||
is similar in spirit to IonMonkey's allocator -- in a single pass,
|
|
||||||
when we discover a loop, we just mark the whole loop as a liverange
|
|
||||||
for values live at the top of the loop -- but we find the loop body
|
|
||||||
without the fixpoint workqueue loop that IonMonkey uses, instead
|
|
||||||
doing a single linear scan for backedges and finding the minimal
|
|
||||||
extent that covers all intermingled loops. In order to support
|
|
||||||
arbitrary block order and irreducible control flow, we relax the
|
|
||||||
invariant that the first liverange for a vreg always starts at its
|
|
||||||
def; instead, the def can happen anywhere, and a liverange may
|
|
||||||
overapproximate. It turns out this is not too hard to handle and is
|
|
||||||
a more robust invariant. (It also means that non-SSA code *may* not
|
|
||||||
be too hard to adapt to, though I haven't seriously thought about
|
|
||||||
this.)
|
|
||||||
|
|
||||||
### Rough Performance Comparison with Regalloc.rs
|
|
||||||
|
|
||||||
The allocator has not yet been wired up to a suitable compiler backend
|
|
||||||
(such as Cranelift) to perform a true apples-to-apples compile-time
|
|
||||||
and runtime comparison. However, we can get some idea of compile speed
|
|
||||||
by running suitable test cases through the allocator and measuring
|
|
||||||
*throughput*: that is, instructions per second for which registers are
|
|
||||||
allocated.
|
|
||||||
|
|
||||||
To do so, I measured the `qsort2` benchmark in
|
|
||||||
[regalloc.rs](https://github.com/bytecodealliance/regalloc.rs),
|
|
||||||
register-allocated with default options in that crate's backtracking
|
|
||||||
allocator, using the Criterion benchmark framework to measure ~620K
|
|
||||||
instructions per second:
|
|
||||||
|
|
||||||
|
|
||||||
```plain
|
|
||||||
benches/0 time: [365.68 us 367.36 us 369.04 us]
|
|
||||||
thrpt: [617.82 Kelem/s 620.65 Kelem/s 623.49 Kelem/s]
|
|
||||||
```
|
|
||||||
|
|
||||||
I then measured three different fuzztest-SSA-generator test cases in
|
|
||||||
this allocator, `regalloc2`, measuring between 1.1M and 2.3M
|
|
||||||
instructions per second (closer to the former for larger functions):
|
|
||||||
|
|
||||||
```plain
|
|
||||||
==== 459 instructions
|
|
||||||
benches/0 time: [377.91 us 378.09 us 378.27 us]
|
|
||||||
thrpt: [1.2134 Melem/s 1.2140 Melem/s 1.2146 Melem/s]
|
|
||||||
|
|
||||||
==== 225 instructions
|
|
||||||
benches/1 time: [202.03 us 202.14 us 202.27 us]
|
|
||||||
thrpt: [1.1124 Melem/s 1.1131 Melem/s 1.1137 Melem/s]
|
|
||||||
|
|
||||||
==== 21 instructions
|
|
||||||
benches/2 time: [9.5605 us 9.5655 us 9.5702 us]
|
|
||||||
thrpt: [2.1943 Melem/s 2.1954 Melem/s 2.1965 Melem/s]
|
|
||||||
```
|
|
||||||
|
|
||||||
Though not apples-to-apples (SSA vs. non-SSA, completely different
|
|
||||||
code only with similar length), this is at least some evidence that
|
|
||||||
`regalloc2` is likely to lead to at least a compile-time improvement
|
|
||||||
when used in e.g. Cranelift.
|
|
||||||
|
|
||||||
### License
|
|
||||||
|
|
||||||
Unless otherwise specified, code in this crate is licensed under the Apache 2.0
|
Unless otherwise specified, code in this crate is licensed under the Apache 2.0
|
||||||
License with LLVM Exception. This license text can be found in the file
|
License with LLVM Exception. This license text can be found in the file
|
||||||
|
|||||||
1625
doc/DESIGN.md
Normal file
1625
doc/DESIGN.md
Normal file
File diff suppressed because it is too large
Load Diff
@@ -1,5 +1,5 @@
|
|||||||
/*
|
/*
|
||||||
* The fellowing license applies to this file, which derives many
|
* The following license applies to this file, which derives many
|
||||||
* details (register and constraint definitions, for example) from the
|
* details (register and constraint definitions, for example) from the
|
||||||
* files `BacktrackingAllocator.h`, `BacktrackingAllocator.cpp`,
|
* files `BacktrackingAllocator.h`, `BacktrackingAllocator.cpp`,
|
||||||
* `LIR.h`, and possibly definitions in other related files in
|
* `LIR.h`, and possibly definitions in other related files in
|
||||||
|
|||||||
Reference in New Issue
Block a user