160 lines
7.4 KiB
Markdown
160 lines
7.4 KiB
Markdown
## regalloc2: another register allocator
|
|
|
|
This is a register allocator that started life as, and is about 75%
|
|
still, a port of IonMonkey's backtracking register allocator to
|
|
Rust. The data structures and invariants have been simplified a little
|
|
bit, and the interfaces made a little more generic and reusable. In
|
|
addition, it contains substantial amounts of testing infrastructure
|
|
(fuzzing harnesses and checkers) that does not exist in the original
|
|
IonMonkey allocator.
|
|
|
|
### Design Overview
|
|
|
|
TODO
|
|
|
|
- SSA with blockparams
|
|
|
|
- Operands with constraints, and clobbers, and reused regs; contrast
|
|
with regalloc.rs approach of vregs and pregs and many moves that get
|
|
coalesced/elided
|
|
|
|
### Differences from IonMonkey Backtracking Allocator
|
|
|
|
There are a number of differences between the [IonMonkey
|
|
allocator](https://searchfox.org/mozilla-central/source/js/src/jit/BacktrackingAllocator.cpp)
|
|
and this one:
|
|
|
|
* Most significantly, there are [fuzz/fuzz_targets/](many different
|
|
fuzz targets) that exercise the allocator, including a full symbolic
|
|
checker (`ion_checker` target) based on the [symbolic checker in
|
|
regalloc.rs](https://cfallin.org/blog/2021/03/15/cranelift-isel-3/)
|
|
and, e.g., a targetted fuzzer for the parallel move-resolution
|
|
algorithm (`moves`) and the SSA generator used for generating cases
|
|
for the other fuzz targets (`ssagen`).
|
|
|
|
* The data-structure invariants are simplified. While the IonMonkey
|
|
allocator allowed for LiveRanges and Bundles to overlap in certain
|
|
cases, this allocator sticks to a strict invariant: ranges do not
|
|
overlap in bundles, and bundles do not overlap. There are other
|
|
examples too: e.g., the definition of minimal bundles is very simple
|
|
and does not depend on scanning the code at all. In general, we
|
|
should be able to state simple invariants and see by inspection (as
|
|
well as fuzzing -- see above) that they hold.
|
|
|
|
* Many of the algorithms in the IonMonkey allocator are built with
|
|
helper functions that do linear scans. These "small quadratic" loops
|
|
are likely not a huge issue in practice, but nevertheless have the
|
|
potential to be in corner cases. As much as possible, all work in
|
|
this allocator is done in linear scans. For example, bundle
|
|
splitting is done in a single compound scan over a bundle, ranges in
|
|
the bundle, and a sorted list of split-points.
|
|
|
|
* There are novel schemes for solving certain interesting design
|
|
challenges. One example: in IonMonkey, liveranges are connected
|
|
across blocks by, when reaching one end of a control-flow edge in a
|
|
scan, doing a lookup of the allocation at the other end. This is in
|
|
principle a linear lookup (so quadratic overall). We instead
|
|
generate a list of "half-moves", keyed on the edge and from/to
|
|
vregs, with each holding one of the allocations. By sorting and then
|
|
scanning this list, we can generate all edge moves in one linear
|
|
scan. There are a number of other examples of simplifications: for
|
|
example, we handle multiple conflicting
|
|
physical-register-constrained uses of a vreg in a single instruction
|
|
by recording a copy to do in a side-table, then removing constraints
|
|
for the core regalloc. Ion instead has to tweak its definition of
|
|
minimal bundles and create two liveranges that overlap (!) to
|
|
represent the two uses.
|
|
|
|
* Using block parameters rather than phi-nodes significantly
|
|
simplifies handling of inter-block data movement. IonMonkey had to
|
|
special-case phis in many ways because they are actually quite
|
|
weird: their uses happen semantically in other blocks, and their
|
|
defs happen in parallel at the top of the block. Block parameters
|
|
naturally and explicitly reprsent these semantics in a direct way.
|
|
|
|
* The allocator supports irreducible control flow and arbitrary block
|
|
ordering (its only CFG requirement is that critical edges are
|
|
split). It handles loops during live-range computation in a way that
|
|
is similar in spirit to IonMonkey's allocator -- in a single pass,
|
|
when we discover a loop, we just mark the whole loop as a liverange
|
|
for values live at the top of the loop -- but we find the loop body
|
|
without the fixpoint workqueue loop that IonMonkey uses, instead
|
|
doing a single linear scan for backedges and finding the minimal
|
|
extent that covers all intermingled loops. In order to support
|
|
arbitrary block order and irreducible control flow, we relax the
|
|
invariant that the first liverange for a vreg always starts at its
|
|
def; instead, the def can happen anywhere, and a liverange may
|
|
overapproximate. It turns out this is not too hard to handle and is
|
|
a more robust invariant. (It also means that non-SSA code *may* not
|
|
be too hard to adapt to, though I haven't seriously thought about
|
|
this.)
|
|
|
|
### Rough Performance Comparison with Regalloc.rs
|
|
|
|
The allocator has not yet been wired up to a suitable compiler backend
|
|
(such as Cranelift) to perform a true apples-to-apples compile-time
|
|
and runtime comparison. However, we can get some idea of compile speed
|
|
by running suitable test cases through the allocator and measuring
|
|
*throughput*: that is, instructions per second for which registers are
|
|
allocated.
|
|
|
|
To do so, I measured the `qsort2` benchmark in
|
|
[regalloc.rs](https://github.com/bytecodealliance/regalloc.rs),
|
|
register-allocated with default options in that crate's backtracking
|
|
allocator, using the Criterion benchmark framework to measure ~620K
|
|
instructions per second:
|
|
|
|
|
|
```plain
|
|
benches/0 time: [365.68 us 367.36 us 369.04 us]
|
|
thrpt: [617.82 Kelem/s 620.65 Kelem/s 623.49 Kelem/s]
|
|
```
|
|
|
|
I then measured three different fuzztest-SSA-generator test cases in
|
|
this allocator, `regalloc2`, measuring between 1.05M and 2.3M
|
|
instructions per second (closer to the former for larger functions):
|
|
|
|
```plain
|
|
==== 459 instructions
|
|
benches/0 time: [424.46 us 425.65 us 426.59 us]
|
|
thrpt: [1.0760 Melem/s 1.0784 Melem/s 1.0814 Melem/s]
|
|
|
|
==== 225 instructions
|
|
benches/1 time: [213.05 us 213.28 us 213.54 us]
|
|
thrpt: [1.0537 Melem/s 1.0549 Melem/s 1.0561 Melem/s]
|
|
|
|
Found 1 outliers among 100 measurements (1.00%)
|
|
1 (1.00%) high mild
|
|
==== 21 instructions
|
|
benches/2 time: [9.0495 us 9.0571 us 9.0641 us]
|
|
thrpt: [2.3168 Melem/s 2.3186 Melem/s 2.3206 Melem/s]
|
|
|
|
Found 4 outliers among 100 measurements (4.00%)
|
|
2 (2.00%) high mild
|
|
2 (2.00%) high severe
|
|
```
|
|
|
|
Though not apples-to-apples (SSA vs. non-SSA, completely different
|
|
code only with similar length), this is at least some evidence that
|
|
`regalloc2` is likely to lead to at least a compile-time improvement
|
|
when used in e.g. Cranelift.
|
|
|
|
### License
|
|
|
|
Unless otherwise specified, code in this crate is licensed under the Apache 2.0
|
|
License with LLVM Exception. This license text can be found in the file
|
|
`LICENSE`.
|
|
|
|
Files in the `src/ion/` directory are directly ported from original C++ code in
|
|
IonMonkey, a part of the Firefox codebase. Parts of `src/lib.rs` are also
|
|
definitions that are directly translated from this original code. As a result,
|
|
these files are derivative works and are covered by the Mozilla Public License
|
|
(MPL) 2.0, as described in license headers in those files. Please see the
|
|
notices in relevant files for links to the original IonMonkey source files from
|
|
which they have been translated/derived. The MPL text can be found in
|
|
`src/ion/LICENSE`.
|
|
|
|
Parts of the code are derived from regalloc.rs: in particular,
|
|
`src/checker.rs` and `src/domtree.rs`. This crate has the same license
|
|
as regalloc.rs, so the license on these files does not differ.
|