Files
regalloc2/README.md
2021-04-13 17:40:12 -07:00

160 lines
7.4 KiB
Markdown

## regalloc2: another register allocator
This is a register allocator that started life as, and is about 75%
still, a port of IonMonkey's backtracking register allocator to
Rust. The data structures and invariants have been simplified a little
bit, and the interfaces made a little more generic and reusable. In
addition, it contains substantial amounts of testing infrastructure
(fuzzing harnesses and checkers) that does not exist in the original
IonMonkey allocator.
### Design Overview
TODO
- SSA with blockparams
- Operands with constraints, and clobbers, and reused regs; contrast
with regalloc.rs approach of vregs and pregs and many moves that get
coalesced/elided
### Differences from IonMonkey Backtracking Allocator
There are a number of differences between the [IonMonkey
allocator](https://searchfox.org/mozilla-central/source/js/src/jit/BacktrackingAllocator.cpp)
and this one:
* Most significantly, there are [fuzz/fuzz_targets/](many different
fuzz targets) that exercise the allocator, including a full symbolic
checker (`ion_checker` target) based on the [symbolic checker in
regalloc.rs](https://cfallin.org/blog/2021/03/15/cranelift-isel-3/)
and, e.g., a targetted fuzzer for the parallel move-resolution
algorithm (`moves`) and the SSA generator used for generating cases
for the other fuzz targets (`ssagen`).
* The data-structure invariants are simplified. While the IonMonkey
allocator allowed for LiveRanges and Bundles to overlap in certain
cases, this allocator sticks to a strict invariant: ranges do not
overlap in bundles, and bundles do not overlap. There are other
examples too: e.g., the definition of minimal bundles is very simple
and does not depend on scanning the code at all. In general, we
should be able to state simple invariants and see by inspection (as
well as fuzzing -- see above) that they hold.
* Many of the algorithms in the IonMonkey allocator are built with
helper functions that do linear scans. These "small quadratic" loops
are likely not a huge issue in practice, but nevertheless have the
potential to be in corner cases. As much as possible, all work in
this allocator is done in linear scans. For example, bundle
splitting is done in a single compound scan over a bundle, ranges in
the bundle, and a sorted list of split-points.
* There are novel schemes for solving certain interesting design
challenges. One example: in IonMonkey, liveranges are connected
across blocks by, when reaching one end of a control-flow edge in a
scan, doing a lookup of the allocation at the other end. This is in
principle a linear lookup (so quadratic overall). We instead
generate a list of "half-moves", keyed on the edge and from/to
vregs, with each holding one of the allocations. By sorting and then
scanning this list, we can generate all edge moves in one linear
scan. There are a number of other examples of simplifications: for
example, we handle multiple conflicting
physical-register-constrained uses of a vreg in a single instruction
by recording a copy to do in a side-table, then removing constraints
for the core regalloc. Ion instead has to tweak its definition of
minimal bundles and create two liveranges that overlap (!) to
represent the two uses.
* Using block parameters rather than phi-nodes significantly
simplifies handling of inter-block data movement. IonMonkey had to
special-case phis in many ways because they are actually quite
weird: their uses happen semantically in other blocks, and their
defs happen in parallel at the top of the block. Block parameters
naturally and explicitly reprsent these semantics in a direct way.
* The allocator supports irreducible control flow and arbitrary block
ordering (its only CFG requirement is that critical edges are
split). It handles loops during live-range computation in a way that
is similar in spirit to IonMonkey's allocator -- in a single pass,
when we discover a loop, we just mark the whole loop as a liverange
for values live at the top of the loop -- but we find the loop body
without the fixpoint workqueue loop that IonMonkey uses, instead
doing a single linear scan for backedges and finding the minimal
extent that covers all intermingled loops. In order to support
arbitrary block order and irreducible control flow, we relax the
invariant that the first liverange for a vreg always starts at its
def; instead, the def can happen anywhere, and a liverange may
overapproximate. It turns out this is not too hard to handle and is
a more robust invariant. (It also means that non-SSA code *may* not
be too hard to adapt to, though I haven't seriously thought about
this.)
### Rough Performance Comparison with Regalloc.rs
The allocator has not yet been wired up to a suitable compiler backend
(such as Cranelift) to perform a true apples-to-apples compile-time
and runtime comparison. However, we can get some idea of compile speed
by running suitable test cases through the allocator and measuring
*throughput*: that is, instructions per second for which registers are
allocated.
To do so, I measured the `qsort2` benchmark in
[regalloc.rs](https://github.com/bytecodealliance/regalloc.rs),
register-allocated with default options in that crate's backtracking
allocator, using the Criterion benchmark framework to measure ~620K
instructions per second:
```plain
benches/0 time: [365.68 us 367.36 us 369.04 us]
thrpt: [617.82 Kelem/s 620.65 Kelem/s 623.49 Kelem/s]
```
I then measured three different fuzztest-SSA-generator test cases in
this allocator, `regalloc2`, measuring between 1.05M and 2.3M
instructions per second (closer to the former for larger functions):
```plain
==== 459 instructions
benches/0 time: [424.46 us 425.65 us 426.59 us]
thrpt: [1.0760 Melem/s 1.0784 Melem/s 1.0814 Melem/s]
==== 225 instructions
benches/1 time: [213.05 us 213.28 us 213.54 us]
thrpt: [1.0537 Melem/s 1.0549 Melem/s 1.0561 Melem/s]
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
==== 21 instructions
benches/2 time: [9.0495 us 9.0571 us 9.0641 us]
thrpt: [2.3168 Melem/s 2.3186 Melem/s 2.3206 Melem/s]
Found 4 outliers among 100 measurements (4.00%)
2 (2.00%) high mild
2 (2.00%) high severe
```
Though not apples-to-apples (SSA vs. non-SSA, completely different
code only with similar length), this is at least some evidence that
`regalloc2` is likely to lead to at least a compile-time improvement
when used in e.g. Cranelift.
### License
Unless otherwise specified, code in this crate is licensed under the Apache 2.0
License with LLVM Exception. This license text can be found in the file
`LICENSE`.
Files in the `src/ion/` directory are directly ported from original C++ code in
IonMonkey, a part of the Firefox codebase. Parts of `src/lib.rs` are also
definitions that are directly translated from this original code. As a result,
these files are derivative works and are covered by the Mozilla Public License
(MPL) 2.0, as described in license headers in those files. Please see the
notices in relevant files for links to the original IonMonkey source files from
which they have been translated/derived. The MPL text can be found in
`src/ion/LICENSE`.
Parts of the code are derived from regalloc.rs: in particular,
`src/checker.rs` and `src/domtree.rs`. This crate has the same license
as regalloc.rs, so the license on these files does not differ.