regalloc2

Go to file

Chris Fallin 15ed2d6522 Allow multiple defs per vreg (i.e., accept non-SSA code).

This generalizes the allocator to accept multiple defs by making defs
just another type of "use" (uses are now perhaps more properly called
"mentions", but for now we abuse the terminology slightly).

It turns out that this actually was not terribly hard, because we don't
rely on the properties that a strict SSA requirement otherwise might
allow us to: e.g., defs always at exactly the start of a vreg's ranges.
Because we already accepted arbitrary block order and irreducible CFGs,
and approximated live-ranges with the single-pass algorithm, we are
robust in our "stitching" (move insertion) and so all we really care
about is computing some superset of the actual live-ranges and then a
non-interfering coloring of (split pieces of) those ranges. Multiple
defs don't change that, as long as we compute the ranges properly.

We still have blockparams in this design, so the client *can* provide
SSA directly, and everything will work as before. But a client that
produces non-SSA need not use them at all; it can just happily reassign
to vregs and everything will Just Work.

This is part of the effort to port Cranelift over to regalloc2; I have
decided that it may be easier to build a compatibility shim that matches
regalloc.rs's interface than to continue boiling the ocean and
converting all of the lowering sequences to SSA. It then becomes a
separable piece of work (and simply further performance improvements and
simplifications) to remove the need for this shim.

2021-05-05 22:49:45 -07:00

.github/workflows

Add GitHub CI config.

2021-04-18 13:18:18 -07:00

fuzz

Factored out test program and fuzzing features; core crate now only depends on smallvec and log.

2021-04-18 14:19:32 -07:00

src

Allow multiple defs per vreg (i.e., accept non-SSA code).

2021-05-05 22:49:45 -07:00

test

Factored out test program and fuzzing features; core crate now only depends on smallvec and log.

2021-04-18 14:19:32 -07:00

.gitignore

Initial public commit of regalloc2.

2021-04-13 17:40:12 -07:00

Cargo.toml

Factored out test program and fuzzing features; core crate now only depends on smallvec and log.

2021-04-18 14:19:32 -07:00

LICENSE

Initial public commit of regalloc2.

2021-04-13 17:40:12 -07:00

README.md

Heuristic improvement: reg-scan offset by inst location.

2021-04-13 23:31:34 -07:00

README.md

regalloc2: another register allocator

This is a register allocator that started life as, and is about 75% still, a port of IonMonkey's backtracking register allocator to Rust. The data structures and invariants have been simplified a little bit, and the interfaces made a little more generic and reusable. In addition, it contains substantial amounts of testing infrastructure (fuzzing harnesses and checkers) that does not exist in the original IonMonkey allocator.

Design Overview

TODO

SSA with blockparams
Operands with constraints, and clobbers, and reused regs; contrast with regalloc.rs approach of vregs and pregs and many moves that get coalesced/elided

Differences from IonMonkey Backtracking Allocator

There are a number of differences between the IonMonkey allocator and this one:

Most significantly, there are [fuzz/fuzz_targets/](many different fuzz targets) that exercise the allocator, including a full symbolic checker (ion_checker target) based on the symbolic checker in regalloc.rs and, e.g., a targetted fuzzer for the parallel move-resolution algorithm (moves) and the SSA generator used for generating cases for the other fuzz targets (ssagen).
The data-structure invariants are simplified. While the IonMonkey allocator allowed for LiveRanges and Bundles to overlap in certain cases, this allocator sticks to a strict invariant: ranges do not overlap in bundles, and bundles do not overlap. There are other examples too: e.g., the definition of minimal bundles is very simple and does not depend on scanning the code at all. In general, we should be able to state simple invariants and see by inspection (as well as fuzzing -- see above) that they hold.
Many of the algorithms in the IonMonkey allocator are built with helper functions that do linear scans. These "small quadratic" loops are likely not a huge issue in practice, but nevertheless have the potential to be in corner cases. As much as possible, all work in this allocator is done in linear scans. For example, bundle splitting is done in a single compound scan over a bundle, ranges in the bundle, and a sorted list of split-points.
There are novel schemes for solving certain interesting design challenges. One example: in IonMonkey, liveranges are connected across blocks by, when reaching one end of a control-flow edge in a scan, doing a lookup of the allocation at the other end. This is in principle a linear lookup (so quadratic overall). We instead generate a list of "half-moves", keyed on the edge and from/to vregs, with each holding one of the allocations. By sorting and then scanning this list, we can generate all edge moves in one linear scan. There are a number of other examples of simplifications: for example, we handle multiple conflicting physical-register-constrained uses of a vreg in a single instruction by recording a copy to do in a side-table, then removing constraints for the core regalloc. Ion instead has to tweak its definition of minimal bundles and create two liveranges that overlap (!) to represent the two uses.
Using block parameters rather than phi-nodes significantly simplifies handling of inter-block data movement. IonMonkey had to special-case phis in many ways because they are actually quite weird: their uses happen semantically in other blocks, and their defs happen in parallel at the top of the block. Block parameters naturally and explicitly reprsent these semantics in a direct way.
The allocator supports irreducible control flow and arbitrary block ordering (its only CFG requirement is that critical edges are split). It handles loops during live-range computation in a way that is similar in spirit to IonMonkey's allocator -- in a single pass, when we discover a loop, we just mark the whole loop as a liverange for values live at the top of the loop -- but we find the loop body without the fixpoint workqueue loop that IonMonkey uses, instead doing a single linear scan for backedges and finding the minimal extent that covers all intermingled loops. In order to support arbitrary block order and irreducible control flow, we relax the invariant that the first liverange for a vreg always starts at its def; instead, the def can happen anywhere, and a liverange may overapproximate. It turns out this is not too hard to handle and is a more robust invariant. (It also means that non-SSA code may not be too hard to adapt to, though I haven't seriously thought about this.)

Rough Performance Comparison with Regalloc.rs

The allocator has not yet been wired up to a suitable compiler backend (such as Cranelift) to perform a true apples-to-apples compile-time and runtime comparison. However, we can get some idea of compile speed by running suitable test cases through the allocator and measuring throughput: that is, instructions per second for which registers are allocated.

To do so, I measured the qsort2 benchmark in regalloc.rs, register-allocated with default options in that crate's backtracking allocator, using the Criterion benchmark framework to measure ~620K instructions per second:

benches/0               time:   [365.68 us 367.36 us 369.04 us]
                        thrpt:  [617.82 Kelem/s 620.65 Kelem/s 623.49 Kelem/s]

I then measured three different fuzztest-SSA-generator test cases in this allocator, regalloc2, measuring between 1.1M and 2.3M instructions per second (closer to the former for larger functions):

==== 459 instructions
benches/0               time:   [377.91 us 378.09 us 378.27 us]
                        thrpt:  [1.2134 Melem/s 1.2140 Melem/s 1.2146 Melem/s]

==== 225 instructions
benches/1               time:   [202.03 us 202.14 us 202.27 us]
                        thrpt:  [1.1124 Melem/s 1.1131 Melem/s 1.1137 Melem/s]

==== 21 instructions
benches/2               time:   [9.5605 us 9.5655 us 9.5702 us]
                        thrpt:  [2.1943 Melem/s 2.1954 Melem/s 2.1965 Melem/s]

Though not apples-to-apples (SSA vs. non-SSA, completely different code only with similar length), this is at least some evidence that regalloc2 is likely to lead to at least a compile-time improvement when used in e.g. Cranelift.

License

Unless otherwise specified, code in this crate is licensed under the Apache 2.0 License with LLVM Exception. This license text can be found in the file LICENSE.

Files in the src/ion/ directory are directly ported from original C++ code in IonMonkey, a part of the Firefox codebase. Parts of src/lib.rs are also definitions that are directly translated from this original code. As a result, these files are derivative works and are covered by the Mozilla Public License (MPL) 2.0, as described in license headers in those files. Please see the notices in relevant files for links to the original IonMonkey source files from which they have been translated/derived. The MPL text can be found in src/ion/LICENSE.

Parts of the code are derived from regalloc.rs: in particular, src/checker.rs and src/domtree.rs. This crate has the same license as regalloc.rs, so the license on these files does not differ.