2021-04-13 17:40:12 -07:00
2021-04-13 17:40:12 -07:00
2021-04-13 17:40:12 -07:00
2021-04-13 17:40:12 -07:00
2021-04-13 17:40:12 -07:00
2021-04-13 17:40:12 -07:00
2021-04-13 17:40:12 -07:00

regalloc2: another register allocator

This is a register allocator that started life as, and is about 75% still, a port of IonMonkey's backtracking register allocator to Rust. The data structures and invariants have been simplified a little bit, and the interfaces made a little more generic and reusable. In addition, it contains substantial amounts of testing infrastructure (fuzzing harnesses and checkers) that does not exist in the original IonMonkey allocator.

Design Overview

TODO

  • SSA with blockparams

  • Operands with constraints, and clobbers, and reused regs; contrast with regalloc.rs approach of vregs and pregs and many moves that get coalesced/elided

Differences from IonMonkey Backtracking Allocator

There are a number of differences between the IonMonkey allocator and this one:

  • Most significantly, there are [fuzz/fuzz_targets/](many different fuzz targets) that exercise the allocator, including a full symbolic checker (ion_checker target) based on the symbolic checker in regalloc.rs and, e.g., a targetted fuzzer for the parallel move-resolution algorithm (moves) and the SSA generator used for generating cases for the other fuzz targets (ssagen).

  • The data-structure invariants are simplified. While the IonMonkey allocator allowed for LiveRanges and Bundles to overlap in certain cases, this allocator sticks to a strict invariant: ranges do not overlap in bundles, and bundles do not overlap. There are other examples too: e.g., the definition of minimal bundles is very simple and does not depend on scanning the code at all. In general, we should be able to state simple invariants and see by inspection (as well as fuzzing -- see above) that they hold.

  • Many of the algorithms in the IonMonkey allocator are built with helper functions that do linear scans. These "small quadratic" loops are likely not a huge issue in practice, but nevertheless have the potential to be in corner cases. As much as possible, all work in this allocator is done in linear scans. For example, bundle splitting is done in a single compound scan over a bundle, ranges in the bundle, and a sorted list of split-points.

  • There are novel schemes for solving certain interesting design challenges. One example: in IonMonkey, liveranges are connected across blocks by, when reaching one end of a control-flow edge in a scan, doing a lookup of the allocation at the other end. This is in principle a linear lookup (so quadratic overall). We instead generate a list of "half-moves", keyed on the edge and from/to vregs, with each holding one of the allocations. By sorting and then scanning this list, we can generate all edge moves in one linear scan. There are a number of other examples of simplifications: for example, we handle multiple conflicting physical-register-constrained uses of a vreg in a single instruction by recording a copy to do in a side-table, then removing constraints for the core regalloc. Ion instead has to tweak its definition of minimal bundles and create two liveranges that overlap (!) to represent the two uses.

  • Using block parameters rather than phi-nodes significantly simplifies handling of inter-block data movement. IonMonkey had to special-case phis in many ways because they are actually quite weird: their uses happen semantically in other blocks, and their defs happen in parallel at the top of the block. Block parameters naturally and explicitly reprsent these semantics in a direct way.

  • The allocator supports irreducible control flow and arbitrary block ordering (its only CFG requirement is that critical edges are split). It handles loops during live-range computation in a way that is similar in spirit to IonMonkey's allocator -- in a single pass, when we discover a loop, we just mark the whole loop as a liverange for values live at the top of the loop -- but we find the loop body without the fixpoint workqueue loop that IonMonkey uses, instead doing a single linear scan for backedges and finding the minimal extent that covers all intermingled loops. In order to support arbitrary block order and irreducible control flow, we relax the invariant that the first liverange for a vreg always starts at its def; instead, the def can happen anywhere, and a liverange may overapproximate. It turns out this is not too hard to handle and is a more robust invariant. (It also means that non-SSA code may not be too hard to adapt to, though I haven't seriously thought about this.)

Rough Performance Comparison with Regalloc.rs

The allocator has not yet been wired up to a suitable compiler backend (such as Cranelift) to perform a true apples-to-apples compile-time and runtime comparison. However, we can get some idea of compile speed by running suitable test cases through the allocator and measuring throughput: that is, instructions per second for which registers are allocated.

To do so, I measured the qsort2 benchmark in regalloc.rs, register-allocated with default options in that crate's backtracking allocator, using the Criterion benchmark framework to measure ~620K instructions per second:

benches/0               time:   [365.68 us 367.36 us 369.04 us]
                        thrpt:  [617.82 Kelem/s 620.65 Kelem/s 623.49 Kelem/s]

I then measured three different fuzztest-SSA-generator test cases in this allocator, regalloc2, measuring between 1.05M and 2.3M instructions per second (closer to the former for larger functions):

==== 459 instructions
benches/0               time:   [424.46 us 425.65 us 426.59 us]
                        thrpt:  [1.0760 Melem/s 1.0784 Melem/s 1.0814 Melem/s]

==== 225 instructions
benches/1               time:   [213.05 us 213.28 us 213.54 us]
                        thrpt:  [1.0537 Melem/s 1.0549 Melem/s 1.0561 Melem/s]

Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
==== 21 instructions
benches/2               time:   [9.0495 us 9.0571 us 9.0641 us]
                        thrpt:  [2.3168 Melem/s 2.3186 Melem/s 2.3206 Melem/s]

Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

Though not apples-to-apples (SSA vs. non-SSA, completely different code only with similar length), this is at least some evidence that regalloc2 is likely to lead to at least a compile-time improvement when used in e.g. Cranelift.

License

Unless otherwise specified, code in this crate is licensed under the Apache 2.0 License with LLVM Exception. This license text can be found in the file LICENSE.

Files in the src/ion/ directory are directly ported from original C++ code in IonMonkey, a part of the Firefox codebase. Parts of src/lib.rs are also definitions that are directly translated from this original code. As a result, these files are derivative works and are covered by the Mozilla Public License (MPL) 2.0, as described in license headers in those files. Please see the notices in relevant files for links to the original IonMonkey source files from which they have been translated/derived. The MPL text can be found in src/ion/LICENSE.

Parts of the code are derived from regalloc.rs: in particular, src/checker.rs and src/domtree.rs. This crate has the same license as regalloc.rs, so the license on these files does not differ.

Description
No description provided
Readme 1.4 MiB
Languages
Rust 100%