* Make regalloc2 `#![no_std]`
This crate doesn't require any features from the standard library, so it
can be made `no_std` to allow it to be used in environments that can't
use the Rust standard library.
This PR mainly performs the following mechanical changes:
- `std::collections` is replaced with `alloc::collections`.
- `std::*` is replaced with `core::*`.
- `Vec`, `vec!`, `format!` and `ToString` are imported when needed since
they are no longer in the prelude.
- `HashSet` and `HashMap` are taken from the `hashbrown` crate, which is
the same implementation that the standard library uses.
- `FxHashSet` and `FxHashMap` are typedefs in `lib.rs` that are based on
the `hashbrown` types.
The only functional change is that `RegAllocError` no longer implements
the `Error` trait since that is not available in `core`.
Dependencies were adjusted to not require `std` and this is tested in CI
by building against the `thumbv6m-none-eabi` target that doesn't have
`std`.
* Add the Error trait impl back under a "std" feature
* Do conflict-set hash lookups once, not twice
This makes the small wasmtime bz2 benchmark 1% faster, per Hyperfine and
Sightglass. The effect disappears into the noise on larger benchmarks.
* Inline PosWithPrio::key
When compiling the pulldown-cmark benchmark from Sightglass, this is the
single most frequently called function: it's invoked 2.5 million times.
Inlining it reduces instructions retired by 1.5% on that benchmark,
according to `valgrind --tool=callgrind`.
This patch is "1.01 ± 0.01 times faster" according to Hyperfine for the
bz2, pulldown-cmark, and spidermonkey benchmarks from Sightglass.
Sightglass, in turn, agrees that all three benchmarks are 1.01x faster
by instructions retired, and the first two are around 1.01x faster by
CPU cycles as well.
* Inline and simplify AdaptiveMap::expand
Previously, `get_or_insert` would iterate over the keys to find one that
matched; then, if none did, iterate over the values to check if any are
0; then iterate again to remove all zero values and compact the map.
This commit instead focuses on picking an index to use: preferably one
where the key already exists; but if it's not in the map, then an unused
index; but if there aren't any, then an index where the value is zero.
As a result this iterates the two arrays at most once each, and both
iterations can stop early.
The downside is that keys whose value is zero are not removed as
aggressively. It might be worth pruning such keys in `IndexSet::set`.
Also:
- `#[inline]` both implementations of `Iterator::next`
- Replace `set_bits` with using the `SetBitsIter` constructor directly
These changes together reduce instructions retired when compiling the
pulldown-cmark benchmark by 0.9%.
When a liverange starts at a *late* point of an instruction, and it
undergoes the fallback "split into all minimal pieces" transform, we end
up creating one minimal bundle that starts at the *early* point of the
instruction at the start of the original LR. This can create
impossible-to-allocate situations where a fixed-constraint LR overlaps
another constrained to the same register (e.g. at calls). We fix this by
ensuring the minimal bundle is trimmed only to the half of the
instruction that overlaps the original LR.
This is analogous to the third fix in #74, but on the other end (start
of LR rather than end of it).
* Some fixes to allow for call instructions to name args, returns, and clobbers with constraints.
- Allow early-pos uses with fixed regs that conflict with
clobbers (which happen at late-pos), in addition to the
existing logic for conflicts with late-pos defs with fixed
regs.
This is a pretty subtle issue that was uncovered in #53 for the def
case, and the fix here is the mirror of that fix for clobbers. The
root cause for all this complexity is that we can't split in the
middle of an instruction (because there's no way to insert a move
there!) so if a use is live-downward, we can't let it live in preg A
at early-pos and preg B != A at late-pos; instead we need to rewrite
the constraints and use a fixup move.
The earlier change to fix#53 was actually a bit too conservative in
that it always applied when such conflicts existed, even if the
downward arg was not live. This PR fixes that (it's fine for the
early-use and late-def to be fixed to the same reg if the use's
liverange ends after early-pos) and adapts the same flexibility to
the clobbers case as well.
- Reworks the fixups for the def case mentioned above to not shift the
def to the Early point. Doing so causes issues when the def is to a
reffy vreg: it can then be falsely included in a stackmap if the
instruction containing this operand is a safepoint.
- Fixes the last-resort split-bundle-into-minimal-pieces logic from
#59 to properly limit a minimal bundle piece to end after the
early-pos, rather than cover the entire instruction. This was causing
artificial overlaps between args that end after early-pos and defs
that start at late-pos when one of the vregs hit the fallback split
behavior.
* Fix fuzzbug: do not merge when a liverange has a fixed-reg def.
This can create impossible situations: e.g., if a vreg is constrained
to p0 as a late-def, and another, completely different vreg is
constrained to p0 as an early-use on the same instruction, and the
instruction also has a third vreg (early-use), we do not want to merge
the liverange for the third vreg with the first, because it would
result in an unsolveable conflict for p0 at the early-point.
* Review comments.
* Limit split count per original bundle with fallback 1-to-N split.
Right now, splitting a bundle produces two halves. Furthermore, it has
cost linear in the length of the bundle, because the resulting
half-bundles have their requirements recomputed with a new scan, and
because we copy half the use-list over to the tail end sub-bundle.
This works fine when a bundle has a handful of splits overall, but not
when an input has a systematic pattern of conflicts that will require
O(|bundle|) splits (e.g., every Use is constrained to a different fixed
register than the last one). In such a case, we get quadratic behavior.
This PR adds a per-spillset (so, per-original-bundle) counter for
splits, and when it reaches a preset threshold (10 for now), we instead
split directly into minimal bundles along the whole length of the
bundle, putting the regions without uses in the spill bundle.
This basically approximates what a non-splitting allocator would do: it
"spills" the whole bundle to possibly a stackslot, or a second-chance
register allocation at best, via the spill bundle; and then does minimal
reservations of registers just at uses/defs and moves the "spilled"
value into/out of them immediately.
Together with another small optimization, this PR results in a 4x
compilation speedup and 24x memory use reduction on one particularly bad
case with alternating conflicting requirements on a vreg (see
bytecodealliance/wasmtime#4291 for details).
* Review comments.
Currently, we unconditionally trim the ends of liveranges around a split
when we do a split, including splits due to conflicts in a
liverange/bundle's requirements (e.g., a liverange with both a register
and a stack use). These trimmed ends, if they exist, go to the spill
bundle, and the spill bundle may receive a register during second-chance
allocation or otherwise will receive a stack slot.
This was previously measured to reduce contention significantly, because
it reduces the sizes of liveranges that participate in the first-chance
competition for allocations. When a split has to occur, we might as well
relegate the "connecting pieces" to a process that comes later, with a
hint to try to get the right register if possible but no hard connection
to either end.
However, in the case of a split arising from a reg-to-stack /
stack-to-reg conflict, as happens when references are used or def'd as
registers and then cross safepoints, this extra step in the connectivity
(normal LR with register use, then spill bundle, then normal LR with
stack use) can lead to extra moves. Additionally, when one of the LRs
has a stack constraint, contention is far less important; so it doesn't
hurt to skip the trimming step. In fact, it's likely much better to put
the "connecting piece" together with the stack side of the conflict.
Ideally we would handle this with the same move-cost logic we use for
conflicts detected during backtracking, but the requirements-related
splitting happens separately and that logic would need to be generalized
further. For now, this is sufficient to eliminate redundant moves as
seen in e.g. bytecodealliance/wasmtime#3785.
Even if the trace log level is disabled, the presence of the trace!
macro still has a significant impact on performance because it is
present in the inner loops of the allocator.
Removing the trace! calls at compile-time reduces instruction count by
~7%.
Changes in computation of bundle priorities during review of the initial
PR introduced a possible mis-ordering of priorities: inner-loop bundle
use weights could exceed the weights of 1_000_000 and 2_000_000 used for
minimal bundles without and with fixed uses (respectively). These two
kinds of minimal bundle are meant to be the highest-priority bundles,
evicting any other bundle they need to, because they can't be split
further. This PR introduces two special bundle weights for these two
kinds of bundles, and clamps all other bundle weights to just below
them.
Thanks to @Amanieu for reporting the issue! Fixes#19.
Large parts of the code in regalloc2 are currently licensed under the
Mozilla Public License (MPL) 2.0, because they derive in meaningful
ways from the register allocator in IonMonkey, which is part of
Firefox. The relevant source files are marked as such, with references
to the files in the Firefox source tree.
The intent of the regalloc2 project was to port the register allocator
from Firefox to use in Cranelift, borrowing good technology and
improving on it in the spirit of open source.
However, Several use-cases of Cranelift require, or at least strongly
prefer, the Apache-2.0 license with the LLVM exception (matching the
license of Cranelift itself, and Bytecode Alliance projects
generally). While using this license is not strictly necessary for
regalloc2 to be usable (The MPL is an excellent open-source license!),
relicensing fully under this license to harmonize with the rest of
Cranelift and Bytecode Alliance codebases significantly widens
possibilities and reduces friction; then regalloc2 is "just another
part of Cranelift" and doesn't have to be treated specially.
The source in `src/ion/` specifically began as a fairly direct port of
the algorithms in the following files in the `mozilla-central`
repository (Firefox codebase):
* The bulk of the "backtracking allocator" algorithm:
* `js/src/jit/BacktrackingAllocator.{cpp,h}`
* Helpers and definitions in the surrounding infrastructure:
* `js/src/jit/RegisterAllocator.h`
* `js/src/jit/RegisterAllocator.cpp`
* `js/src/jit/StackSlotAllocator.h`
* `js/src/jit/LIR.h`
* A few data structure implementations:
* `js/src/ds/SplayTree.h`
* `js/src/ds/PriorityQueue.h`
Subsequent work in improving regalloc2 has caused it to drift from the
direct port -- for example, it no longer uses splay trees or the
direct port of the priority queue above -- but it is of course very
clearly still a derivative work.
Analysis of the contributors to these files indicates that we need
signoff from the following folks:
* Mozilla Corp, for contributions made by Mozilla employees (the
majority of the code). Communications with Mozilla (thanks
@tschneidereit and @bholley for doing the work here!) indicate that
@ekr is able to sign off when ready here.
* Andy Wingo, specifically for the work done in [Bug
1620197](https://bugzilla.mozilla.org/show_bug.cgi?id=1620197) and
[Bug 1609057](https://bugzilla.mozilla.org/show_bug.cgi?id=1609057) to
generalize the stack allocator for a Wasm feature (multiple returns).
Additionally, since the initial port, we have had three contributions
from @Amanieu:
[#9](https://github.com/bytecodealliance/regalloc2/pull/9),
[#11](https://github.com/bytecodealliance/regalloc2/pull/11),
[#13](https://github.com/bytecodealliance/regalloc2/pull/13).
So, if everyone applicable is happy with this relicensing, this PR
removes the MPL-2.0 license in `src/ion/` and marks all files as
covered under `Apache-2.0 WITH LLVM-exception`. Please let us know if
this is OK!
Signoffs:
- [ ] @ekr, for Mozilla's contributions
- [ ] @wingo, for contributions to original code in `mozilla-central`
- [ ] @Amanieu, for the three PRs linked above
Thanks!
In wasmtime's `gc::many_live_refs` unit-test, approximately ~1K vregs
are live over ~1K safepoints (actually, each vreg is live over half the
safepoints on average, in a LIFO sort of arrangement).
This causes a huge slowdown with the current heuristics. Basically, each
vreg had a `Conflict` requirement because it had both stack uses
(safepoints) and register uses (the actual def and normal use). The
action in this case when processing the vreg's bundle is to split off
the first use -- a conservative-but-correct approach that will always
eventually split bundles far enough to get non-conflicting-requirement
pieces.
However, because each vreg had N stack uses followed by one register
use, this meant that each had to be split N times (!) -- so we had
O(n^2) splits and O(n^2) bundles by the end of the allocation.
This instead implements another simple heuristic that is much better:
when the requirements are conflicting, scan forward and find the exact
point at which the requirements become conflicting, such that the prefix
(first half prior to the split) still has no conflict, and split there.
This turns the above test-case into an O(n)-bundle / O(n)-split
situation.