This adds derived `Serialize` and `Deserialize` implementations for
exposed types that describe registers, operands, and related program
inputs; entity indices; and regalloc output types. This allows
serialization of any of the embedder's IR data types that may embed or
build upon regalloc2 types.
These implementations (and the dependency on the `serde` crate itself)
are enabled only when the non-default `enable-serde` feature is
specified.
Currently, if this is done, overlapping liveranges are created, and we
hit an assert that ensures our non-overlapping and
built-in-reverse-order invariants during liverange construction.
One could argue that multiple defs of a single vreg don't make a ton of
sense -- which def's value is valid after the instruction? -- but if
they all get the same alloc, then the answer is "whatever was put in
that alloc", and this is just a case of an instruction being a bit
over-eager when listing its registers.
This can arise in practice when operand lists come from combinations or
concatenations: for example, in Cranelift's s390x backend, there is a
"Loop" pseudo-instruction, and the operands of the Loop are the operands
of all the sub-instructions.
It seems more logically cohesive overall to say that one can state an
operand as many times as one likes; so this PR makes it so.
This resolves an issue seen when the source program uses multiple
regclasses (Int and Float): in some cases, the logic that grabs the
vregs and retains them (with class) in `vreg_regs` missed a register and
we had a class mismatch. This occurred because data structures were
initialized assuming `Int` regclass at first.
This PR instead removes the `vreg_regs` array, stores the class
explicitly as an `Option<RegClass>` in the `VRegData`, and provides a
`Env::vreg()` method that reconstitutes a `VReg` given its index and its
observed class. We "observe" the class of every vreg seen during the
liveness pass (and we assert that every occurrence of the vreg index has
the same class). In this way, we still have a single source-of-truth for
the vreg class (the mention of the vreg itself) and we explicitly
represent the "not observed yet" state (and panic on attempting to use
such a vreg) rather than implicitly taking the wrong class.
The `Operand` abstraction allows a def to be positioned at the "early"
point of an instruction, before its effect and alongside its normal
uses. This is intended to allow the embedder to express that a def may
be written before all uses are read, so it should not conflict with the
uses.
It's also convenient to use early defs to express temporaries, which
should be available throughout a regalloc-level instruction's emitted
sequence. In such a case, the register should not be used again after
the instruction, so it is dead following the instruction.
Strictly speaking, and according to regalloc2 prior to this PR, then the
temp will *only* conflict with the uses at the early-point, and not the
defs at the late-point (after the instruction), because it's dead past
its point of definition. But for a temp we really want it to register
conflicts not just with the normal uses but with the normal defs as
well.
This PR changes the semantics so that an early def builds a liverange
that spans the early- and late-point of an instruction when the vreg is
dead flowing down from the instruction, giving the semantics we want for
temps.
* Generalize debug-info support a bit.
Previously, debug value-label support required each vreg to have a
disjoint sequence of instruction ranges, each with one label.
Unfortunately, it's entirely possible for multiple values at the program
level to map to one vreg at the IR level, leading to multiple labels.
This PR generalizes the debug-info generation support to allow for
arbitrary (label, range, vreg) tuples, as long as they are sorted by
vreg, with no other requirements. The lookup is a little more costly
when we generate the debuginfo, but in practice we shouldn't have more
than a *few* debug value labels per vreg, so in practice the constants
should be small.
* Typo fix from Amanieu
Co-authored-by: Amanieu d'Antras <amanieu@gmail.com>
Co-authored-by: Amanieu d'Antras <amanieu@gmail.com>
The "reused input" operand constraint allows for an instruction to have
a def-operand whose allocation is constrained to reuse the same
allocation as one of the uses.
This is useful to express constraints needed for some instruction sets,
like x86, where at the ISA level, one register serves both as an input
and the output.
Unfortunately the way that we lower the constraints to liveranges does
not work if we have the same vreg used both for the reused input and
another input -- it results in impossible-to-solve constraints. For
example, the instruction
```
alu_op v42 use, v42 use, v43 def reuse(0)
```
would result in an impossible allocation.
This fixes liverange construction to properly handle all uses of the
vreg whose operand is reused, rather than just the one reused operand.
Previously, the regalloc required all liveins to be defined by a
pseudoinstruction at the start of the function body. The regalloc.rs
compatibility shim did this, but it's slightly inconvenient when using
the API directly. This change allows pinned vregs to be implicit liveins
to the function body instead.
Simplify pinned-vreg API: don't require slice of all pinned vregs.
Previously, we kept a bool flag `is_pinned` in the `VRegData`, and we
required a `&[VReg]` of all pinned vregs to be provided by
`Function::pinned_vregs()`. This was (I think) done for convenience, but
it turns out not to really be necessary, as we can just query
`is_pinned_vreg` where needed (and in the likely implementation, e.g. in
Cranelift, this will be a `< NUM_PINNED_VREGS` check that can be
inlined). This adds convenience for the embedder (the main benefit), and
also reduces complexity, removes some state, and avoids some work
initializing the regalloc state for a run.
Support for debug-labels.
If the client adds labels to vregs across ranges of instructions in the
input program, the regalloc will provide metadata in the `Output` that
describes the `Allocation`s in which each such vreg is stored for those
ranges. This allows the client to emit debug metadata telling a debugger
where to find program values at each point in the program.
Even if the trace log level is disabled, the presence of the trace!
macro still has a significant impact on performance because it is
present in the inner loops of the allocator.
Removing the trace! calls at compile-time reduces instruction count by
~7%.
When an instruction uses the same vreg constrained to multiple different
fixed registers, the allocator converts all but one of the fixed
constraints to `Any` and then records a special fixup move that copies
the value to the other fixed registers just before the instruction. This
allows the allocator to maintain the invariant that a value lives in
only one place at a time throughout most of its logic, and constrains
the complexity-fallout of this corner case to just a special last-minute
edit.
Unfortunately some recent CPU time thrown at the fuzzer has uncovered
a subtle interaction with the redundant move eliminator that confuses
the checker.
Specifically, when the correct value is *already* in the second
constrained fixed reg, because of an unrelated other move (e.g. because
of a blockparam or other vreg moved from the original), the redundant
move eliminator can delete the fixup move without telling the checker
that it has done so.
Such an optimization is perfectly valid, and the generated code is
correct; but the checker thinks that some other vreg (the one that was
copied from the original) is in the second preg, and panics.
The fix is to use the mechanism that indicates "this move defines a new
vreg" (emitting a `defalloc` checker-instruction) to force the checker
to understand that after the fixup move, the given preg actually
contains the appropriate vreg.
Changes in computation of bundle priorities during review of the initial
PR introduced a possible mis-ordering of priorities: inner-loop bundle
use weights could exceed the weights of 1_000_000 and 2_000_000 used for
minimal bundles without and with fixed uses (respectively). These two
kinds of minimal bundle are meant to be the highest-priority bundles,
evicting any other bundle they need to, because they can't be split
further. This PR introduces two special bundle weights for these two
kinds of bundles, and clamps all other bundle weights to just below
them.
Thanks to @Amanieu for reporting the issue! Fixes#19.
Large parts of the code in regalloc2 are currently licensed under the
Mozilla Public License (MPL) 2.0, because they derive in meaningful
ways from the register allocator in IonMonkey, which is part of
Firefox. The relevant source files are marked as such, with references
to the files in the Firefox source tree.
The intent of the regalloc2 project was to port the register allocator
from Firefox to use in Cranelift, borrowing good technology and
improving on it in the spirit of open source.
However, Several use-cases of Cranelift require, or at least strongly
prefer, the Apache-2.0 license with the LLVM exception (matching the
license of Cranelift itself, and Bytecode Alliance projects
generally). While using this license is not strictly necessary for
regalloc2 to be usable (The MPL is an excellent open-source license!),
relicensing fully under this license to harmonize with the rest of
Cranelift and Bytecode Alliance codebases significantly widens
possibilities and reduces friction; then regalloc2 is "just another
part of Cranelift" and doesn't have to be treated specially.
The source in `src/ion/` specifically began as a fairly direct port of
the algorithms in the following files in the `mozilla-central`
repository (Firefox codebase):
* The bulk of the "backtracking allocator" algorithm:
* `js/src/jit/BacktrackingAllocator.{cpp,h}`
* Helpers and definitions in the surrounding infrastructure:
* `js/src/jit/RegisterAllocator.h`
* `js/src/jit/RegisterAllocator.cpp`
* `js/src/jit/StackSlotAllocator.h`
* `js/src/jit/LIR.h`
* A few data structure implementations:
* `js/src/ds/SplayTree.h`
* `js/src/ds/PriorityQueue.h`
Subsequent work in improving regalloc2 has caused it to drift from the
direct port -- for example, it no longer uses splay trees or the
direct port of the priority queue above -- but it is of course very
clearly still a derivative work.
Analysis of the contributors to these files indicates that we need
signoff from the following folks:
* Mozilla Corp, for contributions made by Mozilla employees (the
majority of the code). Communications with Mozilla (thanks
@tschneidereit and @bholley for doing the work here!) indicate that
@ekr is able to sign off when ready here.
* Andy Wingo, specifically for the work done in [Bug
1620197](https://bugzilla.mozilla.org/show_bug.cgi?id=1620197) and
[Bug 1609057](https://bugzilla.mozilla.org/show_bug.cgi?id=1609057) to
generalize the stack allocator for a Wasm feature (multiple returns).
Additionally, since the initial port, we have had three contributions
from @Amanieu:
[#9](https://github.com/bytecodealliance/regalloc2/pull/9),
[#11](https://github.com/bytecodealliance/regalloc2/pull/11),
[#13](https://github.com/bytecodealliance/regalloc2/pull/13).
So, if everyone applicable is happy with this relicensing, this PR
removes the MPL-2.0 license in `src/ion/` and marks all files as
covered under `Apache-2.0 WITH LLVM-exception`. Please let us know if
this is OK!
Signoffs:
- [ ] @ekr, for Mozilla's contributions
- [ ] @wingo, for contributions to original code in `mozilla-central`
- [ ] @Amanieu, for the three PRs linked above
Thanks!
The documentation says that this is only used for heuristics, but it
is never actually called. This should be removed for now and perhaps
added back later if we find an actual use for it.
In wasmtime's `gc::many_live_refs` unit-test, approximately ~1K vregs
are live over ~1K safepoints (actually, each vreg is live over half the
safepoints on average, in a LIFO sort of arrangement).
This causes a huge slowdown with the current heuristics. Basically, each
vreg had a `Conflict` requirement because it had both stack uses
(safepoints) and register uses (the actual def and normal use). The
action in this case when processing the vreg's bundle is to split off
the first use -- a conservative-but-correct approach that will always
eventually split bundles far enough to get non-conflicting-requirement
pieces.
However, because each vreg had N stack uses followed by one register
use, this meant that each had to be split N times (!) -- so we had
O(n^2) splits and O(n^2) bundles by the end of the allocation.
This instead implements another simple heuristic that is much better:
when the requirements are conflicting, scan forward and find the exact
point at which the requirements become conflicting, such that the prefix
(first half prior to the split) still has no conflict, and split there.
This turns the above test-case into an O(n)-bundle / O(n)-split
situation.
This feature needs more thought; for now we will of course continue to
support pinned vregs, but perhaps we can do better for
"pass-through-and-forget" operands that are given non-allocatable
registers.
This reverts commit 736f636c36.