* Re-introduce optional dedicated scratch registers
Dedicated scratch registers used for resolving move cycles were removed
in #51 and replaced with an algorithm to automatically allocate a
scratch register as needed.
However in many cases, a client will already have a non-allocatable
scratch register available for things like extended jumps (see #91). It
makes sense to re-use this register for regalloc than potentially
spilling an existing register.
* Clarify comment
* Remove register class from `SpillSlot`
The register allocator was already allowing moves between spillslots and
registers of different classes, so this PR formalizes this by making
spillslots independent of register class.
This also fixes#79 by properly tracking the register class of an
`InsertedMove` with the `to_vreg` field which turns out to never
be `None` in practice. Removing the `Option` allows the register
class of the `VReg` to be used when building the per-class move lists.
Fixes#79
* Address review feedback
Fixed stack slots are treated as `PReg`s by most of the register
allocator, but need some additional handling the move resolver to avoid
generating stack-to-stack moves.
This allows a non-allocatable `PReg` to be passed on directly to the
allocations vector without any liverange tracking from the register
allocator. The main intended use case is to support ISA-specific special
registers such as a fixed zero register.
The `vreg_def_blockparam` and `vreg_def_inst` fields on CFGInfo are only
used in `validate_ssa`, which in turn is only used in the ssagen fuzz
target.
Since these fields are never read in normal usage, initializing them is
entirely wasted effort. According to valgrind/DHAT, when running
`wasmtime compile` on the Sightglass Spidermonkey benchmark, removing
these fields saves about 100M instructions, 23k heap allocations
totalling 40MiB, and 47MiB of writes to the heap.
* Remove log_enabled statements around annotate calls, which are already guarded against annotations_enabled
* Use a trace_enabled!() macro that follows the same logic as trace!() to find if additional traces have been enabled or not
* Clobbers: use a more efficient bitmask representation in API.
Currently, the `Function` trait requires a `&[PReg]` for the
clobber-list for a given instruction. In most cases where clobbers are
used, the list may be long: e.g., ABIs specify a fixed set of registers
that are clobbered and there may be ~half of all registers in this list.
What's more, the list can't be shared for e.g. all calls of a given ABI,
because actual return-values (defs) can't be clobbers. So we need to
allocate space for long, sometimes-slightly-different lists; this is
inefficient for the embedder and for us.
It's much more efficient to use a bitmask to represent a set of physical
registers. With current data structure bitpacking limitations, we can
support at most 128 physical registers; this means we can use a `u128`
bitmask. This also allows e.g. an embedder to start with a constant for
a given ABI, and mask out bits for actual return-value registers on call
instructions.
This PR makes that change, for minor but positive performance impact.
* Review comments.
Currently, regalloc2 sets aside one register per class, unconditionally,
to make move resolution possible. To solve the "parallel moves problem",
we sometimes need to conjure a cyclic permutation of data among
registers or stack slots (this can result, for example, from blockparam
flow that swaps two values on a loop backedge). This set-aside scratch
register is used when a cycle exists.
regalloc2 also uses the scratch register when needed to break down a
stack-to-stack move (which could happen due to blockparam moves on edges
when source and destination are both spilled) into a stack-to-reg move
followed by reg-to-stack, because most machines have loads and stores
but not memory-to-memory moves.
A set-aside register is certainly the simplest solution, but it is not
optimal: it means that we have one fewer register available for use by
the program, and this can be costly especially on machines with fewer
registers (e.g., 16 GPRs/XMMs on x86-64) and especially when some
registers may be set aside by our embedder for other purposes too. Every
register we can reclaim is some nontrivial performance in large function
bodies!
This PR removes this restriction and allows regalloc2 to use all
available physical registers. It then solves the two problems above,
cyclic moves and stack-to-stack moves, with a two-stage approach:
- First, it finds a location to use to resolve cycles, if any exist. If
a register is unallocated at the location of the move, we can use it.
Often we get lucky and this is the case. Otherwise, we allocate a
stackslot to use as the temp. This is perfectly fine at this stage,
even if it means that we have more stack-to-stack moves.
- Then, it resolves stack-to-stack moves into stack-to-reg /
reg-to-stack. There are two subcases here. If there is *another*
available free physical register, we opportunistically use it for this
decomposition. If not, we fall back to our last-ditch option: we pick
a victim register of the appropriate class, we allocate another
temporary stackslot, we spill the victim to that slot just for this
move, we do the move in the above way (stack-to-reg / reg-to-stack)
with the victim, then we reload the victim. So one move (original
stack-to-stack) becomes four moves, but no state is clobbered.
This PR extends the `moves` fuzz-target to exercise this functionality
as well, randomly choosing for some spare registers to exist or not, and
randomly generating {stack,reg}-to-{stack,reg} moves in the initial
parallel-move input set. The target does a simple symbolic simulation of
the sequential move sequence and ensures that the final state is
equivalent to the parallel-move semantics.
I fuzzed both the `moves` target, focusing on the new logic; as well as
the `ion_checker` target, checking the whole register allocator, and
both seem clean (~150M cases on the former, ~1M cases on the latter).
The checker was built to validate programs produced by the fuzzing
testcase generator, which was built before regalloc2 supported special
handling of moves. (In a pure-SSA world, move elision is not needed,
because moves are not needed, and blockparams are the only way of
tying together vregs.)
Due to this, the checker works great for our independent regalloc2
fuzzing setup, but when used on regalloc inputs produced by Cranelift,
cannot prove correctness.
This PR extends the checker's analysis to properly handle "program
moves", which are distinct from regalloc-inserted moves in that they
are present in the original program and hence are semantically
relevant. A program move edits all sets of symbolic vregs at all
allocs, and where the source vreg appears, it inserts the dest vreg as
well. (It also removes the dest vreg from all other sets, since the
old value becomes stale, as is done for other defs.)
Given this, and given some additional checking for moves to/from
pinned vregs, the checker can now be used to fully validate
Cranelift-sourced regalloc2 invocations.
This adds derived `Serialize` and `Deserialize` implementations for
exposed types that describe registers, operands, and related program
inputs; entity indices; and regalloc output types. This allows
serialization of any of the embedder's IR data types that may embed or
build upon regalloc2 types.
These implementations (and the dependency on the `serde` crate itself)
are enabled only when the non-default `enable-serde` feature is
specified.
Currently, if this is done, overlapping liveranges are created, and we
hit an assert that ensures our non-overlapping and
built-in-reverse-order invariants during liverange construction.
One could argue that multiple defs of a single vreg don't make a ton of
sense -- which def's value is valid after the instruction? -- but if
they all get the same alloc, then the answer is "whatever was put in
that alloc", and this is just a case of an instruction being a bit
over-eager when listing its registers.
This can arise in practice when operand lists come from combinations or
concatenations: for example, in Cranelift's s390x backend, there is a
"Loop" pseudo-instruction, and the operands of the Loop are the operands
of all the sub-instructions.
It seems more logically cohesive overall to say that one can state an
operand as many times as one likes; so this PR makes it so.
The `Operand` abstraction allows a def to be positioned at the "early"
point of an instruction, before its effect and alongside its normal
uses. This is intended to allow the embedder to express that a def may
be written before all uses are read, so it should not conflict with the
uses.
It's also convenient to use early defs to express temporaries, which
should be available throughout a regalloc-level instruction's emitted
sequence. In such a case, the register should not be used again after
the instruction, so it is dead following the instruction.
Strictly speaking, and according to regalloc2 prior to this PR, then the
temp will *only* conflict with the uses at the early-point, and not the
defs at the late-point (after the instruction), because it's dead past
its point of definition. But for a temp we really want it to register
conflicts not just with the normal uses but with the normal defs as
well.
This PR changes the semantics so that an early def builds a liverange
that spans the early- and late-point of an instruction when the vreg is
dead flowing down from the instruction, giving the semantics we want for
temps.
* Generalize debug-info support a bit.
Previously, debug value-label support required each vreg to have a
disjoint sequence of instruction ranges, each with one label.
Unfortunately, it's entirely possible for multiple values at the program
level to map to one vreg at the IR level, leading to multiple labels.
This PR generalizes the debug-info generation support to allow for
arbitrary (label, range, vreg) tuples, as long as they are sorted by
vreg, with no other requirements. The lookup is a little more costly
when we generate the debuginfo, but in practice we shouldn't have more
than a *few* debug value labels per vreg, so in practice the constants
should be small.
* Typo fix from Amanieu
Co-authored-by: Amanieu d'Antras <amanieu@gmail.com>
Co-authored-by: Amanieu d'Antras <amanieu@gmail.com>
Previously, the regalloc required all liveins to be defined by a
pseudoinstruction at the start of the function body. The regalloc.rs
compatibility shim did this, but it's slightly inconvenient when using
the API directly. This change allows pinned vregs to be implicit liveins
to the function body instead.
Simplify pinned-vreg API: don't require slice of all pinned vregs.
Previously, we kept a bool flag `is_pinned` in the `VRegData`, and we
required a `&[VReg]` of all pinned vregs to be provided by
`Function::pinned_vregs()`. This was (I think) done for convenience, but
it turns out not to really be necessary, as we can just query
`is_pinned_vreg` where needed (and in the likely implementation, e.g. in
Cranelift, this will be a `< NUM_PINNED_VREGS` check that can be
inlined). This adds convenience for the embedder (the main benefit), and
also reduces complexity, removes some state, and avoids some work
initializing the regalloc state for a run.
Support for debug-labels.
If the client adds labels to vregs across ranges of instructions in the
input program, the regalloc will provide metadata in the `Output` that
describes the `Allocation`s in which each such vreg is stored for those
ranges. This allows the client to emit debug metadata telling a debugger
where to find program values at each point in the program.
Even if the trace log level is disabled, the presence of the trace!
macro still has a significant impact on performance because it is
present in the inner loops of the allocator.
Removing the trace! calls at compile-time reduces instruction count by
~7%.
The documentation says that this is only used for heuristics, but it
is never actually called. This should be removed for now and perhaps
added back later if we find an actual use for it.