* Make regalloc2 `#![no_std]`
This crate doesn't require any features from the standard library, so it
can be made `no_std` to allow it to be used in environments that can't
use the Rust standard library.
This PR mainly performs the following mechanical changes:
- `std::collections` is replaced with `alloc::collections`.
- `std::*` is replaced with `core::*`.
- `Vec`, `vec!`, `format!` and `ToString` are imported when needed since
they are no longer in the prelude.
- `HashSet` and `HashMap` are taken from the `hashbrown` crate, which is
the same implementation that the standard library uses.
- `FxHashSet` and `FxHashMap` are typedefs in `lib.rs` that are based on
the `hashbrown` types.
The only functional change is that `RegAllocError` no longer implements
the `Error` trait since that is not available in `core`.
Dependencies were adjusted to not require `std` and this is tested in CI
by building against the `thumbv6m-none-eabi` target that doesn't have
`std`.
* Add the Error trait impl back under a "std" feature
* Re-introduce optional dedicated scratch registers
Dedicated scratch registers used for resolving move cycles were removed
in #51 and replaced with an algorithm to automatically allocate a
scratch register as needed.
However in many cases, a client will already have a non-allocatable
scratch register available for things like extended jumps (see #91). It
makes sense to re-use this register for regalloc than potentially
spilling an existing register.
* Clarify comment
This allows a non-allocatable `PReg` to be passed on directly to the
allocations vector without any liverange tracking from the register
allocator. The main intended use case is to support ISA-specific special
registers such as a fixed zero register.
Currently there is a loop that takes a variable step toward an end point
with an integer from `Arbitrary`; if this integer is always zero (for
example due to end-of-input?) then we add debug labels to a particular
input SSA value forever. This eventually causes an OOM crash. This PR
bounds the loop at a reasonable count (10) instead.
* Remove unused regalloc2-test crate
This code doesn't build, and Chris says it's "a really old harness that
existed prior to building the fuzzing and was used mainly to profile and
get stats before integration with Cranelift".
* Re-export libfuzzer/arbitrary from fuzzing module
This avoids needing to keep dependencies on `arbitrary` in sync across
the three different Cargo.toml files in this project.
However, before version 0.4.2, libfuzzer-sys only supported using its
macros if it was available at the top-level `libfuzzer_sys` path, which
breaks when re-exporting it. So I'm upgrading to that version (or the
newest patch release of it).
Upgrading libfuzzer-sys in turn brings in the 1.0 release of the
arbitrary crate, with a minor API change along the way.
* Clobbers: use a more efficient bitmask representation in API.
Currently, the `Function` trait requires a `&[PReg]` for the
clobber-list for a given instruction. In most cases where clobbers are
used, the list may be long: e.g., ABIs specify a fixed set of registers
that are clobbered and there may be ~half of all registers in this list.
What's more, the list can't be shared for e.g. all calls of a given ABI,
because actual return-values (defs) can't be clobbers. So we need to
allocate space for long, sometimes-slightly-different lists; this is
inefficient for the embedder and for us.
It's much more efficient to use a bitmask to represent a set of physical
registers. With current data structure bitpacking limitations, we can
support at most 128 physical registers; this means we can use a `u128`
bitmask. This also allows e.g. an embedder to start with a constant for
a given ABI, and mask out bits for actual return-value registers on call
instructions.
This PR makes that change, for minor but positive performance impact.
* Review comments.
* Extend fuzzer to generate cases like #53.
Currently, the fuzz testcase generator will add at most one
fixed-register constraint to an instruction per physical register. This
avoids impossible situations, such as specifying that both `v0` and `v1`
must be placed into the same `p0`.
However, it *should* be possible to say that `v0` is in `p0` before the
instruction, and `v1` is in `p0` after the instruction (i.e., at `Early`
and `Late` operand positions).
This in fact exposes a limitation in the current allocator design: when
`v0` is live downward, with the above constraints, it will result in an
impossible allocation situation because we cannot split in the middle of
an instruction. A subsequent fix will rectify this by using the
multi-fixed-reg fixup mechanism.
* Handle conflicting Before and After fixed-reg constraints with a copy.
This fixes#53. Previously, if two operands on an instruction
specified *different* vregs constrained to the same physical register
at the Before (Early) and After (Late) points of the instruction, and
the Before was live downward as well, we would panic: we can't insert
a move into the middle of an instruction, so putting the first vreg in
the preg at Early implies we have an unsolveable conflict at Late.
We can solve this issue by adding some new logic to insert a copy, and
rewrite the constraint. This reuses the multi-fixed-reg-constraint
fixup logic. While that logic handles the case where the *same* vreg
has multiple *different* fixed-reg constraints, this new logic
handles *different* vregs with the *same* fixed-reg constraints, but
at different *program points*; so the two are complementary.
This addresses the specific test case in #53, and also fuzzes cleanly
with the change to the fuzz testcase generator to generate these
cases (which also immediately found the bug).
* Add a reservation to the PReg when rewriting constraint so it is not doubly-allocated.
* Distinguish initial fixup moves from secondary moves.
* Use `trace` macro, not `log::trace`, to avoid trace output when feature is disabled.
* Rework operand rewriting to properly handle bundle-merging edge case.
When the liverange for the defined vreg with fixed constraint at Late is
*merged* with the liverange for the used vreg with fixed constraint at
Early, the strategy of putting a fixed reservation on the preg at Early
fails, because the whole bundle is minimal (if it spans just the
instruction's Early and Late and nothing else). This could happen if
e.g. the def flows into a blockparam arg that merges with a blockparam
defining the used value.
Instead we move the def one halfstep earlier, to the Early point, with
its fixed-reg constraint still in place. This has the same effect but
works when the two are merged.
* Fix checker issue: make more flexible in the presence of victim-register saves.
Currently, regalloc2 sets aside one register per class, unconditionally,
to make move resolution possible. To solve the "parallel moves problem",
we sometimes need to conjure a cyclic permutation of data among
registers or stack slots (this can result, for example, from blockparam
flow that swaps two values on a loop backedge). This set-aside scratch
register is used when a cycle exists.
regalloc2 also uses the scratch register when needed to break down a
stack-to-stack move (which could happen due to blockparam moves on edges
when source and destination are both spilled) into a stack-to-reg move
followed by reg-to-stack, because most machines have loads and stores
but not memory-to-memory moves.
A set-aside register is certainly the simplest solution, but it is not
optimal: it means that we have one fewer register available for use by
the program, and this can be costly especially on machines with fewer
registers (e.g., 16 GPRs/XMMs on x86-64) and especially when some
registers may be set aside by our embedder for other purposes too. Every
register we can reclaim is some nontrivial performance in large function
bodies!
This PR removes this restriction and allows regalloc2 to use all
available physical registers. It then solves the two problems above,
cyclic moves and stack-to-stack moves, with a two-stage approach:
- First, it finds a location to use to resolve cycles, if any exist. If
a register is unallocated at the location of the move, we can use it.
Often we get lucky and this is the case. Otherwise, we allocate a
stackslot to use as the temp. This is perfectly fine at this stage,
even if it means that we have more stack-to-stack moves.
- Then, it resolves stack-to-stack moves into stack-to-reg /
reg-to-stack. There are two subcases here. If there is *another*
available free physical register, we opportunistically use it for this
decomposition. If not, we fall back to our last-ditch option: we pick
a victim register of the appropriate class, we allocate another
temporary stackslot, we spill the victim to that slot just for this
move, we do the move in the above way (stack-to-reg / reg-to-stack)
with the victim, then we reload the victim. So one move (original
stack-to-stack) becomes four moves, but no state is clobbered.
This PR extends the `moves` fuzz-target to exercise this functionality
as well, randomly choosing for some spare registers to exist or not, and
randomly generating {stack,reg}-to-{stack,reg} moves in the initial
parallel-move input set. The target does a simple symbolic simulation of
the sequential move sequence and ensures that the final state is
equivalent to the parallel-move semantics.
I fuzzed both the `moves` target, focusing on the new logic; as well as
the `ion_checker` target, checking the whole register allocator, and
both seem clean (~150M cases on the former, ~1M cases on the latter).
Support for debug-labels.
If the client adds labels to vregs across ranges of instructions in the
input program, the regalloc will provide metadata in the `Output` that
describes the `Allocation`s in which each such vreg is stored for those
ranges. This allows the client to emit debug metadata telling a debugger
where to find program values at each point in the program.
The documentation says that this is only used for heuristics, but it
is never actually called. This should be removed for now and perhaps
added back later if we find an actual use for it.
- Support preferred and non-preferred subsets of a register class. This
allows allocating, e.g., caller-saved registers before callee-saved
registers.
- Allow branch blockparam args to start an a certain offset in branch
operands; this allows branches to have other operands too (e.g.,
conditional-branch inputs).
- Allow `OperandOrAllocation` to be constructed from an `Allocation` and
`OperandKind` as well (i.e., an allocation with an use/def bit).
The main enhancement in this commit is support for reference types and
stackmaps. This requires tracking whether each VReg is a "reference" or
"pointer". At certain instructions designated as "safepoints", the
regalloc will (i) ensure that all references are in spillslots rather
than in registers, and (ii) provide a list of exactly which spillslots
have live references at that program point. This can be used by, e.g., a
GC to trace and possibly modify pointers. The stackmap of spillslots is
precise: it includes all live references, and *only* live references.
This commit also brings in some API tweaks as part of the in-progress
Cranelift glue. In particular, it makes Allocations and Operands
mutually disjoint by using the same bitfield for the type-tag in both
and choosing non-overlapping tags. This will allow instructions to carry
an Operand for each register slot and then overwrite these in place with
Allocations. The `OperandOrAllocation` type does the necessary magic to
make this look like an enum, but staying in 32 bits.