- Cache the most recent u64 chunk in the set to avoid some hashmap
lookups;
- Defer the live-set union'ing over the loop body until query time
(remember the set that would have been union'd in instead), and lazily
propagate the liveness bit at that query time, union-find style;
- Do n-1 rather than n union operations for n successors (first is a
clone instead);
- Don't union in liveness sets from blocks we haven't visited yet (the
loop-body/backedge handling handles these).
This generalizes the allocator to accept multiple defs by making defs
just another type of "use" (uses are now perhaps more properly called
"mentions", but for now we abuse the terminology slightly).
It turns out that this actually was not terribly hard, because we don't
rely on the properties that a strict SSA requirement otherwise might
allow us to: e.g., defs always at exactly the start of a vreg's ranges.
Because we already accepted arbitrary block order and irreducible CFGs,
and approximated live-ranges with the single-pass algorithm, we are
robust in our "stitching" (move insertion) and so all we really care
about is computing some superset of the actual live-ranges and then a
non-interfering coloring of (split pieces of) those ranges. Multiple
defs don't change that, as long as we compute the ranges properly.
We still have blockparams in this design, so the client *can* provide
SSA directly, and everything will work as before. But a client that
produces non-SSA need not use them at all; it can just happily reassign
to vregs and everything will Just Work.
This is part of the effort to port Cranelift over to regalloc2; I have
decided that it may be easier to build a compatibility shim that matches
regalloc.rs's interface than to continue boiling the ocean and
converting all of the lowering sequences to SSA. It then becomes a
separable piece of work (and simply further performance improvements and
simplifications) to remove the need for this shim.
- Support preferred and non-preferred subsets of a register class. This
allows allocating, e.g., caller-saved registers before callee-saved
registers.
- Allow branch blockparam args to start an a certain offset in branch
operands; this allows branches to have other operands too (e.g.,
conditional-branch inputs).
- Allow `OperandOrAllocation` to be constructed from an `Allocation` and
`OperandKind` as well (i.e., an allocation with an use/def bit).
The main enhancement in this commit is support for reference types and
stackmaps. This requires tracking whether each VReg is a "reference" or
"pointer". At certain instructions designated as "safepoints", the
regalloc will (i) ensure that all references are in spillslots rather
than in registers, and (ii) provide a list of exactly which spillslots
have live references at that program point. This can be used by, e.g., a
GC to trace and possibly modify pointers. The stackmap of spillslots is
precise: it includes all live references, and *only* live references.
This commit also brings in some API tweaks as part of the in-progress
Cranelift glue. In particular, it makes Allocations and Operands
mutually disjoint by using the same bitfield for the type-tag in both
and choosing non-overlapping tags. This will allow instructions to carry
an Operand for each register slot and then overwrite these in place with
Allocations. The `OperandOrAllocation` type does the necessary magic to
make this look like an enum, but staying in 32 bits.
We currently use a heuristic that our scan for an available PReg
starts at an index into the register list that rotates with the bundle
index. This is a simple way to distribute contention across the whole
register file more evenly and avoid repeating less-likely-to-succeed
reg-map probes to lower-numbered registers for every bundle.
After some experimentation with different options (queue that
dynamically puts registers at end after allocating, various
ways of mixing/hashing indices, etc.), adding the *instruction offset*
(of the start of the first range in the bundle) as well gave the best
results. This is very simple and gives us a likely better-than-random
conflict avoidance because ranges tend to be local, so rotating
through registers as we scan down the list of instructions seems like
a very natural strategy.
On the tests used by our `cargo bench` benchmark, this reduces regfile
probes for the largest (459 instruction) benchmark from 1538 to 829,
i.e., approximately by half, and results in an 11% allocation speedup.