309 Commits

Author SHA1 Message Date
Nick Fitzgerald
fdcf7b694f Avoid indexing and use iteration (#98)
...while still appeasing the borrow checker by taking spillsets out of `self`
and then putting them back in again when we're done.

I was just doing this to enable playing with some data structures in follow up
commits, but decided to benchmark this commit as-is and found 2-4% speed ups to
Cranelift compilation!

```
compilation :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  Δ = 39946528.13 ± 38398.29 (confidence = 99%)

  no-index.so is 1.04x to 1.04x faster than main.so!

  [985704952 985984130.24 986180413] main.so
  [945649144 946037602.11 946262076] no-index.so

compilation :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 48413802.56 ± 34288.05 (confidence = 99%)

  no-index.so is 1.03x to 1.03x faster than main.so!

  [1593663899 1593926801.92 1594246604] main.so
  [1545196678 1545512999.36 1545802144] no-index.so

compilation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 841028066.56 ± 253404.59 (confidence = 99%)

  no-index.so is 1.02x to 1.02x faster than main.so!

  [34798712681 34801346430.28 34802786661] main.so
  [33958847844 33960318363.72 33962177143] no-index.so
```
2022-11-01 10:00:09 -07:00
Chris Fallin
b4eedf3f32 Fix the moves fuzz target. (#97)
* Fix iteration bug in move fuzz-target driver.

* Remove panic no longer needed: stack-to-stack moves are now properly handled.
2022-10-21 09:08:27 -07:00
Alex Crichton
d6b15095c5 Add explicit dependency on libfuzzer-sys (#96)
This fixes a build issue with the most recent version of `libfuzzer-sys`
where the prior version of the `fuzz_target!` macro didn't need the
crate as an explicit dependency but the latest version does.
2022-10-21 08:33:50 -07:00
Jamey Sharp
eb259e8aba Some small perf improvements (#95)
* Do conflict-set hash lookups once, not twice

This makes the small wasmtime bz2 benchmark 1% faster, per Hyperfine and
Sightglass. The effect disappears into the noise on larger benchmarks.

* Inline PosWithPrio::key

When compiling the pulldown-cmark benchmark from Sightglass, this is the
single most frequently called function: it's invoked 2.5 million times.
Inlining it reduces instructions retired by 1.5% on that benchmark,
according to `valgrind --tool=callgrind`.

This patch is "1.01 ± 0.01 times faster" according to Hyperfine for the
bz2, pulldown-cmark, and spidermonkey benchmarks from Sightglass.
Sightglass, in turn, agrees that all three benchmarks are 1.01x faster
by instructions retired, and the first two are around 1.01x faster by
CPU cycles as well.

* Inline and simplify AdaptiveMap::expand

Previously, `get_or_insert` would iterate over the keys to find one that
matched; then, if none did, iterate over the values to check if any are
0; then iterate again to remove all zero values and compact the map.

This commit instead focuses on picking an index to use: preferably one
where the key already exists; but if it's not in the map, then an unused
index; but if there aren't any, then an index where the value is zero.

As a result this iterates the two arrays at most once each, and both
iterations can stop early.

The downside is that keys whose value is zero are not removed as
aggressively. It might be worth pruning such keys in `IndexSet::set`.

Also:

- `#[inline]` both implementations of `Iterator::next`
- Replace `set_bits` with using the `SetBitsIter` constructor directly

These changes together reduce instructions retired when compiling the
pulldown-cmark benchmark by 0.9%.
2022-10-11 08:23:02 -07:00
Chris Fallin
1efaa73943 Modify a SmallVec inline size for UseList to be slightly larger. (#93)
This PR updates the `UseList` type alias to a `SmallVec` with 4
`Use`s (which are 4 bytes each) rather than 2, because we get 16 bytes
of space "for free" in a `SmallVec` on a 64-bit machine.

This PR improves the compilation performance of Cranelift by 1% on
SpiderMonkey.wasm (measured on a Linux desktop with pinned CPU
frequency, and pinned to one core).

It's worth noting also that before making these changes, I explored
whether it would be possible to put the lists of uses and liveranges
in single large backing `Vec`s; the basic reason why we can't do this
is that during liverange construction, we append to many lists
concurrently. One could use a linked-list arrangement, and in fact RA2
did this early in its development; the separate `SmallVec`s were
better for performance overall because the cache locality wins when we
traverse the lists many times. It may still be worth investigating use
of an arena to allocate the vecs rather than the default heap allocator.
2022-10-07 13:37:27 -07:00
Chris Fallin
6394913c1d Add README note that non-SSA support is deprecated. (#94)
As noted in #4.
2022-10-05 10:43:15 -07:00
Amanieu d'Antras
227a9fde91 Cache HashSet in try_to_allocate_bundle_to_reg (#90)
Keep `conflict_set` allocated in `Env` instead of allocating a new one
on every call. This improves register allocation performance by about
2%.
2022-09-26 16:14:43 -07:00
Amanieu d'Antras
67f5c167a8 Fix documentation for inst_clobbers (#89)
Clobbers don't conflict with early uses, so they effectively act as a
late def, not an early def.
2022-09-26 16:14:08 -07:00
Chris Fallin
061963f469 Version 0.4.1. (#86) 2022-09-22 15:20:16 -07:00
Chris Fallin
bcfc10c44e Fix fallback-split behavior: trim start of minimal bundle wrt start of original LR. (#85)
When a liverange starts at a *late* point of an instruction, and it
undergoes the fallback "split into all minimal pieces" transform, we end
up creating one minimal bundle that starts at the *early* point of the
instruction at the start of the original LR. This can create
impossible-to-allocate situations where a fixed-constraint LR overlaps
another constrained to the same register (e.g. at calls). We fix this by
ensuring the minimal bundle is trimmed only to the half of the
instruction that overlaps the original LR.

This is analogous to the third fix in #74, but on the other end (start
of LR rather than end of it).
2022-09-22 15:09:49 -07:00
Chris Fallin
0fb9c03e09 Version 0.4.0. (#84)
This is identical to v0.3.3, but it turns out that the removal of the
register class from `SpillSlot` (#80) is an API change. I missed this
and published 0.3.3 but yanked it after it was causing build issues for
folks.
2022-09-20 17:27:38 -07:00
Chris Fallin
b98e8ad517 Version 0.3.3. (#83) 2022-09-20 16:08:01 -07:00
Chris Fallin
1b38a71e38 Some fixes to allow for call instructions to name args, returns, and clobbers with constraints. (#74)
* Some fixes to allow for call instructions to name args, returns, and clobbers with constraints.

- Allow early-pos uses with fixed regs that conflict with
  clobbers (which happen at late-pos), in addition to the
  existing logic for conflicts with late-pos defs with fixed
  regs.

  This is a pretty subtle issue that was uncovered in #53 for the def
  case, and the fix here is the mirror of that fix for clobbers. The
  root cause for all this complexity is that we can't split in the
  middle of an instruction (because there's no way to insert a move
  there!) so if a use is live-downward, we can't let it live in preg A
  at early-pos and preg B != A at late-pos; instead we need to rewrite
  the constraints and use a fixup move.

  The earlier change to fix #53 was actually a bit too conservative in
  that it always applied when such conflicts existed, even if the
  downward arg was not live. This PR fixes that (it's fine for the
  early-use and late-def to be fixed to the same reg if the use's
  liverange ends after early-pos) and adapts the same flexibility to
  the clobbers case as well.

- Reworks the fixups for the def case mentioned above to not shift the
  def to the Early point. Doing so causes issues when the def is to a
  reffy vreg: it can then be falsely included in a stackmap if the
  instruction containing this operand is a safepoint.

- Fixes the last-resort split-bundle-into-minimal-pieces logic from
  #59 to properly limit a minimal bundle piece to end after the
  early-pos, rather than cover the entire instruction. This was causing
  artificial overlaps between args that end after early-pos and defs
  that start at late-pos when one of the vregs hit the fallback split
  behavior.

* Fix fuzzbug: do not merge when a liverange has a fixed-reg def.

This can create impossible situations: e.g., if a vreg is constrained
to p0 as a late-def, and another, completely different vreg is
constrained to p0 as an early-use on the same instruction, and the
instruction also has a third vreg (early-use), we do not want to merge
the liverange for the third vreg with the first, because it would
result in an unsolveable conflict for p0 at the early-point.

* Review comments.
2022-09-20 15:58:20 -07:00
Amanieu d'Antras
906a053208 Remove register class from SpillSlot (#80)
* Remove register class from `SpillSlot`

The register allocator was already allowing moves between spillslots and
registers of different classes, so this PR formalizes this by making
spillslots independent of register class.

This also fixes #79 by properly tracking the register class of an
`InsertedMove` with the `to_vreg` field which turns out to never
be `None` in practice. Removing the `Option` allows the register
class of the `VReg` to be used when building the per-class move lists.

Fixes #79

* Address review feedback
2022-09-20 14:05:23 -07:00
Jeffrey Crocker
3db1b7199b Fix design doc typo (#82) 2022-09-20 14:04:56 -07:00
Amanieu d'Antras
520cafa129 Handle fixed stack slots in the move resolver (#78)
Fixed stack slots are treated as `PReg`s by most of the register
allocator, but need some additional handling the move resolver to avoid
generating stack-to-stack moves.
2022-09-19 12:27:24 -07:00
Amanieu d'Antras
1495c1e342 Add fixed-non-allocatable operand support (#77)
This allows a non-allocatable `PReg` to be passed on directly to the
allocations vector without any liverange tracking from the register
allocator. The main intended use case is to support ISA-specific special
registers such as a fixed zero register.
2022-09-19 12:23:53 -07:00
Amanieu d'Antras
aeef47a06b Make more PReg & VReg methods const (#76)
Also remove the previous workaround for const-assert now that it is
available on stable Rust.
2022-09-12 10:03:29 -07:00
Chris Fallin
be47ac39e8 Fuzzing function generator: bound the debug-labels size. (#73)
Currently there is a loop that takes a variable step toward an end point
with an integer from `Arbitrary`; if this integer is always zero (for
example due to end-of-input?) then we add debug labels to a particular
input SSA value forever. This eventually causes an OOM crash.  This PR
bounds the loop at a reasonable count (10) instead.
2022-08-31 10:10:38 -07:00
Teymour Aldridge
ad39c66fe7 Fix typo. (#71) 2022-08-04 20:24:36 -07:00
Jamey Sharp
81c11d37d6 Version 0.3.2 (#70)
Includes a modest improvement in memory usage and performance by
removing analysis that was only used during fuzzing (#66).
2022-08-03 15:25:45 -07:00
Jamey Sharp
47cb8234b6 Remove old tests and reexport libfuzzer from fuzzing module (#69)
* Remove unused regalloc2-test crate

This code doesn't build, and Chris says it's "a really old harness that
existed prior to building the fuzzing and was used mainly to profile and
get stats before integration with Cranelift".

* Re-export libfuzzer/arbitrary from fuzzing module

This avoids needing to keep dependencies on `arbitrary` in sync across
the three different Cargo.toml files in this project.

However, before version 0.4.2, libfuzzer-sys only supported using its
macros if it was available at the top-level `libfuzzer_sys` path, which
breaks when re-exporting it. So I'm upgrading to that version (or the
newest patch release of it).

Upgrading libfuzzer-sys in turn brings in the 1.0 release of the
arbitrary crate, with a minor API change along the way.
2022-08-03 14:28:37 -07:00
Jamey Sharp
1f915bb9b6 validate_ssa: Check def uniqueness first (#68)
* validate_ssa: Check def uniqueness first

I had trouble understanding `validate_ssa`, so this changes the division
of work to establish one invariant at a time:

- First, check that no VReg is defined in more than one place, and
  record which block it's defined in.
- Second, check that each use refers to a VReg that was defined either
  earlier in the same block, or in a block dominating it.

I'm also enhancing the validation to check that every VReg use is in an
instruction that is strictly later than its def. Previously, an earlier
operand in the same instruction would also be considered a valid def,
but that isn't actually legal in SSA.

* Address review comments
2022-08-03 10:52:50 -07:00
Jamey Sharp
1955c6dfb5 Only record vreg definitions when fuzzing (#66)
The `vreg_def_blockparam` and `vreg_def_inst` fields on CFGInfo are only
used in `validate_ssa`, which in turn is only used in the ssagen fuzz
target.

Since these fields are never read in normal usage, initializing them is
entirely wasted effort. According to valgrind/DHAT, when running
`wasmtime compile` on the Sightglass Spidermonkey benchmark, removing
these fields saves about 100M instructions, 23k heap allocations
totalling 40MiB, and 47MiB of writes to the heap.
2022-07-29 10:25:27 -07:00
Chris Fallin
0eb3deb384 Version 0.3.1. (#65) 2022-07-20 10:57:15 -07:00
Benjamin Bouvier
a33b044d6c Streamline log enablement (#64)
* Remove log_enabled statements around annotate calls, which are already guarded against annotations_enabled

* Use a trace_enabled!() macro that follows the same logic as trace!() to find if additional traces have been enabled or not
2022-07-20 10:44:31 -07:00
Chris Fallin
8bede950d0 Release 0.3.0. (#63)
Includes improvements in splitting performance (#59) and more efficient
handling of clobber lists (#58).

Semver break because of API change in #58.
2022-06-27 14:09:15 -07:00
Chris Fallin
4eb2a2528b Limit split count per original bundle with fallback 1-to-N split. (#59)
* Limit split count per original bundle with fallback 1-to-N split.

Right now, splitting a bundle produces two halves. Furthermore, it has
cost linear in the length of the bundle, because the resulting
half-bundles have their requirements recomputed with a new scan, and
because we copy half the use-list over to the tail end sub-bundle.

This works fine when a bundle has a handful of splits overall, but not
when an input has a systematic pattern of conflicts that will require
O(|bundle|) splits (e.g., every Use is constrained to a different fixed
register than the last one). In such a case, we get quadratic behavior.

This PR adds a per-spillset (so, per-original-bundle) counter for
splits, and when it reaches a preset threshold (10 for now), we instead
split directly into minimal bundles along the whole length of the
bundle, putting the regions without uses in the spill bundle.

This basically approximates what a non-splitting allocator would do: it
"spills" the whole bundle to possibly a stackslot, or a second-chance
register allocation at best, via the spill bundle; and then does minimal
reservations of registers just at uses/defs and moves the "spilled"
value into/out of them immediately.

Together with another small optimization, this PR results in a 4x
compilation speedup and 24x memory use reduction on one particularly bad
case with alternating conflicting requirements on a vreg (see
bytecodealliance/wasmtime#4291 for details).

* Review comments.
2022-06-27 13:23:09 -07:00
Chris Fallin
9733cb2227 Clobbers: use a more efficient bitmask representation in API. (#58)
* Clobbers: use a more efficient bitmask representation in API.

Currently, the `Function` trait requires a `&[PReg]` for the
clobber-list for a given instruction. In most cases where clobbers are
used, the list may be long: e.g., ABIs specify a fixed set of registers
that are clobbered and there may be ~half of all registers in this list.
What's more, the list can't be shared for e.g. all calls of a given ABI,
because actual return-values (defs) can't be clobbers. So we need to
allocate space for long, sometimes-slightly-different lists; this is
inefficient for the embedder and for us.

It's much more efficient to use a bitmask to represent a set of physical
registers. With current data structure bitpacking limitations, we can
support at most 128 physical registers; this means we can use a `u128`
bitmask. This also allows e.g. an embedder to start with a constant for
a given ABI, and mask out bits for actual return-value registers on call
instructions.

This PR makes that change, for minor but positive performance impact.

* Review comments.
2022-06-27 12:27:19 -07:00
Chris Fallin
06b3baf9f9 Version 0.2.3. (#61) 2022-06-27 11:37:57 -07:00
Chris Fallin
68aa61571b Bugfix: no hole in liveranges for pinned vreg move src. (#60)
Right now, pinned vregs are a way of naming real registers (a
compatibility shim of sorts for Cranelift's `RealReg`s) and can be used
as sources and dests of moves. When the input program does so, regalloc2
converts these into "ghost" uses and defs on the other vreg of the move
(dest or source, respectively) with a fixed-register constraint. So
`move v128, v0` where `v0` is pinned to `p0` turns into a "ghost def"
on `v128` with constraint `fixed p0`.

There is some fancy manipulation of liveranges to make this all work
while properly recording where the preg's value must be preserved.
Unfortunately, there was an off-by-one in the location of the move and
transition of live-ranges which interacts poorly with the "implicit
live-in" of pinned vregs at function start. As a result, a function body
that starts like:

```
    move v128, v0
    def v9000
    move v129, v1
```

might allocate `p1` (to which `v1` is pinned) for `v9000`. This clobbers
the original value.

Fortunately this only impacts the implicit live-in, and Cranelift's use
of regalloc2 is such that it will always copy all values out of pinned
vregs (creating ghost defs) without intervening defs, *except* in the
case of `sret` ("structure return") arguments. If a program does not use
`sret` arguments (and the `cranelift-wasm` frontend does not), then this
bug should not be reachable.

Long-term, we really need to kill pinned vregs with fire (#3); the
special cases that arise from these, and from special handling of moves,
are too much incidental complexity. All of this can go away once
Cranelift migrates all fixed-register cases to operand constraints
instead. That will be a happy day.

Thanks to @bjorn3 for finding and reporting this issue!
2022-06-27 11:21:19 -07:00
Chris Fallin
b78ccbce6e Version 0.2.2. (#57) 2022-06-03 16:15:54 -07:00
Chris Fallin
427e041f1c Fix spillslot allocation to actually reuse spillslots. (#56)
* Fix spillslot allocation to actually reuse spillslots.

The old logic, which did some linked-list rearranging to try to probe
more-likely-to-be-free slots first and which was inherited straight from
the original IonMonkey allocator, was slightly broken (error in
translation and not in IonMonkey, to be clear): it did not get the
list-splicing right, so quite often dropped a slot on the floor and
failed to consider it for further reuse.

On some experimentation, it seems to work just as well to keep a
SmallVec of spillslot indices per size class instead, and save the last
probe-point in order to spread load throughout the allocated slots while
limiting the number of probes (to bound quadratic behavior).

This change moves the maximum slot count from 285 to 92 in `python.wasm`
from bytecodealliance/wasmtime#4214, and the maximum frame size from
2384 bytes to 752 bytes.
2022-06-03 16:01:10 -07:00
Chris Fallin
257c5ccc18 Bump version to 0.2.1. (#55) 2022-05-31 14:33:55 -07:00
Chris Fallin
52818a7ed6 Handle conflicting Before and After fixed-reg constraints with a copy. (#54)
* Extend fuzzer to generate cases like #53.

Currently, the fuzz testcase generator will add at most one
fixed-register constraint to an instruction per physical register. This
avoids impossible situations, such as specifying that both `v0` and `v1`
must be placed into the same `p0`.

However, it *should* be possible to say that `v0` is in `p0` before the
instruction, and `v1` is in `p0` after the instruction (i.e., at `Early`
and `Late` operand positions).

This in fact exposes a limitation in the current allocator design: when
`v0` is live downward, with the above constraints, it will result in an
impossible allocation situation because we cannot split in the middle of
an instruction. A subsequent fix will rectify this by using the
multi-fixed-reg fixup mechanism.

* Handle conflicting Before and After fixed-reg constraints with a copy.

This fixes #53. Previously, if two operands on an instruction
specified *different* vregs constrained to the same physical register
at the Before (Early) and After (Late) points of the instruction, and
the Before was live downward as well, we would panic: we can't insert
a move into the middle of an instruction, so putting the first vreg in
the preg at Early implies we have an unsolveable conflict at Late.

We can solve this issue by adding some new logic to insert a copy, and
rewrite the constraint. This reuses the multi-fixed-reg-constraint
fixup logic. While that logic handles the case where the *same* vreg
has multiple *different* fixed-reg constraints, this new logic
handles *different* vregs with the *same* fixed-reg constraints, but
at different *program points*; so the two are complementary.

This addresses the specific test case in #53, and also fuzzes cleanly
with the change to the fuzz testcase generator to generate these
cases (which also immediately found the bug).

* Add a reservation to the PReg when rewriting constraint so it is not doubly-allocated.

* Distinguish initial fixup moves from secondary moves.

* Use `trace` macro, not `log::trace`, to avoid trace output when feature is disabled.

* Rework operand rewriting to properly handle bundle-merging edge case.

When the liverange for the defined vreg with fixed constraint at Late is
*merged* with the liverange for the used vreg with fixed constraint at
Early, the strategy of putting a fixed reservation on the preg at Early
fails, because the whole bundle is minimal (if it spans just the
instruction's Early and Late and nothing else). This could happen if
e.g. the def flows into a blockparam arg that merges with a blockparam
defining the used value.

Instead we move the def one halfstep earlier, to the Early point, with
its fixed-reg constraint still in place. This has the same effect but
works when the two are merged.

* Fix checker issue: make more flexible in the presence of victim-register saves.
2022-05-31 14:01:27 -07:00
Chris Fallin
0395614545 Bump version to 0.2.0. (#52)
The change in #51 was API-visible (removal of the "scratch register"
field in the `Env`) so this is a semver bump.
2022-05-23 11:13:16 -07:00
Chris Fallin
869c21e79c Remove an explicitly-set-aside scratch register per class. (#51)
Currently, regalloc2 sets aside one register per class, unconditionally,
to make move resolution possible. To solve the "parallel moves problem",
we sometimes need to conjure a cyclic permutation of data among
registers or stack slots (this can result, for example, from blockparam
flow that swaps two values on a loop backedge). This set-aside scratch
register is used when a cycle exists.

regalloc2 also uses the scratch register when needed to break down a
stack-to-stack move (which could happen due to blockparam moves on edges
when source and destination are both spilled) into a stack-to-reg move
followed by reg-to-stack, because most machines have loads and stores
but not memory-to-memory moves.

A set-aside register is certainly the simplest solution, but it is not
optimal: it means that we have one fewer register available for use by
the program, and this can be costly especially on machines with fewer
registers (e.g., 16 GPRs/XMMs on x86-64) and especially when some
registers may be set aside by our embedder for other purposes too. Every
register we can reclaim is some nontrivial performance in large function
bodies!

This PR removes this restriction and allows regalloc2 to use all
available physical registers. It then solves the two problems above,
cyclic moves and stack-to-stack moves, with a two-stage approach:

- First, it finds a location to use to resolve cycles, if any exist. If
  a register is unallocated at the location of the move, we can use it.
  Often we get lucky and this is the case. Otherwise, we allocate a
  stackslot to use as the temp. This is perfectly fine at this stage,
  even if it means that we have more stack-to-stack moves.

- Then, it resolves stack-to-stack moves into stack-to-reg /
  reg-to-stack. There are two subcases here. If there is *another*
  available free physical register, we opportunistically use it for this
  decomposition. If not, we fall back to our last-ditch option: we pick
  a victim register of the appropriate class, we allocate another
  temporary stackslot, we spill the victim to that slot just for this
  move, we do the move in the above way (stack-to-reg / reg-to-stack)
  with the victim, then we reload the victim. So one move (original
  stack-to-stack) becomes four moves, but no state is clobbered.

This PR extends the `moves` fuzz-target to exercise this functionality
as well, randomly choosing for some spare registers to exist or not, and
randomly generating {stack,reg}-to-{stack,reg} moves in the initial
parallel-move input set. The target does a simple symbolic simulation of
the sequential move sequence and ensures that the final state is
equivalent to the parallel-move semantics.

I fuzzed both the `moves` target, focusing on the new logic; as well as
the `ion_checker` target, checking the whole register allocator, and
both seem clean (~150M cases on the former, ~1M cases on the latter).
2022-05-23 10:48:37 -07:00
Chris Fallin
33611a68b9 Bump to version 0.1.3. (#50) 2022-05-16 22:44:35 -07:00
Chris Fallin
1379c65a6a Handle conflict-related liverange splits arising from stack constraints without falling back to spill bundle. (#49)
Currently, we unconditionally trim the ends of liveranges around a split
when we do a split, including splits due to conflicts in a
liverange/bundle's requirements (e.g., a liverange with both a register
and a stack use). These trimmed ends, if they exist, go to the spill
bundle, and the spill bundle may receive a register during second-chance
allocation or otherwise will receive a stack slot.

This was previously measured to reduce contention significantly, because
it reduces the sizes of liveranges that participate in the first-chance
competition for allocations. When a split has to occur, we might as well
relegate the "connecting pieces" to a process that comes later, with a
hint to try to get the right register if possible but no hard connection
to either end.

However, in the case of a split arising from a reg-to-stack /
stack-to-reg conflict, as happens when references are used or def'd as
registers and then cross safepoints, this extra step in the connectivity
(normal LR with register use, then spill bundle, then normal LR with
stack use) can lead to extra moves. Additionally, when one of the LRs
has a stack constraint, contention is far less important; so it doesn't
hurt to skip the trimming step. In fact, it's likely much better to put
the "connecting piece" together with the stack side of the conflict.

Ideally we would handle this with the same move-cost logic we use for
conflicts detected during backtracking, but the requirements-related
splitting happens separately and that logic would need to be generalized
further. For now, this is sufficient to eliminate redundant moves as
seen in e.g. bytecodealliance/wasmtime#3785.
2022-05-16 22:36:51 -07:00
Chris Fallin
9b83635980 Bump version to 0.1.2. (#44)
This will get #43 into a release to allow us to use the checker on
Cranelift outputs.
2022-04-18 13:20:07 -07:00
Chris Fallin
a5c48fda8a Support program moves, including pinned vregs, in the checker. (#43)
The checker was built to validate programs produced by the fuzzing
testcase generator, which was built before regalloc2 supported special
handling of moves. (In a pure-SSA world, move elision is not needed,
because moves are not needed, and blockparams are the only way of
tying together vregs.)

Due to this, the checker works great for our independent regalloc2
fuzzing setup, but when used on regalloc inputs produced by Cranelift,
cannot prove correctness.

This PR extends the checker's analysis to properly handle "program
moves", which are distinct from regalloc-inserted moves in that they
are present in the original program and hence are semantically
relevant. A program move edits all sets of symbolic vregs at all
allocs, and where the source vreg appears, it inserts the dest vreg as
well. (It also removes the dest vreg from all other sets, since the
old value becomes stale, as is done for other defs.)

Given this, and given some additional checking for moves to/from
pinned vregs, the checker can now be used to fully validate
Cranelift-sourced regalloc2 invocations.
2022-04-18 10:36:26 -07:00
Chris Fallin
f307ed170c Bump version to 0.1.1. (#41) 2022-04-13 10:37:43 -07:00
Chris Fallin
4cac1614bf Add serde support for exposed types. (#40)
This adds derived `Serialize` and `Deserialize` implementations for
exposed types that describe registers, operands, and related program
inputs; entity indices; and regalloc output types. This allows
serialization of any of the embedder's IR data types that may embed or
build upon regalloc2 types.

These implementations (and the dependency on the `serde` crate itself)
are enabled only when the non-default `enable-serde` feature is
specified.
2022-04-13 10:14:00 -07:00
Chris Fallin
94cd6c421c Bump version to 0.1.0 for release. (#39) 2022-04-04 16:37:01 -07:00
Chris Fallin
0cb08095e6 Make multiple defs of one vreg possible on one instruction. (#38)
Currently, if this is done, overlapping liveranges are created, and we
hit an assert that ensures our non-overlapping and
built-in-reverse-order invariants during liverange construction.

One could argue that multiple defs of a single vreg don't make a ton of
sense -- which def's value is valid after the instruction? -- but if
they all get the same alloc, then the answer is "whatever was put in
that alloc", and this is just a case of an instruction being a bit
over-eager when listing its registers.

This can arise in practice when operand lists come from combinations or
concatenations: for example, in Cranelift's s390x backend, there is a
"Loop" pseudo-instruction, and the operands of the Loop are the operands
of all the sub-instructions.

It seems more logically cohesive overall to say that one can state an
operand as many times as one likes; so this PR makes it so.
2022-04-04 16:22:05 -07:00
Chris Fallin
a369150213 Make some improvements to clarity of checker implementation. (#37)
This PR makes two changes, both suggested by @fitzgen in #36:

1. It updates the top-level description of the analysis to more
   simply and accurately describe the analysis lattice.

2. It modifies both the `CheckerValue` and `CheckerState` types to be
   enums with separate arms for the top/universe value, and adds helpers
   as appropriate to update the values. There should be no functional
   change; this update just makes the meet-functions and updates more
   clear, and makes a bad state ("top" but with values) unrepresentable.

Closes #36.
2022-03-30 11:23:56 -07:00
Chris Fallin
ad41f8a7a5 Record vreg classes explicitly during liverange pass. (#35)
This resolves an issue seen when the source program uses multiple
regclasses (Int and Float): in some cases, the logic that grabs the
vregs and retains them (with class) in `vreg_regs` missed a register and
we had a class mismatch. This occurred because data structures were
initialized assuming `Int` regclass at first.

This PR instead removes the `vreg_regs` array, stores the class
explicitly as an `Option<RegClass>` in the `VRegData`, and provides a
`Env::vreg()` method that reconstitutes a `VReg` given its index and its
observed class. We "observe" the class of every vreg seen during the
liveness pass (and we assert that every occurrence of the vreg index has
the same class). In this way, we still have a single source-of-truth for
the vreg class (the mention of the vreg itself) and we explicitly
represent the "not observed yet" state (and panic on attempting to use
such a vreg) rather than implicitly taking the wrong class.
2022-03-29 14:00:14 -07:00
Chris Fallin
433e8b3776 Early defs reserve a register for whole instruction. (#32)
The `Operand` abstraction allows a def to be positioned at the "early"
point of an instruction, before its effect and alongside its normal
uses. This is intended to allow the embedder to express that a def may
be written before all uses are read, so it should not conflict with the
uses.

It's also convenient to use early defs to express temporaries, which
should be available throughout a regalloc-level instruction's emitted
sequence. In such a case, the register should not be used again after
the instruction, so it is dead following the instruction.

Strictly speaking, and according to regalloc2 prior to this PR, then the
temp will *only* conflict with the uses at the early-point, and not the
defs at the late-point (after the instruction), because it's dead past
its point of definition. But for a temp we really want it to register
conflicts not just with the normal uses but with the normal defs as
well.

This PR changes the semantics so that an early def builds a liverange
that spans the early- and late-point of an instruction when the vreg is
dead flowing down from the instruction, giving the semantics we want for
temps.
2022-03-18 10:32:49 -07:00
Chris Fallin
4f1161d9e4 Generalize debug-info support a bit. (#34)
* Generalize debug-info support a bit.

Previously, debug value-label support required each vreg to have a
disjoint sequence of instruction ranges, each with one label.

Unfortunately, it's entirely possible for multiple values at the program
level to map to one vreg at the IR level, leading to multiple labels.

This PR generalizes the debug-info generation support to allow for
arbitrary (label, range, vreg) tuples, as long as they are sorted by
vreg, with no other requirements. The lookup is a little more costly
when we generate the debuginfo, but in practice we shouldn't have more
than a *few* debug value labels per vreg, so in practice the constants
should be small.

* Typo fix from Amanieu

Co-authored-by: Amanieu d'Antras <amanieu@gmail.com>

Co-authored-by: Amanieu d'Antras <amanieu@gmail.com>
2022-03-18 10:32:27 -07:00
Chris Fallin
00dc692489 Allow for reused inputs when the reused vreg is also used as other (normal) uses. (#33)
The "reused input" operand constraint allows for an instruction to have
a def-operand whose allocation is constrained to reuse the same
allocation as one of the uses.

This is useful to express constraints needed for some instruction sets,
like x86, where at the ISA level, one register serves both as an input
and the output.

Unfortunately the way that we lower the constraints to liveranges does
not work if we have the same vreg used both for the reused input and
another input -- it results in impossible-to-solve constraints. For
example, the instruction

```
    alu_op v42 use, v42 use, v43 def reuse(0)
```

would result in an impossible allocation.

This fixes liverange construction to properly handle all uses of the
vreg whose operand is reused, rather than just the one reused operand.
2022-03-18 10:14:27 -07:00