Commit Graph

31 Commits

Author SHA1 Message Date
Nick Fitzgerald
b41b1f9a3c Use maximum inline capacity available for SmallVec<VRegIndex> in SpillSet (#100)
* Use maximum inline capacity available for `SmallVec<VRegIndex>` in `SpillSet`

We were using 2, which is the maximum for 32-bit architectures, but on 64-bit
architectures we can get 4 inline elements without growing the size of the
`SmallVec`.

This is a statistically significant speed up, but is so small that our
formatting of floats truncates it (so less than 1%).

```
compilation :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  Δ = 3360297.85 ± 40136.18 (confidence = 99%)

  more-inline-capacity.so is 1.00x to 1.00x faster than main.so!

  [945563401 945906690.73 946043245] main.so
  [942192473 942546392.88 942729104] more-inline-capacity.so

compilation :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 1780540.13 ± 39362.84 (confidence = 99%)

  more-inline-capacity.so is 1.00x to 1.00x faster than main.so!

  [1544990595 1545359408.41 1545626251] main.so
  [1543269057 1543578868.28 1543851201] more-inline-capacity.so

compilation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 36577153.54 ± 243753.54 (confidence = 99%)

  more-inline-capacity.so is 1.00x to 1.00x faster than main.so!

  [33956158997 33957780594.50 33959538220] main.so
  [33919762415 33921203440.96 33923023358] more-inline-capacity.so
```

* Use a `const fn` to calculate number of inline elements
2022-11-02 12:16:22 -07:00
Jamey Sharp
eb259e8aba Some small perf improvements (#95)
* Do conflict-set hash lookups once, not twice

This makes the small wasmtime bz2 benchmark 1% faster, per Hyperfine and
Sightglass. The effect disappears into the noise on larger benchmarks.

* Inline PosWithPrio::key

When compiling the pulldown-cmark benchmark from Sightglass, this is the
single most frequently called function: it's invoked 2.5 million times.
Inlining it reduces instructions retired by 1.5% on that benchmark,
according to `valgrind --tool=callgrind`.

This patch is "1.01 ± 0.01 times faster" according to Hyperfine for the
bz2, pulldown-cmark, and spidermonkey benchmarks from Sightglass.
Sightglass, in turn, agrees that all three benchmarks are 1.01x faster
by instructions retired, and the first two are around 1.01x faster by
CPU cycles as well.

* Inline and simplify AdaptiveMap::expand

Previously, `get_or_insert` would iterate over the keys to find one that
matched; then, if none did, iterate over the values to check if any are
0; then iterate again to remove all zero values and compact the map.

This commit instead focuses on picking an index to use: preferably one
where the key already exists; but if it's not in the map, then an unused
index; but if there aren't any, then an index where the value is zero.

As a result this iterates the two arrays at most once each, and both
iterations can stop early.

The downside is that keys whose value is zero are not removed as
aggressively. It might be worth pruning such keys in `IndexSet::set`.

Also:

- `#[inline]` both implementations of `Iterator::next`
- Replace `set_bits` with using the `SetBitsIter` constructor directly

These changes together reduce instructions retired when compiling the
pulldown-cmark benchmark by 0.9%.
2022-10-11 08:23:02 -07:00
Chris Fallin
1efaa73943 Modify a SmallVec inline size for UseList to be slightly larger. (#93)
This PR updates the `UseList` type alias to a `SmallVec` with 4
`Use`s (which are 4 bytes each) rather than 2, because we get 16 bytes
of space "for free" in a `SmallVec` on a 64-bit machine.

This PR improves the compilation performance of Cranelift by 1% on
SpiderMonkey.wasm (measured on a Linux desktop with pinned CPU
frequency, and pinned to one core).

It's worth noting also that before making these changes, I explored
whether it would be possible to put the lists of uses and liveranges
in single large backing `Vec`s; the basic reason why we can't do this
is that during liverange construction, we append to many lists
concurrently. One could use a linked-list arrangement, and in fact RA2
did this early in its development; the separate `SmallVec`s were
better for performance overall because the cache locality wins when we
traverse the lists many times. It may still be worth investigating use
of an arena to allocate the vecs rather than the default heap allocator.
2022-10-07 13:37:27 -07:00
Amanieu d'Antras
227a9fde91 Cache HashSet in try_to_allocate_bundle_to_reg (#90)
Keep `conflict_set` allocated in `Env` instead of allocating a new one
on every call. This improves register allocation performance by about
2%.
2022-09-26 16:14:43 -07:00
Chris Fallin
1b38a71e38 Some fixes to allow for call instructions to name args, returns, and clobbers with constraints. (#74)
* Some fixes to allow for call instructions to name args, returns, and clobbers with constraints.

- Allow early-pos uses with fixed regs that conflict with
  clobbers (which happen at late-pos), in addition to the
  existing logic for conflicts with late-pos defs with fixed
  regs.

  This is a pretty subtle issue that was uncovered in #53 for the def
  case, and the fix here is the mirror of that fix for clobbers. The
  root cause for all this complexity is that we can't split in the
  middle of an instruction (because there's no way to insert a move
  there!) so if a use is live-downward, we can't let it live in preg A
  at early-pos and preg B != A at late-pos; instead we need to rewrite
  the constraints and use a fixup move.

  The earlier change to fix #53 was actually a bit too conservative in
  that it always applied when such conflicts existed, even if the
  downward arg was not live. This PR fixes that (it's fine for the
  early-use and late-def to be fixed to the same reg if the use's
  liverange ends after early-pos) and adapts the same flexibility to
  the clobbers case as well.

- Reworks the fixups for the def case mentioned above to not shift the
  def to the Early point. Doing so causes issues when the def is to a
  reffy vreg: it can then be falsely included in a stackmap if the
  instruction containing this operand is a safepoint.

- Fixes the last-resort split-bundle-into-minimal-pieces logic from
  #59 to properly limit a minimal bundle piece to end after the
  early-pos, rather than cover the entire instruction. This was causing
  artificial overlaps between args that end after early-pos and defs
  that start at late-pos when one of the vregs hit the fallback split
  behavior.

* Fix fuzzbug: do not merge when a liverange has a fixed-reg def.

This can create impossible situations: e.g., if a vreg is constrained
to p0 as a late-def, and another, completely different vreg is
constrained to p0 as an early-use on the same instruction, and the
instruction also has a third vreg (early-use), we do not want to merge
the liverange for the third vreg with the first, because it would
result in an unsolveable conflict for p0 at the early-point.

* Review comments.
2022-09-20 15:58:20 -07:00
Amanieu d'Antras
906a053208 Remove register class from SpillSlot (#80)
* Remove register class from `SpillSlot`

The register allocator was already allowing moves between spillslots and
registers of different classes, so this PR formalizes this by making
spillslots independent of register class.

This also fixes #79 by properly tracking the register class of an
`InsertedMove` with the `to_vreg` field which turns out to never
be `None` in practice. Removing the `Option` allows the register
class of the `VReg` to be used when building the per-class move lists.

Fixes #79

* Address review feedback
2022-09-20 14:05:23 -07:00
Chris Fallin
4eb2a2528b Limit split count per original bundle with fallback 1-to-N split. (#59)
* Limit split count per original bundle with fallback 1-to-N split.

Right now, splitting a bundle produces two halves. Furthermore, it has
cost linear in the length of the bundle, because the resulting
half-bundles have their requirements recomputed with a new scan, and
because we copy half the use-list over to the tail end sub-bundle.

This works fine when a bundle has a handful of splits overall, but not
when an input has a systematic pattern of conflicts that will require
O(|bundle|) splits (e.g., every Use is constrained to a different fixed
register than the last one). In such a case, we get quadratic behavior.

This PR adds a per-spillset (so, per-original-bundle) counter for
splits, and when it reaches a preset threshold (10 for now), we instead
split directly into minimal bundles along the whole length of the
bundle, putting the regions without uses in the spill bundle.

This basically approximates what a non-splitting allocator would do: it
"spills" the whole bundle to possibly a stackslot, or a second-chance
register allocation at best, via the spill bundle; and then does minimal
reservations of registers just at uses/defs and moves the "spilled"
value into/out of them immediately.

Together with another small optimization, this PR results in a 4x
compilation speedup and 24x memory use reduction on one particularly bad
case with alternating conflicting requirements on a vreg (see
bytecodealliance/wasmtime#4291 for details).

* Review comments.
2022-06-27 13:23:09 -07:00
Chris Fallin
427e041f1c Fix spillslot allocation to actually reuse spillslots. (#56)
* Fix spillslot allocation to actually reuse spillslots.

The old logic, which did some linked-list rearranging to try to probe
more-likely-to-be-free slots first and which was inherited straight from
the original IonMonkey allocator, was slightly broken (error in
translation and not in IonMonkey, to be clear): it did not get the
list-splicing right, so quite often dropped a slot on the floor and
failed to consider it for further reuse.

On some experimentation, it seems to work just as well to keep a
SmallVec of spillslot indices per size class instead, and save the last
probe-point in order to spread load throughout the allocated slots while
limiting the number of probes (to bound quadratic behavior).

This change moves the maximum slot count from 285 to 92 in `python.wasm`
from bytecodealliance/wasmtime#4214, and the maximum frame size from
2384 bytes to 752 bytes.
2022-06-03 16:01:10 -07:00
Chris Fallin
52818a7ed6 Handle conflicting Before and After fixed-reg constraints with a copy. (#54)
* Extend fuzzer to generate cases like #53.

Currently, the fuzz testcase generator will add at most one
fixed-register constraint to an instruction per physical register. This
avoids impossible situations, such as specifying that both `v0` and `v1`
must be placed into the same `p0`.

However, it *should* be possible to say that `v0` is in `p0` before the
instruction, and `v1` is in `p0` after the instruction (i.e., at `Early`
and `Late` operand positions).

This in fact exposes a limitation in the current allocator design: when
`v0` is live downward, with the above constraints, it will result in an
impossible allocation situation because we cannot split in the middle of
an instruction. A subsequent fix will rectify this by using the
multi-fixed-reg fixup mechanism.

* Handle conflicting Before and After fixed-reg constraints with a copy.

This fixes #53. Previously, if two operands on an instruction
specified *different* vregs constrained to the same physical register
at the Before (Early) and After (Late) points of the instruction, and
the Before was live downward as well, we would panic: we can't insert
a move into the middle of an instruction, so putting the first vreg in
the preg at Early implies we have an unsolveable conflict at Late.

We can solve this issue by adding some new logic to insert a copy, and
rewrite the constraint. This reuses the multi-fixed-reg-constraint
fixup logic. While that logic handles the case where the *same* vreg
has multiple *different* fixed-reg constraints, this new logic
handles *different* vregs with the *same* fixed-reg constraints, but
at different *program points*; so the two are complementary.

This addresses the specific test case in #53, and also fuzzes cleanly
with the change to the fuzz testcase generator to generate these
cases (which also immediately found the bug).

* Add a reservation to the PReg when rewriting constraint so it is not doubly-allocated.

* Distinguish initial fixup moves from secondary moves.

* Use `trace` macro, not `log::trace`, to avoid trace output when feature is disabled.

* Rework operand rewriting to properly handle bundle-merging edge case.

When the liverange for the defined vreg with fixed constraint at Late is
*merged* with the liverange for the used vreg with fixed constraint at
Early, the strategy of putting a fixed reservation on the preg at Early
fails, because the whole bundle is minimal (if it spans just the
instruction's Early and Late and nothing else). This could happen if
e.g. the def flows into a blockparam arg that merges with a blockparam
defining the used value.

Instead we move the def one halfstep earlier, to the Early point, with
its fixed-reg constraint still in place. This has the same effect but
works when the two are merged.

* Fix checker issue: make more flexible in the presence of victim-register saves.
2022-05-31 14:01:27 -07:00
Chris Fallin
869c21e79c Remove an explicitly-set-aside scratch register per class. (#51)
Currently, regalloc2 sets aside one register per class, unconditionally,
to make move resolution possible. To solve the "parallel moves problem",
we sometimes need to conjure a cyclic permutation of data among
registers or stack slots (this can result, for example, from blockparam
flow that swaps two values on a loop backedge). This set-aside scratch
register is used when a cycle exists.

regalloc2 also uses the scratch register when needed to break down a
stack-to-stack move (which could happen due to blockparam moves on edges
when source and destination are both spilled) into a stack-to-reg move
followed by reg-to-stack, because most machines have loads and stores
but not memory-to-memory moves.

A set-aside register is certainly the simplest solution, but it is not
optimal: it means that we have one fewer register available for use by
the program, and this can be costly especially on machines with fewer
registers (e.g., 16 GPRs/XMMs on x86-64) and especially when some
registers may be set aside by our embedder for other purposes too. Every
register we can reclaim is some nontrivial performance in large function
bodies!

This PR removes this restriction and allows regalloc2 to use all
available physical registers. It then solves the two problems above,
cyclic moves and stack-to-stack moves, with a two-stage approach:

- First, it finds a location to use to resolve cycles, if any exist. If
  a register is unallocated at the location of the move, we can use it.
  Often we get lucky and this is the case. Otherwise, we allocate a
  stackslot to use as the temp. This is perfectly fine at this stage,
  even if it means that we have more stack-to-stack moves.

- Then, it resolves stack-to-stack moves into stack-to-reg /
  reg-to-stack. There are two subcases here. If there is *another*
  available free physical register, we opportunistically use it for this
  decomposition. If not, we fall back to our last-ditch option: we pick
  a victim register of the appropriate class, we allocate another
  temporary stackslot, we spill the victim to that slot just for this
  move, we do the move in the above way (stack-to-reg / reg-to-stack)
  with the victim, then we reload the victim. So one move (original
  stack-to-stack) becomes four moves, but no state is clobbered.

This PR extends the `moves` fuzz-target to exercise this functionality
as well, randomly choosing for some spare registers to exist or not, and
randomly generating {stack,reg}-to-{stack,reg} moves in the initial
parallel-move input set. The target does a simple symbolic simulation of
the sequential move sequence and ensures that the final state is
equivalent to the parallel-move semantics.

I fuzzed both the `moves` target, focusing on the new logic; as well as
the `ion_checker` target, checking the whole register allocator, and
both seem clean (~150M cases on the former, ~1M cases on the latter).
2022-05-23 10:48:37 -07:00
Chris Fallin
4cac1614bf Add serde support for exposed types. (#40)
This adds derived `Serialize` and `Deserialize` implementations for
exposed types that describe registers, operands, and related program
inputs; entity indices; and regalloc output types. This allows
serialization of any of the embedder's IR data types that may embed or
build upon regalloc2 types.

These implementations (and the dependency on the `serde` crate itself)
are enabled only when the non-default `enable-serde` feature is
specified.
2022-04-13 10:14:00 -07:00
Chris Fallin
ad41f8a7a5 Record vreg classes explicitly during liverange pass. (#35)
This resolves an issue seen when the source program uses multiple
regclasses (Int and Float): in some cases, the logic that grabs the
vregs and retains them (with class) in `vreg_regs` missed a register and
we had a class mismatch. This occurred because data structures were
initialized assuming `Int` regclass at first.

This PR instead removes the `vreg_regs` array, stores the class
explicitly as an `Option<RegClass>` in the `VRegData`, and provides a
`Env::vreg()` method that reconstitutes a `VReg` given its index and its
observed class. We "observe" the class of every vreg seen during the
liveness pass (and we assert that every occurrence of the vreg index has
the same class). In this way, we still have a single source-of-truth for
the vreg class (the mention of the vreg itself) and we explicitly
represent the "not observed yet" state (and panic on attempting to use
such a vreg) rather than implicitly taking the wrong class.
2022-03-29 14:00:14 -07:00
Chris Fallin
fe021ad6d4 Simplify pinned-vreg API: don't require slice of all pinned vregs. (#28)
Simplify pinned-vreg API: don't require slice of all pinned vregs.

Previously, we kept a bool flag `is_pinned` in the `VRegData`, and we
required a `&[VReg]` of all pinned vregs to be provided by
`Function::pinned_vregs()`. This was (I think) done for convenience, but
it turns out not to really be necessary, as we can just query
`is_pinned_vreg` where needed (and in the likely implementation, e.g. in
Cranelift, this will be a `< NUM_PINNED_VREGS` check that can be
inlined). This adds convenience for the embedder (the main benefit), and
also reduces complexity, removes some state, and avoids some work
initializing the regalloc state for a run.
2022-03-04 15:12:16 -08:00
Chris Fallin
14442df3fc Support for debug-labels. (#27)
Support for debug-labels.

If the client adds labels to vregs across ranges of instructions in the
input program, the regalloc will provide metadata in the `Output` that
describes the `Allocation`s in which each such vreg is stored for those
ranges. This allows the client to emit debug metadata telling a debugger
where to find program values at each point in the program.
2022-03-03 16:58:33 -08:00
Amanieu d'Antras
6b1a5e8b1b Address review feedback 2022-01-11 22:27:15 +00:00
Amanieu d'Antras
2d9d5dd82b Rearrange some struct fields to work better with u64_key/u128_key
This allows the compiler to load the whole key with 1 or 2 64-bit
accesses, assuming little-endian ordering.

Improves instruction count by ~1%.
2022-01-11 13:24:51 +00:00
Amanieu d'Antras
d95a9d9399 Combine sort keys into u64/u128
This allows the compiler to perform branch-less comparisons, which are
more efficient.

This results in ~5% fewer instructions executed.
2022-01-11 13:03:21 +00:00
Amanieu d'Antras
053375f049 Remove PRegData::reg and use PReg::from_index instead
Performance impact is negligible but this is a good cleanup.
2022-01-11 13:02:08 +00:00
Amanieu d'Antras
51493ab03a Apply review feedback 2021-12-12 00:33:30 +00:00
Amanieu d'Antras
8f435243e0 Properly handle fixed stack slots during multi-fixed-reg fixup 2021-12-11 22:39:14 +00:00
Amanieu d'Antras
77e6a9e0d7 Add support for fixed stack slots
This works by allowing a PReg to be marked as being a stack location
instead of a physical register.
2021-12-11 22:31:58 +00:00
Chris Fallin
ef6c8f3226 Fix fuzzbug: add checker metadata for new vreg on multi-fixed-reg fixup move.
When an instruction uses the same vreg constrained to multiple different
fixed registers, the allocator converts all but one of the fixed
constraints to `Any` and then records a special fixup move that copies
the value to the other fixed registers just before the instruction. This
allows the allocator to maintain the invariant that a value lives in
only one place at a time throughout most of its logic, and constrains
the complexity-fallout of this corner case to just a special last-minute
edit.

Unfortunately some recent CPU time thrown at the fuzzer has uncovered
a subtle interaction with the redundant move eliminator that confuses
the checker.

Specifically, when the correct value is *already* in the second
constrained fixed reg, because of an unrelated other move (e.g. because
of a blockparam or other vreg moved from the original), the redundant
move eliminator can delete the fixup move without telling the checker
that it has done so.

Such an optimization is perfectly valid, and the generated code is
correct; but the checker thinks that some other vreg (the one that was
copied from the original) is in the second preg, and panics.

The fix is to use the mechanism that indicates "this move defines a new
vreg" (emitting a `defalloc` checker-instruction) to force the checker
to understand that after the fixup move, the given preg actually
contains the appropriate vreg.
2021-12-04 23:30:30 -08:00
Chris Fallin
c53fbb4a5c Fix fuzzbug related to bundle priority ordering.
Changes in computation of bundle priorities during review of the initial
PR introduced a possible mis-ordering of priorities: inner-loop bundle
use weights could exceed the weights of 1_000_000 and 2_000_000 used for
minimal bundles without and with fixed uses (respectively). These two
kinds of minimal bundle are meant to be the highest-priority bundles,
evicting any other bundle they need to, because they can't be split
further. This PR introduces two special bundle weights for these two
kinds of bundles, and clamps all other bundle weights to just below
them.

Thanks to @Amanieu for reporting the issue! Fixes #19.
2021-11-30 15:36:12 -08:00
Chris Fallin
c7bc6c941c Merge pull request #15 from cfallin/relicensing
Relicense fully to Apache-2.0 WITH LLVM-exception.
2021-11-18 12:40:54 -08:00
Amanieu d'Antras
a516e6d6f3 Return safepoint_slots as Allocations instead of SpillSlots
This enables us to support reftype vregs in register locations in the
future.
2021-11-16 00:47:43 +00:00
Amanieu d'Antras
a527a6d25a Remove unused clobbers vector 2021-11-16 00:46:05 +00:00
Chris Fallin
cf0d515709 Relicense fully to Apache-2.0 WITH LLVM-exception.
Large parts of the code in regalloc2 are currently licensed under the
Mozilla Public License (MPL) 2.0, because they derive in meaningful
ways from the register allocator in IonMonkey, which is part of
Firefox. The relevant source files are marked as such, with references
to the files in the Firefox source tree.

The intent of the regalloc2 project was to port the register allocator
from Firefox to use in Cranelift, borrowing good technology and
improving on it in the spirit of open source.

However, Several use-cases of Cranelift require, or at least strongly
prefer, the Apache-2.0 license with the LLVM exception (matching the
license of Cranelift itself, and Bytecode Alliance projects
generally). While using this license is not strictly necessary for
regalloc2 to be usable (The MPL is an excellent open-source license!),
relicensing fully under this license to harmonize with the rest of
Cranelift and Bytecode Alliance codebases significantly widens
possibilities and reduces friction; then regalloc2 is "just another
part of Cranelift" and doesn't have to be treated specially.

The source in `src/ion/` specifically began as a fairly direct port of
the algorithms in the following files in the `mozilla-central`
repository (Firefox codebase):

* The bulk of the "backtracking allocator" algorithm:
  * `js/src/jit/BacktrackingAllocator.{cpp,h}`
* Helpers and definitions in the surrounding infrastructure:
  * `js/src/jit/RegisterAllocator.h`
  * `js/src/jit/RegisterAllocator.cpp`
  * `js/src/jit/StackSlotAllocator.h`
  * `js/src/jit/LIR.h`
* A few data structure implementations:
  * `js/src/ds/SplayTree.h`
  * `js/src/ds/PriorityQueue.h`

Subsequent work in improving regalloc2 has caused it to drift from the
direct port -- for example, it no longer uses splay trees or the
direct port of the priority queue above -- but it is of course very
clearly still a derivative work.

Analysis of the contributors to these files indicates that we need
signoff from the following folks:

* Mozilla Corp, for contributions made by Mozilla employees (the
  majority of the code). Communications with Mozilla (thanks
  @tschneidereit and @bholley for doing the work here!) indicate that
  @ekr is able to sign off when ready here.

* Andy Wingo, specifically for the work done in [Bug
  1620197](https://bugzilla.mozilla.org/show_bug.cgi?id=1620197) and
  [Bug 1609057](https://bugzilla.mozilla.org/show_bug.cgi?id=1609057) to
  generalize the stack allocator for a Wasm feature (multiple returns).

Additionally, since the initial port, we have had three contributions
from @Amanieu:
[#9](https://github.com/bytecodealliance/regalloc2/pull/9),
[#11](https://github.com/bytecodealliance/regalloc2/pull/11),
[#13](https://github.com/bytecodealliance/regalloc2/pull/13).

So, if everyone applicable is happy with this relicensing, this PR
removes the MPL-2.0 license in `src/ion/` and marks all files as
covered under `Apache-2.0 WITH LLVM-exception`. Please let us know if
this is OK!

Signoffs:

- [ ] @ekr, for Mozilla's contributions
- [ ] @wingo, for contributions to original code in `mozilla-central`
- [ ] @Amanieu, for the three PRs linked above

Thanks!
2021-11-10 10:54:28 -08:00
Chris Fallin
b19fa4857f Rename operand positions to Early and Late, and make weights f16/f32 values. 2021-08-31 17:31:23 -07:00
Chris Fallin
3a18564e98 Addressed more review comments. 2021-08-30 17:51:55 -07:00
Chris Fallin
f27abc9c48 Remove infinite-loop check: it is not a high enough bound in some pathological cases (e.g., gc::many_live_refs test in wasmtime), and it has served its purpose in testing. We can rely on more detailed assertions, e.g. that splits actually shrink bundles and that bundles evict only lower-priority bundles, instead. 2021-06-22 12:06:12 -07:00
Chris Fallin
b36a563d69 Cleanup: split allocator implemntation into 11 files of more reasonable size. 2021-06-18 16:51:41 -07:00