regalloc2

Author	SHA1	Message	Date
Chris Fallin	9733cb2227	Clobbers: use a more efficient bitmask representation in API. (#58 ) * Clobbers: use a more efficient bitmask representation in API. Currently, the `Function` trait requires a `&[PReg]` for the clobber-list for a given instruction. In most cases where clobbers are used, the list may be long: e.g., ABIs specify a fixed set of registers that are clobbered and there may be ~half of all registers in this list. What's more, the list can't be shared for e.g. all calls of a given ABI, because actual return-values (defs) can't be clobbers. So we need to allocate space for long, sometimes-slightly-different lists; this is inefficient for the embedder and for us. It's much more efficient to use a bitmask to represent a set of physical registers. With current data structure bitpacking limitations, we can support at most 128 physical registers; this means we can use a `u128` bitmask. This also allows e.g. an embedder to start with a constant for a given ABI, and mask out bits for actual return-value registers on call instructions. This PR makes that change, for minor but positive performance impact. * Review comments.	2022-06-27 12:27:19 -07:00
Chris Fallin	52818a7ed6	Handle conflicting Before and After fixed-reg constraints with a copy. (#54 ) * Extend fuzzer to generate cases like #53. Currently, the fuzz testcase generator will add at most one fixed-register constraint to an instruction per physical register. This avoids impossible situations, such as specifying that both `v0` and `v1` must be placed into the same `p0`. However, it should be possible to say that `v0` is in `p0` before the instruction, and `v1` is in `p0` after the instruction (i.e., at `Early` and `Late` operand positions). This in fact exposes a limitation in the current allocator design: when `v0` is live downward, with the above constraints, it will result in an impossible allocation situation because we cannot split in the middle of an instruction. A subsequent fix will rectify this by using the multi-fixed-reg fixup mechanism. * Handle conflicting Before and After fixed-reg constraints with a copy. This fixes #53. Previously, if two operands on an instruction specified different vregs constrained to the same physical register at the Before (Early) and After (Late) points of the instruction, and the Before was live downward as well, we would panic: we can't insert a move into the middle of an instruction, so putting the first vreg in the preg at Early implies we have an unsolveable conflict at Late. We can solve this issue by adding some new logic to insert a copy, and rewrite the constraint. This reuses the multi-fixed-reg-constraint fixup logic. While that logic handles the case where the same vreg has multiple different fixed-reg constraints, this new logic handles different vregs with the same fixed-reg constraints, but at different program points; so the two are complementary. This addresses the specific test case in #53, and also fuzzes cleanly with the change to the fuzz testcase generator to generate these cases (which also immediately found the bug). * Add a reservation to the PReg when rewriting constraint so it is not doubly-allocated. * Distinguish initial fixup moves from secondary moves. * Use `trace` macro, not `log::trace`, to avoid trace output when feature is disabled. * Rework operand rewriting to properly handle bundle-merging edge case. When the liverange for the defined vreg with fixed constraint at Late is merged with the liverange for the used vreg with fixed constraint at Early, the strategy of putting a fixed reservation on the preg at Early fails, because the whole bundle is minimal (if it spans just the instruction's Early and Late and nothing else). This could happen if e.g. the def flows into a blockparam arg that merges with a blockparam defining the used value. Instead we move the def one halfstep earlier, to the Early point, with its fixed-reg constraint still in place. This has the same effect but works when the two are merged. * Fix checker issue: make more flexible in the presence of victim-register saves.	2022-05-31 14:01:27 -07:00
Chris Fallin	869c21e79c	Remove an explicitly-set-aside scratch register per class. (#51 ) Currently, regalloc2 sets aside one register per class, unconditionally, to make move resolution possible. To solve the "parallel moves problem", we sometimes need to conjure a cyclic permutation of data among registers or stack slots (this can result, for example, from blockparam flow that swaps two values on a loop backedge). This set-aside scratch register is used when a cycle exists. regalloc2 also uses the scratch register when needed to break down a stack-to-stack move (which could happen due to blockparam moves on edges when source and destination are both spilled) into a stack-to-reg move followed by reg-to-stack, because most machines have loads and stores but not memory-to-memory moves. A set-aside register is certainly the simplest solution, but it is not optimal: it means that we have one fewer register available for use by the program, and this can be costly especially on machines with fewer registers (e.g., 16 GPRs/XMMs on x86-64) and especially when some registers may be set aside by our embedder for other purposes too. Every register we can reclaim is some nontrivial performance in large function bodies! This PR removes this restriction and allows regalloc2 to use all available physical registers. It then solves the two problems above, cyclic moves and stack-to-stack moves, with a two-stage approach: - First, it finds a location to use to resolve cycles, if any exist. If a register is unallocated at the location of the move, we can use it. Often we get lucky and this is the case. Otherwise, we allocate a stackslot to use as the temp. This is perfectly fine at this stage, even if it means that we have more stack-to-stack moves. - Then, it resolves stack-to-stack moves into stack-to-reg / reg-to-stack. There are two subcases here. If there is another available free physical register, we opportunistically use it for this decomposition. If not, we fall back to our last-ditch option: we pick a victim register of the appropriate class, we allocate another temporary stackslot, we spill the victim to that slot just for this move, we do the move in the above way (stack-to-reg / reg-to-stack) with the victim, then we reload the victim. So one move (original stack-to-stack) becomes four moves, but no state is clobbered. This PR extends the `moves` fuzz-target to exercise this functionality as well, randomly choosing for some spare registers to exist or not, and randomly generating {stack,reg}-to-{stack,reg} moves in the initial parallel-move input set. The target does a simple symbolic simulation of the sequential move sequence and ensures that the final state is equivalent to the parallel-move semantics. I fuzzed both the `moves` target, focusing on the new logic; as well as the `ion_checker` target, checking the whole register allocator, and both seem clean (~150M cases on the former, ~1M cases on the latter).	2022-05-23 10:48:37 -07:00
Chris Fallin	ad41f8a7a5	Record vreg classes explicitly during liverange pass. (#35 ) This resolves an issue seen when the source program uses multiple regclasses (Int and Float): in some cases, the logic that grabs the vregs and retains them (with class) in `vreg_regs` missed a register and we had a class mismatch. This occurred because data structures were initialized assuming `Int` regclass at first. This PR instead removes the `vreg_regs` array, stores the class explicitly as an `Option<RegClass>` in the `VRegData`, and provides a `Env::vreg()` method that reconstitutes a `VReg` given its index and its observed class. We "observe" the class of every vreg seen during the liveness pass (and we assert that every occurrence of the vreg index has the same class). In this way, we still have a single source-of-truth for the vreg class (the mention of the vreg itself) and we explicitly represent the "not observed yet" state (and panic on attempting to use such a vreg) rather than implicitly taking the wrong class.	2022-03-29 14:00:14 -07:00
Chris Fallin	4f1161d9e4	Generalize debug-info support a bit. (#34 ) * Generalize debug-info support a bit. Previously, debug value-label support required each vreg to have a disjoint sequence of instruction ranges, each with one label. Unfortunately, it's entirely possible for multiple values at the program level to map to one vreg at the IR level, leading to multiple labels. This PR generalizes the debug-info generation support to allow for arbitrary (label, range, vreg) tuples, as long as they are sorted by vreg, with no other requirements. The lookup is a little more costly when we generate the debuginfo, but in practice we shouldn't have more than a few debug value labels per vreg, so in practice the constants should be small. * Typo fix from Amanieu Co-authored-by: Amanieu d'Antras <amanieu@gmail.com> Co-authored-by: Amanieu d'Antras <amanieu@gmail.com>	2022-03-18 10:32:27 -07:00
Chris Fallin	fe021ad6d4	Simplify pinned-vreg API: don't require slice of all pinned vregs. (#28 ) Simplify pinned-vreg API: don't require slice of all pinned vregs. Previously, we kept a bool flag `is_pinned` in the `VRegData`, and we required a `&[VReg]` of all pinned vregs to be provided by `Function::pinned_vregs()`. This was (I think) done for convenience, but it turns out not to really be necessary, as we can just query `is_pinned_vreg` where needed (and in the likely implementation, e.g. in Cranelift, this will be a `< NUM_PINNED_VREGS` check that can be inlined). This adds convenience for the embedder (the main benefit), and also reduces complexity, removes some state, and avoids some work initializing the regalloc state for a run.	2022-03-04 15:12:16 -08:00
Chris Fallin	14442df3fc	Support for debug-labels. (#27 ) Support for debug-labels. If the client adds labels to vregs across ranges of instructions in the input program, the regalloc will provide metadata in the `Output` that describes the `Allocation`s in which each such vreg is stored for those ranges. This allows the client to emit debug metadata telling a debugger where to find program values at each point in the program.	2022-03-03 16:58:33 -08:00
Chris Fallin	ccd6b4fc2c	Remove DefAlloc -- no longer needed.	2022-01-19 23:57:31 -08:00
Amanieu d'Antras	ee4de54240	Guard trace! behind cfg!(debug_assertions) Even if the trace log level is disabled, the presence of the trace! macro still has a significant impact on performance because it is present in the inner loops of the allocator. Removing the trace! calls at compile-time reduces instruction count by ~7%.	2022-01-11 13:30:13 +00:00
Amanieu d'Antras	2d9d5dd82b	Rearrange some struct fields to work better with u64_key/u128_key This allows the compiler to load the whole key with 1 or 2 64-bit accesses, assuming little-endian ordering. Improves instruction count by ~1%.	2022-01-11 13:24:51 +00:00
Amanieu d'Antras	693fb6a975	Only emit DefAlloc edits when the "checker" feature is enabled. This reduces instruction counts by ~2% when disabled.	2022-01-11 13:03:24 +00:00
Amanieu d'Antras	d95a9d9399	Combine sort keys into u64/u128 This allows the compiler to perform branch-less comparisons, which are more efficient. This results in ~5% fewer instructions executed.	2022-01-11 13:03:21 +00:00
Amanieu d'Antras	053375f049	Remove PRegData::reg and use PReg::from_index instead Performance impact is negligible but this is a good cleanup.	2022-01-11 13:02:08 +00:00
Amanieu d'Antras	74928b83fa	Replace all assert! with debug_assert! This results in a ~6% reduction in instruction count.	2022-01-11 03:54:08 +00:00
Amanieu d'Antras	51493ab03a	Apply review feedback	2021-12-12 00:33:30 +00:00
Amanieu d'Antras	8f435243e0	Properly handle fixed stack slots during multi-fixed-reg fixup	2021-12-11 22:39:14 +00:00
Amanieu d'Antras	77e6a9e0d7	Add support for fixed stack slots This works by allowing a PReg to be marked as being a stack location instead of a physical register.	2021-12-11 22:31:58 +00:00
Chris Fallin	ef6c8f3226	Fix fuzzbug: add checker metadata for new vreg on multi-fixed-reg fixup move. When an instruction uses the same vreg constrained to multiple different fixed registers, the allocator converts all but one of the fixed constraints to `Any` and then records a special fixup move that copies the value to the other fixed registers just before the instruction. This allows the allocator to maintain the invariant that a value lives in only one place at a time throughout most of its logic, and constrains the complexity-fallout of this corner case to just a special last-minute edit. Unfortunately some recent CPU time thrown at the fuzzer has uncovered a subtle interaction with the redundant move eliminator that confuses the checker. Specifically, when the correct value is already in the second constrained fixed reg, because of an unrelated other move (e.g. because of a blockparam or other vreg moved from the original), the redundant move eliminator can delete the fixup move without telling the checker that it has done so. Such an optimization is perfectly valid, and the generated code is correct; but the checker thinks that some other vreg (the one that was copied from the original) is in the second preg, and panics. The fix is to use the mechanism that indicates "this move defines a new vreg" (emitting a `defalloc` checker-instruction) to force the checker to understand that after the fixup move, the given preg actually contains the appropriate vreg.	2021-12-04 23:30:30 -08:00
Chris Fallin	cf0d515709	Relicense fully to Apache-2.0 WITH LLVM-exception. Large parts of the code in regalloc2 are currently licensed under the Mozilla Public License (MPL) 2.0, because they derive in meaningful ways from the register allocator in IonMonkey, which is part of Firefox. The relevant source files are marked as such, with references to the files in the Firefox source tree. The intent of the regalloc2 project was to port the register allocator from Firefox to use in Cranelift, borrowing good technology and improving on it in the spirit of open source. However, Several use-cases of Cranelift require, or at least strongly prefer, the Apache-2.0 license with the LLVM exception (matching the license of Cranelift itself, and Bytecode Alliance projects generally). While using this license is not strictly necessary for regalloc2 to be usable (The MPL is an excellent open-source license!), relicensing fully under this license to harmonize with the rest of Cranelift and Bytecode Alliance codebases significantly widens possibilities and reduces friction; then regalloc2 is "just another part of Cranelift" and doesn't have to be treated specially. The source in `src/ion/` specifically began as a fairly direct port of the algorithms in the following files in the `mozilla-central` repository (Firefox codebase): * The bulk of the "backtracking allocator" algorithm: * `js/src/jit/BacktrackingAllocator.{cpp,h}` * Helpers and definitions in the surrounding infrastructure: * `js/src/jit/RegisterAllocator.h` * `js/src/jit/RegisterAllocator.cpp` * `js/src/jit/StackSlotAllocator.h` * `js/src/jit/LIR.h` * A few data structure implementations: * `js/src/ds/SplayTree.h` * `js/src/ds/PriorityQueue.h` Subsequent work in improving regalloc2 has caused it to drift from the direct port -- for example, it no longer uses splay trees or the direct port of the priority queue above -- but it is of course very clearly still a derivative work. Analysis of the contributors to these files indicates that we need signoff from the following folks: * Mozilla Corp, for contributions made by Mozilla employees (the majority of the code). Communications with Mozilla (thanks @tschneidereit and @bholley for doing the work here!) indicate that @ekr is able to sign off when ready here. * Andy Wingo, specifically for the work done in [Bug 1620197](https://bugzilla.mozilla.org/show_bug.cgi?id=1620197) and [Bug 1609057](https://bugzilla.mozilla.org/show_bug.cgi?id=1609057) to generalize the stack allocator for a Wasm feature (multiple returns). Additionally, since the initial port, we have had three contributions from @Amanieu: [#9](https://github.com/bytecodealliance/regalloc2/pull/9), [#11](https://github.com/bytecodealliance/regalloc2/pull/11), [#13](https://github.com/bytecodealliance/regalloc2/pull/13). So, if everyone applicable is happy with this relicensing, this PR removes the MPL-2.0 license in `src/ion/` and marks all files as covered under `Apache-2.0 WITH LLVM-exception`. Please let us know if this is OK! Signoffs: - [ ] @ekr, for Mozilla's contributions - [ ] @wingo, for contributions to original code in `mozilla-central` - [ ] @Amanieu, for the three PRs linked above Thanks!	2021-11-10 10:54:28 -08:00
Chris Fallin	6f0893d69d	Address review comments.	2021-08-31 17:56:06 -07:00
Chris Fallin	6389071e09	Address review comments.	2021-08-31 17:42:50 -07:00
Chris Fallin	b19fa4857f	Rename operand positions to Early and Late, and make weights f16/f32 values.	2021-08-31 17:31:23 -07:00
Chris Fallin	6d313f2b56	Address review comments: more doc comments and some minor refactorings.	2021-08-30 17:15:37 -07:00
Chris Fallin	2f856435f4	Review feedback.	2021-08-12 14:08:10 -07:00
Chris Fallin	3e1e0f39b6	Convert all log::debug to log::trace.	2021-08-12 12:05:19 -07:00
Chris Fallin	84285c26fb	Rename OperandPolicy to OperandConstraint as per feedback from @julian-seward1.	2021-08-12 11:17:52 -07:00
Chris Fallin	b36a563d69	Cleanup: split allocator implemntation into 11 files of more reasonable size.	2021-06-18 16:51:41 -07:00

27 Commits