Commit Graph

363 Commits

Author SHA1 Message Date
Chris Fallin
230e2135d6 Cranelift: remove non-egraphs optimization pipeline and use_egraphs option. (#6167)
* Cranelift: remove non-egraphs optimization pipeline and `use_egraphs` option.

This PR removes the LICM, GVN, and preopt passes, and associated support
pieces, from `cranelift-codegen`. Not to worry, we still have
optimizations: the egraph framework subsumes all of these, and has been
on by default since #5181.

A few decision points:

- Filetests for the legacy LICM, GVN and simple_preopt were removed too.
  As we built optimizations in the egraph framework we wrote new tests
  for the equivalent functionality, and many of the old tests were
  testing specific behaviors in the old implementations that may not be
  relevant anymore. However if folks prefer I could take a different
  approach here and try to port over all of the tests.

- The corresponding filetest modes (commands) were deleted too. The
  `test alias_analysis` mode remains, but no longer invokes a separate
  GVN first (since there is no separate GVN that will not also do alias
  analysis) so the tests were tweaked slightly to work with that. The
  egrpah testsuite also covers alias analysis.

- The `divconst_magic_numbers` module is removed since it's unused
  without `simple_preopt`, though this is the one remaining optimization
  we still need to build in the egraphs framework, pending #5908. The
  magic numbers will live forever in git history so removing this in the
  meantime is not a major issue IMHO.

- The `use_egraphs` setting itself was removed at both the Cranelift and
  Wasmtime levels. It has been marked deprecated for a few releases now
  (Wasmtime 6.0, 7.0, upcoming 8.0, and corresponding Cranelift
  versions) so I think this is probably OK. As an alternative if anyone
  feels strongly, we could leave the setting and make it a no-op.

* Update test outputs for remaining test differences.
2023-04-06 18:11:03 +00:00
Remo Senekowitsch
7eb8914090 Chaos mode MVP: Skip branch optimization in MachBuffer (#6039)
* fuzz: Add chaos mode control plane

Co-authored-by: Falk Zwimpfer <24669719+FalkZ@users.noreply.github.com>
Co-authored-by: Moritz Waser <mzrw.dev@pm.me>

* fuzz: Skip branch optimization with chaos mode

Co-authored-by: Falk Zwimpfer <24669719+FalkZ@users.noreply.github.com>
Co-authored-by: Moritz Waser <mzrw.dev@pm.me>

* fuzz: Rename chaos engine -> control plane

Co-authored-by: Falk Zwimpfer <24669719+FalkZ@users.noreply.github.com>
Co-authored-by: Moritz Waser <mzrw.dev@pm.me>

* chaos mode: refactoring ControlPlane to be passed through the call stack by reference

Co-authored-by: Falk Zwimpfer <24669719+FalkZ@users.noreply.github.com>
Co-authored-by: Remo Senekowitsch <contact@remsle.dev>

* fuzz: annotate chaos todos

Co-authored-by: Falk Zwimpfer <24669719+FalkZ@users.noreply.github.com>
Co-authored-by: Moritz Waser <mzrw.dev@pm.me>

* fuzz: cleanup control plane

Co-authored-by: Falk Zwimpfer <24669719+FalkZ@users.noreply.github.com>
Co-authored-by: Moritz Waser <mzrw.dev@pm.me>

* fuzz: remove control plane from compiler context

Co-authored-by: Falk Zwimpfer <24669719+FalkZ@users.noreply.github.com>
Co-authored-by: Moritz Waser <mzrw.dev@pm.me>

* fuzz: move control plane into emit state

Co-authored-by: Falk Zwimpfer <24669719+FalkZ@users.noreply.github.com>
Co-authored-by: Moritz Waser <mzrw.dev@pm.me>

* fuzz: fix remaining compiler errors

Co-authored-by: Falk Zwimpfer <24669719+FalkZ@users.noreply.github.com>
Co-authored-by: Moritz Waser <mzrw.dev@pm.me>

* fix tests

* refactor emission state ctrl plane accessors

Co-authored-by: Falk Zwimpfer <24669719+FalkZ@users.noreply.github.com>
Co-authored-by: Moritz Waser <mzrw.dev@pm.me>

* centralize conditional compilation of chaos mode

Also cleanup a few straggling dependencies on cranelift-control
that aren't needed anymore.

Co-authored-by: Falk Zwimpfer <24669719+FalkZ@users.noreply.github.com>
Co-authored-by: Moritz Waser <mzrw.dev@pm.me>

* add cranelift-control to published crates

prtest:full

Co-authored-by: Falk Zwimpfer <24669719+FalkZ@users.noreply.github.com>
Co-authored-by: Moritz Waser <mzrw.dev@pm.me>

* add cranelift-control to public crates

Co-authored-by: Falk Zwimpfer <24669719+FalkZ@users.noreply.github.com>
Co-authored-by: Moritz Waser <mzrw.dev@pm.me>

---------

Co-authored-by: Falk Zwimpfer <24669719+FalkZ@users.noreply.github.com>
Co-authored-by: Moritz Waser <mzrw.dev@pm.me>
Co-authored-by: Remo Senekowitsch <contact@remsle.dev>
2023-04-05 19:28:46 +00:00
Alex Crichton
a3b21031d4 Add a MachBuffer::defer_trap method (#6011)
* Add a `MachBuffer::defer_trap` method

This commit adds a new method to `MachBuffer` to defer trap opcodes to
the end of a function in a similar manner to how constants are deferred
to the end of the function. This is useful for backends which frequently
use `TrapIf`-style opcodes. Currently a jump is emitted which skips the
next instruction, a trap, and then execution continues normally. While
there isn't any pressing problem with this construction the trap opcode
is in the middle of the instruction stream as opposed to "off on the
side" despite rarely being taken.

With this method in place all the backends (except riscv64 since I
couldn't figure it out easily enough) have a new lowering of their
`TrapIf` opcode. Now a trap is deferred, which returns a label, and then
that label is jumped to when executing the trap. A fixup is then
recorded in `MachBuffer` to get patched later on during emission, or at
the end of the function. Subsequently all `TrapIf` instructions
translate to a single branch plus a single trap at the end of the
function.

I've additionally further updated some more lowerings in the x64 backend
which were explicitly using traps to instead use `TrapIf` where
applicable to avoid jumping over traps mid-function. Other backends
didn't appear to have many jump-over-the-next-trap patterns.

Lots of tests have had their expectations updated here which should
reflect all the traps being sunk to the end of functions.

* Print trap code on all platforms

* Emit traps before constants

* Preserve source location information for traps

* Fix test expectations

* Attempt to fix s390x

The MachBuffer was registering trap codes with the first byte of the
trap, but the SIGILL handler was expecting it to be registered with the
last byte of the trap. Exploit that SIGILL is always represented with a
2-byte instruction and always march 2-backwards for SIGILL, continuing
to march backwards 1 byte for SIGFPE-generating instructions.

* Back out s390x changes

* Back out more s390x bits

* Review comments
2023-03-20 21:24:47 +00:00
Alex Crichton
5ae8575296 x64: Take SIGFPE signals for divide traps (#6026)
* x64: Take SIGFPE signals for divide traps

Prior to this commit Wasmtime would configure `avoid_div_traps=true`
unconditionally for Cranelift. This, for the division-based
instructions, would change emitted code to explicitly trap on trap
conditions instead of letting the `div` x86 instruction trap.

There's no specific reason for Wasmtime, however, to specifically avoid
traps in the `div` instruction. This means that the extra generated
branches on x86 aren't necessary since the `div` and `idiv` instructions
already trap for similar conditions as wasm requires.

This commit instead disables the `avoid_div_traps` setting for
Wasmtime's usage of Cranelift. Subsequently the codegen rules were
updated slightly:

* When `avoid_div_traps=true`, traps are no longer emitted for `div`
  instructions.
* The `udiv`/`urem` instructions now list their trap as divide-by-zero
  instead of integer overflow.
* The lowering for `sdiv` was updated to still explicitly check for zero
  but the integer overflow case is deferred to the instruction itself.
* The lowering of `srem` no longer checks for zero and the listed trap
  for the `div` instruction is a divide-by-zero.

This means that the codegen for `udiv` and `urem` no longer have any
branches. The codegen for `sdiv` removes one branch but keeps the
zero-check to differentiate the two kinds of traps. The codegen for
`srem` removes one branch but keeps the -1 check since the semantics of
`srem` mismatch with the semantics of `idiv` with a -1 divisor
(specifically for INT_MIN).

This is unlikely to have really all that much of a speedup but was
something I noticed during #6008 which seemed like it'd be good to clean
up. Plus Wasmtime's signal handling was already set up to catch
`SIGFPE`, it was just never firing.

* Remove the `avoid_div_traps` cranelift setting

With no known users currently removing this should be possible and helps
simplify the x64 backend.

* x64: GC more support for avoid_div_traps

Remove the `validate_sdiv_divisor*` pseudo-instructions and clean up
some of the ISLE rules now that `div` is allowed to itself trap
unconditionally.

* x64: Store div trap code in instruction itself

* Keep divisors in registers, not in memory

Don't accidentally fold multiple traps together

* Handle EXC_ARITHMETIC on macos

* Update emit tests

* Update winch and tests
2023-03-16 00:18:45 +00:00
Alex Crichton
5c1b468648 x64: Migrate {s,u}{div,rem} to ISLE (#6008)
* x64: Add precise-output tests for div traps

This adds a suite of `*.clif` files which are intended to test the
`avoid_div_traps=true` compilation of the `{s,u}{div,rem}` instructions.

* x64: Remove conditional regalloc in `Div` instruction

Move the 8-bit `Div` logic into a dedicated `Div8` instruction to avoid
having conditionally-used registers with respect to regalloc.

* x64: Migrate non-trapping, `udiv`/`urem` to ISLE

* x64: Port checked `udiv` to ISLE

* x64: Migrate urem entirely to ISLE

* x64: Use `test` instead of `cmp` to compare-to-zero

* x64: Port `sdiv` lowering to ISLE

* x64: Port `srem` lowering to ISLE

* Tidy up regalloc behavior and fix tests

* Update docs and winch

* Review comments

* Reword again

* More refactoring test fixes

* More test fixes
2023-03-14 01:44:06 +00:00
Alex Crichton
52896e020d aarch64: Add specialized shuffle lowerings (#5977)
* aarch64: Add `shuffle` lowerings for the `uzp{1,2}` instructions

This commit uses the same style of patterns in the x64 backend to start
adding specific lowerings of the Cranelift `shuffle` instruction to
particular AArch64 instructions.

* aarch64: Add `shuffle` lowerings to the `zip{1,2}` instructions

These instructions match the `punpck*` family of instructions on x64 and
should help provide more efficient lowerings than the current `shuffle`
fallback.

* aarch64: Add `shuffle` lowerings for `trn{1,2}`

Along the lines of prior commits adds specific patterns to lowering for
individual AArch64 instructions available.

* aarch64: Add a `shuffle` lowering for the `ext` instruction

This instruction will more-or-less concatenate two 128-bit vector
registers to create a 256-bit value, shift it right, and then take the
lower 128-bits into the destination. This can be modeled with a
`shuffle` of consecutive bytes so this adds a lowering rule to generate
this instruction.

* aarch64: Add `shuffle` special case for `dup`

This commit adds special cases for Cranelift's `shuffle` on AArch64 when
the lowering can be represented with a `dup` instruction which
broadcasts one vector's lane into all lanes of the destination.

* aarch64: Add `shuffle` specializations for `rev` instructions

This commit adds shuffle mask specializations for the `rev{16,32,64}`
family of instructions on AArch64 which can be used to reverse bytes,
16-bit values, or 32-bit values within larger values.

* Fix tests

* Add doc-comments in ISLE
2023-03-10 21:37:13 +00:00
Alex Crichton
1c3a1bda6c x64: Add a smattering of lowerings for shuffle specializations (#5930)
* x64: Add lowerings for `punpck{h,l}wd`

Add some special cases for `shuffle` for more specialized x86
instructions.

* x64: Add `shuffle` lowerings for `pshufd`

This commit adds special-cased lowerings for the x64 `shuffle`
instruction when the `pshufd` instruction alone is necessary. This is
possible when the shuffle immediate permutes 32-bit values within one of
the vector inputs of the `shuffle` instruction, but not both.

* x64: Add shuffle lowerings for `punpck{h,l}{q,}dq`

This adds specific permutations for some x86 instructions which
specifically interleave high/low bytes for 32 and 64-bit values. This
corresponds to the preexisting specific lowerings for interleaving 8 and
16-bit values.

* x64: Add `shuffle` lowerings for `shufps`

This commit adds targeted lowerings for the `shuffle` instruction that
match the pattern that `shufps` supports. The `shufps` instruction
selects two elements from the first vector and two elements from the
second vector which means while it's not generally applicable it should
still be more useful than the catch-all lowering of `shuffle`.

* x64: Add shuffle support for `pshuf{l,h}w`

This commit adds special lowering cases for these instructions which
permute 16-bit values within a 128-bit value either within the upper or
lower half of the 128-bit value.

* x64: Specialize `shuffle` with an all-zeros immediate

Instead of loading the all-zeros immediate from a rip-relative address
at the end of the function instead generate a zero with a `pxor`
instruction and then use `pshufb` to do the broadcast.

* Review comments
2023-03-09 22:58:19 +00:00
Kevin Rizzo
013b35ff32 winch: Refactoring wasmtime compiler integration pieces to share more between Cranelift and Winch (#5944)
* Enable the native target by default in winch

Match cranelift-codegen's build script where if no architecture is
explicitly enabled then the host architecture is implicitly enabled.

* Refactor Cranelift's ISA builder to share more with Winch

This commit refactors the `Builder` type to have a type parameter
representing the finished ISA with Cranelift and Winch having their own
typedefs for `Builder` to represent their own builders. The intention is
to use this shared functionality to produce more shared code between the
two codegen backends.

* Moving compiler shared components to a separate crate

* Restore native flag inference in compiler building

This fixes an oversight from the previous commits to use
`cranelift-native` to infer flags for the native host when using default
settings with Wasmtime.

* Move `Compiler::page_size_align` into wasmtime-environ

The `cranelift-codegen` crate doesn't need this and winch wants the same
implementation, so shuffle it around so everyone has access to it.

* Fill out `Compiler::{flags, isa_flags}` for Winch

These are easy enough to plumb through with some shared code for
Wasmtime.

* Plumb the `is_branch_protection_enabled` flag for Winch

Just forwarding an isa-specific setting accessor.

* Moving executable creation to shared compiler crate

* Adding builder back in and removing from shared crate

* Refactoring the shared pieces for the `CompilerBuilder`

I decided to move a couple things around from Alex's initial changes.
Instead of having the shared builder do everything, I went back to
having each compiler have a distinct builder implementation. I
refactored most of the flag setting logic into a single shared location,
so we can still reduce the amount of code duplication.

With them being separate, we don't need to maintain things like
`LinkOpts` which Winch doesn't currently use. We also have an avenue to
error when certain flags are sent to Winch if we don't support them. I'm
hoping this will make things more maintainable as we build out Winch.

I'm still unsure about keeping everything shared in a single crate
(`cranelift_shared`). It's starting to feel like this crate is doing too
much, which makes it difficult to name. There does seem to be a need for
two distinct abstraction: creating the final executable and the handling
of shared/ISA flags when building the compiler. I could make them into
two separate crates, but there doesn't seem to be enough there yet to
justify it.

* Documentation updates, and renaming the finish method

* Adding back in a default temporarily to pass tests, and removing some unused imports

* Fixing winch tests with wrong method name

* Removing unused imports from codegen shared crate

* Apply documentation formatting updates

Co-authored-by: Saúl Cabrera <saulecabrera@gmail.com>

* Adding back in cranelift_native flag inferring

* Adding new shared crate to publish list

* Adding write feature to pass cargo check

---------

Co-authored-by: Alex Crichton <alex@alexcrichton.com>
Co-authored-by: Saúl Cabrera <saulecabrera@gmail.com>
2023-03-08 15:07:13 +00:00
Trevor Elliott
8abfe928d6 Reuse the DominatorTree postorder travesal in BlockLoweringOrder (#5843)
* Rework the blockorder module to reuse the dom tree's cfg postorder

* Update domtree tests

* Treat br_table with an empty jump table as multiple block exits

* Bless tests

* Change branch_idx to succ_idx and fix the comment
2023-02-23 22:05:20 +00:00
Trevor Elliott
d711872d63 Refactor collect_branches_and_targets to not need a smallvec (#5803)
* Refactor collect_branches_and_targets to not need a smallvec

Basic blocks are terminated by at most one branch instruction now, so we
can use that assumption in `collect_branches_and_targets` to return the
last instruction we saw instead.

* Review comments
2023-02-16 21:30:17 +00:00
Trevor Elliott
80c147d9c0 Rework br_table to use BlockCall (#5731)
Rework br_table to use BlockCall, allowing us to avoid adding new nodes during ssa construction to hold block arguments. Additionally, many places where we previously matched on InstructionData to extract branch destinations can be replaced with a use of branch_destination or branch_destination_mut.
2023-02-16 09:23:27 -08:00
Trevor Elliott
cc073593a4 Fix block label printing in precise-output tests (#5798)
As a follow-up to #5780, disassemble the regions identified by bb_starts, falling back on disassembling the whole buffer. This ensures that instructions like br_table that introduce a lot of constants don't throw off capstone for the remainder of the function.

---------

Co-authored-by: Jamey Sharp <jamey@minilop.net>
2023-02-16 02:35:26 +00:00
Trevor Elliott
f04decc4a1 Use capstone to validate precise-output tests (#5780)
Use the capstone library to disassemble precise-output tests, in addition to pretty-printing their vcode.
2023-02-15 16:35:10 -08:00
Trevor Elliott
116e5a665f Bump regalloc2 to 0.6.0 (#5742)
* Bump regalloc2
* Certify regalloc2 0.6.0
2023-02-07 15:57:49 -08:00
Trevor Elliott
2c8425998b Refactor matches that used to consume BranchInfo (#5734)
Explicitly borrow the instruction data, and use a mutable borrow to avoid rematch.
2023-02-07 13:29:42 -08:00
Trevor Elliott
c8a6adf825 Remove analyze_branch and BranchInfo (#5730)
We don't have overlap in behavior for branch instructions anymore, so we can remove analyze_branch and instead match on the InstructionData directly.

Co-authored-by: Jamey Sharp <jamey@minilop.net>
2023-02-06 17:06:57 -08:00
Trevor Elliott
a5698cedf8 cranelift: Remove brz and brnz (#5630)
Remove the brz and brnz instructions, as their behavior is now redundant with brif.
2023-01-30 20:34:56 +00:00
Trevor Elliott
b58a197d33 cranelift: Add a conditional branch instruction with two targets (#5446)
Add a conditional branch instruction with two targets: brif. This instruction will eventually replace brz and brnz, as it encompasses the behavior of both.

This PR also changes the InstructionData layout for instruction formats that hold BlockCall values, taking the same approach we use for Value arguments. This allows branch_destination to return a slice to the BlockCall values held in the instruction, rather than requiring that we pattern match on InstructionData to fetch the then/else blocks.

Function generation for fuzzing has been updated to generate uses of brif, and I've run the cranelift-fuzzgen target locally for hours without triggering any new failures.
2023-01-24 14:37:16 -08:00
Chris Fallin
1faff8c2ce Enable egraph-based optimization by default. (#5587)
This PR follows up on #5382 and #5391, which rebuilt the egraph-based optimization framework to be more performant, by enabling it by default.

Based on performance results in #5382 (my measurements on SpiderMonkey and bjorn3's independent confirmation with cg_clif), it seems that this is reasonable to enable. Now that we have been fuzzing compiler configurations with egraph opts (#5388) for 6 weeks, having fixed a few fuzzbugs that came up (#5409, #5420, #5438) and subsequently received no further reports from OSS-Fuzz, I believe it is stable enough to rely on.

This PR enables `use_egraphs`, and also normalizes its meaning: previously it forced optimization (it basically meant "turn on the egraph optimization machinery"), now it runs egraph opts if the opt level indicates (it means "use egraphs to optimize if we are going to optimize"). The conditionals in the top-level pass driver are a little subtle, but will get simpler once we can remove the non-egraph path (which we plan to do eventually!).

Fixes #5181.
2023-01-19 15:46:53 -08:00
Trevor Elliott
1e6c13d83e cranelift: Rework block instructions to use BlockCall (#5464)
Add a new type BlockCall that represents the pair of a block name with arguments to be passed to it. (The mnemonic here is that it looks a bit like a function call.) Rework the implementation of jump, brz, and brnz to use BlockCall instead of storing the block arguments as varargs in the instruction's ValueList.

To ensure that we're processing block arguments from BlockCall values in instructions, three new functions have been introduced on DataFlowGraph that both sets of arguments:

inst_values - returns an iterator that traverses values in the instruction and block arguments
map_inst_values - applies a function to each value in the instruction and block arguments
overwrite_inst_values - overwrite all values in an instruction and block arguments with values from the iterator

Co-authored-by: Jamey Sharp <jamey@minilop.net>
2023-01-17 16:31:15 -08:00
Jamey Sharp
3a2ca67570 Generalize iterator types in compute_use_states (#5586)
This is a cleanup to help prepare for #5464.

Most of the diff is inlining the closure for `mark_all_uses_as_multiple`
which was only called once. That avoids some borrow-checker challenges.

The key change is that the former `push_args_on_stack` closure no longer
actually pushes the iterator on the stack, but just returns it. That
way this closure doesn't need the name of the stack's type. It also
allows it to be reused in the debug_assert.
2023-01-17 22:17:17 +00:00
Chris Fallin
03463458e4 Cranelift: fix branch-of-icmp/fcmp regression: look through uextend. (#5487)
In #5031, we removed `bool` types from CLIF, using integers instead for
"truthy" values. This greatly simplified the IR, and was generally an
improvement.

However, because x86's `SETcc` instruction sets only the low 8 bits of a
register, we chose to use `i8` types as the result of `icmp` and `fcmp`,
to avoid the need for a masking operation when materializing the result.

Unfortunately this means that uses of truthy values often now have
`uextend` operations, especially when coming from Wasm (where truthy
values are naturally `i32`-typed). For example, where we previously had
`(brz (icmp ...))`, we now have `(brz (uextend (icmp ...)))`.

It's arguable whether or not we should switch to `i32` truthy values --
in most cases we can avoid materializing a value that's immediately used
for a branch or select, so a mask would in most cases be unnecessary,
and it would be a win at the IR level -- but irrespective of that, this
change *did* regress our generated code quality: our backends had
patterns for e.g. `(brz (icmp ...))` but not with the `uextend`, so we
were *always* materializing truthy values. Many blocks thus ended with
"cmp; setcc; cmp; test; branch" rather than "cmp; branch".

In #5391 we noticed this and fixed it on x64, but it was a general
problem on aarch64 and riscv64 as well. This PR introduces a
`maybe_uextend` extractor that "looks through" uextends, and uses it
where we consume truthy values, thus fixing the regression.  This PR
also adds compile filetests to ensure we don't regress again.

The riscv64 backend has not been updated here because doing so appears
to trigger another issue in its branch handling; fixing that is TBD.
2022-12-22 01:43:44 -08:00
Trevor Elliott
fac4a915a3 Assert that we only use virtual registers with moves (#5440)
Assert that we never see real registers as arguments to move instructions in VCodeBuilder::collect_operands.

Also fix a bug in the riscv64 backend that was discovered by these assertions: the lowerings of get_stack_pointer and get_frame_pointer were using physical registers 8 and 2 directly. The solution was similar to other backends: add a move instruction specifically for moving out of physical registers, whose source operand is opaque to regalloc2.
2022-12-20 18:22:47 -08:00
Trevor Elliott
25bf8e0e67 Make DataFlowGraph::insts public, but restricted (#5450)
We have some operations defined on DataFlowGraph purely to work around borrow-checker issues with InstructionData and other data on DataFlowGraph. Part of the problem is that indexing the DFG directly hides the fact that we're only indexing the insts field of the DFG.

This PR makes the insts field of the DFG public, but wraps it in a newtype that only allows indexing. This means that the borrow checker is better able to tell when operations on memory held by the DFG won't conflict, which comes up frequently when mutating ValueLists held by InstructionData.
2022-12-16 10:46:09 -08:00
Ulrich Weigand
f0af622208 Simplify LowerBackend interface (#5432)
* Refactor lower_branch to have Unit result

Branches cannot have any output, so it is more straightforward
to have the ISLE term return Unit instead of InstOutput.

Also provide a new `emit_side_effect` term to simplify
implementation of `lower_branch` rules with Unit result.

* Simplify LowerBackend interface

Move all remaining asserts from the LowerBackend::lower and
::lower_branch_group into the common call site.

Change return value of ::lower to Option<InstOutput>, and
return value of ::lower_branch_group to Option<()> to match
ISLE term signature.

Only pass the first branch into ::lower_branch_group and
rename it to ::lower_branch.

As a result of all those changes, LowerBackend routines
now consists solely to calls to the corresponding ISLE
routines.
2022-12-14 00:48:25 +00:00
Ulrich Weigand
df923f18ca Remove MachInst::gen_constant (#5427)
* aarch64: constant generation cleanup

Add support for MOVZ and MOVN generation via ISLE.
Handle f32const, f64const, and nop instructions via ISLE.
No longer call Inst::gen_constant from lower.rs.

* riscv64: constant generation cleanup

Handle f32const, f64const, and nop instructions via ISLE.

* s390x: constant generation cleanup

Fix rule priorities for "imm" term.
Only handle 32-bit stack offsets; no longer use load_constant64.

* x64: constant generation cleanup

No longer call Inst::gen_constant from lower.rs or abi.rs.

* Refactor LowerBackend::lower to return InstOutput

No longer write to the per-insn output registers; instead, return
an InstOutput vector of temp registers holding the outputs.

This will allow calling LowerBackend::lower multiple times for
the same instruction, e.g. to rematerialize constants.

When emitting the primary copy of the instruction during lowering,
writing to the per-insn registers is now done in lower_clif_block.

As a result, the ISLE lower_common routine is no longer needed.
In addition, the InsnOutput type and all code related to it
can be removed as well.

* Refactor IsleContext to hold a LowerBackend reference

Remove the "triple", "flags", and "isa_flags" fields that are
copied from LowerBackend to each IsleContext, and instead just
hold a reference to LowerBackend in IsleContext.

This will allow calling LowerBackend::lower from within callbacks
in src/machinst/isle.rs, e.g. to rematerialize constants.

To avoid having to pass LowerBackend references through multiple
functions, eliminate the lower_insn_to_regs subroutines in those
targets that still have them, and just inline into the main
lower routine.  This also eliminates lower_inst.rs on aarch64
and riscv64.

Replace all accesses to the removed IsleContext fields by going
through the LowerBackend reference.

* Remove MachInst::gen_constant

This addresses the problem described in issue
https://github.com/bytecodealliance/wasmtime/issues/4426
that targets currently have to duplicate code to emit
constants between the ISLE logic and the gen_constant
callback.

After the various cleanups in earlier patches in this series,
the only remaining user of get_constant is put_value_in_regs
in Lower.  This can now be removed, and instead constant
rematerialization can be performed in the put_in_regs ISLE
callback by simply directly calling LowerBackend::lower
on the instruction defining the constant (using a different
output register).

Since the check for egraph mode is now no longer performed in
put_value_in_regs, the Lower::flags member becomes obsolete.

Care needs to be taken that other calls directly to the
Lower::put_value_in_regs routine now handle the fact that
no more rematerialization is performed.  All such calls in
target code already historically handle constants themselves.
The remaining call site in the ISLE gen_call_common helper
can be redirected to the ISLE put_in_regs callback.

The existing target implementations of gen_constant are then
unused and can be removed.  (In some target there may still
be further opportunities to remove duplication between ISLE
and some local Rust code - this can be left to future patches.)
2022-12-13 13:00:04 -08:00
Saúl Cabrera
a76e0e8aa5 Decouple MachBufferFinalized<Stencil> from ir::FunctionParameters (#5419)
This commit allows retrieving a `MachBufferFinalized<Final>` from a
`MachBufferFinalized<Stencil>` without relying on
`ir::FunctionParameters`. Instead it uses the function's base source location,
which is the only piece used by the previous `apply_params` definition.

This change allows other uses cases (e.g. Winch) to use an opaque, common
concept, exposed outside of `cranelift-codegen` to get the finalized state of
the machine buffer. This change implies that Winch will transitively know about
the `Stencil` compilation phase, but the `Stencil` phase is not exposed to Winch.

Other alternatives considered:

Parametrizing `MachBufferFinzalized` in a way such that it allows specifying
which compilation phase the caller is targetting. Such approach would require
also parametrizing the `MachSrcLoc` definition. One of the main drawbacks of
this approach is that it also requires changing how the `MachBuffer`'s
`start_srcloc` works: for caller requesting a `Final` `MachBufferFinalized`, the
`MachBuffer` will need to work in terms of `SourceLoc` rather than in
`RelSourceLoc` terms. This approach doesn't necessarily present more advantages
than the approach presented in this change, in which there's no need to make any
fundamental changes and in which all the `cranelift-codegen` primitives are
already exposed.
2022-12-13 07:13:45 -05:00
Timothy Chen
122872fb0c Remove references for sig (#5414) 2022-12-12 08:46:23 -08:00
Timothy Chen
8035945502 Reduce sig data size by changing sized spaces (#5402)
* Reduce sig sizes

* Fix test

* Change compute_args_loc to return u32
2022-12-11 15:32:30 -08:00
Ulrich Weigand
e913cf3647 Remove IFLAGS/FFLAGS types (#5406)
All instructions using the CPU flags types (IFLAGS/FFLAGS) were already
removed.  This patch completes the cleanup by removing all remaining
instructions that define values of CPU flags types, as well as the
types themselves.

Specifically, the following features are removed:
- The IFLAGS and FFLAGS types and the SpecialType category.
- Special handling of IFLAGS and FFLAGS in machinst/isle.rs and
  machinst/lower.rs.
- The ifcmp, ifcmp_imm, ffcmp, iadd_ifcin, iadd_ifcout, iadd_ifcarry,
  isub_ifbin, isub_ifbout, and isub_ifborrow instructions.
- The writes_cpu_flags instruction property.
- The flags verifier pass.
- Flags handling in the interpreter.

All of these features are currently unused; no functional change
intended by this patch.

This addresses https://github.com/bytecodealliance/wasmtime/issues/3249.
2022-12-09 13:42:03 -08:00
Chris Fallin
8c55b81300 Optimizations to egraph framework (#5391)
* Optimizations to egraph framework:

- Save elaborated results by canonical value, not latest value (union
  value). Previously we were artificially skipping and re-elaborating
  some values we already had because we were not finding them in the
  map.

- Make some changes to handling of icmp results: when icmp became
  I8-typed (when bools went away), many uses became `(uextend $I32 (icmp
  $I8 ...))`, and so patterns in lowering backends were no longer
  matching.

  This PR includes an x64-specific change to match `(brz (uextend (icmp
  ...)))` and similarly for `brnz`, but it also takes advantage of the
  ability to write rules easily in the egraph mid-end to rewrite selects
  with icmp inputs appropriately.

- Extend constprop to understand selects in the egraph mid-end.

With these changes, bz2.wasm sees a ~1% speedup, and spidermonkey.wasm
with a fib.js input sees a 16.8% speedup:

```
$ time taskset 1 target/release/wasmtime run --allow-precompiled --dir=. ./spidermonkey.base.cwasm ./fib.js
1346269
taskset 1 target/release/wasmtime run --allow-precompiled --dir=.  ./fib.js  2.14s user 0.01s system 99% cpu 2.148 total
$ time taskset 1 target/release/wasmtime run --allow-precompiled --dir=. ./spidermonkey.egraphs.cwasm ./fib.js
1346269
taskset 1 target/release/wasmtime run --allow-precompiled --dir=.  ./fib.js  1.78s user 0.01s system 99% cpu 1.788 total
```

* Review feedback.
2022-12-07 13:23:13 -08:00
Trevor Elliott
c5379051c4 Enable the ssa verifier in debug builds (#5354)
Enable regalloc2's SSA verifier in debug builds to check for any outstanding reuse of virtual registers in def constraints. As fuzzing enables debug_assertions, this will enable the SSA verifier when fuzzing as well.
2022-12-07 12:22:51 -08:00
Jamey Sharp
29b23d41b6 ISLE rule cleanups (#5389)
* cranelift-codegen: Use ISLE matching, not same_value

The `same_value` function just wrapped an equality test into an external
constructor, but we can do that with ISLE's equality constraints
instead.

* riscv64: Remove custom condition-code tests

The `lower_icmp` term exists solely to decide whether to sign-extend or
zero-extend the comparison operands, based on whether the condition code
requires a signed comparison. It additionally tested whether the
condition code was == or !=, but produced the same result as for other
unsigned comparisons.

We already have `signed_cond_code` in the ISLE prelude, which classifies
the total-ordering condition codes according to whether they're signed.
It also lumps == and != in the "unsigned" camp, as desired.

So this commit uses the existing method from the prelude instead of
riscv64-local definitions.

Because this version has no constraints on the left-hand side of the
rule in the unsigned case, ISLE generates Rust that always returns
`Some`. That shows that the current use of `unwrap` is justified, at the
only Rust-side call-site of `constructor_lower_icmp`, which is in
cranelift/codegen/src/isa/riscv64/lower/isle.rs.

* ISLE prelude: make offset32 infallible

This extractor always returns `Some`, so it doesn't need to be fallible.
2022-12-07 02:55:59 +00:00
Chris Fallin
f980defe17 egraph support: rewrite to work in terms of CLIF data structures. (#5382)
* egraph support: rewrite to work in terms of CLIF data structures.

This work rewrites the "egraph"-based optimization framework in
Cranelift to operate on aegraphs (acyclic egraphs) represented in the
CLIF itself rather than as a separate data structure to which and from
which we translate the CLIF.

The basic idea is to add a new kind of value, a "union", that is like an
alias but refers to two other values rather than one.  This allows us to
represent an eclass of enodes (values) as a tree. The union node allows
for a value to have *multiple representations*: either constituent value
could be used, and (in well-formed CLIF produced by correct
optimization rules) they must be equivalent.

Like the old egraph infrastructure, we take advantage of acyclicity and
eager rule application to do optimization in a single pass. Like before,
we integrate GVN (during the optimization pass) and LICM (during
elaboration).

Unlike the old egraph infrastructure, everything stays in the
DataFlowGraph. "Pure" enodes are represented as instructions that have
values attached, but that are not placed into the function layout. When
entering "egraph" form, we remove them from the layout while optimizing.
When leaving "egraph" form, during elaboration, we can place an
instruction back into the layout the first time we elaborate the enode;
if we elaborate it more than once, we clone the instruction.

The implementation performs two passes overall:

- One, a forward pass in RPO (to see defs before uses), that (i) removes
  "pure" instructions from the layout and (ii) optimizes as it goes. As
  before, we eagerly optimize, so we form the entire union of optimized
  forms of a value before we see any uses of that value. This lets us
  rewrite uses to use the most "up-to-date" form of the value and
  canonicalize and optimize that form.

  The eager rewriting and acyclic representation make each other work
  (we could not eagerly rewrite if there were cycles; and acyclicity
  does not miss optimization opportunities only because the first time
  we introduce a value, we immediately produce its "best" form). This
  design choice is also what allows us to avoid the "parent pointers"
  and fixpoint loop of traditional egraphs.

  This forward optimization pass keeps a scoped hashmap to "intern"
  nodes (thus performing GVN), and also interleaves on a per-instruction
  level with alias analysis. The interleaving with alias analysis allows
  alias analysis to see the most optimized form of each address (so it
  can see equivalences), and allows the next value to see any
  equivalences (reuses of loads or stored values) that alias analysis
  uncovers.

- Two, a forward pass in domtree preorder, that "elaborates" pure enodes
  back into the layout, possibly in multiple places if needed. This
  tracks the loop nest and hoists nodes as needed, performing LICM as it
  goes. Note that by doing this in forward order, we avoid the
  "fixpoint" that traditional LICM needs: we hoist a def before its
  uses, so when we place a node, we place it in the right place the
  first time rather than moving later.

This PR replaces the old (a)egraph implementation. It removes both the
cranelift-egraph crate and the logic in cranelift-codegen that uses it.

On `spidermonkey.wasm` running a simple recursive Fibonacci
microbenchmark, this work shows 5.5% compile-time reduction and 7.7%
runtime improvement (speedup).

Most of this implementation was done in (very productive) pair
programming sessions with Jamey Sharp, thus:

Co-authored-by: Jamey Sharp <jsharp@fastly.com>

* Review feedback.

* Review feedback.

* Review feedback.

* Bugfix: cprop rule: `(x + k1) - k2` becomes `x - (k2 - k1)`, not `x - (k1 - k2)`.

Co-authored-by: Jamey Sharp <jsharp@fastly.com>
2022-12-06 14:58:57 -08:00
Trevor Elliott
d54a27d0ea Allocate temporary intermediates when loading constants on aarch64 (#5366)
As loading constants on aarch64 can take up to 4 instructions, we need to plumb through some additional registers. Rather than pass a fixed list of registers in, pass an allocation function.
2022-12-01 22:29:36 +00:00
Alex Crichton
830885383f Implement inline stack probes for AArch64 (#5353)
* Turn off probestack by default in Cranelift

The probestack feature is not implemented for the aarch64 and s390x
backends and currently the on-by-default status requires the aarch64 and
s390x implementations to be a stub. Turning off probestack by default
allows the s390x and aarch64 backends to panic with an error message to
avoid providing a false sense of security. When the probestack option is
implemented for all backends, however, it may be reasonable to
re-enable.

* aarch64: Improve codegen for AMode fallback

Currently the final fallback for finalizing an `AMode` will generate
both a constant-loading instruction as well as an `add` instruction to
the base register into the same temporary. This commit improves the
codegen by removing the `add` instruction and folding the final add into
the finalized `AMode`. This changes the `extendop` used but both
registers are 64-bit so shouldn't be affected by the extending
operation.

* aarch64: Implement inline stack probes

This commit implements inline stack probes for the aarch64 backend in
Cranelift. The support here is modeled after the x64 support where
unrolled probes are used up to a particular threshold after which a loop
is generated. The instructions here are similar in spirit to x64 except
that unlike x64 the stack pointer isn't modified during the unrolled
loop to avoid needing to re-adjust it back up at the end of the loop.

* Enable inline probestack for AArch64 and Riscv64

This commit enables inline probestacks for the AArch64 and Riscv64
architectures in the same manner that x86_64 has it enabled now. Some
more testing was additionally added since on Unix platforms we should be
guaranteed that Rust's stack overflow message is now printed too.

* Enable probestack for aarch64 in cranelift-fuzzgen

* Address review comments

* Remove implicit stack overflow traps from x64 backend

This commit removes implicit `StackOverflow` traps inserted by the x64
backend for stack-based operations. This was historically required when
stack overflow was detected with page faults but Wasmtime no longer
requires that since it's not suitable for wasm modules which call host
functions. Additionally no other backend implements this form of
implicit trap-code additions so this is intended to synchronize the
behavior of all the backends.

This fixes a test added prior for aarch64 to properly abort the process
instead of accidentally being caught by Wasmtime.

* Fix a style issue
2022-11-30 12:30:00 -06:00
Timothy Chen
67fc5389b0 Remove sig data arg and ret fields to reduce size (#5319)
* Remove sig data arg and ret fields to reduce size

* Update cranelift/codegen/src/machinst/abi.rs

Co-authored-by: Jamey Sharp <jamey@minilop.net>

* Update cranelift/codegen/src/machinst/abi.rs

Co-authored-by: Jamey Sharp <jamey@minilop.net>

* Fix offsets

* Add comment

Co-authored-by: Jamey Sharp <jamey@minilop.net>
2022-11-30 07:19:41 -08:00
Trevor Elliott
f138fc0ed3 Bump regalloc2 to 0.5.0 (#5345)
* Bump the regalloc2 dependency to 0.5.0
* Replace preg_set_from_machine_env with PRegSet::from
* Vet the regalloc2 update
2022-11-29 11:25:35 -08:00
Trevor Elliott
a5a0645aff Don't allow reuse_def constraints in the s390x Loop instruction (#5336) 2022-11-28 17:52:11 -08:00
Nick Fitzgerald
58a5089e48 Cranelift: log number of CLIF insts/blocks to optimize/lower (#5333) 2022-11-28 19:35:29 +00:00
Nick Fitzgerald
6fe69d00ca Cranelift: add debug logs counting how many vcode instructions and blocks we lower to (#5332) 2022-11-28 18:57:02 +00:00
Trevor Elliott
54a6d2f79a Generate more fixed_nonallocatable constraints, and add debug assertions (#5132)
Add assertions to the OperandCollector that show we're not using pinned vregs, and use reg_fixed_nonallocatable constraints when a real register is used with other constraint generation functions like reg_use etc.
2022-11-28 10:31:56 -08:00
Timothy Chen
48ee42efc2 Refactor Sigdata methods with sigset (#5307)
* Refactor sigdata methods

* Update cranelift/codegen/src/machinst/abi.rs

Co-authored-by: Jamey Sharp <jamey@minilop.net>

* Address comments

Co-authored-by: Jamey Sharp <jamey@minilop.net>
2022-11-22 09:03:51 -08:00
Trevor Elliott
54cfa4df34 cranelift: Fix implicit pointer argument register use (#5301)
* Fix arg handling to write to VRegs instead of physical regs

* Make is_included_in_clobbers required, and handle Args on x64 and riscv64
2022-11-18 16:47:03 -08:00
Timothy Chen
de6e4a4e20 Shrink the size of SigData in Cranelift (#5261)
* Shrink the size of SigData in Cranelift

* Update cranelift/codegen/src/machinst/abi.rs

Co-authored-by: Jamey Sharp <jamey@minilop.net>

* Change ret arg length to u16

* Add test

Co-authored-by: Jamey Sharp <jamey@minilop.net>
2022-11-17 00:15:19 +00:00
Trevor Elliott
a007e02bd2 Add fixed_nonallocatable constraints when appropriate (#5253)
Plumb the set of allocatable registers through the OperandCollector and use it validate uses of fixed-nonallocatable registers, like %rsp on x86_64.
2022-11-15 12:49:17 -08:00
Denys Zadorozhnyi
d3692c2f2b fix typo in caller_conv arg name in ABIMachineSpec::gen_call; (#5259) 2022-11-14 09:02:07 -08:00
Trevor Elliott
0367fbc2d4 cranelift: Rework pinned register lowering (#5249)
Rework pinned register lowering to avoid the use of pinned virtual registers, instead using the MovFromPReg and MovToPReg pseudo instructions.
2022-11-10 16:19:25 -08:00
Trevor Elliott
b077854b57 Generate SSA code from returns (#5172)
Modify return pseudo-instructions to have pairs of registers: virtual and real. This allows us to constrain the virtual registers to the real ones specified by the abi, instead of directly emitting moves to those real registers.
2022-11-08 16:00:49 -08:00
Trevor Elliott
70bca801ab cranelift: Resize with types::INVALID isntead of types::I8 (#5227) 2022-11-08 20:42:20 +00:00