Commit Graph

636 Commits

Author SHA1 Message Date
Alex Crichton
fcddb9ca81 x64: Add lea-based lowering for iadd (#5986)
* x64: Refactor `Amode` computation in ISLE

This commit replaces the previous computation of `Amode` with a
different set of rules that are intended to achieve the same purpose but
are structured differently. The motivation for this commit is going to
become more relevant in the next commit where `lea` will be used for the
`iadd` instruction, possibly, on x64. When doing so it caused a stack
overflow in the test suite during the compilation phase of a wasm
module, namely as part of the `amode_add` function. This function is
recursively defined in terms of itself and recurses as deep as the
deepest `iadd`-chain in a program. A particular test in our test suite
has a 10k-long chain of `iadd` which ended up causing a stack overflow
in debug mode.

This stack overflow is caused because the `amode_add` helper in ISLE
unconditionally peels all the `iadd` nodes away and looks at all of
them, even if most end up in intermediate registers along the way. Given
that structure I couldn't find a way to easily abort the recursion. The
new `to_amode` helper is structured in a similar fashion but attempts to
instead only recurse far enough to fold items into the final `Amode`
instead of recursing through items which themselves don't end up in the
`Amode`. Put another way previously the `amode_add` helper might emit
`x64_add` instructions, but it no longer does that.

This goal of this commit is to preserve all the original `Amode`
optimizations, however. For some parts, though, it relies more on egraph
optimizations to run since if an `iadd` is 10k deep it doesn't try to
find a constant buried 9k levels inside there to fold into the `Amode`.
The hope, though, is that with egraphs having run already it's shuffled
constants to the right most of the time and already folded any possible
together.

* x64: Add `lea`-based lowering for `iadd`

This commit adds a rule for the lowering of `iadd` to use `lea` for 32
and 64-bit addition. The theoretical benefit of `lea` over the `add`
instruction is that the `lea` variant can emulate a 3-operand
instruction which doesn't destructively modify on of its operands.
Additionally the `lea` operation can fold in other components such as
constant additions and shifts.

In practice, however, if `lea` is unconditionally used instead of `iadd`
it ends up losing 10% performance on a local `meshoptimizer` benchmark.
My best guess as to what's going on here is that my CPU's dedicated
units for address computation are all overloaded while the ALUs are
basically idle in a memory-intensive loop. Previously when the ALU was
used for `add` and the address units for stores/loads it in theory
pipelined things better (most of this is me shooting in the dark). To
prevent the performance loss here I've updated the lowering of `iadd` to
conditionally sometimes use `lea` and sometimes use `add` depending on
how "complicated" the `Amode` is. Simple ones like `a + b` or `a + $imm`
continue to use `add` (and its subsequent hypothetical extra `mov`
necessary into the result). More complicated ones like `a + b + $imm` or
`a + b << c + $imm` use `lea` as it can remove the need for extra
instructions. Locally at least this fixes the performance loss relative
to unconditionally using `lea`.

One note is that this adds an `OperandSize` argument to the
`MInst::LoadEffectiveAddress` variant to add an encoding for 32-bit
`lea` in addition to the preexisting 64-bit encoding.

* Conditionally use `lea` based on regalloc
2023-03-15 17:14:25 +00:00
Alex Crichton
5c1b468648 x64: Migrate {s,u}{div,rem} to ISLE (#6008)
* x64: Add precise-output tests for div traps

This adds a suite of `*.clif` files which are intended to test the
`avoid_div_traps=true` compilation of the `{s,u}{div,rem}` instructions.

* x64: Remove conditional regalloc in `Div` instruction

Move the 8-bit `Div` logic into a dedicated `Div8` instruction to avoid
having conditionally-used registers with respect to regalloc.

* x64: Migrate non-trapping, `udiv`/`urem` to ISLE

* x64: Port checked `udiv` to ISLE

* x64: Migrate urem entirely to ISLE

* x64: Use `test` instead of `cmp` to compare-to-zero

* x64: Port `sdiv` lowering to ISLE

* x64: Port `srem` lowering to ISLE

* Tidy up regalloc behavior and fix tests

* Update docs and winch

* Review comments

* Reword again

* More refactoring test fixes

* More test fixes
2023-03-14 01:44:06 +00:00
Alex Crichton
e2a6fe99c2 x64: Add shuffle specialization for palignr (#5999)
* x64: Add `shuffle` specialization for `palignr`

This commit adds specializations for the `palignr` instruction to the
x64 backend to specialize some more patterns of byte shuffles.

* Fix tests
2023-03-13 21:01:24 +00:00
Alex Crichton
03b5dbb3e0 aarch64: Use VCodeConstant for f64/v128 constants (#5997)
* aarch64: Translate float and splat lowering to ISLE

I was looking into `constant_f128` and its fallback lowering into memory
and to get familiar with the code I figured it'd be good to port some
Rust logic to ISLE. This commit ports the `constant_{f128,f64,f32}`
helpers into ISLE from Rust as well as the `splat_const` helper which
ended up being closely related.

Tests reflect a number of regalloc changes that happened but also namely
one major difference is that in the lowering of `f32` a 32-bit immediate
is created now instead of a 64-bit immediate (in a GP register before
it's moved into a FP register). This semantically has no change but the
generated code is slightly different in a few minor cases.

* aarch64: Load f64/v128 constants from a pool

This commit removes the `LoadFpuConst64` and `LoadFpuConst128`
pseudo-instructions from the AArch64 backend which internally loaded a
nearby constant and then jumped over it. Constants now go through the
`VCodeConstant` infrastructure which gets placed at the end of the
function similar to how x64 works. Some minor support was added in as
well to add a new addressing mode for a `MachLabel`-relative load.
2023-03-13 19:33:52 +00:00
Alex Crichton
6ecdc2482e x64: Improve memory support in {insert,extract}lane (#5982)
* x64: Improve memory support in `{insert,extract}lane`

This commit improves adds support to Cranelift to emit `pextr{b,w,d,q}`
with a memory destination, merging a store-of-extract operation into one
instruction. Additionally AVX support is added for the `pextr*`
instructions.

I've additionally tried to ensure that codegen tests and runtests exist
for all forms of these instructions too.

* Add missing commas

* Fix tests
2023-03-13 19:30:44 +00:00
Saúl Cabrera
d03612c2d9 cranelift-codegen(x64): Expose CallInfo (#6005)
This commit exposes the `CallInfo` struct, needed by Winch to emit function
calls.
2023-03-13 17:50:53 +00:00
Alex Crichton
0ec7b872fa x64: Optimize store-of-extract-lane-0 (#5924)
* x64: Optimize store-of-extract-lane-0

The `movss` and `movsd` instructions can be used to store the 0th lane
of a `t32x4` or a `t64x2` vector into memory, enabling fusing a `store`
and an `extractlane` instruction.

* Fix merge conflict with `main`
2023-03-10 01:06:38 +00:00
Alex Crichton
83f21e784a x64: Add more support for more AVX instructions (#5931)
* x64: Add a smattering of lowerings for `shuffle` specializations (#5930)

* x64: Add lowerings for `punpck{h,l}wd`

Add some special cases for `shuffle` for more specialized x86
instructions.

* x64: Add `shuffle` lowerings for `pshufd`

This commit adds special-cased lowerings for the x64 `shuffle`
instruction when the `pshufd` instruction alone is necessary. This is
possible when the shuffle immediate permutes 32-bit values within one of
the vector inputs of the `shuffle` instruction, but not both.

* x64: Add shuffle lowerings for `punpck{h,l}{q,}dq`

This adds specific permutations for some x86 instructions which
specifically interleave high/low bytes for 32 and 64-bit values. This
corresponds to the preexisting specific lowerings for interleaving 8 and
16-bit values.

* x64: Add `shuffle` lowerings for `shufps`

This commit adds targeted lowerings for the `shuffle` instruction that
match the pattern that `shufps` supports. The `shufps` instruction
selects two elements from the first vector and two elements from the
second vector which means while it's not generally applicable it should
still be more useful than the catch-all lowering of `shuffle`.

* x64: Add shuffle support for `pshuf{l,h}w`

This commit adds special lowering cases for these instructions which
permute 16-bit values within a 128-bit value either within the upper or
lower half of the 128-bit value.

* x64: Specialize `shuffle` with an all-zeros immediate

Instead of loading the all-zeros immediate from a rip-relative address
at the end of the function instead generate a zero with a `pxor`
instruction and then use `pshufb` to do the broadcast.

* Review comments

* x64: Add an AVX encoding for the `pshufd` instruction

This will benefit from lack of need for alignment vs the `pshufd`
instruction if working with a memory operand and additionally, as I've
just learned, this reduces dependencies between instructions because the
`v*` instructions zero the upper bits as opposed to preserving them
which could accidentally create false dependencies in the CPU between
instructions.

* x64: Add more support for AVX loads/stores

This commit adds VEX-encoded versions of instructions such as
`mov{ss,sd,upd,ups,dqu}` for load and store operations. This also
changes some signatures so the `load` helpers specifically take a
`SyntheticAmode` argument which ended up doing a small refactoring of
the `*_regmove` variant used for `insertlane 0` into f64x2 vectors.

* x64: Enable using AVX instructions for zero regs

This commit refactors the internal ISLE helpers for creating zero'd
xmm registers to leverage the AVX support for all other instructions.
This moves away from picking opcodes to instead picking instructions
with a bit of reorganization.

* x64: Remove `XmmConstOp` as an instruction

All existing users can be replaced with usage of the `xmm_uninit_value`
helper instruction so there's no longer any need for these otherwise
constant operations. This additionally reduces manual usage of opcodes
in favor of instruction helpers.

* Review comments

* Update test expectations
2023-03-09 23:57:42 +00:00
Alex Crichton
1c3a1bda6c x64: Add a smattering of lowerings for shuffle specializations (#5930)
* x64: Add lowerings for `punpck{h,l}wd`

Add some special cases for `shuffle` for more specialized x86
instructions.

* x64: Add `shuffle` lowerings for `pshufd`

This commit adds special-cased lowerings for the x64 `shuffle`
instruction when the `pshufd` instruction alone is necessary. This is
possible when the shuffle immediate permutes 32-bit values within one of
the vector inputs of the `shuffle` instruction, but not both.

* x64: Add shuffle lowerings for `punpck{h,l}{q,}dq`

This adds specific permutations for some x86 instructions which
specifically interleave high/low bytes for 32 and 64-bit values. This
corresponds to the preexisting specific lowerings for interleaving 8 and
16-bit values.

* x64: Add `shuffle` lowerings for `shufps`

This commit adds targeted lowerings for the `shuffle` instruction that
match the pattern that `shufps` supports. The `shufps` instruction
selects two elements from the first vector and two elements from the
second vector which means while it's not generally applicable it should
still be more useful than the catch-all lowering of `shuffle`.

* x64: Add shuffle support for `pshuf{l,h}w`

This commit adds special lowering cases for these instructions which
permute 16-bit values within a 128-bit value either within the upper or
lower half of the 128-bit value.

* x64: Specialize `shuffle` with an all-zeros immediate

Instead of loading the all-zeros immediate from a rip-relative address
at the end of the function instead generate a zero with a `pxor`
instruction and then use `pshufb` to do the broadcast.

* Review comments
2023-03-09 22:58:19 +00:00
Chris Fallin
7f3500a172 Cranelift: x64, aarch64, s390x, riscv64: ensure addresses are I64s. (#5972)
* Cranelift: x64, aarch64, s390x, riscv64: ensure addresses are I64s.

@avanhatt has been looking at our address-mode lowering and found an
example where when feeding an `I32`-typed address into a load or store,
we can violate assumptions and get incorrect codegen.

This should never be reachable in practice, because all producers on
64-bit architectures use 64-bit types for addresses. However, our IR is
insufficiently constrained, and allows loads/stores to `I32` addresses
as well. This is nonsensical on a 64-bit architecture.

Initially I had thought we should tighten either the instruction
definition's accepted types, or the CLIF verifier, to reject this.
However both are target-independent, and we don't want to bake
an assumption of 64-bit-ness into the compiler core. Instead this PR
tightens specific backends' lowerings to rejecct loads/stores of
`I32`-typed addresses.

tl;dr: no security implications as all producers use I64-typed
addresses (and must, for correct operation); but we currently accept
I32-typed addresses too, and this breaks other assumptions.

* Allow R64 as well as I64 types.

* Add an explicit extractor to match 64-bit address types.
2023-03-09 19:08:16 +00:00
Alex Crichton
63fb30e4b4 Merge pull request from GHSA-ff4p-7xrq-q5r8
* x64: Remove incorrect `amode_add` lowering rules

This commit removes two incorrect rules as part of the x64 backend's
computation of addressing modes. These two rules folded a zero-extended
32-bit computation into the address mode operand, but this isn't correct
as the 32-bit computation should be truncated to 32-bits but when folded
into the address mode computation it happens with 64-bit operands,
meaning truncation doesn't happen.

* Add release notes
2023-03-08 13:00:40 -06:00
Alex Crichton
5dc2bbccbb Merge pull request from GHSA-xm67-587q-r2vw
This commit fixes an off-by-one error in the subtraction of indices when
shuffling a vector with itself. Lanes 16-and-above are mapped to select
from the first vector since the first and second element are the same,
but the subtraction was with 15 rather than 16 by accident.
2023-03-08 13:00:00 -06:00
Kevin Rizzo
013b35ff32 winch: Refactoring wasmtime compiler integration pieces to share more between Cranelift and Winch (#5944)
* Enable the native target by default in winch

Match cranelift-codegen's build script where if no architecture is
explicitly enabled then the host architecture is implicitly enabled.

* Refactor Cranelift's ISA builder to share more with Winch

This commit refactors the `Builder` type to have a type parameter
representing the finished ISA with Cranelift and Winch having their own
typedefs for `Builder` to represent their own builders. The intention is
to use this shared functionality to produce more shared code between the
two codegen backends.

* Moving compiler shared components to a separate crate

* Restore native flag inference in compiler building

This fixes an oversight from the previous commits to use
`cranelift-native` to infer flags for the native host when using default
settings with Wasmtime.

* Move `Compiler::page_size_align` into wasmtime-environ

The `cranelift-codegen` crate doesn't need this and winch wants the same
implementation, so shuffle it around so everyone has access to it.

* Fill out `Compiler::{flags, isa_flags}` for Winch

These are easy enough to plumb through with some shared code for
Wasmtime.

* Plumb the `is_branch_protection_enabled` flag for Winch

Just forwarding an isa-specific setting accessor.

* Moving executable creation to shared compiler crate

* Adding builder back in and removing from shared crate

* Refactoring the shared pieces for the `CompilerBuilder`

I decided to move a couple things around from Alex's initial changes.
Instead of having the shared builder do everything, I went back to
having each compiler have a distinct builder implementation. I
refactored most of the flag setting logic into a single shared location,
so we can still reduce the amount of code duplication.

With them being separate, we don't need to maintain things like
`LinkOpts` which Winch doesn't currently use. We also have an avenue to
error when certain flags are sent to Winch if we don't support them. I'm
hoping this will make things more maintainable as we build out Winch.

I'm still unsure about keeping everything shared in a single crate
(`cranelift_shared`). It's starting to feel like this crate is doing too
much, which makes it difficult to name. There does seem to be a need for
two distinct abstraction: creating the final executable and the handling
of shared/ISA flags when building the compiler. I could make them into
two separate crates, but there doesn't seem to be enough there yet to
justify it.

* Documentation updates, and renaming the finish method

* Adding back in a default temporarily to pass tests, and removing some unused imports

* Fixing winch tests with wrong method name

* Removing unused imports from codegen shared crate

* Apply documentation formatting updates

Co-authored-by: Saúl Cabrera <saulecabrera@gmail.com>

* Adding back in cranelift_native flag inferring

* Adding new shared crate to publish list

* Adding write feature to pass cargo check

---------

Co-authored-by: Alex Crichton <alex@alexcrichton.com>
Co-authored-by: Saúl Cabrera <saulecabrera@gmail.com>
2023-03-08 15:07:13 +00:00
Alex Crichton
07518dfd36 Remove the Cranelift vselect instruction (#5918)
* Remove the Cranelift `vselect` instruction

This instruction is documented as selecting lanes based on the "truthy"
value of the condition lane, but the current status of the
implementation of this instruction is:

* x64 - uses the high bit for `f32x4` and `f64x2` and otherwise uses the
  high bit of each byte doing a byte-wise lane select rather than
  whatever the controlling type is.

* AArch64 - this is the same as `bitselect` which is a bit-wise
  selection rather than a lane-wise selection.

* s390x - this is the same as AArch64, a bit-wise selection rather than
  lane-wise.

* interpreter - the interpreter implements the documented semantics of
  selecting based on "truthy" values.

Coupled with the status of the implementation is the fact that this
instruction is not used by WebAssembly SIMD today either. The only use
of this instruction in Cranelift is the nan-canonicalization pass. By
moving nan-canonicalization to `bitselect`, since that has the desired
semantics, there's no longer any need for `vselect`.

Given this situation this commit subsqeuently removes `vselect` and all
usage of it throughout Cranelift.

Closes #5917

* Review comments

* Bring back vselect opts as bitselect opts

* Clean up vselect usage in the interpreter

* Move bitcast in nan canonicalization

* Add a comment about float optimization
2023-03-08 00:42:05 +00:00
Alex Crichton
8bb183f16e Implement the relaxed SIMD proposal (#5892)
* Initial support for the Relaxed SIMD proposal

This commit adds initial scaffolding and support for the Relaxed SIMD
proposal for WebAssembly. Codegen support is supported on the x64 and
AArch64 backends on this time.

The purpose of this commit is to get all the boilerplate out of the way
in terms of plumbing through a new feature, adding tests, etc. The tests
are copied from the upstream repository at this time while the
WebAssembly/testsuite repository hasn't been updated.

A summary of changes made in this commit are:

* Lowerings for all relaxed simd opcodes have been added, currently all
  exhibiting deterministic behavior. This means that few lowerings are
  optimal on the x86 backend, but on the AArch64 backend, for example,
  all lowerings should be optimal.

* Support is added to codegen to, eventually, conditionally generate
  different code based on input codegen flags. This is intended to
  enable codegen to more efficient instructions on x86 by default, for
  example, while still allowing embedders to force
  architecture-independent semantics and behavior. One good example of
  this is the `f32x4.relaxed_fmadd` instruction which when deterministic
  forces the `fma` instruction, but otherwise if the backend doesn't
  have support for `fma` then intermediate operations are performed
  instead.

* Lowerings of `iadd_pairwise` for `i16x8` and `i32x4` were added to the
  x86 backend as they're now exercised by the deterministic lowerings of
  relaxed simd instructions.

* Sample codegen tests for added for x86 and aarch64 for some relaxed
  simd instructions.

* Wasmtime embedder support for the relaxed-simd proposal and forcing
  determinism have been added to `Config` and the CLI.

* Support has been added to the `*.wast` runtime execution for the
  `(either ...)` matcher used in the relaxed-simd proposal.

* Tests for relaxed-simd are run both with a default `Engine` as well as
  a "force deterministic" `Engine` to test both configurations.

* All tests from the upstream repository were copied into Wasmtime.
  These tests should be deleted when WebAssembly/testsuite is updated.

* x64: Add x86-specific lowerings for relaxed simd

This commit builds on the prior commit and adds an array of `x86_*`
instructions to Cranelift which have semantics that match their
corresponding x86 equivalents. Translation for relaxed simd is then
additionally updated to conditionally generate different CLIF for
relaxed simd instructions depending on whether the target is x86 or not.
This means that for AArch64 no changes are made but for x86 most relaxed
instructions now lower to some x86-equivalent with slightly different
semantics than the "deterministic" lowering.

* Add libcall support for fma to Wasmtime

This will be required to implement the `f32x4.relaxed_madd` instruction
(and others) when an x86 host doesn't specify the `has_fma` feature.

* Ignore relaxed-simd tests on s390x and riscv64

* Enable relaxed-simd tests on s390x

* Update cranelift/codegen/meta/src/shared/instructions.rs

Co-authored-by: Andrew Brown <andrew.brown@intel.com>

* Add a FIXME from review

* Add notes about deterministic semantics

* Don't default `has_native_fma` to `true`

* Review comments and rebase fixes

---------

Co-authored-by: Andrew Brown <andrew.brown@intel.com>
2023-03-07 15:52:41 +00:00
Alex Crichton
52b4c48a1b x64: Improve codegen for i8x16.shr_u (#5906)
This catches a case that wasn't handled previously by #5880 to allow a
constant load to be folded into an instruction rather than forcing it to
be loaded into a temporary register.
2023-03-02 05:43:42 +00:00
Alex Crichton
f05babc744 x64: Add shuffle cases for punpck{h,l}bw (#5905)
* x64: Add `shuffle` cases for `punpck{h,l}bw`

I noticed this difference between LLVM and Cranelift for something I was
looking at recently, and while it's probably not all that common I
figured I'd add it here since it should be somewhat useful nevertheless.

* Review feedback

* Use u128 extractor instead
2023-03-01 21:49:00 +00:00
Alex Crichton
e0ef0b7c72 x64: Add support for phadd{w,d} instructions (#5896)
This commit adds support for the bare lowering of the `iadd_pairwise`
instruction with `i16x8` and `i32x4` types on the x64 backend. These
lowerings are achieved with the `phaddw` and `phaddd` instructions,
respectively. Additionally AVX encodings of these instructions are added
too.

The motivation for these new lowerings comes from the relaxed-simd
proposal which will use them in the deterministic lowering of some
instructions on the x64 backend.
2023-02-28 23:35:53 +00:00
Alex Crichton
f2dce812c3 x64: Sink constant loads into xmm instructions (#5880)
A number of places in the x64 backend make use of 128-bit constants for
various wasm SIMD-related instructions although most of them currently
use the `x64_xmm_load_const` helper to load the constant into a
register. Almost all xmm instructions, however, enable using a memory
operand which means that these loads can be folded into instructions to
help reduce register pressure. Automatic conversions were added for a
`VCodeConstant` into an `XmmMem` value and then explicit loads were all
removed in favor of forwarding the `XmmMem` value directly to the
underlying instruction. Note that some instances of `x64_xmm_load_const`
remain since they're used in contexts where load sinking won't work
(e.g. they're the first operand, not the second for non-commutative
instructions).
2023-02-27 22:02:42 +00:00
Alex Crichton
9b86a0b9b1 Remove the widening_pairwise_dot_product_s clif instruction (#5889)
This was added for the wasm SIMD proposal but I've been poking around at
this recently and the instruction can instead be represented by its
component parts with the same semantics I believe. This commit removes
the instruction and instead represents it with the existing
`iadd_pairwise` instruction (among others) and updates backends to with
new pattern matches to have the same codegen as before.

This interestingly entirely removed the codegen rule with no replacement
on the AArch64 backend as the existing rules all existed to produce the
same codegen.
2023-02-27 18:43:43 +00:00
Afonso Bordado
36e92add6f riscv64: Move is_null/is_invalid to ISLE (#5874)
* riscv64: Move `is_null`/`is_invalid` to ISLE

* riscv64: Fix `is_invalid` codegen

* Implement review suggestions

Thanks!

Co-authored-by: Jamey Sharp <jamey@minilop.net>

---------

Co-authored-by: Jamey Sharp <jamey@minilop.net>
2023-02-25 12:48:44 +00:00
Jamey Sharp
7d790fcdfe x64: Only branch once in br_table (#5850)
This uses the `cmov`, which was previously necessary for Spectre
mitigation, to clamp the table index instead of zeroing it. By then
placing the default target as the last entry in the table, we can use
just one branch instruction in all cases.

Since there isn't a bounds-check branch any more, this sequence no
longer needs Spectre mitigation. And since we don't need to be careful
about preserving flags, half the instructions can be removed from this
pseudoinstruction and emitted as regular instructions instead.

This is a net savings of three bytes in the encoding of x64's br_table
pseudoinstruction. The generated code can sometimes be longer overall
because the blocks are emitted in a slightly different order.

My benchmark results show a very small effect on runtime performance
with this change.

The spidermonkey benchmark in Sightglass runs "1.01x faster" than main
by instructions retired, but with no significant difference in CPU
cycles. I think that means it rarely hit the default case in any
br_table instructions it executed.

The pulldown-cmark benchmark in Sightglass runs "1.01x faster" than main
by CPU cycles, but main runs "1.00x faster" by instructions retired. I
think that means this benchmark hit the default case a significant
amount of the time, so it executes a few more instructions per br_table,
but maybe the branches were predicted better.
2023-02-24 04:46:38 +00:00
Trevor Elliott
c5d9d5b10f Remove module-level code generation tests (#5870)
* Remove module-level code generation tests

* Add cold block tests for each backend

* Better cold block tests
2023-02-24 01:19:26 +00:00
Alex Crichton
3fc3bc9ec8 x64: Fill out more AVX instructions (#5849)
* x64: Fill out more AVX instructions

This commit fills out more AVX instructions for SSE counterparts
currently used. Many of these instructions do not benefit from the
3-operand form that AVX uses but instead benefit from being able to use
`XmmMem` instead of `XmmMemAligned` which may be able to avoid some
extra temporary registers in some cases.

* Review comments
2023-02-23 22:31:31 +00:00
Trevor Elliott
8abfe928d6 Reuse the DominatorTree postorder travesal in BlockLoweringOrder (#5843)
* Rework the blockorder module to reuse the dom tree's cfg postorder

* Update domtree tests

* Treat br_table with an empty jump table as multiple block exits

* Bless tests

* Change branch_idx to succ_idx and fix the comment
2023-02-23 22:05:20 +00:00
Alex Crichton
bd3dcd313d x64: Add more fma instruction lowerings (#5846)
The relaxed-simd proposal for WebAssembly adds a fused-multiply-add
operation for `v128` types so I was poking around at Cranelift's
existing support for its `fma` instruction. I was also poking around at
the x86_64 ISA's offerings for the FMA operation and ended up with this
PR that improves the lowering of the `fma` instruction on the x64
backend in a number of ways:

* A libcall-based fallback is now provided for `f32x4` and `f64x2` types
  in preparation for eventual support of the relaxed-simd proposal.
  These encodings are horribly slow, but it's expected that if FMA
  semantics must be guaranteed then it's the best that can be done
  without the `fma` feature. Otherwise it'll be up to producers (e.g.
  Wasmtime embedders) whether wasm-level FMA operations should be FMA or
  multiply-then-add.

* In addition to the existing `vfmadd213*` instructions opcodes were
  added for `vfmadd132*`. The `132` variant is selected based on which
  argument can have a sinkable load.

* Any argument in the `fma` CLIF instruction can now have a
  `sinkable_load` and it'll generate a single FMA instruction.

* All `vfnmadd*` opcodes were added as well. These are pattern-matched
  where one of the arguments to the CLIF instruction is an `fneg`. I
  opted to not add a new CLIF instruction here since it seemed like
  pattern matching was easy enough but I'm also not intimately familiar
  with the semantics here so if that's the preferred approach I can do
  that too.
2023-02-21 20:51:22 +00:00
Alex Crichton
d82ebcc102 x64: Enable load-coalescing for SSE/AVX instructions (#5841)
* x64: Enable load-coalescing for SSE/AVX instructions

This commit unlocks the ability to fold loads into operands of SSE and
AVX instructions. This is beneficial for both function size when it
happens in addition to being able to reduce register pressure.
Previously this was not done because most SSE instructions require
memory to be aligned. AVX instructions, however, do not have alignment
requirements.

The solution implemented here is one recommended by Chris which is to
add a new `XmmMemAligned` newtype wrapper around `XmmMem`. All SSE
instructions are now annotated as requiring an `XmmMemAligned` operand
except for a new new instruction styles used specifically for
instructions that don't require alignment (e.g.  `movdqu`, `*sd`, and
`*ss` instructions). All existing instruction helpers continue to take
`XmmMem`, however. This way if an AVX lowering is chosen it can be used
as-is. If an SSE lowering is chosen, however, then an automatic
conversion from `XmmMem` to `XmmMemAligned` kicks in. This automatic
conversion only fails for unaligned addresses in which case a load
instruction is emitted and the operand becomes a temporary register
instead. A number of prior `Xmm` arguments have now been converted to
`XmmMem` as well.

One change from this commit is that loading an unaligned operand for an
SSE instruction previously would use the "correct type" of load, e.g.
`movups` for f32x4 or `movup` for f64x2, but now the loading happens in
a context without type information so the `movdqu` instruction is
generated. According to [this stack overflow question][question] it
looks like modern processors won't penalize this "wrong" choice of type
when the operand is then used for f32 or f64 oriented instructions.

Finally this commit improves some reuse of logic in the `put_in_*_mem*`
helper to share code with `sinkable_load` and avoid duplication. With
this in place some various ISLE rules have been updated as well.

In the tests it can be seen that AVX-instructions are now automatically
load-coalesced and use memory operands in a few cases.

[question]: https://stackoverflow.com/questions/40854819/is-there-any-situation-where-using-movdqu-and-movupd-is-better-than-movups

* Fix tests

* Fix move-and-extend to be unaligned

These don't have alignment requirements like other xmm instructions as
well. Additionally add some ISA tests to ensure that their output is
tested.

* Review comments
2023-02-21 19:10:19 +00:00
Alex Crichton
c65de1f1b1 x64: Remove conditional SseOpcode::uses_src1 (#5842)
This is a follow-up to comments in #5795 to remove some cruft in the x64
instruction model to ensure that the shape of an `Inst` reflects what's
going to happen in regalloc and encoding. This accessor was used to
handle `round*`, `pextr*`, and `pshufb` instructions. The `round*` ones
had already moved to the appropriate `XmmUnary*` variant and `pshufb`
was additionally moved over to that variant as well.

The `pextr*` instructions got a new `Inst` variant and additionally had
their constructors slightly modified to no longer require the type as
input. The encoding for these instructions now automatically handles the
various type-related operands through a new `SseOpcode::Pextrq` operand
to represent 64-bit movements.
2023-02-21 18:17:07 +00:00
Alex Crichton
e6a5ec3fde x64: Tidy up some handling of sinkable loads (#5840)
This commit refactors a bit about how sinkable loads are handled in the
x64 backend. The intention is to bring most handling around sinkable
loads up to date with the current state of the backend since things have
changed since these were originally introduced, namely automatic
conversions between types in ISLE. For example the `Value` type can be
automatically converted to `RegMem` to perform load sinking, but some
rules are still explicitly doing matching themselves.

Here I've removed explicit handling of immediates and sinkable loads
when they're the right-hand-side of an operation. These cases are
already handle by the "base case" when converting a `Value` to a
`RegMemImm`. Instead only rules explicitly for left-hand-side immediates
and sinkable loads remain. This helps cut down on the number of explicit
rules needed.

Additionally in the same manner that `Value` can be automatically
converted to `RegMem` I've added automatic conversions from
`SinkableLoad` to `RegMem` and the various other newtypes. This helps
cut down a bit on rule verbosity where `sink_load_*` is largely no
longer necessary.
2023-02-21 18:15:08 +00:00
Alex Crichton
c26a65a854 x64: Add most remaining AVX lowerings (#5819)
* x64: Add most remaining AVX lowerings

This commit goes through `inst.isle` and adds a corresponding AVX
lowering for most SSE lowerings. I opted to skip instructions where the
SSE lowering didn't read/modify a register, such as `roundps`. I think
that AVX will benefit these instructions when there's load-merging since
AVX doesn't require alignment, but I've deferred that work to a future
PR.

Otherwise though in this PR I think all (or almost all) of the 3-operand
forms of AVX instructions are supported with their SSE counterparts.
This should ideally improve codegen slightly by removing register
pressure and the need for `movdqa` between registers. I've attempted to
ensure that there's at least one codegen test for all the new instructions.

As a side note, the recent capstone integration into `precise-output`
tests helped me catch a number of encoding bugs much earlier than
otherwise, so I've found that incredibly useful in tests!

* Move `vpinsr*` instructions to their own variant

Use true `XmmMem` and `GprMem` types in the instruction as well to get
more type-level safety for what goes where.

* Remove `Inst::produces_const` accessor

Instead of conditionally defining regalloc and various other operations
instead add dedicated `MInst` variants for operations which are intended
to produce a constant to have more clear interactions with regalloc and
printing and such.

* Fix tests

* Register traps in `MachBuffer` for load-folding ops

This adds a missing `add_trap` to encoding of VEX instructions with
memory operands to ensure that if they cause a segfault that there's
appropriate metadata for Wasmtime to understand that the instruction
could in fact trap. This fixes a fuzz test case found locally where v8
trapped and Wasmtime didn't catch the signal and crashed the fuzzer.
2023-02-20 15:11:52 +00:00
Alex Crichton
453330b2db x64: Add rudimentary support for some AVX instructions (#5795)
* x64: Add rudimentary support for some AVX instructions

I was poking around Spidermonkey's wasm backend and saw that the various
assembler functions used are all `v*`-prefixed which look like they're
intended for use with AVX instructions. I looked at Cranelift and it
currently doesn't have support for many AVX-based instructions, so I
figured I'd take a crack at it!

The support added here is a bit of a mishmash when viewed alone, but my
general goal was to take a single instruction from the SIMD proposal for
WebAssembly and migrate all of its component instructions to AVX. I, by
random chance, picked a pretty complicated instruction of `f32x4.min`.
This wasm instruction is implemented on x64 with 4 unique SSE
instructions and ended up being a pretty good candidate.

Further digging about AVX-vs-SSE shows that there should be two major
benefits to using AVX over SSE:

* Primarily AVX instructions largely use a three-operand form where two
  input registers are operated with and an output register is also
  specified. This is in contrast to SSE's predominant
  one-register-is-input-but-also-output pattern. This should help free
  up the register allocator a bit and additionally remove the need for
  movement between registers.

* As #4767 notes the memory-based operations of VEX-encoded instructions
  (aka AVX instructions) do not have strict alignment requirements which
  means we would be able to sink loads and stores into individual
  instructions instead of having separate instructions.

So I set out on my journey to implement the instructions used by
`f32x4.min`. The first few were fairly easy. The machinst backends are
already of the shape "take these inputs and compute the output" where
the x86 requirement of a register being both input and output is
postprocessed in. This means that the `inst.isle` creation helpers for
SSE instructions were already of the correct form to use AVX. I chose to
add new `rule` branches for the instruction creation helpers, for
example `x64_andnps`. The new `rule` conditionally only runs if AVX is
enabled and emits an AVX instruction instead of an SSE instruction for
achieving the same goal. This means that no lowerings of clif
instructions were modified, instead just new instructions are being
generated.

The VEX encoding was previously not heavily used in Cranelift. The only
current user are the FMA-style instructions that Cranelift has at this
time. These FMA instructions have one extra operand than `vandnps`, for
example, so I split the existing `XmmRmRVex` into a few more variants to
fit the shape of the instructions that needed generating for
`f32x4.min`. This was accompanied then with more AVX opcode definitions,
more emission support, etc.

Upon implementing all of this it turned out that the test suite was
failing on my machine due to the memory-operand encodings of VEX
instructions not being supported. I didn't explicitly add those in
myself but some preexisting RIP-relative addressing was leaking into the
new instructions with existing tests. I opted to go ahead and fill out
the memory addressing modes of VEX encoding to get the tests passing
again.

All-in-all this PR adds new instructions to the x64 backend for a number
of AVX instructions, updates 5 existing instruction producers to use AVX
instructions conditionally, implements VEX memory operands, and adds
some simple tests for the new output of `f32x4.min`. The existing
runtest for `f32x.min` caught a few intermediate bugs along the way and
I additionally added a plain `target x86_64` to that runtest to ensure
that it executes with and without AVX to test the various lowerings.
I'll also note that this, and future support, should be well-fuzzed
through Wasmtime's fuzzing which may explicitly disable AVX support
despite the machine having access to AVX, so non-AVX lowerings should be
well-tested into the future.

It's also worth mentioning that I am not an AVX or VEX or x64 expert.
Implementing the memory operand part for VEX was the hardest part of
this PR and while I think it should be good someone else should
definitely double-check me. Additionally I haven't added many
instructions to the x64 backend yet so I may have missed obvious places
to tests or such, so am happy to follow-up with anything to be more
thorough if necessary.

Finally I should note that this is just the tip of the iceberg when it
comes to AVX. My hope is to get some of the idioms sorted out to make it
easier for future PRs to add one-off instruction lowerings or such.

* Review feedback
2023-02-17 01:29:55 +00:00
Alex Crichton
cae3b26623 x64: Improve codegen for vectors with constant shift amounts (#5797)
I stumbled across this working on #5795 and figured this was a nice
opportunity to improve the codegen here.
2023-02-16 20:47:59 +00:00
Trevor Elliott
80c147d9c0 Rework br_table to use BlockCall (#5731)
Rework br_table to use BlockCall, allowing us to avoid adding new nodes during ssa construction to hold block arguments. Additionally, many places where we previously matched on InstructionData to extract branch destinations can be replaced with a use of branch_destination or branch_destination_mut.
2023-02-16 09:23:27 -08:00
Trevor Elliott
f04decc4a1 Use capstone to validate precise-output tests (#5780)
Use the capstone library to disassemble precise-output tests, in addition to pretty-printing their vcode.
2023-02-15 16:35:10 -08:00
Trevor Elliott
f0137c2618 x64: Fix the formatting for andn (#5789)
* Print AluRmRVex instructions with the destination last
* Update andn tests
2023-02-15 11:16:59 -08:00
Alex Crichton
0037b71b11 Use xmm_rm_r more frequently in x64 backend (#5787)
This updates the signatures of the `xmm_rm_r` helper function and then
updates existing users and migrates other users to the helper now that
the type information is no longer required.
2023-02-15 09:03:19 -08:00
Trevor Elliott
19f337e29b Move the default block to the front of the underlying jump table storage (#5770)
The new api on JumpTableData makese it easy to keep the default label first, and that shrinks the diff in #5731 a bit.
2023-02-13 20:50:29 +00:00
Trevor Elliott
d99783fc91 Move default blocks into jump tables (#5756)
Move the default block off of the br_table instrution, and into the JumpTable that it references.
2023-02-10 08:53:30 -08:00
Alex Crichton
de0e0bea3f Legalize b{and,or,xor}_not into component instructions (#5709)
* Remove trailing whitespace in `lower.isle` files

* Legalize the `band_not` instruction into simpler form

This commit legalizes the `band_not` instruction into `band`-of-`bnot`,
or two instructions. This is intended to assist with egraph-based
optimizations where the `band_not` instruction doesn't have to be
specifically included in other bit-operation-patterns.

Lowerings of the `band_not` instruction have been moved to a
specialization of the `band` instruction.

* Legalize `bor_not` into components

Same as prior commit, but for the `bor_not` instruction.

* Legalize bxor_not into bxor-of-bnot

Same as prior commits. I think this also ended up fixing a bug in the
s390x backend where `bxor_not x y` was actually translated as `bnot
(bxor x y)` by accident given the test update changes.

* Simplify not-fused operands for riscv64

Looks like some delegated-to rules have special-cases for "if this
feature is enabled use the fused instruction" so move the clause for
testing the feature up to the lowering phase to help trigger other rules
if the feature isn't enabled. This should make the riscv64 backend more
consistent with how other backends are implemented.

* Remove B{and,or,xor}Not from cost of egraph metrics

These shouldn't ever reach egraphs now that they're legalized away.

* Add an egraph optimization for `x^-1 => ~x`

This adds a simplification node to translate xor-against-minus-1 to a
`bnot` instruction. This helps trigger various other optimizations in
the egraph implementation and also various backend lowering rules for
instructions. This is chiefly useful as wasm doesn't have a `bnot`
equivalent, so it's encoded as `x^-1`.

* Add a wasm test for end-to-end bitwise lowerings

Test that end-to-end various optimizations are being applied for input
wasm modules.

* Specifically don't self-update rustup on CI

I forget why this was here originally, but this is failing on Windows
CI. In general there's no need to update rustup, so leave it as-is.

* Cleanup some aarch64 lowering rules

Previously a 32/64 split was necessary due to the `ALUOp` being different
but that's been refactored away no so there's no longer any need for
duplicate rules.

* Narrow a x64 lowering rule

This previously made more sense when it was `band_not` and rarely used,
but be more specific in the type-filter on this rule that it's only
applicable to SIMD types with lanes.

* Simplify xor-against-minus-1 rule

No need to have the commutative version since constants are already
shuffled right for egraphs

* Optimize band-of-bnot when bnot is on the left

Use some more rules in the egraph algebraic optimizations to
canonicalize band/bor/bxor with a `bnot` operand to put the operand on
the right. That way the lowerings in the backends only have to list the
rule once, with the operand on the right, to optimize both styles of
input.

* Add commutative lowering rules

* Update cranelift/codegen/src/isa/x64/lower.isle

Co-authored-by: Jamey Sharp <jamey@minilop.net>

---------

Co-authored-by: Jamey Sharp <jamey@minilop.net>
2023-02-06 13:53:40 -06:00
Trevor Elliott
6d8f2be9e1 Use andn for band_not when bmi1 is present (#5701)
We can use the andn instruction for the lowering of band_not on x64 when bmi1 is available.
2023-02-03 16:23:18 -08:00
Jun Ryung Ju
9cd4146939 Implemented b{and,or,xor}_not bitops for ty_int_ref_scalar_64 type. (#5604)
* Implemented `b{and,or,xor}_not` bitops for ty_int_ref_scalar_64 type.

* Added tests.
2023-02-01 21:57:18 -08:00
Nick Fitzgerald
bdfb746548 Cranelift: Introduce the return_call and return_call_indirect instructions (#5679)
* Cranelift: Introduce the `tail` calling convention

This is an unstable-ABI calling convention that we will eventually use to
support Wasm tail calls.

Co-Authored-By: Jamey Sharp <jsharp@fastly.com>

* Cranelift: Introduce the `return_call` and `return_call_indirect` instructions

These will be used to implement tail calls for Wasm and any other language
targeting CLIF. The `return_call_indirect` instruction differs from the Wasm
instruction of the same name by taking a native address callee rather than a
Wasm function index.

Co-Authored-By: Jamey Sharp <jsharp@fastly.com>

* Cranelift: Implement verification rules for `return_call[_indirect]`

They must:

* have the same return types between the caller and callee,
* have the same calling convention between caller and callee,
* and that calling convention must support tail calls.

Co-Authored-By: Jamey Sharp <jsharp@fastly.com>

* cargo fmt

---------

Co-authored-by: Jamey Sharp <jsharp@fastly.com>
2023-02-01 21:20:35 +00:00
Trevor Elliott
a5698cedf8 cranelift: Remove brz and brnz (#5630)
Remove the brz and brnz instructions, as their behavior is now redundant with brif.
2023-01-30 20:34:56 +00:00
Trevor Elliott
a181ad2932 Cleanup the use of maybe_uextend in the x64 lowerings (#5637)
Use maybe_uextend for the brnz lowerings on x64.
2023-01-25 17:28:48 -08:00
Trevor Elliott
b58a197d33 cranelift: Add a conditional branch instruction with two targets (#5446)
Add a conditional branch instruction with two targets: brif. This instruction will eventually replace brz and brnz, as it encompasses the behavior of both.

This PR also changes the InstructionData layout for instruction formats that hold BlockCall values, taking the same approach we use for Value arguments. This allows branch_destination to return a slice to the BlockCall values held in the instruction, rather than requiring that we pattern match on InstructionData to fetch the then/else blocks.

Function generation for fuzzing has been updated to generate uses of brif, and I've run the cranelift-fuzzgen target locally for hours without triggering any new failures.
2023-01-24 14:37:16 -08:00
Trevor Elliott
1e6c13d83e cranelift: Rework block instructions to use BlockCall (#5464)
Add a new type BlockCall that represents the pair of a block name with arguments to be passed to it. (The mnemonic here is that it looks a bit like a function call.) Rework the implementation of jump, brz, and brnz to use BlockCall instead of storing the block arguments as varargs in the instruction's ValueList.

To ensure that we're processing block arguments from BlockCall values in instructions, three new functions have been introduced on DataFlowGraph that both sets of arguments:

inst_values - returns an iterator that traverses values in the instruction and block arguments
map_inst_values - applies a function to each value in the instruction and block arguments
overwrite_inst_values - overwrite all values in an instruction and block arguments with values from the iterator

Co-authored-by: Jamey Sharp <jamey@minilop.net>
2023-01-17 16:31:15 -08:00
Saúl Cabrera
6cb68f3287 cranelift-codegen: Expose x64 settings (#5561)
Exposes x64 settings so that they can be consumed from Winch for binary emission.
2023-01-11 18:33:03 -05:00
Alexa VanHattum
44913825b5 cranelift: fix register for srem.i8 on x86_64 (#5540)
* Change register written to in specific srem case. Add regression test as filetest case. Fixes #5470

* Add another test case, newline

* Update comment
2023-01-06 22:18:16 +00:00
Sam Sartor
1efa3d6f8b Add clif-util compile option to output object file (#5493)
* add clif-util compile option to output object file

* switch from a box to a borrow

* update objectmodule tests to use borrowed isa

* put targetisa into an arc
2023-01-06 12:53:48 -08:00
uint256_t
b00455135e Cranelift: Implement 'iabs' for scalar types on x86_64 (#5527)
* Implement 'iabs' for scalar types on x86_64

* Small fix
2023-01-05 21:33:12 -08:00