Commit Graph

1825 Commits

Author SHA1 Message Date
Alex Crichton
d6ce632b5b aarch64: Specialize constant vector shifts (#5976)
* aarch64: Specialize constant vector shifts

This commit adds special lowering rules for
vector-shifts-by-constant-amounts to use dedicated instructions which
cuts down on the codegen here quite a bit for constant values.

* Fix codegen for 0-shift-rights

* Special-case zero left-shifts as well

* Remove left-shift special case
2023-03-13 22:37:59 +00:00
Alex Crichton
e2a6fe99c2 x64: Add shuffle specialization for palignr (#5999)
* x64: Add `shuffle` specialization for `palignr`

This commit adds specializations for the `palignr` instruction to the
x64 backend to specialize some more patterns of byte shuffles.

* Fix tests
2023-03-13 21:01:24 +00:00
Alex Crichton
03b5dbb3e0 aarch64: Use VCodeConstant for f64/v128 constants (#5997)
* aarch64: Translate float and splat lowering to ISLE

I was looking into `constant_f128` and its fallback lowering into memory
and to get familiar with the code I figured it'd be good to port some
Rust logic to ISLE. This commit ports the `constant_{f128,f64,f32}`
helpers into ISLE from Rust as well as the `splat_const` helper which
ended up being closely related.

Tests reflect a number of regalloc changes that happened but also namely
one major difference is that in the lowering of `f32` a 32-bit immediate
is created now instead of a 64-bit immediate (in a GP register before
it's moved into a FP register). This semantically has no change but the
generated code is slightly different in a few minor cases.

* aarch64: Load f64/v128 constants from a pool

This commit removes the `LoadFpuConst64` and `LoadFpuConst128`
pseudo-instructions from the AArch64 backend which internally loaded a
nearby constant and then jumped over it. Constants now go through the
`VCodeConstant` infrastructure which gets placed at the end of the
function similar to how x64 works. Some minor support was added in as
well to add a new addressing mode for a `MachLabel`-relative load.
2023-03-13 19:33:52 +00:00
Alex Crichton
6ecdc2482e x64: Improve memory support in {insert,extract}lane (#5982)
* x64: Improve memory support in `{insert,extract}lane`

This commit improves adds support to Cranelift to emit `pextr{b,w,d,q}`
with a memory destination, merging a store-of-extract operation into one
instruction. Additionally AVX support is added for the `pextr*`
instructions.

I've additionally tried to ensure that codegen tests and runtests exist
for all forms of these instructions too.

* Add missing commas

* Fix tests
2023-03-13 19:30:44 +00:00
Afonso Bordado
5c95e6fbaf riscv64: Codemotion cleanups to ISLE files (#5984)
* riscv64: Fix typo in extensions

* riscv64: Move converters to top of file

* riscv64: Group up all imm12 rules

* riscv64: Move zero_reg helpers to Physical Regs section

* riscv64: Move helpers away from `clz` lowerings

These were in the middle of the `clz` rules and are kind of distracting

* riscv64: Move `cls` rules next to `ctz`/`clz`

* cranelift: Move `u8_and` / `u32_add` to Primitive Arithmetic section

* riscv64: Mark some imm12 constructors as pure

* cranelift: Move `s32_add_fallible` next to `u32_add`

* riscv64: Fix Typo
2023-03-13 19:20:15 +00:00
Afonso Bordado
ad0bce3a36 riscv64: Fix regaloc panic with bor+bnot on floats (#5857) 2023-03-13 18:29:36 +00:00
Saúl Cabrera
d03612c2d9 cranelift-codegen(x64): Expose CallInfo (#6005)
This commit exposes the `CallInfo` struct, needed by Winch to emit function
calls.
2023-03-13 17:50:53 +00:00
Alex Crichton
7956dc6ba2 Change CLIF shuffle to validate lane indices (#5995)
* Change CLIF `shuffle` to validate lane indices

Previously the CLIF `shuffle` instruction did not perform any validation
on the lane shuffle mask and specified that out-of-bounds lanes always
returned 0 as the value. This behavior though is not required by
WebAssembly which validates that lane indices are always in-bounds.
Additionally since these are static immediates even other code
generators should be able to verify that the immediates are in-bounds.

As a result this commit updates the definition of the `shuffle`
instruction to specify that all byte immediates must be in-bounds in the
range of [0, 32). The verifier has been updated and some test cases have
been removed that were testing this functionality.

Closes #5989

* Only generate valid shuffle immediates in fuzzer
2023-03-13 14:24:11 +00:00
Chris Fallin
264089e29d Cranelift: aarch64: fix undefined dest reg in f32x4.splat case. (#5987)
One of the cases for a splat operation, as updated in #5370, wrote to
a temp reg but then only conditionally transformed the temp into the
final destination register. In another codepath, `rd` was left
undefined. This causes a panic later when regalloc2 verifies SSA
properties of its input (here, value not def'd before use).

Fixes #5985.
2023-03-11 00:22:29 +00:00
Alex Crichton
52896e020d aarch64: Add specialized shuffle lowerings (#5977)
* aarch64: Add `shuffle` lowerings for the `uzp{1,2}` instructions

This commit uses the same style of patterns in the x64 backend to start
adding specific lowerings of the Cranelift `shuffle` instruction to
particular AArch64 instructions.

* aarch64: Add `shuffle` lowerings to the `zip{1,2}` instructions

These instructions match the `punpck*` family of instructions on x64 and
should help provide more efficient lowerings than the current `shuffle`
fallback.

* aarch64: Add `shuffle` lowerings for `trn{1,2}`

Along the lines of prior commits adds specific patterns to lowering for
individual AArch64 instructions available.

* aarch64: Add a `shuffle` lowering for the `ext` instruction

This instruction will more-or-less concatenate two 128-bit vector
registers to create a 256-bit value, shift it right, and then take the
lower 128-bits into the destination. This can be modeled with a
`shuffle` of consecutive bytes so this adds a lowering rule to generate
this instruction.

* aarch64: Add `shuffle` special case for `dup`

This commit adds special cases for Cranelift's `shuffle` on AArch64 when
the lowering can be represented with a `dup` instruction which
broadcasts one vector's lane into all lanes of the destination.

* aarch64: Add `shuffle` specializations for `rev` instructions

This commit adds shuffle mask specializations for the `rev{16,32,64}`
family of instructions on AArch64 which can be used to reverse bytes,
16-bit values, or 32-bit values within larger values.

* Fix tests

* Add doc-comments in ISLE
2023-03-10 21:37:13 +00:00
Ulrich Weigand
411781d2fe s390x: Fix mistake in available_in_isa (#5981)
The 32-bit float<->int conversion instructions are part of
the VXRS_EXT2 facility, not MIE2.

Fixes https://github.com/bytecodealliance/wasmtime/issues/5979.
2023-03-10 19:41:41 +00:00
bjorn3
108f7917c8 Support plugging external profilers into the Cranelift timing infrastructure (#5749)
* Remove no-std code for cranelift_codegen::timings

no-std mode isn't supported by Cranelift anymore

* Simplify define_passes macro

* Add egraph opt timings

* Replace the add_to_current api with PassTimes::add

* Omit a couple of unused time measurements

* Reduce divergence between run and run_passes a bit

* Introduce a Profiler trait

This allows plugging in external profilers into the Cranelift profiling
framework.

* Add Pass::description method

* Remove duplicate usage of the compile pass timing

* Rustfmt
2023-03-10 19:33:56 +00:00
yuyang
4e875f33a7 Codegen fix fcvt_from_sint.f32 with small types on riscv64. (#5964)
* fix issue5952

* We should only extend i8 and i16

* remove extra space

* move some code
2023-03-10 10:29:55 +00:00
Alex Crichton
0ec7b872fa x64: Optimize store-of-extract-lane-0 (#5924)
* x64: Optimize store-of-extract-lane-0

The `movss` and `movsd` instructions can be used to store the 0th lane
of a `t32x4` or a `t64x2` vector into memory, enabling fusing a `store`
and an `extractlane` instruction.

* Fix merge conflict with `main`
2023-03-10 01:06:38 +00:00
Alex Crichton
83f21e784a x64: Add more support for more AVX instructions (#5931)
* x64: Add a smattering of lowerings for `shuffle` specializations (#5930)

* x64: Add lowerings for `punpck{h,l}wd`

Add some special cases for `shuffle` for more specialized x86
instructions.

* x64: Add `shuffle` lowerings for `pshufd`

This commit adds special-cased lowerings for the x64 `shuffle`
instruction when the `pshufd` instruction alone is necessary. This is
possible when the shuffle immediate permutes 32-bit values within one of
the vector inputs of the `shuffle` instruction, but not both.

* x64: Add shuffle lowerings for `punpck{h,l}{q,}dq`

This adds specific permutations for some x86 instructions which
specifically interleave high/low bytes for 32 and 64-bit values. This
corresponds to the preexisting specific lowerings for interleaving 8 and
16-bit values.

* x64: Add `shuffle` lowerings for `shufps`

This commit adds targeted lowerings for the `shuffle` instruction that
match the pattern that `shufps` supports. The `shufps` instruction
selects two elements from the first vector and two elements from the
second vector which means while it's not generally applicable it should
still be more useful than the catch-all lowering of `shuffle`.

* x64: Add shuffle support for `pshuf{l,h}w`

This commit adds special lowering cases for these instructions which
permute 16-bit values within a 128-bit value either within the upper or
lower half of the 128-bit value.

* x64: Specialize `shuffle` with an all-zeros immediate

Instead of loading the all-zeros immediate from a rip-relative address
at the end of the function instead generate a zero with a `pxor`
instruction and then use `pshufb` to do the broadcast.

* Review comments

* x64: Add an AVX encoding for the `pshufd` instruction

This will benefit from lack of need for alignment vs the `pshufd`
instruction if working with a memory operand and additionally, as I've
just learned, this reduces dependencies between instructions because the
`v*` instructions zero the upper bits as opposed to preserving them
which could accidentally create false dependencies in the CPU between
instructions.

* x64: Add more support for AVX loads/stores

This commit adds VEX-encoded versions of instructions such as
`mov{ss,sd,upd,ups,dqu}` for load and store operations. This also
changes some signatures so the `load` helpers specifically take a
`SyntheticAmode` argument which ended up doing a small refactoring of
the `*_regmove` variant used for `insertlane 0` into f64x2 vectors.

* x64: Enable using AVX instructions for zero regs

This commit refactors the internal ISLE helpers for creating zero'd
xmm registers to leverage the AVX support for all other instructions.
This moves away from picking opcodes to instead picking instructions
with a bit of reorganization.

* x64: Remove `XmmConstOp` as an instruction

All existing users can be replaced with usage of the `xmm_uninit_value`
helper instruction so there's no longer any need for these otherwise
constant operations. This additionally reduces manual usage of opcodes
in favor of instruction helpers.

* Review comments

* Update test expectations
2023-03-09 23:57:42 +00:00
Alex Crichton
1c3a1bda6c x64: Add a smattering of lowerings for shuffle specializations (#5930)
* x64: Add lowerings for `punpck{h,l}wd`

Add some special cases for `shuffle` for more specialized x86
instructions.

* x64: Add `shuffle` lowerings for `pshufd`

This commit adds special-cased lowerings for the x64 `shuffle`
instruction when the `pshufd` instruction alone is necessary. This is
possible when the shuffle immediate permutes 32-bit values within one of
the vector inputs of the `shuffle` instruction, but not both.

* x64: Add shuffle lowerings for `punpck{h,l}{q,}dq`

This adds specific permutations for some x86 instructions which
specifically interleave high/low bytes for 32 and 64-bit values. This
corresponds to the preexisting specific lowerings for interleaving 8 and
16-bit values.

* x64: Add `shuffle` lowerings for `shufps`

This commit adds targeted lowerings for the `shuffle` instruction that
match the pattern that `shufps` supports. The `shufps` instruction
selects two elements from the first vector and two elements from the
second vector which means while it's not generally applicable it should
still be more useful than the catch-all lowering of `shuffle`.

* x64: Add shuffle support for `pshuf{l,h}w`

This commit adds special lowering cases for these instructions which
permute 16-bit values within a 128-bit value either within the upper or
lower half of the 128-bit value.

* x64: Specialize `shuffle` with an all-zeros immediate

Instead of loading the all-zeros immediate from a rip-relative address
at the end of the function instead generate a zero with a `pxor`
instruction and then use `pshufb` to do the broadcast.

* Review comments
2023-03-09 22:58:19 +00:00
Chris Fallin
7f3500a172 Cranelift: x64, aarch64, s390x, riscv64: ensure addresses are I64s. (#5972)
* Cranelift: x64, aarch64, s390x, riscv64: ensure addresses are I64s.

@avanhatt has been looking at our address-mode lowering and found an
example where when feeding an `I32`-typed address into a load or store,
we can violate assumptions and get incorrect codegen.

This should never be reachable in practice, because all producers on
64-bit architectures use 64-bit types for addresses. However, our IR is
insufficiently constrained, and allows loads/stores to `I32` addresses
as well. This is nonsensical on a 64-bit architecture.

Initially I had thought we should tighten either the instruction
definition's accepted types, or the CLIF verifier, to reject this.
However both are target-independent, and we don't want to bake
an assumption of 64-bit-ness into the compiler core. Instead this PR
tightens specific backends' lowerings to rejecct loads/stores of
`I32`-typed addresses.

tl;dr: no security implications as all producers use I64-typed
addresses (and must, for correct operation); but we currently accept
I32-typed addresses too, and this breaks other assumptions.

* Allow R64 as well as I64 types.

* Add an explicit extractor to match 64-bit address types.
2023-03-09 19:08:16 +00:00
Alex Crichton
63fb30e4b4 Merge pull request from GHSA-ff4p-7xrq-q5r8
* x64: Remove incorrect `amode_add` lowering rules

This commit removes two incorrect rules as part of the x64 backend's
computation of addressing modes. These two rules folded a zero-extended
32-bit computation into the address mode operand, but this isn't correct
as the 32-bit computation should be truncated to 32-bits but when folded
into the address mode computation it happens with 64-bit operands,
meaning truncation doesn't happen.

* Add release notes
2023-03-08 13:00:40 -06:00
Alex Crichton
5dc2bbccbb Merge pull request from GHSA-xm67-587q-r2vw
This commit fixes an off-by-one error in the subtraction of indices when
shuffling a vector with itself. Lanes 16-and-above are mapped to select
from the first vector since the first and second element are the same,
but the subtraction was with 15 rather than 16 by accident.
2023-03-08 13:00:00 -06:00
Kevin Rizzo
013b35ff32 winch: Refactoring wasmtime compiler integration pieces to share more between Cranelift and Winch (#5944)
* Enable the native target by default in winch

Match cranelift-codegen's build script where if no architecture is
explicitly enabled then the host architecture is implicitly enabled.

* Refactor Cranelift's ISA builder to share more with Winch

This commit refactors the `Builder` type to have a type parameter
representing the finished ISA with Cranelift and Winch having their own
typedefs for `Builder` to represent their own builders. The intention is
to use this shared functionality to produce more shared code between the
two codegen backends.

* Moving compiler shared components to a separate crate

* Restore native flag inference in compiler building

This fixes an oversight from the previous commits to use
`cranelift-native` to infer flags for the native host when using default
settings with Wasmtime.

* Move `Compiler::page_size_align` into wasmtime-environ

The `cranelift-codegen` crate doesn't need this and winch wants the same
implementation, so shuffle it around so everyone has access to it.

* Fill out `Compiler::{flags, isa_flags}` for Winch

These are easy enough to plumb through with some shared code for
Wasmtime.

* Plumb the `is_branch_protection_enabled` flag for Winch

Just forwarding an isa-specific setting accessor.

* Moving executable creation to shared compiler crate

* Adding builder back in and removing from shared crate

* Refactoring the shared pieces for the `CompilerBuilder`

I decided to move a couple things around from Alex's initial changes.
Instead of having the shared builder do everything, I went back to
having each compiler have a distinct builder implementation. I
refactored most of the flag setting logic into a single shared location,
so we can still reduce the amount of code duplication.

With them being separate, we don't need to maintain things like
`LinkOpts` which Winch doesn't currently use. We also have an avenue to
error when certain flags are sent to Winch if we don't support them. I'm
hoping this will make things more maintainable as we build out Winch.

I'm still unsure about keeping everything shared in a single crate
(`cranelift_shared`). It's starting to feel like this crate is doing too
much, which makes it difficult to name. There does seem to be a need for
two distinct abstraction: creating the final executable and the handling
of shared/ISA flags when building the compiler. I could make them into
two separate crates, but there doesn't seem to be enough there yet to
justify it.

* Documentation updates, and renaming the finish method

* Adding back in a default temporarily to pass tests, and removing some unused imports

* Fixing winch tests with wrong method name

* Removing unused imports from codegen shared crate

* Apply documentation formatting updates

Co-authored-by: Saúl Cabrera <saulecabrera@gmail.com>

* Adding back in cranelift_native flag inferring

* Adding new shared crate to publish list

* Adding write feature to pass cargo check

---------

Co-authored-by: Alex Crichton <alex@alexcrichton.com>
Co-authored-by: Saúl Cabrera <saulecabrera@gmail.com>
2023-03-08 15:07:13 +00:00
Alex Crichton
07518dfd36 Remove the Cranelift vselect instruction (#5918)
* Remove the Cranelift `vselect` instruction

This instruction is documented as selecting lanes based on the "truthy"
value of the condition lane, but the current status of the
implementation of this instruction is:

* x64 - uses the high bit for `f32x4` and `f64x2` and otherwise uses the
  high bit of each byte doing a byte-wise lane select rather than
  whatever the controlling type is.

* AArch64 - this is the same as `bitselect` which is a bit-wise
  selection rather than a lane-wise selection.

* s390x - this is the same as AArch64, a bit-wise selection rather than
  lane-wise.

* interpreter - the interpreter implements the documented semantics of
  selecting based on "truthy" values.

Coupled with the status of the implementation is the fact that this
instruction is not used by WebAssembly SIMD today either. The only use
of this instruction in Cranelift is the nan-canonicalization pass. By
moving nan-canonicalization to `bitselect`, since that has the desired
semantics, there's no longer any need for `vselect`.

Given this situation this commit subsqeuently removes `vselect` and all
usage of it throughout Cranelift.

Closes #5917

* Review comments

* Bring back vselect opts as bitselect opts

* Clean up vselect usage in the interpreter

* Move bitcast in nan canonicalization

* Add a comment about float optimization
2023-03-08 00:42:05 +00:00
Alex Crichton
afde4ea4e3 Fix the original block for block params in egraphs (#5960)
This fixes an issue where block params were always listed as being
members of the current block in egraphs, even when the block param was
actually defined in a separate block. This then enables instructions
which depend on these parameters to get hoisted up out of inner loops at
least to the block that defined the argument.

Closes #5957
2023-03-07 23:58:03 +00:00
Trevor Elliott
709257011e Restrict uextend and sextend to scalar integers (#5953) 2023-03-07 19:10:50 +00:00
Alex Crichton
8bb183f16e Implement the relaxed SIMD proposal (#5892)
* Initial support for the Relaxed SIMD proposal

This commit adds initial scaffolding and support for the Relaxed SIMD
proposal for WebAssembly. Codegen support is supported on the x64 and
AArch64 backends on this time.

The purpose of this commit is to get all the boilerplate out of the way
in terms of plumbing through a new feature, adding tests, etc. The tests
are copied from the upstream repository at this time while the
WebAssembly/testsuite repository hasn't been updated.

A summary of changes made in this commit are:

* Lowerings for all relaxed simd opcodes have been added, currently all
  exhibiting deterministic behavior. This means that few lowerings are
  optimal on the x86 backend, but on the AArch64 backend, for example,
  all lowerings should be optimal.

* Support is added to codegen to, eventually, conditionally generate
  different code based on input codegen flags. This is intended to
  enable codegen to more efficient instructions on x86 by default, for
  example, while still allowing embedders to force
  architecture-independent semantics and behavior. One good example of
  this is the `f32x4.relaxed_fmadd` instruction which when deterministic
  forces the `fma` instruction, but otherwise if the backend doesn't
  have support for `fma` then intermediate operations are performed
  instead.

* Lowerings of `iadd_pairwise` for `i16x8` and `i32x4` were added to the
  x86 backend as they're now exercised by the deterministic lowerings of
  relaxed simd instructions.

* Sample codegen tests for added for x86 and aarch64 for some relaxed
  simd instructions.

* Wasmtime embedder support for the relaxed-simd proposal and forcing
  determinism have been added to `Config` and the CLI.

* Support has been added to the `*.wast` runtime execution for the
  `(either ...)` matcher used in the relaxed-simd proposal.

* Tests for relaxed-simd are run both with a default `Engine` as well as
  a "force deterministic" `Engine` to test both configurations.

* All tests from the upstream repository were copied into Wasmtime.
  These tests should be deleted when WebAssembly/testsuite is updated.

* x64: Add x86-specific lowerings for relaxed simd

This commit builds on the prior commit and adds an array of `x86_*`
instructions to Cranelift which have semantics that match their
corresponding x86 equivalents. Translation for relaxed simd is then
additionally updated to conditionally generate different CLIF for
relaxed simd instructions depending on whether the target is x86 or not.
This means that for AArch64 no changes are made but for x86 most relaxed
instructions now lower to some x86-equivalent with slightly different
semantics than the "deterministic" lowering.

* Add libcall support for fma to Wasmtime

This will be required to implement the `f32x4.relaxed_madd` instruction
(and others) when an x86 host doesn't specify the `has_fma` feature.

* Ignore relaxed-simd tests on s390x and riscv64

* Enable relaxed-simd tests on s390x

* Update cranelift/codegen/meta/src/shared/instructions.rs

Co-authored-by: Andrew Brown <andrew.brown@intel.com>

* Add a FIXME from review

* Add notes about deterministic semantics

* Don't default `has_native_fma` to `true`

* Review comments and rebase fixes

---------

Co-authored-by: Andrew Brown <andrew.brown@intel.com>
2023-03-07 15:52:41 +00:00
yuyang
812b4b5229 Codegen fix atomic_cas with samll types on riscv64 (#5919)
* fix issue5901

* add regression test file.

* fix regression targets.

* fix a comment.

* enable atomic-cas-little for riscv64

* specify little endian some s390x can pass test.

* fix register error
2023-03-07 13:32:28 +00:00
Alex Crichton
18ee645ebe Allow hoisting vconst instructions out of loops (#5909)
* Allow hoisting `vconst` instructions out of loops

Staring at some SIMD code and what LLVM and v8 both generate it appears
that a common technique for SIMD-loops is to hoist constants outside of
loops since they're nontrivial to rematerialize unlike integer
constants. This commit updates the `loop_hoist_level` calculation with
egraphs to have a nonzero default for instructions that have no
arguments (e.g. consts) which enables hoisting these instructions out of
loops.

Note, though, that for now I've listed the maximum as hoisting outside
of one loop, but not all of them. While theoretically vconsts could move
up to the top of the function I'd be worried about their impact on
register pressure and having to save/restore around calls or similar, so
hopefully if the hot part of a program is a single loop then hoisting
out of one loop is a reasonable-enough heuristic for now.

Locally on x64 with a benchmark that just encodes binary to hex this saw
a 15% performance improvement taking hex encoding from ~6G/s to ~6.7G/s.

* Test vconst is only hoisted one loop out
2023-03-06 15:29:43 +00:00
yuyang
20198d94c6 Codegen fix atomic_rmw_loop missing move result to dst register On riscv64. (#5898)
* fix issue5884.

* fix issue5884

* fix test failure

* fix atomic rmw missing move result to dst register.

* specify little endian some s390x can pass test.
2023-03-06 11:27:46 +00:00
Alex Crichton
3ff3994a12 Add egraph optimization for fneg's cancelling out (#5910)
This implements comments from #5895 to cancel out `fneg` operations in
`fma` instructions. Additional support for `fmul` is added as well.
2023-03-02 18:28:32 +00:00
Jan-Justin van Tonder
db8fe0108f cranelift: Add big and little endian memory accesses to interpreter (#5893)
* Added `mem_flags` parameter to `State::checked_{load,store}` as the means
for determining the endianness, typically derived from an instruction.

* Added `native_endianness` property to `InterpreterState` as fallback when
determining endianness, such as in cases where there are no memory flags
avaiable or set.

* Added `to_be` and `to_le` methods to `DataValue`.

* Added `AtomicCas` and `AtomicRmw` to list of instructions with retrievable
memory flags for `InstructionData::memflags`.

* Enabled `atomic-{cas,rmw}-subword-{big,little}.clif` for interpreter run
tests.
2023-03-02 11:57:01 +00:00
Alex Crichton
9984e959cd aarch64: Add support for the fmls instruction (#5895)
This commit adds lowerings to the AArch64 backend for the `fmls`
instruction which is intended to be leveraged in the relaxed-simd
proposal for WebAssembly. This should hopefully allow for a
teeny-bit-more efficient codegen for this operator instead of using the
`fmla` instruction plus a negation instruction.
2023-03-02 05:45:58 +00:00
Alex Crichton
52b4c48a1b x64: Improve codegen for i8x16.shr_u (#5906)
This catches a case that wasn't handled previously by #5880 to allow a
constant load to be folded into an instruction rather than forcing it to
be loaded into a temporary register.
2023-03-02 05:43:42 +00:00
Chris Fallin
7b8854f803 egraphs: fix handling of effectful-but-idempotent ops and GVN. (#5800)
* Revert "egraphs: disable GVN of effectful idempotent ops (temporarily). (#5808)"

This reverts commit c7e2571866.

* egraphs: fix handling of effectful-but-idempotent ops and GVN.

This PR addresses #5796: currently, ops that are effectful, i.e., remain
in the side-effecting skeleton (which we keep in the `Layout` while the
egraph exists), but are idempotent and thus mergeable by a GVN pass, are
not handled properly.

GVN is still possible on effectful but idempotent ops precisely because
our GVN does not create partial redundancies: it removes an instruction
only when it is dominated by an identical instruction. An isntruction
will not be "hoisted" to a point where it could execute in the optimized
code but not in the original.

However, there are really two parts to the egraph implementation that
produce this effect: the deduplication on insertion into the egraph, and
the elaboration with a scoped hashmap. The deduplication lets us give a
single name (value ID) to all copies of an identical instruction, and
then elaboration will re-create duplicates if GVN should not hoist or
merge some of them.

Because deduplication need not worry about dominance or scopes, we use a
simple (non-scoped) hashmap to dedup/intern ops as "egraph nodes".

When we added support for GVN'ing effectful but idempotent ops (#5594),
we kept the use of this simple dedup'ing hashmap, but these ops do not
get elaborated; instead they stay in the side-effecting skeleton. Thus,
we inadvertently created potential for weird code-motion effects.

The proposal in #5796 would solve this in a clean way by treating these
ops as pure again, and keeping them out of the skeleton, instead putting
"force" pseudo-ops in the skeleton. However, this is a little more
complex than I would like, and I've realized that @jameysharp's earlier
suggestion is much simpler: we can keep an actual scoped hashmap
separately just for the effectful-but-idempotent ops, and use it to GVN
while we build the egraph. In effect, we're fusing a separate GVN pass
with the egraph pass (but letting it interact corecursively with
egraph rewrites. This is in principle similar to how we keep a separate
map for loads and fuse this pass with the egraph rewrite pass as well.

Note that we can use a `ScopedHashMap` here without the "context" (as
needed by `CtxHashMap`) because, as noted by @jameysharp, in practice
the ops we want to GVN have all their args inline. Equality on the
`InstructinoData` itself is conservative: two insts whose struct
contents compare shallowly equal are definitely identical, but identical
insts in a deep-equality sense may not compare shallowly equal, due to
list indirection. This is fine for GVN, because it is still sound to
skip any given GVN opportunity (and keep the original instructions).

Fixes #5796.

* Add comments from review.
2023-03-02 02:10:42 +00:00
Alex Crichton
f05babc744 x64: Add shuffle cases for punpck{h,l}bw (#5905)
* x64: Add `shuffle` cases for `punpck{h,l}bw`

I noticed this difference between LLVM and Cranelift for something I was
looking at recently, and while it's probably not all that common I
figured I'd add it here since it should be somewhat useful nevertheless.

* Review feedback

* Use u128 extractor instead
2023-03-01 21:49:00 +00:00
Alex Crichton
e0ef0b7c72 x64: Add support for phadd{w,d} instructions (#5896)
This commit adds support for the bare lowering of the `iadd_pairwise`
instruction with `i16x8` and `i32x4` types on the x64 backend. These
lowerings are achieved with the `phaddw` and `phaddd` instructions,
respectively. Additionally AVX encodings of these instructions are added
too.

The motivation for these new lowerings comes from the relaxed-simd
proposal which will use them in the deterministic lowering of some
instructions on the x64 backend.
2023-02-28 23:35:53 +00:00
yuyang
32cfd60877 fix codegen riscv64 normalize_cmp_value. (#5873)
* fix issue5839

* add target.

* fix normalize_cmp_value.

* fix test failutre.

* fix test failure.

* fix parameter type.

* Update cranelift/codegen/src/isa/riscv64/inst.isle

Co-authored-by: Jamey Sharp <jamey@minilop.net>

* Update cranelift/codegen/src/isa/riscv64/lower.isle

Co-authored-by: Jamey Sharp <jamey@minilop.net>

* remove convert rule from IntCC to ExtendOp

---------

Co-authored-by: Jamey Sharp <jamey@minilop.net>
2023-02-28 23:00:23 +00:00
Afonso Bordado
2dd6064005 fuzzgen: Generate multiple functions per testcase (#5765)
* fuzzgen: Generate multiple functions per testcase

* fuzzgen: Fix typo

Co-authored-by: Jamey Sharp <jamey@minilop.net>

---------

Co-authored-by: Jamey Sharp <jamey@minilop.net>
2023-02-28 18:47:09 +00:00
Alex Crichton
f2dce812c3 x64: Sink constant loads into xmm instructions (#5880)
A number of places in the x64 backend make use of 128-bit constants for
various wasm SIMD-related instructions although most of them currently
use the `x64_xmm_load_const` helper to load the constant into a
register. Almost all xmm instructions, however, enable using a memory
operand which means that these loads can be folded into instructions to
help reduce register pressure. Automatic conversions were added for a
`VCodeConstant` into an `XmmMem` value and then explicit loads were all
removed in favor of forwarding the `XmmMem` value directly to the
underlying instruction. Note that some instances of `x64_xmm_load_const`
remain since they're used in contexts where load sinking won't work
(e.g. they're the first operand, not the second for non-commutative
instructions).
2023-02-27 22:02:42 +00:00
Alex Crichton
9b86a0b9b1 Remove the widening_pairwise_dot_product_s clif instruction (#5889)
This was added for the wasm SIMD proposal but I've been poking around at
this recently and the instruction can instead be represented by its
component parts with the same semantics I believe. This commit removes
the instruction and instead represents it with the existing
`iadd_pairwise` instruction (among others) and updates backends to with
new pattern matches to have the same codegen as before.

This interestingly entirely removed the codegen rule with no replacement
on the AArch64 backend as the existing rules all existed to produce the
same codegen.
2023-02-27 18:43:43 +00:00
Jamey Sharp
6cf7155052 Cranelift: Generalize (x << k) >> k optimization (#5746)
* Generalize unsigned `(x << k) >> k` optimization

Split the existing rule into three parts:
- A dual of the rule for `(x >> k) << k` that is only valid for unsigned
  shifts.
- Known-bits analysis for `(band (uextend x) k)`.
- A new rule for converting `sextend` to `uextend` if the sign-extended
  bits are masked out anyway.

The first two together cover the existing rule.

* Generalize signed `(x << k) >> k` optimization

* Review comments

* Generalize sign-extending shifts further

The shifts can be eliminated even if the shift amount isn't exactly
equal to the difference in bit-widths between the narrow and wide types.

* Add filetests
2023-02-27 17:34:46 +00:00
yuyang
3864286596 fix issue 5714. (#5845)
* fix issue 5714.

* add target for regression test.

* remove x86_64 test because of not implemented.
2023-02-26 16:25:38 +00:00
Afonso Bordado
36e92add6f riscv64: Move is_null/is_invalid to ISLE (#5874)
* riscv64: Move `is_null`/`is_invalid` to ISLE

* riscv64: Fix `is_invalid` codegen

* Implement review suggestions

Thanks!

Co-authored-by: Jamey Sharp <jamey@minilop.net>

---------

Co-authored-by: Jamey Sharp <jamey@minilop.net>
2023-02-25 12:48:44 +00:00
Jamey Sharp
7d790fcdfe x64: Only branch once in br_table (#5850)
This uses the `cmov`, which was previously necessary for Spectre
mitigation, to clamp the table index instead of zeroing it. By then
placing the default target as the last entry in the table, we can use
just one branch instruction in all cases.

Since there isn't a bounds-check branch any more, this sequence no
longer needs Spectre mitigation. And since we don't need to be careful
about preserving flags, half the instructions can be removed from this
pseudoinstruction and emitted as regular instructions instead.

This is a net savings of three bytes in the encoding of x64's br_table
pseudoinstruction. The generated code can sometimes be longer overall
because the blocks are emitted in a slightly different order.

My benchmark results show a very small effect on runtime performance
with this change.

The spidermonkey benchmark in Sightglass runs "1.01x faster" than main
by instructions retired, but with no significant difference in CPU
cycles. I think that means it rarely hit the default case in any
br_table instructions it executed.

The pulldown-cmark benchmark in Sightglass runs "1.01x faster" than main
by CPU cycles, but main runs "1.00x faster" by instructions retired. I
think that means this benchmark hit the default case a significant
amount of the time, so it executes a few more instructions per br_table,
but maybe the branches were predicted better.
2023-02-24 04:46:38 +00:00
Trevor Elliott
c5d9d5b10f Remove module-level code generation tests (#5870)
* Remove module-level code generation tests

* Add cold block tests for each backend

* Better cold block tests
2023-02-24 01:19:26 +00:00
Alex Crichton
3fc3bc9ec8 x64: Fill out more AVX instructions (#5849)
* x64: Fill out more AVX instructions

This commit fills out more AVX instructions for SSE counterparts
currently used. Many of these instructions do not benefit from the
3-operand form that AVX uses but instead benefit from being able to use
`XmmMem` instead of `XmmMemAligned` which may be able to avoid some
extra temporary registers in some cases.

* Review comments
2023-02-23 22:31:31 +00:00
Trevor Elliott
8abfe928d6 Reuse the DominatorTree postorder travesal in BlockLoweringOrder (#5843)
* Rework the blockorder module to reuse the dom tree's cfg postorder

* Update domtree tests

* Treat br_table with an empty jump table as multiple block exits

* Bless tests

* Change branch_idx to succ_idx and fix the comment
2023-02-23 22:05:20 +00:00
Ulrich Weigand
4314210162 s390x: Fix implementation of {s,u}{min,max} (#5864)
When expanding a min/max operation to a pair of icmp + select,
do not attempt to expand the input value operands twice, as
this might fail with memory operands.

Fixes https://github.com/bytecodealliance/wasmtime/issues/5859.
2023-02-23 20:01:51 +00:00
Afonso Bordado
fc080c739e fuzzgen: Add AtomicRMW (#5861) 2023-02-23 18:34:28 +00:00
Ulrich Weigand
9719147f91 s390x: Fix integer overflow during negation (#5866)
Use wrapping_neg in i{64,32,16}_from_negated_value to avoid Rust
aborts due to integer overflow.  The resulting INT_MIN is already
handled correctly in subsequent operations.

Fixes https://github.com/bytecodealliance/wasmtime/issues/5863.
2023-02-23 16:32:10 +00:00
Afonso Bordado
f6c6bc2155 riscv64: Improve signed and zero extend codegen (#5844)
* riscv64: Remove unused code

* riscv64: Group extend rules

* riscv64: Remove more unused rules

* riscv64: Cleanup existing extension rules

* riscv64: Move the existing Extend rules to ISLE

* riscv64: Use `sext.w` when extending

* riscv64: Remove duplicate extend tests

* riscv64: Use `zbb` instructions when extending values

* riscv64: Use `zbkb` extensions when zero extending

* riscv64: Enable additional tests for extend i128

* riscv64: Fix formatting for `Inst::Extend`

* riscv64: Reverse register for pack

* riscv64: Misc Cleanups

* riscv64: Cleanup extend rules
2023-02-22 17:41:14 +00:00
Afonso Bordado
6e6a1034d7 riscv64: Add bitmanip extension flags (#5847) 2023-02-21 22:12:44 +00:00