* Enable the native target by default in winch
Match cranelift-codegen's build script where if no architecture is
explicitly enabled then the host architecture is implicitly enabled.
* Refactor Cranelift's ISA builder to share more with Winch
This commit refactors the `Builder` type to have a type parameter
representing the finished ISA with Cranelift and Winch having their own
typedefs for `Builder` to represent their own builders. The intention is
to use this shared functionality to produce more shared code between the
two codegen backends.
* Moving compiler shared components to a separate crate
* Restore native flag inference in compiler building
This fixes an oversight from the previous commits to use
`cranelift-native` to infer flags for the native host when using default
settings with Wasmtime.
* Move `Compiler::page_size_align` into wasmtime-environ
The `cranelift-codegen` crate doesn't need this and winch wants the same
implementation, so shuffle it around so everyone has access to it.
* Fill out `Compiler::{flags, isa_flags}` for Winch
These are easy enough to plumb through with some shared code for
Wasmtime.
* Plumb the `is_branch_protection_enabled` flag for Winch
Just forwarding an isa-specific setting accessor.
* Moving executable creation to shared compiler crate
* Adding builder back in and removing from shared crate
* Refactoring the shared pieces for the `CompilerBuilder`
I decided to move a couple things around from Alex's initial changes.
Instead of having the shared builder do everything, I went back to
having each compiler have a distinct builder implementation. I
refactored most of the flag setting logic into a single shared location,
so we can still reduce the amount of code duplication.
With them being separate, we don't need to maintain things like
`LinkOpts` which Winch doesn't currently use. We also have an avenue to
error when certain flags are sent to Winch if we don't support them. I'm
hoping this will make things more maintainable as we build out Winch.
I'm still unsure about keeping everything shared in a single crate
(`cranelift_shared`). It's starting to feel like this crate is doing too
much, which makes it difficult to name. There does seem to be a need for
two distinct abstraction: creating the final executable and the handling
of shared/ISA flags when building the compiler. I could make them into
two separate crates, but there doesn't seem to be enough there yet to
justify it.
* Documentation updates, and renaming the finish method
* Adding back in a default temporarily to pass tests, and removing some unused imports
* Fixing winch tests with wrong method name
* Removing unused imports from codegen shared crate
* Apply documentation formatting updates
Co-authored-by: Saúl Cabrera <saulecabrera@gmail.com>
* Adding back in cranelift_native flag inferring
* Adding new shared crate to publish list
* Adding write feature to pass cargo check
---------
Co-authored-by: Alex Crichton <alex@alexcrichton.com>
Co-authored-by: Saúl Cabrera <saulecabrera@gmail.com>
* Remove the Cranelift `vselect` instruction
This instruction is documented as selecting lanes based on the "truthy"
value of the condition lane, but the current status of the
implementation of this instruction is:
* x64 - uses the high bit for `f32x4` and `f64x2` and otherwise uses the
high bit of each byte doing a byte-wise lane select rather than
whatever the controlling type is.
* AArch64 - this is the same as `bitselect` which is a bit-wise
selection rather than a lane-wise selection.
* s390x - this is the same as AArch64, a bit-wise selection rather than
lane-wise.
* interpreter - the interpreter implements the documented semantics of
selecting based on "truthy" values.
Coupled with the status of the implementation is the fact that this
instruction is not used by WebAssembly SIMD today either. The only use
of this instruction in Cranelift is the nan-canonicalization pass. By
moving nan-canonicalization to `bitselect`, since that has the desired
semantics, there's no longer any need for `vselect`.
Given this situation this commit subsqeuently removes `vselect` and all
usage of it throughout Cranelift.
Closes#5917
* Review comments
* Bring back vselect opts as bitselect opts
* Clean up vselect usage in the interpreter
* Move bitcast in nan canonicalization
* Add a comment about float optimization
This fixes an issue where block params were always listed as being
members of the current block in egraphs, even when the block param was
actually defined in a separate block. This then enables instructions
which depend on these parameters to get hoisted up out of inner loops at
least to the block that defined the argument.
Closes#5957
* Initial support for the Relaxed SIMD proposal
This commit adds initial scaffolding and support for the Relaxed SIMD
proposal for WebAssembly. Codegen support is supported on the x64 and
AArch64 backends on this time.
The purpose of this commit is to get all the boilerplate out of the way
in terms of plumbing through a new feature, adding tests, etc. The tests
are copied from the upstream repository at this time while the
WebAssembly/testsuite repository hasn't been updated.
A summary of changes made in this commit are:
* Lowerings for all relaxed simd opcodes have been added, currently all
exhibiting deterministic behavior. This means that few lowerings are
optimal on the x86 backend, but on the AArch64 backend, for example,
all lowerings should be optimal.
* Support is added to codegen to, eventually, conditionally generate
different code based on input codegen flags. This is intended to
enable codegen to more efficient instructions on x86 by default, for
example, while still allowing embedders to force
architecture-independent semantics and behavior. One good example of
this is the `f32x4.relaxed_fmadd` instruction which when deterministic
forces the `fma` instruction, but otherwise if the backend doesn't
have support for `fma` then intermediate operations are performed
instead.
* Lowerings of `iadd_pairwise` for `i16x8` and `i32x4` were added to the
x86 backend as they're now exercised by the deterministic lowerings of
relaxed simd instructions.
* Sample codegen tests for added for x86 and aarch64 for some relaxed
simd instructions.
* Wasmtime embedder support for the relaxed-simd proposal and forcing
determinism have been added to `Config` and the CLI.
* Support has been added to the `*.wast` runtime execution for the
`(either ...)` matcher used in the relaxed-simd proposal.
* Tests for relaxed-simd are run both with a default `Engine` as well as
a "force deterministic" `Engine` to test both configurations.
* All tests from the upstream repository were copied into Wasmtime.
These tests should be deleted when WebAssembly/testsuite is updated.
* x64: Add x86-specific lowerings for relaxed simd
This commit builds on the prior commit and adds an array of `x86_*`
instructions to Cranelift which have semantics that match their
corresponding x86 equivalents. Translation for relaxed simd is then
additionally updated to conditionally generate different CLIF for
relaxed simd instructions depending on whether the target is x86 or not.
This means that for AArch64 no changes are made but for x86 most relaxed
instructions now lower to some x86-equivalent with slightly different
semantics than the "deterministic" lowering.
* Add libcall support for fma to Wasmtime
This will be required to implement the `f32x4.relaxed_madd` instruction
(and others) when an x86 host doesn't specify the `has_fma` feature.
* Ignore relaxed-simd tests on s390x and riscv64
* Enable relaxed-simd tests on s390x
* Update cranelift/codegen/meta/src/shared/instructions.rs
Co-authored-by: Andrew Brown <andrew.brown@intel.com>
* Add a FIXME from review
* Add notes about deterministic semantics
* Don't default `has_native_fma` to `true`
* Review comments and rebase fixes
---------
Co-authored-by: Andrew Brown <andrew.brown@intel.com>
* Add checks to `InterpreterState::checked_{load,store}` to trap on misaligned memory accesses
where `aligned` memory flag is set.
* Alter `stack_{load,store}` instructions to now rely on `MemFlags::new()` instead of
`MemFlags::trusted` since `InterpreterState::checked_{load,store}` is only able to
deduce type alignment and not stack slot alignment.
This notably updates `wasmparser` for updates to the relaxed-simd
proposal and an implementation of the function-references proposal.
Additionally there are some minor bug fixes being picked up for WIT and
the component model.
* Allow hoisting `vconst` instructions out of loops
Staring at some SIMD code and what LLVM and v8 both generate it appears
that a common technique for SIMD-loops is to hoist constants outside of
loops since they're nontrivial to rematerialize unlike integer
constants. This commit updates the `loop_hoist_level` calculation with
egraphs to have a nonzero default for instructions that have no
arguments (e.g. consts) which enables hoisting these instructions out of
loops.
Note, though, that for now I've listed the maximum as hoisting outside
of one loop, but not all of them. While theoretically vconsts could move
up to the top of the function I'd be worried about their impact on
register pressure and having to save/restore around calls or similar, so
hopefully if the hot part of a program is a single loop then hoisting
out of one loop is a reasonable-enough heuristic for now.
Locally on x64 with a benchmark that just encodes binary to hex this saw
a 15% performance improvement taking hex encoding from ~6G/s to ~6.7G/s.
* Test vconst is only hoisted one loop out
* fix issue5884.
* fix issue5884
* fix test failure
* fix atomic rmw missing move result to dst register.
* specify little endian some s390x can pass test.
* Added `mem_flags` parameter to `State::checked_{load,store}` as the means
for determining the endianness, typically derived from an instruction.
* Added `native_endianness` property to `InterpreterState` as fallback when
determining endianness, such as in cases where there are no memory flags
avaiable or set.
* Added `to_be` and `to_le` methods to `DataValue`.
* Added `AtomicCas` and `AtomicRmw` to list of instructions with retrievable
memory flags for `InstructionData::memflags`.
* Enabled `atomic-{cas,rmw}-subword-{big,little}.clif` for interpreter run
tests.
This commit adds lowerings to the AArch64 backend for the `fmls`
instruction which is intended to be leveraged in the relaxed-simd
proposal for WebAssembly. This should hopefully allow for a
teeny-bit-more efficient codegen for this operator instead of using the
`fmla` instruction plus a negation instruction.
This catches a case that wasn't handled previously by #5880 to allow a
constant load to be folded into an instruction rather than forcing it to
be loaded into a temporary register.
* Revert "egraphs: disable GVN of effectful idempotent ops (temporarily). (#5808)"
This reverts commit c7e2571866.
* egraphs: fix handling of effectful-but-idempotent ops and GVN.
This PR addresses #5796: currently, ops that are effectful, i.e., remain
in the side-effecting skeleton (which we keep in the `Layout` while the
egraph exists), but are idempotent and thus mergeable by a GVN pass, are
not handled properly.
GVN is still possible on effectful but idempotent ops precisely because
our GVN does not create partial redundancies: it removes an instruction
only when it is dominated by an identical instruction. An isntruction
will not be "hoisted" to a point where it could execute in the optimized
code but not in the original.
However, there are really two parts to the egraph implementation that
produce this effect: the deduplication on insertion into the egraph, and
the elaboration with a scoped hashmap. The deduplication lets us give a
single name (value ID) to all copies of an identical instruction, and
then elaboration will re-create duplicates if GVN should not hoist or
merge some of them.
Because deduplication need not worry about dominance or scopes, we use a
simple (non-scoped) hashmap to dedup/intern ops as "egraph nodes".
When we added support for GVN'ing effectful but idempotent ops (#5594),
we kept the use of this simple dedup'ing hashmap, but these ops do not
get elaborated; instead they stay in the side-effecting skeleton. Thus,
we inadvertently created potential for weird code-motion effects.
The proposal in #5796 would solve this in a clean way by treating these
ops as pure again, and keeping them out of the skeleton, instead putting
"force" pseudo-ops in the skeleton. However, this is a little more
complex than I would like, and I've realized that @jameysharp's earlier
suggestion is much simpler: we can keep an actual scoped hashmap
separately just for the effectful-but-idempotent ops, and use it to GVN
while we build the egraph. In effect, we're fusing a separate GVN pass
with the egraph pass (but letting it interact corecursively with
egraph rewrites. This is in principle similar to how we keep a separate
map for loads and fuse this pass with the egraph rewrite pass as well.
Note that we can use a `ScopedHashMap` here without the "context" (as
needed by `CtxHashMap`) because, as noted by @jameysharp, in practice
the ops we want to GVN have all their args inline. Equality on the
`InstructinoData` itself is conservative: two insts whose struct
contents compare shallowly equal are definitely identical, but identical
insts in a deep-equality sense may not compare shallowly equal, due to
list indirection. This is fine for GVN, because it is still sound to
skip any given GVN opportunity (and keep the original instructions).
Fixes#5796.
* Add comments from review.
* x64: Add `shuffle` cases for `punpck{h,l}bw`
I noticed this difference between LLVM and Cranelift for something I was
looking at recently, and while it's probably not all that common I
figured I'd add it here since it should be somewhat useful nevertheless.
* Review feedback
* Use u128 extractor instead
This instruction is only defined with i8x16 inputs and outputs so
there's no need for a type variable, so shadow the otherwise-generic `a`
result with a concrete i8x16 type.
This commit adds support for the bare lowering of the `iadd_pairwise`
instruction with `i16x8` and `i32x4` types on the x64 backend. These
lowerings are achieved with the `phaddw` and `phaddd` instructions,
respectively. Additionally AVX encodings of these instructions are added
too.
The motivation for these new lowerings comes from the relaxed-simd
proposal which will use them in the deterministic lowering of some
instructions on the x64 backend.
A number of places in the x64 backend make use of 128-bit constants for
various wasm SIMD-related instructions although most of them currently
use the `x64_xmm_load_const` helper to load the constant into a
register. Almost all xmm instructions, however, enable using a memory
operand which means that these loads can be folded into instructions to
help reduce register pressure. Automatic conversions were added for a
`VCodeConstant` into an `XmmMem` value and then explicit loads were all
removed in favor of forwarding the `XmmMem` value directly to the
underlying instruction. Note that some instances of `x64_xmm_load_const`
remain since they're used in contexts where load sinking won't work
(e.g. they're the first operand, not the second for non-commutative
instructions).
This was added for the wasm SIMD proposal but I've been poking around at
this recently and the instruction can instead be represented by its
component parts with the same semantics I believe. This commit removes
the instruction and instead represents it with the existing
`iadd_pairwise` instruction (among others) and updates backends to with
new pattern matches to have the same codegen as before.
This interestingly entirely removed the codegen rule with no replacement
on the AArch64 backend as the existing rules all existed to produce the
same codegen.
* Generalize unsigned `(x << k) >> k` optimization
Split the existing rule into three parts:
- A dual of the rule for `(x >> k) << k` that is only valid for unsigned
shifts.
- Known-bits analysis for `(band (uextend x) k)`.
- A new rule for converting `sextend` to `uextend` if the sign-extended
bits are masked out anyway.
The first two together cover the existing rule.
* Generalize signed `(x << k) >> k` optimization
* Review comments
* Generalize sign-extending shifts further
The shifts can be eliminated even if the shift amount isn't exactly
equal to the difference in bit-widths between the narrow and wide types.
* Add filetests
For instructions with no results (such as branches and stores) or
instructions with multiple results (such as add with carry), we have
assertions checking that an optimization rule doesn't try to match on
or construct such instructions.
When we generate terms for matching or constructing instructions, the
terms for these instructions are guaranteed to panic if they're ever
used. So let's just not generate them.
In the future we may wish to generate terms with different types for
these instructions, to make them usable in ISLE rules for optimization
that fall outside our current egraph constraints.
This uses the `cmov`, which was previously necessary for Spectre
mitigation, to clamp the table index instead of zeroing it. By then
placing the default target as the last entry in the table, we can use
just one branch instruction in all cases.
Since there isn't a bounds-check branch any more, this sequence no
longer needs Spectre mitigation. And since we don't need to be careful
about preserving flags, half the instructions can be removed from this
pseudoinstruction and emitted as regular instructions instead.
This is a net savings of three bytes in the encoding of x64's br_table
pseudoinstruction. The generated code can sometimes be longer overall
because the blocks are emitted in a slightly different order.
My benchmark results show a very small effect on runtime performance
with this change.
The spidermonkey benchmark in Sightglass runs "1.01x faster" than main
by instructions retired, but with no significant difference in CPU
cycles. I think that means it rarely hit the default case in any
br_table instructions it executed.
The pulldown-cmark benchmark in Sightglass runs "1.01x faster" than main
by CPU cycles, but main runs "1.00x faster" by instructions retired. I
think that means this benchmark hit the default case a significant
amount of the time, so it executes a few more instructions per br_table,
but maybe the branches were predicted better.
* x64: Fill out more AVX instructions
This commit fills out more AVX instructions for SSE counterparts
currently used. Many of these instructions do not benefit from the
3-operand form that AVX uses but instead benefit from being able to use
`XmmMem` instead of `XmmMemAligned` which may be able to avoid some
extra temporary registers in some cases.
* Review comments
* Rework the blockorder module to reuse the dom tree's cfg postorder
* Update domtree tests
* Treat br_table with an empty jump table as multiple block exits
* Bless tests
* Change branch_idx to succ_idx and fix the comment
When expanding a min/max operation to a pair of icmp + select,
do not attempt to expand the input value operands twice, as
this might fail with memory operands.
Fixes https://github.com/bytecodealliance/wasmtime/issues/5859.
Use wrapping_neg in i{64,32,16}_from_negated_value to avoid Rust
aborts due to integer overflow. The resulting INT_MIN is already
handled correctly in subsequent operations.
Fixes https://github.com/bytecodealliance/wasmtime/issues/5863.
As per the linked issue, atomic_rmw was implemented without specific regard for thread safety.
Additionally, the relevant filetest (atomic-rmw-little.clif) was enabled and altered to fix an
inccorrect call to test function `%atomic_rmw_and_i64` after setting up test function
`%atomic_rmw_and_i32`.