* Remove no-std code for cranelift_codegen::timings
no-std mode isn't supported by Cranelift anymore
* Simplify define_passes macro
* Add egraph opt timings
* Replace the add_to_current api with PassTimes::add
* Omit a couple of unused time measurements
* Reduce divergence between run and run_passes a bit
* Introduce a Profiler trait
This allows plugging in external profilers into the Cranelift profiling
framework.
* Add Pass::description method
* Remove duplicate usage of the compile pass timing
* Rustfmt
Maps to the corresponding `wasmtime::Config` option. The motivation here
is largely completeness and was something I was looking into with the
failures in #5970
* Update the spec test suite submodule
Delete the local copies of the relaxed-simd test suite as well as
they're now incorporated.
Closes#5914
* Remove page guards in QEMU emulation
Otherwise `(memory 0 0)` was being compiled as a static memory with huge
guards which we're trying to avoid in QEMU.
Under some test case layouts the call relocation
panicking with an underflow. Use `wrapping_sub` to
signal that this is expected.
The fuzzer took a while to generate such a test case.
And I can't introduce it as a regression test because
when running via the regular clif-util run tests the
layout is different and the test case passes!
I think this is because in the fuzzer we only add
one trampoline, while in clif-util we build trampolines
for each funcion in the file.
Co-authored-by: Jamey Sharp <jsharp@fastly.com>
* x64: Optimize store-of-extract-lane-0
The `movss` and `movsd` instructions can be used to store the 0th lane
of a `t32x4` or a `t64x2` vector into memory, enabling fusing a `store`
and an `extractlane` instruction.
* Fix merge conflict with `main`
* x64: Add a smattering of lowerings for `shuffle` specializations (#5930)
* x64: Add lowerings for `punpck{h,l}wd`
Add some special cases for `shuffle` for more specialized x86
instructions.
* x64: Add `shuffle` lowerings for `pshufd`
This commit adds special-cased lowerings for the x64 `shuffle`
instruction when the `pshufd` instruction alone is necessary. This is
possible when the shuffle immediate permutes 32-bit values within one of
the vector inputs of the `shuffle` instruction, but not both.
* x64: Add shuffle lowerings for `punpck{h,l}{q,}dq`
This adds specific permutations for some x86 instructions which
specifically interleave high/low bytes for 32 and 64-bit values. This
corresponds to the preexisting specific lowerings for interleaving 8 and
16-bit values.
* x64: Add `shuffle` lowerings for `shufps`
This commit adds targeted lowerings for the `shuffle` instruction that
match the pattern that `shufps` supports. The `shufps` instruction
selects two elements from the first vector and two elements from the
second vector which means while it's not generally applicable it should
still be more useful than the catch-all lowering of `shuffle`.
* x64: Add shuffle support for `pshuf{l,h}w`
This commit adds special lowering cases for these instructions which
permute 16-bit values within a 128-bit value either within the upper or
lower half of the 128-bit value.
* x64: Specialize `shuffle` with an all-zeros immediate
Instead of loading the all-zeros immediate from a rip-relative address
at the end of the function instead generate a zero with a `pxor`
instruction and then use `pshufb` to do the broadcast.
* Review comments
* x64: Add an AVX encoding for the `pshufd` instruction
This will benefit from lack of need for alignment vs the `pshufd`
instruction if working with a memory operand and additionally, as I've
just learned, this reduces dependencies between instructions because the
`v*` instructions zero the upper bits as opposed to preserving them
which could accidentally create false dependencies in the CPU between
instructions.
* x64: Add more support for AVX loads/stores
This commit adds VEX-encoded versions of instructions such as
`mov{ss,sd,upd,ups,dqu}` for load and store operations. This also
changes some signatures so the `load` helpers specifically take a
`SyntheticAmode` argument which ended up doing a small refactoring of
the `*_regmove` variant used for `insertlane 0` into f64x2 vectors.
* x64: Enable using AVX instructions for zero regs
This commit refactors the internal ISLE helpers for creating zero'd
xmm registers to leverage the AVX support for all other instructions.
This moves away from picking opcodes to instead picking instructions
with a bit of reorganization.
* x64: Remove `XmmConstOp` as an instruction
All existing users can be replaced with usage of the `xmm_uninit_value`
helper instruction so there's no longer any need for these otherwise
constant operations. This additionally reduces manual usage of opcodes
in favor of instruction helpers.
* Review comments
* Update test expectations
* x64: Add lowerings for `punpck{h,l}wd`
Add some special cases for `shuffle` for more specialized x86
instructions.
* x64: Add `shuffle` lowerings for `pshufd`
This commit adds special-cased lowerings for the x64 `shuffle`
instruction when the `pshufd` instruction alone is necessary. This is
possible when the shuffle immediate permutes 32-bit values within one of
the vector inputs of the `shuffle` instruction, but not both.
* x64: Add shuffle lowerings for `punpck{h,l}{q,}dq`
This adds specific permutations for some x86 instructions which
specifically interleave high/low bytes for 32 and 64-bit values. This
corresponds to the preexisting specific lowerings for interleaving 8 and
16-bit values.
* x64: Add `shuffle` lowerings for `shufps`
This commit adds targeted lowerings for the `shuffle` instruction that
match the pattern that `shufps` supports. The `shufps` instruction
selects two elements from the first vector and two elements from the
second vector which means while it's not generally applicable it should
still be more useful than the catch-all lowering of `shuffle`.
* x64: Add shuffle support for `pshuf{l,h}w`
This commit adds special lowering cases for these instructions which
permute 16-bit values within a 128-bit value either within the upper or
lower half of the 128-bit value.
* x64: Specialize `shuffle` with an all-zeros immediate
Instead of loading the all-zeros immediate from a rip-relative address
at the end of the function instead generate a zero with a `pxor`
instruction and then use `pshufb` to do the broadcast.
* Review comments
This will allow us to build developer tools for Wasmtime and Cranelift like WAT
and asm side-by-side viewers (a la Godbolt).
These are not proper public APIs, so they are marked `doc(hidden)` and have
comments saying they are only for use within this repo's workspace.
* Cranelift: x64, aarch64, s390x, riscv64: ensure addresses are I64s.
@avanhatt has been looking at our address-mode lowering and found an
example where when feeding an `I32`-typed address into a load or store,
we can violate assumptions and get incorrect codegen.
This should never be reachable in practice, because all producers on
64-bit architectures use 64-bit types for addresses. However, our IR is
insufficiently constrained, and allows loads/stores to `I32` addresses
as well. This is nonsensical on a 64-bit architecture.
Initially I had thought we should tighten either the instruction
definition's accepted types, or the CLIF verifier, to reject this.
However both are target-independent, and we don't want to bake
an assumption of 64-bit-ness into the compiler core. Instead this PR
tightens specific backends' lowerings to rejecct loads/stores of
`I32`-typed addresses.
tl;dr: no security implications as all producers use I64-typed
addresses (and must, for correct operation); but we currently accept
I32-typed addresses too, and this breaks other assumptions.
* Allow R64 as well as I64 types.
* Add an explicit extractor to match 64-bit address types.
We've adopted this pattern in Cranelift's instruction definitions where
we let-bind some calls to `Operand::new` and then later use them in one
or more calls to `Inst::new`.
That pattern has two problems:
- It puts the type of each operand somewhere potentially far removed
from the instruction in which it's used.
- We let-bind the same name for many different operands, compounding the
first problem by making it harder to find _which_ definition is used.
So instead this commit removes all let-bindings for operand definitions
and constructs a new `Operand` every time.
Constructing an `Operand` at every use means we duplicate some
documentation strings, but not all that many of them as it turns out.
I've left the let-bound type-sets alone, so those are currently still
shared across many instructions. They have some of the same problems and
should be reviewed as well.
This follows the same strategy pioneered by the `wit-bindgen` guest Rust
bindgen which keeps track of the latest name of an interface for how to
refer to an interface.
Closes#5961
I was debugging [an issue] recently where it appears that the underlying
cause was a discrepancy in the size/align of a WIT type between Wasmtime
and `wit-parser`. This commit adds compile-time assertions that the size
of a WIT type is the same with `wit-parser` as it is in Wasmtime since
the two have different systems to calculate the size of a type. The hope
is that this will head off any future issues if they crop up.
[an issue]: https://github.com/bytecodealliance/wit-bindgen/issues/526
* x64: Remove incorrect `amode_add` lowering rules
This commit removes two incorrect rules as part of the x64 backend's
computation of addressing modes. These two rules folded a zero-extended
32-bit computation into the address mode operand, but this isn't correct
as the 32-bit computation should be truncated to 32-bits but when folded
into the address mode computation it happens with 64-bit operands,
meaning truncation doesn't happen.
* Add release notes
This commit fixes an off-by-one error in the subtraction of indices when
shuffling a vector with itself. Lanes 16-and-above are mapped to select
from the first vector since the first and second element are the same,
but the subtraction was with 15 rather than 16 by accident.
* Enable the native target by default in winch
Match cranelift-codegen's build script where if no architecture is
explicitly enabled then the host architecture is implicitly enabled.
* Refactor Cranelift's ISA builder to share more with Winch
This commit refactors the `Builder` type to have a type parameter
representing the finished ISA with Cranelift and Winch having their own
typedefs for `Builder` to represent their own builders. The intention is
to use this shared functionality to produce more shared code between the
two codegen backends.
* Moving compiler shared components to a separate crate
* Restore native flag inference in compiler building
This fixes an oversight from the previous commits to use
`cranelift-native` to infer flags for the native host when using default
settings with Wasmtime.
* Move `Compiler::page_size_align` into wasmtime-environ
The `cranelift-codegen` crate doesn't need this and winch wants the same
implementation, so shuffle it around so everyone has access to it.
* Fill out `Compiler::{flags, isa_flags}` for Winch
These are easy enough to plumb through with some shared code for
Wasmtime.
* Plumb the `is_branch_protection_enabled` flag for Winch
Just forwarding an isa-specific setting accessor.
* Moving executable creation to shared compiler crate
* Adding builder back in and removing from shared crate
* Refactoring the shared pieces for the `CompilerBuilder`
I decided to move a couple things around from Alex's initial changes.
Instead of having the shared builder do everything, I went back to
having each compiler have a distinct builder implementation. I
refactored most of the flag setting logic into a single shared location,
so we can still reduce the amount of code duplication.
With them being separate, we don't need to maintain things like
`LinkOpts` which Winch doesn't currently use. We also have an avenue to
error when certain flags are sent to Winch if we don't support them. I'm
hoping this will make things more maintainable as we build out Winch.
I'm still unsure about keeping everything shared in a single crate
(`cranelift_shared`). It's starting to feel like this crate is doing too
much, which makes it difficult to name. There does seem to be a need for
two distinct abstraction: creating the final executable and the handling
of shared/ISA flags when building the compiler. I could make them into
two separate crates, but there doesn't seem to be enough there yet to
justify it.
* Documentation updates, and renaming the finish method
* Adding back in a default temporarily to pass tests, and removing some unused imports
* Fixing winch tests with wrong method name
* Removing unused imports from codegen shared crate
* Apply documentation formatting updates
Co-authored-by: Saúl Cabrera <saulecabrera@gmail.com>
* Adding back in cranelift_native flag inferring
* Adding new shared crate to publish list
* Adding write feature to pass cargo check
---------
Co-authored-by: Alex Crichton <alex@alexcrichton.com>
Co-authored-by: Saúl Cabrera <saulecabrera@gmail.com>
* Remove the Cranelift `vselect` instruction
This instruction is documented as selecting lanes based on the "truthy"
value of the condition lane, but the current status of the
implementation of this instruction is:
* x64 - uses the high bit for `f32x4` and `f64x2` and otherwise uses the
high bit of each byte doing a byte-wise lane select rather than
whatever the controlling type is.
* AArch64 - this is the same as `bitselect` which is a bit-wise
selection rather than a lane-wise selection.
* s390x - this is the same as AArch64, a bit-wise selection rather than
lane-wise.
* interpreter - the interpreter implements the documented semantics of
selecting based on "truthy" values.
Coupled with the status of the implementation is the fact that this
instruction is not used by WebAssembly SIMD today either. The only use
of this instruction in Cranelift is the nan-canonicalization pass. By
moving nan-canonicalization to `bitselect`, since that has the desired
semantics, there's no longer any need for `vselect`.
Given this situation this commit subsqeuently removes `vselect` and all
usage of it throughout Cranelift.
Closes#5917
* Review comments
* Bring back vselect opts as bitselect opts
* Clean up vselect usage in the interpreter
* Move bitcast in nan canonicalization
* Add a comment about float optimization
This fixes an issue where block params were always listed as being
members of the current block in egraphs, even when the block param was
actually defined in a separate block. This then enables instructions
which depend on these parameters to get hoisted up out of inner loops at
least to the block that defined the argument.
Closes#5957
Noted in #5954 this'll report `cargo vet` status checks on PRs that
modify the `supply-chain` directory in addition to `Cargo.lock`
modifications that already happen.
* Initial support for the Relaxed SIMD proposal
This commit adds initial scaffolding and support for the Relaxed SIMD
proposal for WebAssembly. Codegen support is supported on the x64 and
AArch64 backends on this time.
The purpose of this commit is to get all the boilerplate out of the way
in terms of plumbing through a new feature, adding tests, etc. The tests
are copied from the upstream repository at this time while the
WebAssembly/testsuite repository hasn't been updated.
A summary of changes made in this commit are:
* Lowerings for all relaxed simd opcodes have been added, currently all
exhibiting deterministic behavior. This means that few lowerings are
optimal on the x86 backend, but on the AArch64 backend, for example,
all lowerings should be optimal.
* Support is added to codegen to, eventually, conditionally generate
different code based on input codegen flags. This is intended to
enable codegen to more efficient instructions on x86 by default, for
example, while still allowing embedders to force
architecture-independent semantics and behavior. One good example of
this is the `f32x4.relaxed_fmadd` instruction which when deterministic
forces the `fma` instruction, but otherwise if the backend doesn't
have support for `fma` then intermediate operations are performed
instead.
* Lowerings of `iadd_pairwise` for `i16x8` and `i32x4` were added to the
x86 backend as they're now exercised by the deterministic lowerings of
relaxed simd instructions.
* Sample codegen tests for added for x86 and aarch64 for some relaxed
simd instructions.
* Wasmtime embedder support for the relaxed-simd proposal and forcing
determinism have been added to `Config` and the CLI.
* Support has been added to the `*.wast` runtime execution for the
`(either ...)` matcher used in the relaxed-simd proposal.
* Tests for relaxed-simd are run both with a default `Engine` as well as
a "force deterministic" `Engine` to test both configurations.
* All tests from the upstream repository were copied into Wasmtime.
These tests should be deleted when WebAssembly/testsuite is updated.
* x64: Add x86-specific lowerings for relaxed simd
This commit builds on the prior commit and adds an array of `x86_*`
instructions to Cranelift which have semantics that match their
corresponding x86 equivalents. Translation for relaxed simd is then
additionally updated to conditionally generate different CLIF for
relaxed simd instructions depending on whether the target is x86 or not.
This means that for AArch64 no changes are made but for x86 most relaxed
instructions now lower to some x86-equivalent with slightly different
semantics than the "deterministic" lowering.
* Add libcall support for fma to Wasmtime
This will be required to implement the `f32x4.relaxed_madd` instruction
(and others) when an x86 host doesn't specify the `has_fma` feature.
* Ignore relaxed-simd tests on s390x and riscv64
* Enable relaxed-simd tests on s390x
* Update cranelift/codegen/meta/src/shared/instructions.rs
Co-authored-by: Andrew Brown <andrew.brown@intel.com>
* Add a FIXME from review
* Add notes about deterministic semantics
* Don't default `has_native_fma` to `true`
* Review comments and rebase fixes
---------
Co-authored-by: Andrew Brown <andrew.brown@intel.com>
* Add checks to `InterpreterState::checked_{load,store}` to trap on misaligned memory accesses
where `aligned` memory flag is set.
* Alter `stack_{load,store}` instructions to now rely on `MemFlags::new()` instead of
`MemFlags::trusted` since `InterpreterState::checked_{load,store}` is only able to
deduce type alignment and not stack slot alignment.
This notably updates `wasmparser` for updates to the relaxed-simd
proposal and an implementation of the function-references proposal.
Additionally there are some minor bug fixes being picked up for WIT and
the component model.
* Allow hoisting `vconst` instructions out of loops
Staring at some SIMD code and what LLVM and v8 both generate it appears
that a common technique for SIMD-loops is to hoist constants outside of
loops since they're nontrivial to rematerialize unlike integer
constants. This commit updates the `loop_hoist_level` calculation with
egraphs to have a nonzero default for instructions that have no
arguments (e.g. consts) which enables hoisting these instructions out of
loops.
Note, though, that for now I've listed the maximum as hoisting outside
of one loop, but not all of them. While theoretically vconsts could move
up to the top of the function I'd be worried about their impact on
register pressure and having to save/restore around calls or similar, so
hopefully if the hot part of a program is a single loop then hoisting
out of one loop is a reasonable-enough heuristic for now.
Locally on x64 with a benchmark that just encodes binary to hex this saw
a 15% performance improvement taking hex encoding from ~6G/s to ~6.7G/s.
* Test vconst is only hoisted one loop out
* fix issue5884.
* fix issue5884
* fix test failure
* fix atomic rmw missing move result to dst register.
* specify little endian some s390x can pass test.
* wasi-threads: run test suite
This change enables the running of the wasi-threads [test suite]. It
relies on a Wasmtime CLI binary being available and runs all `*.wasm`
and `*.wat` files present in the test suite directory. The results of
each execution are compared against a JSON spec file with the same base
name as the WebAssembly module. The spec file defines the expected exit
code, e.g.
This commit does not yet build any `*.c` or `*.s` files from the test
suite. That could be done later, perhaps upstream; in the meantime, this
work is still valuable as it lays the foundation for running other WASI
tests from the in-progress [wasi-testsuite] which share the same JSON
spec infrastructure.
[test suite]: https://github.com/WebAssembly/wasi-threads/tree/main/test/testsuite
[wasi-testsuite]: https://github.com/WebAssembly/wasi-testsuite
* review: move testsuite to top-level tests
* fix: remove now-unnecessary wasi-threads test
* fix: update testsuite submodule name
* fix: ignore tests on Windows
prtest:full
* fix: `cfg_attr` syntax
prtest:full
* Added `mem_flags` parameter to `State::checked_{load,store}` as the means
for determining the endianness, typically derived from an instruction.
* Added `native_endianness` property to `InterpreterState` as fallback when
determining endianness, such as in cases where there are no memory flags
avaiable or set.
* Added `to_be` and `to_le` methods to `DataValue`.
* Added `AtomicCas` and `AtomicRmw` to list of instructions with retrievable
memory flags for `InstructionData::memflags`.
* Enabled `atomic-{cas,rmw}-subword-{big,little}.clif` for interpreter run
tests.
This commit adds lowerings to the AArch64 backend for the `fmls`
instruction which is intended to be leveraged in the relaxed-simd
proposal for WebAssembly. This should hopefully allow for a
teeny-bit-more efficient codegen for this operator instead of using the
`fmla` instruction plus a negation instruction.
This catches a case that wasn't handled previously by #5880 to allow a
constant load to be folded into an instruction rather than forcing it to
be loaded into a temporary register.
* Revert "egraphs: disable GVN of effectful idempotent ops (temporarily). (#5808)"
This reverts commit c7e2571866.
* egraphs: fix handling of effectful-but-idempotent ops and GVN.
This PR addresses #5796: currently, ops that are effectful, i.e., remain
in the side-effecting skeleton (which we keep in the `Layout` while the
egraph exists), but are idempotent and thus mergeable by a GVN pass, are
not handled properly.
GVN is still possible on effectful but idempotent ops precisely because
our GVN does not create partial redundancies: it removes an instruction
only when it is dominated by an identical instruction. An isntruction
will not be "hoisted" to a point where it could execute in the optimized
code but not in the original.
However, there are really two parts to the egraph implementation that
produce this effect: the deduplication on insertion into the egraph, and
the elaboration with a scoped hashmap. The deduplication lets us give a
single name (value ID) to all copies of an identical instruction, and
then elaboration will re-create duplicates if GVN should not hoist or
merge some of them.
Because deduplication need not worry about dominance or scopes, we use a
simple (non-scoped) hashmap to dedup/intern ops as "egraph nodes".
When we added support for GVN'ing effectful but idempotent ops (#5594),
we kept the use of this simple dedup'ing hashmap, but these ops do not
get elaborated; instead they stay in the side-effecting skeleton. Thus,
we inadvertently created potential for weird code-motion effects.
The proposal in #5796 would solve this in a clean way by treating these
ops as pure again, and keeping them out of the skeleton, instead putting
"force" pseudo-ops in the skeleton. However, this is a little more
complex than I would like, and I've realized that @jameysharp's earlier
suggestion is much simpler: we can keep an actual scoped hashmap
separately just for the effectful-but-idempotent ops, and use it to GVN
while we build the egraph. In effect, we're fusing a separate GVN pass
with the egraph pass (but letting it interact corecursively with
egraph rewrites. This is in principle similar to how we keep a separate
map for loads and fuse this pass with the egraph rewrite pass as well.
Note that we can use a `ScopedHashMap` here without the "context" (as
needed by `CtxHashMap`) because, as noted by @jameysharp, in practice
the ops we want to GVN have all their args inline. Equality on the
`InstructinoData` itself is conservative: two insts whose struct
contents compare shallowly equal are definitely identical, but identical
insts in a deep-equality sense may not compare shallowly equal, due to
list indirection. This is fine for GVN, because it is still sound to
skip any given GVN opportunity (and keep the original instructions).
Fixes#5796.
* Add comments from review.
* x64: Add `shuffle` cases for `punpck{h,l}bw`
I noticed this difference between LLVM and Cranelift for something I was
looking at recently, and while it's probably not all that common I
figured I'd add it here since it should be somewhat useful nevertheless.
* Review feedback
* Use u128 extractor instead