This commit fixes an off-by-one error in the subtraction of indices when
shuffling a vector with itself. Lanes 16-and-above are mapped to select
from the first vector since the first and second element are the same,
but the subtraction was with 15 rather than 16 by accident.
* Enable the native target by default in winch
Match cranelift-codegen's build script where if no architecture is
explicitly enabled then the host architecture is implicitly enabled.
* Refactor Cranelift's ISA builder to share more with Winch
This commit refactors the `Builder` type to have a type parameter
representing the finished ISA with Cranelift and Winch having their own
typedefs for `Builder` to represent their own builders. The intention is
to use this shared functionality to produce more shared code between the
two codegen backends.
* Moving compiler shared components to a separate crate
* Restore native flag inference in compiler building
This fixes an oversight from the previous commits to use
`cranelift-native` to infer flags for the native host when using default
settings with Wasmtime.
* Move `Compiler::page_size_align` into wasmtime-environ
The `cranelift-codegen` crate doesn't need this and winch wants the same
implementation, so shuffle it around so everyone has access to it.
* Fill out `Compiler::{flags, isa_flags}` for Winch
These are easy enough to plumb through with some shared code for
Wasmtime.
* Plumb the `is_branch_protection_enabled` flag for Winch
Just forwarding an isa-specific setting accessor.
* Moving executable creation to shared compiler crate
* Adding builder back in and removing from shared crate
* Refactoring the shared pieces for the `CompilerBuilder`
I decided to move a couple things around from Alex's initial changes.
Instead of having the shared builder do everything, I went back to
having each compiler have a distinct builder implementation. I
refactored most of the flag setting logic into a single shared location,
so we can still reduce the amount of code duplication.
With them being separate, we don't need to maintain things like
`LinkOpts` which Winch doesn't currently use. We also have an avenue to
error when certain flags are sent to Winch if we don't support them. I'm
hoping this will make things more maintainable as we build out Winch.
I'm still unsure about keeping everything shared in a single crate
(`cranelift_shared`). It's starting to feel like this crate is doing too
much, which makes it difficult to name. There does seem to be a need for
two distinct abstraction: creating the final executable and the handling
of shared/ISA flags when building the compiler. I could make them into
two separate crates, but there doesn't seem to be enough there yet to
justify it.
* Documentation updates, and renaming the finish method
* Adding back in a default temporarily to pass tests, and removing some unused imports
* Fixing winch tests with wrong method name
* Removing unused imports from codegen shared crate
* Apply documentation formatting updates
Co-authored-by: Saúl Cabrera <saulecabrera@gmail.com>
* Adding back in cranelift_native flag inferring
* Adding new shared crate to publish list
* Adding write feature to pass cargo check
---------
Co-authored-by: Alex Crichton <alex@alexcrichton.com>
Co-authored-by: Saúl Cabrera <saulecabrera@gmail.com>
* Remove the Cranelift `vselect` instruction
This instruction is documented as selecting lanes based on the "truthy"
value of the condition lane, but the current status of the
implementation of this instruction is:
* x64 - uses the high bit for `f32x4` and `f64x2` and otherwise uses the
high bit of each byte doing a byte-wise lane select rather than
whatever the controlling type is.
* AArch64 - this is the same as `bitselect` which is a bit-wise
selection rather than a lane-wise selection.
* s390x - this is the same as AArch64, a bit-wise selection rather than
lane-wise.
* interpreter - the interpreter implements the documented semantics of
selecting based on "truthy" values.
Coupled with the status of the implementation is the fact that this
instruction is not used by WebAssembly SIMD today either. The only use
of this instruction in Cranelift is the nan-canonicalization pass. By
moving nan-canonicalization to `bitselect`, since that has the desired
semantics, there's no longer any need for `vselect`.
Given this situation this commit subsqeuently removes `vselect` and all
usage of it throughout Cranelift.
Closes#5917
* Review comments
* Bring back vselect opts as bitselect opts
* Clean up vselect usage in the interpreter
* Move bitcast in nan canonicalization
* Add a comment about float optimization
This fixes an issue where block params were always listed as being
members of the current block in egraphs, even when the block param was
actually defined in a separate block. This then enables instructions
which depend on these parameters to get hoisted up out of inner loops at
least to the block that defined the argument.
Closes#5957
* Initial support for the Relaxed SIMD proposal
This commit adds initial scaffolding and support for the Relaxed SIMD
proposal for WebAssembly. Codegen support is supported on the x64 and
AArch64 backends on this time.
The purpose of this commit is to get all the boilerplate out of the way
in terms of plumbing through a new feature, adding tests, etc. The tests
are copied from the upstream repository at this time while the
WebAssembly/testsuite repository hasn't been updated.
A summary of changes made in this commit are:
* Lowerings for all relaxed simd opcodes have been added, currently all
exhibiting deterministic behavior. This means that few lowerings are
optimal on the x86 backend, but on the AArch64 backend, for example,
all lowerings should be optimal.
* Support is added to codegen to, eventually, conditionally generate
different code based on input codegen flags. This is intended to
enable codegen to more efficient instructions on x86 by default, for
example, while still allowing embedders to force
architecture-independent semantics and behavior. One good example of
this is the `f32x4.relaxed_fmadd` instruction which when deterministic
forces the `fma` instruction, but otherwise if the backend doesn't
have support for `fma` then intermediate operations are performed
instead.
* Lowerings of `iadd_pairwise` for `i16x8` and `i32x4` were added to the
x86 backend as they're now exercised by the deterministic lowerings of
relaxed simd instructions.
* Sample codegen tests for added for x86 and aarch64 for some relaxed
simd instructions.
* Wasmtime embedder support for the relaxed-simd proposal and forcing
determinism have been added to `Config` and the CLI.
* Support has been added to the `*.wast` runtime execution for the
`(either ...)` matcher used in the relaxed-simd proposal.
* Tests for relaxed-simd are run both with a default `Engine` as well as
a "force deterministic" `Engine` to test both configurations.
* All tests from the upstream repository were copied into Wasmtime.
These tests should be deleted when WebAssembly/testsuite is updated.
* x64: Add x86-specific lowerings for relaxed simd
This commit builds on the prior commit and adds an array of `x86_*`
instructions to Cranelift which have semantics that match their
corresponding x86 equivalents. Translation for relaxed simd is then
additionally updated to conditionally generate different CLIF for
relaxed simd instructions depending on whether the target is x86 or not.
This means that for AArch64 no changes are made but for x86 most relaxed
instructions now lower to some x86-equivalent with slightly different
semantics than the "deterministic" lowering.
* Add libcall support for fma to Wasmtime
This will be required to implement the `f32x4.relaxed_madd` instruction
(and others) when an x86 host doesn't specify the `has_fma` feature.
* Ignore relaxed-simd tests on s390x and riscv64
* Enable relaxed-simd tests on s390x
* Update cranelift/codegen/meta/src/shared/instructions.rs
Co-authored-by: Andrew Brown <andrew.brown@intel.com>
* Add a FIXME from review
* Add notes about deterministic semantics
* Don't default `has_native_fma` to `true`
* Review comments and rebase fixes
---------
Co-authored-by: Andrew Brown <andrew.brown@intel.com>
* Allow hoisting `vconst` instructions out of loops
Staring at some SIMD code and what LLVM and v8 both generate it appears
that a common technique for SIMD-loops is to hoist constants outside of
loops since they're nontrivial to rematerialize unlike integer
constants. This commit updates the `loop_hoist_level` calculation with
egraphs to have a nonzero default for instructions that have no
arguments (e.g. consts) which enables hoisting these instructions out of
loops.
Note, though, that for now I've listed the maximum as hoisting outside
of one loop, but not all of them. While theoretically vconsts could move
up to the top of the function I'd be worried about their impact on
register pressure and having to save/restore around calls or similar, so
hopefully if the hot part of a program is a single loop then hoisting
out of one loop is a reasonable-enough heuristic for now.
Locally on x64 with a benchmark that just encodes binary to hex this saw
a 15% performance improvement taking hex encoding from ~6G/s to ~6.7G/s.
* Test vconst is only hoisted one loop out
* fix issue5884.
* fix issue5884
* fix test failure
* fix atomic rmw missing move result to dst register.
* specify little endian some s390x can pass test.
* Added `mem_flags` parameter to `State::checked_{load,store}` as the means
for determining the endianness, typically derived from an instruction.
* Added `native_endianness` property to `InterpreterState` as fallback when
determining endianness, such as in cases where there are no memory flags
avaiable or set.
* Added `to_be` and `to_le` methods to `DataValue`.
* Added `AtomicCas` and `AtomicRmw` to list of instructions with retrievable
memory flags for `InstructionData::memflags`.
* Enabled `atomic-{cas,rmw}-subword-{big,little}.clif` for interpreter run
tests.
This commit adds lowerings to the AArch64 backend for the `fmls`
instruction which is intended to be leveraged in the relaxed-simd
proposal for WebAssembly. This should hopefully allow for a
teeny-bit-more efficient codegen for this operator instead of using the
`fmla` instruction plus a negation instruction.
This catches a case that wasn't handled previously by #5880 to allow a
constant load to be folded into an instruction rather than forcing it to
be loaded into a temporary register.
* Revert "egraphs: disable GVN of effectful idempotent ops (temporarily). (#5808)"
This reverts commit c7e2571866.
* egraphs: fix handling of effectful-but-idempotent ops and GVN.
This PR addresses #5796: currently, ops that are effectful, i.e., remain
in the side-effecting skeleton (which we keep in the `Layout` while the
egraph exists), but are idempotent and thus mergeable by a GVN pass, are
not handled properly.
GVN is still possible on effectful but idempotent ops precisely because
our GVN does not create partial redundancies: it removes an instruction
only when it is dominated by an identical instruction. An isntruction
will not be "hoisted" to a point where it could execute in the optimized
code but not in the original.
However, there are really two parts to the egraph implementation that
produce this effect: the deduplication on insertion into the egraph, and
the elaboration with a scoped hashmap. The deduplication lets us give a
single name (value ID) to all copies of an identical instruction, and
then elaboration will re-create duplicates if GVN should not hoist or
merge some of them.
Because deduplication need not worry about dominance or scopes, we use a
simple (non-scoped) hashmap to dedup/intern ops as "egraph nodes".
When we added support for GVN'ing effectful but idempotent ops (#5594),
we kept the use of this simple dedup'ing hashmap, but these ops do not
get elaborated; instead they stay in the side-effecting skeleton. Thus,
we inadvertently created potential for weird code-motion effects.
The proposal in #5796 would solve this in a clean way by treating these
ops as pure again, and keeping them out of the skeleton, instead putting
"force" pseudo-ops in the skeleton. However, this is a little more
complex than I would like, and I've realized that @jameysharp's earlier
suggestion is much simpler: we can keep an actual scoped hashmap
separately just for the effectful-but-idempotent ops, and use it to GVN
while we build the egraph. In effect, we're fusing a separate GVN pass
with the egraph pass (but letting it interact corecursively with
egraph rewrites. This is in principle similar to how we keep a separate
map for loads and fuse this pass with the egraph rewrite pass as well.
Note that we can use a `ScopedHashMap` here without the "context" (as
needed by `CtxHashMap`) because, as noted by @jameysharp, in practice
the ops we want to GVN have all their args inline. Equality on the
`InstructinoData` itself is conservative: two insts whose struct
contents compare shallowly equal are definitely identical, but identical
insts in a deep-equality sense may not compare shallowly equal, due to
list indirection. This is fine for GVN, because it is still sound to
skip any given GVN opportunity (and keep the original instructions).
Fixes#5796.
* Add comments from review.
* x64: Add `shuffle` cases for `punpck{h,l}bw`
I noticed this difference between LLVM and Cranelift for something I was
looking at recently, and while it's probably not all that common I
figured I'd add it here since it should be somewhat useful nevertheless.
* Review feedback
* Use u128 extractor instead
This instruction is only defined with i8x16 inputs and outputs so
there's no need for a type variable, so shadow the otherwise-generic `a`
result with a concrete i8x16 type.
This commit adds support for the bare lowering of the `iadd_pairwise`
instruction with `i16x8` and `i32x4` types on the x64 backend. These
lowerings are achieved with the `phaddw` and `phaddd` instructions,
respectively. Additionally AVX encodings of these instructions are added
too.
The motivation for these new lowerings comes from the relaxed-simd
proposal which will use them in the deterministic lowering of some
instructions on the x64 backend.
A number of places in the x64 backend make use of 128-bit constants for
various wasm SIMD-related instructions although most of them currently
use the `x64_xmm_load_const` helper to load the constant into a
register. Almost all xmm instructions, however, enable using a memory
operand which means that these loads can be folded into instructions to
help reduce register pressure. Automatic conversions were added for a
`VCodeConstant` into an `XmmMem` value and then explicit loads were all
removed in favor of forwarding the `XmmMem` value directly to the
underlying instruction. Note that some instances of `x64_xmm_load_const`
remain since they're used in contexts where load sinking won't work
(e.g. they're the first operand, not the second for non-commutative
instructions).
This was added for the wasm SIMD proposal but I've been poking around at
this recently and the instruction can instead be represented by its
component parts with the same semantics I believe. This commit removes
the instruction and instead represents it with the existing
`iadd_pairwise` instruction (among others) and updates backends to with
new pattern matches to have the same codegen as before.
This interestingly entirely removed the codegen rule with no replacement
on the AArch64 backend as the existing rules all existed to produce the
same codegen.
* Generalize unsigned `(x << k) >> k` optimization
Split the existing rule into three parts:
- A dual of the rule for `(x >> k) << k` that is only valid for unsigned
shifts.
- Known-bits analysis for `(band (uextend x) k)`.
- A new rule for converting `sextend` to `uextend` if the sign-extended
bits are masked out anyway.
The first two together cover the existing rule.
* Generalize signed `(x << k) >> k` optimization
* Review comments
* Generalize sign-extending shifts further
The shifts can be eliminated even if the shift amount isn't exactly
equal to the difference in bit-widths between the narrow and wide types.
* Add filetests
For instructions with no results (such as branches and stores) or
instructions with multiple results (such as add with carry), we have
assertions checking that an optimization rule doesn't try to match on
or construct such instructions.
When we generate terms for matching or constructing instructions, the
terms for these instructions are guaranteed to panic if they're ever
used. So let's just not generate them.
In the future we may wish to generate terms with different types for
these instructions, to make them usable in ISLE rules for optimization
that fall outside our current egraph constraints.
This uses the `cmov`, which was previously necessary for Spectre
mitigation, to clamp the table index instead of zeroing it. By then
placing the default target as the last entry in the table, we can use
just one branch instruction in all cases.
Since there isn't a bounds-check branch any more, this sequence no
longer needs Spectre mitigation. And since we don't need to be careful
about preserving flags, half the instructions can be removed from this
pseudoinstruction and emitted as regular instructions instead.
This is a net savings of three bytes in the encoding of x64's br_table
pseudoinstruction. The generated code can sometimes be longer overall
because the blocks are emitted in a slightly different order.
My benchmark results show a very small effect on runtime performance
with this change.
The spidermonkey benchmark in Sightglass runs "1.01x faster" than main
by instructions retired, but with no significant difference in CPU
cycles. I think that means it rarely hit the default case in any
br_table instructions it executed.
The pulldown-cmark benchmark in Sightglass runs "1.01x faster" than main
by CPU cycles, but main runs "1.00x faster" by instructions retired. I
think that means this benchmark hit the default case a significant
amount of the time, so it executes a few more instructions per br_table,
but maybe the branches were predicted better.
* x64: Fill out more AVX instructions
This commit fills out more AVX instructions for SSE counterparts
currently used. Many of these instructions do not benefit from the
3-operand form that AVX uses but instead benefit from being able to use
`XmmMem` instead of `XmmMemAligned` which may be able to avoid some
extra temporary registers in some cases.
* Review comments
* Rework the blockorder module to reuse the dom tree's cfg postorder
* Update domtree tests
* Treat br_table with an empty jump table as multiple block exits
* Bless tests
* Change branch_idx to succ_idx and fix the comment
When expanding a min/max operation to a pair of icmp + select,
do not attempt to expand the input value operands twice, as
this might fail with memory operands.
Fixes https://github.com/bytecodealliance/wasmtime/issues/5859.
Use wrapping_neg in i{64,32,16}_from_negated_value to avoid Rust
aborts due to integer overflow. The resulting INT_MIN is already
handled correctly in subsequent operations.
Fixes https://github.com/bytecodealliance/wasmtime/issues/5863.
The relaxed-simd proposal for WebAssembly adds a fused-multiply-add
operation for `v128` types so I was poking around at Cranelift's
existing support for its `fma` instruction. I was also poking around at
the x86_64 ISA's offerings for the FMA operation and ended up with this
PR that improves the lowering of the `fma` instruction on the x64
backend in a number of ways:
* A libcall-based fallback is now provided for `f32x4` and `f64x2` types
in preparation for eventual support of the relaxed-simd proposal.
These encodings are horribly slow, but it's expected that if FMA
semantics must be guaranteed then it's the best that can be done
without the `fma` feature. Otherwise it'll be up to producers (e.g.
Wasmtime embedders) whether wasm-level FMA operations should be FMA or
multiply-then-add.
* In addition to the existing `vfmadd213*` instructions opcodes were
added for `vfmadd132*`. The `132` variant is selected based on which
argument can have a sinkable load.
* Any argument in the `fma` CLIF instruction can now have a
`sinkable_load` and it'll generate a single FMA instruction.
* All `vfnmadd*` opcodes were added as well. These are pattern-matched
where one of the arguments to the CLIF instruction is an `fneg`. I
opted to not add a new CLIF instruction here since it seemed like
pattern matching was easy enough but I'm also not intimately familiar
with the semantics here so if that's the preferred approach I can do
that too.
* x64: Enable load-coalescing for SSE/AVX instructions
This commit unlocks the ability to fold loads into operands of SSE and
AVX instructions. This is beneficial for both function size when it
happens in addition to being able to reduce register pressure.
Previously this was not done because most SSE instructions require
memory to be aligned. AVX instructions, however, do not have alignment
requirements.
The solution implemented here is one recommended by Chris which is to
add a new `XmmMemAligned` newtype wrapper around `XmmMem`. All SSE
instructions are now annotated as requiring an `XmmMemAligned` operand
except for a new new instruction styles used specifically for
instructions that don't require alignment (e.g. `movdqu`, `*sd`, and
`*ss` instructions). All existing instruction helpers continue to take
`XmmMem`, however. This way if an AVX lowering is chosen it can be used
as-is. If an SSE lowering is chosen, however, then an automatic
conversion from `XmmMem` to `XmmMemAligned` kicks in. This automatic
conversion only fails for unaligned addresses in which case a load
instruction is emitted and the operand becomes a temporary register
instead. A number of prior `Xmm` arguments have now been converted to
`XmmMem` as well.
One change from this commit is that loading an unaligned operand for an
SSE instruction previously would use the "correct type" of load, e.g.
`movups` for f32x4 or `movup` for f64x2, but now the loading happens in
a context without type information so the `movdqu` instruction is
generated. According to [this stack overflow question][question] it
looks like modern processors won't penalize this "wrong" choice of type
when the operand is then used for f32 or f64 oriented instructions.
Finally this commit improves some reuse of logic in the `put_in_*_mem*`
helper to share code with `sinkable_load` and avoid duplication. With
this in place some various ISLE rules have been updated as well.
In the tests it can be seen that AVX-instructions are now automatically
load-coalesced and use memory operands in a few cases.
[question]: https://stackoverflow.com/questions/40854819/is-there-any-situation-where-using-movdqu-and-movupd-is-better-than-movups
* Fix tests
* Fix move-and-extend to be unaligned
These don't have alignment requirements like other xmm instructions as
well. Additionally add some ISA tests to ensure that their output is
tested.
* Review comments
This is a follow-up to comments in #5795 to remove some cruft in the x64
instruction model to ensure that the shape of an `Inst` reflects what's
going to happen in regalloc and encoding. This accessor was used to
handle `round*`, `pextr*`, and `pshufb` instructions. The `round*` ones
had already moved to the appropriate `XmmUnary*` variant and `pshufb`
was additionally moved over to that variant as well.
The `pextr*` instructions got a new `Inst` variant and additionally had
their constructors slightly modified to no longer require the type as
input. The encoding for these instructions now automatically handles the
various type-related operands through a new `SseOpcode::Pextrq` operand
to represent 64-bit movements.
This commit refactors a bit about how sinkable loads are handled in the
x64 backend. The intention is to bring most handling around sinkable
loads up to date with the current state of the backend since things have
changed since these were originally introduced, namely automatic
conversions between types in ISLE. For example the `Value` type can be
automatically converted to `RegMem` to perform load sinking, but some
rules are still explicitly doing matching themselves.
Here I've removed explicit handling of immediates and sinkable loads
when they're the right-hand-side of an operation. These cases are
already handle by the "base case" when converting a `Value` to a
`RegMemImm`. Instead only rules explicitly for left-hand-side immediates
and sinkable loads remain. This helps cut down on the number of explicit
rules needed.
Additionally in the same manner that `Value` can be automatically
converted to `RegMem` I've added automatic conversions from
`SinkableLoad` to `RegMem` and the various other newtypes. This helps
cut down a bit on rule verbosity where `sink_load_*` is largely no
longer necessary.
* x64: Add most remaining AVX lowerings
This commit goes through `inst.isle` and adds a corresponding AVX
lowering for most SSE lowerings. I opted to skip instructions where the
SSE lowering didn't read/modify a register, such as `roundps`. I think
that AVX will benefit these instructions when there's load-merging since
AVX doesn't require alignment, but I've deferred that work to a future
PR.
Otherwise though in this PR I think all (or almost all) of the 3-operand
forms of AVX instructions are supported with their SSE counterparts.
This should ideally improve codegen slightly by removing register
pressure and the need for `movdqa` between registers. I've attempted to
ensure that there's at least one codegen test for all the new instructions.
As a side note, the recent capstone integration into `precise-output`
tests helped me catch a number of encoding bugs much earlier than
otherwise, so I've found that incredibly useful in tests!
* Move `vpinsr*` instructions to their own variant
Use true `XmmMem` and `GprMem` types in the instruction as well to get
more type-level safety for what goes where.
* Remove `Inst::produces_const` accessor
Instead of conditionally defining regalloc and various other operations
instead add dedicated `MInst` variants for operations which are intended
to produce a constant to have more clear interactions with regalloc and
printing and such.
* Fix tests
* Register traps in `MachBuffer` for load-folding ops
This adds a missing `add_trap` to encoding of VEX instructions with
memory operands to ensure that if they cause a segfault that there's
appropriate metadata for Wasmtime to understand that the instruction
could in fact trap. This fixes a fuzz test case found locally where v8
trapped and Wasmtime didn't catch the signal and crashed the fuzzer.
Fix the postorder traversal computed by the `DominatorTree`. It was
recording nodes in the wrong order depending on the order child nodes
were visited. Consider the following program:
```
function %foo2(i8) -> i8 {
block0(v0: i8):
brif v0, block1, block2
block1:
return v0
block2:
jump block1
}
```
The postorder produced by the previous implementation was:
```
block2
block1
block0
```
Which is incorrect, as `block1` is branched to by `block2`. Changing the
branch order in the function would also change the postorder result,
yielding the expected order with `block1` emitted first.
The problem was that when pushing successor nodes onto the stack, the
old implementation would also mark them SEEN. This would then prevent
them from being pushed on the stack again in the future, which is
incorrect as they might be visited by other nodes that have not yet been
pushed. This causes nodes to potentially show up later in the postorder
traversal than they should.
This PR reworks the implementation of `DominatorTree::compute` to
produce an order where `block1` is always returned first, regardless of
the branch order in the original program.
Co-authored-by: Jamey Sharp <jsharp@fastly.com>
* x64: Add rudimentary support for some AVX instructions
I was poking around Spidermonkey's wasm backend and saw that the various
assembler functions used are all `v*`-prefixed which look like they're
intended for use with AVX instructions. I looked at Cranelift and it
currently doesn't have support for many AVX-based instructions, so I
figured I'd take a crack at it!
The support added here is a bit of a mishmash when viewed alone, but my
general goal was to take a single instruction from the SIMD proposal for
WebAssembly and migrate all of its component instructions to AVX. I, by
random chance, picked a pretty complicated instruction of `f32x4.min`.
This wasm instruction is implemented on x64 with 4 unique SSE
instructions and ended up being a pretty good candidate.
Further digging about AVX-vs-SSE shows that there should be two major
benefits to using AVX over SSE:
* Primarily AVX instructions largely use a three-operand form where two
input registers are operated with and an output register is also
specified. This is in contrast to SSE's predominant
one-register-is-input-but-also-output pattern. This should help free
up the register allocator a bit and additionally remove the need for
movement between registers.
* As #4767 notes the memory-based operations of VEX-encoded instructions
(aka AVX instructions) do not have strict alignment requirements which
means we would be able to sink loads and stores into individual
instructions instead of having separate instructions.
So I set out on my journey to implement the instructions used by
`f32x4.min`. The first few were fairly easy. The machinst backends are
already of the shape "take these inputs and compute the output" where
the x86 requirement of a register being both input and output is
postprocessed in. This means that the `inst.isle` creation helpers for
SSE instructions were already of the correct form to use AVX. I chose to
add new `rule` branches for the instruction creation helpers, for
example `x64_andnps`. The new `rule` conditionally only runs if AVX is
enabled and emits an AVX instruction instead of an SSE instruction for
achieving the same goal. This means that no lowerings of clif
instructions were modified, instead just new instructions are being
generated.
The VEX encoding was previously not heavily used in Cranelift. The only
current user are the FMA-style instructions that Cranelift has at this
time. These FMA instructions have one extra operand than `vandnps`, for
example, so I split the existing `XmmRmRVex` into a few more variants to
fit the shape of the instructions that needed generating for
`f32x4.min`. This was accompanied then with more AVX opcode definitions,
more emission support, etc.
Upon implementing all of this it turned out that the test suite was
failing on my machine due to the memory-operand encodings of VEX
instructions not being supported. I didn't explicitly add those in
myself but some preexisting RIP-relative addressing was leaking into the
new instructions with existing tests. I opted to go ahead and fill out
the memory addressing modes of VEX encoding to get the tests passing
again.
All-in-all this PR adds new instructions to the x64 backend for a number
of AVX instructions, updates 5 existing instruction producers to use AVX
instructions conditionally, implements VEX memory operands, and adds
some simple tests for the new output of `f32x4.min`. The existing
runtest for `f32x.min` caught a few intermediate bugs along the way and
I additionally added a plain `target x86_64` to that runtest to ensure
that it executes with and without AVX to test the various lowerings.
I'll also note that this, and future support, should be well-fuzzed
through Wasmtime's fuzzing which may explicitly disable AVX support
despite the machine having access to AVX, so non-AVX lowerings should be
well-tested into the future.
It's also worth mentioning that I am not an AVX or VEX or x64 expert.
Implementing the memory operand part for VEX was the hardest part of
this PR and while I think it should be good someone else should
definitely double-check me. Additionally I haven't added many
instructions to the x64 backend yet so I may have missed obvious places
to tests or such, so am happy to follow-up with anything to be more
thorough if necessary.
Finally I should note that this is just the tip of the iceberg when it
comes to AVX. My hope is to get some of the idioms sorted out to make it
easier for future PRs to add one-off instruction lowerings or such.
* Review feedback
* Refactor collect_branches_and_targets to not need a smallvec
Basic blocks are terminated by at most one branch instruction now, so we
can use that assumption in `collect_branches_and_targets` to return the
last instruction we saw instead.
* Review comments
This is a short-term fix to the same bug that #5800 is addressing
(#5796), but with less risk: it simply turns off GVN'ing of effectful
but idempotent ops. Because we have an upcoming release, and this is a
miscompile (albeit to do with trapping behavior), we would like to make
the simplest possible fix that avoids the bug, and backport it. I will
then rebase #5800 on top of a revert of this followed by the more
complete fix.