* ISLE: move `icmp` rewrites to separate file.
Move `icmp`-related rewrite rules from `algebraic.isle` to `icmp.isle`.
Also move `icmp`-related tests from `algebraic.clif` to `icmp.clif`.
* Put parameterized and unparameterized `icmp` tests in separate files
* Undo refactoring of (ir)reflexivity rewrites
* Fix `icmp-parameterised.clif`
* Undo formatting/comment changes
* x64: Add AVX encodings of `vcvt{ss2sd,sd2ss}`
Additionally update the instruction helpers to take an `XmmMem` argument
to allow load sinking into the instruction.
* x64: Add AVX encoding of `sqrts{s,d}`
* x64: Add AVX support for `rounds{s,d}`
* x64: Deduplicate fcmp emission logic
The `select`-of-`fcmp` lowering duplicated a good deal of `FloatCC`
lowering logic that was already done by `emit_fcmp`, so this commit
refactors these lowering rules to instead delegate to `emit_fcmp` and
then handle that result.
* Swap order of condition codes
Shouldn't affect the correctness of this operation and it's a bit more
natural to write the lowering rule this way.
* Swap the order of comparison operands
No need to swap `a b`, only the `x y` needs swapping.
* Fix x64 printing of `XmmCmove`
* Add `precise_output` argument to `test optimise`.
Also allow optimise tests to be updated by `CRANELIFT_TEST_BLESS=1`
* Move `check_precise_output` and `update_test` to `subtest`
* Implement TLS on Aarch64 Mach-O
* Add aarch64 macho TLS filetest
* Address review comments
- `Aarch64` instead of `AArch64` in comments
- Remove unnecessary guard in tls_value lowering
- Remove unnecessary regalloc metadata in emission
* Use x1 as temporary register in emission
- Instead of passing in a temporary register to use when emitting
the TLS code, just use `x1`, as it's already in the clobber set.
This also keeps the size of `aarch64::inst::Inst` at 32 bytes.
- Update filetest accordingly
* Update aarch64 mach-o TLS filetest
Following up on the discussion in
https://github.com/bytecodealliance/wasmtime/pull/6011
this adds an improved implementation of TrapIf for s390x
using a single conditional branch instruction.
If the trap conditions is true, we branch into the middle of
the branch instruction - those middle two bytes are zero,
which matches the encoding of the trap instruction.
In addition, show the trap code for Trap and TrapIf
instructions in assembler output.
* x64: Add instruction helpers for `mov{d,q}`
These will soon grow AVX-equivalents so move them to instruction helpers
to have clauses for AVX in the future.
* x64: Don't auto-convert between RegMemImm and XmmMemImm
The previous conversion, `mov_rmi_to_xmm`, would move from GPR registers
to XMM registers which isn't what many of the other `convert` statements
between these newtypes do. This seemed like a possible footgun so I've
removed the auto-conversion and added an explicit helper to go from a
`u32` to an `XmmMemImm`.
* x64: Add AVX encodings of some more GPR-related insns
This commit adds some more support for AVX instructions where GPRs are
in use mixed in with XMM registers. This required a few more variants of
`Inst` to handle the new instructions.
* Fix vpmovmskb encoding
* Fix xmm-to-gpr encoding of vmovd/vmovq
* Fix typo
* Fix rebase conflict
* Fix rebase conflict with tests
* cranelift: Add extra runtests for `clz`/`ctz`
* riscv64: Restrict lowering rules for `ctz`/`clz`
* cranelift: Add `u64` isle helpers
* riscv64: Improve `ctz` codegen
* riscv64: Improve `clz` codegen
* riscv64: Improve `cls` codegen
* riscv64: Improve `clz.i128` codegen
Instead of checking if we have 64 zeros in the top half. Check
if it *is* 0, that way we avoid loading the `64` constant.
* riscv64: Improve `ctz.i128` codegen
Instead of checking if we have 64 zeros in the bottom half. Check
if it *is* 0, that way we avoid loading the `64` constant.
* riscv64: Use extended value in `lower_cls`
* riscv64: Use pattern matches on `bseti`
* Pick argument and return types based on opcode constraints
Co-authored-by: Jamey Sharp <jsharp@fastly.com>
* Lazily build the OPCODE_SIGNATURES list
* Skip unsupported isplit/iconcat cases
* Add an issue reference for the isplit/iconcat exemption
* Refactor the deny lists to use exceptions!, and remove redundant entries
---------
Co-authored-by: Jamey Sharp <jsharp@fastly.com>
* Restrict the types for isplit and iconcat to match backends
* Admit unimplemented bitwidths to isplit/iconcat
* Modify the NarrowInt type instead of shadowing it
* Fix filetest failures
* Add a `MachBuffer::defer_trap` method
This commit adds a new method to `MachBuffer` to defer trap opcodes to
the end of a function in a similar manner to how constants are deferred
to the end of the function. This is useful for backends which frequently
use `TrapIf`-style opcodes. Currently a jump is emitted which skips the
next instruction, a trap, and then execution continues normally. While
there isn't any pressing problem with this construction the trap opcode
is in the middle of the instruction stream as opposed to "off on the
side" despite rarely being taken.
With this method in place all the backends (except riscv64 since I
couldn't figure it out easily enough) have a new lowering of their
`TrapIf` opcode. Now a trap is deferred, which returns a label, and then
that label is jumped to when executing the trap. A fixup is then
recorded in `MachBuffer` to get patched later on during emission, or at
the end of the function. Subsequently all `TrapIf` instructions
translate to a single branch plus a single trap at the end of the
function.
I've additionally further updated some more lowerings in the x64 backend
which were explicitly using traps to instead use `TrapIf` where
applicable to avoid jumping over traps mid-function. Other backends
didn't appear to have many jump-over-the-next-trap patterns.
Lots of tests have had their expectations updated here which should
reflect all the traps being sunk to the end of functions.
* Print trap code on all platforms
* Emit traps before constants
* Preserve source location information for traps
* Fix test expectations
* Attempt to fix s390x
The MachBuffer was registering trap codes with the first byte of the
trap, but the SIGILL handler was expecting it to be registered with the
last byte of the trap. Exploit that SIGILL is always represented with a
2-byte instruction and always march 2-backwards for SIGILL, continuing
to march backwards 1 byte for SIGFPE-generating instructions.
* Back out s390x changes
* Back out more s390x bits
* Review comments
Previously it could affect the PartialEq and Hash impls. Ignoring the
sequence number in PartialEq and Hash allows us to not renumber all
blocks in the incremental cache.
This commit goes through the `runtests` folder of the `filetests`
test suite and ensure that everything which uses simd or float-related
instructions on x64 is executed with the baseline support for x86_64 in
addition to adding in AVX support. Most of the instructions used have
AVX equivalents so this should help test all of the equivalents in
addition to the codegen filetests in the x64 folder.
* x64: Fix vbroadcastss with AVX2 and without AVX
This commit fixes a corner case in the emission of the
`vbroadcasts{s,d}` instructions. The memory-to-xmm form of these
instructions was available with the AVX instruction set, but the
xmm-to-xmm form of these instructions wasn't available until AVX2.
The instruction requirement for these are listed as AVX but the lowering
rules are appropriately annotated to use either AVX2 or AVX when
appropriate.
While this should work in practice this didn't work for the assertion
about enabled features for each instruction. The `vbroadcastss`
instruction was listed as requiring AVX but could get emitted when AVX2
was enabled (due to the reg-to-reg form being available). This caused an
issue for the fuzzer where AVX2 was enabled but AVX was disabled.
One possible fix would be to add more opcodes, one for reg-to-reg and
one for mem-to-reg. That seemed like somewhat overkill for a pretty
niche situation that shouldn't actually come up in practice anywhere.
Instead this commit changes all the `has_avx` accessors to the
`use_avx_simd` predicate already available in the target flags. The
`use_avx2_simd` predicate was then updated to additionally require
`has_avx`, so if AVX2 is enabled and AVX is disabled then the
`vbroadcastss` instruction won't get emitted any more.
Closes#6059
* Pass `enable_simd` on a few more files
* Add a program for checking the function_generator opcode signatures
* Rework as a test in function_generator instead
* Fix some invalid opcode signatures in the function generator
* Fix bnot exclusions
* Only allow pp_cmp within a single block
Block order shouldn't matter for codegen and restricting pp_cmp to a
single block will allow making instruction sequence numbers local to a
block.
* Make sequence numbers local to instructions
This allows renumbering to be localized to a single block where previously it
could affect the entire function. Also saves 32bit of overhead per block.
* cranelift-wasm: Add a bounds-checking optimization for dynamic memories and guard pages
This is a new special case for when we know that there are enough guard pages to
cover the memory access's offset and access size.
The precise should-we-trap condition is
index + offset + access_size > bound
However, if we instead check only the partial condition
index > bound
then the most out of bounds that the access can be, while that partial check
still succeeds, is `offset + access_size`.
However, when we have a guard region that is at least as large as `offset +
access_size`, we can rely on the virtual memory subsystem handling these
out-of-bounds errors at runtime. Therefore, the partial `index > bound` check is
sufficient for this heap configuration.
Additionally, this has the advantage that a series of Wasm loads that use the
same dynamic index operand but different static offset immediates -- which is a
common code pattern when accessing multiple fields in the same struct that is in
linear memory -- will all emit the same `index > bound` check, which we can GVN.
* cranelift: Add WAT tests for accessing dynamic memories with the same index but different offsets
The bounds check comparison is GVN'd but we still branch on values we should
know will always be true if we get this far in the code. This is actual `br_if`s
in the non-Spectre code and `select_spectre_guard`s that we should know will
always go a certain way if we have Spectre mitigations enabled.
Improving the non-Spectre case is pretty straightforward: walk the dominator
tree and remember which values we've already branched on at this point, and
therefore we can simplify any further conditional branches on those same values
into direct jumps.
Improving the Spectre case requires something that is morally the same, but has
a few snags:
* We don't have actual `br_if`s to determine whether the bounds checking
condition succeeded or not. We need to instead reason about dominating
`select_spectre_guard; {load, store}` instruction pairs.
* We have to be SUPER careful about reasoning "through" `select_spectre_guard`s.
Our general rule is never to do that, since it could break the speculative
execution sandboxing that the instruction is designed for.
* Use inst_block instead of pp_block where possible
* Remove unused is_block_gap method
* Remove ProgramOrder trait
It only has a single implementation
* Rename Layout::cmp to pp_cmp to distinguish it from Ord::cmp
* Make pp_block non-generic
* Use rpo_cmp_block instead of rpo_cmp in the verifier
* Remove ProgramPoint
* Rename ExpandedProgramPoint to ProgramPoint
* Remove From<ValueDef> for ProgramPoint impl
* x64: Elide more uextend with extractlane
I've confirmed locally now that `pextr{b,w,d}` all zero the upper bits
of the full 64-bit register size which means that the `extractlane`
operation with a zero-extend can be elided for more cases, including
8-to-64-bit casts as well as 32-to-64.
This helps elide a few extra `mov`s in a loop I was looking at and had a
modest corresponding increase in performance (my guess was due to the
slightly decreased code size mostly as opposed to the removed `mov`s).
* Remove stray file
* x64: Refactor sextend/uextend rules
Move much of the meaty logic from these lowering rules into the
`extend_to_gpr` helper to benefit other callers of `extend_to_gpr` to
elide instructions. This additionally simplifies `sextend` and `uextend`
lowerings to rely on optimizations happening within the `extend_to_gpr`
helper.
* x64: Skip `uextend` for `pextr{b,w}` instructions
These instructions are documented as automatically zeroing the upper
bits so `uextend` operations can be skipped. This slightly improves
codegen for the wasm `i{8x16,16x8}.extract_lane_u` instructions, for
example.
* Modernize an extractor pattern
* Trim some superfluous match clauses
Additionally rejigger priorities to be "mostly default" now.
* Refactor 32-to-64 predicate to a helper
Also adjust the pattern matched in the `extend_to_gpr` helper.
* Slightly refactor pextr{b,w} case
* Review comments
* cranelift: Add function name to tests
* cranelift: Move simd-ineg tests to separate file
* cranelift: Move `avg_round` tests to separate file
* cranelift: Move SIMD `fmin`/`fmax` tests to separate files
* cranelift-interpreter: Implement a bunch of SIMD arithmetic ops
Most of these are quite easy to adapt to be polymorphic
* cranelift: Move shift tests from `simd-arithmetic.clif` into shift files
* x64: Take SIGFPE signals for divide traps
Prior to this commit Wasmtime would configure `avoid_div_traps=true`
unconditionally for Cranelift. This, for the division-based
instructions, would change emitted code to explicitly trap on trap
conditions instead of letting the `div` x86 instruction trap.
There's no specific reason for Wasmtime, however, to specifically avoid
traps in the `div` instruction. This means that the extra generated
branches on x86 aren't necessary since the `div` and `idiv` instructions
already trap for similar conditions as wasm requires.
This commit instead disables the `avoid_div_traps` setting for
Wasmtime's usage of Cranelift. Subsequently the codegen rules were
updated slightly:
* When `avoid_div_traps=true`, traps are no longer emitted for `div`
instructions.
* The `udiv`/`urem` instructions now list their trap as divide-by-zero
instead of integer overflow.
* The lowering for `sdiv` was updated to still explicitly check for zero
but the integer overflow case is deferred to the instruction itself.
* The lowering of `srem` no longer checks for zero and the listed trap
for the `div` instruction is a divide-by-zero.
This means that the codegen for `udiv` and `urem` no longer have any
branches. The codegen for `sdiv` removes one branch but keeps the
zero-check to differentiate the two kinds of traps. The codegen for
`srem` removes one branch but keeps the -1 check since the semantics of
`srem` mismatch with the semantics of `idiv` with a -1 divisor
(specifically for INT_MIN).
This is unlikely to have really all that much of a speedup but was
something I noticed during #6008 which seemed like it'd be good to clean
up. Plus Wasmtime's signal handling was already set up to catch
`SIGFPE`, it was just never firing.
* Remove the `avoid_div_traps` cranelift setting
With no known users currently removing this should be possible and helps
simplify the x64 backend.
* x64: GC more support for avoid_div_traps
Remove the `validate_sdiv_divisor*` pseudo-instructions and clean up
some of the ISLE rules now that `div` is allowed to itself trap
unconditionally.
* x64: Store div trap code in instruction itself
* Keep divisors in registers, not in memory
Don't accidentally fold multiple traps together
* Handle EXC_ARITHMETIC on macos
* Update emit tests
* Update winch and tests
This commit goes through the lowerings for the CLIF `splat` instruction
and improves the support for each operator. Many of these lowerings are
mirrored from v8/SpiderMonkey and there are a number of improvements:
* AVX2 `v{p,}broadcast*` instructions are added and used when available.
* Float-based splats are much simpler and always a single-instruction
* Integer-based splats don't insert into an uninit xmm value and instead
start out with a `movd` to move into an `xmm` register. This
thoeretically breaks dependencies with prior instructions since `movd`
creates a fresh new value in the destination register.
* Loads are now sunk into all of the instructions. A new extractor,
`sinkable_load_exact`, was added to sink the i8/i16 loads.