wasmtime

Author	SHA1	Message	Date
Alex Crichton	9a4bd7c6df	x64: Begin to lift SSE 4.1 requirement for SIMD support (#6216 ) * x64: Change `use_sse41` to a constructor This refactors the existing `use_sse41` extractor to instead be a `constructor` to use with `if-let`. * x64: Gate the `pblendw` instruction on SSE4.1 being enabled This specialization of `shuffle` isn't a base case so adding an `if-let` here should be sufficient for gating this instruction properly on enabled CPU features. * x64: Gate `pmuldq` lowerings on SSE 4.1 The specialized rules using these instructions can fall back to the standard lowerings for non-SSE 4.1 instructions.	2023-04-17 16:09:58 +00:00
T0b1-iOS	f684a5fbee	remove `iadd_cout` and `isub_bout` (#6198 )	2023-04-11 23:39:32 +00:00
T0b1-iOS	569089e473	Add `{u,s}{add,sub,mul}_overflow` instructions (#5784 ) * add `{u,s}{add,sub,mul}_overflow` with interpreter * add `{u,s}{add,sub,mul}_overflow` for x64 * add `{u,s}{add,sub,mul}_overflow` for aarch64 * 128bit filetests for `{u,s}{add,sub,mul}_overflow` * `{u,s}{add,sub,mul}_overflow` emit tests for x64 * `{u,s}{add,sub,mul}_overflow` emit tests for aarch64 * Initial review changes * add `with_flags_extended` helper * add `with_flags_chained` helper	2023-04-11 20:16:04 +00:00
Alex Crichton	afb417920d	x64: Deduplicate fcmp emission logic (#6113 ) * x64: Deduplicate fcmp emission logic The `select`-of-`fcmp` lowering duplicated a good deal of `FloatCC` lowering logic that was already done by `emit_fcmp`, so this commit refactors these lowering rules to instead delegate to `emit_fcmp` and then handle that result. * Swap order of condition codes Shouldn't affect the correctness of this operation and it's a bit more natural to write the lowering rule this way. * Swap the order of comparison operands No need to swap `a b`, only the `x y` needs swapping. * Fix x64 printing of `XmmCmove`	2023-03-29 16:24:25 +00:00
Alex Crichton	2fde25311e	x64: Refactor and fill out some gpr-vs-xmm bits (#6058 ) * x64: Add instruction helpers for `mov{d,q}` These will soon grow AVX-equivalents so move them to instruction helpers to have clauses for AVX in the future. * x64: Don't auto-convert between RegMemImm and XmmMemImm The previous conversion, `mov_rmi_to_xmm`, would move from GPR registers to XMM registers which isn't what many of the other `convert` statements between these newtypes do. This seemed like a possible footgun so I've removed the auto-conversion and added an explicit helper to go from a `u32` to an `XmmMemImm`. * x64: Add AVX encodings of some more GPR-related insns This commit adds some more support for AVX instructions where GPRs are in use mixed in with XMM registers. This required a few more variants of `Inst` to handle the new instructions. * Fix vpmovmskb encoding * Fix xmm-to-gpr encoding of vmovd/vmovq * Fix typo * Fix rebase conflict * Fix rebase conflict with tests	2023-03-22 14:58:09 +00:00
Alex Crichton	f7dda1ab2c	x64: Fix vbroadcastss with AVX2 and without AVX (#6060 ) * x64: Fix vbroadcastss with AVX2 and without AVX This commit fixes a corner case in the emission of the `vbroadcasts{s,d}` instructions. The memory-to-xmm form of these instructions was available with the AVX instruction set, but the xmm-to-xmm form of these instructions wasn't available until AVX2. The instruction requirement for these are listed as AVX but the lowering rules are appropriately annotated to use either AVX2 or AVX when appropriate. While this should work in practice this didn't work for the assertion about enabled features for each instruction. The `vbroadcastss` instruction was listed as requiring AVX but could get emitted when AVX2 was enabled (due to the reg-to-reg form being available). This caused an issue for the fuzzer where AVX2 was enabled but AVX was disabled. One possible fix would be to add more opcodes, one for reg-to-reg and one for mem-to-reg. That seemed like somewhat overkill for a pretty niche situation that shouldn't actually come up in practice anywhere. Instead this commit changes all the `has_avx` accessors to the `use_avx_simd` predicate already available in the target flags. The `use_avx2_simd` predicate was then updated to additionally require `has_avx`, so if AVX2 is enabled and AVX is disabled then the `vbroadcastss` instruction won't get emitted any more. Closes #6059 * Pass `enable_simd` on a few more files	2023-03-18 18:38:03 +00:00
Alex Crichton	5ebe53a351	x64: Elide more uextend with extractlane (#6045 ) * x64: Elide more uextend with extractlane I've confirmed locally now that `pextr{b,w,d}` all zero the upper bits of the full 64-bit register size which means that the `extractlane` operation with a zero-extend can be elided for more cases, including 8-to-64-bit casts as well as 32-to-64. This helps elide a few extra `mov`s in a loop I was looking at and had a modest corresponding increase in performance (my guess was due to the slightly decreased code size mostly as opposed to the removed `mov`s). * Remove stray file	2023-03-17 16:18:41 +00:00
Alex Crichton	8e500099b3	x64: Refactor and add extractlane special case for uextend/sextend (#6022 ) * x64: Refactor sextend/uextend rules Move much of the meaty logic from these lowering rules into the `extend_to_gpr` helper to benefit other callers of `extend_to_gpr` to elide instructions. This additionally simplifies `sextend` and `uextend` lowerings to rely on optimizations happening within the `extend_to_gpr` helper. * x64: Skip `uextend` for `pextr{b,w}` instructions These instructions are documented as automatically zeroing the upper bits so `uextend` operations can be skipped. This slightly improves codegen for the wasm `i{8x16,16x8}.extract_lane_u` instructions, for example. * Modernize an extractor pattern * Trim some superfluous match clauses Additionally rejigger priorities to be "mostly default" now. * Refactor 32-to-64 predicate to a helper Also adjust the pattern matched in the `extend_to_gpr` helper. * Slightly refactor pextr{b,w} case * Review comments	2023-03-16 22:14:59 +00:00
Alex Crichton	5ae8575296	x64: Take SIGFPE signals for divide traps (#6026 ) * x64: Take SIGFPE signals for divide traps Prior to this commit Wasmtime would configure `avoid_div_traps=true` unconditionally for Cranelift. This, for the division-based instructions, would change emitted code to explicitly trap on trap conditions instead of letting the `div` x86 instruction trap. There's no specific reason for Wasmtime, however, to specifically avoid traps in the `div` instruction. This means that the extra generated branches on x86 aren't necessary since the `div` and `idiv` instructions already trap for similar conditions as wasm requires. This commit instead disables the `avoid_div_traps` setting for Wasmtime's usage of Cranelift. Subsequently the codegen rules were updated slightly: * When `avoid_div_traps=true`, traps are no longer emitted for `div` instructions. * The `udiv`/`urem` instructions now list their trap as divide-by-zero instead of integer overflow. * The lowering for `sdiv` was updated to still explicitly check for zero but the integer overflow case is deferred to the instruction itself. * The lowering of `srem` no longer checks for zero and the listed trap for the `div` instruction is a divide-by-zero. This means that the codegen for `udiv` and `urem` no longer have any branches. The codegen for `sdiv` removes one branch but keeps the zero-check to differentiate the two kinds of traps. The codegen for `srem` removes one branch but keeps the -1 check since the semantics of `srem` mismatch with the semantics of `idiv` with a -1 divisor (specifically for INT_MIN). This is unlikely to have really all that much of a speedup but was something I noticed during #6008 which seemed like it'd be good to clean up. Plus Wasmtime's signal handling was already set up to catch `SIGFPE`, it was just never firing. * Remove the `avoid_div_traps` cranelift setting With no known users currently removing this should be possible and helps simplify the x64 backend. * x64: GC more support for avoid_div_traps Remove the `validate_sdiv_divisor` pseudo-instructions and clean up some of the ISLE rules now that `div` is allowed to itself trap unconditionally. x64: Store div trap code in instruction itself * Keep divisors in registers, not in memory Don't accidentally fold multiple traps together * Handle EXC_ARITHMETIC on macos * Update emit tests * Update winch and tests	2023-03-16 00:18:45 +00:00
Alex Crichton	d76f7ee52e	x64: Improve codegen for splats (#6025 ) This commit goes through the lowerings for the CLIF `splat` instruction and improves the support for each operator. Many of these lowerings are mirrored from v8/SpiderMonkey and there are a number of improvements: * AVX2 `v{p,}broadcast` instructions are added and used when available. Float-based splats are much simpler and always a single-instruction * Integer-based splats don't insert into an uninit xmm value and instead start out with a `movd` to move into an `xmm` register. This thoeretically breaks dependencies with prior instructions since `movd` creates a fresh new value in the destination register. * Loads are now sunk into all of the instructions. A new extractor, `sinkable_load_exact`, was added to sink the i8/i16 loads.	2023-03-15 21:33:56 +00:00
Alex Crichton	6ed90f86c8	x64: Add support for the `pblendw` instruction (#6023 ) This commit adds another case for `shuffle` lowering to the x64 backend for the `{,v}pblendw` instruction. This instruction selects 16-bit values from either of the inputs corresponding to an immediate 8-bit-mask where each bit selects the corresponding lane from the inputs.	2023-03-15 17:20:43 +00:00
Alex Crichton	fcddb9ca81	x64: Add lea-based lowering for iadd (#5986 ) * x64: Refactor `Amode` computation in ISLE This commit replaces the previous computation of `Amode` with a different set of rules that are intended to achieve the same purpose but are structured differently. The motivation for this commit is going to become more relevant in the next commit where `lea` will be used for the `iadd` instruction, possibly, on x64. When doing so it caused a stack overflow in the test suite during the compilation phase of a wasm module, namely as part of the `amode_add` function. This function is recursively defined in terms of itself and recurses as deep as the deepest `iadd`-chain in a program. A particular test in our test suite has a 10k-long chain of `iadd` which ended up causing a stack overflow in debug mode. This stack overflow is caused because the `amode_add` helper in ISLE unconditionally peels all the `iadd` nodes away and looks at all of them, even if most end up in intermediate registers along the way. Given that structure I couldn't find a way to easily abort the recursion. The new `to_amode` helper is structured in a similar fashion but attempts to instead only recurse far enough to fold items into the final `Amode` instead of recursing through items which themselves don't end up in the `Amode`. Put another way previously the `amode_add` helper might emit `x64_add` instructions, but it no longer does that. This goal of this commit is to preserve all the original `Amode` optimizations, however. For some parts, though, it relies more on egraph optimizations to run since if an `iadd` is 10k deep it doesn't try to find a constant buried 9k levels inside there to fold into the `Amode`. The hope, though, is that with egraphs having run already it's shuffled constants to the right most of the time and already folded any possible together. * x64: Add `lea`-based lowering for `iadd` This commit adds a rule for the lowering of `iadd` to use `lea` for 32 and 64-bit addition. The theoretical benefit of `lea` over the `add` instruction is that the `lea` variant can emulate a 3-operand instruction which doesn't destructively modify on of its operands. Additionally the `lea` operation can fold in other components such as constant additions and shifts. In practice, however, if `lea` is unconditionally used instead of `iadd` it ends up losing 10% performance on a local `meshoptimizer` benchmark. My best guess as to what's going on here is that my CPU's dedicated units for address computation are all overloaded while the ALUs are basically idle in a memory-intensive loop. Previously when the ALU was used for `add` and the address units for stores/loads it in theory pipelined things better (most of this is me shooting in the dark). To prevent the performance loss here I've updated the lowering of `iadd` to conditionally sometimes use `lea` and sometimes use `add` depending on how "complicated" the `Amode` is. Simple ones like `a + b` or `a + $imm` continue to use `add` (and its subsequent hypothetical extra `mov` necessary into the result). More complicated ones like `a + b + $imm` or `a + b << c + $imm` use `lea` as it can remove the need for extra instructions. Locally at least this fixes the performance loss relative to unconditionally using `lea`. One note is that this adds an `OperandSize` argument to the `MInst::LoadEffectiveAddress` variant to add an encoding for 32-bit `lea` in addition to the preexisting 64-bit encoding. * Conditionally use `lea` based on regalloc	2023-03-15 17:14:25 +00:00
Alex Crichton	5c1b468648	x64: Migrate {s,u}{div,rem} to ISLE (#6008 ) * x64: Add precise-output tests for div traps This adds a suite of `.clif` files which are intended to test the `avoid_div_traps=true` compilation of the `{s,u}{div,rem}` instructions. x64: Remove conditional regalloc in `Div` instruction Move the 8-bit `Div` logic into a dedicated `Div8` instruction to avoid having conditionally-used registers with respect to regalloc. * x64: Migrate non-trapping, `udiv`/`urem` to ISLE * x64: Port checked `udiv` to ISLE * x64: Migrate urem entirely to ISLE * x64: Use `test` instead of `cmp` to compare-to-zero * x64: Port `sdiv` lowering to ISLE * x64: Port `srem` lowering to ISLE * Tidy up regalloc behavior and fix tests * Update docs and winch * Review comments * Reword again * More refactoring test fixes * More test fixes	2023-03-14 01:44:06 +00:00
Alex Crichton	e2a6fe99c2	x64: Add `shuffle` specialization for `palignr` (#5999 ) * x64: Add `shuffle` specialization for `palignr` This commit adds specializations for the `palignr` instruction to the x64 backend to specialize some more patterns of byte shuffles. * Fix tests	2023-03-13 21:01:24 +00:00
Alex Crichton	03b5dbb3e0	aarch64: Use `VCodeConstant` for f64/v128 constants (#5997 ) * aarch64: Translate float and splat lowering to ISLE I was looking into `constant_f128` and its fallback lowering into memory and to get familiar with the code I figured it'd be good to port some Rust logic to ISLE. This commit ports the `constant_{f128,f64,f32}` helpers into ISLE from Rust as well as the `splat_const` helper which ended up being closely related. Tests reflect a number of regalloc changes that happened but also namely one major difference is that in the lowering of `f32` a 32-bit immediate is created now instead of a 64-bit immediate (in a GP register before it's moved into a FP register). This semantically has no change but the generated code is slightly different in a few minor cases. * aarch64: Load f64/v128 constants from a pool This commit removes the `LoadFpuConst64` and `LoadFpuConst128` pseudo-instructions from the AArch64 backend which internally loaded a nearby constant and then jumped over it. Constants now go through the `VCodeConstant` infrastructure which gets placed at the end of the function similar to how x64 works. Some minor support was added in as well to add a new addressing mode for a `MachLabel`-relative load.	2023-03-13 19:33:52 +00:00
Alex Crichton	6ecdc2482e	x64: Improve memory support in `{insert,extract}lane` (#5982 ) * x64: Improve memory support in `{insert,extract}lane` This commit improves adds support to Cranelift to emit `pextr{b,w,d,q}` with a memory destination, merging a store-of-extract operation into one instruction. Additionally AVX support is added for the `pextr` instructions. I've additionally tried to ensure that codegen tests and runtests exist for all forms of these instructions too. Add missing commas * Fix tests	2023-03-13 19:30:44 +00:00
Alex Crichton	0ec7b872fa	x64: Optimize store-of-extract-lane-0 (#5924 ) * x64: Optimize store-of-extract-lane-0 The `movss` and `movsd` instructions can be used to store the 0th lane of a `t32x4` or a `t64x2` vector into memory, enabling fusing a `store` and an `extractlane` instruction. * Fix merge conflict with `main`	2023-03-10 01:06:38 +00:00
Alex Crichton	83f21e784a	x64: Add more support for more AVX instructions (#5931 ) * x64: Add a smattering of lowerings for `shuffle` specializations (#5930) * x64: Add lowerings for `punpck{h,l}wd` Add some special cases for `shuffle` for more specialized x86 instructions. * x64: Add `shuffle` lowerings for `pshufd` This commit adds special-cased lowerings for the x64 `shuffle` instruction when the `pshufd` instruction alone is necessary. This is possible when the shuffle immediate permutes 32-bit values within one of the vector inputs of the `shuffle` instruction, but not both. * x64: Add shuffle lowerings for `punpck{h,l}{q,}dq` This adds specific permutations for some x86 instructions which specifically interleave high/low bytes for 32 and 64-bit values. This corresponds to the preexisting specific lowerings for interleaving 8 and 16-bit values. * x64: Add `shuffle` lowerings for `shufps` This commit adds targeted lowerings for the `shuffle` instruction that match the pattern that `shufps` supports. The `shufps` instruction selects two elements from the first vector and two elements from the second vector which means while it's not generally applicable it should still be more useful than the catch-all lowering of `shuffle`. * x64: Add shuffle support for `pshuf{l,h}w` This commit adds special lowering cases for these instructions which permute 16-bit values within a 128-bit value either within the upper or lower half of the 128-bit value. * x64: Specialize `shuffle` with an all-zeros immediate Instead of loading the all-zeros immediate from a rip-relative address at the end of the function instead generate a zero with a `pxor` instruction and then use `pshufb` to do the broadcast. * Review comments * x64: Add an AVX encoding for the `pshufd` instruction This will benefit from lack of need for alignment vs the `pshufd` instruction if working with a memory operand and additionally, as I've just learned, this reduces dependencies between instructions because the `v` instructions zero the upper bits as opposed to preserving them which could accidentally create false dependencies in the CPU between instructions. x64: Add more support for AVX loads/stores This commit adds VEX-encoded versions of instructions such as `mov{ss,sd,upd,ups,dqu}` for load and store operations. This also changes some signatures so the `load` helpers specifically take a `SyntheticAmode` argument which ended up doing a small refactoring of the `_regmove` variant used for `insertlane 0` into f64x2 vectors. x64: Enable using AVX instructions for zero regs This commit refactors the internal ISLE helpers for creating zero'd xmm registers to leverage the AVX support for all other instructions. This moves away from picking opcodes to instead picking instructions with a bit of reorganization. * x64: Remove `XmmConstOp` as an instruction All existing users can be replaced with usage of the `xmm_uninit_value` helper instruction so there's no longer any need for these otherwise constant operations. This additionally reduces manual usage of opcodes in favor of instruction helpers. * Review comments * Update test expectations	2023-03-09 23:57:42 +00:00
Alex Crichton	1c3a1bda6c	x64: Add a smattering of lowerings for `shuffle` specializations (#5930 ) * x64: Add lowerings for `punpck{h,l}wd` Add some special cases for `shuffle` for more specialized x86 instructions. * x64: Add `shuffle` lowerings for `pshufd` This commit adds special-cased lowerings for the x64 `shuffle` instruction when the `pshufd` instruction alone is necessary. This is possible when the shuffle immediate permutes 32-bit values within one of the vector inputs of the `shuffle` instruction, but not both. * x64: Add shuffle lowerings for `punpck{h,l}{q,}dq` This adds specific permutations for some x86 instructions which specifically interleave high/low bytes for 32 and 64-bit values. This corresponds to the preexisting specific lowerings for interleaving 8 and 16-bit values. * x64: Add `shuffle` lowerings for `shufps` This commit adds targeted lowerings for the `shuffle` instruction that match the pattern that `shufps` supports. The `shufps` instruction selects two elements from the first vector and two elements from the second vector which means while it's not generally applicable it should still be more useful than the catch-all lowering of `shuffle`. * x64: Add shuffle support for `pshuf{l,h}w` This commit adds special lowering cases for these instructions which permute 16-bit values within a 128-bit value either within the upper or lower half of the 128-bit value. * x64: Specialize `shuffle` with an all-zeros immediate Instead of loading the all-zeros immediate from a rip-relative address at the end of the function instead generate a zero with a `pxor` instruction and then use `pshufb` to do the broadcast. * Review comments	2023-03-09 22:58:19 +00:00
Alex Crichton	07518dfd36	Remove the Cranelift `vselect` instruction (#5918 ) * Remove the Cranelift `vselect` instruction This instruction is documented as selecting lanes based on the "truthy" value of the condition lane, but the current status of the implementation of this instruction is: * x64 - uses the high bit for `f32x4` and `f64x2` and otherwise uses the high bit of each byte doing a byte-wise lane select rather than whatever the controlling type is. * AArch64 - this is the same as `bitselect` which is a bit-wise selection rather than a lane-wise selection. * s390x - this is the same as AArch64, a bit-wise selection rather than lane-wise. * interpreter - the interpreter implements the documented semantics of selecting based on "truthy" values. Coupled with the status of the implementation is the fact that this instruction is not used by WebAssembly SIMD today either. The only use of this instruction in Cranelift is the nan-canonicalization pass. By moving nan-canonicalization to `bitselect`, since that has the desired semantics, there's no longer any need for `vselect`. Given this situation this commit subsqeuently removes `vselect` and all usage of it throughout Cranelift. Closes #5917 * Review comments * Bring back vselect opts as bitselect opts * Clean up vselect usage in the interpreter * Move bitcast in nan canonicalization * Add a comment about float optimization	2023-03-08 00:42:05 +00:00
Alex Crichton	8bb183f16e	Implement the relaxed SIMD proposal (#5892 ) * Initial support for the Relaxed SIMD proposal This commit adds initial scaffolding and support for the Relaxed SIMD proposal for WebAssembly. Codegen support is supported on the x64 and AArch64 backends on this time. The purpose of this commit is to get all the boilerplate out of the way in terms of plumbing through a new feature, adding tests, etc. The tests are copied from the upstream repository at this time while the WebAssembly/testsuite repository hasn't been updated. A summary of changes made in this commit are: * Lowerings for all relaxed simd opcodes have been added, currently all exhibiting deterministic behavior. This means that few lowerings are optimal on the x86 backend, but on the AArch64 backend, for example, all lowerings should be optimal. * Support is added to codegen to, eventually, conditionally generate different code based on input codegen flags. This is intended to enable codegen to more efficient instructions on x86 by default, for example, while still allowing embedders to force architecture-independent semantics and behavior. One good example of this is the `f32x4.relaxed_fmadd` instruction which when deterministic forces the `fma` instruction, but otherwise if the backend doesn't have support for `fma` then intermediate operations are performed instead. * Lowerings of `iadd_pairwise` for `i16x8` and `i32x4` were added to the x86 backend as they're now exercised by the deterministic lowerings of relaxed simd instructions. * Sample codegen tests for added for x86 and aarch64 for some relaxed simd instructions. * Wasmtime embedder support for the relaxed-simd proposal and forcing determinism have been added to `Config` and the CLI. * Support has been added to the `.wast` runtime execution for the `(either ...)` matcher used in the relaxed-simd proposal. Tests for relaxed-simd are run both with a default `Engine` as well as a "force deterministic" `Engine` to test both configurations. * All tests from the upstream repository were copied into Wasmtime. These tests should be deleted when WebAssembly/testsuite is updated. * x64: Add x86-specific lowerings for relaxed simd This commit builds on the prior commit and adds an array of `x86_` instructions to Cranelift which have semantics that match their corresponding x86 equivalents. Translation for relaxed simd is then additionally updated to conditionally generate different CLIF for relaxed simd instructions depending on whether the target is x86 or not. This means that for AArch64 no changes are made but for x86 most relaxed instructions now lower to some x86-equivalent with slightly different semantics than the "deterministic" lowering. Add libcall support for fma to Wasmtime This will be required to implement the `f32x4.relaxed_madd` instruction (and others) when an x86 host doesn't specify the `has_fma` feature. * Ignore relaxed-simd tests on s390x and riscv64 * Enable relaxed-simd tests on s390x * Update cranelift/codegen/meta/src/shared/instructions.rs Co-authored-by: Andrew Brown <andrew.brown@intel.com> * Add a FIXME from review * Add notes about deterministic semantics * Don't default `has_native_fma` to `true` * Review comments and rebase fixes --------- Co-authored-by: Andrew Brown <andrew.brown@intel.com>	2023-03-07 15:52:41 +00:00
Alex Crichton	52b4c48a1b	x64: Improve codegen for i8x16.shr_u (#5906 ) This catches a case that wasn't handled previously by #5880 to allow a constant load to be folded into an instruction rather than forcing it to be loaded into a temporary register.	2023-03-02 05:43:42 +00:00
Alex Crichton	f05babc744	x64: Add `shuffle` cases for `punpck{h,l}bw` (#5905 ) * x64: Add `shuffle` cases for `punpck{h,l}bw` I noticed this difference between LLVM and Cranelift for something I was looking at recently, and while it's probably not all that common I figured I'd add it here since it should be somewhat useful nevertheless. * Review feedback * Use u128 extractor instead	2023-03-01 21:49:00 +00:00
Alex Crichton	e0ef0b7c72	x64: Add support for `phadd{w,d}` instructions (#5896 ) This commit adds support for the bare lowering of the `iadd_pairwise` instruction with `i16x8` and `i32x4` types on the x64 backend. These lowerings are achieved with the `phaddw` and `phaddd` instructions, respectively. Additionally AVX encodings of these instructions are added too. The motivation for these new lowerings comes from the relaxed-simd proposal which will use them in the deterministic lowering of some instructions on the x64 backend.	2023-02-28 23:35:53 +00:00
Alex Crichton	f2dce812c3	x64: Sink constant loads into xmm instructions (#5880 ) A number of places in the x64 backend make use of 128-bit constants for various wasm SIMD-related instructions although most of them currently use the `x64_xmm_load_const` helper to load the constant into a register. Almost all xmm instructions, however, enable using a memory operand which means that these loads can be folded into instructions to help reduce register pressure. Automatic conversions were added for a `VCodeConstant` into an `XmmMem` value and then explicit loads were all removed in favor of forwarding the `XmmMem` value directly to the underlying instruction. Note that some instances of `x64_xmm_load_const` remain since they're used in contexts where load sinking won't work (e.g. they're the first operand, not the second for non-commutative instructions).	2023-02-27 22:02:42 +00:00
Alex Crichton	9b86a0b9b1	Remove the `widening_pairwise_dot_product_s` clif instruction (#5889 ) This was added for the wasm SIMD proposal but I've been poking around at this recently and the instruction can instead be represented by its component parts with the same semantics I believe. This commit removes the instruction and instead represents it with the existing `iadd_pairwise` instruction (among others) and updates backends to with new pattern matches to have the same codegen as before. This interestingly entirely removed the codegen rule with no replacement on the AArch64 backend as the existing rules all existed to produce the same codegen.	2023-02-27 18:43:43 +00:00
Afonso Bordado	36e92add6f	riscv64: Move `is_null`/`is_invalid` to ISLE (#5874 ) * riscv64: Move `is_null`/`is_invalid` to ISLE * riscv64: Fix `is_invalid` codegen * Implement review suggestions Thanks! Co-authored-by: Jamey Sharp <jamey@minilop.net> --------- Co-authored-by: Jamey Sharp <jamey@minilop.net>	2023-02-25 12:48:44 +00:00
Jamey Sharp	7d790fcdfe	x64: Only branch once in br_table (#5850 ) This uses the `cmov`, which was previously necessary for Spectre mitigation, to clamp the table index instead of zeroing it. By then placing the default target as the last entry in the table, we can use just one branch instruction in all cases. Since there isn't a bounds-check branch any more, this sequence no longer needs Spectre mitigation. And since we don't need to be careful about preserving flags, half the instructions can be removed from this pseudoinstruction and emitted as regular instructions instead. This is a net savings of three bytes in the encoding of x64's br_table pseudoinstruction. The generated code can sometimes be longer overall because the blocks are emitted in a slightly different order. My benchmark results show a very small effect on runtime performance with this change. The spidermonkey benchmark in Sightglass runs "1.01x faster" than main by instructions retired, but with no significant difference in CPU cycles. I think that means it rarely hit the default case in any br_table instructions it executed. The pulldown-cmark benchmark in Sightglass runs "1.01x faster" than main by CPU cycles, but main runs "1.00x faster" by instructions retired. I think that means this benchmark hit the default case a significant amount of the time, so it executes a few more instructions per br_table, but maybe the branches were predicted better.	2023-02-24 04:46:38 +00:00
Alex Crichton	bd3dcd313d	x64: Add more `fma` instruction lowerings (#5846 ) The relaxed-simd proposal for WebAssembly adds a fused-multiply-add operation for `v128` types so I was poking around at Cranelift's existing support for its `fma` instruction. I was also poking around at the x86_64 ISA's offerings for the FMA operation and ended up with this PR that improves the lowering of the `fma` instruction on the x64 backend in a number of ways: * A libcall-based fallback is now provided for `f32x4` and `f64x2` types in preparation for eventual support of the relaxed-simd proposal. These encodings are horribly slow, but it's expected that if FMA semantics must be guaranteed then it's the best that can be done without the `fma` feature. Otherwise it'll be up to producers (e.g. Wasmtime embedders) whether wasm-level FMA operations should be FMA or multiply-then-add. * In addition to the existing `vfmadd213` instructions opcodes were added for `vfmadd132`. The `132` variant is selected based on which argument can have a sinkable load. * Any argument in the `fma` CLIF instruction can now have a `sinkable_load` and it'll generate a single FMA instruction. * All `vfnmadd*` opcodes were added as well. These are pattern-matched where one of the arguments to the CLIF instruction is an `fneg`. I opted to not add a new CLIF instruction here since it seemed like pattern matching was easy enough but I'm also not intimately familiar with the semantics here so if that's the preferred approach I can do that too.	2023-02-21 20:51:22 +00:00
Alex Crichton	d82ebcc102	x64: Enable load-coalescing for SSE/AVX instructions (#5841 ) * x64: Enable load-coalescing for SSE/AVX instructions This commit unlocks the ability to fold loads into operands of SSE and AVX instructions. This is beneficial for both function size when it happens in addition to being able to reduce register pressure. Previously this was not done because most SSE instructions require memory to be aligned. AVX instructions, however, do not have alignment requirements. The solution implemented here is one recommended by Chris which is to add a new `XmmMemAligned` newtype wrapper around `XmmMem`. All SSE instructions are now annotated as requiring an `XmmMemAligned` operand except for a new new instruction styles used specifically for instructions that don't require alignment (e.g. `movdqu`, `sd`, and `ss` instructions). All existing instruction helpers continue to take `XmmMem`, however. This way if an AVX lowering is chosen it can be used as-is. If an SSE lowering is chosen, however, then an automatic conversion from `XmmMem` to `XmmMemAligned` kicks in. This automatic conversion only fails for unaligned addresses in which case a load instruction is emitted and the operand becomes a temporary register instead. A number of prior `Xmm` arguments have now been converted to `XmmMem` as well. One change from this commit is that loading an unaligned operand for an SSE instruction previously would use the "correct type" of load, e.g. `movups` for f32x4 or `movup` for f64x2, but now the loading happens in a context without type information so the `movdqu` instruction is generated. According to [this stack overflow question][question] it looks like modern processors won't penalize this "wrong" choice of type when the operand is then used for f32 or f64 oriented instructions. Finally this commit improves some reuse of logic in the `put_in__mem` helper to share code with `sinkable_load` and avoid duplication. With this in place some various ISLE rules have been updated as well. In the tests it can be seen that AVX-instructions are now automatically load-coalesced and use memory operands in a few cases. [question]: https://stackoverflow.com/questions/40854819/is-there-any-situation-where-using-movdqu-and-movupd-is-better-than-movups * Fix tests * Fix move-and-extend to be unaligned These don't have alignment requirements like other xmm instructions as well. Additionally add some ISA tests to ensure that their output is tested. * Review comments	2023-02-21 19:10:19 +00:00
Alex Crichton	c65de1f1b1	x64: Remove conditional `SseOpcode::uses_src1` (#5842 ) This is a follow-up to comments in #5795 to remove some cruft in the x64 instruction model to ensure that the shape of an `Inst` reflects what's going to happen in regalloc and encoding. This accessor was used to handle `round`, `pextr`, and `pshufb` instructions. The `round` ones had already moved to the appropriate `XmmUnary` variant and `pshufb` was additionally moved over to that variant as well. The `pextr*` instructions got a new `Inst` variant and additionally had their constructors slightly modified to no longer require the type as input. The encoding for these instructions now automatically handles the various type-related operands through a new `SseOpcode::Pextrq` operand to represent 64-bit movements.	2023-02-21 18:17:07 +00:00
Alex Crichton	e6a5ec3fde	x64: Tidy up some handling of sinkable loads (#5840 ) This commit refactors a bit about how sinkable loads are handled in the x64 backend. The intention is to bring most handling around sinkable loads up to date with the current state of the backend since things have changed since these were originally introduced, namely automatic conversions between types in ISLE. For example the `Value` type can be automatically converted to `RegMem` to perform load sinking, but some rules are still explicitly doing matching themselves. Here I've removed explicit handling of immediates and sinkable loads when they're the right-hand-side of an operation. These cases are already handle by the "base case" when converting a `Value` to a `RegMemImm`. Instead only rules explicitly for left-hand-side immediates and sinkable loads remain. This helps cut down on the number of explicit rules needed. Additionally in the same manner that `Value` can be automatically converted to `RegMem` I've added automatic conversions from `SinkableLoad` to `RegMem` and the various other newtypes. This helps cut down a bit on rule verbosity where `sink_load_*` is largely no longer necessary.	2023-02-21 18:15:08 +00:00
Alex Crichton	c26a65a854	x64: Add most remaining AVX lowerings (#5819 ) * x64: Add most remaining AVX lowerings This commit goes through `inst.isle` and adds a corresponding AVX lowering for most SSE lowerings. I opted to skip instructions where the SSE lowering didn't read/modify a register, such as `roundps`. I think that AVX will benefit these instructions when there's load-merging since AVX doesn't require alignment, but I've deferred that work to a future PR. Otherwise though in this PR I think all (or almost all) of the 3-operand forms of AVX instructions are supported with their SSE counterparts. This should ideally improve codegen slightly by removing register pressure and the need for `movdqa` between registers. I've attempted to ensure that there's at least one codegen test for all the new instructions. As a side note, the recent capstone integration into `precise-output` tests helped me catch a number of encoding bugs much earlier than otherwise, so I've found that incredibly useful in tests! * Move `vpinsr` instructions to their own variant Use true `XmmMem` and `GprMem` types in the instruction as well to get more type-level safety for what goes where. Remove `Inst::produces_const` accessor Instead of conditionally defining regalloc and various other operations instead add dedicated `MInst` variants for operations which are intended to produce a constant to have more clear interactions with regalloc and printing and such. * Fix tests * Register traps in `MachBuffer` for load-folding ops This adds a missing `add_trap` to encoding of VEX instructions with memory operands to ensure that if they cause a segfault that there's appropriate metadata for Wasmtime to understand that the instruction could in fact trap. This fixes a fuzz test case found locally where v8 trapped and Wasmtime didn't catch the signal and crashed the fuzzer.	2023-02-20 15:11:52 +00:00
Alex Crichton	cae3b26623	x64: Improve codegen for vectors with constant shift amounts (#5797 ) I stumbled across this working on #5795 and figured this was a nice opportunity to improve the codegen here.	2023-02-16 20:47:59 +00:00
Trevor Elliott	d99783fc91	Move default blocks into jump tables (#5756 ) Move the default block off of the br_table instrution, and into the JumpTable that it references.	2023-02-10 08:53:30 -08:00
Alex Crichton	de0e0bea3f	Legalize `b{and,or,xor}_not` into component instructions (#5709 ) * Remove trailing whitespace in `lower.isle` files * Legalize the `band_not` instruction into simpler form This commit legalizes the `band_not` instruction into `band`-of-`bnot`, or two instructions. This is intended to assist with egraph-based optimizations where the `band_not` instruction doesn't have to be specifically included in other bit-operation-patterns. Lowerings of the `band_not` instruction have been moved to a specialization of the `band` instruction. * Legalize `bor_not` into components Same as prior commit, but for the `bor_not` instruction. * Legalize bxor_not into bxor-of-bnot Same as prior commits. I think this also ended up fixing a bug in the s390x backend where `bxor_not x y` was actually translated as `bnot (bxor x y)` by accident given the test update changes. * Simplify not-fused operands for riscv64 Looks like some delegated-to rules have special-cases for "if this feature is enabled use the fused instruction" so move the clause for testing the feature up to the lowering phase to help trigger other rules if the feature isn't enabled. This should make the riscv64 backend more consistent with how other backends are implemented. * Remove B{and,or,xor}Not from cost of egraph metrics These shouldn't ever reach egraphs now that they're legalized away. * Add an egraph optimization for `x^-1 => ~x` This adds a simplification node to translate xor-against-minus-1 to a `bnot` instruction. This helps trigger various other optimizations in the egraph implementation and also various backend lowering rules for instructions. This is chiefly useful as wasm doesn't have a `bnot` equivalent, so it's encoded as `x^-1`. * Add a wasm test for end-to-end bitwise lowerings Test that end-to-end various optimizations are being applied for input wasm modules. * Specifically don't self-update rustup on CI I forget why this was here originally, but this is failing on Windows CI. In general there's no need to update rustup, so leave it as-is. * Cleanup some aarch64 lowering rules Previously a 32/64 split was necessary due to the `ALUOp` being different but that's been refactored away no so there's no longer any need for duplicate rules. * Narrow a x64 lowering rule This previously made more sense when it was `band_not` and rarely used, but be more specific in the type-filter on this rule that it's only applicable to SIMD types with lanes. * Simplify xor-against-minus-1 rule No need to have the commutative version since constants are already shuffled right for egraphs * Optimize band-of-bnot when bnot is on the left Use some more rules in the egraph algebraic optimizations to canonicalize band/bor/bxor with a `bnot` operand to put the operand on the right. That way the lowerings in the backends only have to list the rule once, with the operand on the right, to optimize both styles of input. * Add commutative lowering rules * Update cranelift/codegen/src/isa/x64/lower.isle Co-authored-by: Jamey Sharp <jamey@minilop.net> --------- Co-authored-by: Jamey Sharp <jamey@minilop.net>	2023-02-06 13:53:40 -06:00
Trevor Elliott	6d8f2be9e1	Use `andn` for `band_not` when bmi1 is present (#5701 ) We can use the andn instruction for the lowering of band_not on x64 when bmi1 is available.	2023-02-03 16:23:18 -08:00
Jun Ryung Ju	9cd4146939	Implemented `b{and,or,xor}_not` bitops for ty_int_ref_scalar_64 type. (#5604 ) * Implemented `b{and,or,xor}_not` bitops for ty_int_ref_scalar_64 type. * Added tests.	2023-02-01 21:57:18 -08:00
Trevor Elliott	a5698cedf8	cranelift: Remove brz and brnz (#5630 ) Remove the brz and brnz instructions, as their behavior is now redundant with brif.	2023-01-30 20:34:56 +00:00
Trevor Elliott	a181ad2932	Cleanup the use of `maybe_uextend` in the x64 lowerings (#5637 ) Use maybe_uextend for the brnz lowerings on x64.	2023-01-25 17:28:48 -08:00
Trevor Elliott	b58a197d33	cranelift: Add a conditional branch instruction with two targets (#5446 ) Add a conditional branch instruction with two targets: brif. This instruction will eventually replace brz and brnz, as it encompasses the behavior of both. This PR also changes the InstructionData layout for instruction formats that hold BlockCall values, taking the same approach we use for Value arguments. This allows branch_destination to return a slice to the BlockCall values held in the instruction, rather than requiring that we pattern match on InstructionData to fetch the then/else blocks. Function generation for fuzzing has been updated to generate uses of brif, and I've run the cranelift-fuzzgen target locally for hours without triggering any new failures.	2023-01-24 14:37:16 -08:00
Trevor Elliott	1e6c13d83e	cranelift: Rework block instructions to use BlockCall (#5464 ) Add a new type BlockCall that represents the pair of a block name with arguments to be passed to it. (The mnemonic here is that it looks a bit like a function call.) Rework the implementation of jump, brz, and brnz to use BlockCall instead of storing the block arguments as varargs in the instruction's ValueList. To ensure that we're processing block arguments from BlockCall values in instructions, three new functions have been introduced on DataFlowGraph that both sets of arguments: inst_values - returns an iterator that traverses values in the instruction and block arguments map_inst_values - applies a function to each value in the instruction and block arguments overwrite_inst_values - overwrite all values in an instruction and block arguments with values from the iterator Co-authored-by: Jamey Sharp <jamey@minilop.net>	2023-01-17 16:31:15 -08:00
uint256_t	b00455135e	Cranelift: Implement 'iabs' for scalar types on x86_64 (#5527 ) * Implement 'iabs' for scalar types on x86_64 * Small fix	2023-01-05 21:33:12 -08:00
Chris Fallin	03463458e4	Cranelift: fix branch-of-icmp/fcmp regression: look through `uextend`. (#5487 ) In #5031, we removed `bool` types from CLIF, using integers instead for "truthy" values. This greatly simplified the IR, and was generally an improvement. However, because x86's `SETcc` instruction sets only the low 8 bits of a register, we chose to use `i8` types as the result of `icmp` and `fcmp`, to avoid the need for a masking operation when materializing the result. Unfortunately this means that uses of truthy values often now have `uextend` operations, especially when coming from Wasm (where truthy values are naturally `i32`-typed). For example, where we previously had `(brz (icmp ...))`, we now have `(brz (uextend (icmp ...)))`. It's arguable whether or not we should switch to `i32` truthy values -- in most cases we can avoid materializing a value that's immediately used for a branch or select, so a mask would in most cases be unnecessary, and it would be a win at the IR level -- but irrespective of that, this change did regress our generated code quality: our backends had patterns for e.g. `(brz (icmp ...))` but not with the `uextend`, so we were always materializing truthy values. Many blocks thus ended with "cmp; setcc; cmp; test; branch" rather than "cmp; branch". In #5391 we noticed this and fixed it on x64, but it was a general problem on aarch64 and riscv64 as well. This PR introduces a `maybe_uextend` extractor that "looks through" uextends, and uses it where we consume truthy values, thus fixing the regression. This PR also adds compile filetests to ensure we don't regress again. The riscv64 backend has not been updated here because doing so appears to trigger another issue in its branch handling; fixing that is TBD.	2022-12-22 01:43:44 -08:00
Chris Fallin	22439f7b39	support select_spectre_guard and select on i128 conditions on all platforms. (#5460 ) Fixes #5199. Fixes #5200. Fixes #5452. Fixes #5453. On riscv64, there is apparently an autoconversion from `ValueRegs` to `Reg` that takes just the low register [0], and removing this conversion causes 48 errors. As a result of this, `select` with an `i128` condition was silently miscompiling, testing only the low 64 bits. We should remove this autoconversion to ensure we aren't missing any other silent truncations, but for now this PR just adds the explicit `I128` logic for `select` / `select_spectre_guard`. [0] `d9fdbfd50e/cranelift/codegen/src/isa/riscv64/inst.isle (L1762)`	2022-12-16 14:18:22 -08:00
Ulrich Weigand	f0af622208	Simplify LowerBackend interface (#5432 ) * Refactor lower_branch to have Unit result Branches cannot have any output, so it is more straightforward to have the ISLE term return Unit instead of InstOutput. Also provide a new `emit_side_effect` term to simplify implementation of `lower_branch` rules with Unit result. * Simplify LowerBackend interface Move all remaining asserts from the LowerBackend::lower and ::lower_branch_group into the common call site. Change return value of ::lower to Option<InstOutput>, and return value of ::lower_branch_group to Option<()> to match ISLE term signature. Only pass the first branch into ::lower_branch_group and rename it to ::lower_branch. As a result of all those changes, LowerBackend routines now consists solely to calls to the corresponding ISLE routines.	2022-12-14 00:48:25 +00:00
Trevor Elliott	a5ecb5e647	x64: Share a zero in the ushr translation on x64 to free up a register (#5424 ) Share a zero value in the translation of ushr for i128. This increases the lifetime of the value by a few instructions, and reduces the number of registers used in the translation by one, which seems like an acceptable trade-off.	2022-12-12 18:15:43 -08:00
Chris Fallin	9397ea1abe	Cranelift: implement general select_spectre_guard fallbacks. (#5420 ) When adding some optimization rules for `icmp` in the egraph infrastructure, we ended up creating a path to legal CLIF but with patterns unsupported by three of our four backends: specifically, `select_spectre_guard` with a general truthy input, rather than an `icmp`. In #5206 we discussed replacing `select_spectre_guard` with something more specific, and that could still be a long-term solution here, but doing so now would interfere with ongoing refactoring of heap access lowering, so I've opted not to do so. (In that issue I was concerned about complexity and didn't see the need but with this fuzzbug I'm starting to feel a bit differently; maybe we should remove this non-orthogonal op in the long run.) Fixes #5417.	2022-12-12 17:13:34 -08:00
Ulrich Weigand	e913cf3647	Remove IFLAGS/FFLAGS types (#5406 ) All instructions using the CPU flags types (IFLAGS/FFLAGS) were already removed. This patch completes the cleanup by removing all remaining instructions that define values of CPU flags types, as well as the types themselves. Specifically, the following features are removed: - The IFLAGS and FFLAGS types and the SpecialType category. - Special handling of IFLAGS and FFLAGS in machinst/isle.rs and machinst/lower.rs. - The ifcmp, ifcmp_imm, ffcmp, iadd_ifcin, iadd_ifcout, iadd_ifcarry, isub_ifbin, isub_ifbout, and isub_ifborrow instructions. - The writes_cpu_flags instruction property. - The flags verifier pass. - Flags handling in the interpreter. All of these features are currently unused; no functional change intended by this patch. This addresses https://github.com/bytecodealliance/wasmtime/issues/3249.	2022-12-09 13:42:03 -08:00
Jamey Sharp	8726eeefb3	cranelift-isle: Add "partial" flag for constructors (#5392 ) * cranelift-isle: Add "partial" flag for constructors Instead of tying fallibility of constructors to whether they're either internal or pure, this commit assumes all constructors are infallible unless tagged otherwise with a "partial" flag. Internal constructors without the "partial" flag are not allowed to use constructors which have the "partial" flag on the right-hand side of any rules, because they have no way to report last-minute match failures. Multi-constructors should never be "partial"; they report match failures with an empty iterator instead. In turn this means you can't use partial constructors on the right-hand side of internal multi-constructor rules. However, you can use the same constructors on the left-hand side with `if` or `if-let` instead. In many cases, ISLE can already trivially prove that an internal constructor always returns `Some`. With this commit, those cases are largely unchanged, except for removing all the `Option`s and `Some`s from the generated code for those terms. However, for internal non-partial constructors where ISLE could not prove that, it now emits an `unreachable!` panic as the last-resort, instead of returning `None` like it used to do. Among the existing backends, here's how many constructors have these panic cases: - x64: 14% (53/374) - aarch64: 15% (41/277) - riscv64: 23% (26/114) - s390x: 47% (268/567) It's often possible to rewrite rules so that ISLE can tell the panic can never be hit. Just ensure that there's a lowest-priority rule which has no constraints on the left-hand side. But in many of these constructors, it's difficult to statically prove the unhandled cases are unreachable because that's only down to knowledge about how they're called or other preconditions. So this commit does not try to enforce that all terms have a last-resort fallback rule. * Check term flags while translating expressions Instead of doing it in a separate pass afterward. This involved threading all the term flags (pure, multi, partial) through the recursive `translate_expr` calls, so I extracted the flags to a new struct so they can all be passed together. * Validate multi-term usage Now that I've threaded the flags through `translate_expr`, it's easy to check this case too, so let's just do it. * Extract `ReturnKind` to use in `ExternalSig` There are only three legal states for the combination of `multi` and `infallible`, so replace those fields of `ExternalSig` with a three-state enum. * Remove `Option` wrapper from multi-extractors too If we'd had any external multi-constructors this would correct their signatures as well. * Update ISLE tests * Tag prelude constructors as pure where appropriate I believe the only reason these weren't marked `pure` before was because that would have implied that they're also partial. Now that those two states are specified separately we apply this flag more places. * Fix my changes to aarch64 `lower_bmask` and `imm` terms	2022-12-07 17:16:03 -08:00

1 2 3 4

167 Commits