Implement the relaxed SIMD proposal (#5892)

* Initial support for the Relaxed SIMD proposal

This commit adds initial scaffolding and support for the Relaxed SIMD
proposal for WebAssembly. Codegen support is supported on the x64 and
AArch64 backends on this time.

The purpose of this commit is to get all the boilerplate out of the way
in terms of plumbing through a new feature, adding tests, etc. The tests
are copied from the upstream repository at this time while the
WebAssembly/testsuite repository hasn't been updated.

A summary of changes made in this commit are:

* Lowerings for all relaxed simd opcodes have been added, currently all
  exhibiting deterministic behavior. This means that few lowerings are
  optimal on the x86 backend, but on the AArch64 backend, for example,
  all lowerings should be optimal.

* Support is added to codegen to, eventually, conditionally generate
  different code based on input codegen flags. This is intended to
  enable codegen to more efficient instructions on x86 by default, for
  example, while still allowing embedders to force
  architecture-independent semantics and behavior. One good example of
  this is the `f32x4.relaxed_fmadd` instruction which when deterministic
  forces the `fma` instruction, but otherwise if the backend doesn't
  have support for `fma` then intermediate operations are performed
  instead.

* Lowerings of `iadd_pairwise` for `i16x8` and `i32x4` were added to the
  x86 backend as they're now exercised by the deterministic lowerings of
  relaxed simd instructions.

* Sample codegen tests for added for x86 and aarch64 for some relaxed
  simd instructions.

* Wasmtime embedder support for the relaxed-simd proposal and forcing
  determinism have been added to `Config` and the CLI.

* Support has been added to the `*.wast` runtime execution for the
  `(either ...)` matcher used in the relaxed-simd proposal.

* Tests for relaxed-simd are run both with a default `Engine` as well as
  a "force deterministic" `Engine` to test both configurations.

* All tests from the upstream repository were copied into Wasmtime.
  These tests should be deleted when WebAssembly/testsuite is updated.

* x64: Add x86-specific lowerings for relaxed simd

This commit builds on the prior commit and adds an array of `x86_*`
instructions to Cranelift which have semantics that match their
corresponding x86 equivalents. Translation for relaxed simd is then
additionally updated to conditionally generate different CLIF for
relaxed simd instructions depending on whether the target is x86 or not.
This means that for AArch64 no changes are made but for x86 most relaxed
instructions now lower to some x86-equivalent with slightly different
semantics than the "deterministic" lowering.

* Add libcall support for fma to Wasmtime

This will be required to implement the `f32x4.relaxed_madd` instruction
(and others) when an x86 host doesn't specify the `has_fma` feature.

* Ignore relaxed-simd tests on s390x and riscv64

* Enable relaxed-simd tests on s390x

* Update cranelift/codegen/meta/src/shared/instructions.rs

Co-authored-by: Andrew Brown <andrew.brown@intel.com>

* Add a FIXME from review

* Add notes about deterministic semantics

* Don't default `has_native_fma` to `true`

* Review comments and rebase fixes

---------

Co-authored-by: Andrew Brown <andrew.brown@intel.com>
This commit is contained in:
Alex Crichton
2023-03-07 09:52:41 -06:00
committed by GitHub
parent e2dcb19099
commit 8bb183f16e
34 changed files with 1727 additions and 37 deletions

View File

@@ -386,6 +386,27 @@ fn define_simd_lane_access(
.operands_out(vec![a]),
);
ig.push(
Inst::new(
"x86_pshufb",
r#"
A vector swizzle lookalike which has the semantics of `pshufb` on x64.
This instruction will permute the 8-bit lanes of `x` with the indices
specified in `y`. Each lane in the mask, `y`, uses the bottom four
bits for selecting the lane from `x` unless the most significant bit
is set, in which case the lane is zeroed. The output vector will have
the following contents when the element of `y` is in these ranges:
* `[0, 127]` -> `x[y[i] % 16]`
* `[128, 255]` -> 0
"#,
&formats.binary,
)
.operands_in(vec![x, y])
.operands_out(vec![a]),
);
let x = &Operand::new("x", TxN).with_doc("The vector to modify");
let y = &Operand::new("y", &TxN.lane_of()).with_doc("New lane value");
let Idx = &Operand::new("Idx", &imm.uimm8).with_doc("Lane index");
@@ -1436,7 +1457,7 @@ pub(crate) fn define(
Conditional select of bits.
For each bit in `c`, this instruction selects the corresponding bit from `x` if the bit
in `c` is 1 and the corresponding bit from `y` if the bit in `c` is 0. See also:
in `x` is 1 and the corresponding bit from `y` if the bit in `c` is 0. See also:
`select`, `vselect`.
"#,
&formats.ternary,
@@ -1445,6 +1466,24 @@ pub(crate) fn define(
.operands_out(vec![a]),
);
ig.push(
Inst::new(
"x86_blendv",
r#"
A bitselect-lookalike instruction except with the semantics of
`blendv`-related instructions on x86.
This instruction will use the top bit of each lane in `c`, the condition
mask. If the bit is 1 then the corresponding lane from `x` is chosen.
Otherwise the corresponding lane from `y` is chosen.
"#,
&formats.ternary,
)
.operands_in(vec![c, x, y])
.operands_out(vec![a]),
);
let c = &Operand::new("c", &TxN.as_bool()).with_doc("Controlling vector");
let x = &Operand::new("x", TxN).with_doc("Value to use where `c` is true");
let y = &Operand::new("y", TxN).with_doc("Value to use where `c` is false");
@@ -1698,6 +1737,22 @@ pub(crate) fn define(
.operands_out(vec![qa]),
);
ig.push(
Inst::new(
"x86_pmulhrsw",
r#"
A similar instruction to `sqmul_round_sat` except with the semantics
of x86's `pmulhrsw` instruction.
This is the same as `sqmul_round_sat` except when both input lanes are
`i16::MIN`.
"#,
&formats.binary,
)
.operands_in(vec![qx, qy])
.operands_out(vec![qa]),
);
{
// Integer division and remainder are scalar-only; most
// hardware does not directly support vector integer division.
@@ -3135,6 +3190,36 @@ pub(crate) fn define(
.operands_out(vec![a]),
);
let I8x16 = &TypeVar::new(
"I8x16",
"A SIMD vector type consisting of 16 lanes of 8-bit integers",
TypeSetBuilder::new()
.ints(8..8)
.simd_lanes(16..16)
.includes_scalars(false)
.build(),
);
let x = &Operand::new("x", I8x16);
let y = &Operand::new("y", I8x16);
let a = &Operand::new("a", I16x8);
ig.push(
Inst::new(
"x86_pmaddubsw",
r#"
An instruction with equivalent semantics to `pmaddubsw` on x86.
This instruction will take signed bytes from the first argument and
multiply them against unsigned bytes in the second argument. Adjacent
pairs are then added, with saturating, to a 16-bit value and are packed
into the result.
"#,
&formats.binary,
)
.operands_in(vec![x, y])
.operands_out(vec![a]),
);
let IntTo = &TypeVar::new(
"IntTo",
"A larger integer type with the same number of lanes",
@@ -3378,6 +3463,20 @@ pub(crate) fn define(
.operands_out(vec![a]),
);
ig.push(
Inst::new(
"x86_cvtt2dq",
r#"
A float-to-integer conversion instruction for vectors-of-floats which
has the same semantics as `cvttp{s,d}2dq` on x86. This specifically
returns `INT_MIN` for NaN or out-of-bounds lanes.
"#,
&formats.unary,
)
.operands_in(vec![x])
.operands_out(vec![a]),
);
let Int = &TypeVar::new(
"Int",
"A scalar or vector integer type",

View File

@@ -214,6 +214,10 @@ impl TargetIsa for AArch64Backend {
cs.set_skipdata(true)?;
Ok(cs)
}
fn has_native_fma(&self) -> bool {
true
}
}
impl fmt::Display for AArch64Backend {

View File

@@ -315,6 +315,13 @@ pub trait TargetIsa: fmt::Display + Send + Sync {
fn to_capstone(&self) -> Result<capstone::Capstone, capstone::Error> {
Err(capstone::Error::UnsupportedArch)
}
/// Returns whether this ISA has a native fused-multiply-and-add instruction
/// for floats.
///
/// Currently this only returns false on x86 when some native features are
/// not detected.
fn has_native_fma(&self) -> bool;
}
/// Methods implemented for free for target ISA!

View File

@@ -186,6 +186,10 @@ impl TargetIsa for Riscv64Backend {
cs.set_skipdata(true)?;
Ok(cs)
}
fn has_native_fma(&self) -> bool {
true
}
}
impl fmt::Display for Riscv64Backend {

View File

@@ -186,6 +186,10 @@ impl TargetIsa for S390xBackend {
Ok(cs)
}
fn has_native_fma(&self) -> bool {
true
}
}
impl fmt::Display for S390xBackend {

View File

@@ -1212,6 +1212,20 @@
(decl pure vconst_all_ones_or_all_zeros () Constant)
(extern extractor vconst_all_ones_or_all_zeros vconst_all_ones_or_all_zeros)
;;;; Rules for `x86_blendv` ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(rule (lower (has_type $I8X16
(x86_blendv condition if_true if_false)))
(x64_pblendvb if_false if_true condition))
(rule (lower (has_type $I32X4
(x86_blendv condition if_true if_false)))
(x64_blendvps if_false if_true condition))
(rule (lower (has_type $I64X2
(x86_blendv condition if_true if_false)))
(x64_blendvpd if_false if_true condition))
;;;; Rules for `vselect` ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(rule (lower (has_type ty @ (multi_lane _bits _lanes)
@@ -2145,6 +2159,11 @@
(rule (lower (debugtrap))
(side_effect (x64_hlt)))
;; Rules for `x86_pmaddubsw` ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(rule (lower (has_type $I16X8 (x86_pmaddubsw x y)))
(x64_pmaddubsw y x))
;; Rules for `fadd` ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(rule (lower (has_type $F32 (fadd x y)))
@@ -3169,6 +3188,11 @@
;; values greater than max signed int.
(x64_paddd tmp1 dst)))
;; Rules for `x86_cvtt2dq` ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(rule (lower (has_type $I32X4 (x86_cvtt2dq val @ (value_type $F32X4))))
(x64_cvttps2dq val))
;; Rules for `iadd_pairwise` ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(rule (lower (has_type $I16X8 (iadd_pairwise x y)))
@@ -3304,6 +3328,12 @@
(dst Xmm (x64_minpd a tmp1)))
(x64_cvttpd2dq dst)))
;; This rule is a special case for handling the translation of the wasm op
;; `i32x4.relaxed_trunc_f64x2_s_zero`.
(rule (lower (has_type $I32X4 (snarrow (has_type $I64X2 (x86_cvtt2dq val))
(vconst (u128_from_constant 0)))))
(x64_cvttpd2dq val))
;; Rules for `unarrow` ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(rule (lower (has_type $I8X16 (unarrow a @ (value_type $I16X8) b)))
@@ -3559,6 +3589,11 @@
(let ((mask Xmm (x64_paddusb mask (swizzle_zero_mask))))
(x64_pshufb src mask)))
;; Rules for `x86_pshufb` ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(rule (lower (x86_pshufb src mask))
(x64_pshufb src mask))
;; Rules for `extractlane` ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; Remove the extractlane instruction, leaving the float where it is. The upper
@@ -3736,7 +3771,12 @@
(cmp Xmm (x64_pcmpeqw dst mask)))
(x64_pxor dst cmp)))
;; Rules for `sqmul_round_sat` ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; Rules for `x86_pmulhrsw` ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(rule (lower (x86_pmulhrsw qx @ (value_type $I16X8) qy))
(x64_pmulhrsw qx qy))
;; Rules for `uunarrow` ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; TODO: currently we only lower a special case of `uunarrow` needed to support
;; the translation of wasm's i32x4.trunc_sat_f64x2_u_zero operation.

View File

@@ -184,6 +184,10 @@ impl TargetIsa for X64Backend {
.syntax(arch::x86::ArchSyntax::Att)
.build()
}
fn has_native_fma(&self) -> bool {
self.x64_flags.use_fma()
}
}
impl fmt::Display for X64Backend {

View File

@@ -0,0 +1,87 @@
;;! target = "aarch64"
;;! compile = true
(module
(func (param v128) (result v128)
local.get 0
i32x4.relaxed_trunc_f32x4_s
)
(func (param v128) (result v128)
local.get 0
i32x4.relaxed_trunc_f32x4_u
)
(func (param v128) (result v128)
local.get 0
i32x4.relaxed_trunc_f64x2_s_zero
)
(func (param v128) (result v128)
local.get 0
i32x4.relaxed_trunc_f64x2_u_zero
)
(func (param v128 v128) (result v128)
local.get 0
local.get 1
i16x8.relaxed_dot_i8x16_i7x16_s
)
(func (param v128 v128 v128) (result v128)
local.get 0
local.get 1
local.get 2
i32x4.relaxed_dot_i8x16_i7x16_add_s
)
)
;; function u0:0:
;; block0:
;; fcvtzs v0.4s, v0.4s
;; b label1
;; block1:
;; ret
;;
;; function u0:1:
;; block0:
;; fcvtzu v0.4s, v0.4s
;; b label1
;; block1:
;; ret
;;
;; function u0:2:
;; block0:
;; fcvtzs v4.2d, v0.2d
;; sqxtn v0.2s, v4.2d
;; b label1
;; block1:
;; ret
;;
;; function u0:3:
;; block0:
;; fcvtzu v4.2d, v0.2d
;; uqxtn v0.2s, v4.2d
;; b label1
;; block1:
;; ret
;;
;; function u0:4:
;; block0:
;; smull v6.8h, v0.8b, v1.8b
;; smull2 v7.8h, v0.16b, v1.16b
;; addp v0.8h, v6.8h, v7.8h
;; b label1
;; block1:
;; ret
;;
;; function u0:5:
;; block0:
;; smull v17.8h, v0.8b, v1.8b
;; smull2 v18.8h, v0.16b, v1.16b
;; addp v17.8h, v17.8h, v18.8h
;; saddlp v17.4s, v17.8h
;; add v0.4s, v17.4s, v2.4s
;; b label1
;; block1:
;; ret

View File

@@ -0,0 +1,161 @@
;;! target = "x86_64"
;;! compile = true
;;! relaxed_simd_deterministic = true
;;! settings = ["has_avx=true"]
(module
(func (param v128) (result v128)
local.get 0
i32x4.relaxed_trunc_f32x4_s
)
(func (param v128) (result v128)
local.get 0
i32x4.relaxed_trunc_f32x4_u
)
(func (param v128) (result v128)
local.get 0
i32x4.relaxed_trunc_f64x2_s_zero
)
(func (param v128) (result v128)
local.get 0
i32x4.relaxed_trunc_f64x2_u_zero
)
(func (param v128 v128) (result v128)
local.get 0
local.get 1
i16x8.relaxed_dot_i8x16_i7x16_s
)
(func (param v128 v128 v128) (result v128)
local.get 0
local.get 1
local.get 2
i32x4.relaxed_dot_i8x16_i7x16_add_s
)
)
;; function u0:0:
;; pushq %rbp
;; unwind PushFrameRegs { offset_upward_to_caller_sp: 16 }
;; movq %rsp, %rbp
;; unwind DefineNewFrame { offset_upward_to_caller_sp: 16, offset_downward_to_clobbers: 0 }
;; block0:
;; vcmpps $0 %xmm0, %xmm0, %xmm3
;; vandps %xmm0, %xmm3, %xmm5
;; vpxor %xmm3, %xmm5, %xmm7
;; vcvttps2dq %xmm5, %xmm9
;; vpand %xmm9, %xmm7, %xmm11
;; vpsrad %xmm11, $31, %xmm13
;; vpxor %xmm13, %xmm9, %xmm0
;; jmp label1
;; block1:
;; movq %rbp, %rsp
;; popq %rbp
;; ret
;;
;; function u0:1:
;; pushq %rbp
;; unwind PushFrameRegs { offset_upward_to_caller_sp: 16 }
;; movq %rsp, %rbp
;; unwind DefineNewFrame { offset_upward_to_caller_sp: 16, offset_downward_to_clobbers: 0 }
;; block0:
;; xorps %xmm3, %xmm3, %xmm3
;; vmaxps %xmm0, %xmm3, %xmm5
;; vpcmpeqd %xmm3, %xmm3, %xmm7
;; vpsrld %xmm7, $1, %xmm9
;; vcvtdq2ps %xmm9, %xmm11
;; vcvttps2dq %xmm5, %xmm13
;; vsubps %xmm5, %xmm11, %xmm15
;; vcmpps $2 %xmm11, %xmm15, %xmm1
;; vcvttps2dq %xmm15, %xmm3
;; vpxor %xmm3, %xmm1, %xmm5
;; pxor %xmm7, %xmm7, %xmm7
;; vpmaxsd %xmm5, %xmm7, %xmm9
;; vpaddd %xmm9, %xmm13, %xmm0
;; jmp label1
;; block1:
;; movq %rbp, %rsp
;; popq %rbp
;; ret
;;
;; function u0:2:
;; pushq %rbp
;; unwind PushFrameRegs { offset_upward_to_caller_sp: 16 }
;; movq %rsp, %rbp
;; unwind DefineNewFrame { offset_upward_to_caller_sp: 16, offset_downward_to_clobbers: 0 }
;; block0:
;; vcmppd $0 %xmm0, %xmm0, %xmm3
;; vandps %xmm3, const(0), %xmm5
;; vminpd %xmm0, %xmm5, %xmm7
;; vcvttpd2dq %xmm7, %xmm0
;; jmp label1
;; block1:
;; movq %rbp, %rsp
;; popq %rbp
;; ret
;;
;; function u0:3:
;; pushq %rbp
;; unwind PushFrameRegs { offset_upward_to_caller_sp: 16 }
;; movq %rsp, %rbp
;; unwind DefineNewFrame { offset_upward_to_caller_sp: 16, offset_downward_to_clobbers: 0 }
;; block0:
;; xorpd %xmm3, %xmm3, %xmm3
;; vmaxpd %xmm0, %xmm3, %xmm5
;; vminpd %xmm5, const(0), %xmm7
;; vroundpd $3, %xmm7, %xmm9
;; vaddpd %xmm9, const(1), %xmm11
;; vshufps $136 %xmm11, %xmm3, %xmm0
;; jmp label1
;; block1:
;; movq %rbp, %rsp
;; popq %rbp
;; ret
;;
;; function u0:4:
;; pushq %rbp
;; unwind PushFrameRegs { offset_upward_to_caller_sp: 16 }
;; movq %rsp, %rbp
;; unwind DefineNewFrame { offset_upward_to_caller_sp: 16, offset_downward_to_clobbers: 0 }
;; block0:
;; vpmovsxbw %xmm0, %xmm10
;; vpmovsxbw %xmm1, %xmm12
;; vpmullw %xmm10, %xmm12, %xmm14
;; vpalignr $8 %xmm0, %xmm0, %xmm8
;; vpmovsxbw %xmm8, %xmm10
;; vpalignr $8 %xmm1, %xmm1, %xmm12
;; vpmovsxbw %xmm12, %xmm15
;; vpmullw %xmm10, %xmm15, %xmm0
;; vphaddw %xmm14, %xmm0, %xmm0
;; jmp label1
;; block1:
;; movq %rbp, %rsp
;; popq %rbp
;; ret
;;
;; function u0:5:
;; pushq %rbp
;; unwind PushFrameRegs { offset_upward_to_caller_sp: 16 }
;; movq %rsp, %rbp
;; unwind DefineNewFrame { offset_upward_to_caller_sp: 16, offset_downward_to_clobbers: 0 }
;; block0:
;; vpmovsxbw %xmm0, %xmm13
;; vpmovsxbw %xmm1, %xmm15
;; vpmullw %xmm13, %xmm15, %xmm3
;; vpalignr $8 %xmm0, %xmm0, %xmm11
;; vpmovsxbw %xmm11, %xmm13
;; vpalignr $8 %xmm1, %xmm1, %xmm15
;; vpmovsxbw %xmm15, %xmm1
;; vpmullw %xmm13, %xmm1, %xmm4
;; vphaddw %xmm3, %xmm4, %xmm15
;; vpmaddwd %xmm15, const(0), %xmm15
;; vpaddd %xmm15, %xmm2, %xmm0
;; jmp label1
;; block1:
;; movq %rbp, %rsp
;; popq %rbp
;; ret

View File

@@ -0,0 +1,140 @@
;;! target = "x86_64"
;;! compile = true
(module
(func (param v128) (result v128)
local.get 0
i32x4.relaxed_trunc_f32x4_s
)
(func (param v128) (result v128)
local.get 0
i32x4.relaxed_trunc_f32x4_u
)
(func (param v128) (result v128)
local.get 0
i32x4.relaxed_trunc_f64x2_s_zero
)
(func (param v128) (result v128)
local.get 0
i32x4.relaxed_trunc_f64x2_u_zero
)
(func (param v128 v128) (result v128)
local.get 0
local.get 1
i16x8.relaxed_dot_i8x16_i7x16_s
)
(func (param v128 v128 v128) (result v128)
local.get 0
local.get 1
local.get 2
i32x4.relaxed_dot_i8x16_i7x16_add_s
)
)
;; function u0:0:
;; pushq %rbp
;; unwind PushFrameRegs { offset_upward_to_caller_sp: 16 }
;; movq %rsp, %rbp
;; unwind DefineNewFrame { offset_upward_to_caller_sp: 16, offset_downward_to_clobbers: 0 }
;; block0:
;; cvttps2dq %xmm0, %xmm0
;; jmp label1
;; block1:
;; movq %rbp, %rsp
;; popq %rbp
;; ret
;;
;; function u0:1:
;; pushq %rbp
;; unwind PushFrameRegs { offset_upward_to_caller_sp: 16 }
;; movq %rsp, %rbp
;; unwind DefineNewFrame { offset_upward_to_caller_sp: 16, offset_downward_to_clobbers: 0 }
;; block0:
;; xorps %xmm6, %xmm6, %xmm6
;; movdqa %xmm0, %xmm10
;; maxps %xmm10, %xmm6, %xmm10
;; pcmpeqd %xmm6, %xmm6, %xmm6
;; psrld %xmm6, $1, %xmm6
;; cvtdq2ps %xmm6, %xmm14
;; cvttps2dq %xmm10, %xmm13
;; subps %xmm10, %xmm14, %xmm10
;; cmpps $2, %xmm14, %xmm10, %xmm14
;; cvttps2dq %xmm10, %xmm0
;; pxor %xmm0, %xmm14, %xmm0
;; pxor %xmm7, %xmm7, %xmm7
;; pmaxsd %xmm0, %xmm7, %xmm0
;; paddd %xmm0, %xmm13, %xmm0
;; jmp label1
;; block1:
;; movq %rbp, %rsp
;; popq %rbp
;; ret
;;
;; function u0:2:
;; pushq %rbp
;; unwind PushFrameRegs { offset_upward_to_caller_sp: 16 }
;; movq %rsp, %rbp
;; unwind DefineNewFrame { offset_upward_to_caller_sp: 16, offset_downward_to_clobbers: 0 }
;; block0:
;; cvttpd2dq %xmm0, %xmm0
;; jmp label1
;; block1:
;; movq %rbp, %rsp
;; popq %rbp
;; ret
;;
;; function u0:3:
;; pushq %rbp
;; unwind PushFrameRegs { offset_upward_to_caller_sp: 16 }
;; movq %rsp, %rbp
;; unwind DefineNewFrame { offset_upward_to_caller_sp: 16, offset_downward_to_clobbers: 0 }
;; block0:
;; xorpd %xmm3, %xmm3, %xmm3
;; movdqa %xmm0, %xmm6
;; maxpd %xmm6, %xmm3, %xmm6
;; minpd %xmm6, const(0), %xmm6
;; roundpd $3, %xmm6, %xmm0
;; addpd %xmm0, const(1), %xmm0
;; shufps $136, %xmm0, %xmm3, %xmm0
;; jmp label1
;; block1:
;; movq %rbp, %rsp
;; popq %rbp
;; ret
;;
;; function u0:4:
;; pushq %rbp
;; unwind PushFrameRegs { offset_upward_to_caller_sp: 16 }
;; movq %rsp, %rbp
;; unwind DefineNewFrame { offset_upward_to_caller_sp: 16, offset_downward_to_clobbers: 0 }
;; block0:
;; movdqa %xmm1, %xmm4
;; pmaddubsw %xmm4, %xmm0, %xmm4
;; movdqa %xmm4, %xmm0
;; jmp label1
;; block1:
;; movq %rbp, %rsp
;; popq %rbp
;; ret
;;
;; function u0:5:
;; pushq %rbp
;; unwind PushFrameRegs { offset_upward_to_caller_sp: 16 }
;; movq %rsp, %rbp
;; unwind DefineNewFrame { offset_upward_to_caller_sp: 16, offset_downward_to_clobbers: 0 }
;; block0:
;; movdqa %xmm0, %xmm8
;; movdqa %xmm1, %xmm0
;; pmaddubsw %xmm0, %xmm8, %xmm0
;; pmaddwd %xmm0, const(0), %xmm0
;; paddd %xmm0, %xmm2, %xmm0
;; jmp label1
;; block1:
;; movq %rbp, %rsp
;; popq %rbp
;; ret

View File

@@ -29,6 +29,9 @@ pub struct TestConfig {
#[serde(default)]
pub heaps: Vec<TestHeap>,
#[serde(default)]
pub relaxed_simd_deterministic: bool,
}
impl TestConfig {

View File

@@ -82,6 +82,7 @@ impl<'data> ModuleEnvironment<'data> for ModuleEnv {
wasmparser::WasmFeatures {
memory64: true,
multi_memory: true,
relaxed_simd: true,
..self.inner.wasm_features()
}
}
@@ -613,4 +614,12 @@ impl<'a> FuncEnvironment for FuncEnv<'a> {
{
self.inner.heaps()
}
fn relaxed_simd_deterministic(&self) -> bool {
self.config.relaxed_simd_deterministic
}
fn is_x86(&self) -> bool {
self.config.target.contains("x86_64")
}
}

View File

@@ -1358,6 +1358,11 @@ where
Opcode::GetFramePointer => unimplemented!("GetFramePointer"),
Opcode::GetStackPointer => unimplemented!("GetStackPointer"),
Opcode::GetReturnAddress => unimplemented!("GetReturnAddress"),
Opcode::X86Pshufb => unimplemented!("X86Pshufb"),
Opcode::X86Blendv => unimplemented!("X86Blendv"),
Opcode::X86Pmulhrsw => unimplemented!("X86Pmulhrsw"),
Opcode::X86Pmaddubsw => unimplemented!("X86Pmaddubsw"),
Opcode::X86Cvtt2dq => unimplemented!("X86Cvtt2dq"),
})
}

View File

@@ -1778,13 +1778,10 @@ pub fn translate_operator<FE: FuncEnvironment + ?Sized>(
state.push1(builder.ins().sshr(bitcast_a, b))
}
Operator::V128Bitselect => {
let (a, b, c) = state.pop3();
let bitcast_a = optionally_bitcast_vector(a, I8X16, builder);
let bitcast_b = optionally_bitcast_vector(b, I8X16, builder);
let bitcast_c = optionally_bitcast_vector(c, I8X16, builder);
let (a, b, c) = pop3_with_bitcast(state, I8X16, builder);
// The CLIF operand ordering is slightly different and the types of all three
// operands must match (hence the bitcast).
state.push1(builder.ins().bitselect(bitcast_c, bitcast_a, bitcast_b))
state.push1(builder.ins().bitselect(c, a, b))
}
Operator::V128AnyTrue => {
let a = pop1_with_bitcast(state, type_of(op), builder);
@@ -1938,11 +1935,23 @@ pub fn translate_operator<FE: FuncEnvironment + ?Sized>(
state.push1(builder.ins().snarrow(converted_a, zero));
}
Operator::I32x4TruncSatF32x4U => {
// FIXME(#5913): the relaxed instructions here are translated the same
// as the saturating instructions, even when the code generator
// configuration allow for different semantics across hosts. On x86,
// however, it's theoretically possible to have a slightly more optimal
// lowering which accounts for NaN differently, although the lowering is
// still not trivial (e.g. one instruction). At this time the
// more-optimal-but-still-large lowering for x86 is not implemented so
// the relaxed instructions are listed here instead of down below with
// the other relaxed instructions. An x86-specific implementation (or
// perhaps for other backends too) should be added and the codegen for
// the relaxed instruction should conditionally be different.
Operator::I32x4RelaxedTruncF32x4U | Operator::I32x4TruncSatF32x4U => {
let a = pop1_with_bitcast(state, F32X4, builder);
state.push1(builder.ins().fcvt_to_uint_sat(I32X4, a))
}
Operator::I32x4TruncSatF64x2UZero => {
Operator::I32x4RelaxedTruncF64x2UZero | Operator::I32x4TruncSatF64x2UZero => {
let a = pop1_with_bitcast(state, F64X2, builder);
let converted_a = builder.ins().fcvt_to_uint_sat(I64X2, a);
let handle = builder.func.dfg.constants.insert(vec![0u8; 16].into());
@@ -1950,6 +1959,7 @@ pub fn translate_operator<FE: FuncEnvironment + ?Sized>(
state.push1(builder.ins().uunarrow(converted_a, zero));
}
Operator::I8x16NarrowI16x8S => {
let (a, b) = pop2_with_bitcast(state, I16X8, builder);
state.push1(builder.ins().snarrow(a, b))
@@ -2156,27 +2166,175 @@ pub fn translate_operator<FE: FuncEnvironment + ?Sized>(
op
));
}
Operator::I8x16RelaxedSwizzle
| Operator::I32x4RelaxedTruncF32x4S
| Operator::I32x4RelaxedTruncF32x4U
| Operator::I32x4RelaxedTruncF64x2SZero
| Operator::I32x4RelaxedTruncF64x2UZero
| Operator::F32x4RelaxedMadd
| Operator::F32x4RelaxedNmadd
| Operator::F64x2RelaxedMadd
| Operator::F64x2RelaxedNmadd
| Operator::I8x16RelaxedLaneselect
Operator::F32x4RelaxedMax | Operator::F64x2RelaxedMax => {
let (a, b) = pop2_with_bitcast(state, type_of(op), builder);
state.push1(
if environ.relaxed_simd_deterministic() || !environ.is_x86() {
// Deterministic semantics match the `fmax` instruction, or
// the `fAAxBB.max` wasm instruction.
builder.ins().fmax(a, b)
} else {
builder.ins().fmax_pseudo(a, b)
},
)
}
Operator::F32x4RelaxedMin | Operator::F64x2RelaxedMin => {
let (a, b) = pop2_with_bitcast(state, type_of(op), builder);
state.push1(
if environ.relaxed_simd_deterministic() || !environ.is_x86() {
// Deterministic semantics match the `fmin` instruction, or
// the `fAAxBB.min` wasm instruction.
builder.ins().fmin(a, b)
} else {
builder.ins().fmin_pseudo(a, b)
},
);
}
Operator::I8x16RelaxedSwizzle => {
let (a, b) = pop2_with_bitcast(state, I8X16, builder);
state.push1(
if environ.relaxed_simd_deterministic() || !environ.is_x86() {
// Deterministic semantics match the `i8x16.swizzle`
// instruction which is the CLIF `swizzle`.
builder.ins().swizzle(a, b)
} else {
builder.ins().x86_pshufb(a, b)
},
);
}
Operator::F32x4RelaxedMadd | Operator::F64x2RelaxedMadd => {
let (a, b, c) = pop3_with_bitcast(state, type_of(op), builder);
state.push1(
if environ.relaxed_simd_deterministic() || environ.has_native_fma() {
// Deterministic semantics are "fused multiply and add"
// which the CLIF `fma` guarantees.
builder.ins().fma(a, b, c)
} else {
let mul = builder.ins().fmul(a, b);
builder.ins().fadd(mul, c)
},
);
}
Operator::F32x4RelaxedNmadd | Operator::F64x2RelaxedNmadd => {
let (a, b, c) = pop3_with_bitcast(state, type_of(op), builder);
let a = builder.ins().fneg(a);
state.push1(
if environ.relaxed_simd_deterministic() || environ.has_native_fma() {
// Deterministic semantics are "fused multiply and add"
// which the CLIF `fma` guarantees.
builder.ins().fma(a, b, c)
} else {
let mul = builder.ins().fmul(a, b);
builder.ins().fadd(mul, c)
},
);
}
Operator::I8x16RelaxedLaneselect
| Operator::I16x8RelaxedLaneselect
| Operator::I32x4RelaxedLaneselect
| Operator::I64x2RelaxedLaneselect
| Operator::F32x4RelaxedMin
| Operator::F32x4RelaxedMax
| Operator::F64x2RelaxedMin
| Operator::F64x2RelaxedMax
| Operator::I16x8RelaxedQ15mulrS
| Operator::I16x8RelaxedDotI8x16I7x16S
| Operator::I32x4RelaxedDotI8x16I7x16AddS => {
return Err(wasm_unsupported!("proposed relaxed-simd operator {:?}", op));
| Operator::I64x2RelaxedLaneselect => {
let ty = type_of(op);
let (a, b, c) = pop3_with_bitcast(state, ty, builder);
// Note that the variable swaps here are intentional due to
// the difference of the order of the wasm op and the clif
// op.
//
// Additionally note that even on x86 the I16X8 type uses the
// `bitselect` instruction since x86 has no corresponding
// `blendv`-style instruction for 16-bit operands.
state.push1(
if environ.relaxed_simd_deterministic() || !environ.is_x86() || ty == I16X8 {
// Deterministic semantics are a `bitselect` along the lines
// of the wasm `v128.bitselect` instruction.
builder.ins().bitselect(c, a, b)
} else {
builder.ins().x86_blendv(c, a, b)
},
);
}
Operator::I32x4RelaxedTruncF32x4S => {
let a = pop1_with_bitcast(state, F32X4, builder);
state.push1(
if environ.relaxed_simd_deterministic() || !environ.is_x86() {
// Deterministic semantics are to match the
// `i32x4.trunc_sat_f32x4_s` instruction.
builder.ins().fcvt_to_sint_sat(I32X4, a)
} else {
builder.ins().x86_cvtt2dq(I32X4, a)
},
)
}
Operator::I32x4RelaxedTruncF64x2SZero => {
let a = pop1_with_bitcast(state, F64X2, builder);
let converted_a = if environ.relaxed_simd_deterministic() || !environ.is_x86() {
// Deterministic semantics are to match the
// `i32x4.trunc_sat_f64x2_s_zero` instruction.
builder.ins().fcvt_to_sint_sat(I64X2, a)
} else {
builder.ins().x86_cvtt2dq(I64X2, a)
};
let handle = builder.func.dfg.constants.insert(vec![0u8; 16].into());
let zero = builder.ins().vconst(I64X2, handle);
state.push1(builder.ins().snarrow(converted_a, zero));
}
Operator::I16x8RelaxedQ15mulrS => {
let (a, b) = pop2_with_bitcast(state, I16X8, builder);
state.push1(
if environ.relaxed_simd_deterministic() || !environ.is_x86() {
// Deterministic semantics are to match the
// `i16x8.q15mulr_sat_s` instruction.
builder.ins().sqmul_round_sat(a, b)
} else {
builder.ins().x86_pmulhrsw(a, b)
},
);
}
Operator::I16x8RelaxedDotI8x16I7x16S => {
let (a, b) = pop2_with_bitcast(state, I8X16, builder);
state.push1(
if environ.relaxed_simd_deterministic() || !environ.is_x86() {
// Deterministic semantics are to treat both operands as
// signed integers and perform the dot product.
let alo = builder.ins().swiden_low(a);
let blo = builder.ins().swiden_low(b);
let lo = builder.ins().imul(alo, blo);
let ahi = builder.ins().swiden_high(a);
let bhi = builder.ins().swiden_high(b);
let hi = builder.ins().imul(ahi, bhi);
builder.ins().iadd_pairwise(lo, hi)
} else {
builder.ins().x86_pmaddubsw(a, b)
},
);
}
Operator::I32x4RelaxedDotI8x16I7x16AddS => {
let c = pop1_with_bitcast(state, I32X4, builder);
let (a, b) = pop2_with_bitcast(state, I8X16, builder);
let dot = if environ.relaxed_simd_deterministic() || !environ.is_x86() {
// Deterministic semantics are to treat both operands as
// signed integers and perform the dot product.
let alo = builder.ins().swiden_low(a);
let blo = builder.ins().swiden_low(b);
let lo = builder.ins().imul(alo, blo);
let ahi = builder.ins().swiden_high(a);
let bhi = builder.ins().swiden_high(b);
let hi = builder.ins().imul(ahi, bhi);
builder.ins().iadd_pairwise(lo, hi)
} else {
builder.ins().x86_pmaddubsw(a, b)
};
let dotlo = builder.ins().swiden_low(dot);
let dothi = builder.ins().swiden_high(dot);
let dot32 = builder.ins().iadd_pairwise(dotlo, dothi);
state.push1(builder.ins().iadd(dot32, c));
}
Operator::CallRef { .. }
@@ -2945,7 +3103,8 @@ fn type_of(operator: &Operator) -> Type {
| Operator::I8x16MaxU
| Operator::I8x16AvgrU
| Operator::I8x16Bitmask
| Operator::I8x16Popcnt => I8X16,
| Operator::I8x16Popcnt
| Operator::I8x16RelaxedLaneselect => I8X16,
Operator::I16x8Splat
| Operator::V128Load16Splat { .. }
@@ -2982,7 +3141,8 @@ fn type_of(operator: &Operator) -> Type {
| Operator::I16x8MaxU
| Operator::I16x8AvgrU
| Operator::I16x8Mul
| Operator::I16x8Bitmask => I16X8,
| Operator::I16x8Bitmask
| Operator::I16x8RelaxedLaneselect => I16X8,
Operator::I32x4Splat
| Operator::V128Load32Splat { .. }
@@ -3016,6 +3176,7 @@ fn type_of(operator: &Operator) -> Type {
| Operator::I32x4Bitmask
| Operator::I32x4TruncSatF32x4S
| Operator::I32x4TruncSatF32x4U
| Operator::I32x4RelaxedLaneselect
| Operator::V128Load32Zero { .. } => I32X4,
Operator::I64x2Splat
@@ -3040,6 +3201,7 @@ fn type_of(operator: &Operator) -> Type {
| Operator::I64x2Sub
| Operator::I64x2Mul
| Operator::I64x2Bitmask
| Operator::I64x2RelaxedLaneselect
| Operator::V128Load64Zero { .. } => I64X2,
Operator::F32x4Splat
@@ -3067,7 +3229,11 @@ fn type_of(operator: &Operator) -> Type {
| Operator::F32x4Ceil
| Operator::F32x4Floor
| Operator::F32x4Trunc
| Operator::F32x4Nearest => F32X4,
| Operator::F32x4Nearest
| Operator::F32x4RelaxedMax
| Operator::F32x4RelaxedMin
| Operator::F32x4RelaxedMadd
| Operator::F32x4RelaxedNmadd => F32X4,
Operator::F64x2Splat
| Operator::F64x2ExtractLane { .. }
@@ -3092,7 +3258,11 @@ fn type_of(operator: &Operator) -> Type {
| Operator::F64x2Ceil
| Operator::F64x2Floor
| Operator::F64x2Trunc
| Operator::F64x2Nearest => F64X2,
| Operator::F64x2Nearest
| Operator::F64x2RelaxedMax
| Operator::F64x2RelaxedMin
| Operator::F64x2RelaxedMadd
| Operator::F64x2RelaxedNmadd => F64X2,
_ => unimplemented!(
"Currently only SIMD instructions are mapped to their return type; the \
@@ -3219,6 +3389,18 @@ fn pop2_with_bitcast(
(bitcast_a, bitcast_b)
}
fn pop3_with_bitcast(
state: &mut FuncTranslationState,
needed_type: Type,
builder: &mut FunctionBuilder,
) -> (Value, Value, Value) {
let (a, b, c) = state.pop3();
let bitcast_a = optionally_bitcast_vector(a, needed_type, builder);
let bitcast_b = optionally_bitcast_vector(b, needed_type, builder);
let bitcast_c = optionally_bitcast_vector(c, needed_type, builder);
(bitcast_a, bitcast_b, bitcast_c)
}
fn bitcast_arguments<'a>(
builder: &FunctionBuilder,
arguments: &'a mut [Value],

View File

@@ -525,6 +525,27 @@ pub trait FuncEnvironment: TargetEnvironment {
/// Returns the target ISA's condition to check for unsigned addition
/// overflowing.
fn unsigned_add_overflow_condition(&self) -> ir::condcodes::IntCC;
/// Whether or not to force relaxed simd instructions to have deterministic
/// lowerings meaning they will produce the same results across all hosts,
/// regardless of the cost to performance.
fn relaxed_simd_deterministic(&self) -> bool {
true
}
/// Whether or not the target being translated for has a native fma
/// instruction. If it does not then when relaxed simd isn't deterministic
/// the translation of the `f32x4.relaxed_fma` instruction, for example,
/// will do a multiplication and then an add instead of the fused version.
fn has_native_fma(&self) -> bool {
false
}
/// Returns whether this is an x86 target, which may alter lowerings of
/// relaxed simd instructions.
fn is_x86(&self) -> bool {
false
}
}
/// An object satisfying the `ModuleEnvironment` trait can be passed as argument to the