Commit Graph

96 Commits

Author SHA1 Message Date
Johnnie Birch
6fbe0b72bd Add simd_extmul_* support for x64 2021-07-15 01:07:52 -07:00
Johnnie Birch
2d676d838f Implements f64x2.convert_low_i32x4_u for x64 2021-07-09 10:39:05 -07:00
Johnnie Birch
1770880e19 x64: add support for packed promote and demote (#2783)
* Add support for x64 packed promote low

* Add support for x64 packed floating point demote

* Update vector promote low and demote by adding constraints

Also does some renaming and minor refactoring
2021-06-04 15:59:20 -07:00
Andrew Brown
2a9f458ea3 x64: lower i8x16.shuffle to VPERMI2B when possible
When shuffling values from two different registers, the x64 lowering for
`i8x16.shuffle` must first shuffle each register separately and then OR
the results with SSE instructions. With `VPERMI2B`, available in
AVX512VL + AVX512VBMI, this can be done in a single instruction after
the shuffle mask has been moved into the destination register. This
change uses `VPERMI2B` for that case when the CPU supports it.
2021-06-01 11:40:53 -07:00
Andrew Brown
459fce3467 x64: lower i8x16.popcnt to VPOPCNTB when possible
When AVX512VL or AVX512BITALG are available, Wasm SIMD's `popcnt`
instruction can be lowered to a single x64 instruction, `VPOPCNTB`,
instead of 8+ instructions.
2021-05-25 12:16:25 -07:00
Chris Fallin
95559c01aa Merge pull request from GHSA-hpqh-2wqx-7qp5
Fix spillslot reload of narrow values: zero-extend, don't sign-extend. Release v0.74.0 as security-patch release.
2021-05-21 12:01:55 -07:00
Andrew Brown
54b45d28a3 x64: lower fcvt_from_uint to VCVTUDQ2PS when possible
When AVX512VL and AVX512F are available, use a single instruction
(`VCVTUDQ2PS`) instead of a length 9-instruction sequence. This
optimization is a port from the legacy x86 backend.
2021-05-19 12:20:11 -07:00
Chris Fallin
a1c9b06cea Fix spillslot reload of narrow values: zero-extend, don't sign-extend.
Previously, the x64 backend's ABI code would generate a sign-extending
load when loading a less-than-64-bit integer from a spillslot. This is
incorrect: e.g., for i32s > 0x80000000, this would result in all high
bits set.

This interacts poorly with another optimization. Normally, the invariant
is that the high bits of a register holding a value of a certain type,
beyond that type's bits, are undefined. However, as an optimization, we
recognize and use the fact that on x86-64, 32-bit instructions zero the
upper 32 bits. This allows us to elide a 32-to-64-bit zero-extend op
(turning it into just a move, which can then sometimes disappear
entirely due to register coalescing).

If a spill and reload happen between the production of a 32-bit value
from an instruction known to zero the upper bits and its use, then we
will rely on zero upper bits that might actually be set by a
sign-extend. This will result in incorrect execution.

As a fix, we stick to a simple invariant: we always spill and reload a
full 64 bits when handling integer registers on x64. This ensures that
no bits are mangled.
2021-05-19 12:19:19 -07:00
Andrew Brown
7ef3ae2903 x64: implement vselect with variable blend instructions
This change implements `vselect` using SSE4.1's `BLENDVPS`, `BLENDVPD`,
and `PBLENDVB`. `vselect` is a lane-selecting instruction that is used
by
[simple_preopt.rs](fa1faf5d22/cranelift/codegen/src/simple_preopt.rs (L947-L999))
to lower `bitselect` to a single x86 instruction when the condition mask
is known to be boolean (all 1s or 0s, e.g., from a conversion). This is
better than `bitselect` in general, which lowers to 4-5 instructions.
The old backend had the `vselect` lowering; this simply introduces it to
the new backend.
2021-05-17 11:23:33 -07:00
Andrew Brown
e676589b0c x64: lower i64x2.imul to VPMULLQ when possible
This adds the machinery to encode the VPMULLQ instruction which is
available in AVX512VL and AVX512DQ. When these feature sets are
available, we use this instruction instead of a lengthy 12-instruction
sequence.
2021-05-13 20:14:05 -07:00
Andrew Brown
0acc1451ea x64: lower iabs.i64x2 using a single AVX512 instruction when possible (#2819)
* x64: add EVEX encoding mechanism

Also, includes an empty stub module for the VEX encoding.

* x64: lower abs.i64x2 to VPABSQ when available

* x64: refactor EVEX encodings to use `EvexInstruction`

This change replaces the `encode_evex` function with a builder-style struct, `EvexInstruction`. This approach clarifies the code, adds documentation, and results in slight speedups when benchmarked.

* x64: rename encoding CodeSink to ByteSink
2021-04-15 11:53:58 -07:00
Andrew Brown
8e495ac79d x64: match multiple ISA requirements before emitting
Because there are instructions that are present in more than one ISA feature set, we need to see if any of the ISA requirements match before emitting. This change includes the `VPABSQ` instruction as an example, which is present in both `AVX512F` and `AVX512VL`.
2021-04-08 10:30:39 -07:00
Johnnie Birch
31d3db1ec2 Implements convert low signed integer to float for x64 simd 2021-03-26 12:13:29 -07:00
Alex Crichton
3f694ae319 Use stable Rust on CI to test the x64 backend (#2766)
* Use stable Rust on CI to test the x64 backend

This commit leverages the newly-released 1.51.0 compiler to test the
new backend on Windows and Linux with a stable compiler instead of a
nightly compiler. This isolates the nightly build to just the nightly
documentation generation and fuzzing, both of which rely on nightly for
the best results right now.

* Use updated stable in book build job

* Run rustfmt for new stable

* Silence new warnings for wasi-nn

* Allow some dead code in the x64 backend

Looks like new rustc is better about emitting some dead-code warnings

* Update rust in peepmatic job

* Fix a test in the pooling allocator

* Remove `package.metdata.docs.rs` temporarily

Needs resolution of https://github.com/rust-lang/cargo/pull/9300 first

* Fix a warning in a wasi-nn example
2021-03-25 13:18:59 -05:00
Chris Fallin
e41d882144 Merge pull request #2678 from cfallin/x64-fastcall
x86-64 Windows fastcall ABI support.
2021-03-05 10:46:47 -08:00
Chris Fallin
6c94eb82aa x86-64 Windows fastcall ABI support.
This adds support for the "fastcall" ABI, which is the native C/C++ ABI
on Windows platforms on x86-64. It is similar to but not exactly like
System V; primarily, its argument register assignments are different,
and it requires stack shadow space.

Note that this also adjusts the handling of multi-register values in the
shared ABI implementation, and with this change, adjusts handling of
`i128`s on *both* Fastcall/x64 *and* SysV/x64 platforms. This was done
to align with actual behavior by the "rustc ABI" on both platforms, as
mapped out experimentally (Compiler Explorer link in comments). This
behavior is gated under the `enable_llvm_abi_extensions` flag.

Note also that this does *not* add x64 unwind info on Windows. That will
come in a future PR (but is planned!).
2021-03-03 19:53:18 -08:00
Andrew Brown
508f8fa5a9 [x64] Add i64x2.abs
This instruction has a single instruction lowering in AVX512F/VL and a three instruction lowering in AVX but neither is currently supported in the x64 backend. To implement this, we instead subtract the vector from 0 and use a blending instruction to pick the lanes containing the absolute value.
2021-03-02 12:30:02 -08:00
Kasey Carrothers
7bd96c8e2f Refactor x64::Insts that use an is_64 bool to use OperandSize. 2021-02-03 10:40:11 -08:00
Kasey Carrothers
3306408100 Refactor x64::Inst to use OperandSize instead of u8s.
TODO: some types take a 'is_64_bit' bool. Those are left unchanged for now.
2021-02-03 10:40:11 -08:00
Kasey Carrothers
b12d41bfe9 Expand x64 OperandSize to support 8 and 16-bit operands.
This is in preparation for refactoring all x64::Inst arms to use OperandSize.

Current uses of OperandSize fall into two categories:
  1. XMM operations which require 32/64 bit operands
  2. Immediates which only care about 64-bit or not.

Adds assertions to existing Inst constructors to check that they are passed valid sizes.
This change also removes the implicit widening of 1 and 2 byte values to 4 bytes. from_bytes() is only used by category 2, so removing this behavior will not change any visible behavior.

Overall this change should be a no-op.
2021-02-03 10:40:11 -08:00
Benjamin Bouvier
13027ad670 cranelift x64: add instruction set checks for popcnt/tzcnt/lzcnt; 2021-01-30 13:38:55 +01:00
Benjamin Bouvier
2275519cb1 cranelift x64: use the POPCNT instruction for Popcount when it's available; 2021-01-29 19:41:01 +01:00
Benjamin Bouvier
6bf6612d96 cranelift x64: use the TZCNT instruction for Ctz when it's available; 2021-01-29 19:41:01 +01:00
Benjamin Bouvier
d3acd9a283 cranelift x64: use the LZCNT instruction for Clz when it's available; 2021-01-29 19:41:01 +01:00
Chris Fallin
71ead6e31d x64 backend: implement 128-bit ops and misc fixes.
This implements all of the ops on I128 that are implemented by the
legacy x86 backend, and includes all that are required by at least one
major use-case (cg_clif rustc backend).

The sequences are open-coded where necessary; for e.g. the bit
operations, this can be somewhat complex, but these sequences have been
tested carefully. This PR also includes a drive-by fix of clz/ctz for 8-
and 16-bit cases where they were incorrect previously.

Also includes ridealong fixes developed while bringing up cg_clif
support, because they are difficult to completely separate due to
other refactors that occurred in this PR:

- fix REX prefix logic for some 8-bit instructions.

  When using an 8-bit register in 64-bit mode on x86-64, the REX prefix
  semantics are somewhat subtle: without the REX prefix, register numbers
  4--7 correspond to the second-to-lowest byte of the first four registers
  (AH, CH, BH, DH), whereas with the REX prefix, these register numbers
  correspond to the usual encoding (SPL, BPL, SIL, DIL). We could always
  emit a REX byte for instructions with 8-bit cases (this is harmless even
  if unneeded), but this would unnecessarily inflate code size; instead,
  the usual approach is to emit it only for these registers.

  This logic was present in some cases but missing for some other
  instructions: divide, not, negate, shifts.

  Fixes #2508.

- avoid unaligned SSE loads on some f64 ops.

  The implementations of several FP ops, such as fabs/fneg, used SSE
  instructions. This is not a problem per-se, except that load-op merging
  did not take *alignment* into account. Specifically, if an op on an f64
  loaded from memory happened to merge that load, and the instruction into
  which it was merged was an SSE instruction, then the SSE instruction
  imposes stricter (128-bit) alignment requirements than the load.f64 did.

  This PR simply forces any instruction lowerings that could use SSE
  instructions to implement non-SIMD operations to take inputs in
  registers only, and avoid load-op merging.

  Fixes #2507.

- two bugfixes exposed by cg_clif: urem/srem.i8, select.b1.

  - urem/srem.i8: the 8-bit form of the DIV instruction on x86-64 places
    the remainder in AH, not RDX, different from all the other width-forms
    of this instruction.

  - select.b1: we were not recognizing selects of boolean values as
    integer-typed operations, so we were generating XMM moves instead (!).
2021-01-14 13:45:50 -08:00
Andrew Brown
09a5b91b9d x64: make several structures debuggable 2021-01-08 16:21:57 -08:00
Chris Fallin
dbd2241b60 x64: handle tests of b1 values correctly (only LSB is defined).
Previously, `select` and `brz`/`brnz` instructions, when given a `b1`
boolean argument, would test whether that boolean argument was nonzero,
rather than whether its LSB was nonzero. Since our invariant for mapping
CLIF state to machine state is that bits beyond the width of a value are
undefined, the proper lowering is to test only the LSB.

(aarch64 does not have the same issue because its `Extend` pseudoinst
already properly handles masking of b1 values when a zero-extend is
requested, as it is for select/brz/brnz.)

Found by Nathan Ringo on Zulip [1] (thanks!).

[1]
https://bytecodealliance.zulipchat.com/#narrow/stream/217117-cranelift/topic/bnot.20on.20b1s
2021-01-05 14:45:46 -08:00
Yury Delendik
2964023a77 [SIMD][x86_64] Add encoding for PMADDWD (#2530)
* [SIMD][x86_64] Add encoding for PMADDWD

* also for "experimental_x64"
2020-12-24 07:52:50 -06:00
Johnnie Birch
a548516f97 Enable SIMD spec tests for f32x4_rounding and f64x4_rounding.
Also address some review comments pointing out minor issues.
2020-12-02 13:44:51 -08:00
Johnnie Birch
a33e755cb2 Adds x86 SIMD support for Ceil, Floor, Trunc, and Nearest 2020-12-02 13:44:51 -08:00
Johnnie Birch
2cc501427e Add remaining X86_64 support for pack w/ signed/unsigned saturation
Adds lowering for packssdw, packusdw, packuswb
2020-11-22 23:14:29 -08:00
Johnnie Birch
124096735b Add support for palignr for X86_64 vcode backend 2020-11-22 22:14:02 -08:00
Johnnie Birch
615a575da1 Add support for x86_64 packed move lowering for the vcode backend 2020-11-22 20:23:00 -08:00
Chris Fallin
073c727a74 x64 and aarch64: carry MemFlags on loads/stores; don't emit trap info unless an op can trap.
This end result was previously enacted by carrying a `SourceLoc` on
every load/store, which was somewhat cumbersome, and only indirectly
encoded metadata about a memory reference (can it trap) by its presence
or absence. We have a type for this -- `MemFlags` -- that tells us
everything we might want to know about a load or store, and we should
plumb it through to code emission instead.

This PR attaches a `MemFlags` to an `Amode` on x64, and puts it on load
and store `Inst` variants on aarch64. These two choices seem to factor
things out in the nicest way: there are relatively few load/store insts
on aarch64 but many addressing modes, while the opposite is true on x64.
2020-11-17 11:43:06 -08:00
Andrew Brown
8ba92853be [machinst x64]: add punpack[hl]bw instructions 2020-11-12 14:21:45 -08:00
Andrew Brown
8131b15921 [machinst x64]: allow addressing of constants 2020-11-12 14:21:45 -08:00
Andrew Brown
6725b6b129 [machinst x64]: implement bitmask 2020-10-28 15:16:36 -07:00
Johnnie Birch
8bbe6a25a9 Add support for packed float to signed int conversion
Implements i32x4.trunc_sat_f32x4_s
2020-10-28 13:02:50 -07:00
Johnnie Birch
f27c0f3434 Adds support for signed packed integer conversion to float
f32x4.convert_i32x4_s
2020-10-16 14:16:53 -07:00
Andrew Brown
3c55523d40 [machinst x64]: implement packed and, and_not, xor, or 2020-10-09 10:04:50 -07:00
Andrew Brown
c8cce5d2d7 [machinst x64]: enable packed saturated arithmetic 2020-10-08 08:46:20 -07:00
Benjamin Bouvier
a470f1e0cd machinst x64: remove dead code and allow(dead_code) annotation;
The BranchTarget is always used as a label, so just use a plain
MachLabel in this case.
2020-10-08 10:05:57 +02:00
Chris Fallin
71768bb6cf Fix AArch64 ABI to respect half-caller-save, half-callee-save vec regs.
This PR updates the AArch64 ABI implementation so that it (i) properly
respects that v8-v15 inclusive have callee-save lower halves, and
caller-save upper halves, by conservatively approximating (to full
registers) in the appropriate directions when generating prologue
caller-saves and when informing the regalloc of clobbered regs across
callsites.

In order to prevent saving all of these vector registers in the prologue
of every non-leaf function due to the above approximation, this also
makes use of a new regalloc.rs feature to exclude call instructions'
writes from the clobber set returned by register allocation. This is
safe whenever the caller and callee have the same ABI (because anything
the callee could clobber, the caller is allowed to clobber as well
without saving it in the prologue).

Fixes #2254.
2020-10-06 14:44:02 -07:00
Andrew Brown
50b9399006 [machinst x64]: lower remaining lane operations--any_true, all_true, splat 2020-10-02 08:29:31 -07:00
Andrew Brown
0579e9f9de [machinst x64]: add packed OR 2020-10-02 08:29:31 -07:00
Andrew Brown
74226d6781 [machinst x64]: add integer comparisons 2020-10-02 08:29:31 -07:00
Andrew Brown
4484a00ea5 [machinst x64]: calculate extension modes in one place 2020-09-29 14:48:59 -07:00
Andrew Brown
f50d905152 [machinst x64]: refactor using added RegMem::from(Writable<Reg>) 2020-09-29 08:45:12 -07:00
Andrew Brown
050f078f86 [machinst x64]: add saturating addition implementation 2020-09-29 08:45:12 -07:00
Andrew Brown
a64abf9b76 [machinst x64]: add shuffle implementation 2020-09-29 08:45:12 -07:00