Commit Graph

25 Commits

Author SHA1 Message Date
Alex Crichton
83f21e784a x64: Add more support for more AVX instructions (#5931)
* x64: Add a smattering of lowerings for `shuffle` specializations (#5930)

* x64: Add lowerings for `punpck{h,l}wd`

Add some special cases for `shuffle` for more specialized x86
instructions.

* x64: Add `shuffle` lowerings for `pshufd`

This commit adds special-cased lowerings for the x64 `shuffle`
instruction when the `pshufd` instruction alone is necessary. This is
possible when the shuffle immediate permutes 32-bit values within one of
the vector inputs of the `shuffle` instruction, but not both.

* x64: Add shuffle lowerings for `punpck{h,l}{q,}dq`

This adds specific permutations for some x86 instructions which
specifically interleave high/low bytes for 32 and 64-bit values. This
corresponds to the preexisting specific lowerings for interleaving 8 and
16-bit values.

* x64: Add `shuffle` lowerings for `shufps`

This commit adds targeted lowerings for the `shuffle` instruction that
match the pattern that `shufps` supports. The `shufps` instruction
selects two elements from the first vector and two elements from the
second vector which means while it's not generally applicable it should
still be more useful than the catch-all lowering of `shuffle`.

* x64: Add shuffle support for `pshuf{l,h}w`

This commit adds special lowering cases for these instructions which
permute 16-bit values within a 128-bit value either within the upper or
lower half of the 128-bit value.

* x64: Specialize `shuffle` with an all-zeros immediate

Instead of loading the all-zeros immediate from a rip-relative address
at the end of the function instead generate a zero with a `pxor`
instruction and then use `pshufb` to do the broadcast.

* Review comments

* x64: Add an AVX encoding for the `pshufd` instruction

This will benefit from lack of need for alignment vs the `pshufd`
instruction if working with a memory operand and additionally, as I've
just learned, this reduces dependencies between instructions because the
`v*` instructions zero the upper bits as opposed to preserving them
which could accidentally create false dependencies in the CPU between
instructions.

* x64: Add more support for AVX loads/stores

This commit adds VEX-encoded versions of instructions such as
`mov{ss,sd,upd,ups,dqu}` for load and store operations. This also
changes some signatures so the `load` helpers specifically take a
`SyntheticAmode` argument which ended up doing a small refactoring of
the `*_regmove` variant used for `insertlane 0` into f64x2 vectors.

* x64: Enable using AVX instructions for zero regs

This commit refactors the internal ISLE helpers for creating zero'd
xmm registers to leverage the AVX support for all other instructions.
This moves away from picking opcodes to instead picking instructions
with a bit of reorganization.

* x64: Remove `XmmConstOp` as an instruction

All existing users can be replaced with usage of the `xmm_uninit_value`
helper instruction so there's no longer any need for these otherwise
constant operations. This additionally reduces manual usage of opcodes
in favor of instruction helpers.

* Review comments

* Update test expectations
2023-03-09 23:57:42 +00:00
Alex Crichton
5dc2bbccbb Merge pull request from GHSA-xm67-587q-r2vw
This commit fixes an off-by-one error in the subtraction of indices when
shuffling a vector with itself. Lanes 16-and-above are mapped to select
from the first vector since the first and second element are the same,
but the subtraction was with 15 rather than 16 by accident.
2023-03-08 13:00:00 -06:00
Alex Crichton
c4a2c1e818 clif: Remove the type variable from swizzle (#5897)
This instruction is only defined with i8x16 inputs and outputs so
there's no need for a type variable, so shadow the otherwise-generic `a`
result with a concrete i8x16 type.
2023-03-01 00:38:53 +00:00
Alex Crichton
f2dce812c3 x64: Sink constant loads into xmm instructions (#5880)
A number of places in the x64 backend make use of 128-bit constants for
various wasm SIMD-related instructions although most of them currently
use the `x64_xmm_load_const` helper to load the constant into a
register. Almost all xmm instructions, however, enable using a memory
operand which means that these loads can be folded into instructions to
help reduce register pressure. Automatic conversions were added for a
`VCodeConstant` into an `XmmMem` value and then explicit loads were all
removed in favor of forwarding the `XmmMem` value directly to the
underlying instruction. Note that some instances of `x64_xmm_load_const`
remain since they're used in contexts where load sinking won't work
(e.g. they're the first operand, not the second for non-commutative
instructions).
2023-02-27 22:02:42 +00:00
Trevor Elliott
cc073593a4 Fix block label printing in precise-output tests (#5798)
As a follow-up to #5780, disassemble the regions identified by bb_starts, falling back on disassembling the whole buffer. This ensures that instructions like br_table that introduce a lot of constants don't throw off capstone for the remainder of the function.

---------

Co-authored-by: Jamey Sharp <jamey@minilop.net>
2023-02-16 02:35:26 +00:00
Trevor Elliott
f04decc4a1 Use capstone to validate precise-output tests (#5780)
Use the capstone library to disassemble precise-output tests, in addition to pretty-printing their vcode.
2023-02-15 16:35:10 -08:00
Trevor Elliott
b077854b57 Generate SSA code from returns (#5172)
Modify return pseudo-instructions to have pairs of registers: virtual and real. This allows us to constrain the virtual registers to the real ones specified by the abi, instead of directly emitting moves to those real registers.
2022-11-08 16:00:49 -08:00
Trevor Elliott
32a7593c94 cranelift: Remove booleans (#5031)
Remove the boolean types from cranelift, and the associated instructions breduce, bextend, bconst, and bint. Standardize on using 1/0 for the return value from instructions that produce scalar boolean results, and -1/0 for boolean vector elements.

Fixes #3205

Co-authored-by: Afonso Bordado <afonso360@users.noreply.github.com>
Co-authored-by: Ulrich Weigand <ulrich.weigand@de.ibm.com>
Co-authored-by: Chris Fallin <chris@cfallin.org>
2022-10-17 16:00:27 -07:00
Andrew Brown
f063082474 x64: remove Inst::XmmLoadConst (#4876)
This is a cherry-pick of a long-ago commit, 2d46637. The original
message reads:

> Now that `SyntheticAmode` can refer to constants, there is no longer a
> need for a separate instruction format--standard load instructions will
> work.

Since then, the transition to ISLE and the use of `XmmLoadConst` in many
more places makes this change a larger diff than the original. The basic
idea is the same, though: the extra indirection of `Inst::XMmLoadConst`
is removed and replaced by a direct use of `VCodeConstant` as a
`SyntheticAmode`. This has no effect on codegen, but the CLIF output is
now clearer in that the actual instruction is displayed (e.g., `movdqu`)
instead of a made-up instruction (`load_const`).
2022-09-07 12:52:13 -07:00
Trevor Elliott
9386409607 x64: Lower extractlane, scalar_to_vector, and splat in ISLE (#4780)
Lower extractlane, scalar_to_vector and splat in ISLE.

This PR also makes some changes to the SinkableLoad api
* change the return type of sink_load to RegMem as there are more functions available for dealing with RegMem
* add reg_mem_to_reg_mem_imm and register it as an automatic conversion
2022-08-25 09:38:03 -07:00
Trevor Elliott
b8b6f2781e x64: Lower shuffle and swizzle in ISLE (#4772)
Lower `shuffle` and `swizzle` in ISLE.

This PR surfaced a bug with the lowering of `shuffle` when avx512vl and avx512vbmi are enabled: we use `vpermi2b` as the implementation, but panic if the immediate shuffle mask contains any out-of-bounds values. The behavior when the avx512 extensions are not present is that out-of-bounds values are turned into `0` in the result.

I've resolved this by detecting when the shuffle immediate has out-of-bounds indices in the avx512-enabled lowering, and generating an additional mask to zero out the lanes where those indices occur. This brings the avx512 case into line with the semantics of the `shuffle` op: 94bcbe8446/cranelift/codegen/meta/src/shared/instructions.rs (L1495-L1498)
2022-08-24 21:49:51 +00:00
Trevor Elliott
4bdfa76370 x64: Migrate get_pinned_reg, set_pinned_reg, vconst, and raw_bitcast to ISLE (#4763)
https://github.com/bytecodealliance/wasmtime/pull/4763
2022-08-23 16:32:00 -07:00
Chris Fallin
019ebf47b1 x64: fix pretty-printing argument order for XmmRmR instructions. (#4094)
The pretty-printing had swapped dst and src2; this was introduced when
we moved to RA2 (sorry about that! IMHO we should do something to
automate the mapping between regalloc arg collection and pretty
printing/emission).

`src2` comes at the end because it has a variable number of register
mentions; this is in line with how many of the other inst formats work.

Actual emitted code was never incorrect, just the pretty-printing.

Updated test golden outputs look correct to me now, including the one
that we saw was incorrect in #3945.
2022-05-03 10:12:58 -07:00
Chris Fallin
a0318f36f0 Switch Cranelift over to regalloc2. (#3989)
This PR switches Cranelift over to the new register allocator, regalloc2.

See [this document](https://gist.github.com/cfallin/08553421a91f150254fe878f67301801)
for a summary of the design changes. This switchover has implications for
core VCode/MachInst types and the lowering pass.

Overall, this change brings improvements to both compile time and speed of
generated code (runtime), as reported in #3942:

```
Benchmark       Compilation (wallclock)     Execution (wallclock)
blake3-scalar   25% faster                  28% faster
blake3-simd     no diff                     no diff
meshoptimizer   19% faster                  17% faster
pulldown-cmark  17% faster                  no diff
bz2             15% faster                  no diff
SpiderMonkey,   21% faster                  2% faster
  fib(30)
clang.wasm      42% faster                  N/A
```
2022-04-14 10:28:21 -07:00
Chris Fallin
1c014d129a Cranelift: ensure ISA level needed for SIMD is present when SIMD is enabled. (#3816)
Addresses #3809: when we are asked to create a Cranelift backend with
shared flags that indicate support for SIMD, we should check that the
ISA level needed for our SIMD lowerings is present.
2022-02-16 17:29:30 -08:00
Alex Crichton
1ef0abb12c Update lots of isa/*/*.clif tests to precise-output (#3677)
* Update lots of `isa/*/*.clif` tests to `precise-output`

This commit goes through the `aarch64` and `x64` subdirectories and
subjectively changes tests from `test compile` to add `precise-output`.
This then auto-updates all the test expectations so they can be
automatically instead of manually updated in the future. Not all tests
were migrated, largely subject to the whims of myself, mainly looking to
see if the test was looking for specific instructions or just checking
the whole assembly output.

* Filter out `;;` comments from test expctations

Looks like the cranelift parser picks up all comments, not just those
trailing the function, so use a convention where `;;` is used for
human-readable-comments in test cases and `;`-prefixed comments are the
test expectation.
2022-01-10 13:38:23 -06:00
bjorn3
9e34df33b9 Remove the old x86 backend 2021-09-29 16:13:46 +02:00
Chris Fallin
cb48ea406e Switch default to new x86_64 backend.
This PR switches the default backend on x86, for both the
`cranelift-codegen` crate and for Wasmtime, to the new
(`MachInst`-style, `VCode`-based) backend that has been under
development and testing for some time now.

The old backend is still available by default in builds with the
`old-x86-backend` feature, or by requesting `BackendVariant::Legacy`
from the appropriate APIs.

As part of that switch, it adds some more runtime-configurable plumbing
to the testing infrastructure so that tests can be run using the
appropriate backend. `clif-util test` is now capable of parsing a
backend selector option from filetests and instantiating the correct
backend.

CI has been updated so that the old x86 backend continues to run its
tests, just as we used to run the new x64 backend separately.

At some point, we will remove the old x86 backend entirely, once we are
satisfied that the new backend has not caused any unforeseen issues and
we do not need to revert.
2021-04-02 11:35:53 -07:00
Andrew Brown
bb2dd5b68b [machinst x64]: implement load*_zero for x64 2021-01-08 16:21:57 -08:00
Chris Fallin
1dddba649a x64 regalloc register order: put caller-saves (volatiles) first.
The x64 backend currently builds the `RealRegUniverse` in a way that
is generating somewhat suboptimal code. In many blocks, we see uses of
callee-save (non-volatile) registers (r12, r13, r14, rbx) first, even in
very short leaf functions where there are plenty of volatiles to use.
This is leading to unnecessary spills/reloads.

On one (local) test program, a medium-sized C benchmark compiled to Wasm
and run on Wasmtime, I am seeing a ~10% performance improvement with
this change; it will be less pronounced in programs with high register
pressure (there we are likely to use all registers regardless, so the
prologue/epilogue will save/restore all callee-saves), or in programs
with fewer calls, but this is a clear win for small functions and in
many cases removes prologue/epilogue clobber-saves altogether.

Separately, I think the RA's coalescing is tripping up a bit in some
cases; see e.g. the filetest touched by this commit that loads a value
into %rsi then moves to %rax and returns immediately. This is an
orthogonal issue, though, and should be addressed (if worthwhile) in
regalloc.rs.
2020-12-06 22:37:43 -08:00
Andrew Brown
83f182b390 Implement initial emission of constants
This approach suffers from memory-size bloat during compile time due to the desire to de-duplicate the constants emitted and reduce runtime memory-size. As a first step, though, this provides an end-to-end mechanism for constants to be emitted in the MachBuffer islands.
2020-11-05 14:25:02 -08:00
Benjamin Bouvier
e32e6fb612 machinst x64: check SSE requirements for instructions against enabled features; 2020-10-08 09:21:51 +02:00
Andrew Brown
16a2538ecd [machinst x64]: rename Inst::XmmUninitializedValue and document
This approach is not the best but avoids an extra instruction; perhaps at some point, as mentioned in https://github.com/bytecodealliance/wasmtime/pull/2248, we will add the extra instruction or refactor things in such a way that this `Inst` variant is unnecessary.
2020-10-02 08:29:31 -07:00
Andrew Brown
3d9f3bf728 [machinst x64]: port CLIF tests related to comparison and lane operations 2020-10-02 08:29:31 -07:00
Andrew Brown
b7217d454f [machinst x64]: add lane-related CLIF filetests 2020-09-29 08:45:12 -07:00