Commit Graph

1491 Commits

Author SHA1 Message Date
Alex Crichton
afb417920d x64: Deduplicate fcmp emission logic (#6113)
* x64: Deduplicate fcmp emission logic

The `select`-of-`fcmp` lowering duplicated a good deal of `FloatCC`
lowering logic that was already done by `emit_fcmp`, so this commit
refactors these lowering rules to instead delegate to `emit_fcmp` and
then handle that result.

* Swap order of condition codes

Shouldn't affect the correctness of this operation and it's a bit more
natural to write the lowering rule this way.

* Swap the order of comparison operands

No need to swap `a b`, only the `x y` needs swapping.

* Fix x64 printing of `XmmCmove`
2023-03-29 16:24:25 +00:00
Karl Meakin
dcf0ea9ff3 ISLE: rewrite and/or of icmp (#6095)
* ISLE: rewrite `and`/`or` of `icmp`

* Add `make-icmp-tests.sh` script

* Remove unused changes
2023-03-29 00:13:27 +00:00
Karl Meakin
97d9f77d94 Add precise_output argument to test optimize. (#6111)
* Add `precise_output` argument to `test optimise`.

Also allow optimise tests to be updated by `CRANELIFT_TEST_BLESS=1`

* Move `check_precise_output` and `update_test` to `subtest`
2023-03-28 22:32:04 +00:00
Maja Kądziołka
db07988ccb x64: emit_cmp: use x64_test for comparisons with 0 (#6086)
* x64: emit_cmp: use x64_test for comparisons with 0

See #5869

* fixup! x64: emit_cmp: use x64_test for comparisons with 0
2023-03-27 15:38:48 +00:00
Nathan Whitaker
c3decdf910 cranelift: Implement TLS on aarch64 Mach-O (Apple Silicon) (#5434)
* Implement TLS on Aarch64 Mach-O

* Add aarch64 macho TLS filetest

* Address review comments

- `Aarch64` instead of `AArch64` in comments
- Remove unnecessary guard in tls_value lowering
- Remove unnecessary regalloc metadata in emission

* Use x1 as temporary register in emission

- Instead of passing in a temporary register to use when emitting
the TLS code, just use `x1`, as it's already in the clobber set.
This also keeps the size of `aarch64::inst::Inst` at 32 bytes.
- Update filetest accordingly

* Update aarch64 mach-o TLS filetest
2023-03-24 17:54:01 +00:00
Afonso Bordado
602ff71fe4 riscv64: Add Zba extension instructions (#6087)
* riscv64: Use `add.uw` to zero extend

* riscv64: Implement `add.uw` optimizations

* riscv64: Add `Zba` `iadd+ishl` optimizations

* riscv64: Add `shl+uextend` optimizations based on `Zba`

* riscv64: Fix some issues with `Zba` instructions

* riscv64: Restrict shnadd selection

* riscv64: Fix `extend` priorities

* riscv64: Remove redundant `addw` rule

* riscv64: Specify type for `add` extend rules

* riscv64: Use `u64_from_imm64` extractor instead of `uimm8`

* riscv64: Restrict `uextend` in `shnadd.uw` rules

* riscv64: Use concrete type in `slli.uw` rule

* riscv64: Add extra arithmetic extends tests

Co-authored-by: Jamey Sharp <jsharp@fastly.com>

* riscv64: Make `Adduw` types concrete

* riscv64: Add extra arithmetic extend tests

* riscv64: Add `sextend`+Arithmetic rules

* riscv64: Fix whitespace

* cranelift: Move arithmetic extends tests with i128 to separate file

---------

Co-authored-by: Jamey Sharp <jsharp@fastly.com>
2023-03-23 20:06:03 +00:00
Ulrich Weigand
6f66abd5c7 s390x: Improved TrapIf implementation (#6079)
Following up on the discussion in
https://github.com/bytecodealliance/wasmtime/pull/6011
this adds an improved implementation of TrapIf for s390x
using a single conditional branch instruction.

If the trap conditions is true, we branch into the middle of
the branch instruction - those middle two bytes are zero,
which matches the encoding of the trap instruction.

In addition, show the trap code for Trap and TrapIf
instructions in assembler output.
2023-03-23 14:50:43 +00:00
Alex Crichton
2fde25311e x64: Refactor and fill out some gpr-vs-xmm bits (#6058)
* x64: Add instruction helpers for `mov{d,q}`

These will soon grow AVX-equivalents so move them to instruction helpers
to have clauses for AVX in the future.

* x64: Don't auto-convert between RegMemImm and XmmMemImm

The previous conversion, `mov_rmi_to_xmm`, would move from GPR registers
to XMM registers which isn't what many of the other `convert` statements
between these newtypes do. This seemed like a possible footgun so I've
removed the auto-conversion and added an explicit helper to go from a
`u32` to an `XmmMemImm`.

* x64: Add AVX encodings of some more GPR-related insns

This commit adds some more support for AVX instructions where GPRs are
in use mixed in with XMM registers. This required a few more variants of
`Inst` to handle the new instructions.

* Fix vpmovmskb encoding

* Fix xmm-to-gpr encoding of vmovd/vmovq

* Fix typo

* Fix rebase conflict

* Fix rebase conflict with tests
2023-03-22 14:58:09 +00:00
Afonso Bordado
7a3df7dcc0 riscv64: Improve ctz/clz/cls codegen (#5854)
* cranelift: Add extra runtests for `clz`/`ctz`

* riscv64: Restrict lowering rules for `ctz`/`clz`

* cranelift: Add `u64` isle helpers

* riscv64: Improve `ctz` codegen

* riscv64: Improve `clz` codegen

* riscv64: Improve `cls` codegen

* riscv64: Improve `clz.i128` codegen

Instead of checking if we have 64 zeros in the top half. Check
if it *is* 0, that way we avoid loading the `64` constant.

* riscv64: Improve `ctz.i128` codegen

Instead of checking if we have 64 zeros in the bottom half. Check
if it *is* 0, that way we avoid loading the `64` constant.

* riscv64: Use extended value in `lower_cls`

* riscv64: Use pattern matches on `bseti`
2023-03-21 23:15:14 +00:00
Alexa VanHattum
13be5618a7 Cranelift: ISLE: aarch64: fix imm12_from_negated_value for i32, i16 (#6078)
* Fix the semantics of imm12_from_negated_value, swapping to a partial term + rule

* wrapping_neg
2023-03-21 19:16:25 +00:00
Trevor Elliott
861220c433 Restrict the types for isplit and iconcat to match backends (#6070)
* Restrict the types for isplit and iconcat to match backends

* Admit unimplemented bitwidths to isplit/iconcat

* Modify the NarrowInt type instead of shadowing it

* Fix filetest failures
2023-03-21 01:21:00 +00:00
Karl Meakin
7d9318fe77 cranelift: rewrite iabs(ineg(x)) and iabs(iabs(x)) (#6072)
* cranelift: rerwite `iabs(ineg(x))`` and `iabs(iabs(x))`

* Fix comment on `iabs(iabs(x))` rewrite

* Remove subsume on rewrite for `iabs(ineg(x))`
2023-03-21 00:12:21 +00:00
Alex Crichton
a3b21031d4 Add a MachBuffer::defer_trap method (#6011)
* Add a `MachBuffer::defer_trap` method

This commit adds a new method to `MachBuffer` to defer trap opcodes to
the end of a function in a similar manner to how constants are deferred
to the end of the function. This is useful for backends which frequently
use `TrapIf`-style opcodes. Currently a jump is emitted which skips the
next instruction, a trap, and then execution continues normally. While
there isn't any pressing problem with this construction the trap opcode
is in the middle of the instruction stream as opposed to "off on the
side" despite rarely being taken.

With this method in place all the backends (except riscv64 since I
couldn't figure it out easily enough) have a new lowering of their
`TrapIf` opcode. Now a trap is deferred, which returns a label, and then
that label is jumped to when executing the trap. A fixup is then
recorded in `MachBuffer` to get patched later on during emission, or at
the end of the function. Subsequently all `TrapIf` instructions
translate to a single branch plus a single trap at the end of the
function.

I've additionally further updated some more lowerings in the x64 backend
which were explicitly using traps to instead use `TrapIf` where
applicable to avoid jumping over traps mid-function. Other backends
didn't appear to have many jump-over-the-next-trap patterns.

Lots of tests have had their expectations updated here which should
reflect all the traps being sunk to the end of functions.

* Print trap code on all platforms

* Emit traps before constants

* Preserve source location information for traps

* Fix test expectations

* Attempt to fix s390x

The MachBuffer was registering trap codes with the first byte of the
trap, but the SIGILL handler was expecting it to be registered with the
last byte of the trap. Exploit that SIGILL is always represented with a
2-byte instruction and always march 2-backwards for SIGILL, continuing
to march backwards 1 byte for SIGFPE-generating instructions.

* Back out s390x changes

* Back out more s390x bits

* Review comments
2023-03-20 21:24:47 +00:00
Alex Crichton
dd7fa81b20 x64: Run more filetests with AVX support (#6063)
This commit goes through the `runtests` folder of the `filetests`
test suite and ensure that everything which uses simd or float-related
instructions on x64 is executed with the baseline support for x86_64 in
addition to adding in AVX support. Most of the instructions used have
AVX equivalents so this should help test all of the equivalents in
addition to the codegen filetests in the x64 folder.
2023-03-20 19:13:14 +00:00
Alex Crichton
f7dda1ab2c x64: Fix vbroadcastss with AVX2 and without AVX (#6060)
* x64: Fix vbroadcastss with AVX2 and without AVX

This commit fixes a corner case in the emission of the
`vbroadcasts{s,d}` instructions. The memory-to-xmm form of these
instructions was available with the AVX instruction set, but the
xmm-to-xmm form of these instructions wasn't available until AVX2.
The instruction requirement for these are listed as AVX but the lowering
rules are appropriately annotated to use either AVX2 or AVX when
appropriate.

While this should work in practice this didn't work for the assertion
about enabled features for each instruction. The `vbroadcastss`
instruction was listed as requiring AVX but could get emitted when AVX2
was enabled (due to the reg-to-reg form being available). This caused an
issue for the fuzzer where AVX2 was enabled but AVX was disabled.

One possible fix would be to add more opcodes, one for reg-to-reg and
one for mem-to-reg. That seemed like somewhat overkill for a pretty
niche situation that shouldn't actually come up in practice anywhere.
Instead this commit changes all the `has_avx` accessors to the
`use_avx_simd` predicate already available in the target flags. The
`use_avx2_simd` predicate was then updated to additionally require
`has_avx`, so if AVX2 is enabled and AVX is disabled then the
`vbroadcastss` instruction won't get emitted any more.

Closes #6059

* Pass `enable_simd` on a few more files
2023-03-18 18:38:03 +00:00
Nick Fitzgerald
90d3eff0f3 cranelift-wasm: Refactor bounds checks to avoid repetition of Spectre and non-Spectre (#6054) 2023-03-17 20:30:42 +00:00
Karl Meakin
208d09e9f0 cranelift: rewrite x*-1 to ineg(x) (#6052)
* cranelift: rewrite `x*-1` to `ineg(x)`

* Add commuted test
2023-03-17 19:52:13 +00:00
Karl Meakin
c3f5b71b6a craneleft: cancel ineg when args to imul (#6053)
* craneleft: cancel `ineg`/`iabs` when args to `imul`

* Remove unsound `iabs(x) * iabs(y) == x*y` rewrite
2023-03-17 19:41:20 +00:00
Nick Fitzgerald
2e48babf23 cranelift-wasm: Add a bounds-checking optimization for dynamic memories and guard pages (#6031)
* cranelift-wasm: Add a bounds-checking optimization for dynamic memories and guard pages

This is a new special case for when we know that there are enough guard pages to
cover the memory access's offset and access size.

The precise should-we-trap condition is

    index + offset + access_size > bound

However, if we instead check only the partial condition

    index > bound

then the most out of bounds that the access can be, while that partial check
still succeeds, is `offset + access_size`.

However, when we have a guard region that is at least as large as `offset +
access_size`, we can rely on the virtual memory subsystem handling these
out-of-bounds errors at runtime. Therefore, the partial `index > bound` check is
sufficient for this heap configuration.

Additionally, this has the advantage that a series of Wasm loads that use the
same dynamic index operand but different static offset immediates -- which is a
common code pattern when accessing multiple fields in the same struct that is in
linear memory -- will all emit the same `index > bound` check, which we can GVN.

* cranelift: Add WAT tests for accessing dynamic memories with the same index but different offsets

The bounds check comparison is GVN'd but we still branch on values we should
know will always be true if we get this far in the code. This is actual `br_if`s
in the non-Spectre code and `select_spectre_guard`s that we should know will
always go a certain way if we have Spectre mitigations enabled.

Improving the non-Spectre case is pretty straightforward: walk the dominator
tree and remember which values we've already branched on at this point, and
therefore we can simplify any further conditional branches on those same values
into direct jumps.

Improving the Spectre case requires something that is morally the same, but has
a few snags:

* We don't have actual `br_if`s to determine whether the bounds checking
  condition succeeded or not. We need to instead reason about dominating
  `select_spectre_guard; {load, store}` instruction pairs.

* We have to be SUPER careful about reasoning "through" `select_spectre_guard`s.
  Our general rule is never to do that, since it could break the speculative
  execution sandboxing that the instruction is designed for.
2023-03-17 19:06:19 +00:00
Karl Meakin
73cc433bdd cranelift: simplify icmp against UMAX/SMIN/SMAX (#6037)
* cranelift: simplify `icmp` against UMAX/SMIN/SMAX

* Add tests for icmp against numeric limits
2023-03-17 18:54:29 +00:00
Alex Crichton
5ebe53a351 x64: Elide more uextend with extractlane (#6045)
* x64: Elide more uextend with extractlane

I've confirmed locally now that `pextr{b,w,d}` all zero the upper bits
of the full 64-bit register size which means that the `extractlane`
operation with a zero-extend can be elided for more cases, including
8-to-64-bit casts as well as 32-to-64.

This helps elide a few extra `mov`s in a loop I was looking at and had a
modest corresponding increase in performance (my guess was due to the
slightly decreased code size mostly as opposed to the removed `mov`s).

* Remove stray file
2023-03-17 16:18:41 +00:00
Karl Meakin
b53d66e634 cranelift: simplify x-x to 0 (#6032)
* cranelift: simplify `x-x` to `0`

* Guard `x-x => 0` rewrite with `fits_in_64`
2023-03-17 15:14:28 +00:00
Alex Crichton
8e500099b3 x64: Refactor and add extractlane special case for uextend/sextend (#6022)
* x64: Refactor sextend/uextend rules

Move much of the meaty logic from these lowering rules into the
`extend_to_gpr` helper to benefit other callers of `extend_to_gpr` to
elide instructions. This additionally simplifies `sextend` and `uextend`
lowerings to rely on optimizations happening within the `extend_to_gpr`
helper.

* x64: Skip `uextend` for `pextr{b,w}` instructions

These instructions are documented as automatically zeroing the upper
bits so `uextend` operations can be skipped. This slightly improves
codegen for the wasm `i{8x16,16x8}.extract_lane_u` instructions, for
example.

* Modernize an extractor pattern

* Trim some superfluous match clauses

Additionally rejigger priorities to be "mostly default" now.

* Refactor 32-to-64 predicate to a helper

Also adjust the pattern matched in the `extend_to_gpr` helper.

* Slightly refactor pextr{b,w} case

* Review comments
2023-03-16 22:14:59 +00:00
Karl Meakin
d479951469 cranelift: simplify fneg(fneg(x)) to x (#6034) 2023-03-16 22:14:12 +00:00
Karl Meakin
dccc2d6269 cranelift: simplify ineg(ineg(x)) to x (#6033) 2023-03-16 22:14:05 +00:00
Afonso Bordado
07136ae96d cranelift-interpreter: Implement a bunch of SIMD arithmetic ops (#5991)
* cranelift: Add function name to tests

* cranelift: Move simd-ineg tests to separate file

* cranelift: Move `avg_round` tests to separate file

* cranelift: Move SIMD `fmin`/`fmax` tests to separate files

* cranelift-interpreter: Implement a bunch of SIMD arithmetic ops

Most of these are quite easy to adapt to be polymorphic

* cranelift: Move shift tests from `simd-arithmetic.clif` into shift files
2023-03-16 18:44:16 +00:00
Alex Crichton
5ae8575296 x64: Take SIGFPE signals for divide traps (#6026)
* x64: Take SIGFPE signals for divide traps

Prior to this commit Wasmtime would configure `avoid_div_traps=true`
unconditionally for Cranelift. This, for the division-based
instructions, would change emitted code to explicitly trap on trap
conditions instead of letting the `div` x86 instruction trap.

There's no specific reason for Wasmtime, however, to specifically avoid
traps in the `div` instruction. This means that the extra generated
branches on x86 aren't necessary since the `div` and `idiv` instructions
already trap for similar conditions as wasm requires.

This commit instead disables the `avoid_div_traps` setting for
Wasmtime's usage of Cranelift. Subsequently the codegen rules were
updated slightly:

* When `avoid_div_traps=true`, traps are no longer emitted for `div`
  instructions.
* The `udiv`/`urem` instructions now list their trap as divide-by-zero
  instead of integer overflow.
* The lowering for `sdiv` was updated to still explicitly check for zero
  but the integer overflow case is deferred to the instruction itself.
* The lowering of `srem` no longer checks for zero and the listed trap
  for the `div` instruction is a divide-by-zero.

This means that the codegen for `udiv` and `urem` no longer have any
branches. The codegen for `sdiv` removes one branch but keeps the
zero-check to differentiate the two kinds of traps. The codegen for
`srem` removes one branch but keeps the -1 check since the semantics of
`srem` mismatch with the semantics of `idiv` with a -1 divisor
(specifically for INT_MIN).

This is unlikely to have really all that much of a speedup but was
something I noticed during #6008 which seemed like it'd be good to clean
up. Plus Wasmtime's signal handling was already set up to catch
`SIGFPE`, it was just never firing.

* Remove the `avoid_div_traps` cranelift setting

With no known users currently removing this should be possible and helps
simplify the x64 backend.

* x64: GC more support for avoid_div_traps

Remove the `validate_sdiv_divisor*` pseudo-instructions and clean up
some of the ISLE rules now that `div` is allowed to itself trap
unconditionally.

* x64: Store div trap code in instruction itself

* Keep divisors in registers, not in memory

Don't accidentally fold multiple traps together

* Handle EXC_ARITHMETIC on macos

* Update emit tests

* Update winch and tests
2023-03-16 00:18:45 +00:00
Alex Crichton
d76f7ee52e x64: Improve codegen for splats (#6025)
This commit goes through the lowerings for the CLIF `splat` instruction
and improves the support for each operator. Many of these lowerings are
mirrored from v8/SpiderMonkey and there are a number of improvements:

* AVX2 `v{p,}broadcast*` instructions are added and used when available.
* Float-based splats are much simpler and always a single-instruction
* Integer-based splats don't insert into an uninit xmm value and instead
  start out with a `movd` to move into an `xmm` register. This
  thoeretically breaks dependencies with prior instructions since `movd`
  creates a fresh new value in the destination register.
* Loads are now sunk into all of the instructions. A new extractor,
  `sinkable_load_exact`, was added to sink the i8/i16 loads.
2023-03-15 21:33:56 +00:00
Afonso Bordado
a10c50afe9 cranelift: Translate stack_* accesses as unaligned (#6016)
We can't currently ensure that these will be aligned, so we shouldn't mark them as such.
2023-03-15 18:05:55 +00:00
Alex Crichton
6ed90f86c8 x64: Add support for the pblendw instruction (#6023)
This commit adds another case for `shuffle` lowering to the x64 backend
for the `{,v}pblendw` instruction. This instruction selects 16-bit
values from either of the inputs corresponding to an immediate 8-bit-mask where
each bit selects the corresponding lane from the inputs.
2023-03-15 17:20:43 +00:00
Alex Crichton
fcddb9ca81 x64: Add lea-based lowering for iadd (#5986)
* x64: Refactor `Amode` computation in ISLE

This commit replaces the previous computation of `Amode` with a
different set of rules that are intended to achieve the same purpose but
are structured differently. The motivation for this commit is going to
become more relevant in the next commit where `lea` will be used for the
`iadd` instruction, possibly, on x64. When doing so it caused a stack
overflow in the test suite during the compilation phase of a wasm
module, namely as part of the `amode_add` function. This function is
recursively defined in terms of itself and recurses as deep as the
deepest `iadd`-chain in a program. A particular test in our test suite
has a 10k-long chain of `iadd` which ended up causing a stack overflow
in debug mode.

This stack overflow is caused because the `amode_add` helper in ISLE
unconditionally peels all the `iadd` nodes away and looks at all of
them, even if most end up in intermediate registers along the way. Given
that structure I couldn't find a way to easily abort the recursion. The
new `to_amode` helper is structured in a similar fashion but attempts to
instead only recurse far enough to fold items into the final `Amode`
instead of recursing through items which themselves don't end up in the
`Amode`. Put another way previously the `amode_add` helper might emit
`x64_add` instructions, but it no longer does that.

This goal of this commit is to preserve all the original `Amode`
optimizations, however. For some parts, though, it relies more on egraph
optimizations to run since if an `iadd` is 10k deep it doesn't try to
find a constant buried 9k levels inside there to fold into the `Amode`.
The hope, though, is that with egraphs having run already it's shuffled
constants to the right most of the time and already folded any possible
together.

* x64: Add `lea`-based lowering for `iadd`

This commit adds a rule for the lowering of `iadd` to use `lea` for 32
and 64-bit addition. The theoretical benefit of `lea` over the `add`
instruction is that the `lea` variant can emulate a 3-operand
instruction which doesn't destructively modify on of its operands.
Additionally the `lea` operation can fold in other components such as
constant additions and shifts.

In practice, however, if `lea` is unconditionally used instead of `iadd`
it ends up losing 10% performance on a local `meshoptimizer` benchmark.
My best guess as to what's going on here is that my CPU's dedicated
units for address computation are all overloaded while the ALUs are
basically idle in a memory-intensive loop. Previously when the ALU was
used for `add` and the address units for stores/loads it in theory
pipelined things better (most of this is me shooting in the dark). To
prevent the performance loss here I've updated the lowering of `iadd` to
conditionally sometimes use `lea` and sometimes use `add` depending on
how "complicated" the `Amode` is. Simple ones like `a + b` or `a + $imm`
continue to use `add` (and its subsequent hypothetical extra `mov`
necessary into the result). More complicated ones like `a + b + $imm` or
`a + b << c + $imm` use `lea` as it can remove the need for extra
instructions. Locally at least this fixes the performance loss relative
to unconditionally using `lea`.

One note is that this adds an `OperandSize` argument to the
`MInst::LoadEffectiveAddress` variant to add an encoding for 32-bit
`lea` in addition to the preexisting 64-bit encoding.

* Conditionally use `lea` based on regalloc
2023-03-15 17:14:25 +00:00
Trevor Elliott
68b937d965 cranelift: Fix shift overflow when constructing BitSet (#6020)
* Fix shift overflow when constructing the Wider constraint for integers

* Clarify comment
2023-03-14 22:25:51 +00:00
Trevor Elliott
e4d9bb7c5a cranelift: Exclude the control type in narrower and wider (#6018)
* Don't include the control type in `narrower` or `wider` constraints

* Add verifier tests for instructions that use narrower and wider
2023-03-14 18:09:15 +00:00
Trevor Elliott
f5ad74e546 cranelift: Add narrower and wider constraints to the instruction DSL (#6013)
* Add narrower and wider constraints to the instruction DSL

* Add docs to narrower/wider operands

* Update cranelift/codegen/meta/src/cdsl/instructions.rs

Co-authored-by: Jamey Sharp <jamey@minilop.net>

* Fix assertion message

* Simplify upper bounds for the wider constraint

* Remove additional unnecessary cases in the verifier

* Remove unused variables

* Remove changes to is_ctrl_typevar_candidate

These changes were only necessary when the type returned by an
instruction was a variable constrained by narrow or widen. As we have
switched to requiring that constraints must appear on argument types and
not return types, these changes were not longer necessary.

---------

Co-authored-by: Jamey Sharp <jamey@minilop.net>
2023-03-14 16:34:17 +00:00
Alex Crichton
5c1b468648 x64: Migrate {s,u}{div,rem} to ISLE (#6008)
* x64: Add precise-output tests for div traps

This adds a suite of `*.clif` files which are intended to test the
`avoid_div_traps=true` compilation of the `{s,u}{div,rem}` instructions.

* x64: Remove conditional regalloc in `Div` instruction

Move the 8-bit `Div` logic into a dedicated `Div8` instruction to avoid
having conditionally-used registers with respect to regalloc.

* x64: Migrate non-trapping, `udiv`/`urem` to ISLE

* x64: Port checked `udiv` to ISLE

* x64: Migrate urem entirely to ISLE

* x64: Use `test` instead of `cmp` to compare-to-zero

* x64: Port `sdiv` lowering to ISLE

* x64: Port `srem` lowering to ISLE

* Tidy up regalloc behavior and fix tests

* Update docs and winch

* Review comments

* Reword again

* More refactoring test fixes

* More test fixes
2023-03-14 01:44:06 +00:00
Alex Crichton
d6ce632b5b aarch64: Specialize constant vector shifts (#5976)
* aarch64: Specialize constant vector shifts

This commit adds special lowering rules for
vector-shifts-by-constant-amounts to use dedicated instructions which
cuts down on the codegen here quite a bit for constant values.

* Fix codegen for 0-shift-rights

* Special-case zero left-shifts as well

* Remove left-shift special case
2023-03-13 22:37:59 +00:00
Alex Crichton
e2a6fe99c2 x64: Add shuffle specialization for palignr (#5999)
* x64: Add `shuffle` specialization for `palignr`

This commit adds specializations for the `palignr` instruction to the
x64 backend to specialize some more patterns of byte shuffles.

* Fix tests
2023-03-13 21:01:24 +00:00
Alex Crichton
03b5dbb3e0 aarch64: Use VCodeConstant for f64/v128 constants (#5997)
* aarch64: Translate float and splat lowering to ISLE

I was looking into `constant_f128` and its fallback lowering into memory
and to get familiar with the code I figured it'd be good to port some
Rust logic to ISLE. This commit ports the `constant_{f128,f64,f32}`
helpers into ISLE from Rust as well as the `splat_const` helper which
ended up being closely related.

Tests reflect a number of regalloc changes that happened but also namely
one major difference is that in the lowering of `f32` a 32-bit immediate
is created now instead of a 64-bit immediate (in a GP register before
it's moved into a FP register). This semantically has no change but the
generated code is slightly different in a few minor cases.

* aarch64: Load f64/v128 constants from a pool

This commit removes the `LoadFpuConst64` and `LoadFpuConst128`
pseudo-instructions from the AArch64 backend which internally loaded a
nearby constant and then jumped over it. Constants now go through the
`VCodeConstant` infrastructure which gets placed at the end of the
function similar to how x64 works. Some minor support was added in as
well to add a new addressing mode for a `MachLabel`-relative load.
2023-03-13 19:33:52 +00:00
Alex Crichton
6ecdc2482e x64: Improve memory support in {insert,extract}lane (#5982)
* x64: Improve memory support in `{insert,extract}lane`

This commit improves adds support to Cranelift to emit `pextr{b,w,d,q}`
with a memory destination, merging a store-of-extract operation into one
instruction. Additionally AVX support is added for the `pextr*`
instructions.

I've additionally tried to ensure that codegen tests and runtests exist
for all forms of these instructions too.

* Add missing commas

* Fix tests
2023-03-13 19:30:44 +00:00
Afonso Bordado
ad0bce3a36 riscv64: Fix regaloc panic with bor+bnot on floats (#5857) 2023-03-13 18:29:36 +00:00
Alex Crichton
7956dc6ba2 Change CLIF shuffle to validate lane indices (#5995)
* Change CLIF `shuffle` to validate lane indices

Previously the CLIF `shuffle` instruction did not perform any validation
on the lane shuffle mask and specified that out-of-bounds lanes always
returned 0 as the value. This behavior though is not required by
WebAssembly which validates that lane indices are always in-bounds.
Additionally since these are static immediates even other code
generators should be able to verify that the immediates are in-bounds.

As a result this commit updates the definition of the `shuffle`
instruction to specify that all byte immediates must be in-bounds in the
range of [0, 32). The verifier has been updated and some test cases have
been removed that were testing this functionality.

Closes #5989

* Only generate valid shuffle immediates in fuzzer
2023-03-13 14:24:11 +00:00
Chris Fallin
264089e29d Cranelift: aarch64: fix undefined dest reg in f32x4.splat case. (#5987)
One of the cases for a splat operation, as updated in #5370, wrote to
a temp reg but then only conditionally transformed the temp into the
final destination register. In another codepath, `rd` was left
undefined. This causes a panic later when regalloc2 verifies SSA
properties of its input (here, value not def'd before use).

Fixes #5985.
2023-03-11 00:22:29 +00:00
Alex Crichton
52896e020d aarch64: Add specialized shuffle lowerings (#5977)
* aarch64: Add `shuffle` lowerings for the `uzp{1,2}` instructions

This commit uses the same style of patterns in the x64 backend to start
adding specific lowerings of the Cranelift `shuffle` instruction to
particular AArch64 instructions.

* aarch64: Add `shuffle` lowerings to the `zip{1,2}` instructions

These instructions match the `punpck*` family of instructions on x64 and
should help provide more efficient lowerings than the current `shuffle`
fallback.

* aarch64: Add `shuffle` lowerings for `trn{1,2}`

Along the lines of prior commits adds specific patterns to lowering for
individual AArch64 instructions available.

* aarch64: Add a `shuffle` lowering for the `ext` instruction

This instruction will more-or-less concatenate two 128-bit vector
registers to create a 256-bit value, shift it right, and then take the
lower 128-bits into the destination. This can be modeled with a
`shuffle` of consecutive bytes so this adds a lowering rule to generate
this instruction.

* aarch64: Add `shuffle` special case for `dup`

This commit adds special cases for Cranelift's `shuffle` on AArch64 when
the lowering can be represented with a `dup` instruction which
broadcasts one vector's lane into all lanes of the destination.

* aarch64: Add `shuffle` specializations for `rev` instructions

This commit adds shuffle mask specializations for the `rev{16,32,64}`
family of instructions on AArch64 which can be used to reverse bytes,
16-bit values, or 32-bit values within larger values.

* Fix tests

* Add doc-comments in ISLE
2023-03-10 21:37:13 +00:00
bjorn3
108f7917c8 Support plugging external profilers into the Cranelift timing infrastructure (#5749)
* Remove no-std code for cranelift_codegen::timings

no-std mode isn't supported by Cranelift anymore

* Simplify define_passes macro

* Add egraph opt timings

* Replace the add_to_current api with PassTimes::add

* Omit a couple of unused time measurements

* Reduce divergence between run and run_passes a bit

* Introduce a Profiler trait

This allows plugging in external profilers into the Cranelift profiling
framework.

* Add Pass::description method

* Remove duplicate usage of the compile pass timing

* Rustfmt
2023-03-10 19:33:56 +00:00
yuyang
4e875f33a7 Codegen fix fcvt_from_sint.f32 with small types on riscv64. (#5964)
* fix issue5952

* We should only extend i8 and i16

* remove extra space

* move some code
2023-03-10 10:29:55 +00:00
Alex Crichton
0ec7b872fa x64: Optimize store-of-extract-lane-0 (#5924)
* x64: Optimize store-of-extract-lane-0

The `movss` and `movsd` instructions can be used to store the 0th lane
of a `t32x4` or a `t64x2` vector into memory, enabling fusing a `store`
and an `extractlane` instruction.

* Fix merge conflict with `main`
2023-03-10 01:06:38 +00:00
Alex Crichton
83f21e784a x64: Add more support for more AVX instructions (#5931)
* x64: Add a smattering of lowerings for `shuffle` specializations (#5930)

* x64: Add lowerings for `punpck{h,l}wd`

Add some special cases for `shuffle` for more specialized x86
instructions.

* x64: Add `shuffle` lowerings for `pshufd`

This commit adds special-cased lowerings for the x64 `shuffle`
instruction when the `pshufd` instruction alone is necessary. This is
possible when the shuffle immediate permutes 32-bit values within one of
the vector inputs of the `shuffle` instruction, but not both.

* x64: Add shuffle lowerings for `punpck{h,l}{q,}dq`

This adds specific permutations for some x86 instructions which
specifically interleave high/low bytes for 32 and 64-bit values. This
corresponds to the preexisting specific lowerings for interleaving 8 and
16-bit values.

* x64: Add `shuffle` lowerings for `shufps`

This commit adds targeted lowerings for the `shuffle` instruction that
match the pattern that `shufps` supports. The `shufps` instruction
selects two elements from the first vector and two elements from the
second vector which means while it's not generally applicable it should
still be more useful than the catch-all lowering of `shuffle`.

* x64: Add shuffle support for `pshuf{l,h}w`

This commit adds special lowering cases for these instructions which
permute 16-bit values within a 128-bit value either within the upper or
lower half of the 128-bit value.

* x64: Specialize `shuffle` with an all-zeros immediate

Instead of loading the all-zeros immediate from a rip-relative address
at the end of the function instead generate a zero with a `pxor`
instruction and then use `pshufb` to do the broadcast.

* Review comments

* x64: Add an AVX encoding for the `pshufd` instruction

This will benefit from lack of need for alignment vs the `pshufd`
instruction if working with a memory operand and additionally, as I've
just learned, this reduces dependencies between instructions because the
`v*` instructions zero the upper bits as opposed to preserving them
which could accidentally create false dependencies in the CPU between
instructions.

* x64: Add more support for AVX loads/stores

This commit adds VEX-encoded versions of instructions such as
`mov{ss,sd,upd,ups,dqu}` for load and store operations. This also
changes some signatures so the `load` helpers specifically take a
`SyntheticAmode` argument which ended up doing a small refactoring of
the `*_regmove` variant used for `insertlane 0` into f64x2 vectors.

* x64: Enable using AVX instructions for zero regs

This commit refactors the internal ISLE helpers for creating zero'd
xmm registers to leverage the AVX support for all other instructions.
This moves away from picking opcodes to instead picking instructions
with a bit of reorganization.

* x64: Remove `XmmConstOp` as an instruction

All existing users can be replaced with usage of the `xmm_uninit_value`
helper instruction so there's no longer any need for these otherwise
constant operations. This additionally reduces manual usage of opcodes
in favor of instruction helpers.

* Review comments

* Update test expectations
2023-03-09 23:57:42 +00:00
Alex Crichton
1c3a1bda6c x64: Add a smattering of lowerings for shuffle specializations (#5930)
* x64: Add lowerings for `punpck{h,l}wd`

Add some special cases for `shuffle` for more specialized x86
instructions.

* x64: Add `shuffle` lowerings for `pshufd`

This commit adds special-cased lowerings for the x64 `shuffle`
instruction when the `pshufd` instruction alone is necessary. This is
possible when the shuffle immediate permutes 32-bit values within one of
the vector inputs of the `shuffle` instruction, but not both.

* x64: Add shuffle lowerings for `punpck{h,l}{q,}dq`

This adds specific permutations for some x86 instructions which
specifically interleave high/low bytes for 32 and 64-bit values. This
corresponds to the preexisting specific lowerings for interleaving 8 and
16-bit values.

* x64: Add `shuffle` lowerings for `shufps`

This commit adds targeted lowerings for the `shuffle` instruction that
match the pattern that `shufps` supports. The `shufps` instruction
selects two elements from the first vector and two elements from the
second vector which means while it's not generally applicable it should
still be more useful than the catch-all lowering of `shuffle`.

* x64: Add shuffle support for `pshuf{l,h}w`

This commit adds special lowering cases for these instructions which
permute 16-bit values within a 128-bit value either within the upper or
lower half of the 128-bit value.

* x64: Specialize `shuffle` with an all-zeros immediate

Instead of loading the all-zeros immediate from a rip-relative address
at the end of the function instead generate a zero with a `pxor`
instruction and then use `pshufb` to do the broadcast.

* Review comments
2023-03-09 22:58:19 +00:00
Alex Crichton
63fb30e4b4 Merge pull request from GHSA-ff4p-7xrq-q5r8
* x64: Remove incorrect `amode_add` lowering rules

This commit removes two incorrect rules as part of the x64 backend's
computation of addressing modes. These two rules folded a zero-extended
32-bit computation into the address mode operand, but this isn't correct
as the 32-bit computation should be truncated to 32-bits but when folded
into the address mode computation it happens with 64-bit operands,
meaning truncation doesn't happen.

* Add release notes
2023-03-08 13:00:40 -06:00
Alex Crichton
5dc2bbccbb Merge pull request from GHSA-xm67-587q-r2vw
This commit fixes an off-by-one error in the subtraction of indices when
shuffling a vector with itself. Lanes 16-and-above are mapped to select
from the first vector since the first and second element are the same,
but the subtraction was with 15 rather than 16 by accident.
2023-03-08 13:00:00 -06:00