Commit Graph

551 Commits

Author SHA1 Message Date
Trevor Elliott
a5698cedf8 cranelift: Remove brz and brnz (#5630)
Remove the brz and brnz instructions, as their behavior is now redundant with brif.
2023-01-30 20:34:56 +00:00
Trevor Elliott
7926808e8e riscv64: improve unordered comparison generated code (#5636)
Improve the generated code for unordered floating point comparisons by negating the comparison and inveritng the branches. This allows us to pick the unordered versions, which generate significantly better code.
2023-01-25 17:28:28 -08:00
Trevor Elliott
b58a197d33 cranelift: Add a conditional branch instruction with two targets (#5446)
Add a conditional branch instruction with two targets: brif. This instruction will eventually replace brz and brnz, as it encompasses the behavior of both.

This PR also changes the InstructionData layout for instruction formats that hold BlockCall values, taking the same approach we use for Value arguments. This allows branch_destination to return a slice to the BlockCall values held in the instruction, rather than requiring that we pattern match on InstructionData to fetch the then/else blocks.

Function generation for fuzzing has been updated to generate uses of brif, and I've run the cranelift-fuzzgen target locally for hours without triggering any new failures.
2023-01-24 14:37:16 -08:00
Jamey Sharp
fef9f64d2c x86: Test paired udiv/urem (#5573)
Ideally these pairs of CLIF instructions should emit a single x86
instruction, but they don't today. This test will tell us if somebody
fixes that.

Similar tests might make sense for imul/umulhi as well as signed
versions, but I haven't tried that.
2023-01-23 11:44:27 -08:00
yuyang
7e10bd1f58 fix issue #5497 #5524 #5526. (#5595)
* fix issue 5497.

* fix issue 5524

* fix issue 5497 5524 5526.

* some clif change because of reg alloc.
2023-01-20 14:06:26 -08:00
yuyang
299b8187f8 fix issue 5525. (#5603)
* fix issue 5525.

* reg alloc changed.
2023-01-20 09:53:54 -08:00
Afonso Bordado
3ae373b073 cranelift: Disable select rule for i128 types on riscv64 (#5584)
* fuzzgen: Disable some selects for RISC-V

* cranelift: Force disable gen_select_reg rule for i128 values
2023-01-17 10:01:23 -08:00
uint256_t
b00455135e Cranelift: Implement 'iabs' for scalar types on x86_64 (#5527)
* Implement 'iabs' for scalar types on x86_64

* Small fix
2023-01-05 21:33:12 -08:00
Trevor Elliott
b2d5afdf83 riscv64: Implement fcmp in ISLE (#5512)
Rework the compilation of fcmp in the riscv64 backend to be in ISLE, removing the need for the dedicated Fcmp instruction. This change is motivated by #5500, which showed that the riscv64 backend was generating branch instructions in the middle of a basic block.

We can't remove lower_br_fcmp quite yet as it's used in a few places in the emit module, but it's now no longer reachable from the ISLE lowerings.

Fixes #5500
2023-01-04 11:52:00 -08:00
Afonso Bordado
52ba72f341 riscv64: Fix masking on iabs (#5505)
* cranelift: Add `iabs.i128` runtest

* riscv64: Fix incorrect extension in iabs

When lowering iabs, we were accidentally comparing the unextended value
this caused the instruction to misbehave with certain top bits.

This commit also adds a zbb lowering that does not use jumps.
2023-01-03 17:37:25 -08:00
Afonso Bordado
7e94704264 riscv64: Add masking for small types when lowering select (#5504)
When lowering `select+icmp` we have an optimization that allows us to
avoid materializing the icmp result.

We were accidentally not masking the high bits for i8 and i16 in this case.

Issue #5498 reported this as an illegal instruction but what was happening
there was that the invalid select caused a division by zero.
2023-01-03 19:59:14 +00:00
Chris Fallin
03463458e4 Cranelift: fix branch-of-icmp/fcmp regression: look through uextend. (#5487)
In #5031, we removed `bool` types from CLIF, using integers instead for
"truthy" values. This greatly simplified the IR, and was generally an
improvement.

However, because x86's `SETcc` instruction sets only the low 8 bits of a
register, we chose to use `i8` types as the result of `icmp` and `fcmp`,
to avoid the need for a masking operation when materializing the result.

Unfortunately this means that uses of truthy values often now have
`uextend` operations, especially when coming from Wasm (where truthy
values are naturally `i32`-typed). For example, where we previously had
`(brz (icmp ...))`, we now have `(brz (uextend (icmp ...)))`.

It's arguable whether or not we should switch to `i32` truthy values --
in most cases we can avoid materializing a value that's immediately used
for a branch or select, so a mask would in most cases be unnecessary,
and it would be a win at the IR level -- but irrespective of that, this
change *did* regress our generated code quality: our backends had
patterns for e.g. `(brz (icmp ...))` but not with the `uextend`, so we
were *always* materializing truthy values. Many blocks thus ended with
"cmp; setcc; cmp; test; branch" rather than "cmp; branch".

In #5391 we noticed this and fixed it on x64, but it was a general
problem on aarch64 and riscv64 as well. This PR introduces a
`maybe_uextend` extractor that "looks through" uextends, and uses it
where we consume truthy values, thus fixing the regression.  This PR
also adds compile filetests to ensure we don't regress again.

The riscv64 backend has not been updated here because doing so appears
to trigger another issue in its branch handling; fixing that is TBD.
2022-12-22 01:43:44 -08:00
Nick Fitzgerald
c0b587ac5f Remove heaps from core Cranelift, push them into cranelift-wasm (#5386)
* cranelift-wasm: translate Wasm loads into lower-level CLIF operations

Rather than using `heap_{load,store,addr}`.

* cranelift: Remove the `heap_{addr,load,store}` instructions

These are now legalized in the `cranelift-wasm` frontend.

* cranelift: Remove the `ir::Heap` entity from CLIF

* Port basic memory operation tests to .wat filetests

* Remove test for verifying CLIF heaps

* Remove `heap_addr` from replace_branching_instructions_and_cfg_predecessors.clif test

* Remove `heap_addr` from readonly.clif test

* Remove `heap_addr` from `table_addr.clif` test

* Remove `heap_addr` from the simd-fvpromote_low.clif test

* Remove `heap_addr` from simd-fvdemote.clif test

* Remove `heap_addr` from the load-op-store.clif test

* Remove the CLIF heap runtest

* Remove `heap_addr` from the global_value.clif test

* Remove `heap_addr` from fpromote.clif runtests

* Remove `heap_addr` from fdemote.clif runtests

* Remove `heap_addr` from memory.clif parser test

* Remove `heap_addr` from reject_load_readonly.clif test

* Remove `heap_addr` from reject_load_notrap.clif test

* Remove `heap_addr` from load_readonly_notrap.clif test

* Remove `static-heap-without-guard-pages.clif` test

Will be subsumed when we port `make-heap-load-store-tests.sh` to generating
`.wat` tests.

* Remove `static-heap-with-guard-pages.clif` test

Will be subsumed when we port `make-heap-load-store-tests.sh` over to `.wat`
tests.

* Remove more heap tests

These will be subsumed by porting `make-heap-load-store-tests.sh` over to `.wat`
tests.

* Remove `heap_addr` from `simple-alias.clif` test

* Remove `heap_addr` from partial-redundancy.clif test

* Remove `heap_addr` from multiple-blocks.clif test

* Remove `heap_addr` from fence.clif test

* Remove `heap_addr` from extends.clif test

* Remove runtests that rely on heaps

Heaps are not a thing in CLIF or the interpreter anymore

* Add generated load/store `.wat` tests

* Enable memory-related wasm features in `.wat` tests

* Remove CLIF heap from fcmp-mem-bug.clif test

* Add a mode for compiling `.wat` all the way to assembly in filetests

* Also generate WAT to assembly tests in `make-load-store-tests.sh`

* cargo fmt

* Reinstate `f{de,pro}mote.clif` tests without the heap bits

* Remove undefined doc link

* Remove outdated SVG and dot file from docs

* Add docs about `None` returns for base address computation helpers

* Factor out `env.heap_access_spectre_mitigation()` to a local

* Expand docs for `FuncEnvironment::heaps` trait method

* Restore f{de,pro}mote+load clif runtests with stack memory
2022-12-15 00:26:45 +00:00
Nick Fitzgerald
be710df237 Cranelift: Add .wat to assembly test support and generate Wasm load/store tests for all ISAs (#5439)
* cranelift-filetest: Add the ability to test `.wat` to assembly

* Make the load/store test case generator script use `.wat` tests

And generate tests that exercise both Wasm-to-CLIF lowering and Wasm all the way
to assembly.

* Remove old versions of generated load/store tests

* Add new generated load/store tests

* Fix filename reference in script
2022-12-14 21:13:43 +00:00
Trevor Elliott
9dc4f1a83c s390x: Move the value out of the casloop_val_reg with mov_preg (#5430)
The casloop_emit function in the s390x backend was using the fixed non-allocatable register %r0 directly with move instructions, which produced a panic in the regalloc2 checker (#5425). This PR changes the casloop_result function to use mov_preg instead of copy_reg to fetch the result, as it's not viewed by regalloc2 as a move.

Fixes #5425
2022-12-14 13:06:35 -08:00
Ulrich Weigand
df923f18ca Remove MachInst::gen_constant (#5427)
* aarch64: constant generation cleanup

Add support for MOVZ and MOVN generation via ISLE.
Handle f32const, f64const, and nop instructions via ISLE.
No longer call Inst::gen_constant from lower.rs.

* riscv64: constant generation cleanup

Handle f32const, f64const, and nop instructions via ISLE.

* s390x: constant generation cleanup

Fix rule priorities for "imm" term.
Only handle 32-bit stack offsets; no longer use load_constant64.

* x64: constant generation cleanup

No longer call Inst::gen_constant from lower.rs or abi.rs.

* Refactor LowerBackend::lower to return InstOutput

No longer write to the per-insn output registers; instead, return
an InstOutput vector of temp registers holding the outputs.

This will allow calling LowerBackend::lower multiple times for
the same instruction, e.g. to rematerialize constants.

When emitting the primary copy of the instruction during lowering,
writing to the per-insn registers is now done in lower_clif_block.

As a result, the ISLE lower_common routine is no longer needed.
In addition, the InsnOutput type and all code related to it
can be removed as well.

* Refactor IsleContext to hold a LowerBackend reference

Remove the "triple", "flags", and "isa_flags" fields that are
copied from LowerBackend to each IsleContext, and instead just
hold a reference to LowerBackend in IsleContext.

This will allow calling LowerBackend::lower from within callbacks
in src/machinst/isle.rs, e.g. to rematerialize constants.

To avoid having to pass LowerBackend references through multiple
functions, eliminate the lower_insn_to_regs subroutines in those
targets that still have them, and just inline into the main
lower routine.  This also eliminates lower_inst.rs on aarch64
and riscv64.

Replace all accesses to the removed IsleContext fields by going
through the LowerBackend reference.

* Remove MachInst::gen_constant

This addresses the problem described in issue
https://github.com/bytecodealliance/wasmtime/issues/4426
that targets currently have to duplicate code to emit
constants between the ISLE logic and the gen_constant
callback.

After the various cleanups in earlier patches in this series,
the only remaining user of get_constant is put_value_in_regs
in Lower.  This can now be removed, and instead constant
rematerialization can be performed in the put_in_regs ISLE
callback by simply directly calling LowerBackend::lower
on the instruction defining the constant (using a different
output register).

Since the check for egraph mode is now no longer performed in
put_value_in_regs, the Lower::flags member becomes obsolete.

Care needs to be taken that other calls directly to the
Lower::put_value_in_regs routine now handle the fact that
no more rematerialization is performed.  All such calls in
target code already historically handle constants themselves.
The remaining call site in the ISLE gen_call_common helper
can be redirected to the ISLE put_in_regs callback.

The existing target implementations of gen_constant are then
unused and can be removed.  (In some target there may still
be further opportunities to remove duplication between ISLE
and some local Rust code - this can be left to future patches.)
2022-12-13 13:00:04 -08:00
Chris Fallin
92ce79366c riscv64: remove valueregs_2_reg extractor. (#5426)
This extractor had a side-effect of invoking `put_in_regs`, which is not
supposed to be invoked until the pattern-matching commits to evaluating
a rule right-hand side (i.e., cannot backtrack). In this case the
side-effect was mostly benign (in theory it could have caused additional
values to be computed needlessly), but in general we should be careful
to keep side-effects out of the left-hand side to enable further
optimizations and work on islec.

The implicit conversion from `Value` to `Reg` turns out to be enough to
make the rules in question work, so we can simply remove the use of the
extractor in this case.
2022-12-13 11:47:20 -08:00
Trevor Elliott
a5ecb5e647 x64: Share a zero in the ushr translation on x64 to free up a register (#5424)
Share a zero value in the translation of ushr for i128. This increases the lifetime of the value by a few instructions, and reduces the number of registers used in the translation by one, which seems like an acceptable trade-off.
2022-12-12 18:15:43 -08:00
Ulrich Weigand
e913cf3647 Remove IFLAGS/FFLAGS types (#5406)
All instructions using the CPU flags types (IFLAGS/FFLAGS) were already
removed.  This patch completes the cleanup by removing all remaining
instructions that define values of CPU flags types, as well as the
types themselves.

Specifically, the following features are removed:
- The IFLAGS and FFLAGS types and the SpecialType category.
- Special handling of IFLAGS and FFLAGS in machinst/isle.rs and
  machinst/lower.rs.
- The ifcmp, ifcmp_imm, ffcmp, iadd_ifcin, iadd_ifcout, iadd_ifcarry,
  isub_ifbin, isub_ifbout, and isub_ifborrow instructions.
- The writes_cpu_flags instruction property.
- The flags verifier pass.
- Flags handling in the interpreter.

All of these features are currently unused; no functional change
intended by this patch.

This addresses https://github.com/bytecodealliance/wasmtime/issues/3249.
2022-12-09 13:42:03 -08:00
Jamey Sharp
8bbd9bb228 aarch64: Test instruction selection for bmask (#5396)
I copied the `bmask` precise-output tests from x64 and used
CRANELIFT_TEST_BLESS=1 to generate this test.

I don't know aarch64 well enough to decide if this output is correct.
However, for I128 it is identical to the previous I128-only
precise-output tests, and the existing runtests for bmask pass on
aarch64, so I think it's likely correct.
2022-12-08 10:22:23 -08:00
Trevor Elliott
c5379051c4 Enable the ssa verifier in debug builds (#5354)
Enable regalloc2's SSA verifier in debug builds to check for any outstanding reuse of virtual registers in def constraints. As fuzzing enables debug_assertions, this will enable the SSA verifier when fuzzing as well.
2022-12-07 12:22:51 -08:00
Trevor Elliott
293bb5b334 riscv64: Only emit jumps at the end of basic blocks (#5381)
This PR fixes two bugs in the riscv64 backend, where branch instructions were emitted in the middle of a basic block:

Constant emission, where the constants are inlined into the vcode and are jumped over at runtime,
The BrTableCheck pseudo-instruction, which was always emitted before a BrTable instruction, and would handle jumping to the default label.
The first bug was resolved by introducing two new psuedo instructions, LoadConst32 and LoadConst64. Both of these instructions serve to delay the original encoding to emission time, after regalloc2 has run.

The second bug was fixed by removing the BrTableCheck instruction. As it was always emitted directly before BrTable, it was easier to remove it and merge the two into a single instruction.
2022-12-06 10:54:10 -08:00
Trevor Elliott
353a681671 Avoid reusing a register during constant loading (#5379)
Avoid reusing a register when loading a constant, allocating a temporary instead.
2022-12-05 18:37:53 -08:00
Trevor Elliott
7d28d586da riscv64: Don't reuse registers when loading constants (#5376)
Rework the constant loading functions in the riscv64 backend to generate fresh temporaries instead of reusing the destination register.
2022-12-05 16:51:52 -08:00
Trevor Elliott
817c2b205c riscv64: Use a temporary when translating shift amount (#5375)
Use a temporary when translating the shift amount, instead of reusing the destination register.
2022-12-05 20:54:14 +00:00
Trevor Elliott
b475b9bd19 Terminate blocks with a single branch in riscv64 (#5374)
Ensure that we're terminating blocks with a single branch instruction, when testing I128 values against zero.
2022-12-05 20:13:28 +00:00
Trevor Elliott
6aea8e0d7e Don't reuse destination registers when lowering splat on aarch64 (#5370) 2022-12-05 08:18:49 -08:00
Trevor Elliott
2e9b0802ab aarch64: Rework amode compilation to produce SSA code (#5369)
Rework the compilation of amodes in the aarch64 backend to stop reusing registers and instead generate fresh virtual registers for intermediates. This resolves some SSA checker violations with the aarch64 backend, and as a nice side-effect removes some unnecessary movs in the generated code.
2022-12-02 01:23:15 +00:00
Trevor Elliott
d54a27d0ea Allocate temporary intermediates when loading constants on aarch64 (#5366)
As loading constants on aarch64 can take up to 4 instructions, we need to plumb through some additional registers. Rather than pass a fixed list of registers in, pass an allocation function.
2022-12-01 22:29:36 +00:00
Nick Fitzgerald
79f7fa6079 Cranelift: implement heap_{load,store} instruction legalization (#5351)
* Cranelift: implement `heap_{load,store}` instruction legalization

This does not remove `heap_addr` yet, but it does factor out the common
bounds-check-and-compute-the-native-address functionality that is shared between
all of `heap_{addr,load,store}`.

Finally, this adds a missing optimization for when we can dedupe explicit bounds
checks for static memories and Spectre mitigations.

* Cranelift: Enable `heap_load_store_*` run tests on all targets
2022-11-30 19:12:49 +00:00
Alex Crichton
830885383f Implement inline stack probes for AArch64 (#5353)
* Turn off probestack by default in Cranelift

The probestack feature is not implemented for the aarch64 and s390x
backends and currently the on-by-default status requires the aarch64 and
s390x implementations to be a stub. Turning off probestack by default
allows the s390x and aarch64 backends to panic with an error message to
avoid providing a false sense of security. When the probestack option is
implemented for all backends, however, it may be reasonable to
re-enable.

* aarch64: Improve codegen for AMode fallback

Currently the final fallback for finalizing an `AMode` will generate
both a constant-loading instruction as well as an `add` instruction to
the base register into the same temporary. This commit improves the
codegen by removing the `add` instruction and folding the final add into
the finalized `AMode`. This changes the `extendop` used but both
registers are 64-bit so shouldn't be affected by the extending
operation.

* aarch64: Implement inline stack probes

This commit implements inline stack probes for the aarch64 backend in
Cranelift. The support here is modeled after the x64 support where
unrolled probes are used up to a particular threshold after which a loop
is generated. The instructions here are similar in spirit to x64 except
that unlike x64 the stack pointer isn't modified during the unrolled
loop to avoid needing to re-adjust it back up at the end of the loop.

* Enable inline probestack for AArch64 and Riscv64

This commit enables inline probestacks for the AArch64 and Riscv64
architectures in the same manner that x86_64 has it enabled now. Some
more testing was additionally added since on Unix platforms we should be
guaranteed that Rust's stack overflow message is now printed too.

* Enable probestack for aarch64 in cranelift-fuzzgen

* Address review comments

* Remove implicit stack overflow traps from x64 backend

This commit removes implicit `StackOverflow` traps inserted by the x64
backend for stack-based operations. This was historically required when
stack overflow was detected with page faults but Wasmtime no longer
requires that since it's not suitable for wasm modules which call host
functions. Additionally no other backend implements this form of
implicit trap-code additions so this is intended to synchronize the
behavior of all the backends.

This fixes a test added prior for aarch64 to properly abort the process
instead of accidentally being caught by Wasmtime.

* Fix a style issue
2022-11-30 12:30:00 -06:00
Trevor Elliott
368004428a Fix rule shadowing instances in x64 and aarch64 backends (#5334)
Fix shadowing identified in #5322 for imul and swiden_high/swiden_low/uwiden_high/uwiden_low combinations in the x64 backend, and remove some redundant rules from the aarch64 dynamic neon ruleset. Additionally, add tests to the x64 backend showing that the imul specializations are firing.
2022-11-28 15:48:34 -08:00
Trevor Elliott
54cfa4df34 cranelift: Fix implicit pointer argument register use (#5301)
* Fix arg handling to write to VRegs instead of physical regs

* Make is_included_in_clobbers required, and handle Args on x64 and riscv64
2022-11-18 16:47:03 -08:00
Trevor Elliott
07bd8bf34a Remove unnecessary moves in x64 gen_memcpy (#5277)
Remove some unnecessary moves in the x64 gen_memcpy implementation -- the call instruction that's generated will already constrain the args to those registers.
2022-11-16 10:33:00 -08:00
Nick Fitzgerald
9967782726 Cranelift(Aarch64): Optimize lowering of icmps with immediates (#5252)
We can encode more constants into 12-bit immediates if we do the following
rewrite for comparisons with odd constants:

        A >= B + 1
    ==> A - 1 >= B
    ==> A > B
2022-11-15 09:18:55 -08:00
Nick Fitzgerald
c2a7ea7e24 Cranelift: de-duplicate bounds checks in legalizations (#5190)
* Cranelift: Add the `DataFlowGraph::display_value_inst` convenience method

* Cranelift: Add some `trace!` logs to some parts of legalization

* Cranelift: de-duplicate bounds checks in legalizations

When both (1) "dynamic" memories that need explicit bounds checks and (2)
spectre mitigations that perform bounds checks are enabled, reuse the same
bounds checks between the two legalizations.

This reduces the overhead of explicit bounds checks and spectre mitigations over
using virtual memory guard pages with spectre mitigations from ~1.9-2.1x
overhead to ~1.6-1.8x overhead. That is about a 14-19% speed up for when dynamic
memories and spectre mitigations are enabled.

<details>

```
execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 3422455129.47 ± 120159.49 (confidence = 99%)

  virtual-memory-guards.so is 2.09x to 2.09x faster than bounds-checks.so!

  [6563931659 6564063496.07 6564301535] bounds-checks.so
  [3141492675 3141608366.60 3141895249] virtual-memory-guards.so

execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  Δ = 338716136.87 ± 1.38 (confidence = 99%)

  virtual-memory-guards.so is 2.08x to 2.08x faster than bounds-checks.so!

  [651961494 651961495.47 651961497] bounds-checks.so
  [313245357 313245358.60 313245362] virtual-memory-guards.so

execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 22742944.07 ± 331.73 (confidence = 99%)

  virtual-memory-guards.so is 1.87x to 1.87x faster than bounds-checks.so!

  [48841295 48841567.33 48842139] bounds-checks.so
  [26098439 26098623.27 26099479] virtual-memory-guards.so
```

</details>

<details>

```
execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 2465900207.27 ± 146476.61 (confidence = 99%)

  virtual-memory-guards.so is 1.78x to 1.78x faster than de-duped-bounds-checks.so!

  [5607275431 5607442989.13 5607838342] de-duped-bounds-checks.so
  [3141445345 3141542781.87 3141711213] virtual-memory-guards.so

execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  Δ = 234253620.20 ± 2.33 (confidence = 99%)

  virtual-memory-guards.so is 1.75x to 1.75x faster than de-duped-bounds-checks.so!

  [547498977 547498980.93 547498985] de-duped-bounds-checks.so
  [313245357 313245360.73 313245363] virtual-memory-guards.so

execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 16605659.13 ± 315.78 (confidence = 99%)

  virtual-memory-guards.so is 1.64x to 1.64x faster than de-duped-bounds-checks.so!

  [42703971 42704284.40 42704787] de-duped-bounds-checks.so
  [26098432 26098625.27 26099234] virtual-memory-guards.so
```

</details>

<details>

```
execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  Δ = 104462517.13 ± 7.32 (confidence = 99%)

  de-duped-bounds-checks.so is 1.19x to 1.19x faster than bounds-checks.so!

  [651961493 651961500.80 651961532] bounds-checks.so
  [547498981 547498983.67 547498989] de-duped-bounds-checks.so

execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 956556982.80 ± 103034.59 (confidence = 99%)

  de-duped-bounds-checks.so is 1.17x to 1.17x faster than bounds-checks.so!

  [6563930590 6564019842.40 6564243651] bounds-checks.so
  [5607307146 5607462859.60 5607677763] de-duped-bounds-checks.so

execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 6137307.87 ± 247.75 (confidence = 99%)

  de-duped-bounds-checks.so is 1.14x to 1.14x faster than bounds-checks.so!

  [48841303 48841472.93 48842000] bounds-checks.so
  [42703965 42704165.07 42704718] de-duped-bounds-checks.so
```

</details>

* Update test expectations

* Add a test for deduplicating bounds checks between dynamic memories and spectre mitigations

* Define a struct for the Spectre comparison instead of using a tuple

* More trace logging for heap legalization
2022-11-15 08:47:22 -08:00
Trevor Elliott
dece901d16 Use regalloc constraints for sse blend operations (#5251)
Instead of using xmm0 explicitly for the mask argument to instructions like blendvpd, use regalloc constraints to constrain it to xmm0 instead.
2022-11-14 16:44:34 -08:00
Trevor Elliott
0367fbc2d4 cranelift: Rework pinned register lowering (#5249)
Rework pinned register lowering to avoid the use of pinned virtual registers, instead using the MovFromPReg and MovToPReg pseudo instructions.
2022-11-10 16:19:25 -08:00
Nick Fitzgerald
fc62d4ad65 Cranelift: Make heap_addr return calculated base + index + offset (#5231)
* Cranelift: Make `heap_addr` return calculated `base + index + offset`

Rather than return just the `base + index`.

(Note: I've chosen to use the nomenclature "index" for the dynamic operand and
"offset" for the static immediate.)

This move the addition of the `offset` into `heap_addr`, instead of leaving it
for the subsequent memory operation, so that we can Spectre-guard the full
address, and not allow speculative execution to read the first 4GiB of memory.

Before this commit, we were effectively doing

    load(spectre_guard(base + index) + offset)

Now we are effectively doing

    load(spectre_guard(base + index + offset))

Finally, this also corrects `heap_addr`'s documented semantics to say that it
returns an address that will trap on access if `index + offset + access_size` is
out of bounds for the given heap, rather than saying that the `heap_addr` itself
will trap. This matches the implemented behavior for static memories, and after
https://github.com/bytecodealliance/wasmtime/pull/5190 lands (which is blocked
on this commit) will also match the implemented behavior for dynamic memories.

* Update heap_addr docs

* Factor out `offset + size` to a helper
2022-11-09 19:53:51 +00:00
Trevor Elliott
b077854b57 Generate SSA code from returns (#5172)
Modify return pseudo-instructions to have pairs of registers: virtual and real. This allows us to constrain the virtual registers to the real ones specified by the abi, instead of directly emitting moves to those real registers.
2022-11-08 16:00:49 -08:00
Ulrich Weigand
3e5938e65a Support big- and little-endian lane order with bitcast (#5196)
Add a MemFlags operand to the bitcast instruction, where only the
`big` and `little` flags are accepted.  These define the lane order
to be used when casting between types of different lane counts.

Update all users to pass an appropriate MemFlags argument.

Implement lane swaps where necessary in the s390x back-end.

This is the final part necessary to fix
https://github.com/bytecodealliance/wasmtime/issues/4566.
2022-11-07 14:41:10 -08:00
Ulrich Weigand
137a8b710f Move bitselect->vselect optimization to x64 back-end (#5191)
The simplifier was performing an optimization to replace bitselect
with vselect if the all bytes of the condition mask could be shown
to be all ones or all zeros.

This optimization only ever made any difference in codegen on the
x64 target.  Therefore, move this optimization to the x64 back-end
and perform it in ISLE instead.  Resulting codegen should be
unchanged, with slightly improved compile time.

This also eliminates a few endian-dependent bitcast operations.
2022-11-03 20:17:36 +00:00
Afonso Bordado
3ef30b5b67 cranelift: Rename i{min,max} to s{min,max} (#5187)
This brings these instructions with our general naming convention
of signed instructions being prefixed with `s`.
2022-11-03 18:20:33 +00:00
Trevor Elliott
aeceea28e2 Remove trapif and trapff (#5162)
This branch removes the trapif and trapff instructions, in favor of using an explicit comparison and trapnz. This moves us closer to removing iflags and fflags, but introduces the need to implement instructions like iadd_cout in the x64 and aarch64 backends.
2022-11-03 09:25:11 -07:00
Ulrich Weigand
961107ec63 Merge raw_bitcast and bitcast (#5175)
- Allow bitcast for vectors with differing lane widths
- Remove raw_bitcast IR instruction
- Change all users of raw_bitcast to bitcast
- Implement support for no-op bitcast cases across backends

This implements the second step of the plan outlined here:
https://github.com/bytecodealliance/wasmtime/issues/4566#issuecomment-1234819394
2022-11-02 10:16:27 -07:00
Trevor Elliott
09d8df6fab Switch to x64_rbp to avoid the use of a pinned register (#5168)
Avoid a use of preg_rpb in the x64 backend, using x64_rbp instead.
2022-11-01 13:23:33 -07:00
11evan
4ca9e82bd1 cranelift: Add Bswap instruction (#1092) (#5147)
Adds Bswap to the Cranelift IR. Implements the Bswap instruction
in the x64 and aarch64 codegen backends. Cranelift users can now:
```
builder.ins().bswap(value)
```
to get a native byteswap instruction.

* x64: implements the 32- and 64-bit bswap instruction, following
the pattern set by similar unary instrutions (Neg and Not) - it
only operates on a dst register, but is parameterized with both
a src and dst which are expected to be the same register.

As x64 bswap instruction is only for 32- or 64-bit registers,
the 16-bit swap is implemented as a rotate left by 8.

Updated x64 RexFlags type to support emitting for single-operand
instructions like bswap

* aarch64: Bswap gets emitted as aarch64 rev16, rev32,
or rev64 instruction as appropriate.

* s390x: Bswap was already supported in backend, just had to add
a bit of plumbing

* For completeness, added bswap to the interpreter as well.

* added filetests and runtests for each ISA

* added bswap to fuzzgen, thanks to afonso360 for the code there

* 128-bit swaps are not yet implemented, that can be done later
2022-10-31 19:30:00 +00:00
Afonso Bordado
2fb76be2e4 x64: Add bmask implementation (#5148) 2022-10-28 17:17:22 -07:00
Afonso Bordado
e8f3d03bbe cranelift: Mask high bits on bmask for types smaller than a register (#5118)
* aarch64: Fix incorrect masking for small types on bmask

`bmask` was accidentally relying on the uppermost bits of the register
for small types.

This was found by fuzzgen,  when it generated a shift left followed by
a bmask, the shift left shifted the bits out of the range of the input
type (i8), however these are not automatically cleared since they
remained inside the 32 bits of the register.

That caused issues when the bmask tried to compare the whole register
instead of just the bottom bits. The solution here is to mask the upper
bits for small types.

* aarch64: Emit 32bit cmp on bmask

This fixes an issue where bmask was accidentally comparing the
upper bits of the register by always using a 64bit cmp.

* riscv: Mask high bits in bmask

* riscv: Add compile tests for br{z,nz}

* riscv: Use shifts to mask 32bit values

This produces less code than the AND since that version needs to
load an immediate constant from memory.

* cranelift: Update test input to hexadecimal values

This makes it a bit more clear what is being tested.

* riscv: Use addiw for masking 32 bit values

Co-authored-by: Trevor Elliott <telliott@fastly.com>

* aarch64: Update bmask rule priority

Co-authored-by: Trevor Elliott <telliott@fastly.com>
2022-10-27 09:45:39 -07:00
Trevor Elliott
02620441c3 Add uadd_overflow_trap (#5123)
Add a new instruction uadd_overflow_trap, which is a fused version of iadd_ifcout and trapif. Adding this instruction removes a dependency on the iflags type, and would allow us to move closer to removing it entirely.

The instruction is defined for the i32 and i64 types only, and is currently only used in the legalization of heap_addr.
2022-10-27 09:43:15 -07:00