Commit Graph

1806 Commits

Author SHA1 Message Date
Chris Fallin
8c55b81300 Optimizations to egraph framework (#5391)
* Optimizations to egraph framework:

- Save elaborated results by canonical value, not latest value (union
  value). Previously we were artificially skipping and re-elaborating
  some values we already had because we were not finding them in the
  map.

- Make some changes to handling of icmp results: when icmp became
  I8-typed (when bools went away), many uses became `(uextend $I32 (icmp
  $I8 ...))`, and so patterns in lowering backends were no longer
  matching.

  This PR includes an x64-specific change to match `(brz (uextend (icmp
  ...)))` and similarly for `brnz`, but it also takes advantage of the
  ability to write rules easily in the egraph mid-end to rewrite selects
  with icmp inputs appropriately.

- Extend constprop to understand selects in the egraph mid-end.

With these changes, bz2.wasm sees a ~1% speedup, and spidermonkey.wasm
with a fib.js input sees a 16.8% speedup:

```
$ time taskset 1 target/release/wasmtime run --allow-precompiled --dir=. ./spidermonkey.base.cwasm ./fib.js
1346269
taskset 1 target/release/wasmtime run --allow-precompiled --dir=.  ./fib.js  2.14s user 0.01s system 99% cpu 2.148 total
$ time taskset 1 target/release/wasmtime run --allow-precompiled --dir=. ./spidermonkey.egraphs.cwasm ./fib.js
1346269
taskset 1 target/release/wasmtime run --allow-precompiled --dir=.  ./fib.js  1.78s user 0.01s system 99% cpu 1.788 total
```

* Review feedback.
2022-12-07 13:23:13 -08:00
Trevor Elliott
c5379051c4 Enable the ssa verifier in debug builds (#5354)
Enable regalloc2's SSA verifier in debug builds to check for any outstanding reuse of virtual registers in def constraints. As fuzzing enables debug_assertions, this will enable the SSA verifier when fuzzing as well.
2022-12-07 12:22:51 -08:00
Nick Fitzgerald
f0c4b6f3a1 Cranelift: Implement iadd_cout on x64 for 32- and 64-bit integers (#5285)
* Split the `iadd_cout` runtests by type

* Implement `iadd_cout` for 32- and 64-bit values on x64

* Delete trailing whitespace in `riscv/lower.isle`
2022-12-07 19:54:14 +00:00
Jamey Sharp
29b23d41b6 ISLE rule cleanups (#5389)
* cranelift-codegen: Use ISLE matching, not same_value

The `same_value` function just wrapped an equality test into an external
constructor, but we can do that with ISLE's equality constraints
instead.

* riscv64: Remove custom condition-code tests

The `lower_icmp` term exists solely to decide whether to sign-extend or
zero-extend the comparison operands, based on whether the condition code
requires a signed comparison. It additionally tested whether the
condition code was == or !=, but produced the same result as for other
unsigned comparisons.

We already have `signed_cond_code` in the ISLE prelude, which classifies
the total-ordering condition codes according to whether they're signed.
It also lumps == and != in the "unsigned" camp, as desired.

So this commit uses the existing method from the prelude instead of
riscv64-local definitions.

Because this version has no constraints on the left-hand side of the
rule in the unsigned case, ISLE generates Rust that always returns
`Some`. That shows that the current use of `unwrap` is justified, at the
only Rust-side call-site of `constructor_lower_icmp`, which is in
cranelift/codegen/src/isa/riscv64/lower/isle.rs.

* ISLE prelude: make offset32 infallible

This extractor always returns `Some`, so it doesn't need to be fallible.
2022-12-07 02:55:59 +00:00
Chris Fallin
f980defe17 egraph support: rewrite to work in terms of CLIF data structures. (#5382)
* egraph support: rewrite to work in terms of CLIF data structures.

This work rewrites the "egraph"-based optimization framework in
Cranelift to operate on aegraphs (acyclic egraphs) represented in the
CLIF itself rather than as a separate data structure to which and from
which we translate the CLIF.

The basic idea is to add a new kind of value, a "union", that is like an
alias but refers to two other values rather than one.  This allows us to
represent an eclass of enodes (values) as a tree. The union node allows
for a value to have *multiple representations*: either constituent value
could be used, and (in well-formed CLIF produced by correct
optimization rules) they must be equivalent.

Like the old egraph infrastructure, we take advantage of acyclicity and
eager rule application to do optimization in a single pass. Like before,
we integrate GVN (during the optimization pass) and LICM (during
elaboration).

Unlike the old egraph infrastructure, everything stays in the
DataFlowGraph. "Pure" enodes are represented as instructions that have
values attached, but that are not placed into the function layout. When
entering "egraph" form, we remove them from the layout while optimizing.
When leaving "egraph" form, during elaboration, we can place an
instruction back into the layout the first time we elaborate the enode;
if we elaborate it more than once, we clone the instruction.

The implementation performs two passes overall:

- One, a forward pass in RPO (to see defs before uses), that (i) removes
  "pure" instructions from the layout and (ii) optimizes as it goes. As
  before, we eagerly optimize, so we form the entire union of optimized
  forms of a value before we see any uses of that value. This lets us
  rewrite uses to use the most "up-to-date" form of the value and
  canonicalize and optimize that form.

  The eager rewriting and acyclic representation make each other work
  (we could not eagerly rewrite if there were cycles; and acyclicity
  does not miss optimization opportunities only because the first time
  we introduce a value, we immediately produce its "best" form). This
  design choice is also what allows us to avoid the "parent pointers"
  and fixpoint loop of traditional egraphs.

  This forward optimization pass keeps a scoped hashmap to "intern"
  nodes (thus performing GVN), and also interleaves on a per-instruction
  level with alias analysis. The interleaving with alias analysis allows
  alias analysis to see the most optimized form of each address (so it
  can see equivalences), and allows the next value to see any
  equivalences (reuses of loads or stored values) that alias analysis
  uncovers.

- Two, a forward pass in domtree preorder, that "elaborates" pure enodes
  back into the layout, possibly in multiple places if needed. This
  tracks the loop nest and hoists nodes as needed, performing LICM as it
  goes. Note that by doing this in forward order, we avoid the
  "fixpoint" that traditional LICM needs: we hoist a def before its
  uses, so when we place a node, we place it in the right place the
  first time rather than moving later.

This PR replaces the old (a)egraph implementation. It removes both the
cranelift-egraph crate and the logic in cranelift-codegen that uses it.

On `spidermonkey.wasm` running a simple recursive Fibonacci
microbenchmark, this work shows 5.5% compile-time reduction and 7.7%
runtime improvement (speedup).

Most of this implementation was done in (very productive) pair
programming sessions with Jamey Sharp, thus:

Co-authored-by: Jamey Sharp <jsharp@fastly.com>

* Review feedback.

* Review feedback.

* Review feedback.

* Bugfix: cprop rule: `(x + k1) - k2` becomes `x - (k2 - k1)`, not `x - (k1 - k2)`.

Co-authored-by: Jamey Sharp <jsharp@fastly.com>
2022-12-06 14:58:57 -08:00
Trevor Elliott
293bb5b334 riscv64: Only emit jumps at the end of basic blocks (#5381)
This PR fixes two bugs in the riscv64 backend, where branch instructions were emitted in the middle of a basic block:

Constant emission, where the constants are inlined into the vcode and are jumped over at runtime,
The BrTableCheck pseudo-instruction, which was always emitted before a BrTable instruction, and would handle jumping to the default label.
The first bug was resolved by introducing two new psuedo instructions, LoadConst32 and LoadConst64. Both of these instructions serve to delay the original encoding to emission time, after regalloc2 has run.

The second bug was fixed by removing the BrTableCheck instruction. As it was always emitted directly before BrTable, it was easier to remove it and merge the two into a single instruction.
2022-12-06 10:54:10 -08:00
Chris Fallin
feaa7ca75f Alias analysis: refactor for use by other driver loops. (#5380)
* Alias analysis: refactor for use by other driver loops.

This PR pulls the core of the alias analysis infrastructure into a
`process_inst()` method that operates on a single instruction, and
allows another compiler pass to apply store-to-load forwarding and
redundant-load elimination interleaved with other work. The existing
behavior remains unchanged; the pass's toplevel loop calls this
extracted method.

This refactor is a prerequisite for using the alias analysis as part of
a refactored egraph-based optimization framework.

* Review feedback: remove unneeded mut.
2022-12-06 18:30:02 +00:00
Trevor Elliott
353a681671 Avoid reusing a register during constant loading (#5379)
Avoid reusing a register when loading a constant, allocating a temporary instead.
2022-12-05 18:37:53 -08:00
Trevor Elliott
7d28d586da riscv64: Don't reuse registers when loading constants (#5376)
Rework the constant loading functions in the riscv64 backend to generate fresh temporaries instead of reusing the destination register.
2022-12-05 16:51:52 -08:00
Saúl Cabrera
28cfa57533 cranelift: Small documentation fixes (#5377)
* `translate_operator` doesn't return a boolean.
* `from_base_offset` doesn't panic if offset is smaller than base.
2022-12-06 00:46:58 +00:00
Trevor Elliott
817c2b205c riscv64: Use a temporary when translating shift amount (#5375)
Use a temporary when translating the shift amount, instead of reusing the destination register.
2022-12-05 20:54:14 +00:00
Trevor Elliott
b475b9bd19 Terminate blocks with a single branch in riscv64 (#5374)
Ensure that we're terminating blocks with a single branch instruction, when testing I128 values against zero.
2022-12-05 20:13:28 +00:00
Anton Romanov
29d4d1063f [codegen] Fixed mutability of domtree reference (#5371) 2022-12-05 08:19:32 -08:00
Trevor Elliott
6aea8e0d7e Don't reuse destination registers when lowering splat on aarch64 (#5370) 2022-12-05 08:18:49 -08:00
Trevor Elliott
2e9b0802ab aarch64: Rework amode compilation to produce SSA code (#5369)
Rework the compilation of amodes in the aarch64 backend to stop reusing registers and instead generate fresh virtual registers for intermediates. This resolves some SSA checker violations with the aarch64 backend, and as a nice side-effect removes some unnecessary movs in the generated code.
2022-12-02 01:23:15 +00:00
Trevor Elliott
d54a27d0ea Allocate temporary intermediates when loading constants on aarch64 (#5366)
As loading constants on aarch64 can take up to 4 instructions, we need to plumb through some additional registers. Rather than pass a fixed list of registers in, pass an allocation function.
2022-12-01 22:29:36 +00:00
Nick Fitzgerald
4510a4a805 Cranelift: mark post-legalization trapping blocks as cold (#5367)
Trapping is a rare event.
2022-12-01 12:46:26 -08:00
Trevor Elliott
37c3c5b1e0 Remove an unnecessary debug trace (#5359) 2022-11-30 20:37:20 -08:00
Trevor Elliott
c16f2956db Allocate a temporary for 64-bit constant loads in the s390x backend (#5357)
Avoid reusing a destination virtual register for 64-bit constants in the s390x backend. This change addresses a case identified by the regalloc2 ssa validator, as the destination register was written to twice when constants were generated via the MachInst::gen_constant function.
2022-11-30 17:01:14 -08:00
Trevor Elliott
d8dbabfe6b Don't reuse registers in the x64 div lowering (#5356)
Introduce a temporary for an intermediate value in the lowering of div in the x64 backend. Additionally, add a src argument to the shift_r smart constructor, which is why the diff got larger than just the div lowering.
2022-11-30 22:44:59 +00:00
Trevor Elliott
87b63174b1 Don't reuse registers in make_i64x2_from_lanes (#5355)
Avoid reusing output registers in make_i64x2_from_lanes by threading the output name instead, and using smart constructors for x64_pinsrd instead of constructing the instructions directly.
2022-11-30 14:37:01 -08:00
Nick Fitzgerald
79f7fa6079 Cranelift: implement heap_{load,store} instruction legalization (#5351)
* Cranelift: implement `heap_{load,store}` instruction legalization

This does not remove `heap_addr` yet, but it does factor out the common
bounds-check-and-compute-the-native-address functionality that is shared between
all of `heap_{addr,load,store}`.

Finally, this adds a missing optimization for when we can dedupe explicit bounds
checks for static memories and Spectre mitigations.

* Cranelift: Enable `heap_load_store_*` run tests on all targets
2022-11-30 19:12:49 +00:00
Alex Crichton
830885383f Implement inline stack probes for AArch64 (#5353)
* Turn off probestack by default in Cranelift

The probestack feature is not implemented for the aarch64 and s390x
backends and currently the on-by-default status requires the aarch64 and
s390x implementations to be a stub. Turning off probestack by default
allows the s390x and aarch64 backends to panic with an error message to
avoid providing a false sense of security. When the probestack option is
implemented for all backends, however, it may be reasonable to
re-enable.

* aarch64: Improve codegen for AMode fallback

Currently the final fallback for finalizing an `AMode` will generate
both a constant-loading instruction as well as an `add` instruction to
the base register into the same temporary. This commit improves the
codegen by removing the `add` instruction and folding the final add into
the finalized `AMode`. This changes the `extendop` used but both
registers are 64-bit so shouldn't be affected by the extending
operation.

* aarch64: Implement inline stack probes

This commit implements inline stack probes for the aarch64 backend in
Cranelift. The support here is modeled after the x64 support where
unrolled probes are used up to a particular threshold after which a loop
is generated. The instructions here are similar in spirit to x64 except
that unlike x64 the stack pointer isn't modified during the unrolled
loop to avoid needing to re-adjust it back up at the end of the loop.

* Enable inline probestack for AArch64 and Riscv64

This commit enables inline probestacks for the AArch64 and Riscv64
architectures in the same manner that x86_64 has it enabled now. Some
more testing was additionally added since on Unix platforms we should be
guaranteed that Rust's stack overflow message is now printed too.

* Enable probestack for aarch64 in cranelift-fuzzgen

* Address review comments

* Remove implicit stack overflow traps from x64 backend

This commit removes implicit `StackOverflow` traps inserted by the x64
backend for stack-based operations. This was historically required when
stack overflow was detected with page faults but Wasmtime no longer
requires that since it's not suitable for wasm modules which call host
functions. Additionally no other backend implements this form of
implicit trap-code additions so this is intended to synchronize the
behavior of all the backends.

This fixes a test added prior for aarch64 to properly abort the process
instead of accidentally being caught by Wasmtime.

* Fix a style issue
2022-11-30 12:30:00 -06:00
Timothy Chen
67fc5389b0 Remove sig data arg and ret fields to reduce size (#5319)
* Remove sig data arg and ret fields to reduce size

* Update cranelift/codegen/src/machinst/abi.rs

Co-authored-by: Jamey Sharp <jamey@minilop.net>

* Update cranelift/codegen/src/machinst/abi.rs

Co-authored-by: Jamey Sharp <jamey@minilop.net>

* Fix offsets

* Add comment

Co-authored-by: Jamey Sharp <jamey@minilop.net>
2022-11-30 07:19:41 -08:00
Nick Fitzgerald
913a2ec8c8 Cranelift: consider heap's guard pages when legalizing heap_addr (#5335)
* Cranelift: consider heap's guard pages when legalizing `heap_addr`

Fixes #5328

* Update comment to align more directly with implementation

* Add legalization tests for `heap_addr` and offset guard pages
2022-11-29 19:54:25 +00:00
Trevor Elliott
f138fc0ed3 Bump regalloc2 to 0.5.0 (#5345)
* Bump the regalloc2 dependency to 0.5.0
* Replace preg_set_from_machine_env with PRegSet::from
* Vet the regalloc2 update
2022-11-29 11:25:35 -08:00
Afonso Bordado
ec342c20e3 cranelift: Add iadd_cout lowerings for aarch64 (#5177)
* cranelift: Add `iadd_cout`/`isub_bout`  i128 tests

* aarch64: Add `iadd_cout` lowerings

* fuzzgen: Add `iadd_cout`
2022-11-29 10:58:44 -08:00
Trevor Elliott
a5a0645aff Don't allow reuse_def constraints in the s390x Loop instruction (#5336) 2022-11-28 17:52:11 -08:00
Trevor Elliott
368004428a Fix rule shadowing instances in x64 and aarch64 backends (#5334)
Fix shadowing identified in #5322 for imul and swiden_high/swiden_low/uwiden_high/uwiden_low combinations in the x64 backend, and remove some redundant rules from the aarch64 dynamic neon ruleset. Additionally, add tests to the x64 backend showing that the imul specializations are firing.
2022-11-28 15:48:34 -08:00
Nick Fitzgerald
58a5089e48 Cranelift: log number of CLIF insts/blocks to optimize/lower (#5333) 2022-11-28 19:35:29 +00:00
Nick Fitzgerald
6fe69d00ca Cranelift: add debug logs counting how many vcode instructions and blocks we lower to (#5332) 2022-11-28 18:57:02 +00:00
Trevor Elliott
54a6d2f79a Generate more fixed_nonallocatable constraints, and add debug assertions (#5132)
Add assertions to the OperandCollector that show we're not using pinned vregs, and use reg_fixed_nonallocatable constraints when a real register is used with other constraint generation functions like reg_use etc.
2022-11-28 10:31:56 -08:00
Timothy Chen
48ee42efc2 Refactor Sigdata methods with sigset (#5307)
* Refactor sigdata methods

* Update cranelift/codegen/src/machinst/abi.rs

Co-authored-by: Jamey Sharp <jamey@minilop.net>

* Address comments

Co-authored-by: Jamey Sharp <jamey@minilop.net>
2022-11-22 09:03:51 -08:00
Nick Fitzgerald
d0d3245a35 Cranelift: Add heap_load and heap_store instructions (#5300)
* Cranelift: Define `heap_load` and `heap_store` instructions

* Cranelift: Implement interpreter support for `heap_load` and `heap_store`

* Cranelift: Add a suite runtests for `heap_{load,store}`

There are so many knobs we can twist for heaps and I wanted to exhaustively test
all of them, so I wrote a script to generate the tests. I've checked in the
script in case we want to make any changes in the future, but I don't think it
is worth adding this to CI to check that scripts are up to date or anything like
that.

* Review feedback
2022-11-21 23:00:39 +00:00
Trevor Elliott
54cfa4df34 cranelift: Fix implicit pointer argument register use (#5301)
* Fix arg handling to write to VRegs instead of physical regs

* Make is_included_in_clobbers required, and handle Args on x64 and riscv64
2022-11-18 16:47:03 -08:00
Jun Ryung Ju
e5f93d9ec0 cranelift: Support bnot, band, bor, bxor for x86_64. (#5036)
* Support `bnot`, `band`, `bor`, `bxor` for x86_64.

* Fix-up to handle `B{8,16,32,64}` type on bitops

* Fix-up conflict.
2022-11-18 07:45:54 -08:00
Nick Fitzgerald
3b6544dc66 Fix warnings in cranelift-codegen docs builds (#5292) 2022-11-17 21:13:24 +00:00
Timothy Chen
de6e4a4e20 Shrink the size of SigData in Cranelift (#5261)
* Shrink the size of SigData in Cranelift

* Update cranelift/codegen/src/machinst/abi.rs

Co-authored-by: Jamey Sharp <jamey@minilop.net>

* Change ret arg length to u16

* Add test

Co-authored-by: Jamey Sharp <jamey@minilop.net>
2022-11-17 00:15:19 +00:00
Trevor Elliott
4780bd5902 Don't use %rcx directly with CoffTlsGetAddr (#5278)
Avoid naming %rcx as written by the CoffTlsGetAddr pseudo-instruction in the x64 backend, and instead emit a fixed-def constraint for a fresh VReg and %rcx.
2022-11-16 11:32:09 -08:00
Trevor Elliott
07bd8bf34a Remove unnecessary moves in x64 gen_memcpy (#5277)
Remove some unnecessary moves in the x64 gen_memcpy implementation -- the call instruction that's generated will already constrain the args to those registers.
2022-11-16 10:33:00 -08:00
Trevor Elliott
a007e02bd2 Add fixed_nonallocatable constraints when appropriate (#5253)
Plumb the set of allocatable registers through the OperandCollector and use it validate uses of fixed-nonallocatable registers, like %rsp on x86_64.
2022-11-15 12:49:17 -08:00
Nick Fitzgerald
f6ae67f3f0 Cranelift(aarch64): Use an existing extractor instead of a new pure constructor (#5273) 2022-11-15 20:40:44 +00:00
Nick Fitzgerald
d335dc8d5a Cranelift: Do not optimize heap bounds checking comparison in legalization (#5272)
That optimization is only for 12-bit immediates in Aarch64, which is now handled
in backend lowering, so we can simplify this code a bit now.
2022-11-15 19:54:52 +00:00
Nick Fitzgerald
9967782726 Cranelift(Aarch64): Optimize lowering of icmps with immediates (#5252)
We can encode more constants into 12-bit immediates if we do the following
rewrite for comparisons with odd constants:

        A >= B + 1
    ==> A - 1 >= B
    ==> A > B
2022-11-15 09:18:55 -08:00
Nick Fitzgerald
c2a7ea7e24 Cranelift: de-duplicate bounds checks in legalizations (#5190)
* Cranelift: Add the `DataFlowGraph::display_value_inst` convenience method

* Cranelift: Add some `trace!` logs to some parts of legalization

* Cranelift: de-duplicate bounds checks in legalizations

When both (1) "dynamic" memories that need explicit bounds checks and (2)
spectre mitigations that perform bounds checks are enabled, reuse the same
bounds checks between the two legalizations.

This reduces the overhead of explicit bounds checks and spectre mitigations over
using virtual memory guard pages with spectre mitigations from ~1.9-2.1x
overhead to ~1.6-1.8x overhead. That is about a 14-19% speed up for when dynamic
memories and spectre mitigations are enabled.

<details>

```
execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 3422455129.47 ± 120159.49 (confidence = 99%)

  virtual-memory-guards.so is 2.09x to 2.09x faster than bounds-checks.so!

  [6563931659 6564063496.07 6564301535] bounds-checks.so
  [3141492675 3141608366.60 3141895249] virtual-memory-guards.so

execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  Δ = 338716136.87 ± 1.38 (confidence = 99%)

  virtual-memory-guards.so is 2.08x to 2.08x faster than bounds-checks.so!

  [651961494 651961495.47 651961497] bounds-checks.so
  [313245357 313245358.60 313245362] virtual-memory-guards.so

execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 22742944.07 ± 331.73 (confidence = 99%)

  virtual-memory-guards.so is 1.87x to 1.87x faster than bounds-checks.so!

  [48841295 48841567.33 48842139] bounds-checks.so
  [26098439 26098623.27 26099479] virtual-memory-guards.so
```

</details>

<details>

```
execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 2465900207.27 ± 146476.61 (confidence = 99%)

  virtual-memory-guards.so is 1.78x to 1.78x faster than de-duped-bounds-checks.so!

  [5607275431 5607442989.13 5607838342] de-duped-bounds-checks.so
  [3141445345 3141542781.87 3141711213] virtual-memory-guards.so

execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  Δ = 234253620.20 ± 2.33 (confidence = 99%)

  virtual-memory-guards.so is 1.75x to 1.75x faster than de-duped-bounds-checks.so!

  [547498977 547498980.93 547498985] de-duped-bounds-checks.so
  [313245357 313245360.73 313245363] virtual-memory-guards.so

execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 16605659.13 ± 315.78 (confidence = 99%)

  virtual-memory-guards.so is 1.64x to 1.64x faster than de-duped-bounds-checks.so!

  [42703971 42704284.40 42704787] de-duped-bounds-checks.so
  [26098432 26098625.27 26099234] virtual-memory-guards.so
```

</details>

<details>

```
execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  Δ = 104462517.13 ± 7.32 (confidence = 99%)

  de-duped-bounds-checks.so is 1.19x to 1.19x faster than bounds-checks.so!

  [651961493 651961500.80 651961532] bounds-checks.so
  [547498981 547498983.67 547498989] de-duped-bounds-checks.so

execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 956556982.80 ± 103034.59 (confidence = 99%)

  de-duped-bounds-checks.so is 1.17x to 1.17x faster than bounds-checks.so!

  [6563930590 6564019842.40 6564243651] bounds-checks.so
  [5607307146 5607462859.60 5607677763] de-duped-bounds-checks.so

execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 6137307.87 ± 247.75 (confidence = 99%)

  de-duped-bounds-checks.so is 1.14x to 1.14x faster than bounds-checks.so!

  [48841303 48841472.93 48842000] bounds-checks.so
  [42703965 42704165.07 42704718] de-duped-bounds-checks.so
```

</details>

* Update test expectations

* Add a test for deduplicating bounds checks between dynamic memories and spectre mitigations

* Define a struct for the Spectre comparison instead of using a tuple

* More trace logging for heap legalization
2022-11-15 08:47:22 -08:00
Trevor Elliott
dece901d16 Use regalloc constraints for sse blend operations (#5251)
Instead of using xmm0 explicitly for the mask argument to instructions like blendvpd, use regalloc constraints to constrain it to xmm0 instead.
2022-11-14 16:44:34 -08:00
Denys Zadorozhnyi
d3692c2f2b fix typo in caller_conv arg name in ABIMachineSpec::gen_call; (#5259) 2022-11-14 09:02:07 -08:00
Trevor Elliott
0367fbc2d4 cranelift: Rework pinned register lowering (#5249)
Rework pinned register lowering to avoid the use of pinned virtual registers, instead using the MovFromPReg and MovToPReg pseudo instructions.
2022-11-10 16:19:25 -08:00
Nick Fitzgerald
fc62d4ad65 Cranelift: Make heap_addr return calculated base + index + offset (#5231)
* Cranelift: Make `heap_addr` return calculated `base + index + offset`

Rather than return just the `base + index`.

(Note: I've chosen to use the nomenclature "index" for the dynamic operand and
"offset" for the static immediate.)

This move the addition of the `offset` into `heap_addr`, instead of leaving it
for the subsequent memory operation, so that we can Spectre-guard the full
address, and not allow speculative execution to read the first 4GiB of memory.

Before this commit, we were effectively doing

    load(spectre_guard(base + index) + offset)

Now we are effectively doing

    load(spectre_guard(base + index + offset))

Finally, this also corrects `heap_addr`'s documented semantics to say that it
returns an address that will trap on access if `index + offset + access_size` is
out of bounds for the given heap, rather than saying that the `heap_addr` itself
will trap. This matches the implemented behavior for static memories, and after
https://github.com/bytecodealliance/wasmtime/pull/5190 lands (which is blocked
on this commit) will also match the implemented behavior for dynamic memories.

* Update heap_addr docs

* Factor out `offset + size` to a helper
2022-11-09 19:53:51 +00:00
Trevor Elliott
b077854b57 Generate SSA code from returns (#5172)
Modify return pseudo-instructions to have pairs of registers: virtual and real. This allows us to constrain the virtual registers to the real ones specified by the abi, instead of directly emitting moves to those real registers.
2022-11-08 16:00:49 -08:00