Commit Graph

101 Commits

Author SHA1 Message Date
Afonso Bordado
7a3df7dcc0 riscv64: Improve ctz/clz/cls codegen (#5854)
* cranelift: Add extra runtests for `clz`/`ctz`

* riscv64: Restrict lowering rules for `ctz`/`clz`

* cranelift: Add `u64` isle helpers

* riscv64: Improve `ctz` codegen

* riscv64: Improve `clz` codegen

* riscv64: Improve `cls` codegen

* riscv64: Improve `clz.i128` codegen

Instead of checking if we have 64 zeros in the top half. Check
if it *is* 0, that way we avoid loading the `64` constant.

* riscv64: Improve `ctz.i128` codegen

Instead of checking if we have 64 zeros in the bottom half. Check
if it *is* 0, that way we avoid loading the `64` constant.

* riscv64: Use extended value in `lower_cls`

* riscv64: Use pattern matches on `bseti`
2023-03-21 23:15:14 +00:00
Karl Meakin
ff6f17ca52 ISLE: add synonyms for all variations of icmp (#6081) 2023-03-21 22:13:00 +00:00
Alexa VanHattum
13be5618a7 Cranelift: ISLE: aarch64: fix imm12_from_negated_value for i32, i16 (#6078)
* Fix the semantics of imm12_from_negated_value, swapping to a partial term + rule

* wrapping_neg
2023-03-21 19:16:25 +00:00
Karl Meakin
73cc433bdd cranelift: simplify icmp against UMAX/SMIN/SMAX (#6037)
* cranelift: simplify `icmp` against UMAX/SMIN/SMAX

* Add tests for icmp against numeric limits
2023-03-17 18:54:29 +00:00
Alex Crichton
fcddb9ca81 x64: Add lea-based lowering for iadd (#5986)
* x64: Refactor `Amode` computation in ISLE

This commit replaces the previous computation of `Amode` with a
different set of rules that are intended to achieve the same purpose but
are structured differently. The motivation for this commit is going to
become more relevant in the next commit where `lea` will be used for the
`iadd` instruction, possibly, on x64. When doing so it caused a stack
overflow in the test suite during the compilation phase of a wasm
module, namely as part of the `amode_add` function. This function is
recursively defined in terms of itself and recurses as deep as the
deepest `iadd`-chain in a program. A particular test in our test suite
has a 10k-long chain of `iadd` which ended up causing a stack overflow
in debug mode.

This stack overflow is caused because the `amode_add` helper in ISLE
unconditionally peels all the `iadd` nodes away and looks at all of
them, even if most end up in intermediate registers along the way. Given
that structure I couldn't find a way to easily abort the recursion. The
new `to_amode` helper is structured in a similar fashion but attempts to
instead only recurse far enough to fold items into the final `Amode`
instead of recursing through items which themselves don't end up in the
`Amode`. Put another way previously the `amode_add` helper might emit
`x64_add` instructions, but it no longer does that.

This goal of this commit is to preserve all the original `Amode`
optimizations, however. For some parts, though, it relies more on egraph
optimizations to run since if an `iadd` is 10k deep it doesn't try to
find a constant buried 9k levels inside there to fold into the `Amode`.
The hope, though, is that with egraphs having run already it's shuffled
constants to the right most of the time and already folded any possible
together.

* x64: Add `lea`-based lowering for `iadd`

This commit adds a rule for the lowering of `iadd` to use `lea` for 32
and 64-bit addition. The theoretical benefit of `lea` over the `add`
instruction is that the `lea` variant can emulate a 3-operand
instruction which doesn't destructively modify on of its operands.
Additionally the `lea` operation can fold in other components such as
constant additions and shifts.

In practice, however, if `lea` is unconditionally used instead of `iadd`
it ends up losing 10% performance on a local `meshoptimizer` benchmark.
My best guess as to what's going on here is that my CPU's dedicated
units for address computation are all overloaded while the ALUs are
basically idle in a memory-intensive loop. Previously when the ALU was
used for `add` and the address units for stores/loads it in theory
pipelined things better (most of this is me shooting in the dark). To
prevent the performance loss here I've updated the lowering of `iadd` to
conditionally sometimes use `lea` and sometimes use `add` depending on
how "complicated" the `Amode` is. Simple ones like `a + b` or `a + $imm`
continue to use `add` (and its subsequent hypothetical extra `mov`
necessary into the result). More complicated ones like `a + b + $imm` or
`a + b << c + $imm` use `lea` as it can remove the need for extra
instructions. Locally at least this fixes the performance loss relative
to unconditionally using `lea`.

One note is that this adds an `OperandSize` argument to the
`MInst::LoadEffectiveAddress` variant to add an encoding for 32-bit
`lea` in addition to the preexisting 64-bit encoding.

* Conditionally use `lea` based on regalloc
2023-03-15 17:14:25 +00:00
Alex Crichton
03b5dbb3e0 aarch64: Use VCodeConstant for f64/v128 constants (#5997)
* aarch64: Translate float and splat lowering to ISLE

I was looking into `constant_f128` and its fallback lowering into memory
and to get familiar with the code I figured it'd be good to port some
Rust logic to ISLE. This commit ports the `constant_{f128,f64,f32}`
helpers into ISLE from Rust as well as the `splat_const` helper which
ended up being closely related.

Tests reflect a number of regalloc changes that happened but also namely
one major difference is that in the lowering of `f32` a 32-bit immediate
is created now instead of a 64-bit immediate (in a GP register before
it's moved into a FP register). This semantically has no change but the
generated code is slightly different in a few minor cases.

* aarch64: Load f64/v128 constants from a pool

This commit removes the `LoadFpuConst64` and `LoadFpuConst128`
pseudo-instructions from the AArch64 backend which internally loaded a
nearby constant and then jumped over it. Constants now go through the
`VCodeConstant` infrastructure which gets placed at the end of the
function similar to how x64 works. Some minor support was added in as
well to add a new addressing mode for a `MachLabel`-relative load.
2023-03-13 19:33:52 +00:00
Afonso Bordado
5c95e6fbaf riscv64: Codemotion cleanups to ISLE files (#5984)
* riscv64: Fix typo in extensions

* riscv64: Move converters to top of file

* riscv64: Group up all imm12 rules

* riscv64: Move zero_reg helpers to Physical Regs section

* riscv64: Move helpers away from `clz` lowerings

These were in the middle of the `clz` rules and are kind of distracting

* riscv64: Move `cls` rules next to `ctz`/`clz`

* cranelift: Move `u8_and` / `u32_add` to Primitive Arithmetic section

* riscv64: Mark some imm12 constructors as pure

* cranelift: Move `s32_add_fallible` next to `u32_add`

* riscv64: Fix Typo
2023-03-13 19:20:15 +00:00
Chris Fallin
7f3500a172 Cranelift: x64, aarch64, s390x, riscv64: ensure addresses are I64s. (#5972)
* Cranelift: x64, aarch64, s390x, riscv64: ensure addresses are I64s.

@avanhatt has been looking at our address-mode lowering and found an
example where when feeding an `I32`-typed address into a load or store,
we can violate assumptions and get incorrect codegen.

This should never be reachable in practice, because all producers on
64-bit architectures use 64-bit types for addresses. However, our IR is
insufficiently constrained, and allows loads/stores to `I32` addresses
as well. This is nonsensical on a 64-bit architecture.

Initially I had thought we should tighten either the instruction
definition's accepted types, or the CLIF verifier, to reject this.
However both are target-independent, and we don't want to bake
an assumption of 64-bit-ness into the compiler core. Instead this PR
tightens specific backends' lowerings to rejecct loads/stores of
`I32`-typed addresses.

tl;dr: no security implications as all producers use I64-typed
addresses (and must, for correct operation); but we currently accept
I32-typed addresses too, and this breaks other assumptions.

* Allow R64 as well as I64 types.

* Add an explicit extractor to match 64-bit address types.
2023-03-09 19:08:16 +00:00
Jamey Sharp
6cf7155052 Cranelift: Generalize (x << k) >> k optimization (#5746)
* Generalize unsigned `(x << k) >> k` optimization

Split the existing rule into three parts:
- A dual of the rule for `(x >> k) << k` that is only valid for unsigned
  shifts.
- Known-bits analysis for `(band (uextend x) k)`.
- A new rule for converting `sextend` to `uextend` if the sign-extended
  bits are masked out anyway.

The first two together cover the existing rule.

* Generalize signed `(x << k) >> k` optimization

* Review comments

* Generalize sign-extending shifts further

The shifts can be eliminated even if the shift amount isn't exactly
equal to the difference in bit-widths between the narrow and wide types.

* Add filetests
2023-02-27 17:34:46 +00:00
Jamey Sharp
97381792ac Generalize u/sextend constant folding to all types (#5706)
Also move these optimization rules to cprop.isle; it's where all the
other similar rules are.

Like the other cprop rules, these can subsume any other rules. We can't
do better than reducing an expression to a constant.

The new i64_sextend_imm64 and u64_uextend_imm64 constructors are useful
helpers to clean up other code. I applied them to `imm64_icmp` while I
was here, as well as using the existing `ty_mask` helper to clean up
`imm64_masked`.
2023-02-03 17:29:21 -08:00
Nick Fitzgerald
72c8513411 Cranelift: Correctly wrap shifts in constant propagation (#5695)
Fixes #5690
Fixes #5696

Co-authored-by: Jamey Sharp <jsharp@fastly.com>
2023-02-03 00:12:57 +00:00
Jamey Sharp
ac4d28f4dd Constant-fold icmp instructions (#5666)
We found examples of icmp instructions with both operands constant in
spidermonkey.wasm.
2023-02-01 21:55:36 +00:00
Nick Fitzgerald
ffbbfbffce Cranelift: Rewrite or(and(x, y), not(y)) => or(x, not(y)) again (#5684)
This rewrite was introduced in #5676 and then reverted in #5682 due to a footgun
where we accidentally weren't actually checking the `y == !z` precondition. This
commit fixes the precondition check. It also fixes the arithmetic to be
correctly masked to the value type's width.

This reverts commit 268f6bfc1d.
2023-02-01 20:53:22 +00:00
Trevor Elliott
268f6bfc1d Revert "Cranelift: Rewrite or(and(x, y), not(y)) => or(x, not(y)) (#5676)" (#5682)
This reverts commit 8c9eb9939b.

Fixes #5680
2023-02-01 02:53:23 +00:00
Nick Fitzgerald
8c9eb9939b Cranelift: Rewrite or(and(x, y), not(y)) => or(x, not(y)) (#5676)
Co-authored-by: Rainy Sinclair <844493+itsrainy@users.noreply.github.com>
2023-01-31 22:44:45 +00:00
Nick Fitzgerald
7aa240e0f2 Cranelift: constant propagate shifts (#5671)
Thanks to Souper for pointing out we weren't doing this!
2023-01-31 12:06:53 -08:00
Nick Fitzgerald
c9d1c068bc Cranelift: Add egraph rule to rewrite x * C ==> x << log2(C) when C is a power of two (#5647) 2023-01-31 18:04:17 +00:00
Trevor Elliott
7926808e8e riscv64: improve unordered comparison generated code (#5636)
Improve the generated code for unordered floating point comparisons by negating the comparison and inveritng the branches. This allows us to pick the unordered versions, which generate significantly better code.
2023-01-25 17:28:28 -08:00
Trevor Elliott
b58a197d33 cranelift: Add a conditional branch instruction with two targets (#5446)
Add a conditional branch instruction with two targets: brif. This instruction will eventually replace brz and brnz, as it encompasses the behavior of both.

This PR also changes the InstructionData layout for instruction formats that hold BlockCall values, taking the same approach we use for Value arguments. This allows branch_destination to return a slice to the BlockCall values held in the instruction, rather than requiring that we pattern match on InstructionData to fetch the then/else blocks.

Function generation for fuzzing has been updated to generate uses of brif, and I've run the cranelift-fuzzgen target locally for hours without triggering any new failures.
2023-01-24 14:37:16 -08:00
Chris Fallin
244dce93f6 Fix optimization rules for narrow types: wrap i8 results to 8 bits. (#5409)
* Fix optimization rules for narrow types: wrap i8 results to 8 bits.

This fixes #5405.

In the egraph mid-end's optimization rules, we were rewriting e.g. imuls
of two iconsts to an iconst of the result, but without masking off the
high bits (beyond the result type's width). This was producing iconsts
with set high bits beyond their types' width, which is not legal.

In addition, this PR adds some optimizations to the algebraic rules to
recognize e.g. `x == x` (and all other integer comparison operators) and
resolve to 1 or 0 as appropriate.

* Review feedback.

* Review feedback, again.
2022-12-09 22:29:25 +00:00
Jamey Sharp
8726eeefb3 cranelift-isle: Add "partial" flag for constructors (#5392)
* cranelift-isle: Add "partial" flag for constructors

Instead of tying fallibility of constructors to whether they're either
internal or pure, this commit assumes all constructors are infallible
unless tagged otherwise with a "partial" flag.

Internal constructors without the "partial" flag are not allowed to use
constructors which have the "partial" flag on the right-hand side of any
rules, because they have no way to report last-minute match failures.

Multi-constructors should never be "partial"; they report match failures
with an empty iterator instead. In turn this means you can't use partial
constructors on the right-hand side of internal multi-constructor rules.
However, you can use the same constructors on the left-hand side with
`if` or `if-let` instead.

In many cases, ISLE can already trivially prove that an internal
constructor always returns `Some`. With this commit, those cases are
largely unchanged, except for removing all the `Option`s and `Some`s
from the generated code for those terms.

However, for internal non-partial constructors where ISLE could not
prove that, it now emits an `unreachable!` panic as the last-resort,
instead of returning `None` like it used to do. Among the existing
backends, here's how many constructors have these panic cases:

- x64: 14% (53/374)
- aarch64: 15% (41/277)
- riscv64: 23% (26/114)
- s390x: 47% (268/567)

It's often possible to rewrite rules so that ISLE can tell the panic can
never be hit. Just ensure that there's a lowest-priority rule which has
no constraints on the left-hand side.

But in many of these constructors, it's difficult to statically prove
the unhandled cases are unreachable because that's only down to
knowledge about how they're called or other preconditions.

So this commit does not try to enforce that all terms have a last-resort
fallback rule.

* Check term flags while translating expressions

Instead of doing it in a separate pass afterward.

This involved threading all the term flags (pure, multi, partial)
through the recursive `translate_expr` calls, so I extracted the flags
to a new struct so they can all be passed together.

* Validate multi-term usage

Now that I've threaded the flags through `translate_expr`, it's easy to
check this case too, so let's just do it.

* Extract `ReturnKind` to use in `ExternalSig`

There are only three legal states for the combination of `multi` and
`infallible`, so replace those fields of `ExternalSig` with a
three-state enum.

* Remove `Option` wrapper from multi-extractors too

If we'd had any external multi-constructors this would correct their
signatures as well.

* Update ISLE tests

* Tag prelude constructors as pure where appropriate

I believe the only reason these weren't marked `pure` before was because
that would have implied that they're also partial. Now that those two
states are specified separately we apply this flag more places.

* Fix my changes to aarch64 `lower_bmask` and `imm` terms
2022-12-07 17:16:03 -08:00
Chris Fallin
8c55b81300 Optimizations to egraph framework (#5391)
* Optimizations to egraph framework:

- Save elaborated results by canonical value, not latest value (union
  value). Previously we were artificially skipping and re-elaborating
  some values we already had because we were not finding them in the
  map.

- Make some changes to handling of icmp results: when icmp became
  I8-typed (when bools went away), many uses became `(uextend $I32 (icmp
  $I8 ...))`, and so patterns in lowering backends were no longer
  matching.

  This PR includes an x64-specific change to match `(brz (uextend (icmp
  ...)))` and similarly for `brnz`, but it also takes advantage of the
  ability to write rules easily in the egraph mid-end to rewrite selects
  with icmp inputs appropriately.

- Extend constprop to understand selects in the egraph mid-end.

With these changes, bz2.wasm sees a ~1% speedup, and spidermonkey.wasm
with a fib.js input sees a 16.8% speedup:

```
$ time taskset 1 target/release/wasmtime run --allow-precompiled --dir=. ./spidermonkey.base.cwasm ./fib.js
1346269
taskset 1 target/release/wasmtime run --allow-precompiled --dir=.  ./fib.js  2.14s user 0.01s system 99% cpu 2.148 total
$ time taskset 1 target/release/wasmtime run --allow-precompiled --dir=. ./spidermonkey.egraphs.cwasm ./fib.js
1346269
taskset 1 target/release/wasmtime run --allow-precompiled --dir=.  ./fib.js  1.78s user 0.01s system 99% cpu 1.788 total
```

* Review feedback.
2022-12-07 13:23:13 -08:00
Jamey Sharp
29b23d41b6 ISLE rule cleanups (#5389)
* cranelift-codegen: Use ISLE matching, not same_value

The `same_value` function just wrapped an equality test into an external
constructor, but we can do that with ISLE's equality constraints
instead.

* riscv64: Remove custom condition-code tests

The `lower_icmp` term exists solely to decide whether to sign-extend or
zero-extend the comparison operands, based on whether the condition code
requires a signed comparison. It additionally tested whether the
condition code was == or !=, but produced the same result as for other
unsigned comparisons.

We already have `signed_cond_code` in the ISLE prelude, which classifies
the total-ordering condition codes according to whether they're signed.
It also lumps == and != in the "unsigned" camp, as desired.

So this commit uses the existing method from the prelude instead of
riscv64-local definitions.

Because this version has no constraints on the left-hand side of the
rule in the unsigned case, ISLE generates Rust that always returns
`Some`. That shows that the current use of `unwrap` is justified, at the
only Rust-side call-site of `constructor_lower_icmp`, which is in
cranelift/codegen/src/isa/riscv64/lower/isle.rs.

* ISLE prelude: make offset32 infallible

This extractor always returns `Some`, so it doesn't need to be fallible.
2022-12-07 02:55:59 +00:00
Chris Fallin
f980defe17 egraph support: rewrite to work in terms of CLIF data structures. (#5382)
* egraph support: rewrite to work in terms of CLIF data structures.

This work rewrites the "egraph"-based optimization framework in
Cranelift to operate on aegraphs (acyclic egraphs) represented in the
CLIF itself rather than as a separate data structure to which and from
which we translate the CLIF.

The basic idea is to add a new kind of value, a "union", that is like an
alias but refers to two other values rather than one.  This allows us to
represent an eclass of enodes (values) as a tree. The union node allows
for a value to have *multiple representations*: either constituent value
could be used, and (in well-formed CLIF produced by correct
optimization rules) they must be equivalent.

Like the old egraph infrastructure, we take advantage of acyclicity and
eager rule application to do optimization in a single pass. Like before,
we integrate GVN (during the optimization pass) and LICM (during
elaboration).

Unlike the old egraph infrastructure, everything stays in the
DataFlowGraph. "Pure" enodes are represented as instructions that have
values attached, but that are not placed into the function layout. When
entering "egraph" form, we remove them from the layout while optimizing.
When leaving "egraph" form, during elaboration, we can place an
instruction back into the layout the first time we elaborate the enode;
if we elaborate it more than once, we clone the instruction.

The implementation performs two passes overall:

- One, a forward pass in RPO (to see defs before uses), that (i) removes
  "pure" instructions from the layout and (ii) optimizes as it goes. As
  before, we eagerly optimize, so we form the entire union of optimized
  forms of a value before we see any uses of that value. This lets us
  rewrite uses to use the most "up-to-date" form of the value and
  canonicalize and optimize that form.

  The eager rewriting and acyclic representation make each other work
  (we could not eagerly rewrite if there were cycles; and acyclicity
  does not miss optimization opportunities only because the first time
  we introduce a value, we immediately produce its "best" form). This
  design choice is also what allows us to avoid the "parent pointers"
  and fixpoint loop of traditional egraphs.

  This forward optimization pass keeps a scoped hashmap to "intern"
  nodes (thus performing GVN), and also interleaves on a per-instruction
  level with alias analysis. The interleaving with alias analysis allows
  alias analysis to see the most optimized form of each address (so it
  can see equivalences), and allows the next value to see any
  equivalences (reuses of loads or stored values) that alias analysis
  uncovers.

- Two, a forward pass in domtree preorder, that "elaborates" pure enodes
  back into the layout, possibly in multiple places if needed. This
  tracks the loop nest and hoists nodes as needed, performing LICM as it
  goes. Note that by doing this in forward order, we avoid the
  "fixpoint" that traditional LICM needs: we hoist a def before its
  uses, so when we place a node, we place it in the right place the
  first time rather than moving later.

This PR replaces the old (a)egraph implementation. It removes both the
cranelift-egraph crate and the logic in cranelift-codegen that uses it.

On `spidermonkey.wasm` running a simple recursive Fibonacci
microbenchmark, this work shows 5.5% compile-time reduction and 7.7%
runtime improvement (speedup).

Most of this implementation was done in (very productive) pair
programming sessions with Jamey Sharp, thus:

Co-authored-by: Jamey Sharp <jsharp@fastly.com>

* Review feedback.

* Review feedback.

* Review feedback.

* Bugfix: cprop rule: `(x + k1) - k2` becomes `x - (k2 - k1)`, not `x - (k1 - k2)`.

Co-authored-by: Jamey Sharp <jsharp@fastly.com>
2022-12-06 14:58:57 -08:00
Nick Fitzgerald
9967782726 Cranelift(Aarch64): Optimize lowering of icmps with immediates (#5252)
We can encode more constants into 12-bit immediates if we do the following
rewrite for comparisons with odd constants:

        A >= B + 1
    ==> A - 1 >= B
    ==> A > B
2022-11-15 09:18:55 -08:00
Trevor Elliott
32a7593c94 cranelift: Remove booleans (#5031)
Remove the boolean types from cranelift, and the associated instructions breduce, bextend, bconst, and bint. Standardize on using 1/0 for the return value from instructions that produce scalar boolean results, and -1/0 for boolean vector elements.

Fixes #3205

Co-authored-by: Afonso Bordado <afonso360@users.noreply.github.com>
Co-authored-by: Ulrich Weigand <ulrich.weigand@de.ibm.com>
Co-authored-by: Chris Fallin <chris@cfallin.org>
2022-10-17 16:00:27 -07:00
Chris Fallin
1aaea279e5 egraph opts: fix uextend-of-i32. (#5061)
This is a simple error in the const-prop rules: uextend was not
masking iconst's u64 immediate when extending from i32 to
i64. Arguably an iconst.i32 should not have nonzero bits in the upper
32 of its immediate, but that's a separate design question. For now,
if our invariant is that the upper bits are ignored, then it is
required to mask the bits when const-evaling a `uextend`.

Fixes #5047.
2022-10-17 12:45:49 -07:00
Chris Fallin
2be12a5167 egraph-based midend: draw the rest of the owl (productionized). (#4953)
* egraph-based midend: draw the rest of the owl.

* Rename `egg` submodule of cranelift-codegen to `egraph`.

* Apply some feedback from @jsharp during code walkthrough.

* Remove recursion from find_best_node by doing a single pass.

Rather than recursively computing the lowest-cost node for a given
eclass and memoizing the answer at each eclass node, we can do a single
forward pass; because every eclass node refers only to earlier nodes,
this is sufficient. The behavior may slightly differ from the earlier
behavior because we cannot short-circuit costs to zero once a node is
elaborated; but in practice this should not matter.

* Make elaboration non-recursive.

Use an explicit stack instead (with `ElabStackEntry` entries,
alongside a result stack).

* Make elaboration traversal of the domtree non-recursive/stack-safe.

* Work analysis logic in Cranelift-side egraph glue into a general analysis framework in cranelift-egraph.

* Apply static recursion limit to rule application.

* Fix aarch64 wrt dynamic-vector support -- broken rebase.

* Topo-sort cranelift-egraph before cranelift-codegen in publish script, like the comment instructs me to!

* Fix multi-result call testcase.

* Include `cranelift-egraph` in `PUBLISHED_CRATES`.

* Fix atomic_rmw: not really a load.

* Remove now-unnecessary PartialOrd/Ord derivations.

* Address some code-review comments.

* Review feedback.

* Review feedback.

* No overlap in mid-end rules, because we are defining a multi-constructor.

* rustfmt

* Review feedback.

* Review feedback.

* Review feedback.

* Review feedback.

* Remove redundant `mut`.

* Add comment noting what rules can do.

* Review feedback.

* Clarify comment wording.

* Update `has_memory_fence_semantics`.

* Apply @jameysharp's improved loop-level computation.

Co-authored-by: Jamey Sharp <jamey@minilop.net>

* Fix suggestion commit.

* Fix off-by-one in new loop-nest analysis.

* Review feedback.

* Review feedback.

* Review feedback.

* Use `Default`, not `std::default::Default`, as per @fitzgen

Co-authored-by: Nick Fitzgerald <fitzgen@gmail.com>

* Apply @fitzgen's comment elaboration to a doc-comment.

Co-authored-by: Nick Fitzgerald <fitzgen@gmail.com>

* Add stat for hitting the rewrite-depth limit.

* Some code motion in split prelude to make the diff a little clearer wrt `main`.

* Take @jameysharp's suggested `try_into()` usage for blockparam indices.

Co-authored-by: Jamey Sharp <jamey@minilop.net>

* Take @jameysharp's suggestion to avoid double-match on load op.

Co-authored-by: Jamey Sharp <jamey@minilop.net>

* Fix suggestion (add import).

* Review feedback.

* Fix stack_load handling.

* Remove redundant can_store case.

* Take @jameysharp's suggested improvement to FuncEGraph::build() logic

Co-authored-by: Jamey Sharp <jamey@minilop.net>

* Tweaks to FuncEGraph::build() on top of suggestion.

* Take @jameysharp's suggested clarified condition

Co-authored-by: Jamey Sharp <jamey@minilop.net>

* Clean up after suggestion (unused variable).

* Fix loop analysis.

* loop level asserts

* Revert constant-space loop analysis -- edge cases were incorrect, so let's go with the simple thing for now.

* Take @jameysharp's suggestion re: result_tys

Co-authored-by: Jamey Sharp <jamey@minilop.net>

* Fix up after suggestion

* Take @jameysharp's suggestion to use fold rather than reduce

Co-authored-by: Jamey Sharp <jamey@minilop.net>

* Fixup after suggestion

* Take @jameysharp's suggestion to remove elaborate_eclass_use's return value.

* Clarifying comment in terminator insts.

Co-authored-by: Jamey Sharp <jamey@minilop.net>
Co-authored-by: Nick Fitzgerald <fitzgen@gmail.com>
2022-10-11 18:15:53 -07:00
Trevor Elliott
db06e4e622 ISLE: Resolve remaining x64 overlap errors (#4977)
Resolve overlap errors with the x64 backend.
2022-09-29 10:09:37 -07:00
Trevor Elliott
faf31f6216 ISLE: Resolve overlap in prelude.isle and x64/inst.isle (#4941)
Resolve overlap in the ISLE prelude and the x64 inst module by introducing new types that allow better sharing of extractor resuls, or falling back on priorities.
2022-09-28 10:54:39 -07:00
Damian Heaton
3a2b32bf4d Port branches to ISLE (AArch64) (#4943)
* Port branches to ISLE (AArch64)

Ported the existing implementations of the following opcodes for AArch64
to ISLE:
- `Brz`
- `Brnz`
- `Brif`
- `Brff`
- `BrIcmp`
- `Jump`
- `BrTable`

Copyright (c) 2022 Arm Limited

* Remove dead code

Copyright (c) 2022 Arm Limited
2022-09-26 09:45:32 +01:00
Damian Heaton
e9b08b856d Port icmp to ISLE (AArch64) (#4898)
* Port `icmp` to ISLE (AArch64)

Ported the existing implementation of `icmp` (and, by extension, the
`lower_icmp` function) to ISLE for AArch64.

Copyright (c) 2022 Arm Limited

* Allow 'producer chains', eliminating `Nop0`s

Copyright (c) 2022 Arm Limited
2022-09-13 08:56:50 -07:00
Chris Fallin
2986f6b0ff ABI: implement register arguments with constraints. (#4858)
* ABI: implement register arguments with constraints.

Currently, Cranelift's ABI code emits a sequence of moves from physical
registers into vregs at the top of the function body, one for every
register-carried argument.

For a number of reasons, we want to move to operand constraints instead,
and remove the use of explicitly-named "pinned vregs"; this allows for
better regalloc in theory, as it removes the need to "reverse-engineer"
the sequence of moves.

This PR alters the ABI code so that it generates a single "args"
pseudo-instruction as the first instruction in the function body. This
pseudo-inst defs all register arguments, and constrains them to the
appropriate registers at the def-point. Subsequently the regalloc can
move them wherever it needs to.

Some care was taken not to have this pseudo-inst show up in
post-regalloc disassemblies, but the change did cause a general regalloc
"shift" in many tests, so the precise-output updates are a bit noisy.
Sorry about that!

A subsequent PR will handle the other half of the ABI code, namely, the
callsite case, with a similar preg-to-constraint conversion.

* Update based on review feedback.

* Review feedback.
2022-09-08 18:03:14 -07:00
Anton Kirilov
dd07e354b4 Cranelift AArch64: Fix the get_return_address lowering (#4851)
The previous implementation assumed that nothing had clobbered the
LR register since the current function had started executing, so
it would be incorrect for a non-leaf function, for example, that
contains the `get_return_address` operation right after a call.
The operation is valid only if the `preserve_frame_pointers` flag
is enabled, which implies that the presence of a frame record on
the stack is guaranteed.

Copyright (c) 2022, Arm Limited.
2022-09-07 11:09:22 -07:00
Nick Fitzgerald
f18a1f1488 Cranelift: Deduplicate ABI signatures during lowering (#4829)
* Cranelift: Deduplicate ABI signatures during lowering

This commit creates the `SigSet` type which interns and deduplicates the ABI
signatures that we create from `ir::Signature`s. The ABI signatures are now
referred to indirectly via a `Sig` (which is a `cranelift_entity` ID), and we
pass around a `SigSet` to anything that needs to access the actual underlying
`SigData` (which is what `ABISig` used to be).

I had to change a couple methods to return a `SmallInstVec` instead of emitting
directly to work around what would otherwise be shared and exclusive borrows of
the lowering context overlapping. I don't expect any of these to heap allocate
in practice.

This does not remove the often-unnecessary allocations caused by
`ensure_struct_return_ptr_is_returned`. That is left for follow up work.

This also opens the door for further shuffling of signature data into more
efficient representations in the future, now that we have `SigSet` to store it
all in one place and it is threaded through all the code. We could potentially
move each signature's parameter and return vectors into one big vector shared
between all signatures, for example, which could cut down on allocations and
shrink the size of `SigData` since those `SmallVec`s have pretty large inline
capacity.

Overall, this refactoring gives a 1-7% speedup for compilation on
`pulldown-cmark`:

```
compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 8754213.66 ± 7526266.23 (confidence = 99%)

  dedupe.so is 1.01x to 1.07x faster than main.so!

  [191003295 234620642.20 280597986] dedupe.so
  [197626699 243374855.86 321816763] main.so

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [170406200 194299792.68 253001201] dedupe.so
  [172071888 193230743.11 223608329] main.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [3870997347 4437735062.59 5216007266] dedupe.so
  [4019924063 4424595349.24 4965088931] main.so
```

* Use full path instead of import to avoid warnings in some build configurations

Warnings will then cause CI to fail.

* Move `SigSet` into `VCode`
2022-08-31 20:39:32 +00:00
Damian Heaton
3d9d759380 Port fcmp to ISLE (AArch64) (#4819)
Ported the existing implementation of `fcmp` for AArch64 to ISLE.

This also ports the `lower_vector_comparison` method to ISLE.

Copyright (c) 2022 Arm Limited
2022-08-30 09:06:15 -07:00
Chris Fallin
955d4e4ba1 AArch64: port load and store operations to ISLE. (#4785)
This retains `lower_amode` in the handwritten code (@akirilov-arm
reports that there is an upcoming patch to port this), but tweaks it
slightly to take a `Value` rather than an `Inst`.
2022-08-29 17:45:55 -07:00
Chris Fallin
a6eb24bd4f AArch64: port misc ops to ISLE. (#4796)
* Add some precise-output compile tests for aarch64.

* AArch64: port misc ops to ISLE.

- get_pinned_reg / set_pinned_reg
- bitcast
- stack_addr
- extractlane
- insertlane
- vhigh_bits
- iadd_ifcout
- fcvt_low_from_sint
2022-08-29 12:56:39 -07:00
Trevor Elliott
25d960f9c4 x64: Lower tlsvalue, sqmul_round_sat, and uunarrow in ISLE (#4793)
Lower tlsvalue, sqmul_round_sat, and uunarrow in ISLE.
2022-08-26 16:33:48 -07:00
Chris Fallin
8e8dfdf5f9 AArch64: Migrate calls and returns to ISLE. (#4788) 2022-08-26 16:26:39 -07:00
Trevor Elliott
ca6d648e37 x64: Ensure that constants are always 16 bytes for XmmMem (#4790)
Ensure that constants generated for the memory case of XmmMem values are always 16 bytes, ensuring that we don't accidantally perform an unaligned load.

Fixes #4761
2022-08-26 20:04:38 +00:00
Trevor Elliott
9386409607 x64: Lower extractlane, scalar_to_vector, and splat in ISLE (#4780)
Lower extractlane, scalar_to_vector and splat in ISLE.

This PR also makes some changes to the SinkableLoad api
* change the return type of sink_load to RegMem as there are more functions available for dealing with RegMem
* add reg_mem_to_reg_mem_imm and register it as an automatic conversion
2022-08-25 09:38:03 -07:00
Trevor Elliott
b8b6f2781e x64: Lower shuffle and swizzle in ISLE (#4772)
Lower `shuffle` and `swizzle` in ISLE.

This PR surfaced a bug with the lowering of `shuffle` when avx512vl and avx512vbmi are enabled: we use `vpermi2b` as the implementation, but panic if the immediate shuffle mask contains any out-of-bounds values. The behavior when the avx512 extensions are not present is that out-of-bounds values are turned into `0` in the result.

I've resolved this by detecting when the shuffle immediate has out-of-bounds indices in the avx512-enabled lowering, and generating an additional mask to zero out the lanes where those indices occur. This brings the avx512 case into line with the semantics of the `shuffle` op: 94bcbe8446/cranelift/codegen/meta/src/shared/instructions.rs (L1495-L1498)
2022-08-24 21:49:51 +00:00
Trevor Elliott
4bdfa76370 x64: Migrate get_pinned_reg, set_pinned_reg, vconst, and raw_bitcast to ISLE (#4763)
https://github.com/bytecodealliance/wasmtime/pull/4763
2022-08-23 16:32:00 -07:00
Damian Heaton
da1fb305a3 Port vconst to ISLE (AArch64) (#4750)
* Port `vconst` to ISLE (AArch64)

Ported the existing implementation of `vconst` to ISLE for AArch64, and
added support for 64-bit vector constants.

Also introduced 64-bit `vconst` support to the interpreter.

Copyright (c) 2022 Arm Limited

* Replace if-chains with match statements

Copyright (c) 2022 Arm Limited
2022-08-23 09:40:11 -07:00
Ulrich Weigand
50fcab2984 s390x: Implement tls_value (#4616)
Implement the tls_value for s390 in the ELF general-dynamic mode.

Notable differences to the x86_64 implementation are:
- We use a __tls_get_offset libcall instead of __tls_get_addr.
- The current thread pointer (stored in a pair of access registers)
  needs to be added to the result of __tls_get_offset.
- __tls_get_offset has a variant ABI that requires the address of
  the GOT (global offset table) is passed in %r12.

This means we need a new libcall entries for __tls_get_offset.
In addition, we also need a way to access _GLOBAL_OFFSET_TABLE_.
The latter is a "magic" symbol with a well-known name defined
by the ABI and recognized by the linker.  This patch introduces
a new ExternalName::KnownSymbol variant to support such names
(originally due to @afonso360).

We also need to emit a relocation on a symbol placed in a
constant pool, as well as an extra relocation on the call
to __tls_get_offset required for TLS linker optimization.

Needed by the cg_clif frontend.
2022-08-10 10:02:07 -07:00
Damian Heaton
eb332b8369 Convert fma, valltrue & vanytrue to ISLE (AArch64) (#4608)
* Convert `fma`, `valltrue` & `vanytrue` to ISLE (AArch64)

Ported the existing implementations of the following opcodes to ISLE on
AArch64:
- `fma`
  - Introduced missing support for `fma` on vector values, as per the
    docs.
- `valltrue`
- `vanytrue`

Also fixed `fcmp` on scalar values in the interpreter, and enabled
interpreter tests in `simd-fma.clif`.

This introduces the `FMLA` machine instruction.

Copyright (c) 2022 Arm Limited

* Add comments for `Fmla` and `Bsl`

Copyright (c) 2022 Arm Limited
2022-08-05 09:47:56 -07:00
Ulrich Weigand
b17b1eb25d [s390x, abi_impl] Add i128 support (#4598)
This adds full i128 support to the s390x target, including new filetests
and enabling the existing i128 runtest on s390x.

The ABI requires that i128 is passed and returned via implicit pointer,
but the front end still generates direct i128 types in call.  This means
we have to implement ABI support to implicitly convert i128 types to
pointers when passing arguments.

To do so, we add a new variant ABIArg::ImplicitArg.  This acts like
StructArg, except that the value type is the actual target type,
not a pointer type.  The required conversions have to be inserted
in the prologue and at function call sites.

Note that when dereferencing the implicit pointer in the prologue,
we may require a temp register: the pointer may be passed on the
stack so it needs to be loaded first, but the value register may
be in the wrong class for pointer values.  In this case, we use
the "stack limit" register, which should be available at this
point in the prologue.

For return values, we use a mechanism similar to the one used for
supporting multiple return values in the Wasmtime ABI.  The only
difference is that the hidden pointer to the return buffer must
be the *first*, not last, argument in this case.

(This implements the second half of issue #4565.)
2022-08-04 20:41:26 +00:00
Trevor Elliott
1fc11bbe51 x64: Migrate brff and I128 branching instructions to ISLE (#4599)
https://github.com/bytecodealliance/wasmtime/pull/4599
2022-08-04 08:58:50 -07:00
Ulrich Weigand
b9dd48e34b [s390x, abi_impl] Support struct args using explicit pointers (#4585)
This adds support for StructArgument on s390x.  The ABI for this
platform requires that the address of the buffer holding the copy
of the struct argument is passed from caller to callee as hidden
pointer, using a register or overflow stack slot.

To implement this, I've added an optional "pointer" filed to
ABIArg::StructArg, and code to handle the pointer both in common
abi_impl code and the s390x back-end.

One notable change necessary to make this work involved the
"copy_to_arg_order" mechanism.  Currently, for struct args
we only need to copy the data (and that need to happen before
setting up any other args), while for non-struct args we only
need to set up the appropriate registers or stack slots.
This order is ensured by sorting the arguments appropriately
into a "copy_to_arg_order" list.

However, for struct args with explicit pointers we need to *both*
copy the data (again, before everything else), *and* set up a
register or stack slot.  Since we now need to touch the argument
twice, we cannot solve the ordering problem by a simple sort.
Instead, the abi_impl common code now provided *two* callbacks,
emit_copy_regs_to_buffer and emit_copy_regs_to_arg, and expects
the back end to first call copy..to_buffer for all args, and
then call copy.._to_arg for all args.  This required updates
to all back ends.

In the s390x back end, in addition to the new ABI code, I'm now
adding code to actually copy the struct data, using the MVC
instruction (for small buffers) or a memcpy libcall (for larger
buffers).  This also requires a bit of new infrastructure:
- MVC is the first memory-to-memory instruction we use, which
  needed a bit of memory argument tweaking
- We also need to set up the infrastructure to emit libcalls.

(This implements the first half of issue #4565.)
2022-08-03 19:00:07 +00:00