Commit Graph

45 Commits

Author SHA1 Message Date
Trevor Elliott
9dc4f1a83c s390x: Move the value out of the casloop_val_reg with mov_preg (#5430)
The casloop_emit function in the s390x backend was using the fixed non-allocatable register %r0 directly with move instructions, which produced a panic in the regalloc2 checker (#5425). This PR changes the casloop_result function to use mov_preg instead of copy_reg to fetch the result, as it's not viewed by regalloc2 as a move.

Fixes #5425
2022-12-14 13:06:35 -08:00
Ulrich Weigand
f0af622208 Simplify LowerBackend interface (#5432)
* Refactor lower_branch to have Unit result

Branches cannot have any output, so it is more straightforward
to have the ISLE term return Unit instead of InstOutput.

Also provide a new `emit_side_effect` term to simplify
implementation of `lower_branch` rules with Unit result.

* Simplify LowerBackend interface

Move all remaining asserts from the LowerBackend::lower and
::lower_branch_group into the common call site.

Change return value of ::lower to Option<InstOutput>, and
return value of ::lower_branch_group to Option<()> to match
ISLE term signature.

Only pass the first branch into ::lower_branch_group and
rename it to ::lower_branch.

As a result of all those changes, LowerBackend routines
now consists solely to calls to the corresponding ISLE
routines.
2022-12-14 00:48:25 +00:00
Ulrich Weigand
df923f18ca Remove MachInst::gen_constant (#5427)
* aarch64: constant generation cleanup

Add support for MOVZ and MOVN generation via ISLE.
Handle f32const, f64const, and nop instructions via ISLE.
No longer call Inst::gen_constant from lower.rs.

* riscv64: constant generation cleanup

Handle f32const, f64const, and nop instructions via ISLE.

* s390x: constant generation cleanup

Fix rule priorities for "imm" term.
Only handle 32-bit stack offsets; no longer use load_constant64.

* x64: constant generation cleanup

No longer call Inst::gen_constant from lower.rs or abi.rs.

* Refactor LowerBackend::lower to return InstOutput

No longer write to the per-insn output registers; instead, return
an InstOutput vector of temp registers holding the outputs.

This will allow calling LowerBackend::lower multiple times for
the same instruction, e.g. to rematerialize constants.

When emitting the primary copy of the instruction during lowering,
writing to the per-insn registers is now done in lower_clif_block.

As a result, the ISLE lower_common routine is no longer needed.
In addition, the InsnOutput type and all code related to it
can be removed as well.

* Refactor IsleContext to hold a LowerBackend reference

Remove the "triple", "flags", and "isa_flags" fields that are
copied from LowerBackend to each IsleContext, and instead just
hold a reference to LowerBackend in IsleContext.

This will allow calling LowerBackend::lower from within callbacks
in src/machinst/isle.rs, e.g. to rematerialize constants.

To avoid having to pass LowerBackend references through multiple
functions, eliminate the lower_insn_to_regs subroutines in those
targets that still have them, and just inline into the main
lower routine.  This also eliminates lower_inst.rs on aarch64
and riscv64.

Replace all accesses to the removed IsleContext fields by going
through the LowerBackend reference.

* Remove MachInst::gen_constant

This addresses the problem described in issue
https://github.com/bytecodealliance/wasmtime/issues/4426
that targets currently have to duplicate code to emit
constants between the ISLE logic and the gen_constant
callback.

After the various cleanups in earlier patches in this series,
the only remaining user of get_constant is put_value_in_regs
in Lower.  This can now be removed, and instead constant
rematerialization can be performed in the put_in_regs ISLE
callback by simply directly calling LowerBackend::lower
on the instruction defining the constant (using a different
output register).

Since the check for egraph mode is now no longer performed in
put_value_in_regs, the Lower::flags member becomes obsolete.

Care needs to be taken that other calls directly to the
Lower::put_value_in_regs routine now handle the fact that
no more rematerialization is performed.  All such calls in
target code already historically handle constants themselves.
The remaining call site in the ISLE gen_call_common helper
can be redirected to the ISLE put_in_regs callback.

The existing target implementations of gen_constant are then
unused and can be removed.  (In some target there may still
be further opportunities to remove duplication between ISLE
and some local Rust code - this can be left to future patches.)
2022-12-13 13:00:04 -08:00
Timothy Chen
122872fb0c Remove references for sig (#5414) 2022-12-12 08:46:23 -08:00
Jamey Sharp
8726eeefb3 cranelift-isle: Add "partial" flag for constructors (#5392)
* cranelift-isle: Add "partial" flag for constructors

Instead of tying fallibility of constructors to whether they're either
internal or pure, this commit assumes all constructors are infallible
unless tagged otherwise with a "partial" flag.

Internal constructors without the "partial" flag are not allowed to use
constructors which have the "partial" flag on the right-hand side of any
rules, because they have no way to report last-minute match failures.

Multi-constructors should never be "partial"; they report match failures
with an empty iterator instead. In turn this means you can't use partial
constructors on the right-hand side of internal multi-constructor rules.
However, you can use the same constructors on the left-hand side with
`if` or `if-let` instead.

In many cases, ISLE can already trivially prove that an internal
constructor always returns `Some`. With this commit, those cases are
largely unchanged, except for removing all the `Option`s and `Some`s
from the generated code for those terms.

However, for internal non-partial constructors where ISLE could not
prove that, it now emits an `unreachable!` panic as the last-resort,
instead of returning `None` like it used to do. Among the existing
backends, here's how many constructors have these panic cases:

- x64: 14% (53/374)
- aarch64: 15% (41/277)
- riscv64: 23% (26/114)
- s390x: 47% (268/567)

It's often possible to rewrite rules so that ISLE can tell the panic can
never be hit. Just ensure that there's a lowest-priority rule which has
no constraints on the left-hand side.

But in many of these constructors, it's difficult to statically prove
the unhandled cases are unreachable because that's only down to
knowledge about how they're called or other preconditions.

So this commit does not try to enforce that all terms have a last-resort
fallback rule.

* Check term flags while translating expressions

Instead of doing it in a separate pass afterward.

This involved threading all the term flags (pure, multi, partial)
through the recursive `translate_expr` calls, so I extracted the flags
to a new struct so they can all be passed together.

* Validate multi-term usage

Now that I've threaded the flags through `translate_expr`, it's easy to
check this case too, so let's just do it.

* Extract `ReturnKind` to use in `ExternalSig`

There are only three legal states for the combination of `multi` and
`infallible`, so replace those fields of `ExternalSig` with a
three-state enum.

* Remove `Option` wrapper from multi-extractors too

If we'd had any external multi-constructors this would correct their
signatures as well.

* Update ISLE tests

* Tag prelude constructors as pure where appropriate

I believe the only reason these weren't marked `pure` before was because
that would have implied that they're also partial. Now that those two
states are specified separately we apply this flag more places.

* Fix my changes to aarch64 `lower_bmask` and `imm` terms
2022-12-07 17:16:03 -08:00
Timothy Chen
67fc5389b0 Remove sig data arg and ret fields to reduce size (#5319)
* Remove sig data arg and ret fields to reduce size

* Update cranelift/codegen/src/machinst/abi.rs

Co-authored-by: Jamey Sharp <jamey@minilop.net>

* Update cranelift/codegen/src/machinst/abi.rs

Co-authored-by: Jamey Sharp <jamey@minilop.net>

* Fix offsets

* Add comment

Co-authored-by: Jamey Sharp <jamey@minilop.net>
2022-11-30 07:19:41 -08:00
Timothy Chen
48ee42efc2 Refactor Sigdata methods with sigset (#5307)
* Refactor sigdata methods

* Update cranelift/codegen/src/machinst/abi.rs

Co-authored-by: Jamey Sharp <jamey@minilop.net>

* Address comments

Co-authored-by: Jamey Sharp <jamey@minilop.net>
2022-11-22 09:03:51 -08:00
Nick Fitzgerald
6d289723bd Cranelift: Use a single, shared vector allocation for all ABIArgs (#5127)
* Cranelift: Use a single, shared vector allocation for all `ABIArg`s

Instead of two `SmallVec`s per `SigData`.

* Remove `Deref` and `DerefMut` impls for `ArgsAccumulator`
2022-10-31 14:32:17 -07:00
Ulrich Weigand
39b3b1d772 s390x: Fix handling of sret arguments (#5116)
Skip synthetic StructReturn entries in the return value list.
Fixes https://github.com/bytecodealliance/wasmtime/issues/5089
2022-10-25 10:40:10 -07:00
Ulrich Weigand
9dadba60a0 s390x: use constraints for call arguments and return values (#5092)
Use the regalloc constraint-based CallArgList / CallRetList
mechanism instead of directly using physregs in instructions.
2022-10-21 11:01:22 -07:00
Trevor Elliott
d9753fac2b Remove uses of reg_mod from s390x (#5073)
Remove uses of reg_mod from the s390x backend. This required moving away from using r0/r1 as the result registers from a few different pseudo instructions, standardizing instead on r2/r3. That change was necessary as regalloc2 will not correctly allocate registers that aren't listed in the allocatable set, which r0/r1 are not.

Co-authored-by: Ulrich Weigand <ulrich.weigand@de.ibm.com>
Co-authored-by: Chris Fallin <chris@cfallin.org>
2022-10-21 09:22:16 -07:00
Trevor Elliott
32a7593c94 cranelift: Remove booleans (#5031)
Remove the boolean types from cranelift, and the associated instructions breduce, bextend, bconst, and bint. Standardize on using 1/0 for the return value from instructions that produce scalar boolean results, and -1/0 for boolean vector elements.

Fixes #3205

Co-authored-by: Afonso Bordado <afonso360@users.noreply.github.com>
Co-authored-by: Ulrich Weigand <ulrich.weigand@de.ibm.com>
Co-authored-by: Chris Fallin <chris@cfallin.org>
2022-10-17 16:00:27 -07:00
Chris Fallin
2be12a5167 egraph-based midend: draw the rest of the owl (productionized). (#4953)
* egraph-based midend: draw the rest of the owl.

* Rename `egg` submodule of cranelift-codegen to `egraph`.

* Apply some feedback from @jsharp during code walkthrough.

* Remove recursion from find_best_node by doing a single pass.

Rather than recursively computing the lowest-cost node for a given
eclass and memoizing the answer at each eclass node, we can do a single
forward pass; because every eclass node refers only to earlier nodes,
this is sufficient. The behavior may slightly differ from the earlier
behavior because we cannot short-circuit costs to zero once a node is
elaborated; but in practice this should not matter.

* Make elaboration non-recursive.

Use an explicit stack instead (with `ElabStackEntry` entries,
alongside a result stack).

* Make elaboration traversal of the domtree non-recursive/stack-safe.

* Work analysis logic in Cranelift-side egraph glue into a general analysis framework in cranelift-egraph.

* Apply static recursion limit to rule application.

* Fix aarch64 wrt dynamic-vector support -- broken rebase.

* Topo-sort cranelift-egraph before cranelift-codegen in publish script, like the comment instructs me to!

* Fix multi-result call testcase.

* Include `cranelift-egraph` in `PUBLISHED_CRATES`.

* Fix atomic_rmw: not really a load.

* Remove now-unnecessary PartialOrd/Ord derivations.

* Address some code-review comments.

* Review feedback.

* Review feedback.

* No overlap in mid-end rules, because we are defining a multi-constructor.

* rustfmt

* Review feedback.

* Review feedback.

* Review feedback.

* Review feedback.

* Remove redundant `mut`.

* Add comment noting what rules can do.

* Review feedback.

* Clarify comment wording.

* Update `has_memory_fence_semantics`.

* Apply @jameysharp's improved loop-level computation.

Co-authored-by: Jamey Sharp <jamey@minilop.net>

* Fix suggestion commit.

* Fix off-by-one in new loop-nest analysis.

* Review feedback.

* Review feedback.

* Review feedback.

* Use `Default`, not `std::default::Default`, as per @fitzgen

Co-authored-by: Nick Fitzgerald <fitzgen@gmail.com>

* Apply @fitzgen's comment elaboration to a doc-comment.

Co-authored-by: Nick Fitzgerald <fitzgen@gmail.com>

* Add stat for hitting the rewrite-depth limit.

* Some code motion in split prelude to make the diff a little clearer wrt `main`.

* Take @jameysharp's suggested `try_into()` usage for blockparam indices.

Co-authored-by: Jamey Sharp <jamey@minilop.net>

* Take @jameysharp's suggestion to avoid double-match on load op.

Co-authored-by: Jamey Sharp <jamey@minilop.net>

* Fix suggestion (add import).

* Review feedback.

* Fix stack_load handling.

* Remove redundant can_store case.

* Take @jameysharp's suggested improvement to FuncEGraph::build() logic

Co-authored-by: Jamey Sharp <jamey@minilop.net>

* Tweaks to FuncEGraph::build() on top of suggestion.

* Take @jameysharp's suggested clarified condition

Co-authored-by: Jamey Sharp <jamey@minilop.net>

* Clean up after suggestion (unused variable).

* Fix loop analysis.

* loop level asserts

* Revert constant-space loop analysis -- edge cases were incorrect, so let's go with the simple thing for now.

* Take @jameysharp's suggestion re: result_tys

Co-authored-by: Jamey Sharp <jamey@minilop.net>

* Fix up after suggestion

* Take @jameysharp's suggestion to use fold rather than reduce

Co-authored-by: Jamey Sharp <jamey@minilop.net>

* Fixup after suggestion

* Take @jameysharp's suggestion to remove elaborate_eclass_use's return value.

* Clarifying comment in terminator insts.

Co-authored-by: Jamey Sharp <jamey@minilop.net>
Co-authored-by: Nick Fitzgerald <fitzgen@gmail.com>
2022-10-11 18:15:53 -07:00
Chris Fallin
2986f6b0ff ABI: implement register arguments with constraints. (#4858)
* ABI: implement register arguments with constraints.

Currently, Cranelift's ABI code emits a sequence of moves from physical
registers into vregs at the top of the function body, one for every
register-carried argument.

For a number of reasons, we want to move to operand constraints instead,
and remove the use of explicitly-named "pinned vregs"; this allows for
better regalloc in theory, as it removes the need to "reverse-engineer"
the sequence of moves.

This PR alters the ABI code so that it generates a single "args"
pseudo-instruction as the first instruction in the function body. This
pseudo-inst defs all register arguments, and constrains them to the
appropriate registers at the def-point. Subsequently the regalloc can
move them wherever it needs to.

Some care was taken not to have this pseudo-inst show up in
post-regalloc disassemblies, but the change did cause a general regalloc
"shift" in many tests, so the precise-output updates are a bit noisy.
Sorry about that!

A subsequent PR will handle the other half of the ABI code, namely, the
callsite case, with a similar preg-to-constraint conversion.

* Update based on review feedback.

* Review feedback.
2022-09-08 18:03:14 -07:00
Jamey Sharp
3d6d49daba cranelift: Remove of/nof overflow flags from icmp (#4879)
* cranelift: Remove of/nof overflow flags from icmp

Neither Wasmtime nor cg-clif use these flags under any circumstances.
From discussion on #3060 I see it's long been unclear what purpose these
flags served.

Fixes #3060, fixes #4406, and fixes #4875... by deleting all the code
that could have been buggy.

This changes the cranelift-fuzzgen input format by removing some IntCC
options, so I've gone ahead and enabled I128 icmp tests at the same
time. Since only the of/nof cases were failing before, I expect these to
work.

* Restore trapif tests

It's still useful to validate that iadd_ifcout's iflags result can be
forwarded correctly to trapif, and for that purpose it doesn't really
matter what condition code is checked.
2022-09-07 08:38:41 -07:00
Nick Fitzgerald
f18a1f1488 Cranelift: Deduplicate ABI signatures during lowering (#4829)
* Cranelift: Deduplicate ABI signatures during lowering

This commit creates the `SigSet` type which interns and deduplicates the ABI
signatures that we create from `ir::Signature`s. The ABI signatures are now
referred to indirectly via a `Sig` (which is a `cranelift_entity` ID), and we
pass around a `SigSet` to anything that needs to access the actual underlying
`SigData` (which is what `ABISig` used to be).

I had to change a couple methods to return a `SmallInstVec` instead of emitting
directly to work around what would otherwise be shared and exclusive borrows of
the lowering context overlapping. I don't expect any of these to heap allocate
in practice.

This does not remove the often-unnecessary allocations caused by
`ensure_struct_return_ptr_is_returned`. That is left for follow up work.

This also opens the door for further shuffling of signature data into more
efficient representations in the future, now that we have `SigSet` to store it
all in one place and it is threaded through all the code. We could potentially
move each signature's parameter and return vectors into one big vector shared
between all signatures, for example, which could cut down on allocations and
shrink the size of `SigData` since those `SmallVec`s have pretty large inline
capacity.

Overall, this refactoring gives a 1-7% speedup for compilation on
`pulldown-cmark`:

```
compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 8754213.66 ± 7526266.23 (confidence = 99%)

  dedupe.so is 1.01x to 1.07x faster than main.so!

  [191003295 234620642.20 280597986] dedupe.so
  [197626699 243374855.86 321816763] main.so

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [170406200 194299792.68 253001201] dedupe.so
  [172071888 193230743.11 223608329] main.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [3870997347 4437735062.59 5216007266] dedupe.so
  [4019924063 4424595349.24 4965088931] main.so
```

* Use full path instead of import to avoid warnings in some build configurations

Warnings will then cause CI to fail.

* Move `SigSet` into `VCode`
2022-08-31 20:39:32 +00:00
Nick Fitzgerald
5392d7cdd7 cranelift: Merge abi and abi_impl modules (#4805) 2022-08-29 23:20:36 +00:00
Chris Fallin
8e8dfdf5f9 AArch64: Migrate calls and returns to ISLE. (#4788) 2022-08-26 16:26:39 -07:00
Benjamin Bouvier
8a9b1a9025 Implement an incremental compilation cache for Cranelift (#4551)
This is the implementation of https://github.com/bytecodealliance/wasmtime/issues/4155, using the "inverted API" approach suggested by @cfallin (thanks!) in Cranelift, and trait object to provide a backend for an all-included experience in Wasmtime. 

After the suggestion of Chris, `Function` has been split into mostly two parts:

- on the one hand, `FunctionStencil` contains all the fields required during compilation, and that act as a compilation cache key: if two function stencils are the same, then the result of their compilation (`CompiledCodeBase<Stencil>`) will be the same. This makes caching trivial, as the only thing to cache is the `FunctionStencil`.
- on the other hand, `FunctionParameters` contain the... function parameters that are required to finalize the result of compilation into a `CompiledCode` (aka `CompiledCodeBase<Final>`) with proper final relocations etc., by applying fixups and so on.

Most changes are here to accomodate those requirements, in particular that `FunctionStencil` should be `Hash`able to be used as a key in the cache:

- most source locations are now relative to a base source location in the function, and as such they're encoded as `RelSourceLoc` in the `FunctionStencil`. This required changes so that there's no need to explicitly mark a `SourceLoc` as the base source location, it's automatically detected instead the first time a non-default `SourceLoc` is set.
- user-defined external names in the `FunctionStencil` (aka before this patch `ExternalName::User { namespace, index }`) are now references into an external table of `UserExternalNameRef -> UserExternalName`, present in the `FunctionParameters`, and must be explicitly declared using `Function::declare_imported_user_function`.
- some refactorings have been made for function names:
  - `ExternalName` was used as the type for a `Function`'s name; while it thus allowed `ExternalName::Libcall` in this place, this would have been quite confusing to use it there. Instead, a new enum `UserFuncName` is introduced for this name, that's either a user-defined function name (the above `UserExternalName`) or a test case name.
  - The future of `ExternalName` is likely to become a full reference into the `FunctionParameters`'s mapping, instead of being "either a handle for user-defined external names, or the thing itself for other variants". I'm running out of time to do this, and this is not trivial as it implies touching ISLE which I'm less familiar with.

The cache computes a sha256 hash of the `FunctionStencil`, and uses this as the cache key. No equality check (using `PartialEq`) is performed in addition to the hash being the same, as we hope that this is sufficient data to avoid collisions.

A basic fuzz target has been introduced that tries to do the bare minimum:

- check that a function successfully compiled and cached will be also successfully reloaded from the cache, and returns the exact same function.
- check that a trivial modification in the external mapping of `UserExternalNameRef -> UserExternalName` hits the cache, and that other modifications don't hit the cache.
  - This last check is less efficient and less likely to happen, so probably should be rethought a bit.

Thanks to both @alexcrichton and @cfallin for your very useful feedback on Zulip.

Some numbers show that for a large wasm module we're using internally, this is a 20% compile-time speedup, because so many `FunctionStencil`s are the same, even within a single module. For a group of modules that have a lot of code in common, we get hit rates up to 70% when they're used together. When a single function changes in a wasm module, every other function is reloaded; that's still slower than I expect (between 10% and 50% of the overall compile time), so there's likely room for improvement. 

Fixes #4155.
2022-08-12 16:47:43 +00:00
Nick Fitzgerald
532fb22af6 Cranelift: Remove the LowerCtx trait (#4697)
The trait had only one implementation: the `Lower` struct. It is easier to just
use that directly, and not introduce unnecessary layers of generics and
abstractions.

Once upon a time, there was hope that we would have other implementations of the
`LowerCtx` trait, that did things like lower CLIF to SMTLIB for
verification. However, this is not practical these days given the way that the
trait has evolved over time, and our verification efforts are focused on ISLE
now anyways, and we're actually making some progress on that front (much more
than anyone ever did on a second `LowerCtx` trait implementation!)
2022-08-11 16:54:17 -07:00
Ulrich Weigand
67870d1518 s390x: Support both big- and little-endian vector lane order (#4682)
This implements the s390x back-end portion of the solution for
https://github.com/bytecodealliance/wasmtime/issues/4566

We now support both big- and little-endian vector lane order
in code generation.  The order used for a function is determined
by the function's ABI: if it uses a Wasmtime ABI, it will use
little-endian lane order, and big-endian lane order otherwise.
(This ensures that all raw_bitcast instructions generated by
both wasmtime and other cranelift frontends can always be
implemented as a no-op.)

Lane order affects the implementation of a number of operations:
- Vector immediates
- Vector memory load / store (in big- and little-endian variants)
- Operations explicitly using lane numbers
  (insertlane, extractlane, shuffle, swizzle)
- Operations implicitly using lane numbers
  (iadd_pairwise, narrow/widen, promote/demote, fcvt_low, vhigh_bits)

In addition, when calling a function using a different lane order,
we need to lane-swap all vector values passed or returned in registers.

A small number of changes to common code were also needed:

- Ensure we always select a Wasmtime calling convention on s390x
  in crates/cranelift (func_signature).

- Fix vector immediates for filetests/runtests.  In PR #4427,
  I attempted to fix this by byte-swapping the V128 value, but
  with the new scheme, we'd instead need to perform a per-lane
  byte swap.  Since we do not know the actual type in write_to_slice
  and read_from_slice, this isn't easily possible.

  Revert this part of PR #4427 again, and instead just mark the
  memory buffer as little-endian when emitting the trampoline;
  the back-end will then emit correct code to load the constant.

- Change a runtest in simd-bitselect-to-vselect.clif to no longer
  make little-endian lane order assumptions.

- Remove runtests in simd-swizzle.clif that make little-endian
  lane order assumptions by relying on implicit type conversion
  when using a non-i16x8 swizzle result type (this feature should
  probably be removed anyway).

Tested with both wasmtime and cg_clif.
2022-08-11 12:10:46 -07:00
Ulrich Weigand
50fcab2984 s390x: Implement tls_value (#4616)
Implement the tls_value for s390 in the ELF general-dynamic mode.

Notable differences to the x86_64 implementation are:
- We use a __tls_get_offset libcall instead of __tls_get_addr.
- The current thread pointer (stored in a pair of access registers)
  needs to be added to the result of __tls_get_offset.
- __tls_get_offset has a variant ABI that requires the address of
  the GOT (global offset table) is passed in %r12.

This means we need a new libcall entries for __tls_get_offset.
In addition, we also need a way to access _GLOBAL_OFFSET_TABLE_.
The latter is a "magic" symbol with a well-known name defined
by the ABI and recognized by the linker.  This patch introduces
a new ExternalName::KnownSymbol variant to support such names
(originally due to @afonso360).

We also need to emit a relocation on a symbol placed in a
constant pool, as well as an extra relocation on the call
to __tls_get_offset required for TLS linker optimization.

Needed by the cg_clif frontend.
2022-08-10 10:02:07 -07:00
Ulrich Weigand
b17b1eb25d [s390x, abi_impl] Add i128 support (#4598)
This adds full i128 support to the s390x target, including new filetests
and enabling the existing i128 runtest on s390x.

The ABI requires that i128 is passed and returned via implicit pointer,
but the front end still generates direct i128 types in call.  This means
we have to implement ABI support to implicitly convert i128 types to
pointers when passing arguments.

To do so, we add a new variant ABIArg::ImplicitArg.  This acts like
StructArg, except that the value type is the actual target type,
not a pointer type.  The required conversions have to be inserted
in the prologue and at function call sites.

Note that when dereferencing the implicit pointer in the prologue,
we may require a temp register: the pointer may be passed on the
stack so it needs to be loaded first, but the value register may
be in the wrong class for pointer values.  In this case, we use
the "stack limit" register, which should be available at this
point in the prologue.

For return values, we use a mechanism similar to the one used for
supporting multiple return values in the Wasmtime ABI.  The only
difference is that the hidden pointer to the return buffer must
be the *first*, not last, argument in this case.

(This implements the second half of issue #4565.)
2022-08-04 20:41:26 +00:00
Ulrich Weigand
b9dd48e34b [s390x, abi_impl] Support struct args using explicit pointers (#4585)
This adds support for StructArgument on s390x.  The ABI for this
platform requires that the address of the buffer holding the copy
of the struct argument is passed from caller to callee as hidden
pointer, using a register or overflow stack slot.

To implement this, I've added an optional "pointer" filed to
ABIArg::StructArg, and code to handle the pointer both in common
abi_impl code and the s390x back-end.

One notable change necessary to make this work involved the
"copy_to_arg_order" mechanism.  Currently, for struct args
we only need to copy the data (and that need to happen before
setting up any other args), while for non-struct args we only
need to set up the appropriate registers or stack slots.
This order is ensured by sorting the arguments appropriately
into a "copy_to_arg_order" list.

However, for struct args with explicit pointers we need to *both*
copy the data (again, before everything else), *and* set up a
register or stack slot.  Since we now need to touch the argument
twice, we cannot solve the ordering problem by a simple sort.
Instead, the abi_impl common code now provided *two* callbacks,
emit_copy_regs_to_buffer and emit_copy_regs_to_arg, and expects
the back end to first call copy..to_buffer for all args, and
then call copy.._to_arg for all args.  This required updates
to all back ends.

In the s390x back end, in addition to the new ABI code, I'm now
adding code to actually copy the struct data, using the MVC
instruction (for small buffers) or a memcpy libcall (for larger
buffers).  This also requires a bit of new infrastructure:
- MVC is the first memory-to-memory instruction we use, which
  needed a bit of memory argument tweaking
- We also need to set up the infrastructure to emit libcalls.

(This implements the first half of issue #4565.)
2022-08-03 19:00:07 +00:00
Afonso Bordado
709716bb8e cranelift: Implement scalar FMA on x86 (#4460)
x86 does not have dedicated instructions for scalar FMA, lower
to a libcall which seems to be what llvm does.
2022-08-03 10:29:10 -07:00
Nick Fitzgerald
42bba452a6 Cranelift: Add instructions for getting the current stack/frame/return pointers (#4573)
* Cranelift: Add instructions for getting the current stack/frame pointers and return address

This is the initial part of https://github.com/bytecodealliance/wasmtime/issues/4535

* x64: Remove `Amode::RbpOffset` and use `Amode::ImmReg` instead

We just special case getting operands from `Amode`s now.

* Fix s390x `get_return_address`; require `preserve_frame_pointers=true`

* Assert that `Amode::ImmRegRegShift` doesn't use rbp/rsp

* Handle non-allocatable registers in Amode::with_allocs

* Use "stack" instead of "r15" on s390x

* r14 is an allocatable register on s390x, so it shouldn't be used with `MovPReg`
2022-08-02 14:37:17 -07:00
Anton Kirilov
ead6edb0c5 Cranelift AArch64: Migrate Splat to ISLE (#4521)
Copyright (c) 2022, Arm Limited.
2022-07-26 17:57:15 +00:00
Trevor Elliott
ee7e4f4c6b x64: Port func_addr and symbol_value to ISLE (#4485)
https://github.com/bytecodealliance/wasmtime/pull/4485
2022-07-25 11:11:16 -07:00
Ulrich Weigand
638dc4e0b3 s390x: Implement full SIMD support (#4427)
This adds full support for all Cranelift SIMD instructions
to the s390x target.  Everything is matched fully via ISLE.

In addition to adding support for many new instructions,
and the lower.isle code to match all SIMD IR patterns,
this patch also adds ABI support for vector types.
In particular, we now need to handle the fact that
vector registers 8 .. 15 are partially callee-saved,
i.e. the high parts of those registers (which correspond
to the old floating-poing registers) are callee-saved,
but the low parts are not.  This is the exact same situation
that we already have on AArch64, and so this patch uses the
same solution (the is_included_in_clobbers callback).

The bulk of the changes are platform-specific, but there are
a few exceptions:

- Added ISLE extractors for the Immediate and Constant types,
  to enable matching the vconst and swizzle instructions.

- Added a missing accessor for call_conv to ABISig.

- Fixed endian conversion for vector types in data_value.rs
  to enable their use in runtests on the big-endian platforms.

- Enabled (nearly) all SIMD runtests on s390x.  [ Two test cases
  remain disabled due to vector shift count semantics, see below. ]

- Enabled all Wasmtime SIMD tests on s390x.

There are three minor issues, called out via FIXMEs below,
which should be addressed in the future, but should not be
blockers to getting this patch merged.  I've opened the
following issues to track them:

- Vector shift count semantics
  https://github.com/bytecodealliance/wasmtime/issues/4424

- is_included_in_clobbers vs. link register
  https://github.com/bytecodealliance/wasmtime/issues/4425

- gen_constant callback
  https://github.com/bytecodealliance/wasmtime/issues/4426

All tests, including all newly enabled SIMD tests, pass
on both z14 and z15 architectures.
2022-07-18 14:00:48 -07:00
Sam Parker
9c43749dfe [RFC] Dynamic Vector Support (#4200)
Introduce a new concept in the IR that allows a producer to create
dynamic vector types. An IR function can now contain global value(s)
that represent a dynamic scaling factor, for a given fixed-width
vector type. A dynamic type is then created by 'multiplying' the
corresponding global value with a fixed-width type. These new types
can be used just like the existing types and the type system has a
set of hard-coded dynamic types, such as I32X4XN, which the user
defined types map onto. The dynamic types are also used explicitly
to create dynamic stack slots, which have no set size like their
existing counterparts. New IR instructions are added to access these
new stack entities.

Currently, during codegen, the dynamic scaling factor has to be
lowered to a constant so the dynamic slots do eventually have a
compile-time known size, as do spill slots.

The current lowering for aarch64 just targets Neon, using a dynamic
scale of 1.

Copyright (c) 2022, Arm Limited.
2022-07-07 12:54:39 -07:00
Ulrich Weigand
ec83144c88 s390x: use full vector register file for FP operations (#4360)
This defines the full set of 32 128-bit vector registers on s390x.
(Note that the VRs overlap the existing FPRs.)  In addition, this
adds support to use all 32 vector registers to implement floating-
point operations, by using vector floating-point instructions with
the 'W' bit set to operate only on the first element.

This part of the vector instruction set mostly matches the old FP
instruction set, with two exceptions:

- There is no vector version of the COPY SIGN instruction.  Instead,
  now use a VECTOR SELECT with an appropriate bit mask to implement
  the fcopysign operation.

- There are no vector version of the float <-> int conversion
  instructions where source and target differ in bit size.  Use
  appropriate multiple conversion steps instead.  This also requires
  use of explicit checking to implement correct overflow handling.
  As a side effect, this version now also implements the i8 / i16
  variants of all conversions, which had been missing so far.

For all operations except those two above, we continue to use the
old FP instruction if applicable (i.e. if all operands happen to
have been allocated to the original FP register set), and use the
vector instruction otherwise.
2022-06-30 16:33:39 -07:00
Ulrich Weigand
95836ba114 s390x: clean up lower.rs (#4355)
Now that lowering is fully done in ISLE, clean up some code remnants
in lower.rs.  In particular, move code to lower/isle.rs where
possible, and inline lower_insn_to_regs into its caller and simplify.
2022-06-30 11:16:59 -07:00
Ulrich Weigand
7a9479f77c ISLE: Migrate call and return instructions (#3785)
This adds infrastructure to allow implementing call and return
instructions in ISLE, and migrates the s390x back-end.

To implement ABI details, this patch creates public accessors
for `ABISig` and makes them accessible in ISLE.  All actual
code generation is then done in ISLE rules, following the
information provided by that signature.

[ Note that the s390x back end never requires multiple slots for
a single argument - the infrastructure to handle this should
already be present, however. ]

To implement loops in ISLE rules, this patch uses regular tail
recursion, employing a `Range` data structure holding a range
of integers to be looped over.
2022-06-29 14:22:50 -07:00
Chris Fallin
eb435f3057 x64: use constant pool for u64 constants rather than movabs. (#4088)
* Allow emitting u64 constants into constant pool.

* Use constant pool for constants on x64 that do not fit in a simm32 and are needed as a RegMem or RegMemImm.

* Fix rip-relative addressing bug in pinsrd emission.
2022-05-10 09:21:05 -07:00
Chris Fallin
f85047b084 Rework x64 addressing-mode lowering to be slightly more flexible. (#4080)
This PR refactors the x64 backend address-mode lowering to use an
incremental-build approach, where it considers each node in a tree of
`iadd`s that feed into a load/store address and, at each step, builds
the best possible `Amode`. It will combine an arbitrary number of
constant offsets (an extension beyond the current rules), and can
capture a left-shifted (scaled) index in any position of the tree
(another extension).

This doesn't have any measurable performance improvement on our Wasm
benchmarks in Sightglass, unfortunately, because the IR lowered from
wasm32 will do address computation in 32 bits and then `uextend` it to
add to the 64-bit heap base. We can't quite lift the 32-bit adds to 64
bits because this loses the wraparound semantics.

(We could label adds as "expected not to overflow", and allow *those* to
be lifted to 64 bit operations; wasm32 heap address computation should
fit this.  This is `add nuw` (no unsigned wrap) in LLVM IR terms. That's
likely my next step.)

Nevertheless, (i) this generalizes the cases we can handle, which should
be a good thing, all other things being equal (and in this case, no
compile time impact was measured); and (ii) might benefit non-Wasm
frontends.
2022-05-02 16:20:39 -07:00
Chris Fallin
03793b71a7 ISLE: remove all uses of argument polarity, and remove it from the language. (#4091)
This PR removes "argument polarity": the feature of ISLE extractors that lets them take
inputs aside from the value to be matched.

Cases that need this expressivity have been subsumed by #4072 with if-let clauses;
we can now finally remove this misfeature of the language, which has caused significant
confusion and has always felt like a bit of a hack.

This PR (i) removes the feature from the ISLE compiler; (ii) removes it from the reference
documentation; and (iii) refactors away all uses of the feature in our three existing
backends written in ISLE.
2022-05-02 09:52:12 -07:00
Chris Fallin
e4b7c8a737 Cranelift: fix #3953: rework single/multiple-use logic in lowering. (#4061)
* Cranelift: fix #3953: rework single/multiple-use logic in lowering.

This PR addresses the longstanding issue with loads trying to merge
into compares on x86-64, and more generally, with the lowering
framework falsely recognizing "single uses" of one op by
another (which would normally allow merging of side-effecting ops like
loads) when there is *indirect* duplication.

To fix this, we replace the direct `value_uses` count with a
transitive notion of uniqueness (not unlike Rust's `&`/`&mut` and how
a `&mut` downgrades to `&` when accessed through another `&`!). A
value is used multiple times transitively if it has multiple direct
uses, or is used by another op that is used multiple times
transitively.

The canonical example of badness is:

```
    v1 := load
    v2 := ifcmp v1, ...
    v3 := selectif v2, ...
    v4 := selectif v2, ...
```

both `v3` and `v4` effectively merge the `ifcmp` (`v2`), so even
though the use of `v1` is "unique", it is codegenned twice. This is
why we ~~can't have nice things~~ can't merge loads into
compares (#3953).

There is quite a subtle and interesting design space around this
problem and how we might solve it. See the long doc-comment on
`ValueUseState` in this PR for more justification for the particular
design here. In particular, this design deliberately simplifies a bit
relative to an "optimal" solution: some uses can *become* unique
depending on merging, but we don't design our data structures for such
updates because that would require significant extra costly
tracking (some sort of transitive refcounting). For example, in the
above, if `selectif` somehow did not merge `ifcmp`, then we would only
codegen the `ifcmp` once into its result register (and use that
register twice); then the load *is* uniquely used, and could be
merged. But that requires transitioning from "multiple use" back to
"unique use" with careful tracking as we do pattern-matching, which
I've chosen to make out-of-scope here for now. In practice, I don't
think it will matter too much (and we can always improve later).

With this PR, we can now re-enable load-op merging for compares. A
subsequent commit does this.

* Update x64 backend to allow load-op merging for `cmp`.

* Update filetests.

* Add test for cmp-mem merging on x64.

* Comment fixes.

* Rework ValueUseState analysis for better performance.

* Update s390x filetest: iadd_ifcout cannot merge loads anymore because it has multiple outputs (ValueUseState limitation)

* Address review comments.
2022-04-22 18:00:48 -07:00
Chris Fallin
a0318f36f0 Switch Cranelift over to regalloc2. (#3989)
This PR switches Cranelift over to the new register allocator, regalloc2.

See [this document](https://gist.github.com/cfallin/08553421a91f150254fe878f67301801)
for a summary of the design changes. This switchover has implications for
core VCode/MachInst types and the lowering pass.

Overall, this change brings improvements to both compile time and speed of
generated code (runtime), as reported in #3942:

```
Benchmark       Compilation (wallclock)     Execution (wallclock)
blake3-scalar   25% faster                  28% faster
blake3-simd     no diff                     no diff
meshoptimizer   19% faster                  17% faster
pulldown-cmark  17% faster                  no diff
bz2             15% faster                  no diff
SpiderMonkey,   21% faster                  2% faster
  fib(30)
clang.wasm      42% faster                  N/A
```
2022-04-14 10:28:21 -07:00
Ulrich Weigand
10198553c7 ISLE: Common accessors for some insn data fields (#3781)
Add accessors to prelude.isle to access data fields of
`func_addr` and `symbol_value` instructions.

These are based on similar versions I had added to the s390x
back-end, but are a bit more straightforward to use.

- func_ref_data: Extract SigRef, ExternalName, and RelocDistance
  fields given a FuncRef.

- symbol_value_data: Extract ExternalName, RelocDistance, and
  offset fields given a GlobalValue representing a Symbol.

- reloc_distance_near: Test for RelocDistance::Near.

The s390x back-end is changed to use these common versions.

Note that this exposed a bug in common isle code: This extractor:

(extractor (load_sym inst)
  (and inst
       (load _ (def_inst (symbol_value
                           (symbol_value_data _
                             (reloc_distance_near) offset)))
               (i64_from_offset
                 (memarg_symbol_offset_sum <offset _)))))

would raise an assertion in sema.rs due to a supposed cycle in
extractor definitions.  But there was no actual cycle, it was
simply that the extractor tree refers twice to the `insn_data`
extractor (once via the `load` and once via the `symbol_value`
extractor).  Fixed by checking for pre-existing definitions only
along one path in the tree, not across the whole tree.
2022-02-08 17:57:27 -08:00
Ulrich Weigand
9c5c872b3b s390x: Add support for all remaining atomic operations (#3746)
This adds support for all atomic operations that were unimplemented
so far in the s390x back end:
- atomic_rmw operations xchg, nand, smin, smax, umin, umax
- $I8 and $I16 versions of atomic_rmw and atomic_cas
- little endian versions of atomic_rmw and atomic_cas

All of these have to be implemented by a compare-and-swap loop;
and for the $I8 and $I16 versions the actual atomic instruction
needs to operate on the surrounding aligned 32-bit word.

Since we cannot emit new control flow during ISLE instruction
selection, these compare-and-swap loops are emitted as a single
meta-instruction to be expanded at emit time.

However, since there is a large number of different versions of
the loop required to implement all the above operations, I've
implemented a facility to allow specifying the loop bodies
from within ISLE after all, by creating a vector of MInst
structures that will be emitted as part of the meta-instruction.

There are still restrictions, in particular instructions that
are part of the loop body may not modify any virtual register.
But even so, this approach looks preferable to doing everything
in emit.rs.

A few instructions needed in those compare-and-swap loop bodies
were added as well, in particular the RxSBG family of instructions
as well as the LOAD REVERSED in-register byte-swap instructions.

This patch also adds filetest runtests to verify the semantics
of all operations, in particular the subword and little-endian
variants (those are currently only executed on s390x).
2022-02-08 13:48:44 -08:00
Ulrich Weigand
36369a6f35 s390x: Migrate branches and traps to ISLE
In order to migrate branches to ISLE, we define a second entry
point `lower_branch` which gets the list of branch targets as
additional argument.

This requires a small change to `lower_common`: the `isle_lower`
callback argument is changed from a function pointer to a closure.
This allows passing the extra argument via a closure.

Traps make use of the recently added facility to emit safepoints
from ISLE, but are otherwise straightforward.
2022-01-25 18:15:32 +01:00
Chris Fallin
cd6b73fc90 Merge pull request #3723 from uweigand/isle-safepoint
ISLE: Allow emitting safepoint insns
2022-01-25 08:56:22 -08:00
Ulrich Weigand
906f6a35cf ISLE: Allow emitting safepoint insns
Change the implementation of emitted_insts in IsleContext from
a plain vector of instructions into a vector of tuples, where
the second element is a boolean that indicates whether this
instruction should be emitted as a safepoint.

This allows targets to emit safepoint insns via ISLE.
2022-01-25 14:21:41 +01:00
Ulrich Weigand
cee00c6591 s390x: Refactor branch and jumptable emission
The BranchTarget abstraction is no longer needed, since all branches are
being emitted using a MachLabel target.  Remove BranchTarget and simply
use MachLabel everywhere a branch target is required.  (This brings the
s390x back-end in line with what x64 does as well.)

In addition, simplify jumptable emission by moving all instructions
that do not depend on the internal label (i.e. the conditional branch
to the default label, as well as the scaling the index register) out of
the combined JTSequence instruction.

This refactoring will make moving branch generation to ISLE easier.
2022-01-24 12:22:53 +01:00
Ulrich Weigand
a94e72b5b7 s390x: Add ISLE support
This adds ISLE support for the s390x back-end and moves lowering
of most instructions to ISLE.  The only instructions still remaining
are calls, returns, traps, and branches, most of which will need
additional support in common code.

Generated code is not intended to be (significantly) different
than before; any additional optimizations now made easier to
implement due to the ISLE layer can be added in follow-on patches.

There were a few differences in some filetests, but those are all
just simple register allocation changes (and all to the better!).
2022-01-21 19:30:56 +01:00