Commit Graph

43 Commits

Author SHA1 Message Date
Ulrich Weigand
3e5938e65a Support big- and little-endian lane order with bitcast (#5196)
Add a MemFlags operand to the bitcast instruction, where only the
`big` and `little` flags are accepted.  These define the lane order
to be used when casting between types of different lane counts.

Update all users to pass an appropriate MemFlags argument.

Implement lane swaps where necessary in the s390x back-end.

This is the final part necessary to fix
https://github.com/bytecodealliance/wasmtime/issues/4566.
2022-11-07 14:41:10 -08:00
Afonso Bordado
3ef30b5b67 cranelift: Rename i{min,max} to s{min,max} (#5187)
This brings these instructions with our general naming convention
of signed instructions being prefixed with `s`.
2022-11-03 18:20:33 +00:00
Afonso Bordado
2c69b94744 cranelift: Add support for bswap.i128 (#5186)
* fuzzgen: Request only one variable for bswap

This was included by accident. Bswap only has one input, instead of two.

* cranelift: Add `bswap.i128` support

Adds support only for x86, AArch64, S390X.

RISCV does not yet have bswap.
2022-11-03 18:03:37 +00:00
Trevor Elliott
aeceea28e2 Remove trapif and trapff (#5162)
This branch removes the trapif and trapff instructions, in favor of using an explicit comparison and trapnz. This moves us closer to removing iflags and fflags, but introduces the need to implement instructions like iadd_cout in the x64 and aarch64 backends.
2022-11-03 09:25:11 -07:00
Ulrich Weigand
961107ec63 Merge raw_bitcast and bitcast (#5175)
- Allow bitcast for vectors with differing lane widths
- Remove raw_bitcast IR instruction
- Change all users of raw_bitcast to bitcast
- Implement support for no-op bitcast cases across backends

This implements the second step of the plan outlined here:
https://github.com/bytecodealliance/wasmtime/issues/4566#issuecomment-1234819394
2022-11-02 10:16:27 -07:00
11evan
4ca9e82bd1 cranelift: Add Bswap instruction (#1092) (#5147)
Adds Bswap to the Cranelift IR. Implements the Bswap instruction
in the x64 and aarch64 codegen backends. Cranelift users can now:
```
builder.ins().bswap(value)
```
to get a native byteswap instruction.

* x64: implements the 32- and 64-bit bswap instruction, following
the pattern set by similar unary instrutions (Neg and Not) - it
only operates on a dst register, but is parameterized with both
a src and dst which are expected to be the same register.

As x64 bswap instruction is only for 32- or 64-bit registers,
the 16-bit swap is implemented as a rotate left by 8.

Updated x64 RexFlags type to support emitting for single-operand
instructions like bswap

* aarch64: Bswap gets emitted as aarch64 rev16, rev32,
or rev64 instruction as appropriate.

* s390x: Bswap was already supported in backend, just had to add
a bit of plumbing

* For completeness, added bswap to the interpreter as well.

* added filetests and runtests for each ISA

* added bswap to fuzzgen, thanks to afonso360 for the code there

* 128-bit swaps are not yet implemented, that can be done later
2022-10-31 19:30:00 +00:00
Trevor Elliott
02620441c3 Add uadd_overflow_trap (#5123)
Add a new instruction uadd_overflow_trap, which is a fused version of iadd_ifcout and trapif. Adding this instruction removes a dependency on the iflags type, and would allow us to move closer to removing it entirely.

The instruction is defined for the i32 and i64 types only, and is currently only used in the legalization of heap_addr.
2022-10-27 09:43:15 -07:00
Afonso Bordado
4867813f77 cranelift: Remove copy instruction (#5125) 2022-10-25 17:27:33 -07:00
Ulrich Weigand
39b3b1d772 s390x: Fix handling of sret arguments (#5116)
Skip synthetic StructReturn entries in the return value list.
Fixes https://github.com/bytecodealliance/wasmtime/issues/5089
2022-10-25 10:40:10 -07:00
Trevor Elliott
ec12415b1f cranelift: Remove redundant branch and select instructions (#5097)
As discussed in the 2022/10/19 meeting, this PR removes many of the branch and select instructions that used iflags, in favor if using brz/brnz and select in their place. Additionally, it reworks selectif_spectre_guard to take an i8 input instead of an iflags input.

For reference, the removed instructions are: br_icmp, brif, brff, trueif, trueff, and selectif.
2022-10-24 16:14:35 -07:00
Ulrich Weigand
bfcf6616fe s390x: clean up remnants of non-SSA code generation (#5096)
Eliminate a few remaining instances of non-SSA code.
Remove infrastructure previously used for non-SSA code emission.
Related cleanup around flags handling.
2022-10-24 12:40:50 -07:00
Ulrich Weigand
9dadba60a0 s390x: use constraints for call arguments and return values (#5092)
Use the regalloc constraint-based CallArgList / CallRetList
mechanism instead of directly using physregs in instructions.
2022-10-21 11:01:22 -07:00
Chris Fallin
86e77953f8 Fix some egraph-related issues. (#5088)
This fixes #5086 by addressing two separate issues:

- The `ValueDataPacked::set_type()` helper had an embarrassing bitfield-manipulation bug that would mangle the rest of a `ValueDef` when setting its type. This is not normally used, only when the egraph elaboration fills in types after-the-fact on a multi-value node.
- The lowering rules for `isplit` on aarch64 and s390x were dispatching on the first output type, rather than the input type. When only the second output is used (as in the example in #5086), the first output type actually remains `INVALID` (and this is fine because it's never used).
2022-10-21 10:24:48 -07:00
Trevor Elliott
d9753fac2b Remove uses of reg_mod from s390x (#5073)
Remove uses of reg_mod from the s390x backend. This required moving away from using r0/r1 as the result registers from a few different pseudo instructions, standardizing instead on r2/r3. That change was necessary as regalloc2 will not correctly allocate registers that aren't listed in the allocatable set, which r0/r1 are not.

Co-authored-by: Ulrich Weigand <ulrich.weigand@de.ibm.com>
Co-authored-by: Chris Fallin <chris@cfallin.org>
2022-10-21 09:22:16 -07:00
Trevor Elliott
32a7593c94 cranelift: Remove booleans (#5031)
Remove the boolean types from cranelift, and the associated instructions breduce, bextend, bconst, and bint. Standardize on using 1/0 for the return value from instructions that produce scalar boolean results, and -1/0 for boolean vector elements.

Fixes #3205

Co-authored-by: Afonso Bordado <afonso360@users.noreply.github.com>
Co-authored-by: Ulrich Weigand <ulrich.weigand@de.ibm.com>
Co-authored-by: Chris Fallin <chris@cfallin.org>
2022-10-17 16:00:27 -07:00
Trevor Elliott
a209cb63f5 ISLE: Enable the overlap checker (#5011)
This PR turns the overlap checker on by default, requiring the use of priorities to resolve overlap between rules.
2022-10-04 21:56:49 +00:00
Trevor Elliott
c9ff14e00b Resolve overlap in the s390x backend (#5002)
Resolve overlap in the s390x backend by adding rule priorities to disambiguate rule order.
2022-10-03 17:06:10 -07:00
Nick Fitzgerald
f18a1f1488 Cranelift: Deduplicate ABI signatures during lowering (#4829)
* Cranelift: Deduplicate ABI signatures during lowering

This commit creates the `SigSet` type which interns and deduplicates the ABI
signatures that we create from `ir::Signature`s. The ABI signatures are now
referred to indirectly via a `Sig` (which is a `cranelift_entity` ID), and we
pass around a `SigSet` to anything that needs to access the actual underlying
`SigData` (which is what `ABISig` used to be).

I had to change a couple methods to return a `SmallInstVec` instead of emitting
directly to work around what would otherwise be shared and exclusive borrows of
the lowering context overlapping. I don't expect any of these to heap allocate
in practice.

This does not remove the often-unnecessary allocations caused by
`ensure_struct_return_ptr_is_returned`. That is left for follow up work.

This also opens the door for further shuffling of signature data into more
efficient representations in the future, now that we have `SigSet` to store it
all in one place and it is threaded through all the code. We could potentially
move each signature's parameter and return vectors into one big vector shared
between all signatures, for example, which could cut down on allocations and
shrink the size of `SigData` since those `SmallVec`s have pretty large inline
capacity.

Overall, this refactoring gives a 1-7% speedup for compilation on
`pulldown-cmark`:

```
compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 8754213.66 ± 7526266.23 (confidence = 99%)

  dedupe.so is 1.01x to 1.07x faster than main.so!

  [191003295 234620642.20 280597986] dedupe.so
  [197626699 243374855.86 321816763] main.so

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [170406200 194299792.68 253001201] dedupe.so
  [172071888 193230743.11 223608329] main.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [3870997347 4437735062.59 5216007266] dedupe.so
  [4019924063 4424595349.24 4965088931] main.so
```

* Use full path instead of import to avoid warnings in some build configurations

Warnings will then cause CI to fail.

* Move `SigSet` into `VCode`
2022-08-31 20:39:32 +00:00
Chris Fallin
8e8dfdf5f9 AArch64: Migrate calls and returns to ISLE. (#4788) 2022-08-26 16:26:39 -07:00
Ulrich Weigand
67870d1518 s390x: Support both big- and little-endian vector lane order (#4682)
This implements the s390x back-end portion of the solution for
https://github.com/bytecodealliance/wasmtime/issues/4566

We now support both big- and little-endian vector lane order
in code generation.  The order used for a function is determined
by the function's ABI: if it uses a Wasmtime ABI, it will use
little-endian lane order, and big-endian lane order otherwise.
(This ensures that all raw_bitcast instructions generated by
both wasmtime and other cranelift frontends can always be
implemented as a no-op.)

Lane order affects the implementation of a number of operations:
- Vector immediates
- Vector memory load / store (in big- and little-endian variants)
- Operations explicitly using lane numbers
  (insertlane, extractlane, shuffle, swizzle)
- Operations implicitly using lane numbers
  (iadd_pairwise, narrow/widen, promote/demote, fcvt_low, vhigh_bits)

In addition, when calling a function using a different lane order,
we need to lane-swap all vector values passed or returned in registers.

A small number of changes to common code were also needed:

- Ensure we always select a Wasmtime calling convention on s390x
  in crates/cranelift (func_signature).

- Fix vector immediates for filetests/runtests.  In PR #4427,
  I attempted to fix this by byte-swapping the V128 value, but
  with the new scheme, we'd instead need to perform a per-lane
  byte swap.  Since we do not know the actual type in write_to_slice
  and read_from_slice, this isn't easily possible.

  Revert this part of PR #4427 again, and instead just mark the
  memory buffer as little-endian when emitting the trampoline;
  the back-end will then emit correct code to load the constant.

- Change a runtest in simd-bitselect-to-vselect.clif to no longer
  make little-endian lane order assumptions.

- Remove runtests in simd-swizzle.clif that make little-endian
  lane order assumptions by relying on implicit type conversion
  when using a non-i16x8 swizzle result type (this feature should
  probably be removed anyway).

Tested with both wasmtime and cg_clif.
2022-08-11 12:10:46 -07:00
Ulrich Weigand
50fcab2984 s390x: Implement tls_value (#4616)
Implement the tls_value for s390 in the ELF general-dynamic mode.

Notable differences to the x86_64 implementation are:
- We use a __tls_get_offset libcall instead of __tls_get_addr.
- The current thread pointer (stored in a pair of access registers)
  needs to be added to the result of __tls_get_offset.
- __tls_get_offset has a variant ABI that requires the address of
  the GOT (global offset table) is passed in %r12.

This means we need a new libcall entries for __tls_get_offset.
In addition, we also need a way to access _GLOBAL_OFFSET_TABLE_.
The latter is a "magic" symbol with a well-known name defined
by the ABI and recognized by the linker.  This patch introduces
a new ExternalName::KnownSymbol variant to support such names
(originally due to @afonso360).

We also need to emit a relocation on a symbol placed in a
constant pool, as well as an extra relocation on the call
to __tls_get_offset required for TLS linker optimization.

Needed by the cg_clif frontend.
2022-08-10 10:02:07 -07:00
Ulrich Weigand
f552a53654 s390x: Implement bitrev (#4617)
Since we do not have an instruction for this, this is a simple
open-coded implementation.

Needed by the cg_clif frontend.
2022-08-04 16:24:55 -07:00
Ulrich Weigand
b17b1eb25d [s390x, abi_impl] Add i128 support (#4598)
This adds full i128 support to the s390x target, including new filetests
and enabling the existing i128 runtest on s390x.

The ABI requires that i128 is passed and returned via implicit pointer,
but the front end still generates direct i128 types in call.  This means
we have to implement ABI support to implicitly convert i128 types to
pointers when passing arguments.

To do so, we add a new variant ABIArg::ImplicitArg.  This acts like
StructArg, except that the value type is the actual target type,
not a pointer type.  The required conversions have to be inserted
in the prologue and at function call sites.

Note that when dereferencing the implicit pointer in the prologue,
we may require a temp register: the pointer may be passed on the
stack so it needs to be loaded first, but the value register may
be in the wrong class for pointer values.  In this case, we use
the "stack limit" register, which should be available at this
point in the prologue.

For return values, we use a mechanism similar to the one used for
supporting multiple return values in the Wasmtime ABI.  The only
difference is that the hidden pointer to the return buffer must
be the *first*, not last, argument in this case.

(This implements the second half of issue #4565.)
2022-08-04 20:41:26 +00:00
Ulrich Weigand
b9dd48e34b [s390x, abi_impl] Support struct args using explicit pointers (#4585)
This adds support for StructArgument on s390x.  The ABI for this
platform requires that the address of the buffer holding the copy
of the struct argument is passed from caller to callee as hidden
pointer, using a register or overflow stack slot.

To implement this, I've added an optional "pointer" filed to
ABIArg::StructArg, and code to handle the pointer both in common
abi_impl code and the s390x back-end.

One notable change necessary to make this work involved the
"copy_to_arg_order" mechanism.  Currently, for struct args
we only need to copy the data (and that need to happen before
setting up any other args), while for non-struct args we only
need to set up the appropriate registers or stack slots.
This order is ensured by sorting the arguments appropriately
into a "copy_to_arg_order" list.

However, for struct args with explicit pointers we need to *both*
copy the data (again, before everything else), *and* set up a
register or stack slot.  Since we now need to touch the argument
twice, we cannot solve the ordering problem by a simple sort.
Instead, the abi_impl common code now provided *two* callbacks,
emit_copy_regs_to_buffer and emit_copy_regs_to_arg, and expects
the back end to first call copy..to_buffer for all args, and
then call copy.._to_arg for all args.  This required updates
to all back ends.

In the s390x back end, in addition to the new ABI code, I'm now
adding code to actually copy the struct data, using the MVC
instruction (for small buffers) or a memcpy libcall (for larger
buffers).  This also requires a bit of new infrastructure:
- MVC is the first memory-to-memory instruction we use, which
  needed a bit of memory argument tweaking
- We also need to set up the infrastructure to emit libcalls.

(This implements the first half of issue #4565.)
2022-08-03 19:00:07 +00:00
Nick Fitzgerald
42bba452a6 Cranelift: Add instructions for getting the current stack/frame/return pointers (#4573)
* Cranelift: Add instructions for getting the current stack/frame pointers and return address

This is the initial part of https://github.com/bytecodealliance/wasmtime/issues/4535

* x64: Remove `Amode::RbpOffset` and use `Amode::ImmReg` instead

We just special case getting operands from `Amode`s now.

* Fix s390x `get_return_address`; require `preserve_frame_pointers=true`

* Assert that `Amode::ImmRegRegShift` doesn't use rbp/rsp

* Handle non-allocatable registers in Amode::with_allocs

* Use "stack" instead of "r15" on s390x

* r14 is an allocatable register on s390x, so it shouldn't be used with `MovPReg`
2022-08-02 14:37:17 -07:00
Chris Fallin
43f1765272 Cranellift: remove Baldrdash support and related features. (#4571)
* Cranellift: remove Baldrdash support and related features.

As noted in Mozilla's bugzilla bug 1781425 [1], the SpiderMonkey team
has recently determined that their current form of integration with
Cranelift is too hard to maintain, and they have chosen to remove it
from their codebase. If and when they decide to build updated support
for Cranelift, they will adopt different approaches to several details
of the integration.

In the meantime, after discussion with the SpiderMonkey folks, they
agree that it makes sense to remove the bits of Cranelift that exist
to support the integration ("Baldrdash"), as they will not need
them. Many of these bits are difficult-to-maintain special cases that
are not actually tested in Cranelift proper: for example, the
Baldrdash integration required Cranelift to emit function bodies
without prologues/epilogues, and instead communicate very precise
information about the expected frame size and layout, then stitched
together something post-facto. This was brittle and caused a lot of
incidental complexity ("fallthrough returns", the resulting special
logic in block-ordering); this is just one example. As another
example, one particular Baldrdash ABI variant processed stack args in
reverse order, so our ABI code had to support both traversal
orders. We had a number of other Baldrdash-specific settings as well
that did various special things.

This PR removes Baldrdash ABI support, the `fallthrough_return`
instruction, and pulls some threads to remove now-unused bits as a
result of those two, with the  understanding that the SpiderMonkey folks
will build new functionality as needed in the future and we can perhaps
find cleaner abstractions to make it all work.

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1781425

* Review feedback.

* Fix (?) DWARF debug tests: add `--disable-cache` to wasmtime invocations.

The debugger tests invoke `wasmtime` from within each test case under
the control of a debugger (gdb or lldb). Some of these tests started to
inexplicably fail in CI with unrelated changes, and the failures were
only inconsistently reproducible locally. It seems to be cache related:
if we disable cached compilation on the nested `wasmtime` invocations,
the tests consistently pass.

* Review feedback.
2022-08-02 19:37:56 +00:00
Trevor Elliott
586ec95c11 ISLE: Allow shadowing in let expressions (#4562)
* Support shadowing in isle

* Re-run the isle build.rs if the examples change

* Print error messages when isle tests fail

* Move run tests

* Refactor `let` uses that don't need to introduce unique names
2022-08-01 21:10:28 +00:00
Anton Kirilov
ead6edb0c5 Cranelift AArch64: Migrate Splat to ISLE (#4521)
Copyright (c) 2022, Arm Limited.
2022-07-26 17:57:15 +00:00
Ulrich Weigand
dd40bf075a s390x: Enable more runtests, and fix a few bugs (#4516)
This enables more runtests to be executed on s390x.  Doing so
uncovered a two back-end bugs, which are fixed as well:

- The result of cls was always off by one.
- The result of popcnt.i16 has uninitialized high bits.

In addition, I found a bug in the load-op-store.clif test case:
     v3 = heap_addr.i64 heap0, v1, 4
     v4 = iconst.i64 42
     store.i32 v4, v3
This was clearly intended to perform a 32-bit store, but
actually performs a 64-bit store (it seems the type annotation
of the store opcode is ignored, and the type of the operand
is used instead).  That bug did not show any noticable symptoms
on little-endian architectures, but broke on big-endian.
2022-07-25 12:37:06 -07:00
Ulrich Weigand
638dc4e0b3 s390x: Implement full SIMD support (#4427)
This adds full support for all Cranelift SIMD instructions
to the s390x target.  Everything is matched fully via ISLE.

In addition to adding support for many new instructions,
and the lower.isle code to match all SIMD IR patterns,
this patch also adds ABI support for vector types.
In particular, we now need to handle the fact that
vector registers 8 .. 15 are partially callee-saved,
i.e. the high parts of those registers (which correspond
to the old floating-poing registers) are callee-saved,
but the low parts are not.  This is the exact same situation
that we already have on AArch64, and so this patch uses the
same solution (the is_included_in_clobbers callback).

The bulk of the changes are platform-specific, but there are
a few exceptions:

- Added ISLE extractors for the Immediate and Constant types,
  to enable matching the vconst and swizzle instructions.

- Added a missing accessor for call_conv to ABISig.

- Fixed endian conversion for vector types in data_value.rs
  to enable their use in runtests on the big-endian platforms.

- Enabled (nearly) all SIMD runtests on s390x.  [ Two test cases
  remain disabled due to vector shift count semantics, see below. ]

- Enabled all Wasmtime SIMD tests on s390x.

There are three minor issues, called out via FIXMEs below,
which should be addressed in the future, but should not be
blockers to getting this patch merged.  I've opened the
following issues to track them:

- Vector shift count semantics
  https://github.com/bytecodealliance/wasmtime/issues/4424

- is_included_in_clobbers vs. link register
  https://github.com/bytecodealliance/wasmtime/issues/4425

- gen_constant callback
  https://github.com/bytecodealliance/wasmtime/issues/4426

All tests, including all newly enabled SIMD tests, pass
on both z14 and z15 architectures.
2022-07-18 14:00:48 -07:00
Sam Parker
9c43749dfe [RFC] Dynamic Vector Support (#4200)
Introduce a new concept in the IR that allows a producer to create
dynamic vector types. An IR function can now contain global value(s)
that represent a dynamic scaling factor, for a given fixed-width
vector type. A dynamic type is then created by 'multiplying' the
corresponding global value with a fixed-width type. These new types
can be used just like the existing types and the type system has a
set of hard-coded dynamic types, such as I32X4XN, which the user
defined types map onto. The dynamic types are also used explicitly
to create dynamic stack slots, which have no set size like their
existing counterparts. New IR instructions are added to access these
new stack entities.

Currently, during codegen, the dynamic scaling factor has to be
lowered to a constant so the dynamic slots do eventually have a
compile-time known size, as do spill slots.

The current lowering for aarch64 just targets Neon, using a dynamic
scale of 1.

Copyright (c) 2022, Arm Limited.
2022-07-07 12:54:39 -07:00
Ulrich Weigand
ec83144c88 s390x: use full vector register file for FP operations (#4360)
This defines the full set of 32 128-bit vector registers on s390x.
(Note that the VRs overlap the existing FPRs.)  In addition, this
adds support to use all 32 vector registers to implement floating-
point operations, by using vector floating-point instructions with
the 'W' bit set to operate only on the first element.

This part of the vector instruction set mostly matches the old FP
instruction set, with two exceptions:

- There is no vector version of the COPY SIGN instruction.  Instead,
  now use a VECTOR SELECT with an appropriate bit mask to implement
  the fcopysign operation.

- There are no vector version of the float <-> int conversion
  instructions where source and target differ in bit size.  Use
  appropriate multiple conversion steps instead.  This also requires
  use of explicit checking to implement correct overflow handling.
  As a side effect, this version now also implements the i8 / i16
  variants of all conversions, which had been missing so far.

For all operations except those two above, we continue to use the
old FP instruction if applicable (i.e. if all operands happen to
have been allocated to the original FP register set), and use the
vector instruction otherwise.
2022-06-30 16:33:39 -07:00
Ulrich Weigand
7a9479f77c ISLE: Migrate call and return instructions (#3785)
This adds infrastructure to allow implementing call and return
instructions in ISLE, and migrates the s390x back-end.

To implement ABI details, this patch creates public accessors
for `ABISig` and makes them accessible in ISLE.  All actual
code generation is then done in ISLE rules, following the
information provided by that signature.

[ Note that the s390x back end never requires multiple slots for
a single argument - the infrastructure to handle this should
already be present, however. ]

To implement loops in ISLE rules, this patch uses regular tail
recursion, employing a `Range` data structure holding a range
of integers to be looped over.
2022-06-29 14:22:50 -07:00
Ulrich Weigand
0243a16679 s390x: Fix bitwise operations (#4146)
Current codegen had a number of logic errors confusing
NAND with AND WITH COMPLEMENT, and NOR with OR WITH COMPLEMENT.

Add support for the missing z15 instructions and fix logic.
2022-05-12 10:05:22 -07:00
Chris Fallin
03793b71a7 ISLE: remove all uses of argument polarity, and remove it from the language. (#4091)
This PR removes "argument polarity": the feature of ISLE extractors that lets them take
inputs aside from the value to be matched.

Cases that need this expressivity have been subsumed by #4072 with if-let clauses;
we can now finally remove this misfeature of the language, which has caused significant
confusion and has always felt like a bit of a hack.

This PR (i) removes the feature from the ISLE compiler; (ii) removes it from the reference
documentation; and (iii) refactors away all uses of the feature in our three existing
backends written in ISLE.
2022-05-02 09:52:12 -07:00
Chris Fallin
a0318f36f0 Switch Cranelift over to regalloc2. (#3989)
This PR switches Cranelift over to the new register allocator, regalloc2.

See [this document](https://gist.github.com/cfallin/08553421a91f150254fe878f67301801)
for a summary of the design changes. This switchover has implications for
core VCode/MachInst types and the lowering pass.

Overall, this change brings improvements to both compile time and speed of
generated code (runtime), as reported in #3942:

```
Benchmark       Compilation (wallclock)     Execution (wallclock)
blake3-scalar   25% faster                  28% faster
blake3-simd     no diff                     no diff
meshoptimizer   19% faster                  17% faster
pulldown-cmark  17% faster                  no diff
bz2             15% faster                  no diff
SpiderMonkey,   21% faster                  2% faster
  fib(30)
clang.wasm      42% faster                  N/A
```
2022-04-14 10:28:21 -07:00
Ulrich Weigand
07d615d3f7 ISLE: Lowering of multi-output instructions (#3783)
This changes the output of the `lower` constructor from a
`ValueRegs` to a new `InstOutput` type, which is a vector
of `ValueRegs`.

Code in `lower_common` is updated to use this new type to
handle instructions with multiple outputs.  All back-ends
are updated to use the new type.
2022-02-24 14:03:06 -08:00
Chris Fallin
e8881b2cc0 ISLE lowering rules: make use of implicit conversions. (#3847)
This PR makes use of the new implicit-conversion feature of the ISLE DSL
that was introduced in #3807 in order to make the lowering rules
significantly simpler and more concise.

The basic idea is to eliminate the repetitive and mechanical use of
terms that convert from one type to another when there is only one real
way to do the conversion -- for example, to go from a `WritableReg` to a
`Reg`, the only sensible way is to use `writable_reg_to_reg`.

This PR generally takes any term of the form "A_to_B" and makes it an
automatic conversion, as well as some others that are similar in spirit.

The notable exception to the pure-value-convsion category is the
`put_in_reg` family of operations, which actually do have side-effects.
However, as noted in the doc additions in #3807, this is fine as long as
the side-effects are idempotent. And on balance, making `put_in_reg`
automatic is a significant clarity win -- together with other operand
converters, it enables rules like:

```
;; Add two registers.
(rule (lower (has_type (fits_in_64 ty)
                       (iadd x y)))
      (add ty x y))
```

There may be other converters that we could define to make the rules
even simpler; we can make such improvements as we think of them, but
this should be a good start!
2022-02-23 16:14:38 -08:00
Andrew Brown
f87c61176a x64: port select to ISLE (#3682)
* x64: port `select` using an FP comparison to ISLE

This change includes quite a few interlocking parts, required mainly by
the current x64 conventions in ISLE:
 - it adds a way to emit a `cmove` with multiple OR-ing conditions;
   because x64 ISLE cannot currently safely emit a comparison followed
   by several jumps, this adds `MachInst::CmoveOr` and
   `MachInst::XmmCmoveOr` macro instructions. Unfortunately, these macro
   instructions hide the multi-instruction sequence in `lower.isle`
 - to properly keep track of what instructions consume and produce
   flags, @cfallin added a way to pass around variants of
   `ConsumesFlags` and `ProducesFlags`--these changes affect all
   backends
 - then, to lower the `fcmp + select` CLIF, this change adds several
   `cmove*_from_values` helpers that perform all of the awkward
   conversions between `Value`, `ValueReg`, `Reg`, and `Gpr/Xmm`; one
   upside is that now these lowerings have much-improved documentation
   explaining why the various `FloatCC` and `CC` choices are made the
   the way they are.

Co-authored-by: Chris Fallin <chris@cfallin.org>
2022-02-23 10:03:16 -08:00
Ulrich Weigand
10198553c7 ISLE: Common accessors for some insn data fields (#3781)
Add accessors to prelude.isle to access data fields of
`func_addr` and `symbol_value` instructions.

These are based on similar versions I had added to the s390x
back-end, but are a bit more straightforward to use.

- func_ref_data: Extract SigRef, ExternalName, and RelocDistance
  fields given a FuncRef.

- symbol_value_data: Extract ExternalName, RelocDistance, and
  offset fields given a GlobalValue representing a Symbol.

- reloc_distance_near: Test for RelocDistance::Near.

The s390x back-end is changed to use these common versions.

Note that this exposed a bug in common isle code: This extractor:

(extractor (load_sym inst)
  (and inst
       (load _ (def_inst (symbol_value
                           (symbol_value_data _
                             (reloc_distance_near) offset)))
               (i64_from_offset
                 (memarg_symbol_offset_sum <offset _)))))

would raise an assertion in sema.rs due to a supposed cycle in
extractor definitions.  But there was no actual cycle, it was
simply that the extractor tree refers twice to the `insn_data`
extractor (once via the `load` and once via the `symbol_value`
extractor).  Fixed by checking for pre-existing definitions only
along one path in the tree, not across the whole tree.
2022-02-08 17:57:27 -08:00
Ulrich Weigand
9c5c872b3b s390x: Add support for all remaining atomic operations (#3746)
This adds support for all atomic operations that were unimplemented
so far in the s390x back end:
- atomic_rmw operations xchg, nand, smin, smax, umin, umax
- $I8 and $I16 versions of atomic_rmw and atomic_cas
- little endian versions of atomic_rmw and atomic_cas

All of these have to be implemented by a compare-and-swap loop;
and for the $I8 and $I16 versions the actual atomic instruction
needs to operate on the surrounding aligned 32-bit word.

Since we cannot emit new control flow during ISLE instruction
selection, these compare-and-swap loops are emitted as a single
meta-instruction to be expanded at emit time.

However, since there is a large number of different versions of
the loop required to implement all the above operations, I've
implemented a facility to allow specifying the loop bodies
from within ISLE after all, by creating a vector of MInst
structures that will be emitted as part of the meta-instruction.

There are still restrictions, in particular instructions that
are part of the loop body may not modify any virtual register.
But even so, this approach looks preferable to doing everything
in emit.rs.

A few instructions needed in those compare-and-swap loop bodies
were added as well, in particular the RxSBG family of instructions
as well as the LOAD REVERSED in-register byte-swap instructions.

This patch also adds filetest runtests to verify the semantics
of all operations, in particular the subword and little-endian
variants (those are currently only executed on s390x).
2022-02-08 13:48:44 -08:00
Ulrich Weigand
36369a6f35 s390x: Migrate branches and traps to ISLE
In order to migrate branches to ISLE, we define a second entry
point `lower_branch` which gets the list of branch targets as
additional argument.

This requires a small change to `lower_common`: the `isle_lower`
callback argument is changed from a function pointer to a closure.
This allows passing the extra argument via a closure.

Traps make use of the recently added facility to emit safepoints
from ISLE, but are otherwise straightforward.
2022-01-25 18:15:32 +01:00
Ulrich Weigand
a94e72b5b7 s390x: Add ISLE support
This adds ISLE support for the s390x back-end and moves lowering
of most instructions to ISLE.  The only instructions still remaining
are calls, returns, traps, and branches, most of which will need
additional support in common code.

Generated code is not intended to be (significantly) different
than before; any additional optimizations now made easier to
implement due to the ISLE layer can be added in follow-on patches.

There were a few differences in some filetests, but those are all
just simple register allocation changes (and all to the better!).
2022-01-21 19:30:56 +01:00