Remove the boolean types from cranelift, and the associated instructions breduce, bextend, bconst, and bint. Standardize on using 1/0 for the return value from instructions that produce scalar boolean results, and -1/0 for boolean vector elements.
Fixes#3205
Co-authored-by: Afonso Bordado <afonso360@users.noreply.github.com>
Co-authored-by: Ulrich Weigand <ulrich.weigand@de.ibm.com>
Co-authored-by: Chris Fallin <chris@cfallin.org>
This is a simple error in the const-prop rules: uextend was not
masking iconst's u64 immediate when extending from i32 to
i64. Arguably an iconst.i32 should not have nonzero bits in the upper
32 of its immediate, but that's a separate design question. For now,
if our invariant is that the upper bits are ignored, then it is
required to mask the bits when const-evaling a `uextend`.
Fixes#5047.
* egraph-based midend: draw the rest of the owl.
* Rename `egg` submodule of cranelift-codegen to `egraph`.
* Apply some feedback from @jsharp during code walkthrough.
* Remove recursion from find_best_node by doing a single pass.
Rather than recursively computing the lowest-cost node for a given
eclass and memoizing the answer at each eclass node, we can do a single
forward pass; because every eclass node refers only to earlier nodes,
this is sufficient. The behavior may slightly differ from the earlier
behavior because we cannot short-circuit costs to zero once a node is
elaborated; but in practice this should not matter.
* Make elaboration non-recursive.
Use an explicit stack instead (with `ElabStackEntry` entries,
alongside a result stack).
* Make elaboration traversal of the domtree non-recursive/stack-safe.
* Work analysis logic in Cranelift-side egraph glue into a general analysis framework in cranelift-egraph.
* Apply static recursion limit to rule application.
* Fix aarch64 wrt dynamic-vector support -- broken rebase.
* Topo-sort cranelift-egraph before cranelift-codegen in publish script, like the comment instructs me to!
* Fix multi-result call testcase.
* Include `cranelift-egraph` in `PUBLISHED_CRATES`.
* Fix atomic_rmw: not really a load.
* Remove now-unnecessary PartialOrd/Ord derivations.
* Address some code-review comments.
* Review feedback.
* Review feedback.
* No overlap in mid-end rules, because we are defining a multi-constructor.
* rustfmt
* Review feedback.
* Review feedback.
* Review feedback.
* Review feedback.
* Remove redundant `mut`.
* Add comment noting what rules can do.
* Review feedback.
* Clarify comment wording.
* Update `has_memory_fence_semantics`.
* Apply @jameysharp's improved loop-level computation.
Co-authored-by: Jamey Sharp <jamey@minilop.net>
* Fix suggestion commit.
* Fix off-by-one in new loop-nest analysis.
* Review feedback.
* Review feedback.
* Review feedback.
* Use `Default`, not `std::default::Default`, as per @fitzgen
Co-authored-by: Nick Fitzgerald <fitzgen@gmail.com>
* Apply @fitzgen's comment elaboration to a doc-comment.
Co-authored-by: Nick Fitzgerald <fitzgen@gmail.com>
* Add stat for hitting the rewrite-depth limit.
* Some code motion in split prelude to make the diff a little clearer wrt `main`.
* Take @jameysharp's suggested `try_into()` usage for blockparam indices.
Co-authored-by: Jamey Sharp <jamey@minilop.net>
* Take @jameysharp's suggestion to avoid double-match on load op.
Co-authored-by: Jamey Sharp <jamey@minilop.net>
* Fix suggestion (add import).
* Review feedback.
* Fix stack_load handling.
* Remove redundant can_store case.
* Take @jameysharp's suggested improvement to FuncEGraph::build() logic
Co-authored-by: Jamey Sharp <jamey@minilop.net>
* Tweaks to FuncEGraph::build() on top of suggestion.
* Take @jameysharp's suggested clarified condition
Co-authored-by: Jamey Sharp <jamey@minilop.net>
* Clean up after suggestion (unused variable).
* Fix loop analysis.
* loop level asserts
* Revert constant-space loop analysis -- edge cases were incorrect, so let's go with the simple thing for now.
* Take @jameysharp's suggestion re: result_tys
Co-authored-by: Jamey Sharp <jamey@minilop.net>
* Fix up after suggestion
* Take @jameysharp's suggestion to use fold rather than reduce
Co-authored-by: Jamey Sharp <jamey@minilop.net>
* Fixup after suggestion
* Take @jameysharp's suggestion to remove elaborate_eclass_use's return value.
* Clarifying comment in terminator insts.
Co-authored-by: Jamey Sharp <jamey@minilop.net>
Co-authored-by: Nick Fitzgerald <fitzgen@gmail.com>
Resolve overlap in the ISLE prelude and the x64 inst module by introducing new types that allow better sharing of extractor resuls, or falling back on priorities.
* Port branches to ISLE (AArch64)
Ported the existing implementations of the following opcodes for AArch64
to ISLE:
- `Brz`
- `Brnz`
- `Brif`
- `Brff`
- `BrIcmp`
- `Jump`
- `BrTable`
Copyright (c) 2022 Arm Limited
* Remove dead code
Copyright (c) 2022 Arm Limited
* Port `icmp` to ISLE (AArch64)
Ported the existing implementation of `icmp` (and, by extension, the
`lower_icmp` function) to ISLE for AArch64.
Copyright (c) 2022 Arm Limited
* Allow 'producer chains', eliminating `Nop0`s
Copyright (c) 2022 Arm Limited
* ABI: implement register arguments with constraints.
Currently, Cranelift's ABI code emits a sequence of moves from physical
registers into vregs at the top of the function body, one for every
register-carried argument.
For a number of reasons, we want to move to operand constraints instead,
and remove the use of explicitly-named "pinned vregs"; this allows for
better regalloc in theory, as it removes the need to "reverse-engineer"
the sequence of moves.
This PR alters the ABI code so that it generates a single "args"
pseudo-instruction as the first instruction in the function body. This
pseudo-inst defs all register arguments, and constrains them to the
appropriate registers at the def-point. Subsequently the regalloc can
move them wherever it needs to.
Some care was taken not to have this pseudo-inst show up in
post-regalloc disassemblies, but the change did cause a general regalloc
"shift" in many tests, so the precise-output updates are a bit noisy.
Sorry about that!
A subsequent PR will handle the other half of the ABI code, namely, the
callsite case, with a similar preg-to-constraint conversion.
* Update based on review feedback.
* Review feedback.
The previous implementation assumed that nothing had clobbered the
LR register since the current function had started executing, so
it would be incorrect for a non-leaf function, for example, that
contains the `get_return_address` operation right after a call.
The operation is valid only if the `preserve_frame_pointers` flag
is enabled, which implies that the presence of a frame record on
the stack is guaranteed.
Copyright (c) 2022, Arm Limited.
* Cranelift: Deduplicate ABI signatures during lowering
This commit creates the `SigSet` type which interns and deduplicates the ABI
signatures that we create from `ir::Signature`s. The ABI signatures are now
referred to indirectly via a `Sig` (which is a `cranelift_entity` ID), and we
pass around a `SigSet` to anything that needs to access the actual underlying
`SigData` (which is what `ABISig` used to be).
I had to change a couple methods to return a `SmallInstVec` instead of emitting
directly to work around what would otherwise be shared and exclusive borrows of
the lowering context overlapping. I don't expect any of these to heap allocate
in practice.
This does not remove the often-unnecessary allocations caused by
`ensure_struct_return_ptr_is_returned`. That is left for follow up work.
This also opens the door for further shuffling of signature data into more
efficient representations in the future, now that we have `SigSet` to store it
all in one place and it is threaded through all the code. We could potentially
move each signature's parameter and return vectors into one big vector shared
between all signatures, for example, which could cut down on allocations and
shrink the size of `SigData` since those `SmallVec`s have pretty large inline
capacity.
Overall, this refactoring gives a 1-7% speedup for compilation on
`pulldown-cmark`:
```
compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm
Δ = 8754213.66 ± 7526266.23 (confidence = 99%)
dedupe.so is 1.01x to 1.07x faster than main.so!
[191003295 234620642.20 280597986] dedupe.so
[197626699 243374855.86 321816763] main.so
compilation :: cycles :: benchmarks/bz2/benchmark.wasm
No difference in performance.
[170406200 194299792.68 253001201] dedupe.so
[172071888 193230743.11 223608329] main.so
compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[3870997347 4437735062.59 5216007266] dedupe.so
[4019924063 4424595349.24 4965088931] main.so
```
* Use full path instead of import to avoid warnings in some build configurations
Warnings will then cause CI to fail.
* Move `SigSet` into `VCode`
Ported the existing implementation of `fcmp` for AArch64 to ISLE.
This also ports the `lower_vector_comparison` method to ISLE.
Copyright (c) 2022 Arm Limited
This retains `lower_amode` in the handwritten code (@akirilov-arm
reports that there is an upcoming patch to port this), but tweaks it
slightly to take a `Value` rather than an `Inst`.
Ensure that constants generated for the memory case of XmmMem values are always 16 bytes, ensuring that we don't accidantally perform an unaligned load.
Fixes#4761
Lower extractlane, scalar_to_vector and splat in ISLE.
This PR also makes some changes to the SinkableLoad api
* change the return type of sink_load to RegMem as there are more functions available for dealing with RegMem
* add reg_mem_to_reg_mem_imm and register it as an automatic conversion
Lower `shuffle` and `swizzle` in ISLE.
This PR surfaced a bug with the lowering of `shuffle` when avx512vl and avx512vbmi are enabled: we use `vpermi2b` as the implementation, but panic if the immediate shuffle mask contains any out-of-bounds values. The behavior when the avx512 extensions are not present is that out-of-bounds values are turned into `0` in the result.
I've resolved this by detecting when the shuffle immediate has out-of-bounds indices in the avx512-enabled lowering, and generating an additional mask to zero out the lanes where those indices occur. This brings the avx512 case into line with the semantics of the `shuffle` op: 94bcbe8446/cranelift/codegen/meta/src/shared/instructions.rs (L1495-L1498)
* Port `vconst` to ISLE (AArch64)
Ported the existing implementation of `vconst` to ISLE for AArch64, and
added support for 64-bit vector constants.
Also introduced 64-bit `vconst` support to the interpreter.
Copyright (c) 2022 Arm Limited
* Replace if-chains with match statements
Copyright (c) 2022 Arm Limited
Implement the tls_value for s390 in the ELF general-dynamic mode.
Notable differences to the x86_64 implementation are:
- We use a __tls_get_offset libcall instead of __tls_get_addr.
- The current thread pointer (stored in a pair of access registers)
needs to be added to the result of __tls_get_offset.
- __tls_get_offset has a variant ABI that requires the address of
the GOT (global offset table) is passed in %r12.
This means we need a new libcall entries for __tls_get_offset.
In addition, we also need a way to access _GLOBAL_OFFSET_TABLE_.
The latter is a "magic" symbol with a well-known name defined
by the ABI and recognized by the linker. This patch introduces
a new ExternalName::KnownSymbol variant to support such names
(originally due to @afonso360).
We also need to emit a relocation on a symbol placed in a
constant pool, as well as an extra relocation on the call
to __tls_get_offset required for TLS linker optimization.
Needed by the cg_clif frontend.
* Convert `fma`, `valltrue` & `vanytrue` to ISLE (AArch64)
Ported the existing implementations of the following opcodes to ISLE on
AArch64:
- `fma`
- Introduced missing support for `fma` on vector values, as per the
docs.
- `valltrue`
- `vanytrue`
Also fixed `fcmp` on scalar values in the interpreter, and enabled
interpreter tests in `simd-fma.clif`.
This introduces the `FMLA` machine instruction.
Copyright (c) 2022 Arm Limited
* Add comments for `Fmla` and `Bsl`
Copyright (c) 2022 Arm Limited
This adds full i128 support to the s390x target, including new filetests
and enabling the existing i128 runtest on s390x.
The ABI requires that i128 is passed and returned via implicit pointer,
but the front end still generates direct i128 types in call. This means
we have to implement ABI support to implicitly convert i128 types to
pointers when passing arguments.
To do so, we add a new variant ABIArg::ImplicitArg. This acts like
StructArg, except that the value type is the actual target type,
not a pointer type. The required conversions have to be inserted
in the prologue and at function call sites.
Note that when dereferencing the implicit pointer in the prologue,
we may require a temp register: the pointer may be passed on the
stack so it needs to be loaded first, but the value register may
be in the wrong class for pointer values. In this case, we use
the "stack limit" register, which should be available at this
point in the prologue.
For return values, we use a mechanism similar to the one used for
supporting multiple return values in the Wasmtime ABI. The only
difference is that the hidden pointer to the return buffer must
be the *first*, not last, argument in this case.
(This implements the second half of issue #4565.)
This adds support for StructArgument on s390x. The ABI for this
platform requires that the address of the buffer holding the copy
of the struct argument is passed from caller to callee as hidden
pointer, using a register or overflow stack slot.
To implement this, I've added an optional "pointer" filed to
ABIArg::StructArg, and code to handle the pointer both in common
abi_impl code and the s390x back-end.
One notable change necessary to make this work involved the
"copy_to_arg_order" mechanism. Currently, for struct args
we only need to copy the data (and that need to happen before
setting up any other args), while for non-struct args we only
need to set up the appropriate registers or stack slots.
This order is ensured by sorting the arguments appropriately
into a "copy_to_arg_order" list.
However, for struct args with explicit pointers we need to *both*
copy the data (again, before everything else), *and* set up a
register or stack slot. Since we now need to touch the argument
twice, we cannot solve the ordering problem by a simple sort.
Instead, the abi_impl common code now provided *two* callbacks,
emit_copy_regs_to_buffer and emit_copy_regs_to_arg, and expects
the back end to first call copy..to_buffer for all args, and
then call copy.._to_arg for all args. This required updates
to all back ends.
In the s390x back end, in addition to the new ABI code, I'm now
adding code to actually copy the struct data, using the MVC
instruction (for small buffers) or a memcpy libcall (for larger
buffers). This also requires a bit of new infrastructure:
- MVC is the first memory-to-memory instruction we use, which
needed a bit of memory argument tweaking
- We also need to set up the infrastructure to emit libcalls.
(This implements the first half of issue #4565.)
* Cranelift: Add instructions for getting the current stack/frame pointers and return address
This is the initial part of https://github.com/bytecodealliance/wasmtime/issues/4535
* x64: Remove `Amode::RbpOffset` and use `Amode::ImmReg` instead
We just special case getting operands from `Amode`s now.
* Fix s390x `get_return_address`; require `preserve_frame_pointers=true`
* Assert that `Amode::ImmRegRegShift` doesn't use rbp/rsp
* Handle non-allocatable registers in Amode::with_allocs
* Use "stack" instead of "r15" on s390x
* r14 is an allocatable register on s390x, so it shouldn't be used with `MovPReg`
* Cranellift: remove Baldrdash support and related features.
As noted in Mozilla's bugzilla bug 1781425 [1], the SpiderMonkey team
has recently determined that their current form of integration with
Cranelift is too hard to maintain, and they have chosen to remove it
from their codebase. If and when they decide to build updated support
for Cranelift, they will adopt different approaches to several details
of the integration.
In the meantime, after discussion with the SpiderMonkey folks, they
agree that it makes sense to remove the bits of Cranelift that exist
to support the integration ("Baldrdash"), as they will not need
them. Many of these bits are difficult-to-maintain special cases that
are not actually tested in Cranelift proper: for example, the
Baldrdash integration required Cranelift to emit function bodies
without prologues/epilogues, and instead communicate very precise
information about the expected frame size and layout, then stitched
together something post-facto. This was brittle and caused a lot of
incidental complexity ("fallthrough returns", the resulting special
logic in block-ordering); this is just one example. As another
example, one particular Baldrdash ABI variant processed stack args in
reverse order, so our ABI code had to support both traversal
orders. We had a number of other Baldrdash-specific settings as well
that did various special things.
This PR removes Baldrdash ABI support, the `fallthrough_return`
instruction, and pulls some threads to remove now-unused bits as a
result of those two, with the understanding that the SpiderMonkey folks
will build new functionality as needed in the future and we can perhaps
find cleaner abstractions to make it all work.
[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1781425
* Review feedback.
* Fix (?) DWARF debug tests: add `--disable-cache` to wasmtime invocations.
The debugger tests invoke `wasmtime` from within each test case under
the control of a debugger (gdb or lldb). Some of these tests started to
inexplicably fail in CI with unrelated changes, and the failures were
only inconsistently reproducible locally. It seems to be cache related:
if we disable cached compilation on the nested `wasmtime` invocations,
the tests consistently pass.
* Review feedback.
* Support shadowing in isle
* Re-run the isle build.rs if the examples change
* Print error messages when isle tests fail
* Move run tests
* Refactor `let` uses that don't need to introduce unique names
* [AArch64] Port SIMD narrowing to ISLE
Fvdemote, snarrow, unarrow and uunarrow.
Also refactor the aarch64 instructions descriptions to parameterize
on ScalarSize instead of using different opcodes.
The zero_value pure constructor has been introduced and used by the
integer narrow operations and it replaces, and extends, the compare
zero patterns.
Copright (c) 2022, Arm Limited.
* use short 'if' patterns
This adds full support for all Cranelift SIMD instructions
to the s390x target. Everything is matched fully via ISLE.
In addition to adding support for many new instructions,
and the lower.isle code to match all SIMD IR patterns,
this patch also adds ABI support for vector types.
In particular, we now need to handle the fact that
vector registers 8 .. 15 are partially callee-saved,
i.e. the high parts of those registers (which correspond
to the old floating-poing registers) are callee-saved,
but the low parts are not. This is the exact same situation
that we already have on AArch64, and so this patch uses the
same solution (the is_included_in_clobbers callback).
The bulk of the changes are platform-specific, but there are
a few exceptions:
- Added ISLE extractors for the Immediate and Constant types,
to enable matching the vconst and swizzle instructions.
- Added a missing accessor for call_conv to ABISig.
- Fixed endian conversion for vector types in data_value.rs
to enable their use in runtests on the big-endian platforms.
- Enabled (nearly) all SIMD runtests on s390x. [ Two test cases
remain disabled due to vector shift count semantics, see below. ]
- Enabled all Wasmtime SIMD tests on s390x.
There are three minor issues, called out via FIXMEs below,
which should be addressed in the future, but should not be
blockers to getting this patch merged. I've opened the
following issues to track them:
- Vector shift count semantics
https://github.com/bytecodealliance/wasmtime/issues/4424
- is_included_in_clobbers vs. link register
https://github.com/bytecodealliance/wasmtime/issues/4425
- gen_constant callback
https://github.com/bytecodealliance/wasmtime/issues/4426
All tests, including all newly enabled SIMD tests, pass
on both z14 and z15 architectures.
* Convert `scalar_to_vector` to ISLE (AArch64)
Converted the exisiting implementation of `scalar_to_vector` for AArch64 to
ISLE.
Copyright (c) 2022 Arm Limited
* Add support for floats and fix FpuExtend
- Added rules to cover `f32 -> f32x4` and `f64 -> f64x2` for
`scalar_to_vector`
- Added tests for `scalar_to_vector` on floats.
- Corrected an invalid instruction emitted by `FpuExtend` on 64-bit
values.
Copyright (c) 2022 Arm Limited
Introduce a new concept in the IR that allows a producer to create
dynamic vector types. An IR function can now contain global value(s)
that represent a dynamic scaling factor, for a given fixed-width
vector type. A dynamic type is then created by 'multiplying' the
corresponding global value with a fixed-width type. These new types
can be used just like the existing types and the type system has a
set of hard-coded dynamic types, such as I32X4XN, which the user
defined types map onto. The dynamic types are also used explicitly
to create dynamic stack slots, which have no set size like their
existing counterparts. New IR instructions are added to access these
new stack entities.
Currently, during codegen, the dynamic scaling factor has to be
lowered to a constant so the dynamic slots do eventually have a
compile-time known size, as do spill slots.
The current lowering for aarch64 just targets Neon, using a dynamic
scale of 1.
Copyright (c) 2022, Arm Limited.
Move from passing and returning u8 and u16 values to u32 in many of
the functions. This removes a number of type conversions and gives
a small compilation time speedup, around ~0.7% on my aarch64 machine.
Copyright (c) 2022, Arm Limited.
This adds infrastructure to allow implementing call and return
instructions in ISLE, and migrates the s390x back-end.
To implement ABI details, this patch creates public accessors
for `ABISig` and makes them accessible in ISLE. All actual
code generation is then done in ISLE rules, following the
information provided by that signature.
[ Note that the s390x back end never requires multiple slots for
a single argument - the infrastructure to handle this should
already be present, however. ]
To implement loops in ISLE rules, this patch uses regular tail
recursion, employing a `Range` data structure holding a range
of integers to be looped over.
* Allow emitting u64 constants into constant pool.
* Use constant pool for constants on x64 that do not fit in a simm32 and are needed as a RegMem or RegMemImm.
* Fix rip-relative addressing bug in pinsrd emission.
This PR refactors the x64 backend address-mode lowering to use an
incremental-build approach, where it considers each node in a tree of
`iadd`s that feed into a load/store address and, at each step, builds
the best possible `Amode`. It will combine an arbitrary number of
constant offsets (an extension beyond the current rules), and can
capture a left-shifted (scaled) index in any position of the tree
(another extension).
This doesn't have any measurable performance improvement on our Wasm
benchmarks in Sightglass, unfortunately, because the IR lowered from
wasm32 will do address computation in 32 bits and then `uextend` it to
add to the 64-bit heap base. We can't quite lift the 32-bit adds to 64
bits because this loses the wraparound semantics.
(We could label adds as "expected not to overflow", and allow *those* to
be lifted to 64 bit operations; wasm32 heap address computation should
fit this. This is `add nuw` (no unsigned wrap) in LLVM IR terms. That's
likely my next step.)
Nevertheless, (i) this generalizes the cases we can handle, which should
be a good thing, all other things being equal (and in this case, no
compile time impact was measured); and (ii) might benefit non-Wasm
frontends.
Also fix and extend the current implementation:
- AtomicRMWOp::Clr != AtomicRmwOp::And, as the input needs to be
inverted first.
- Inputs to the cmp for the RMWLoop case are sign-extended when
needed.
- Lower Xchg to Swp.
- Lower Sub to Add with a negated input.
- Added more runtests.
Copyright (c) 2022, Arm Limited.
This PR switches Cranelift over to the new register allocator, regalloc2.
See [this document](https://gist.github.com/cfallin/08553421a91f150254fe878f67301801)
for a summary of the design changes. This switchover has implications for
core VCode/MachInst types and the lowering pass.
Overall, this change brings improvements to both compile time and speed of
generated code (runtime), as reported in #3942:
```
Benchmark Compilation (wallclock) Execution (wallclock)
blake3-scalar 25% faster 28% faster
blake3-simd no diff no diff
meshoptimizer 19% faster 17% faster
pulldown-cmark 17% faster no diff
bz2 15% faster no diff
SpiderMonkey, 21% faster 2% faster
fib(30)
clang.wasm 42% faster N/A
```
This change moves the majority of the lowerings for CLIF's `load`
instruction over to ISLE. To do so, it also migrates the previous
mechanism for creating an `Amode` (`lower_to_amode`) to several ISLE
rules (see `to_amode`).
* x64: port scalar `fcmp` to ISLE
Implement the CLIF lowering for the `fcmp` to ISLE. This adds a new
type-matcher, `ty_scalar_float`, for detecting uses of `F32` and `F64`.
* isle: rename `vec128` to `ty_vec12`
This refactoring changes the name of the `vec128` matcher function to
follow the `ty_*` convention of the other type matchers. It also makes
the helper an inline function call.
* x64: port vector `fcmp` to ISLE