This fixes two problems: minimum symbol alignment for the LARL
instruction, and alignment requirements for LRL/LGRL etc.
The first problem is that the LARL instruction used to load a
symbol address (PC relative) requires that the target symbol
is at least 2-byte aligned. This is always guaranteed for code
symbols (all instructions must be 2-aligned anyway), but not
necessarily for data symbols.
Other s390x compilers fix this problem by ensuring that all
global symbols are always emitted with a minimum 2-byte
alignment. This patch introduces an equivalent mechanism
for cranelift:
- Add a symbol_alignment routine to TargetIsa, similar to the
existing code_section_alignment routine.
- Respect symbol_alignment as minimum alignment for all symbols
emitted in the object backend (code and data).
The second problem is that PC-relative instructions that
directly *access* data (like LRL/LGRL, STRL/STGRL etc.)
not only have the 2-byte requirement like LARL, but actually
require that their memory operand is *naturally* aligned
(i.e. alignment is at least the size of the access).
This property (natural alignment for memory accesses) is
supposed to be provided by the "aligned" flag in MemFlags;
however, this is not implemented correctly at the moment.
To fix this, this patch:
- Only emits PC-relative memory access instructions if the
"aligned" flag is set in the associated MemFlags.
- Fixes a bug in emit_small_memory_copy and emit_small_memset
which currently set the aligned flag unconditionally, ignoring
the actual alignment info passed by their caller.
Tested with wasmtime and cg_clif.
* Add a test for iadd_pairwise with swiden input
* Implement iadd_pairwise for swiden_{low,high} input
* Add a test case for iadd_pairwise with uwiden input
* Implement iadd_pairwise with uwiden
This is the implementation of https://github.com/bytecodealliance/wasmtime/issues/4155, using the "inverted API" approach suggested by @cfallin (thanks!) in Cranelift, and trait object to provide a backend for an all-included experience in Wasmtime.
After the suggestion of Chris, `Function` has been split into mostly two parts:
- on the one hand, `FunctionStencil` contains all the fields required during compilation, and that act as a compilation cache key: if two function stencils are the same, then the result of their compilation (`CompiledCodeBase<Stencil>`) will be the same. This makes caching trivial, as the only thing to cache is the `FunctionStencil`.
- on the other hand, `FunctionParameters` contain the... function parameters that are required to finalize the result of compilation into a `CompiledCode` (aka `CompiledCodeBase<Final>`) with proper final relocations etc., by applying fixups and so on.
Most changes are here to accomodate those requirements, in particular that `FunctionStencil` should be `Hash`able to be used as a key in the cache:
- most source locations are now relative to a base source location in the function, and as such they're encoded as `RelSourceLoc` in the `FunctionStencil`. This required changes so that there's no need to explicitly mark a `SourceLoc` as the base source location, it's automatically detected instead the first time a non-default `SourceLoc` is set.
- user-defined external names in the `FunctionStencil` (aka before this patch `ExternalName::User { namespace, index }`) are now references into an external table of `UserExternalNameRef -> UserExternalName`, present in the `FunctionParameters`, and must be explicitly declared using `Function::declare_imported_user_function`.
- some refactorings have been made for function names:
- `ExternalName` was used as the type for a `Function`'s name; while it thus allowed `ExternalName::Libcall` in this place, this would have been quite confusing to use it there. Instead, a new enum `UserFuncName` is introduced for this name, that's either a user-defined function name (the above `UserExternalName`) or a test case name.
- The future of `ExternalName` is likely to become a full reference into the `FunctionParameters`'s mapping, instead of being "either a handle for user-defined external names, or the thing itself for other variants". I'm running out of time to do this, and this is not trivial as it implies touching ISLE which I'm less familiar with.
The cache computes a sha256 hash of the `FunctionStencil`, and uses this as the cache key. No equality check (using `PartialEq`) is performed in addition to the hash being the same, as we hope that this is sufficient data to avoid collisions.
A basic fuzz target has been introduced that tries to do the bare minimum:
- check that a function successfully compiled and cached will be also successfully reloaded from the cache, and returns the exact same function.
- check that a trivial modification in the external mapping of `UserExternalNameRef -> UserExternalName` hits the cache, and that other modifications don't hit the cache.
- This last check is less efficient and less likely to happen, so probably should be rethought a bit.
Thanks to both @alexcrichton and @cfallin for your very useful feedback on Zulip.
Some numbers show that for a large wasm module we're using internally, this is a 20% compile-time speedup, because so many `FunctionStencil`s are the same, even within a single module. For a group of modules that have a lot of code in common, we get hit rates up to 70% when they're used together. When a single function changes in a wasm module, every other function is reloaded; that's still slower than I expect (between 10% and 50% of the overall compile time), so there's likely room for improvement.
Fixes#4155.
* cranelift: Remove shifts-small-types runtests
These were moved to the main shifts file in #4519 but this file was accidentaly left in tree.
It also fixes the missing sshr_i8_i8 testcase
* cranelift: Add shifts to fuzzer
* cranelift: Add extends to fuzzer
* Add a test for the existing behavior of fcvt_from_unit
* Migrate the I8, I16, I32 cases of fcvt_from_uint
* Implement the I64 case of fcvt_from_uint
* Add a test for the existing behavior of fcvt_from_uint.f64x2
* Migrate fcvt_from_uint.f64x2 to ISLE
* Lower the last case of `fcvt_from_uint`
* Add a test for `fcvt_from_uint`
* Finish lowering fcmp_from_uint
* Format
This implements the s390x back-end portion of the solution for
https://github.com/bytecodealliance/wasmtime/issues/4566
We now support both big- and little-endian vector lane order
in code generation. The order used for a function is determined
by the function's ABI: if it uses a Wasmtime ABI, it will use
little-endian lane order, and big-endian lane order otherwise.
(This ensures that all raw_bitcast instructions generated by
both wasmtime and other cranelift frontends can always be
implemented as a no-op.)
Lane order affects the implementation of a number of operations:
- Vector immediates
- Vector memory load / store (in big- and little-endian variants)
- Operations explicitly using lane numbers
(insertlane, extractlane, shuffle, swizzle)
- Operations implicitly using lane numbers
(iadd_pairwise, narrow/widen, promote/demote, fcvt_low, vhigh_bits)
In addition, when calling a function using a different lane order,
we need to lane-swap all vector values passed or returned in registers.
A small number of changes to common code were also needed:
- Ensure we always select a Wasmtime calling convention on s390x
in crates/cranelift (func_signature).
- Fix vector immediates for filetests/runtests. In PR #4427,
I attempted to fix this by byte-swapping the V128 value, but
with the new scheme, we'd instead need to perform a per-lane
byte swap. Since we do not know the actual type in write_to_slice
and read_from_slice, this isn't easily possible.
Revert this part of PR #4427 again, and instead just mark the
memory buffer as little-endian when emitting the trampoline;
the back-end will then emit correct code to load the constant.
- Change a runtest in simd-bitselect-to-vselect.clif to no longer
make little-endian lane order assumptions.
- Remove runtests in simd-swizzle.clif that make little-endian
lane order assumptions by relying on implicit type conversion
when using a non-i16x8 swizzle result type (this feature should
probably be removed anyway).
Tested with both wasmtime and cg_clif.
* Fix sret for AArch64
AArch64 requires the struct return address argument to be stored in the x8
register. This register is never used for regular arguments.
* Add extra sret tests for x86_64
Implement the tls_value for s390 in the ELF general-dynamic mode.
Notable differences to the x86_64 implementation are:
- We use a __tls_get_offset libcall instead of __tls_get_addr.
- The current thread pointer (stored in a pair of access registers)
needs to be added to the result of __tls_get_offset.
- __tls_get_offset has a variant ABI that requires the address of
the GOT (global offset table) is passed in %r12.
This means we need a new libcall entries for __tls_get_offset.
In addition, we also need a way to access _GLOBAL_OFFSET_TABLE_.
The latter is a "magic" symbol with a well-known name defined
by the ABI and recognized by the linker. This patch introduces
a new ExternalName::KnownSymbol variant to support such names
(originally due to @afonso360).
We also need to emit a relocation on a symbol placed in a
constant pool, as well as an extra relocation on the call
to __tls_get_offset required for TLS linker optimization.
Needed by the cg_clif frontend.
* cranelift: Upgrade libm to 0.2.4
This resolves an issue with incorrect fmaf on the x86_64-pc-windows-gnu target under some inputs.
See: #4517
* supply-chain: Vet `libm` 0.2.4
* cranelift: Use JIT in runtests
Using `cranelift-jit` in run tests allows us to preform relocations and
libcalls. This is important since some instruction lowerings fallback
to libcall's when an extension is missing, or when it's too complicated
to implement manually.
This is also a first step to being able to test `call`'s between functions
in the runtest suite. It should also make it easier to eventually test
TLS relocations, symbol resolution and ABI's.
Another benefit of this is that we also get to test the JIT more, since
it now runs the runtests, and gets some fuzzing via `fuzzgen` (which
uses the `SingleFunctionCompiler`).
This change causes regressions in terms of runtime for the filetests.
I haven't done any serious benchmarking but what I've been seeing is
that it now takes about ~3 seconds to run the testsuite while it
previously took around 2 seconds.
* Add FMA tests for X86
This is a nonsensical constraint: the entry block must come first in the
compiled code's layout, so it cannot also be sunk to the end of the
function.
This PR modifies the CLIF verifier to disallow this situation entirely.
It also adds an assert during final block-order computation to catch the
problem (and avoid a silent miscompile) even if the verifier is
disabled.
Fixes#4656.
Ported the existing implementations of the following opcodes on AArch64
to ISLE:
- `AvgRound`
- Also introduced support for `i64x2` vectors, as per the docs.
- `SqmulRoundSat`
Copyright (c) 2022 Arm Limited
* Cranelift: Don't print "skipped TEST can't run aarch64" on x64, etc
It's way too noisy. Move it to the logs.
* Cranelift: Enable Cranelift trace logs in `clif-util` by default
* cranelift-filetest: use `log::warn!` for warnings
Instead of `println!`
* rustfmt
* Convert `fma`, `valltrue` & `vanytrue` to ISLE (AArch64)
Ported the existing implementations of the following opcodes to ISLE on
AArch64:
- `fma`
- Introduced missing support for `fma` on vector values, as per the
docs.
- `valltrue`
- `vanytrue`
Also fixed `fcmp` on scalar values in the interpreter, and enabled
interpreter tests in `simd-fma.clif`.
This introduces the `FMLA` machine instruction.
Copyright (c) 2022 Arm Limited
* Add comments for `Fmla` and `Bsl`
Copyright (c) 2022 Arm Limited
This adds full i128 support to the s390x target, including new filetests
and enabling the existing i128 runtest on s390x.
The ABI requires that i128 is passed and returned via implicit pointer,
but the front end still generates direct i128 types in call. This means
we have to implement ABI support to implicitly convert i128 types to
pointers when passing arguments.
To do so, we add a new variant ABIArg::ImplicitArg. This acts like
StructArg, except that the value type is the actual target type,
not a pointer type. The required conversions have to be inserted
in the prologue and at function call sites.
Note that when dereferencing the implicit pointer in the prologue,
we may require a temp register: the pointer may be passed on the
stack so it needs to be loaded first, but the value register may
be in the wrong class for pointer values. In this case, we use
the "stack limit" register, which should be available at this
point in the prologue.
For return values, we use a mechanism similar to the one used for
supporting multiple return values in the Wasmtime ABI. The only
difference is that the hidden pointer to the return buffer must
be the *first*, not last, argument in this case.
(This implements the second half of issue #4565.)
This adds support for StructArgument on s390x. The ABI for this
platform requires that the address of the buffer holding the copy
of the struct argument is passed from caller to callee as hidden
pointer, using a register or overflow stack slot.
To implement this, I've added an optional "pointer" filed to
ABIArg::StructArg, and code to handle the pointer both in common
abi_impl code and the s390x back-end.
One notable change necessary to make this work involved the
"copy_to_arg_order" mechanism. Currently, for struct args
we only need to copy the data (and that need to happen before
setting up any other args), while for non-struct args we only
need to set up the appropriate registers or stack slots.
This order is ensured by sorting the arguments appropriately
into a "copy_to_arg_order" list.
However, for struct args with explicit pointers we need to *both*
copy the data (again, before everything else), *and* set up a
register or stack slot. Since we now need to touch the argument
twice, we cannot solve the ordering problem by a simple sort.
Instead, the abi_impl common code now provided *two* callbacks,
emit_copy_regs_to_buffer and emit_copy_regs_to_arg, and expects
the back end to first call copy..to_buffer for all args, and
then call copy.._to_arg for all args. This required updates
to all back ends.
In the s390x back end, in addition to the new ABI code, I'm now
adding code to actually copy the struct data, using the MVC
instruction (for small buffers) or a memcpy libcall (for larger
buffers). This also requires a bit of new infrastructure:
- MVC is the first memory-to-memory instruction we use, which
needed a bit of memory argument tweaking
- We also need to set up the infrastructure to emit libcalls.
(This implements the first half of issue #4565.)
Give the user the option to sign and to authenticate function
return addresses with the operations introduced by the Pointer
Authentication extension to the Arm instruction set architecture.
Copyright (c) 2021, Arm Limited.
* Cranelift: Add instructions for getting the current stack/frame pointers and return address
This is the initial part of https://github.com/bytecodealliance/wasmtime/issues/4535
* x64: Remove `Amode::RbpOffset` and use `Amode::ImmReg` instead
We just special case getting operands from `Amode`s now.
* Fix s390x `get_return_address`; require `preserve_frame_pointers=true`
* Assert that `Amode::ImmRegRegShift` doesn't use rbp/rsp
* Handle non-allocatable registers in Amode::with_allocs
* Use "stack" instead of "r15" on s390x
* r14 is an allocatable register on s390x, so it shouldn't be used with `MovPReg`
The gen_copy_arg_to_regs routine currently ignores argument extension
flags when loading incoming arguments. This causes a problem with
stack arguments on big-endian systems, since the argument address
points to the word on the stack as extended by the caller, but the
generated code only loads the inner type from the address, causing
it to receive an incorrect value. (This happens to work on little-
endian systems.)
Fixed by loading extended arguments as full words.
* Cranellift: remove Baldrdash support and related features.
As noted in Mozilla's bugzilla bug 1781425 [1], the SpiderMonkey team
has recently determined that their current form of integration with
Cranelift is too hard to maintain, and they have chosen to remove it
from their codebase. If and when they decide to build updated support
for Cranelift, they will adopt different approaches to several details
of the integration.
In the meantime, after discussion with the SpiderMonkey folks, they
agree that it makes sense to remove the bits of Cranelift that exist
to support the integration ("Baldrdash"), as they will not need
them. Many of these bits are difficult-to-maintain special cases that
are not actually tested in Cranelift proper: for example, the
Baldrdash integration required Cranelift to emit function bodies
without prologues/epilogues, and instead communicate very precise
information about the expected frame size and layout, then stitched
together something post-facto. This was brittle and caused a lot of
incidental complexity ("fallthrough returns", the resulting special
logic in block-ordering); this is just one example. As another
example, one particular Baldrdash ABI variant processed stack args in
reverse order, so our ABI code had to support both traversal
orders. We had a number of other Baldrdash-specific settings as well
that did various special things.
This PR removes Baldrdash ABI support, the `fallthrough_return`
instruction, and pulls some threads to remove now-unused bits as a
result of those two, with the understanding that the SpiderMonkey folks
will build new functionality as needed in the future and we can perhaps
find cleaner abstractions to make it all work.
[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1781425
* Review feedback.
* Fix (?) DWARF debug tests: add `--disable-cache` to wasmtime invocations.
The debugger tests invoke `wasmtime` from within each test case under
the control of a debugger (gdb or lldb). Some of these tests started to
inexplicably fail in CI with unrelated changes, and the failures were
only inconsistently reproducible locally. It seems to be cache related:
if we disable cached compilation on the nested `wasmtime` invocations,
the tests consistently pass.
* Review feedback.
* Move `emit_to_memory` to `MachCompileResult`
This small refactoring makes it clearer to me that emitting to memory
doesn't require anything else from the compilation `Context`. While it's
a trivial change, it's a small public API change that shouldn't cause
too much trouble, and doesn't seem RFC-worthy. Happy to hear different
opinions about this, though!
* hide the MachCompileResult behind a method
* Add a `CompileError` wrapper type that references a `Function`
* Rename MachCompileResult to CompiledCode
* Additionally remove the last unsafe API in cranelift-codegen
* cranelift: Add MinGW `fma` regression tests
* cranelift: Fix FMA in interpreter
* cranelift: Add separate `fma` test suite for the interpreter
The interpreter can run `fma.clif` on most platforms, however on
`x86_64-pc-windows-gnu` we use libm which has issues with some inputs.
We should delete `fma-interpreter.clif` and enable the interpreter on
the main `fma.clif` file once those are fixed.
Ported the existing implementation of the following Opcodes for AArch64
to ISLE:
- `Fence`
- `IsNull`
- `IsInvalid`
- `Debugtrap`
Copyright (c) 2022 Arm Limited
* cranelift: Reorganize test suite
Group some SIMD operations by instruction.
* cranelift: Deduplicate some shift tests
Also, new tests with the mod behaviour
* aarch64: Lower shifts with mod behaviour
* x64: Lower shifts with mod behaviour
* wasmtime: Don't mask SIMD shifts
* x64: Add VEX Instruction Encoder
This uses a similar builder pattern to the EVEX Encoder.
Does not yet support memory accesses.
* x64: Add FMA Flag
* x64: Implement SIMD `fma`
* x64: Use 4 register Vex Inst
* x64: Reorder VEX pretty print args
* Allow 64-bit vectors and implement for interpreter
The AArch64 backend already supports 64-bit vectors; this simply allows
instructions to make use of that.
Implemented support for 64-bit vectors within the interpreter to allow
interpret runtests to use them.
Copyright (c) 2022 Arm Limited
* Disable 64-bit SIMD `iaddpairwise` tests on s390x
Copyright (c) 2022 Arm Limited
* [AArch64] Port SIMD narrowing to ISLE
Fvdemote, snarrow, unarrow and uunarrow.
Also refactor the aarch64 instructions descriptions to parameterize
on ScalarSize instead of using different opcodes.
The zero_value pure constructor has been introduced and used by the
integer narrow operations and it replaces, and extends, the compare
zero patterns.
Copright (c) 2022, Arm Limited.
* use short 'if' patterns