* Port `vconst` to ISLE (AArch64)
Ported the existing implementation of `vconst` to ISLE for AArch64, and
added support for 64-bit vector constants.
Also introduced 64-bit `vconst` support to the interpreter.
Copyright (c) 2022 Arm Limited
* Replace if-chains with match statements
Copyright (c) 2022 Arm Limited
Fixes#4736
Fix lowerings that were using values as both a Reg and a RegMem, making it look like a load could be sunk while its value in a register was still being used. Also add an assert that checks that loads that are sunk are never used.
* Cranelift: Remove the `ABICaller` trait
It has only one implementation: the `ABICallerImpl` struct. We can just use that
directly rather than having extra, unnecessary layers of generics and abstractions.
* Cranelift: Rename `ABICallerImpl` to `Caller`
* Cranelift: Remove `ABICallee` trait
It has only one implementation: the `ABICalleeImpl` struct. By using that
directly we can avoid unnecessary layers of generics and abstractions as well as
a couple `Box`es that were previously putting the single implementation into a
`Box<dyn>`.
* Cranelift: Rename `ABICalleeImpl` to `AbiCallee`
* Fix comments as per review
* Rename `AbiCallee` to `Callee`
This is the implementation of https://github.com/bytecodealliance/wasmtime/issues/4155, using the "inverted API" approach suggested by @cfallin (thanks!) in Cranelift, and trait object to provide a backend for an all-included experience in Wasmtime.
After the suggestion of Chris, `Function` has been split into mostly two parts:
- on the one hand, `FunctionStencil` contains all the fields required during compilation, and that act as a compilation cache key: if two function stencils are the same, then the result of their compilation (`CompiledCodeBase<Stencil>`) will be the same. This makes caching trivial, as the only thing to cache is the `FunctionStencil`.
- on the other hand, `FunctionParameters` contain the... function parameters that are required to finalize the result of compilation into a `CompiledCode` (aka `CompiledCodeBase<Final>`) with proper final relocations etc., by applying fixups and so on.
Most changes are here to accomodate those requirements, in particular that `FunctionStencil` should be `Hash`able to be used as a key in the cache:
- most source locations are now relative to a base source location in the function, and as such they're encoded as `RelSourceLoc` in the `FunctionStencil`. This required changes so that there's no need to explicitly mark a `SourceLoc` as the base source location, it's automatically detected instead the first time a non-default `SourceLoc` is set.
- user-defined external names in the `FunctionStencil` (aka before this patch `ExternalName::User { namespace, index }`) are now references into an external table of `UserExternalNameRef -> UserExternalName`, present in the `FunctionParameters`, and must be explicitly declared using `Function::declare_imported_user_function`.
- some refactorings have been made for function names:
- `ExternalName` was used as the type for a `Function`'s name; while it thus allowed `ExternalName::Libcall` in this place, this would have been quite confusing to use it there. Instead, a new enum `UserFuncName` is introduced for this name, that's either a user-defined function name (the above `UserExternalName`) or a test case name.
- The future of `ExternalName` is likely to become a full reference into the `FunctionParameters`'s mapping, instead of being "either a handle for user-defined external names, or the thing itself for other variants". I'm running out of time to do this, and this is not trivial as it implies touching ISLE which I'm less familiar with.
The cache computes a sha256 hash of the `FunctionStencil`, and uses this as the cache key. No equality check (using `PartialEq`) is performed in addition to the hash being the same, as we hope that this is sufficient data to avoid collisions.
A basic fuzz target has been introduced that tries to do the bare minimum:
- check that a function successfully compiled and cached will be also successfully reloaded from the cache, and returns the exact same function.
- check that a trivial modification in the external mapping of `UserExternalNameRef -> UserExternalName` hits the cache, and that other modifications don't hit the cache.
- This last check is less efficient and less likely to happen, so probably should be rethought a bit.
Thanks to both @alexcrichton and @cfallin for your very useful feedback on Zulip.
Some numbers show that for a large wasm module we're using internally, this is a 20% compile-time speedup, because so many `FunctionStencil`s are the same, even within a single module. For a group of modules that have a lot of code in common, we get hit rates up to 70% when they're used together. When a single function changes in a wasm module, every other function is reloaded; that's still slower than I expect (between 10% and 50% of the overall compile time), so there's likely room for improvement.
Fixes#4155.
The trait had only one implementation: the `Lower` struct. It is easier to just
use that directly, and not introduce unnecessary layers of generics and
abstractions.
Once upon a time, there was hope that we would have other implementations of the
`LowerCtx` trait, that did things like lower CLIF to SMTLIB for
verification. However, this is not practical these days given the way that the
trait has evolved over time, and our verification efforts are focused on ISLE
now anyways, and we're actually making some progress on that front (much more
than anyone ever did on a second `LowerCtx` trait implementation!)
Implement the tls_value for s390 in the ELF general-dynamic mode.
Notable differences to the x86_64 implementation are:
- We use a __tls_get_offset libcall instead of __tls_get_addr.
- The current thread pointer (stored in a pair of access registers)
needs to be added to the result of __tls_get_offset.
- __tls_get_offset has a variant ABI that requires the address of
the GOT (global offset table) is passed in %r12.
This means we need a new libcall entries for __tls_get_offset.
In addition, we also need a way to access _GLOBAL_OFFSET_TABLE_.
The latter is a "magic" symbol with a well-known name defined
by the ABI and recognized by the linker. This patch introduces
a new ExternalName::KnownSymbol variant to support such names
(originally due to @afonso360).
We also need to emit a relocation on a symbol placed in a
constant pool, as well as an extra relocation on the call
to __tls_get_offset required for TLS linker optimization.
Needed by the cg_clif frontend.
This is a nonsensical constraint: the entry block must come first in the
compiled code's layout, so it cannot also be sunk to the end of the
function.
This PR modifies the CLIF verifier to disallow this situation entirely.
It also adds an assert during final block-order computation to catch the
problem (and avoid a silent miscompile) even if the verifier is
disabled.
Fixes#4656.
The `MachBuffer` applies a set of peephole-optimization rules to do
branch threading, leverage fallthrough paths, eliminate empty blocks,
and flip conditional branches where needed to make branches more
efficient starting from naive always-branch-at-end-of-BB code.
This works by applying the rules at every label-bind, which is
equivalent to applying them at the end of every basic block, where
branches are usually inserted.
However, this misses one case: the end of the buffer! Currently we
don't optimize any redundant or foldable branches at the very end of
the machine code.
This usually doesn't matter when the function ends in an epilogue with
`ret` as the last instruction. However, when cold blocks exist, it can
actually matter.
Thanks to @mchesser for pointing out this issue in #4636.
To determine whether we need to insert a "veneer island" of
branch-range extension veneers, we need to know ahead of emitting a
basic block the worst-case size of that block. This is because veneers
only go between blocks (we could plop one in the middle of a block but
that would require another jump around it and would probably pessimize
some code significantly), and we can't back up once we emit a block.
To compute this worst-case size, we take the number of instructions
and multiply by the largest possible size of one pseudoinst (e.g., on
aarch64, this is 44 bytes; it explicitly excludes the `EmitIsland`
pseudo-op which is used before large jumptable inline offset tables
are emitted). This is conservative, but it always works, and veneers
are somewhat rare in practice (function body >1MiB on aarch64 for
example).
Unfortunately this logic didn't account for the spill/reload/move
instructions inserted by the register allocator, and in one example in
issue #4629, a block had only one instruction but 482
edge-moves (!). This came at just the wrong time as we were
approaching the 1MiB limit on aarch64.
This PR fixes that issue, and fixes the logic to actually look at the
correct next block (next in `final_order` rather than numerically
next), as a bonus correctness fix.
Fixes#4629.
* Convert `fma`, `valltrue` & `vanytrue` to ISLE (AArch64)
Ported the existing implementations of the following opcodes to ISLE on
AArch64:
- `fma`
- Introduced missing support for `fma` on vector values, as per the
docs.
- `valltrue`
- `vanytrue`
Also fixed `fcmp` on scalar values in the interpreter, and enabled
interpreter tests in `simd-fma.clif`.
This introduces the `FMLA` machine instruction.
Copyright (c) 2022 Arm Limited
* Add comments for `Fmla` and `Bsl`
Copyright (c) 2022 Arm Limited
And the `ABICallerImpl::ir_sig` field that was used to implement that
method. This removes 56 bytes from the size of `ABICallerImpl` and gives us
speed ups to compilation of about 7% on all benchmarks.
```
compilation :: nanoseconds :: benchmarks/pulldown-cmark/benchmark.wasm
Δ = 8205119.48 ± 4069474.25 (confidence = 99%)
main.so is 0.91x to 0.97x faster than feature.so!
feature.so is 1.03x to 1.10x faster than main.so!
[117729152 132258110.36 167484097] main.so
[107486500 124052990.88 138008797] feature.so
compilation :: nanoseconds :: benchmarks/bz2/benchmark.wasm
Δ = 4645258.32 ± 1981104.59 (confidence = 99%)
main.so is 0.92x to 0.97x faster than feature.so!
feature.so is 1.03x to 1.08x faster than main.so!
[76562171 85504479.28 93116863] main.so
[75180650 80859220.96 90591978] feature.so
compilation :: nanoseconds :: benchmarks/spidermonkey/benchmark.wasm
Δ = 150575617.54 ± 65021102.57 (confidence = 99%)
main.so is 0.92x to 0.97x faster than feature.so!
feature.so is 1.03x to 1.08x faster than main.so!
[2573089039 2843117485.10 3175982602] main.so
[2559784932 2692541867.56 3143529008] feature.so
```
This adds full i128 support to the s390x target, including new filetests
and enabling the existing i128 runtest on s390x.
The ABI requires that i128 is passed and returned via implicit pointer,
but the front end still generates direct i128 types in call. This means
we have to implement ABI support to implicitly convert i128 types to
pointers when passing arguments.
To do so, we add a new variant ABIArg::ImplicitArg. This acts like
StructArg, except that the value type is the actual target type,
not a pointer type. The required conversions have to be inserted
in the prologue and at function call sites.
Note that when dereferencing the implicit pointer in the prologue,
we may require a temp register: the pointer may be passed on the
stack so it needs to be loaded first, but the value register may
be in the wrong class for pointer values. In this case, we use
the "stack limit" register, which should be available at this
point in the prologue.
For return values, we use a mechanism similar to the one used for
supporting multiple return values in the Wasmtime ABI. The only
difference is that the hidden pointer to the return buffer must
be the *first*, not last, argument in this case.
(This implements the second half of issue #4565.)
This adds support for StructArgument on s390x. The ABI for this
platform requires that the address of the buffer holding the copy
of the struct argument is passed from caller to callee as hidden
pointer, using a register or overflow stack slot.
To implement this, I've added an optional "pointer" filed to
ABIArg::StructArg, and code to handle the pointer both in common
abi_impl code and the s390x back-end.
One notable change necessary to make this work involved the
"copy_to_arg_order" mechanism. Currently, for struct args
we only need to copy the data (and that need to happen before
setting up any other args), while for non-struct args we only
need to set up the appropriate registers or stack slots.
This order is ensured by sorting the arguments appropriately
into a "copy_to_arg_order" list.
However, for struct args with explicit pointers we need to *both*
copy the data (again, before everything else), *and* set up a
register or stack slot. Since we now need to touch the argument
twice, we cannot solve the ordering problem by a simple sort.
Instead, the abi_impl common code now provided *two* callbacks,
emit_copy_regs_to_buffer and emit_copy_regs_to_arg, and expects
the back end to first call copy..to_buffer for all args, and
then call copy.._to_arg for all args. This required updates
to all back ends.
In the s390x back end, in addition to the new ABI code, I'm now
adding code to actually copy the struct data, using the MVC
instruction (for small buffers) or a memcpy libcall (for larger
buffers). This also requires a bit of new infrastructure:
- MVC is the first memory-to-memory instruction we use, which
needed a bit of memory argument tweaking
- We also need to set up the infrastructure to emit libcalls.
(This implements the first half of issue #4565.)
Give the user the option to sign and to authenticate function
return addresses with the operations introduced by the Pointer
Authentication extension to the Arm instruction set architecture.
Copyright (c) 2021, Arm Limited.
I essentially add these same logs back in every time I'm debugging something
related to this fuzz target or `externref`s in general. Probably like 5 times
I've added roughly these logs. We should just make them available whenever we
need them via `RUST_LOG=wasmtime_runtime=trace`.
This also changes a couple `if let`s to `unwrap`s that are now infallible after
* Cranelift: Add instructions for getting the current stack/frame pointers and return address
This is the initial part of https://github.com/bytecodealliance/wasmtime/issues/4535
* x64: Remove `Amode::RbpOffset` and use `Amode::ImmReg` instead
We just special case getting operands from `Amode`s now.
* Fix s390x `get_return_address`; require `preserve_frame_pointers=true`
* Assert that `Amode::ImmRegRegShift` doesn't use rbp/rsp
* Handle non-allocatable registers in Amode::with_allocs
* Use "stack" instead of "r15" on s390x
* r14 is an allocatable register on s390x, so it shouldn't be used with `MovPReg`
The gen_copy_arg_to_regs routine currently ignores argument extension
flags when loading incoming arguments. This causes a problem with
stack arguments on big-endian systems, since the argument address
points to the word on the stack as extended by the caller, but the
generated code only loads the inner type from the address, causing
it to receive an incorrect value. (This happens to work on little-
endian systems.)
Fixed by loading extended arguments as full words.
* Cranellift: remove Baldrdash support and related features.
As noted in Mozilla's bugzilla bug 1781425 [1], the SpiderMonkey team
has recently determined that their current form of integration with
Cranelift is too hard to maintain, and they have chosen to remove it
from their codebase. If and when they decide to build updated support
for Cranelift, they will adopt different approaches to several details
of the integration.
In the meantime, after discussion with the SpiderMonkey folks, they
agree that it makes sense to remove the bits of Cranelift that exist
to support the integration ("Baldrdash"), as they will not need
them. Many of these bits are difficult-to-maintain special cases that
are not actually tested in Cranelift proper: for example, the
Baldrdash integration required Cranelift to emit function bodies
without prologues/epilogues, and instead communicate very precise
information about the expected frame size and layout, then stitched
together something post-facto. This was brittle and caused a lot of
incidental complexity ("fallthrough returns", the resulting special
logic in block-ordering); this is just one example. As another
example, one particular Baldrdash ABI variant processed stack args in
reverse order, so our ABI code had to support both traversal
orders. We had a number of other Baldrdash-specific settings as well
that did various special things.
This PR removes Baldrdash ABI support, the `fallthrough_return`
instruction, and pulls some threads to remove now-unused bits as a
result of those two, with the understanding that the SpiderMonkey folks
will build new functionality as needed in the future and we can perhaps
find cleaner abstractions to make it all work.
[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1781425
* Review feedback.
* Fix (?) DWARF debug tests: add `--disable-cache` to wasmtime invocations.
The debugger tests invoke `wasmtime` from within each test case under
the control of a debugger (gdb or lldb). Some of these tests started to
inexplicably fail in CI with unrelated changes, and the failures were
only inconsistently reproducible locally. It seems to be cache related:
if we disable cached compilation on the nested `wasmtime` invocations,
the tests consistently pass.
* Review feedback.
* Move `emit_to_memory` to `MachCompileResult`
This small refactoring makes it clearer to me that emitting to memory
doesn't require anything else from the compilation `Context`. While it's
a trivial change, it's a small public API change that shouldn't cause
too much trouble, and doesn't seem RFC-worthy. Happy to hear different
opinions about this, though!
* hide the MachCompileResult behind a method
* Add a `CompileError` wrapper type that references a `Function`
* Rename MachCompileResult to CompiledCode
* Additionally remove the last unsafe API in cranelift-codegen
* [AArch64] Port SIMD narrowing to ISLE
Fvdemote, snarrow, unarrow and uunarrow.
Also refactor the aarch64 instructions descriptions to parameterize
on ScalarSize instead of using different opcodes.
The zero_value pure constructor has been introduced and used by the
integer narrow operations and it replaces, and extends, the compare
zero patterns.
Copright (c) 2022, Arm Limited.
* use short 'if' patterns
* Skip new `table_ops` test under emulation
When emulating we already have to disable most pooling-allocator related
tests so this commit carries over that logic to the new fuzz test which
may run some configurations with the pooling allocator depending on the
random input.
* Fix panics in s390x codegen related to aliases
This commit fixes an issue introduced as part of the fix for
GHSA-5fhj-g3p3-pq9g. The `reftyped_vregs` list given to `regalloc2` is
not allowed to have duplicates in it and while the list originally
doesn't have duplicates once aliases are applied the list may have
duplicates. The fix here is to perform another pass to remove duplicates
after the aliases have been processed.
* Improve cranelift disassembly of stack maps
Print out extra information about stack maps such as their contents and
other related metadata available. Additionally also print out addresses
in hex to line up with the disassembly otherwise printed as well.
* Improve the `table_ops` fuzzer
* Generate more instructions by default
* Fix negative indices appearing in `table.{get,set}`
* Assert that the traps generated are expected to prevent accidental
other errors reporting a fuzzing success.
* Fix `reftype_vregs` reported to `regalloc2`
This fixes a mistake in the register allocation of Cranelift functions
where functions using reference-typed arguments incorrectly report which
virtual registers are reference-typed values if there are vreg aliases
in play. The fix here is to apply the vreg aliases to the final list of
reftyped regs which is eventually passed to `regalloc2`.
The main consequence of this fix is that functions which previously
accidentally didn't have correct stack maps should now have the missing
stack maps.
* Add a test that `table_ops` gc's eventually
* Add a comment about new alias resolution
* Update crates/fuzzing/src/oracles.rs
Co-authored-by: Nick Fitzgerald <fitzgen@gmail.com>
* Add some comments
Co-authored-by: Nick Fitzgerald <fitzgen@gmail.com>
Preserving frame pointers -- even inside leaf functions -- makes it easy to
capture the stack of a running program, without requiring any side tables or
metadata (like `.eh_frame` sections). Many sampling profilers and similar tools
walk frame pointers to capture stacks. Enabling this option will play nice with
those tools.
This adds full support for all Cranelift SIMD instructions
to the s390x target. Everything is matched fully via ISLE.
In addition to adding support for many new instructions,
and the lower.isle code to match all SIMD IR patterns,
this patch also adds ABI support for vector types.
In particular, we now need to handle the fact that
vector registers 8 .. 15 are partially callee-saved,
i.e. the high parts of those registers (which correspond
to the old floating-poing registers) are callee-saved,
but the low parts are not. This is the exact same situation
that we already have on AArch64, and so this patch uses the
same solution (the is_included_in_clobbers callback).
The bulk of the changes are platform-specific, but there are
a few exceptions:
- Added ISLE extractors for the Immediate and Constant types,
to enable matching the vconst and swizzle instructions.
- Added a missing accessor for call_conv to ABISig.
- Fixed endian conversion for vector types in data_value.rs
to enable their use in runtests on the big-endian platforms.
- Enabled (nearly) all SIMD runtests on s390x. [ Two test cases
remain disabled due to vector shift count semantics, see below. ]
- Enabled all Wasmtime SIMD tests on s390x.
There are three minor issues, called out via FIXMEs below,
which should be addressed in the future, but should not be
blockers to getting this patch merged. I've opened the
following issues to track them:
- Vector shift count semantics
https://github.com/bytecodealliance/wasmtime/issues/4424
- is_included_in_clobbers vs. link register
https://github.com/bytecodealliance/wasmtime/issues/4425
- gen_constant callback
https://github.com/bytecodealliance/wasmtime/issues/4426
All tests, including all newly enabled SIMD tests, pass
on both z14 and z15 architectures.
* Convert `scalar_to_vector` to ISLE (AArch64)
Converted the exisiting implementation of `scalar_to_vector` for AArch64 to
ISLE.
Copyright (c) 2022 Arm Limited
* Add support for floats and fix FpuExtend
- Added rules to cover `f32 -> f32x4` and `f64 -> f64x2` for
`scalar_to_vector`
- Added tests for `scalar_to_vector` on floats.
- Corrected an invalid instruction emitted by `FpuExtend` on 64-bit
values.
Copyright (c) 2022 Arm Limited
Introduce a new concept in the IR that allows a producer to create
dynamic vector types. An IR function can now contain global value(s)
that represent a dynamic scaling factor, for a given fixed-width
vector type. A dynamic type is then created by 'multiplying' the
corresponding global value with a fixed-width type. These new types
can be used just like the existing types and the type system has a
set of hard-coded dynamic types, such as I32X4XN, which the user
defined types map onto. The dynamic types are also used explicitly
to create dynamic stack slots, which have no set size like their
existing counterparts. New IR instructions are added to access these
new stack entities.
Currently, during codegen, the dynamic scaling factor has to be
lowered to a constant so the dynamic slots do eventually have a
compile-time known size, as do spill slots.
The current lowering for aarch64 just targets Neon, using a dynamic
scale of 1.
Copyright (c) 2022, Arm Limited.
* x64: port `atomic_rmw` to ISLE
This change ports `atomic_rmw` to ISLE for the x64 backend. It does not
change the lowering in any way, though it seems possible that the fixed
regs need not be as fixed and that there are opportunities for single
instruction lowerings. It does rename `inst_common::AtomicRmwOp` to
`MachAtomicRmwOp` to disambiguate with the IR enum with the same name.
* x64: remove remaining hardcoded register constraints for `atomic_rmw`
* x64: use `SyntheticAmode` in `AtomicRmwSeq`
* review: add missing reg collector for amode
* review: collect memory registers in the 'late' phase
Move from passing and returning u8 and u16 values to u32 in many of
the functions. This removes a number of type conversions and gives
a small compilation time speedup, around ~0.7% on my aarch64 machine.
Copyright (c) 2022, Arm Limited.
This adds infrastructure to allow implementing call and return
instructions in ISLE, and migrates the s390x back-end.
To implement ABI details, this patch creates public accessors
for `ABISig` and makes them accessible in ISLE. All actual
code generation is then done in ISLE rules, following the
information provided by that signature.
[ Note that the s390x back end never requires multiple slots for
a single argument - the infrastructure to handle this should
already be present, however. ]
To implement loops in ISLE rules, this patch uses regular tail
recursion, employing a `Range` data structure holding a range
of integers to be looped over.
- Handle call instructions' clobbers with the clobbers API, using RA2's
clobbers bitmask (bytecodealliance/regalloc2#58) rather than clobbers
list;
- Pull in changes from bytecodealliance/regalloc2#59 for much more sane
edge-case behavior w.r.t. liverange splitting.
Now the fiber implementation on AArch64 authenticates function
return addresses and includes the relevant BTI instructions, except
on macOS.
Also, change the locations of the saved FP and LR registers on the
fiber stack to make them compliant with the Procedure Call Standard
for the Arm 64-bit Architecture.
Copyright (c) 2022, Arm Limited.
* Remove unused srcloc in MachReloc
* Remove unused srcloc in MachTrap
* Use `into_iter` on array in bench code to suppress a warning
* Remove unused srcloc in MachCallSite
Previously, the pinned register (enabled by the `enable_pinned_reg`
Cranelift setting and used via the `get_pinned_reg` and `set_pinned_reg`
CLIF ops) was only used when Cranelift was embedded in SpiderMonkey, in
order to support a pinned heap register. SpiderMonkey has its own
calling convention in Cranelift (named after the integration layer,
"Baldrdash").
However, the feature is more general, and should be usable with the
default system calling convention too, e.g. SysV or Windows Fastcall.
This PR fixes the ABI code to properly treat the pinned register as a
globally allocated register -- and hence an implicit input and output to
every function, not saved/restored in the prologue/epilogue -- for SysV
on x86-64 and aarch64, and Fastcall on x86-64.
Fixes#4170.