This adds full i128 support to the s390x target, including new filetests
and enabling the existing i128 runtest on s390x.
The ABI requires that i128 is passed and returned via implicit pointer,
but the front end still generates direct i128 types in call. This means
we have to implement ABI support to implicitly convert i128 types to
pointers when passing arguments.
To do so, we add a new variant ABIArg::ImplicitArg. This acts like
StructArg, except that the value type is the actual target type,
not a pointer type. The required conversions have to be inserted
in the prologue and at function call sites.
Note that when dereferencing the implicit pointer in the prologue,
we may require a temp register: the pointer may be passed on the
stack so it needs to be loaded first, but the value register may
be in the wrong class for pointer values. In this case, we use
the "stack limit" register, which should be available at this
point in the prologue.
For return values, we use a mechanism similar to the one used for
supporting multiple return values in the Wasmtime ABI. The only
difference is that the hidden pointer to the return buffer must
be the *first*, not last, argument in this case.
(This implements the second half of issue #4565.)
* Add `try_use_var` method to `cranelift-frontend`.
- Unlike `use_var`, this method does not panic if the variable has not been defined
before use
* Add `try_declare_var` and `try_def_var`.
- Also implement Error for error enums.
* Use `write!` macro.
* Add `write!` use I missed.
* Port `Shuffle` to ISLE (AArch64)
Ported the existing implementation of `Shuffle` for AArch64 to ISLE.
Copyright (c) 2022 Arm Limited
* Cleanup by shadowing `rn`, `rn2`, and `_`
Copyright (c) 2022 Arm Limited
* Wasmtime: Add a pointer to `VMRuntimeLimits` in component contexts
* Save exit Wasm FP and PC in component-to-host trampolines
Fixes#4535
* Add comment about why we deref the trampoline's FP
* Update some tests to use new `vmruntime_limits_*` methods
This adds support for StructArgument on s390x. The ABI for this
platform requires that the address of the buffer holding the copy
of the struct argument is passed from caller to callee as hidden
pointer, using a register or overflow stack slot.
To implement this, I've added an optional "pointer" filed to
ABIArg::StructArg, and code to handle the pointer both in common
abi_impl code and the s390x back-end.
One notable change necessary to make this work involved the
"copy_to_arg_order" mechanism. Currently, for struct args
we only need to copy the data (and that need to happen before
setting up any other args), while for non-struct args we only
need to set up the appropriate registers or stack slots.
This order is ensured by sorting the arguments appropriately
into a "copy_to_arg_order" list.
However, for struct args with explicit pointers we need to *both*
copy the data (again, before everything else), *and* set up a
register or stack slot. Since we now need to touch the argument
twice, we cannot solve the ordering problem by a simple sort.
Instead, the abi_impl common code now provided *two* callbacks,
emit_copy_regs_to_buffer and emit_copy_regs_to_arg, and expects
the back end to first call copy..to_buffer for all args, and
then call copy.._to_arg for all args. This required updates
to all back ends.
In the s390x back end, in addition to the new ABI code, I'm now
adding code to actually copy the struct data, using the MVC
instruction (for small buffers) or a memcpy libcall (for larger
buffers). This also requires a bit of new infrastructure:
- MVC is the first memory-to-memory instruction we use, which
needed a bit of memory argument tweaking
- We also need to set up the infrastructure to emit libcalls.
(This implements the first half of issue #4565.)
Give the user the option to sign and to authenticate function
return addresses with the operations introduced by the Pointer
Authentication extension to the Arm instruction set architecture.
Copyright (c) 2021, Arm Limited.
I essentially add these same logs back in every time I'm debugging something
related to this fuzz target or `externref`s in general. Probably like 5 times
I've added roughly these logs. We should just make them available whenever we
need them via `RUST_LOG=wasmtime_runtime=trace`.
This also changes a couple `if let`s to `unwrap`s that are now infallible after
* Cranelift: Add instructions for getting the current stack/frame pointers and return address
This is the initial part of https://github.com/bytecodealliance/wasmtime/issues/4535
* x64: Remove `Amode::RbpOffset` and use `Amode::ImmReg` instead
We just special case getting operands from `Amode`s now.
* Fix s390x `get_return_address`; require `preserve_frame_pointers=true`
* Assert that `Amode::ImmRegRegShift` doesn't use rbp/rsp
* Handle non-allocatable registers in Amode::with_allocs
* Use "stack" instead of "r15" on s390x
* r14 is an allocatable register on s390x, so it shouldn't be used with `MovPReg`
The gen_copy_arg_to_regs routine currently ignores argument extension
flags when loading incoming arguments. This causes a problem with
stack arguments on big-endian systems, since the argument address
points to the word on the stack as extended by the caller, but the
generated code only loads the inner type from the address, causing
it to receive an incorrect value. (This happens to work on little-
endian systems.)
Fixed by loading extended arguments as full words.
* Cranellift: remove Baldrdash support and related features.
As noted in Mozilla's bugzilla bug 1781425 [1], the SpiderMonkey team
has recently determined that their current form of integration with
Cranelift is too hard to maintain, and they have chosen to remove it
from their codebase. If and when they decide to build updated support
for Cranelift, they will adopt different approaches to several details
of the integration.
In the meantime, after discussion with the SpiderMonkey folks, they
agree that it makes sense to remove the bits of Cranelift that exist
to support the integration ("Baldrdash"), as they will not need
them. Many of these bits are difficult-to-maintain special cases that
are not actually tested in Cranelift proper: for example, the
Baldrdash integration required Cranelift to emit function bodies
without prologues/epilogues, and instead communicate very precise
information about the expected frame size and layout, then stitched
together something post-facto. This was brittle and caused a lot of
incidental complexity ("fallthrough returns", the resulting special
logic in block-ordering); this is just one example. As another
example, one particular Baldrdash ABI variant processed stack args in
reverse order, so our ABI code had to support both traversal
orders. We had a number of other Baldrdash-specific settings as well
that did various special things.
This PR removes Baldrdash ABI support, the `fallthrough_return`
instruction, and pulls some threads to remove now-unused bits as a
result of those two, with the understanding that the SpiderMonkey folks
will build new functionality as needed in the future and we can perhaps
find cleaner abstractions to make it all work.
[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1781425
* Review feedback.
* Fix (?) DWARF debug tests: add `--disable-cache` to wasmtime invocations.
The debugger tests invoke `wasmtime` from within each test case under
the control of a debugger (gdb or lldb). Some of these tests started to
inexplicably fail in CI with unrelated changes, and the failures were
only inconsistently reproducible locally. It seems to be cache related:
if we disable cached compilation on the nested `wasmtime` invocations,
the tests consistently pass.
* Review feedback.
* Move `emit_to_memory` to `MachCompileResult`
This small refactoring makes it clearer to me that emitting to memory
doesn't require anything else from the compilation `Context`. While it's
a trivial change, it's a small public API change that shouldn't cause
too much trouble, and doesn't seem RFC-worthy. Happy to hear different
opinions about this, though!
* hide the MachCompileResult behind a method
* Add a `CompileError` wrapper type that references a `Function`
* Rename MachCompileResult to CompiledCode
* Additionally remove the last unsafe API in cranelift-codegen
* Cranelift: Don't `emit` inside lowering rules in aarch64
The lowering rules should be "pure" and side-effect free, using helpers defined
in `inst.isle` to perform actual side effects like emitting instructions.
* Cranelift: use 80 width for section separators in aarch64 lowering rules
* Support shadowing in isle
* Re-run the isle build.rs if the examples change
* Print error messages when isle tests fail
* Move run tests
* Refactor `let` uses that don't need to introduce unique names
The ISLE language's lexer previously used a very primitive
`i64::from_str_radix` call to parse integer constants, allowing values
in the range -2^63..2^63 only. Also, underscores to separate digits (as
is allwoed in Rust) were not supported. Finally, 128-bit constants were
not supported at all.
This PR addresses all issues above:
- Integer constants are internally stored as 128-bit values.
- Parsing supports either signed (-2^127..2^127) or unsigned (0..2^128)
range. Negation works independently of that, so one can write
`-0xffff..ffff` (128 bits wide, i.e., -(2^128-1)) to get a `1`.
- Underscores are supported to separate groups of digits, so one can
write `0xffff_ffff`.
- A minor oversight was fixed: hex constants can start with `0X`
(uppercase) as well as `0x`, for consistency with Rust and C.
This PR also adds a new kind of ISLE test that actually runs a driver
linked to compiled ISLE code; we previously didn't have any such tests,
but it is now quite useful to assert correct interpretation of constant
values.
* cranelift: Add MinGW `fma` regression tests
* cranelift: Fix FMA in interpreter
* cranelift: Add separate `fma` test suite for the interpreter
The interpreter can run `fma.clif` on most platforms, however on
`x86_64-pc-windows-gnu` we use libm which has issues with some inputs.
We should delete `fma-interpreter.clif` and enable the interpreter on
the main `fma.clif` file once those are fixed.
Ported the existing implementation of the following Opcodes for AArch64
to ISLE:
- `Fence`
- `IsNull`
- `IsInvalid`
- `Debugtrap`
Copyright (c) 2022 Arm Limited
For wasm programs using SIMD vector types, the type known at function
entry or exit may not match the type used within the body of the
function, so we have to bitcast them. We have to check all calls and
returns for this condition, which involves comparing a subset of a
function's signature with the CLIF types we're trying to use. Currently,
this check heap-allocates a short-lived Vec for the appropriate subset
of the signature.
But most of the time none of the values need a bitcast. So this patch
avoids allocating unless at least one bitcast is actually required, and
only saves the pointers to values that need fixing up. I leaned heavily
on iterators to keep space usage constant until we discover allocation
is necessary after all.
I don't think it's possible to eliminate the allocation entirely,
because the function signature we're examining is borrowed from the
FuncBuilder, but we need to mutably borrow the FuncBuilder to insert the
bitcast instructions. Fortunately, the FromIterator implementation for
Vec doesn't reserve any heap space if the iterator is empty, so we can
unconditionally collect into a Vec and still avoid unnecessary
allocations.
Since the relationship between a function signature and a list of CLIF
values is somewhat complicated, I refactored all the uses of
`bitcast_arguments` and `wasm_param_types`. Instead there's
`bitcast_wasm_params` and `bitcast_wasm_returns` which share a helper
that combines the previous pair of functions into one.
According to DHAT, when compiling the Sightglass Spidermonkey benchmark,
this avoids 52k allocations averaging about 9 bytes each, which are
freed on average within 2k instructions.
Most Sightglass benchmarks, including Spidermonkey, show no performance
difference with this change. Only one has a slowdown, and it's small:
compilation :: nanoseconds :: benchmarks/shootout-matrix/benchmark.wasm
Δ = 689373.34 ± 593720.78 (confidence = 99%)
lazy-bitcast.so is 0.94x to 1.00x faster than main-e121c209f.so!
main-e121c209f.so is 1.00x to 1.06x faster than lazy-bitcast.so!
[19741713 21375844.46 32976047] lazy-bitcast.so
[19345471 20686471.12 30872267] main-e121c209f.so
But several Sightglass benchmarks have notable speed-ups, with smaller
improvements for shootout-ed25519, meshoptimizer, and pulldown-cmark,
and larger ones as follows:
compilation :: nanoseconds :: benchmarks/bz2/benchmark.wasm
Δ = 20071545.47 ± 2950014.92 (confidence = 99%)
lazy-bitcast.so is 1.26x to 1.36x faster than main-e121c209f.so!
main-e121c209f.so is 0.73x to 0.80x faster than lazy-bitcast.so!
[55995164 64849257.68 89083031] lazy-bitcast.so
[79382460 84920803.15 98016185] main-e121c209f.so
compilation :: nanoseconds :: benchmarks/blake3-scalar/benchmark.wasm
Δ = 16620780.61 ± 5395162.13 (confidence = 99%)
lazy-bitcast.so is 1.14x to 1.28x faster than main-e121c209f.so!
main-e121c209f.so is 0.77x to 0.88x faster than lazy-bitcast.so!
[54604352 79877776.35 103666598] lazy-bitcast.so
[94011835 96498556.96 106200091] main-e121c209f.so
compilation :: nanoseconds :: benchmarks/intgemm-simd/benchmark.wasm
Δ = 36891254.34 ± 12403663.50 (confidence = 99%)
lazy-bitcast.so is 1.12x to 1.24x faster than main-e121c209f.so!
main-e121c209f.so is 0.79x to 0.90x faster than lazy-bitcast.so!
[131610845 201289587.88 247341883] lazy-bitcast.so
[232065032 238180842.22 250957563] main-e121c209f.so
This commit builds on bytecodealliance/wasm-tools#690 to add support to
testing of the component model to execute functions when running
`*.wast` files. This support is all built on #4442 as functions are
invoked through a "dynamic" API. Right now the testing and integration
is fairly crude but I'm hoping that we can try to improve it over time
as necessary. For now this should provide a hopefully more convenient
syntax for unit tests and the like.
* cranelift: Reorganize test suite
Group some SIMD operations by instruction.
* cranelift: Deduplicate some shift tests
Also, new tests with the mod behaviour
* aarch64: Lower shifts with mod behaviour
* x64: Lower shifts with mod behaviour
* wasmtime: Don't mask SIMD shifts
DHAT reports that when compiling the Spidermonkey Sightglass benchmark,
there are over 100k of these Vec allocations, averaging less than 4
bytes, and with an average lifetime of only about 500 instructions.
This function is only called from one place, which immediately converts
it into an iterator. So this commit just returns the iterator that was
previously being collected into a Vec. The iterator has to borrow from
the DataFlowGraph, so this would change borrow-check results, but in the
one caller that turns out to be okay.
(That sole caller is in cranelift/codegen/src/machinst/lower.rs, in
Lower::lower().)
According to Sightglass, this is a compile-time improvement of between
2% and 12% on the Spidermonkey benchmark:
instantiation :: nanoseconds :: benchmarks/spidermonkey/benchmark.wasm
Δ = 14628.76 ± 10318.59 (confidence = 99%)
main-0e6ffd024.so is 0.87x to 0.98x faster than no-small-vecs.so!
no-small-vecs.so is 1.02x to 1.14x faster than main-0e6ffd024.so!
[142023 187464.24 301522] main-0e6ffd024.so
[103742 172835.48 263917] no-small-vecs.so
compilation :: nanoseconds :: benchmarks/spidermonkey/benchmark.wasm
Δ = 362392705.93 ± 267070467.06 (confidence = 99%)
main-0e6ffd024.so is 0.89x to 0.98x faster than no-small-vecs.so!
no-small-vecs.so is 1.02x to 1.12x faster than main-0e6ffd024.so!
[3655734131 5522594697.83 6471126699] main-0e6ffd024.so
[3278129811 5160201991.90 5810600015] no-small-vecs.so
First, we switch from a `BTreeSet` to a `HashSet` because clearing a `BTreeSet`
will deallocate the btree's nodes but clearing a `HashSet` will not deallocate
the backing hash table, saving the space to reuse for future insertions.
Then, we reuse the same set (and therefore the same allocation) across every
call to `can_optimize_var_lookup`.
This results in a 1.22x to 1.32x speed up on various Sightglass benchmarks:
```
compilation :: nanoseconds :: benchmarks/pulldown-cmark/benchmark.wasm
Δ = 39478181.76 ± 3441880.32 (confidence = 99%)
main.so is 0.75x to 0.79x faster than reuse-set.so!
reuse-set.so is 1.27x to 1.32x faster than main.so!
[160128343 172174751.09 213325968] main.so
[115055695 132696569.33 200782128] reuse-set.so
compilation :: nanoseconds :: benchmarks/bz2/benchmark.wasm
Δ = 22576954.88 ± 1830771.68 (confidence = 99%)
main.so is 0.77x to 0.81x faster than reuse-set.so!
reuse-set.so is 1.25x to 1.29x faster than main.so!
[100449245 106820149.65 118628066] main.so
[77039172 84243194.77 128168647] reuse-set.so
compilation :: nanoseconds :: benchmarks/spidermonkey/benchmark.wasm
Δ = 664533554.97 ± 22109170.05 (confidence = 99%)
main.so is 0.81x to 0.82x faster than reuse-set.so!
reuse-set.so is 1.22x to 1.23x faster than main.so!
[3549762523 3640587103.35 3798662501] main.so
[2793335181 2976053548.38 3192950484] reuse-set.so
```
* Add initial support for fused adapter trampolines
This commit lands a significant new piece of functionality to Wasmtime's
implementation of the component model in the form of the implementation
of fused adapter trampolines. Internally within a component core wasm
modules can communicate with each other by having their exports
`canon lift`'d to get `canon lower`'d into a different component. This
signifies that two components are communicating through a statically
known interface via the canonical ABI at this time. Previously Wasmtime
was able to identify that this communication was happening but it simply
panicked with `unimplemented!` upon seeing it. This commit is the
beginning of filling out this panic location with an actual
implementation.
The implementation route chosen here for fused adapters is to use a
WebAssembly module itself for the implementation. This means that, at
compile time of a component, Wasmtime is generating core WebAssembly
modules which then get recursively compiled within Wasmtime as well. The
choice to use WebAssembly itself as the implementation of fused adapters
stems from a few motivations:
* This does not represent a significant increase in the "trusted
compiler base" of Wasmtime. Getting the Wasm -> CLIF translation
correct once is hard enough much less for an entirely different IR to
CLIF. By generating WebAssembly no new interactions with Cranelift are
added which drastically reduces the possibilities for mistakes.
* Using WebAssembly means that component adapters are insulated from
miscompilations and mistakes. If something goes wrong it's defined
well within the WebAssembly specification how it goes wrong and what
happens as a result. This means that the "blast zone" for a wrong
adapter is the component instance but not the entire host itself.
Accesses to linear memory are guaranteed to be in-bounds and otherwise
handled via well-defined traps.
* A fully-finished fused adapter compiler is expected to be a
significant and quite complex component of Wasmtime. Functionality
along these lines is expected to be needed for Web-based polyfills of
the component model and by using core WebAssembly it provides the
opportunity to share code between Wasmtime and these polyfills for the
component model.
* Finally the runtime implementation of managing WebAssembly modules is
already implemented and quite easy to integrate with, so representing
fused adapters with WebAssembly results in very little extra support
necessary for the runtime implementation of instantiating and managing
a component.
The compiler added in this commit is dubbed Wasmtime's Fused Adapter
Compiler of Trampolines (FACT) because who doesn't like deriving a name
from an acronym. Currently the trampoline compiler is limited in its
support for interface types and only supports a few primitives. I plan
on filing future PRs to flesh out the support here for all the variants
of `InterfaceType`. For now this PR is primarily focused on all of the
other infrastructure for the addition of a trampoline compiler.
With the choice to use core WebAssembly to implement fused adapters it
means that adapters need to be inserted into a module. Unfortunately
adapters cannot all go into a single WebAssembly module because adapters
themselves have dependencies which may be provided transitively through
instances that were instantiated with other adapters. This means that a
significant chunk of this PR (`adapt.rs`) is dedicated to determining
precisely which adapters go into precisely which adapter modules. This
partitioning process attempts to make large modules wherever it can to
cut down on core wasm instantiations but is likely not optimal as
it's just a simple heuristic today.
With all of this added together it's now possible to start writing
`*.wast` tests that internally have adapted modules communicating with
one another. A `fused.wast` test suite was added as part of this PR
which is the beginning of tests for the support of the fused adapter
compiler added in this PR. Currently this is primarily testing some
various topologies of adapters along with direct/indirect modes. This
will grow many more tests over time as more types are supported.
Overall I'm not 100% satisfied with the testing story of this PR. When a
test fails it's very difficult to debug since everything is written in
the text format of WebAssembly meaning there's no "conveniences" to
print out the state of the world when things go wrong and easily debug.
I think this will become even more apparent as more tests are written
for more types in subsequent PRs. At this time though I know of no
better alternative other than leaning pretty heavily on fuzz-testing to
ensure this is all exercised.
* Fix an unused field warning
* Fix tests in `wasmtime-runtime`
* Add some more tests for compiled trampolines
* Remap exports when injecting adapters
The exports of a component were accidentally left unmapped which meant
that they indexed the instance indexes pre-adapter module insertion.
* Fix typo
* Rebase conflicts
* fuzzgen: Use Switch interface
Turns out this is an interface that the frontend provides.
We should fuzz it.
* cranelift: Restrict index range in Switch emission on fuzzgen
* x64: Add VEX Instruction Encoder
This uses a similar builder pattern to the EVEX Encoder.
Does not yet support memory accesses.
* x64: Add FMA Flag
* x64: Implement SIMD `fma`
* x64: Use 4 register Vex Inst
* x64: Reorder VEX pretty print args
* Allow 64-bit vectors and implement for interpreter
The AArch64 backend already supports 64-bit vectors; this simply allows
instructions to make use of that.
Implemented support for 64-bit vectors within the interpreter to allow
interpret runtests to use them.
Copyright (c) 2022 Arm Limited
* Disable 64-bit SIMD `iaddpairwise` tests on s390x
Copyright (c) 2022 Arm Limited
* [AArch64] Port SIMD narrowing to ISLE
Fvdemote, snarrow, unarrow and uunarrow.
Also refactor the aarch64 instructions descriptions to parameterize
on ScalarSize instead of using different opcodes.
The zero_value pure constructor has been introduced and used by the
integer narrow operations and it replaces, and extends, the compare
zero patterns.
Copright (c) 2022, Arm Limited.
* use short 'if' patterns
This enables more runtests to be executed on s390x. Doing so
uncovered a two back-end bugs, which are fixed as well:
- The result of cls was always off by one.
- The result of popcnt.i16 has uninitialized high bits.
In addition, I found a bug in the load-op-store.clif test case:
v3 = heap_addr.i64 heap0, v1, 4
v4 = iconst.i64 42
store.i32 v4, v3
This was clearly intended to perform a 32-bit store, but
actually performs a 64-bit store (it seems the type annotation
of the store opcode is ignored, and the type of the operand
is used instead). That bug did not show any noticable symptoms
on little-endian architectures, but broke on big-endian.