* Cranelift: extend docs on Inst to discuss `call` instructions
the docs on `Inst` note that the type is returned by non-resultful
instructions built from `InstBuilder`, but did _not_ note that it is
also returned by `call` and `call_indirect`. if you're trying to learn
and use Cranelift by following the docs, this means you'd follow a doc
link to `Inst` that implies that `call` does not return a value - this
is actively misleading, since you'd want to use the returned `Inst` to
find exactly those returned values!
so, this adds a few sentences talking about the case of call `Inst`s.
* cranelift: Add assert to prevent wrong InstFormat being used for the wrong opcode
* cranelift: Use correct instruction format when inserting opcodes in fuzzgen (fixes#4733)
* cranelift: Use debug assert on InstFormat assert
Fixes#4736
Fix lowerings that were using values as both a Reg and a RegMem, making it look like a load could be sunk while its value in a register was still being used. Also add an assert that checks that loads that are sunk are never used.
The sse_cmp_op rule had cases that would produce SseOperand values that aren't legal to use with MInst.XmmRmR, and was only used in vector_all_ones when constructing an XmmRmR value. Additionally, vector_all_ones always called sse_cmp_op with the same type, so the other cases were redundant.
The solution in this PR is to remove sse_cmp_op entirely and inline a call to x64_pcmpeqd directly in vector_all_ones, and remove the unused argument from vector_all_ones.
Also, adjust the tests that are executed on that platform. Finally,
fix a bug with obtaining backtraces when back-edge CFI is enabled.
Copyright (c) 2022, Arm Limited.
When trying to read generated CLIF, it's nice to be able to see at a
glance that some of the operands are defined by `iconst` and similar
instructions, without having to go find each operand's definition
manually.
All of the `*_imm` instructions are rewritten during legalization to an
explicit `iconst` plus the general form of the operator, so backends
never see them. Therefore these ISLE rules in the x64 backend can never
match anything.
This fixes two problems: minimum symbol alignment for the LARL
instruction, and alignment requirements for LRL/LGRL etc.
The first problem is that the LARL instruction used to load a
symbol address (PC relative) requires that the target symbol
is at least 2-byte aligned. This is always guaranteed for code
symbols (all instructions must be 2-aligned anyway), but not
necessarily for data symbols.
Other s390x compilers fix this problem by ensuring that all
global symbols are always emitted with a minimum 2-byte
alignment. This patch introduces an equivalent mechanism
for cranelift:
- Add a symbol_alignment routine to TargetIsa, similar to the
existing code_section_alignment routine.
- Respect symbol_alignment as minimum alignment for all symbols
emitted in the object backend (code and data).
The second problem is that PC-relative instructions that
directly *access* data (like LRL/LGRL, STRL/STGRL etc.)
not only have the 2-byte requirement like LARL, but actually
require that their memory operand is *naturally* aligned
(i.e. alignment is at least the size of the access).
This property (natural alignment for memory accesses) is
supposed to be provided by the "aligned" flag in MemFlags;
however, this is not implemented correctly at the moment.
To fix this, this patch:
- Only emits PC-relative memory access instructions if the
"aligned" flag is set in the associated MemFlags.
- Fixes a bug in emit_small_memory_copy and emit_small_memset
which currently set the aligned flag unconditionally, ignoring
the actual alignment info passed by their caller.
Tested with wasmtime and cg_clif.
* Add a test for iadd_pairwise with swiden input
* Implement iadd_pairwise for swiden_{low,high} input
* Add a test case for iadd_pairwise with uwiden input
* Implement iadd_pairwise with uwiden
* Cranelift: Use bump allocation in `remove_constant_phis` pass
This makes compilation 2-6% faster for Sightglass's bz2 benchmark:
```
compilation :: cycles :: benchmarks/bz2/benchmark.wasm
Δ = 7290648.36 ± 4245152.07 (confidence = 99%)
bump.so is 1.02x to 1.06x faster than main.so!
[166388177 183238542.98 214732518] bump.so
[172836648 190529191.34 217514271] main.so
compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm
No difference in performance.
[182220055 225793551.12 277857575] bump.so
[193212613 227784078.61 277175335] main.so
compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[3848442474 4295214144.37 4665127241] bump.so
[3969505457 4262415290.10 4563869974] main.so
```
* Add audit for `bumpalo`
* Add an audit of `arrayvec` version 0.7.2
* Remove unnecessary `collect` into `Vec`
I wasn't able to measure any perf difference here, but its nice to do anyways.
* Use a `SecondaryMap` for keeping track of summaries
* Cranelift: Remove the `ABICaller` trait
It has only one implementation: the `ABICallerImpl` struct. We can just use that
directly rather than having extra, unnecessary layers of generics and abstractions.
* Cranelift: Rename `ABICallerImpl` to `Caller`
* Cranelift: Remove `ABICallee` trait
It has only one implementation: the `ABICalleeImpl` struct. By using that
directly we can avoid unnecessary layers of generics and abstractions as well as
a couple `Box`es that were previously putting the single implementation into a
`Box<dyn>`.
* Cranelift: Rename `ABICalleeImpl` to `AbiCallee`
* Fix comments as per review
* Rename `AbiCallee` to `Callee`
This is the implementation of https://github.com/bytecodealliance/wasmtime/issues/4155, using the "inverted API" approach suggested by @cfallin (thanks!) in Cranelift, and trait object to provide a backend for an all-included experience in Wasmtime.
After the suggestion of Chris, `Function` has been split into mostly two parts:
- on the one hand, `FunctionStencil` contains all the fields required during compilation, and that act as a compilation cache key: if two function stencils are the same, then the result of their compilation (`CompiledCodeBase<Stencil>`) will be the same. This makes caching trivial, as the only thing to cache is the `FunctionStencil`.
- on the other hand, `FunctionParameters` contain the... function parameters that are required to finalize the result of compilation into a `CompiledCode` (aka `CompiledCodeBase<Final>`) with proper final relocations etc., by applying fixups and so on.
Most changes are here to accomodate those requirements, in particular that `FunctionStencil` should be `Hash`able to be used as a key in the cache:
- most source locations are now relative to a base source location in the function, and as such they're encoded as `RelSourceLoc` in the `FunctionStencil`. This required changes so that there's no need to explicitly mark a `SourceLoc` as the base source location, it's automatically detected instead the first time a non-default `SourceLoc` is set.
- user-defined external names in the `FunctionStencil` (aka before this patch `ExternalName::User { namespace, index }`) are now references into an external table of `UserExternalNameRef -> UserExternalName`, present in the `FunctionParameters`, and must be explicitly declared using `Function::declare_imported_user_function`.
- some refactorings have been made for function names:
- `ExternalName` was used as the type for a `Function`'s name; while it thus allowed `ExternalName::Libcall` in this place, this would have been quite confusing to use it there. Instead, a new enum `UserFuncName` is introduced for this name, that's either a user-defined function name (the above `UserExternalName`) or a test case name.
- The future of `ExternalName` is likely to become a full reference into the `FunctionParameters`'s mapping, instead of being "either a handle for user-defined external names, or the thing itself for other variants". I'm running out of time to do this, and this is not trivial as it implies touching ISLE which I'm less familiar with.
The cache computes a sha256 hash of the `FunctionStencil`, and uses this as the cache key. No equality check (using `PartialEq`) is performed in addition to the hash being the same, as we hope that this is sufficient data to avoid collisions.
A basic fuzz target has been introduced that tries to do the bare minimum:
- check that a function successfully compiled and cached will be also successfully reloaded from the cache, and returns the exact same function.
- check that a trivial modification in the external mapping of `UserExternalNameRef -> UserExternalName` hits the cache, and that other modifications don't hit the cache.
- This last check is less efficient and less likely to happen, so probably should be rethought a bit.
Thanks to both @alexcrichton and @cfallin for your very useful feedback on Zulip.
Some numbers show that for a large wasm module we're using internally, this is a 20% compile-time speedup, because so many `FunctionStencil`s are the same, even within a single module. For a group of modules that have a lot of code in common, we get hit rates up to 70% when they're used together. When a single function changes in a wasm module, every other function is reloaded; that's still slower than I expect (between 10% and 50% of the overall compile time), so there's likely room for improvement.
Fixes#4155.
The trait had only one implementation: the `Lower` struct. It is easier to just
use that directly, and not introduce unnecessary layers of generics and
abstractions.
Once upon a time, there was hope that we would have other implementations of the
`LowerCtx` trait, that did things like lower CLIF to SMTLIB for
verification. However, this is not practical these days given the way that the
trait has evolved over time, and our verification efforts are focused on ISLE
now anyways, and we're actually making some progress on that front (much more
than anyone ever did on a second `LowerCtx` trait implementation!)
* Add a test for the existing behavior of fcvt_from_unit
* Migrate the I8, I16, I32 cases of fcvt_from_uint
* Implement the I64 case of fcvt_from_uint
* Add a test for the existing behavior of fcvt_from_uint.f64x2
* Migrate fcvt_from_uint.f64x2 to ISLE
* Lower the last case of `fcvt_from_uint`
* Add a test for `fcvt_from_uint`
* Finish lowering fcmp_from_uint
* Format
This implements the s390x back-end portion of the solution for
https://github.com/bytecodealliance/wasmtime/issues/4566
We now support both big- and little-endian vector lane order
in code generation. The order used for a function is determined
by the function's ABI: if it uses a Wasmtime ABI, it will use
little-endian lane order, and big-endian lane order otherwise.
(This ensures that all raw_bitcast instructions generated by
both wasmtime and other cranelift frontends can always be
implemented as a no-op.)
Lane order affects the implementation of a number of operations:
- Vector immediates
- Vector memory load / store (in big- and little-endian variants)
- Operations explicitly using lane numbers
(insertlane, extractlane, shuffle, swizzle)
- Operations implicitly using lane numbers
(iadd_pairwise, narrow/widen, promote/demote, fcvt_low, vhigh_bits)
In addition, when calling a function using a different lane order,
we need to lane-swap all vector values passed or returned in registers.
A small number of changes to common code were also needed:
- Ensure we always select a Wasmtime calling convention on s390x
in crates/cranelift (func_signature).
- Fix vector immediates for filetests/runtests. In PR #4427,
I attempted to fix this by byte-swapping the V128 value, but
with the new scheme, we'd instead need to perform a per-lane
byte swap. Since we do not know the actual type in write_to_slice
and read_from_slice, this isn't easily possible.
Revert this part of PR #4427 again, and instead just mark the
memory buffer as little-endian when emitting the trampoline;
the back-end will then emit correct code to load the constant.
- Change a runtest in simd-bitselect-to-vselect.clif to no longer
make little-endian lane order assumptions.
- Remove runtests in simd-swizzle.clif that make little-endian
lane order assumptions by relying on implicit type conversion
when using a non-i16x8 swizzle result type (this feature should
probably be removed anyway).
Tested with both wasmtime and cg_clif.
This enables the object backend for s390x, in particular the
processing of all required relocations.
This uncovered a bug: we need to use PLT relocations for the
target of calls, which we currently do not. Fixed by adding
a new S390xPLTRel32Dbl reloc type and using it where needed.
Figuring out which boolean settings go into each preset is not easy by
inspecting the DSL source (e.g. meta/src/isa/x86.rs). This patch extends
the comments in the Rust that's generated by that DSL to list the names
of the settings together with the name of the preset.
* Fix sret for AArch64
AArch64 requires the struct return address argument to be stored in the x8
register. This register is never used for regular arguments.
* Add extra sret tests for x86_64
Implement the tls_value for s390 in the ELF general-dynamic mode.
Notable differences to the x86_64 implementation are:
- We use a __tls_get_offset libcall instead of __tls_get_addr.
- The current thread pointer (stored in a pair of access registers)
needs to be added to the result of __tls_get_offset.
- __tls_get_offset has a variant ABI that requires the address of
the GOT (global offset table) is passed in %r12.
This means we need a new libcall entries for __tls_get_offset.
In addition, we also need a way to access _GLOBAL_OFFSET_TABLE_.
The latter is a "magic" symbol with a well-known name defined
by the ABI and recognized by the linker. This patch introduces
a new ExternalName::KnownSymbol variant to support such names
(originally due to @afonso360).
We also need to emit a relocation on a symbol placed in a
constant pool, as well as an extra relocation on the call
to __tls_get_offset required for TLS linker optimization.
Needed by the cg_clif frontend.
This is a nonsensical constraint: the entry block must come first in the
compiled code's layout, so it cannot also be sunk to the end of the
function.
This PR modifies the CLIF verifier to disallow this situation entirely.
It also adds an assert during final block-order computation to catch the
problem (and avoid a silent miscompile) even if the verifier is
disabled.
Fixes#4656.
The `MachBuffer` applies a set of peephole-optimization rules to do
branch threading, leverage fallthrough paths, eliminate empty blocks,
and flip conditional branches where needed to make branches more
efficient starting from naive always-branch-at-end-of-BB code.
This works by applying the rules at every label-bind, which is
equivalent to applying them at the end of every basic block, where
branches are usually inserted.
However, this misses one case: the end of the buffer! Currently we
don't optimize any redundant or foldable branches at the very end of
the machine code.
This usually doesn't matter when the function ends in an epilogue with
`ret` as the last instruction. However, when cold blocks exist, it can
actually matter.
Thanks to @mchesser for pointing out this issue in #4636.
To determine whether we need to insert a "veneer island" of
branch-range extension veneers, we need to know ahead of emitting a
basic block the worst-case size of that block. This is because veneers
only go between blocks (we could plop one in the middle of a block but
that would require another jump around it and would probably pessimize
some code significantly), and we can't back up once we emit a block.
To compute this worst-case size, we take the number of instructions
and multiply by the largest possible size of one pseudoinst (e.g., on
aarch64, this is 44 bytes; it explicitly excludes the `EmitIsland`
pseudo-op which is used before large jumptable inline offset tables
are emitted). This is conservative, but it always works, and veneers
are somewhat rare in practice (function body >1MiB on aarch64 for
example).
Unfortunately this logic didn't account for the spill/reload/move
instructions inserted by the register allocator, and in one example in
issue #4629, a block had only one instruction but 482
edge-moves (!). This came at just the wrong time as we were
approaching the 1MiB limit on aarch64.
This PR fixes that issue, and fixes the logic to actually look at the
correct next block (next in `final_order` rather than numerically
next), as a bonus correctness fix.
Fixes#4629.
Ported the existing implementations of the following opcodes on AArch64
to ISLE:
- `AvgRound`
- Also introduced support for `i64x2` vectors, as per the docs.
- `SqmulRoundSat`
Copyright (c) 2022 Arm Limited
Separates the following opcodes for AArch64 into a separate `VecALUModOp` enum,
which is emitted via the `VecRRRMod` instruction. This separates vector ALU
instructions which modify a register from instructions which write to a new register:
- `Bsl`
- `Fmla`
Addresses [a discussion](https://github.com/bytecodealliance/wasmtime/pull/4608#discussion_r937975581) in #4608.
Copyright (c) 2022 Arm Limited
We assert after emitting each instruction that its size was less than
the "worst-case size", which is used to determine when we need to
proactively emit an island so pending branch fixups don't go out of
bounds. However, the `EmitIsland` pseudo-inst itself can cause an
arbitrarily large island to be emitted; this should not have to fit
within the worst-case size (because island size is explicitly accounted
for by the threshold computation). This PR fixes the assert accordingly.
Fixes#4626.
* Convert `fma`, `valltrue` & `vanytrue` to ISLE (AArch64)
Ported the existing implementations of the following opcodes to ISLE on
AArch64:
- `fma`
- Introduced missing support for `fma` on vector values, as per the
docs.
- `valltrue`
- `vanytrue`
Also fixed `fcmp` on scalar values in the interpreter, and enabled
interpreter tests in `simd-fma.clif`.
This introduces the `FMLA` machine instruction.
Copyright (c) 2022 Arm Limited
* Add comments for `Fmla` and `Bsl`
Copyright (c) 2022 Arm Limited
And the `ABICallerImpl::ir_sig` field that was used to implement that
method. This removes 56 bytes from the size of `ABICallerImpl` and gives us
speed ups to compilation of about 7% on all benchmarks.
```
compilation :: nanoseconds :: benchmarks/pulldown-cmark/benchmark.wasm
Δ = 8205119.48 ± 4069474.25 (confidence = 99%)
main.so is 0.91x to 0.97x faster than feature.so!
feature.so is 1.03x to 1.10x faster than main.so!
[117729152 132258110.36 167484097] main.so
[107486500 124052990.88 138008797] feature.so
compilation :: nanoseconds :: benchmarks/bz2/benchmark.wasm
Δ = 4645258.32 ± 1981104.59 (confidence = 99%)
main.so is 0.92x to 0.97x faster than feature.so!
feature.so is 1.03x to 1.08x faster than main.so!
[76562171 85504479.28 93116863] main.so
[75180650 80859220.96 90591978] feature.so
compilation :: nanoseconds :: benchmarks/spidermonkey/benchmark.wasm
Δ = 150575617.54 ± 65021102.57 (confidence = 99%)
main.so is 0.92x to 0.97x faster than feature.so!
feature.so is 1.03x to 1.08x faster than main.so!
[2573089039 2843117485.10 3175982602] main.so
[2559784932 2692541867.56 3143529008] feature.so
```