Previously, `Inst::store` only understood a subset of the scalar types, which resulted in failures seen in #2826. This change allows `Inst::store` to generate instructions for all scalar widths (`8 | 16 | 32 | 64`) since all of these are supported in the emission code of `Inst::MovRM`.
Because there are instructions that are present in more than one ISA feature set, we need to see if any of the ISA requirements match before emitting. This change includes the `VPABSQ` instruction as an example, which is present in both `AVX512F` and `AVX512VL`.
* Fully support multiple returns in Wasmtime
For quite some time now Wasmtime has "supported" multiple return values,
but only in the mose bare bones ways. Up until recently you couldn't get
a typed version of functions with multiple return values, and never have
you been able to use `Func::wrap` with functions that return multiple
values. Even recently where `Func::typed` can call functions that return
multiple values it uses a double-indirection by calling a trampoline
which calls the real function.
The underlying reason for this lack of support is that cranelift's ABI
for returning multiple values is not possible to write in Rust. For
example if a wasm function returns two `i32` values there is no Rust (or
C!) function you can write to correspond to that. This commit, however
fixes that.
This commit adds two new ABIs to Cranelift: `WasmtimeSystemV` and
`WasmtimeFastcall`. The intention is that these Wasmtime-specific ABIs
match their corresponding ABI (e.g. `SystemV` or `WindowsFastcall`) for
everything *except* how multiple values are returned. For multiple
return values we simply define our own version of the ABI which Wasmtime
implements, which is that for N return values the first is returned as
if the function only returned that and the latter N-1 return values are
returned via an out-ptr that's the last parameter to the function.
These custom ABIs provides the ability for Wasmtime to bind these in
Rust meaning that `Func::wrap` can now wrap functions that return
multiple values and `Func::typed` no longer uses trampolines when
calling functions that return multiple values. Although there's lots of
internal changes there's no actual changes in the API surface area of
Wasmtime, just a few more impls of more public traits which means that
more types are supported in more places!
Another change made with this PR is a consolidation of how the ABI of
each function in a wasm module is selected. The native `SystemV` ABI,
for example, is more efficient at returning multiple values than the
wasmtime version of the ABI (since more things are in more registers).
To continue to take advantage of this Wasmtime will now classify some
functions in a wasm module with the "fast" ABI. Only functions that are
not reachable externally from the module are classified with the fast
ABI (e.g. those not exported, used in tables, or used with `ref.func`).
This should enable purely internal functions of modules to have a faster
calling convention than those which might be exposed to Wasmtime itself.
Closes#1178
* Tweak some names and add docs
* "fix" lightbeam compile
* Fix TODO with dummy environ
* Unwind info is a property of the target, not the ABI
* Remove lightbeam unused imports
* Attempt to fix arm64
* Document new ABIs aren't stable
* Fix filetests to use the right target
* Don't always do 64-bit stores with cranelift
This was overwriting upper bits when 32-bit registers were being stored
into return values, so fix the code inline to do a sized store instead
of one-size-fits-all store.
* At least get tests passing on the old backend
* Fix a typo
* Add some filetests with mixed abi calls
* Get `multi` example working
* Fix doctests on old x86 backend
* Add a mixture of wasmtime/system_v tests
In preparation for adding new encoding modes to the x64 backend (e.g. VEX,
EVEX), this change moves all of the current instruction encoding functions to
`encodings::rex`. This refactor does not change any logic.
Our previous implementation of unwind infrastructure was somewhat
complex and brittle: it parsed generated instructions in order to
reverse-engineer unwind info from prologues. It also relied on some
fragile linkage to communicate instruction-layout information that VCode
was not designed to provide.
A much simpler, more reliable, and easier-to-reason-about approach is to
embed unwind directives as pseudo-instructions in the prologue as we
generate it. That way, we can say what we mean and just emit it
directly.
The usual reasoning that leads to the reverse-engineering approach is
that metadata is hard to keep in sync across optimization passes; but
here, (i) prologues are generated at the very end of the pipeline, and
(ii) if we ever do a post-prologue-gen optimization, we can treat unwind
directives as black boxes with unknown side-effects, just as we do for
some other pseudo-instructions today.
It turns out that it was easier to just build this for both x64 and
aarch64 (since they share a factored-out ABI implementation), and wire
up the platform-specific unwind-info generation for Windows and SystemV.
Now we have simpler unwind on all platforms and we can delete the old
unwind infra as soon as we remove the old backend.
There were a few consequences to supporting Fastcall unwind in
particular that led to a refactor of the common ABI. Windows only
supports naming clobbered-register save locations within 240 bytes of
the frame-pointer register, whatever one chooses that to be (RSP or
RBP). We had previously saved clobbers below the fixed frame (and below
nominal-SP). The 240-byte range has to include the old RBP too, so we're
forced to place clobbers at the top of the frame, just below saved
RBP/RIP. This is fine; we always keep a frame pointer anyway because we
use it to refer to stack args. It does mean that offsets of fixed-frame
slots (spillslots, stackslots) from RBP are no longer known before we do
regalloc, so if we ever want to index these off of RBP rather than
nominal-SP because we add support for `alloca` (dynamic frame growth),
then we'll need a "nominal-BP" mode that is resolved after regalloc and
clobber-save code is generated. I added a comment to this effect in
`abi_impl.rs`.
The above refactor touched both x64 and aarch64 because of shared code.
This had a further effect in that the old aarch64 prologue generation
subtracted from `sp` once to allocate space, then used stores to `[sp,
offset]` to save clobbers. Unfortunately the offset only has 7-bit
range, so if there are enough clobbered registers (and there can be --
aarch64 has 384 bytes of registers; at least one unit test hits this)
the stores/loads will be out-of-range. I really don't want to synthesize
large-offset sequences here; better to go back to the simpler
pre-index/post-index `stp r1, r2, [sp, #-16]` form that works just like
a "push". It's likely not much worse microarchitecturally (dependence
chain on SP, but oh well) and it actually saves an instruction if
there's no other frame to allocate. As a further advantage, it's much
simpler to understand; simpler is usually better.
This PR adds the new backend on Windows to CI as well.
This is in preparation for refactoring all x64::Inst arms to use OperandSize.
Current uses of OperandSize fall into two categories:
1. XMM operations which require 32/64 bit operands
2. Immediates which only care about 64-bit or not.
Adds assertions to existing Inst constructors to check that they are passed valid sizes.
This change also removes the implicit widening of 1 and 2 byte values to 4 bytes. from_bytes() is only used by category 2, so removing this behavior will not change any visible behavior.
Overall this change should be a no-op.
1. Restricts max nop size to 15 instead of 16.
2. Fixes an edge case where gen_nop() would return a zero sized intruction on multiples of 16.
3. Clarifies the documentation of the gen_nop interface to state that returning zero is allowed when preferred_size is zero.
This PR propagates "value labels" all the way from CLIF to DWARF
metadata on the emitted machine code. The key idea is as follows:
- Translate value-label metadata on the input into "value_label"
pseudo-instructions when lowering into VCode. These
pseudo-instructions take a register as input, denote a value label,
and semantically are like a "move into value label" -- i.e., they
update the current value (as seen by debugging tools) of the given
local. These pseudo-instructions emit no machine code.
- Perform a dataflow analysis *at the machine-code level*, tracking
value-labels that propagate into registers and into [SP+constant]
stack storage. This is a forward dataflow fixpoint analysis where each
storage location can contain a *set* of value labels, and each value
label can reside in a *set* of storage locations. (Meet function is
pairwise intersection by storage location.)
This analysis traces value labels symbolically through loads and
stores and reg-to-reg moves, so it will naturally handle spills and
reloads without knowing anything special about them.
- When this analysis converges, we have, at each machine-code offset, a
mapping from value labels to some number of storage locations; for
each offset for each label, we choose the best location (prefer
registers). Note that we can choose any location, as the symbolic
dataflow analysis is sound and guarantees that the value at the
value_label instruction propagates to all of the named locations.
- Then we can convert this mapping into a format that the DWARF
generation code (wasmtime's debug crate) can use.
This PR also adds the new-backend variant to the gdb tests on CI.
This follows the implementation in the legacy x86 backend, including
hardcoded sequence that is compatible with what the linker expects. We
could potentially do better here, but it is likely not necessary.
Thanks to @bjorn3 for a bugfix to an earlier version of this.
This implements all of the ops on I128 that are implemented by the
legacy x86 backend, and includes all that are required by at least one
major use-case (cg_clif rustc backend).
The sequences are open-coded where necessary; for e.g. the bit
operations, this can be somewhat complex, but these sequences have been
tested carefully. This PR also includes a drive-by fix of clz/ctz for 8-
and 16-bit cases where they were incorrect previously.
Also includes ridealong fixes developed while bringing up cg_clif
support, because they are difficult to completely separate due to
other refactors that occurred in this PR:
- fix REX prefix logic for some 8-bit instructions.
When using an 8-bit register in 64-bit mode on x86-64, the REX prefix
semantics are somewhat subtle: without the REX prefix, register numbers
4--7 correspond to the second-to-lowest byte of the first four registers
(AH, CH, BH, DH), whereas with the REX prefix, these register numbers
correspond to the usual encoding (SPL, BPL, SIL, DIL). We could always
emit a REX byte for instructions with 8-bit cases (this is harmless even
if unneeded), but this would unnecessarily inflate code size; instead,
the usual approach is to emit it only for these registers.
This logic was present in some cases but missing for some other
instructions: divide, not, negate, shifts.
Fixes#2508.
- avoid unaligned SSE loads on some f64 ops.
The implementations of several FP ops, such as fabs/fneg, used SSE
instructions. This is not a problem per-se, except that load-op merging
did not take *alignment* into account. Specifically, if an op on an f64
loaded from memory happened to merge that load, and the instruction into
which it was merged was an SSE instruction, then the SSE instruction
imposes stricter (128-bit) alignment requirements than the load.f64 did.
This PR simply forces any instruction lowerings that could use SSE
instructions to implement non-SIMD operations to take inputs in
registers only, and avoid load-op merging.
Fixes#2507.
- two bugfixes exposed by cg_clif: urem/srem.i8, select.b1.
- urem/srem.i8: the 8-bit form of the DIV instruction on x86-64 places
the remainder in AH, not RDX, different from all the other width-forms
of this instruction.
- select.b1: we were not recognizing selects of boolean values as
integer-typed operations, so we were generating XMM moves instead (!).
This will allow for support for `I128` values everywhere, and `I64`
values on 32-bit targets (e.g., ARM32 and x86-32). It does not alter the
machine backends to build such support; it just adds the framework for
the MachInst backends to *reason* about a `Value` residing in more than
one register.
Previously, `select` and `brz`/`brnz` instructions, when given a `b1`
boolean argument, would test whether that boolean argument was nonzero,
rather than whether its LSB was nonzero. Since our invariant for mapping
CLIF state to machine state is that bits beyond the width of a value are
undefined, the proper lowering is to test only the LSB.
(aarch64 does not have the same issue because its `Extend` pseudoinst
already properly handles masking of b1 values when a zero-extend is
requested, as it is for select/brz/brnz.)
Found by Nathan Ringo on Zulip [1] (thanks!).
[1]
https://bytecodealliance.zulipchat.com/#narrow/stream/217117-cranelift/topic/bnot.20on.20b1s
In existing MachInst backends, many instructions -- any that can trap or
result in a relocation -- carry `SourceLoc` values in order to propagate
the location-in-original-source to use to describe resulting traps or
relocation errors.
This is quite tedious, and also error-prone: it is likely that the
necessary plumbing will be missed in some cases, and in any case, it's
unnecessarily verbose.
This PR factors out the `SourceLoc` handling so that it is tracked
during emission as part of the `EmitState`, and plumbed through
automatically by the machine-independent framework. Instruction emission
code that directly emits trap or relocation records can query the
current location as necessary. Then we only need to ensure that memory
references and trap instructions, at their (one) emission point rather
than their (many) lowering/generation points, are wired up correctly.
This does have the side-effect that some loads and stores that do not
correspond directly to user code's heap accesses will have unnecessary
but harmless trap metadata. For example, the load that fetches a code
offset from a jump table will have a 'heap out of bounds' trap record
attached to it; but because it is bounds-checked, and will never
actually trap if the lowering is correct, this should be harmless. The
simplicity improvement here seemed more worthwhile to me than plumbing
through a "corresponds to user-level load/store" bit, because the latter
is a bit complex when we allow for op merging.
Closes#2290: though it does not implement a full "metadata" scheme as
described in that issue, this seems simpler overall.
This approach suffers from memory-size bloat during compile time due to the desire to de-duplicate the constants emitted and reduce runtime memory-size. As a first step, though, this provides an end-to-end mechanism for constants to be emitted in the MachBuffer islands.
In order to register traps for `load_splat`, several instruction formats need knowledge of `SourceLoc`s; however, since the x64 backend does not correctly and completely register traps for `RegMem::Mem` variants I opened https://github.com/bytecodealliance/wasmtime/issues/2290 to discuss and resolve this issue. In the meantime, the current behavior (i.e. remaining largely unaware of `SourceLoc`s) is retained.
A new associated type Info is added to MachInstEmit, which is the
immutable counterpart to State. It can't easily be constructed from an
ABICallee, since it would require adding an associated type to the
latter, and making so leaks the associated type in a lot of places in
the code base and makes the code harder to read. Instead, the EmitInfo
state can simply be passed to the `Vcode::emit` function directly.
As found by @julian-seward1, movss/movsd aren't included in the
zero-latency move instructions section of the Intel optimization manual.
Use MOVAPS instead for those moves.
This PR updates the AArch64 ABI implementation so that it (i) properly
respects that v8-v15 inclusive have callee-save lower halves, and
caller-save upper halves, by conservatively approximating (to full
registers) in the appropriate directions when generating prologue
caller-saves and when informing the regalloc of clobbered regs across
callsites.
In order to prevent saving all of these vector registers in the prologue
of every non-leaf function due to the above approximation, this also
makes use of a new regalloc.rs feature to exclude call instructions'
writes from the clobber set returned by register allocation. This is
safe whenever the caller and callee have the same ABI (because anything
the callee could clobber, the caller is allowed to clobber as well
without saving it in the prologue).
Fixes#2254.
This approach is not the best but avoids an extra instruction; perhaps at some point, as mentioned in https://github.com/bytecodealliance/wasmtime/pull/2248, we will add the extra instruction or refactor things in such a way that this `Inst` variant is unnecessary.
In particular:
- try to optimize the integer emission into a 32-bit emission, when the
high bits are all zero, and stop relying on the caller of `imm_r` to
ensure this.
- rename `Inst::imm_r`/`Inst::Imm_R` to `Inst::imm`/`Inst::Imm`.
- generate a sign-extending mov 32-bit immediate to 64-bits, whenever
possible.
- fix a few places where the previous commit did introduce the
generation of zero-constants with xor, when calling `put_input_to_reg`,
thus clobbering the flags before they were read.
Previously, in #2128, we factored out a common "vanilla 64-bit ABI"
implementation from the AArch64 ABI code, with the idea that this should
be largely compatible with x64. This PR alters the new x64 backend to
make use of the shared infrastructure, removing the duplication that
existed previously. The generated code is nearly (not exactly) the same;
the only difference relates to how the clobber-save region is padded in
the prologue.
This also changes some register allocations in the aarch64 code because
call support in the shared ABI infra now passes a temp vreg in, rather
than requiring use of a fixed, non-allocable temp; tests have been
updated, and the runtime behavior is unchanged.