Previously, the multiple flags for certain AVX512 instructions were
checked using `OR`: e.g., if the CPU has AVX512VL `OR` AVX512DQ,
emit `VPMULLQ`. This is incorrect--the logic should be `AND`. The Intel
Software Developer Manual, vol. 1, sec. 15.4, has more information on
this (notable there is the suggestion to check with `XGETBV` that the OS
is allowing the use of the XMM registers--but that is a separate issue).
This change switches to `AND` logic in the new backend.
When shuffling values from two different registers, the x64 lowering for
`i8x16.shuffle` must first shuffle each register separately and then OR
the results with SSE instructions. With `VPERMI2B`, available in
AVX512VL + AVX512VBMI, this can be done in a single instruction after
the shuffle mask has been moved into the destination register. This
change uses `VPERMI2B` for that case when the CPU supports it.
When AVX512VL or AVX512BITALG are available, Wasm SIMD's `popcnt`
instruction can be lowered to a single x64 instruction, `VPOPCNTB`,
instead of 8+ instructions.
When AVX512VL and AVX512F are available, use a single instruction
(`VCVTUDQ2PS`) instead of a length 9-instruction sequence. This
optimization is a port from the legacy x86 backend.
Previously, the x64 backend's ABI code would generate a sign-extending
load when loading a less-than-64-bit integer from a spillslot. This is
incorrect: e.g., for i32s > 0x80000000, this would result in all high
bits set.
This interacts poorly with another optimization. Normally, the invariant
is that the high bits of a register holding a value of a certain type,
beyond that type's bits, are undefined. However, as an optimization, we
recognize and use the fact that on x86-64, 32-bit instructions zero the
upper 32 bits. This allows us to elide a 32-to-64-bit zero-extend op
(turning it into just a move, which can then sometimes disappear
entirely due to register coalescing).
If a spill and reload happen between the production of a 32-bit value
from an instruction known to zero the upper bits and its use, then we
will rely on zero upper bits that might actually be set by a
sign-extend. This will result in incorrect execution.
As a fix, we stick to a simple invariant: we always spill and reload a
full 64 bits when handling integer registers on x64. This ensures that
no bits are mangled.
This change implements `vselect` using SSE4.1's `BLENDVPS`, `BLENDVPD`,
and `PBLENDVB`. `vselect` is a lane-selecting instruction that is used
by
[simple_preopt.rs](fa1faf5d22/cranelift/codegen/src/simple_preopt.rs (L947-L999))
to lower `bitselect` to a single x86 instruction when the condition mask
is known to be boolean (all 1s or 0s, e.g., from a conversion). This is
better than `bitselect` in general, which lowers to 4-5 instructions.
The old backend had the `vselect` lowering; this simply introduces it to
the new backend.
This adds the machinery to encode the VPMULLQ instruction which is
available in AVX512VL and AVX512DQ. When these feature sets are
available, we use this instruction instead of a lengthy 12-instruction
sequence.
In order to benchmark the encoding code with criterion, the functions
and structures must be public. Moving this code to its own module
(instead of keeping as a submodule to `inst`), allows `inst` to remain
private. This avoids having to expose and document (or ignore
documenting) the numerous instruction variants in `inst` while allowing
access to the encoding code. This commit changes no functionality.
* x64: add EVEX encoding mechanism
Also, includes an empty stub module for the VEX encoding.
* x64: lower abs.i64x2 to VPABSQ when available
* x64: refactor EVEX encodings to use `EvexInstruction`
This change replaces the `encode_evex` function with a builder-style struct, `EvexInstruction`. This approach clarifies the code, adds documentation, and results in slight speedups when benchmarked.
* x64: rename encoding CodeSink to ByteSink
The patch extends the unwinder to support targets that do not need
to use a dedicated frame pointer register. Specifically, the
changes include:
- Change the "fp" routine in the RegisterMapper to return an
*optional* frame pointer regsiter via Option<Register>.
- On targets that choose to not define a FP register via the above
routine, the UnwindInst::DefineNewFrame operation no longer switches
the CFA to be defined in terms of the FP. (The operation still can
be used to define the location of the clobber area.)
- In addition, on targets that choose not to define a FP register, the
UnwindInst::PushFrameRegs operation is not supported.
- There is a new operation UnwindInst::StackAlloc that needs to be
called on targets without FP whenever the stack pointer is updated.
This caused the CFA offset to be adjusted accordingly. (On
targets with FP this operation is a no-op.)
Previously, `Inst::store` only understood a subset of the scalar types, which resulted in failures seen in #2826. This change allows `Inst::store` to generate instructions for all scalar widths (`8 | 16 | 32 | 64`) since all of these are supported in the emission code of `Inst::MovRM`.
Because there are instructions that are present in more than one ISA feature set, we need to see if any of the ISA requirements match before emitting. This change includes the `VPABSQ` instruction as an example, which is present in both `AVX512F` and `AVX512VL`.
* Fully support multiple returns in Wasmtime
For quite some time now Wasmtime has "supported" multiple return values,
but only in the mose bare bones ways. Up until recently you couldn't get
a typed version of functions with multiple return values, and never have
you been able to use `Func::wrap` with functions that return multiple
values. Even recently where `Func::typed` can call functions that return
multiple values it uses a double-indirection by calling a trampoline
which calls the real function.
The underlying reason for this lack of support is that cranelift's ABI
for returning multiple values is not possible to write in Rust. For
example if a wasm function returns two `i32` values there is no Rust (or
C!) function you can write to correspond to that. This commit, however
fixes that.
This commit adds two new ABIs to Cranelift: `WasmtimeSystemV` and
`WasmtimeFastcall`. The intention is that these Wasmtime-specific ABIs
match their corresponding ABI (e.g. `SystemV` or `WindowsFastcall`) for
everything *except* how multiple values are returned. For multiple
return values we simply define our own version of the ABI which Wasmtime
implements, which is that for N return values the first is returned as
if the function only returned that and the latter N-1 return values are
returned via an out-ptr that's the last parameter to the function.
These custom ABIs provides the ability for Wasmtime to bind these in
Rust meaning that `Func::wrap` can now wrap functions that return
multiple values and `Func::typed` no longer uses trampolines when
calling functions that return multiple values. Although there's lots of
internal changes there's no actual changes in the API surface area of
Wasmtime, just a few more impls of more public traits which means that
more types are supported in more places!
Another change made with this PR is a consolidation of how the ABI of
each function in a wasm module is selected. The native `SystemV` ABI,
for example, is more efficient at returning multiple values than the
wasmtime version of the ABI (since more things are in more registers).
To continue to take advantage of this Wasmtime will now classify some
functions in a wasm module with the "fast" ABI. Only functions that are
not reachable externally from the module are classified with the fast
ABI (e.g. those not exported, used in tables, or used with `ref.func`).
This should enable purely internal functions of modules to have a faster
calling convention than those which might be exposed to Wasmtime itself.
Closes#1178
* Tweak some names and add docs
* "fix" lightbeam compile
* Fix TODO with dummy environ
* Unwind info is a property of the target, not the ABI
* Remove lightbeam unused imports
* Attempt to fix arm64
* Document new ABIs aren't stable
* Fix filetests to use the right target
* Don't always do 64-bit stores with cranelift
This was overwriting upper bits when 32-bit registers were being stored
into return values, so fix the code inline to do a sized store instead
of one-size-fits-all store.
* At least get tests passing on the old backend
* Fix a typo
* Add some filetests with mixed abi calls
* Get `multi` example working
* Fix doctests on old x86 backend
* Add a mixture of wasmtime/system_v tests
This PR switches the default backend on x86, for both the
`cranelift-codegen` crate and for Wasmtime, to the new
(`MachInst`-style, `VCode`-based) backend that has been under
development and testing for some time now.
The old backend is still available by default in builds with the
`old-x86-backend` feature, or by requesting `BackendVariant::Legacy`
from the appropriate APIs.
As part of that switch, it adds some more runtime-configurable plumbing
to the testing infrastructure so that tests can be run using the
appropriate backend. `clif-util test` is now capable of parsing a
backend selector option from filetests and instantiating the correct
backend.
CI has been updated so that the old x86 backend continues to run its
tests, just as we used to run the new x64 backend separately.
At some point, we will remove the old x86 backend entirely, once we are
satisfied that the new backend has not caused any unforeseen issues and
we do not need to revert.
In preparation for adding new encoding modes to the x64 backend (e.g. VEX,
EVEX), this change moves all of the current instruction encoding functions to
`encodings::rex`. This refactor does not change any logic.
* Use stable Rust on CI to test the x64 backend
This commit leverages the newly-released 1.51.0 compiler to test the
new backend on Windows and Linux with a stable compiler instead of a
nightly compiler. This isolates the nightly build to just the nightly
documentation generation and fuzzing, both of which rely on nightly for
the best results right now.
* Use updated stable in book build job
* Run rustfmt for new stable
* Silence new warnings for wasi-nn
* Allow some dead code in the x64 backend
Looks like new rustc is better about emitting some dead-code warnings
* Update rust in peepmatic job
* Fix a test in the pooling allocator
* Remove `package.metdata.docs.rs` temporarily
Needs resolution of https://github.com/rust-lang/cargo/pull/9300 first
* Fix a warning in a wasi-nn example
The codegen for div/rem ops has two modes, depending on the
`avoid_div_traps` flag: it can either do all checks for trapping
conditions explicitly, and use explicit trap instructions, then use a
hardware divide instruction that will not trap (`avoid_div_traps ==
true`); or it can run in a mode where a hardware FP fault on the divide
instruction implies a Wasm trap (`avoid_div_traps == false`). Wasmtime
uses the former while Lucet (for example) uses the latter.
It turns out that because we run all our spec tests run under Wasmtime,
we missed a spec corner case that fails in the latter: INT_MIN % -1 == 0
per the spec, but causes a trap with the x86 signed divide/remainder
instruction. Hence, in Lucet, this specific remainder computation would
incorrectly result in a Wasm trap.
This PR fixes the issue by just forcing use of the explicit-checks
implementation for `srem` even when `avoid_div_traps` is false.
Our previous implementation of unwind infrastructure was somewhat
complex and brittle: it parsed generated instructions in order to
reverse-engineer unwind info from prologues. It also relied on some
fragile linkage to communicate instruction-layout information that VCode
was not designed to provide.
A much simpler, more reliable, and easier-to-reason-about approach is to
embed unwind directives as pseudo-instructions in the prologue as we
generate it. That way, we can say what we mean and just emit it
directly.
The usual reasoning that leads to the reverse-engineering approach is
that metadata is hard to keep in sync across optimization passes; but
here, (i) prologues are generated at the very end of the pipeline, and
(ii) if we ever do a post-prologue-gen optimization, we can treat unwind
directives as black boxes with unknown side-effects, just as we do for
some other pseudo-instructions today.
It turns out that it was easier to just build this for both x64 and
aarch64 (since they share a factored-out ABI implementation), and wire
up the platform-specific unwind-info generation for Windows and SystemV.
Now we have simpler unwind on all platforms and we can delete the old
unwind infra as soon as we remove the old backend.
There were a few consequences to supporting Fastcall unwind in
particular that led to a refactor of the common ABI. Windows only
supports naming clobbered-register save locations within 240 bytes of
the frame-pointer register, whatever one chooses that to be (RSP or
RBP). We had previously saved clobbers below the fixed frame (and below
nominal-SP). The 240-byte range has to include the old RBP too, so we're
forced to place clobbers at the top of the frame, just below saved
RBP/RIP. This is fine; we always keep a frame pointer anyway because we
use it to refer to stack args. It does mean that offsets of fixed-frame
slots (spillslots, stackslots) from RBP are no longer known before we do
regalloc, so if we ever want to index these off of RBP rather than
nominal-SP because we add support for `alloca` (dynamic frame growth),
then we'll need a "nominal-BP" mode that is resolved after regalloc and
clobber-save code is generated. I added a comment to this effect in
`abi_impl.rs`.
The above refactor touched both x64 and aarch64 because of shared code.
This had a further effect in that the old aarch64 prologue generation
subtracted from `sp` once to allocate space, then used stores to `[sp,
offset]` to save clobbers. Unfortunately the offset only has 7-bit
range, so if there are enough clobbered registers (and there can be --
aarch64 has 384 bytes of registers; at least one unit test hits this)
the stores/loads will be out-of-range. I really don't want to synthesize
large-offset sequences here; better to go back to the simpler
pre-index/post-index `stp r1, r2, [sp, #-16]` form that works just like
a "push". It's likely not much worse microarchitecturally (dependence
chain on SP, but oh well) and it actually saves an instruction if
there's no other frame to allocate. As a further advantage, it's much
simpler to understand; simpler is usually better.
This PR adds the new backend on Windows to CI as well.
This adds support for the "fastcall" ABI, which is the native C/C++ ABI
on Windows platforms on x86-64. It is similar to but not exactly like
System V; primarily, its argument register assignments are different,
and it requires stack shadow space.
Note that this also adjusts the handling of multi-register values in the
shared ABI implementation, and with this change, adjusts handling of
`i128`s on *both* Fastcall/x64 *and* SysV/x64 platforms. This was done
to align with actual behavior by the "rustc ABI" on both platforms, as
mapped out experimentally (Compiler Explorer link in comments). This
behavior is gated under the `enable_llvm_abi_extensions` flag.
Note also that this does *not* add x64 unwind info on Windows. That will
come in a future PR (but is planned!).
This instruction has a single instruction lowering in AVX512F/VL and a three instruction lowering in AVX but neither is currently supported in the x64 backend. To implement this, we instead subtract the vector from 0 and use a blending instruction to pick the lanes containing the absolute value.
If an instruction has more than one trap record associated with it (for
example: a divide instruction that has participated in load-op fusion,
so we have both a heap-out-of-bounds trap record due to its load and a
divide-by-zero trap record due to its divide op), the current MachBuffer
code would emit only one of the trap records to the sink.
Separately, divide instructions probably shouldn't merge loads, because
the two separate possible traps at one location might be confusing for
some embedders (certainly in Lucet). Divide seems to be the only case in
our current codegen where such merging might occur. This PR changes the
lowering to always force the divisor into a register.
Finally, while working out why trap records were not appearing, I had
noticed that `isa::x64::emit_std_enc_mem()` was only emitting heap-OOB
trap metadata for loads/stores when it had a srcloc. This PR ensures
that the metadata is emitted even when the srcloc is empty.
Note that none of the above presents a security or correctness problem;
trap metadata only affects the status that we return to the embedder
when a Wasm program terminates with a trap.
This fixes#2672 and #2679, and also fixes an incorrect instruction
emission (`test` with small immediate) that we had missed earlier.
The shift-related fixes have to do with (i) shifts by 0 bits, as a
special case that must be handled; and (ii) shifts by a 128-bit amount,
which we can handle by just dropping the upper half (we only use 3--7
bits of shift amount).
This adjusts the lowerings appropriately, and also adds run-tests to
ensure that the lowerings actually execute correctly (previously we only
had compile-tests with golden lowerings; I'd like to correct this for
more ops eventually, adding run-tests beyond what the Wasm spec and
frontend covers).
This unifies the logic around Rex prefix emission and hopefully makes REX prefix errors less likely.
There are still several instructions that use other sources to determine the flags, so set_w and clear_w are left as is.
Additional cleanups:
* Change always_emit_if_8bit_needed to take a Reg instead of a u8 for type safety.
* Deduplicated emission code in MovRM.
This is in preparation for refactoring all x64::Inst arms to use OperandSize.
Current uses of OperandSize fall into two categories:
1. XMM operations which require 32/64 bit operands
2. Immediates which only care about 64-bit or not.
Adds assertions to existing Inst constructors to check that they are passed valid sizes.
This change also removes the implicit widening of 1 and 2 byte values to 4 bytes. from_bytes() is only used by category 2, so removing this behavior will not change any visible behavior.
Overall this change should be a no-op.
1. Restricts max nop size to 15 instead of 16.
2. Fixes an edge case where gen_nop() would return a zero sized intruction on multiples of 16.
3. Clarifies the documentation of the gen_nop interface to state that returning zero is allowed when preferred_size is zero.
This PR propagates "value labels" all the way from CLIF to DWARF
metadata on the emitted machine code. The key idea is as follows:
- Translate value-label metadata on the input into "value_label"
pseudo-instructions when lowering into VCode. These
pseudo-instructions take a register as input, denote a value label,
and semantically are like a "move into value label" -- i.e., they
update the current value (as seen by debugging tools) of the given
local. These pseudo-instructions emit no machine code.
- Perform a dataflow analysis *at the machine-code level*, tracking
value-labels that propagate into registers and into [SP+constant]
stack storage. This is a forward dataflow fixpoint analysis where each
storage location can contain a *set* of value labels, and each value
label can reside in a *set* of storage locations. (Meet function is
pairwise intersection by storage location.)
This analysis traces value labels symbolically through loads and
stores and reg-to-reg moves, so it will naturally handle spills and
reloads without knowing anything special about them.
- When this analysis converges, we have, at each machine-code offset, a
mapping from value labels to some number of storage locations; for
each offset for each label, we choose the best location (prefer
registers). Note that we can choose any location, as the symbolic
dataflow analysis is sound and guarantees that the value at the
value_label instruction propagates to all of the named locations.
- Then we can convert this mapping into a format that the DWARF
generation code (wasmtime's debug crate) can use.
This PR also adds the new-backend variant to the gdb tests on CI.
This follows the implementation in the legacy x86 backend, including
hardcoded sequence that is compatible with what the linker expects. We
could potentially do better here, but it is likely not necessary.
Thanks to @bjorn3 for a bugfix to an earlier version of this.
This implements all of the ops on I128 that are implemented by the
legacy x86 backend, and includes all that are required by at least one
major use-case (cg_clif rustc backend).
The sequences are open-coded where necessary; for e.g. the bit
operations, this can be somewhat complex, but these sequences have been
tested carefully. This PR also includes a drive-by fix of clz/ctz for 8-
and 16-bit cases where they were incorrect previously.
Also includes ridealong fixes developed while bringing up cg_clif
support, because they are difficult to completely separate due to
other refactors that occurred in this PR:
- fix REX prefix logic for some 8-bit instructions.
When using an 8-bit register in 64-bit mode on x86-64, the REX prefix
semantics are somewhat subtle: without the REX prefix, register numbers
4--7 correspond to the second-to-lowest byte of the first four registers
(AH, CH, BH, DH), whereas with the REX prefix, these register numbers
correspond to the usual encoding (SPL, BPL, SIL, DIL). We could always
emit a REX byte for instructions with 8-bit cases (this is harmless even
if unneeded), but this would unnecessarily inflate code size; instead,
the usual approach is to emit it only for these registers.
This logic was present in some cases but missing for some other
instructions: divide, not, negate, shifts.
Fixes#2508.
- avoid unaligned SSE loads on some f64 ops.
The implementations of several FP ops, such as fabs/fneg, used SSE
instructions. This is not a problem per-se, except that load-op merging
did not take *alignment* into account. Specifically, if an op on an f64
loaded from memory happened to merge that load, and the instruction into
which it was merged was an SSE instruction, then the SSE instruction
imposes stricter (128-bit) alignment requirements than the load.f64 did.
This PR simply forces any instruction lowerings that could use SSE
instructions to implement non-SIMD operations to take inputs in
registers only, and avoid load-op merging.
Fixes#2507.
- two bugfixes exposed by cg_clif: urem/srem.i8, select.b1.
- urem/srem.i8: the 8-bit form of the DIV instruction on x86-64 places
the remainder in AH, not RDX, different from all the other width-forms
of this instruction.
- select.b1: we were not recognizing selects of boolean values as
integer-typed operations, so we were generating XMM moves instead (!).
This will allow for support for `I128` values everywhere, and `I64`
values on 32-bit targets (e.g., ARM32 and x86-32). It does not alter the
machine backends to build such support; it just adds the framework for
the MachInst backends to *reason* about a `Value` residing in more than
one register.