The StructReturn ABI is fairly simple at the codegen/isel level: we only
need to take care to return the sret pointer as one of the return values
if that wasn't specified in the initial function signature.
Struct arguments are a little more complex. A struct argument is stored
as a chunk of memory in the stack-args space. However, the CLIF
semantics are slightly special: on the caller side, the parameter passed
in is a pointer to an arbitrary memory block, and we must memcpy this
data to the on-stack struct-argument; and on the callee side, we provide
a pointer to the passed-in struct-argument as the CLIF block param
value.
This is necessary to support various ABIs other than Wasm, such as that
of Rust (with the cg_clif codegen backend).
This follows the implementation in the legacy x86 backend, including
hardcoded sequence that is compatible with what the linker expects. We
could potentially do better here, but it is likely not necessary.
Thanks to @bjorn3 for a bugfix to an earlier version of this.
This implements all of the ops on I128 that are implemented by the
legacy x86 backend, and includes all that are required by at least one
major use-case (cg_clif rustc backend).
The sequences are open-coded where necessary; for e.g. the bit
operations, this can be somewhat complex, but these sequences have been
tested carefully. This PR also includes a drive-by fix of clz/ctz for 8-
and 16-bit cases where they were incorrect previously.
Also includes ridealong fixes developed while bringing up cg_clif
support, because they are difficult to completely separate due to
other refactors that occurred in this PR:
- fix REX prefix logic for some 8-bit instructions.
When using an 8-bit register in 64-bit mode on x86-64, the REX prefix
semantics are somewhat subtle: without the REX prefix, register numbers
4--7 correspond to the second-to-lowest byte of the first four registers
(AH, CH, BH, DH), whereas with the REX prefix, these register numbers
correspond to the usual encoding (SPL, BPL, SIL, DIL). We could always
emit a REX byte for instructions with 8-bit cases (this is harmless even
if unneeded), but this would unnecessarily inflate code size; instead,
the usual approach is to emit it only for these registers.
This logic was present in some cases but missing for some other
instructions: divide, not, negate, shifts.
Fixes#2508.
- avoid unaligned SSE loads on some f64 ops.
The implementations of several FP ops, such as fabs/fneg, used SSE
instructions. This is not a problem per-se, except that load-op merging
did not take *alignment* into account. Specifically, if an op on an f64
loaded from memory happened to merge that load, and the instruction into
which it was merged was an SSE instruction, then the SSE instruction
imposes stricter (128-bit) alignment requirements than the load.f64 did.
This PR simply forces any instruction lowerings that could use SSE
instructions to implement non-SIMD operations to take inputs in
registers only, and avoid load-op merging.
Fixes#2507.
- two bugfixes exposed by cg_clif: urem/srem.i8, select.b1.
- urem/srem.i8: the 8-bit form of the DIV instruction on x86-64 places
the remainder in AH, not RDX, different from all the other width-forms
of this instruction.
- select.b1: we were not recognizing selects of boolean values as
integer-typed operations, so we were generating XMM moves instead (!).
This commit updates the various tooling used by wasmtime which has new
updates to the module linking proposal. This is done primarily to sync
with WebAssembly/module-linking#26. The main change implemented here is
that wasmtime now supports creating instances from a set of values, nott
just from instantiating a module. Additionally subtyping handling of
modules with respect to imports is now properly handled by desugaring
two-level imports to imports of instances.
A number of small refactorings are included here as well, but most of
them are in accordance with the changes to `wasmparser` and the updated
binary format for module linking.
An intermittent failure during SIMD spectests is described in #2432. This patch
corrects code written in a way that assumes comparing fp equality of a register with itself will
always return true. This is not true when the register value is NaN as NaN. In this case, and
with all ordered comparisons involving NaN, the comparisons will always return false.
This patch corrects that assumption for SIMD Fabs and Fneg which seem to be the only
instructions generating the failure with #2432.
On x64, the new backend generates `cmp` instructions at their use-sites
when possible (when the icmp that generates a boolean is known) so that
the condition flows directly through flags rather than a materialized
boolean. E.g., both `bint` (boolean to int) and `select` (conditional
select) instruction lowerings invoke `emit_cmp()` to do so.
Load-op fusion in `emit_cmp()` nominally allowed `cmp` to use its `cmp
reg, mem` form.
However, the mergeable-load condition (load has only single use) was not
adequately checked. Consider the sequence:
```
v2 = load.i64 v1
v3 = icmp eq v0, v2
v4 = bint.i64 v3
v5 = select.i64 v3, v0, v1
```
The load `v2` is only used in the `icmp` at `v3`. However, the cmp will
be separately codegen'd twice, once for the `bint` and once for the
`select`.
Prior to this fix, the above example would result in the load at `v2`
sinking to the `cmp` just above the `select`; we then emit another `cmp`
for the `bint`, but the load has already been used once so we do not
allow merging. We thus (i) expect the register for `v2` to contain the
loaded value, but (ii) skip the codegen for the load because it has been
sunk. This results in a regalloc error (unexpected livein) as the
unfilled register is upward-exposed to the entry point.
Because of this, we need to accept only the reg, reg form in
`emit_cmp()` (and the FP equivalent). We could get marginally better
code by tracking whether the `cmp` we are emitting comes from an
`icmp`/`fcmp` with only one use; but IMHO simplicity is a better rule
here when subtle interactions occur.
A branch is considered side-effecting and so updates the instruction
color (which is our way of computing how far instructions can sink).
However, in the lowering loop, we did not update current instruction
color when scanning backward across branches, which are side-effecting.
As a result, the color was stale and fewer load-op merges were permitted
than are actually possible.
Note that this would not have resulted in any correctness issues, as the
stale color is too high (so no merges are permitted that should have
been disallowed).
Fixes#2562.
This will allow for support for `I128` values everywhere, and `I64`
values on 32-bit targets (e.g., ARM32 and x86-32). It does not alter the
machine backends to build such support; it just adds the framework for
the MachInst backends to *reason* about a `Value` residing in more than
one register.
On AArch64, the zero register (xzr) and the stack pointer (xsp) are
alternately named by the same index `31` in machine code depending on
context. In particular, in the reg-reg-immediate ALU instruction form,
add/subtract will use the stack pointer, not the zero register, if index
31 is given for the first (register) source arg.
In a few places, we were emitting subtract instructions with the zero
register as an argument and a reg/immediate as the second argument. When
an immediate could be incorporated directly (we have the `iconst`
definition visible), this would result in incorrect code being
generated.
This issue was found in `ineg` and in the sequence for vector
right-shifts.
Reported by Ian Cullinan; thanks!
Previously, `select` and `brz`/`brnz` instructions, when given a `b1`
boolean argument, would test whether that boolean argument was nonzero,
rather than whether its LSB was nonzero. Since our invariant for mapping
CLIF state to machine state is that bits beyond the width of a value are
undefined, the proper lowering is to test only the LSB.
(aarch64 does not have the same issue because its `Extend` pseudoinst
already properly handles masking of b1 values when a zero-extend is
requested, as it is for select/brz/brnz.)
Found by Nathan Ringo on Zulip [1] (thanks!).
[1]
https://bytecodealliance.zulipchat.com/#narrow/stream/217117-cranelift/topic/bnot.20on.20b1s
This PR adds a new `isa::lookup_variant()` that takes a `BackendVariant`
(`Legacy`, `MachInst` or `Any`), and exposes both x86 backends as
separate variants if both are compiled into the build.
This will allow some new use-cases that require both backends in the
same process: for example, differential fuzzing between old and new
backends, or perhaps allowing for dynamic feature-flag selection between
the backends.
WebAssembly memory operations are by definition little-endian even on
big-endian target platforms. However, other memory accesses will require
native target endianness (e.g. to access parts of the VMContext that is
also accessed by VM native code). This means on big-endian targets,
the code generator will have to handle both little- and big-endian
memory accesses. However, there is currently no way to encode that
distinction into the Cranelift IR that describes memory accesses.
This patch provides such a way by adding an (optional) explicit
endianness marker to an instance of MemFlags. Since each Cranelift IR
instruction that describes memory accesses already has an instance of
MemFlags attached, this can now be used to provide endianness
information.
Note that by default, memory accesses will continue to use the native
target ISA endianness. To override this to specify an explicit
endianness, a MemFlags value that was built using the set_endianness
routine must be used. This patch does so for accesses that implement
WebAssembly memory operations.
This patch addresses issue #2124.
This commit updates all the wasm-tools crates that we use and enables
fuzzing of the module linking proposal in our various fuzz targets. This
also refactors some of the dummy value generation logic to not be
fallible and to always succeed, the thinking being that we don't want to
accidentally hide errors while fuzzing. Additionally instantiation is
only allowed to fail with a `Trap`, other failure reasons are unwrapped.
As a subtle consequence of the recent load-op fusion, popcnt of a
value that came from a load.i32 was compiling into a 64-bit load. This
is a result of the way in which x86 infers the width of loads: it is a
consequence of the instruction containing the memory reference, not the
memory reference itself. So the `input_to_reg_mem()` helper (convert an
instruction input into a register or memory reference) was providing the
appropriate memory reference for the result of a load.i32, but never
encoded the assumption that it would only be used in a 32-bit
instruction. It turns out that popcnt.i32 uses a 64-bit instruction to
load this RM op, hence widening a 32-bit to 64-bit load (which is
problematic when the offset is (memory_length - 4)).
Separately, popcnt was using the RM operand twice, resulting in two
loads if we merged a load. This isn't a correctness bug in practice
because only a racy sequence (store interleaving between the loads)
would produce incorrect results, but we decided earlier to treat loads
as effectful for now, neither reordering nor duplicating them, to
deliberately reduce complexity.
Because of the second issue, the fix is just to force the operand into a
register always, so any source load will not be merged.
Discovered via fuzzing with oss-fuzz.
Lucet uses stack probes rather than explicit stack limit checks as
Wasmtime does. In bytecodealliance/lucet#616, I have discovered that I
previously was not running some Lucet runtime tests with the new
backend, so was missing some test failures due to missing pieces in the
new backend.
This PR adds (i) calls to probestack, when enabled, in the prologue of
every function with a stack frame larger than one page (configurable via
flags); and (ii) trap metadata for every instruction on x86-64 that can
access the stack, hence be the first point at which a stack overflow is
detected when the stack pointer is decremented.
The x64 backend currently builds the `RealRegUniverse` in a way that
is generating somewhat suboptimal code. In many blocks, we see uses of
callee-save (non-volatile) registers (r12, r13, r14, rbx) first, even in
very short leaf functions where there are plenty of volatiles to use.
This is leading to unnecessary spills/reloads.
On one (local) test program, a medium-sized C benchmark compiled to Wasm
and run on Wasmtime, I am seeing a ~10% performance improvement with
this change; it will be less pronounced in programs with high register
pressure (there we are likely to use all registers regardless, so the
prologue/epilogue will save/restore all callee-saves), or in programs
with fewer calls, but this is a clear win for small functions and in
many cases removes prologue/epilogue clobber-saves altogether.
Separately, I think the RA's coalescing is tripping up a bit in some
cases; see e.g. the filetest touched by this commit that loads a value
into %rsi then moves to %rax and returns immediately. This is an
orthogonal issue, though, and should be addressed (if worthwhile) in
regalloc.rs.
The current code doesn't correctly handle the case where `ExtendOp::UXTW` has
as source, a constant-producing insn that produces a negative (32-bit) value.
Then the value is incorrectly sign-extended to 64 bits (in fact, this has
already been done by `ctx.get_constant(insn)`), whereas it needs to be zero
extended. The obvious fix, done here, is just to force bits 63:32 of the
extension to zero, hence zero-extending it.
This fixes a subtle corner case exposed during fuzzing. If we have a bit
of CLIF like:
```
v0 = load.i64 ...
v1 = iadd.i64 v0, ...
v2 = do_other_thing v1
v3 = load.i64 v1
```
and if this is lowered using a machine backend that can merge loads into
ALU ops, *and* that has an addressing mode that can look through add
ops, then the following can happen:
1. We lower the load at `v3`. This looks backward at the address
operand tree and finds that `v1` is `v0` plus other things; it has an
addressing mode that can add `v0`'s register and the other things
directly; so it calls `put_value_in_reg(v0)` and uses its register in
the amode. At this point, the add producing `v1` has no references,
so it will not (yet) be codegen'd.
2. We lower `do_other_thing`, which puts `v1` in a register and uses it.
the `iadd` now has a reference.
3. We reach the `iadd` and, because it has a reference, lower it. Our
machine has the ability to merge a load into an ALU operation.
Crucially, *we think the load at `v0` is mergeable* because it has
only one user, the add at `v1` (!). So we merge it.
4. We reach the `load` at `v0` and because it has been merged into the
`iadd`, we do not separately codegen it. The register that holds `v0`
is thus never written, and the use of this register by the final load
(Step 1) will see an undefined value.
The logic error here is that in the presence of pattern matching that
looks through pure ops, we can end up with multiple uses of a value that
originally had a single use (because we allow lookthrough of pure ops in
all cases). In other words, the multiple-use-ness of `v1` "passes
through" in some sense to `v0`. However, the load sinking logic is not
aware of this.
The fix, I think, is pretty simple: we disallow an effectful instruction
from sinking/merging if it already has some other use when we look back
at it.
If we disallowed lookthrough of *any* op that had multiple uses, even
pure ones, then we would avoid this scenario; but earlier experiments
showed that to have a non-negligible performance impact, so (given that
we've worked out the logic above) I think this complexity is worth it.
It turns out that Souper does not allow a constant to be assigned to a variable,
they may only be used as operands. The 2.0.0 version of the `souper-ir` crate
correctly reflects this. In the `cranelift_codegen::souper_harvest` module, we
need to modify our Souper IR harvester so that it delays converting `iconst` and
`bconst` into Souper IR until their values are used as operands. Finally, some
unit tests in the `peepmatic-souper` crate need some small updates as well.
* Implement imported/exported modules/instances
This commit implements the final piece of the module linking proposal
which is to flesh out the support for importing/exporting instances and
modules. This ended up having a few changes:
* Two more `PrimaryMap` instances are now stored in an `Instance`. The value
for instances is `InstanceHandle` (pretty easy) and for modules it's
`Box<dyn Any>` (less easy).
* The custom host state for `InstanceHandle` for `wasmtime` is now
`Arc<TypeTables` to be able to fully reconstruct an instance's types
just from its instance.
* Type matching for imports now has been updated to take
instances/modules into account.
One of the main downsides of this implementation is that type matching
of imports is duplicated between wasmparser and wasmtime, leading to
posssible bugs especially in the subtelties of module linking. I'm not
sure how best to unify these two pieces of validation, however, and it
may be more trouble than it's worth.
cc #2094
* Update wat/wast/wasmparser
* Review comments
* Fix a bug in publish script to vendor the right witx
Currently there's two witx binaries in our repository given the two wasi
spec submodules, so this updates the publication script to vendor the
right one.
- Sort by generated-code offset to maintain invariant and avoid gimli
panic.
- Fix srcloc interaction with branch peephole optimization in
MachBuffer: if a srcloc range overlaps with a branch that is
truncated, remove that srcloc range.
These issues were found while fuzzing the new backend (#2453); I suspect
that they arise with the new backend because we can sink instructions
(e.g. loads or extends) in more interesting ways than before, but I'm
not entirely sure.
Test coverage will be via the fuzz corpus once #2453 lands.
A dynamic heap address computation may create up to two conditional
branches: the usual bounds-check, but also (in some cases) an
offset-addition overflow check.
The x64 backend had reversed the condition code for this check,
resulting in an always-trapping execution for a valid offset. I'm
somewhat surprised this has existed so long, but I suppose the
particular conditions (large offset, small offset guard, dynamic heap)
have been somewhat rare in our testing so far.
Found via fuzzing in #2453.