On x64, the new backend generates `cmp` instructions at their use-sites
when possible (when the icmp that generates a boolean is known) so that
the condition flows directly through flags rather than a materialized
boolean. E.g., both `bint` (boolean to int) and `select` (conditional
select) instruction lowerings invoke `emit_cmp()` to do so.
Load-op fusion in `emit_cmp()` nominally allowed `cmp` to use its `cmp
reg, mem` form.
However, the mergeable-load condition (load has only single use) was not
adequately checked. Consider the sequence:
```
v2 = load.i64 v1
v3 = icmp eq v0, v2
v4 = bint.i64 v3
v5 = select.i64 v3, v0, v1
```
The load `v2` is only used in the `icmp` at `v3`. However, the cmp will
be separately codegen'd twice, once for the `bint` and once for the
`select`.
Prior to this fix, the above example would result in the load at `v2`
sinking to the `cmp` just above the `select`; we then emit another `cmp`
for the `bint`, but the load has already been used once so we do not
allow merging. We thus (i) expect the register for `v2` to contain the
loaded value, but (ii) skip the codegen for the load because it has been
sunk. This results in a regalloc error (unexpected livein) as the
unfilled register is upward-exposed to the entry point.
Because of this, we need to accept only the reg, reg form in
`emit_cmp()` (and the FP equivalent). We could get marginally better
code by tracking whether the `cmp` we are emitting comes from an
`icmp`/`fcmp` with only one use; but IMHO simplicity is a better rule
here when subtle interactions occur.
A branch is considered side-effecting and so updates the instruction
color (which is our way of computing how far instructions can sink).
However, in the lowering loop, we did not update current instruction
color when scanning backward across branches, which are side-effecting.
As a result, the color was stale and fewer load-op merges were permitted
than are actually possible.
Note that this would not have resulted in any correctness issues, as the
stale color is too high (so no merges are permitted that should have
been disallowed).
Fixes#2562.
The translation of Operator::Select and Operator::TypedSelect for vector-typed
operands, lacks the relevant bitcasting of the operands to I8X16. This commit
adds it.
wiggle enforces this but the specially-overridden proc_exit
function did not. Now that we proc_exit through wiggle, wiggle
will trap if it cannot import the instance's memory
also, make noreturn functions always return a Trap
wasmtime-wiggle can trivially turn a wiggle::Trap into a wasmtime::Trap.
lucet will have to do the same.
This will allow for support for `I128` values everywhere, and `I64`
values on 32-bit targets (e.g., ARM32 and x86-32). It does not alter the
machine backends to build such support; it just adds the framework for
the MachInst backends to *reason* about a `Value` residing in more than
one register.
the missing memory behavior was always a silly thing, that we generate a
function for wasmtime which is Result<_, Trap> we can just Err(Trap)
when the memory export is missing.
On AArch64, the zero register (xzr) and the stack pointer (xsp) are
alternately named by the same index `31` in machine code depending on
context. In particular, in the reg-reg-immediate ALU instruction form,
add/subtract will use the stack pointer, not the zero register, if index
31 is given for the first (register) source arg.
In a few places, we were emitting subtract instructions with the zero
register as an argument and a reg/immediate as the second argument. When
an immediate could be incorporated directly (we have the `iconst`
definition visible), this would result in incorrect code being
generated.
This issue was found in `ineg` and in the sequence for vector
right-shifts.
Reported by Ian Cullinan; thanks!
Previously, `select` and `brz`/`brnz` instructions, when given a `b1`
boolean argument, would test whether that boolean argument was nonzero,
rather than whether its LSB was nonzero. Since our invariant for mapping
CLIF state to machine state is that bits beyond the width of a value are
undefined, the proper lowering is to test only the LSB.
(aarch64 does not have the same issue because its `Extend` pseudoinst
already properly handles masking of b1 values when a zero-extend is
requested, as it is for select/brz/brnz.)
Found by Nathan Ringo on Zulip [1] (thanks!).
[1]
https://bytecodealliance.zulipchat.com/#narrow/stream/217117-cranelift/topic/bnot.20on.20b1s