Commit Graph

319 Commits

Author SHA1 Message Date
Chris Fallin
4638de673c x64 bugfix: prevent load-op fusion of cmp because it could be emitted multiple times.
On x64, the new backend generates `cmp` instructions at their use-sites
when possible (when the icmp that generates a boolean is known) so that
the condition flows directly through flags rather than a materialized
boolean. E.g., both `bint` (boolean to int) and `select` (conditional
select) instruction lowerings invoke `emit_cmp()` to do so.

Load-op fusion in `emit_cmp()` nominally allowed `cmp` to use its `cmp
reg, mem` form.

However, the mergeable-load condition (load has only single use) was not
adequately checked. Consider the sequence:

```
    v2 = load.i64 v1
    v3 = icmp eq v0, v2
    v4 = bint.i64 v3
    v5 = select.i64 v3, v0, v1
```

The load `v2` is only used in the `icmp` at `v3`. However, the cmp will
be separately codegen'd twice, once for the `bint` and once for the
`select`.

Prior to this fix, the above example would result in the load at `v2`
sinking to the `cmp` just above the `select`; we then emit another `cmp`
for the `bint`, but the load has already been used once so we do not
allow merging. We thus (i) expect the register for `v2` to contain the
loaded value, but (ii) skip the codegen for the load because it has been
sunk. This results in a regalloc error (unexpected livein) as the
unfilled register is upward-exposed to the entry point.

Because of this, we need to accept only the reg, reg form in
`emit_cmp()` (and the FP equivalent). We could get marginally better
code by tracking whether the `cmp` we are emitting comes from an
`icmp`/`fcmp` with only one use; but IMHO simplicity is a better rule
here when subtle interactions occur.
2021-01-13 09:48:51 -08:00
Andrew Brown
b25a3c387e fix: dst should be Writable, not ValueRegs 2021-01-08 16:49:28 -08:00
Andrew Brown
09a5b91b9d x64: make several structures debuggable 2021-01-08 16:21:57 -08:00
Andrew Brown
bb2dd5b68b [machinst x64]: implement load*_zero for x64 2021-01-08 16:21:57 -08:00
Chris Fallin
81bc811236 Merge pull request #2558 from cfallin/pic-symbol-refs
x64: support PC-rel symbol references using the GOT when in PIC mode.
2021-01-08 10:03:10 -08:00
Chris Fallin
3ee898cb2c x64: support PC-rel symbol references using the GOT when in PIC mode. 2021-01-07 22:46:56 -08:00
Chris Fallin
6eea015d6c Multi-register value support: framework for Values wider than machine regs.
This will allow for support for `I128` values everywhere, and `I64`
values on 32-bit targets (e.g., ARM32 and x86-32). It does not alter the
machine backends to build such support; it just adds the framework for
the MachInst backends to *reason* about a `Value` residing in more than
one register.
2021-01-05 17:45:02 -08:00
Chris Fallin
dbd2241b60 x64: handle tests of b1 values correctly (only LSB is defined).
Previously, `select` and `brz`/`brnz` instructions, when given a `b1`
boolean argument, would test whether that boolean argument was nonzero,
rather than whether its LSB was nonzero. Since our invariant for mapping
CLIF state to machine state is that bits beyond the width of a value are
undefined, the proper lowering is to test only the LSB.

(aarch64 does not have the same issue because its `Extend` pseudoinst
already properly handles masking of b1 values when a zero-extend is
requested, as it is for select/brz/brnz.)

Found by Nathan Ringo on Zulip [1] (thanks!).

[1]
https://bytecodealliance.zulipchat.com/#narrow/stream/217117-cranelift/topic/bnot.20on.20b1s
2021-01-05 14:45:46 -08:00
Yury Delendik
2964023a77 [SIMD][x86_64] Add encoding for PMADDWD (#2530)
* [SIMD][x86_64] Add encoding for PMADDWD

* also for "experimental_x64"
2020-12-24 07:52:50 -06:00
Johnnie Birch
f705a72aeb Refactor packed moves to use xmm_mov instead of xmm_rm_r
Refactors previous packed move implementation to use xmm_mov
instead of xmm_rm_r which looks to simplify register accounting
during lowering.
2020-12-16 17:13:27 -08:00
Johnnie Birch
51973aefbb Implements x64 SIMD loads for the new backend. 2020-12-16 17:13:27 -08:00
Chris Fallin
6632c45c01 x64 lowering fix: i32.popcnt should not merge load and make it 64-bit.
As a subtle consequence of the recent load-op fusion, popcnt of a
value that came from a load.i32 was compiling into a 64-bit load. This
is a result of the way in which x86 infers the width of loads: it is a
consequence of the instruction containing the memory reference, not the
memory reference itself. So the `input_to_reg_mem()` helper (convert an
instruction input into a register or memory reference) was providing the
appropriate memory reference for the result of a load.i32, but never
encoded the assumption that it would only be used in a 32-bit
instruction. It turns out that popcnt.i32 uses a 64-bit instruction to
load this RM op, hence widening a 32-bit to 64-bit load (which is
problematic when the offset is (memory_length - 4)).

Separately, popcnt was using the RM operand twice, resulting in two
loads if we merged a load. This isn't a correctness bug in practice
because only a racy sequence (store interleaving between the loads)
would produce incorrect results, but we decided earlier to treat loads
as effectful for now, neither reordering nor duplicating them, to
deliberately reduce complexity.

Because of the second issue, the fix is just to force the operand into a
register always, so any source load will not be merged.

Discovered via fuzzing with oss-fuzz.
2020-12-08 12:24:34 -08:00
Chris Fallin
2cec20aa57 Merge pull request #2486 from cfallin/fix-probestack
Two Lucet-related fixes to stack overflow handling.
2020-12-07 16:47:37 -08:00
Chris Fallin
3a01d14712 Two Lucet-related fixes to stack overflow handling.
Lucet uses stack probes rather than explicit stack limit checks as
Wasmtime does. In bytecodealliance/lucet#616, I have discovered that I
previously was not running some Lucet runtime tests with the new
backend, so was missing some test failures due to missing pieces in the
new backend.

This PR adds (i) calls to probestack, when enabled, in the prologue of
every function with a stack frame larger than one page (configurable via
flags); and (ii) trap metadata for every instruction on x86-64 that can
access the stack, hence be the first point at which a stack overflow is
detected when the stack pointer is decremented.
2020-12-07 16:08:53 -08:00
Chris Fallin
1dddba649a x64 regalloc register order: put caller-saves (volatiles) first.
The x64 backend currently builds the `RealRegUniverse` in a way that
is generating somewhat suboptimal code. In many blocks, we see uses of
callee-save (non-volatile) registers (r12, r13, r14, rbx) first, even in
very short leaf functions where there are plenty of volatiles to use.
This is leading to unnecessary spills/reloads.

On one (local) test program, a medium-sized C benchmark compiled to Wasm
and run on Wasmtime, I am seeing a ~10% performance improvement with
this change; it will be less pronounced in programs with high register
pressure (there we are likely to use all registers regardless, so the
prologue/epilogue will save/restore all callee-saves), or in programs
with fewer calls, but this is a clear win for small functions and in
many cases removes prologue/epilogue clobber-saves altogether.

Separately, I think the RA's coalescing is tripping up a bit in some
cases; see e.g. the filetest touched by this commit that loads a value
into %rsi then moves to %rax and returns immediately. This is an
orthogonal issue, though, and should be addressed (if worthwhile) in
regalloc.rs.
2020-12-06 22:37:43 -08:00
Johnnie Birch
a548516f97 Enable SIMD spec tests for f32x4_rounding and f64x4_rounding.
Also address some review comments pointing out minor issues.
2020-12-02 13:44:51 -08:00
Johnnie Birch
a33e755cb2 Adds x86 SIMD support for Ceil, Floor, Trunc, and Nearest 2020-12-02 13:44:51 -08:00
Chris Fallin
c9a81f008d x64 backend: fix condition-code used for part of explicit heap check.
A dynamic heap address computation may create up to two conditional
branches: the usual bounds-check, but also (in some cases) an
offset-addition overflow check.

The x64 backend had reversed the condition code for this check,
resulting in an always-trapping execution for a valid offset. I'm
somewhat surprised this has existed so long, but I suppose the
particular conditions (large offset, small offset guard, dynamic heap)
have been somewhat rare in our testing so far.

Found via fuzzing in #2453.
2020-12-02 10:40:53 -08:00
Johnnie Birch
09f3d4e331 Refactor convert from float to unsigned int and add comments 2020-11-29 00:04:24 -08:00
Johnnie Birch
ade9f12c72 Add support for X86_64 SIMD narrow instructions for vcode backend
Adds lowering support for:
i8x16.narrow_i16x8_s
i8x16.narrow_i16x8_u
i16x8.narrow_i32x4_s
i16x8.narrow_i32x4_u
2020-11-23 09:58:39 -08:00
Johnnie Birch
2cc501427e Add remaining X86_64 support for pack w/ signed/unsigned saturation
Adds lowering for packssdw, packusdw, packuswb
2020-11-22 23:14:29 -08:00
Johnnie Birch
258013cff1 Add support for SWidenHigh and UWidenHigh X86_64 for vcode backend
Support is based on SSE4.1
2020-11-22 22:14:19 -08:00
Johnnie Birch
124096735b Add support for palignr for X86_64 vcode backend 2020-11-22 22:14:02 -08:00
Johnnie Birch
f9937575d6 Add support for SwidenLow and UwidenLow for the X86_64 vcode backend
Adds support using lowerings compatible with SSE4.1
2020-11-22 21:38:53 -08:00
Johnnie Birch
615a575da1 Add support for x86_64 packed move lowering for the vcode backend 2020-11-22 20:23:00 -08:00
Johnnie Birch
b6d783a120 Adds support for i32x4.trunc_sat_f32x4_u 2020-11-22 12:00:54 -08:00
Alex Crichton
4d64c68b05 Run rustfmt 1.48
Run rustfmt over wasmtime with the new stable release which looks like
it wants to reformat a few lines.
2020-11-19 11:12:30 -08:00
Chris Fallin
073c727a74 x64 and aarch64: carry MemFlags on loads/stores; don't emit trap info unless an op can trap.
This end result was previously enacted by carrying a `SourceLoc` on
every load/store, which was somewhat cumbersome, and only indirectly
encoded metadata about a memory reference (can it trap) by its presence
or absence. We have a type for this -- `MemFlags` -- that tells us
everything we might want to know about a load or store, and we should
plumb it through to code emission instead.

This PR attaches a `MemFlags` to an `Amode` on x64, and puts it on load
and store `Inst` variants on aarch64. These two choices seem to factor
things out in the nicest way: there are relatively few load/store insts
on aarch64 but many addressing modes, while the opposite is true on x64.
2020-11-17 11:43:06 -08:00
Chris Fallin
b97f07b405 x64 backend: merge loads into ALU ops when appropriate.
This PR makes use of the support in #2366 for sinking effectful
instructions and merging them with consumers. In particular, on x86, we
want to make use of the ability of many instructions to load one operand
directly from memory. That is, instead of this:

```
    movq 0(%rdi), %rax
    addq %rax, %rbx
```

we want to generate this:

```
    addq 0(%rdi), %rax
```

As described in more detail in #2366, sinking and merging the load is
only possible under certain conditions. In particular, we need to ensure
that the use is the *only* use (otherwise the load happens more than
once), and we need to ensure that it does not move across other
effectful ops (see #2366 for how we ensure this).

This change is actually fairly simple, given that all the framework is
in place: we simply pattern-match a load on one operand of an ALU
instruction that takes an RMI (reg, mem, or immediate) operand, and
generate the mem form when we match.

Also makes a drive-by improvement in the x64 backend to use
statically-monomorphized `LowerCtx` types rather than a `&mut dyn
LowerCtx`.

On `bz2.wasm`, this results in ~1% instruction-count reduction. More is
likely possible by following up with other instructions that can merge
memory loads as well.
2020-11-17 11:06:46 -08:00
Chris Fallin
712ff22492 AArch64 SIMD: pattern-match load+splat into LD1R instruction. 2020-11-16 15:59:28 -08:00
Chris Fallin
3c8cb7b908 MachInst lowering logic: allow effectful instructions to merge.
This PR updates the "coloring" scheme that accounts for side-effects in
the MachInst lowering logic. As a result, the new backends will now be
able to merge effectful operations (such as memory loads) *into* other
operations; previously, only the other way (pure ops merged into
effectful ops) was possible. This will allow, for example, a load+ALU-op
combination, as is common on x86. It should even allow a load + ALU-op +
store sequence to merge into one lowered instruction.

The scheme arose from many fruitful discussions with @julian-seward1
(thanks!); significant credit is due to him for the insights here.

The first insight is that given the right basic conditions, i.e.  that
the root instruction is the only use of an effectful instruction's
result, all we need is that the "color" of the effectful instruction is
*one less* than the color of the current instruction. It's easier to
think about colors on the program points between instructions: if the
color coming *out* of the first (effectful def) instruction and *in* to
the second (effectful or effect-free use) instruction are the same, then
they can merge. Basically the color denotes a version of global state;
if the same, then no other effectful ops happened in the meantime.

The second insight is that we can keep state as we scan, tracking the
"current color", and *update* this when we sink (merge) an op. Hence
when we sink a load into another op, we effectively *re-color* every
instruction it moved over; this may allow further sinks.

Consider the example (and assume that we consider loads effectful in
order to conservatively ensure a strong memory model; otherwise, replace
with other effectful value-producing insts):

```
  v0 = load x
  v1 = load y
  v2 = add v0, 1
  v3 = add v1, 1
```

Scanning from bottom to top, we first see the add producing `v3` and we
can sink the load producing `v1` into it, producing a load + ALU-op
machine instruction. This is legal because `v1` moves over only `v2`,
which is a pure instruction. Consider, though, `v2`: under a simple
scheme that has no other context, `v0` could not sink to `v2` because it
would move over `v1`, another load. But because we already sunk `v1`
down to `v3`, we are free to sink `v0` to `v2`; the update of the
"current color" during the scan allows this.

This PR also cleans up the `LowerCtx` interface a bit at the same time:
whereas previously it always gave some subset of (constant, mergeable
inst, register) directly from `LowerCtx::get_input()`, it now returns
zero or more of (constant, mergable inst) from
`LowerCtx::maybe_get_input_as_source_or_const()`, and returns the
register only from `LowerCtx::put_input_in_reg()`. This removes the need
to explicitly denote uses of the register, so it's a little safer.

Note that this PR does not actually make use of the new ability to merge
loads into other ops; that will come in future PRs, especially to
optimize the `x64` backend by using direct-memory operands.
2020-11-16 14:53:45 -08:00
Andrew Brown
bd93e69eb4 [machinst x64]: implement packed shifts 2020-11-12 14:21:45 -08:00
Andrew Brown
8ba92853be [machinst x64]: add punpack[hl]bw instructions 2020-11-12 14:21:45 -08:00
Andrew Brown
8131b15921 [machinst x64]: allow addressing of constants 2020-11-12 14:21:45 -08:00
Chris Fallin
89dbc4590d Merge pull request #2363 from cfallin/extend-only-if-abi
Do value-extensions at ABI boundaries only when ABI requires it.
2020-11-12 12:26:20 -08:00
Chris Fallin
fd6433aaf5 Merge pull request #2395 from cfallin/lucet-x64-support
Add support for brff/brif and icmp_sp to new x64 backend to support Lucet.
2020-11-12 12:10:52 -08:00
Chris Fallin
5df8840483 Add support for brff/brif and icmp_sp to new x64 backend to support Lucet.
`lucetc` currently *almost*, but not quite, works with the new x64
backend; the only missing piece is support for the particular
instructions emitted as part of its prologue stack-check.

We do not normally see `brff`, `brif`, or `ifcmp_sp` in CLIF generated by
`cranelift-wasm` without the old-backend legalization rules, so these
were not supported in the new x64 backend as they were not necessary for
Wasm MVP support. Using them resulted in an `unimplemented!()` panic.

This PR adds support for `brff` and `brif` analogously to how AArch64
implements them, by pattern-matching the `ifcmp` / `ffcmp` directly.
Then `ifcmp_sp` is a straightforward variant of `ifcmp`.

Along the way, this also removes the notion of "fallthrough block" from
the branch-group lowering method; instead, `fallthrough` instructions
are handled as normal branches to their explicitly-provided targets,
which (in the original CLIF) match the fallthrough block. The reason for
this is that the block reordering done as part of lowering can change
the fallthrough block. We were not using `fallthrough` instructions in
the output produced by `cranelift-wasm`, so this, too, was not
previously caught.

With these changes, the `lucetc` crate in Lucet passes all tests with
the `x64` feature-flag added to its `cranelift-codegen` dependency.
2020-11-11 13:43:39 -08:00
Chris Fallin
4dce51096d MachInst backends: handle SourceLocs out-of-band, not in Insts.
In existing MachInst backends, many instructions -- any that can trap or
result in a relocation -- carry `SourceLoc` values in order to propagate
the location-in-original-source to use to describe resulting traps or
relocation errors.

This is quite tedious, and also error-prone: it is likely that the
necessary plumbing will be missed in some cases, and in any case, it's
unnecessarily verbose.

This PR factors out the `SourceLoc` handling so that it is tracked
during emission as part of the `EmitState`, and plumbed through
automatically by the machine-independent framework. Instruction emission
code that directly emits trap or relocation records can query the
current location as necessary. Then we only need to ensure that memory
references and trap instructions, at their (one) emission point rather
than their (many) lowering/generation points, are wired up correctly.

This does have the side-effect that some loads and stores that do not
correspond directly to user code's heap accesses will have unnecessary
but harmless trap metadata. For example, the load that fetches a code
offset from a jump table will have a 'heap out of bounds' trap record
attached to it; but because it is bounds-checked, and will never
actually trap if the lowering is correct, this should be harmless.  The
simplicity improvement here seemed more worthwhile to me than plumbing
through a "corresponds to user-level load/store" bit, because the latter
is a bit complex when we allow for op merging.

Closes #2290: though it does not implement a full "metadata" scheme as
described in that issue, this seems simpler overall.
2020-11-10 15:46:53 -08:00
Yury Delendik
f60c0f3ec3 cranelift: refactor unwind logic to accommodate multiple backends (#2357)
*    Make cranelift_codegen::isa::unwind::input public
*    Move UnwindCode's common offset field out of the structure
*    Make MachCompileResult::unwind_info more generic
*    Record initial stack pointer offset
2020-11-05 16:57:40 -06:00
Andrew Brown
83f182b390 Implement initial emission of constants
This approach suffers from memory-size bloat during compile time due to the desire to de-duplicate the constants emitted and reduce runtime memory-size. As a first step, though, this provides an end-to-end mechanism for constants to be emitted in the MachBuffer islands.
2020-11-05 14:25:02 -08:00
Chris Fallin
a2bbb198de Do value-extensions at ABI boundaries only when ABI requires it.
There has been some confusion over the meaning of the "sign-extend"
(`sext`) and "zero-extend" (`uext`) attributes on parameters and return
values in signatures. According to the three implemented backends, these
attributes indicate that a value narrower than a full register should
always be extended in the way specified. However, they are much more
useful if they mean "extend in this way if the ABI requires extending":
only the ABI backend knows whether or not a particular ABI (e.g., x64
SysV vs. x64 Baldrdash) requires extensions, while only the frontend
(CLIF generator) knows whether or not a value is signed, so the two have
to work in concert.

This is the result of some very helpful discussion in #2354 (thanks to
@uweigand for raising the issue and @bjorn3 for helping to reason about
it).

This change respects the extension attributes in the above way, rather
than unconditionally extending, to avoid potential performance
degradation as we introduce more extension attributes on signatures.
2020-11-05 11:54:35 -08:00
Ulrich Weigand
80c2d70d2d machinst ABI: Support for accumulating outgoing args
When performing a function call, the platform ABI may require space
on the stack to hold outgoing arguments and/or return values.

Currently, this is supported via decrementing the stack pointer
before the call and incrementing it afterwards, using the
emit_stack_pre_adjust and emit_stack_post_adjust methods of
ABICaller.  However, on some platforms it would be preferable
to just allocate enough space for any call done in the function
in the caller's prologue instead.

This patch adds support to allow back-ends to choose that method.
Instead of calling emit_stack_pre/post_adjust around a call, they
simply call a new accumulate_outgoing_args_size method of
ABICaller instead.  This will pass on the required size to the
ABICallee structure of the calling function, which will accumulate
the maximum size required for all function calls.

That accumulated size is then passed to the gen_clobber_save
and gen_clobber_restore functions so they can include the size
in the stack allocation / deallocation that already happens in
the prologue / epilogue code.
2020-11-03 18:49:34 +01:00
Chris Fallin
5ab7b4aa7f Merge pull request #2345 from uweigand/abi-stackalign
machinst ABI: Allow back-end to define stack alignment
2020-11-03 09:02:41 -08:00
Ulrich Weigand
c9bc4edd08 machinst ABI: Pass fixed frame size to gen_clobber_restore
The ABI common code currently passes the fixed frame size to
the gen_clobber_save back-end routine, which is required to
emit code to allocate the required stack space in the prologue.

Similarly, the back-end needs to emit code to de-allocate the
stack in the epilogue.  However, at this point the back-end
does not have access to that fixed frame size value any more.
With targets that use a frame pointer, this does not matter,
since de-allocation can be done simply by assigning the frame
pointer back to the stack pointer.  However, on targets that
do not use a frame pointer, the frame size is required.

To allow back-ends that option, this patch changes ABI common
code to pass the fixed frame size to get_clobber_restore as
well (the same value as is passed to get_clobber_save).
2020-11-03 11:15:03 +01:00
Ulrich Weigand
d02ae3940c machinst ABI: Allow back-end to define stack alignment
The common gen_prologue code currently assumes that the stack
pointer has to be aligned to twice the word size.  While this
is true for many ABIs, it does not hold universally.

This patch adds a new callback stack_align that back-ends can
provide to define the specific stack alignment required by the
ABI on that platform.
2020-11-03 09:43:55 +01:00
Johnnie Birch
c32740ffcd Updates comments on Int to Float conversion
Int to float for unsigned ints has merged, but there were
some comments on a different PR for the same pull request that
are addressed in this PR
2020-10-30 16:49:30 -07:00
Andrew Brown
6c6d958f38 [machinst x64]: implement packed pmin/pmax 2020-10-28 16:03:53 -07:00
Andrew Brown
6725b6b129 [machinst x64]: implement bitmask 2020-10-28 15:16:36 -07:00
Andrew Brown
5b9a21e099 Add missing SourceLoc to newly-emitted instructions
The changes in https://github.com/bytecodealliance/wasmtime/pull/2278 added `SourceLoc`s to several x64 `Inst` variants; between when that PR was last run in CI and when it was merged, new instructions were added that require this new parameter. This change adds the parameter in order to fix CI.
2020-10-28 14:33:09 -07:00
Johnnie Birch
8bbe6a25a9 Add support for packed float to signed int conversion
Implements i32x4.trunc_sat_f32x4_s
2020-10-28 13:02:50 -07:00