wasmtime

T0b1/wasmtime

Fork 0

Commit Graph

Author	SHA1	Message	Date
Chris Fallin	1dddba649a	x64 regalloc register order: put caller-saves (volatiles) first. The x64 backend currently builds the `RealRegUniverse` in a way that is generating somewhat suboptimal code. In many blocks, we see uses of callee-save (non-volatile) registers (r12, r13, r14, rbx) first, even in very short leaf functions where there are plenty of volatiles to use. This is leading to unnecessary spills/reloads. On one (local) test program, a medium-sized C benchmark compiled to Wasm and run on Wasmtime, I am seeing a ~10% performance improvement with this change; it will be less pronounced in programs with high register pressure (there we are likely to use all registers regardless, so the prologue/epilogue will save/restore all callee-saves), or in programs with fewer calls, but this is a clear win for small functions and in many cases removes prologue/epilogue clobber-saves altogether. Separately, I think the RA's coalescing is tripping up a bit in some cases; see e.g. the filetest touched by this commit that loads a value into %rsi then moves to %rax and returns immediately. This is an orthogonal issue, though, and should be addressed (if worthwhile) in regalloc.rs.	2020-12-06 22:37:43 -08:00
Chris Fallin	3e516e784b	Fix lowering instruction-sinking (load-merging) bug. This fixes a subtle corner case exposed during fuzzing. If we have a bit of CLIF like: ``` v0 = load.i64 ... v1 = iadd.i64 v0, ... v2 = do_other_thing v1 v3 = load.i64 v1 ``` and if this is lowered using a machine backend that can merge loads into ALU ops, and that has an addressing mode that can look through add ops, then the following can happen: 1. We lower the load at `v3`. This looks backward at the address operand tree and finds that `v1` is `v0` plus other things; it has an addressing mode that can add `v0`'s register and the other things directly; so it calls `put_value_in_reg(v0)` and uses its register in the amode. At this point, the add producing `v1` has no references, so it will not (yet) be codegen'd. 2. We lower `do_other_thing`, which puts `v1` in a register and uses it. the `iadd` now has a reference. 3. We reach the `iadd` and, because it has a reference, lower it. Our machine has the ability to merge a load into an ALU operation. Crucially, we think the load at `v0` is mergeable because it has only one user, the add at `v1` (!). So we merge it. 4. We reach the `load` at `v0` and because it has been merged into the `iadd`, we do not separately codegen it. The register that holds `v0` is thus never written, and the use of this register by the final load (Step 1) will see an undefined value. The logic error here is that in the presence of pattern matching that looks through pure ops, we can end up with multiple uses of a value that originally had a single use (because we allow lookthrough of pure ops in all cases). In other words, the multiple-use-ness of `v1` "passes through" in some sense to `v0`. However, the load sinking logic is not aware of this. The fix, I think, is pretty simple: we disallow an effectful instruction from sinking/merging if it already has some other use when we look back at it. If we disallowed lookthrough of any op that had multiple uses, even pure ones, then we would avoid this scenario; but earlier experiments showed that to have a non-negligible performance impact, so (given that we've worked out the logic above) I think this complexity is worth it.	2020-12-03 14:59:12 -08:00
Chris Fallin	b97f07b405	x64 backend: merge loads into ALU ops when appropriate. This PR makes use of the support in #2366 for sinking effectful instructions and merging them with consumers. In particular, on x86, we want to make use of the ability of many instructions to load one operand directly from memory. That is, instead of this: ``` movq 0(%rdi), %rax addq %rax, %rbx ``` we want to generate this: ``` addq 0(%rdi), %rax ``` As described in more detail in #2366, sinking and merging the load is only possible under certain conditions. In particular, we need to ensure that the use is the only use (otherwise the load happens more than once), and we need to ensure that it does not move across other effectful ops (see #2366 for how we ensure this). This change is actually fairly simple, given that all the framework is in place: we simply pattern-match a load on one operand of an ALU instruction that takes an RMI (reg, mem, or immediate) operand, and generate the mem form when we match. Also makes a drive-by improvement in the x64 backend to use statically-monomorphized `LowerCtx` types rather than a `&mut dyn LowerCtx`. On `bz2.wasm`, this results in ~1% instruction-count reduction. More is likely possible by following up with other instructions that can merge memory loads as well.	2020-11-17 11:06:46 -08:00

Author

SHA1

Message

Date

Chris Fallin

1dddba649a

x64 regalloc register order: put caller-saves (volatiles) first.

The x64 backend currently builds the `RealRegUniverse` in a way that
is generating somewhat suboptimal code. In many blocks, we see uses of
callee-save (non-volatile) registers (r12, r13, r14, rbx) first, even in
very short leaf functions where there are plenty of volatiles to use.
This is leading to unnecessary spills/reloads.

On one (local) test program, a medium-sized C benchmark compiled to Wasm
and run on Wasmtime, I am seeing a ~10% performance improvement with
this change; it will be less pronounced in programs with high register
pressure (there we are likely to use all registers regardless, so the
prologue/epilogue will save/restore all callee-saves), or in programs
with fewer calls, but this is a clear win for small functions and in
many cases removes prologue/epilogue clobber-saves altogether.

Separately, I think the RA's coalescing is tripping up a bit in some
cases; see e.g. the filetest touched by this commit that loads a value
into %rsi then moves to %rax and returns immediately. This is an
orthogonal issue, though, and should be addressed (if worthwhile) in
regalloc.rs.

2020-12-06 22:37:43 -08:00

Chris Fallin

3e516e784b

Fix lowering instruction-sinking (load-merging) bug.

This fixes a subtle corner case exposed during fuzzing. If we have a bit
of CLIF like:

```
    v0 = load.i64 ...
    v1 = iadd.i64 v0, ...
    v2 = do_other_thing v1
    v3 = load.i64 v1
```

and if this is lowered using a machine backend that can merge loads into
ALU ops, *and* that has an addressing mode that can look through add
ops, then the following can happen:

1. We lower the load at `v3`. This looks backward at the address
   operand tree and finds that `v1` is `v0` plus other things; it has an
   addressing mode that can add `v0`'s register and the other things
   directly; so it calls `put_value_in_reg(v0)` and uses its register in
   the amode. At this point, the add producing `v1` has no references,
   so it will not (yet) be codegen'd.
2. We lower `do_other_thing`, which puts `v1` in a register and uses it.
   the `iadd` now has a reference.
3. We reach the `iadd` and, because it has a reference, lower it. Our
   machine has the ability to merge a load into an ALU operation.
   Crucially, *we think the load at `v0` is mergeable* because it has
   only one user, the add at `v1` (!). So we merge it.
4. We reach the `load` at `v0` and because it has been merged into the
   `iadd`, we do not separately codegen it. The register that holds `v0`
   is thus never written, and the use of this register by the final load
   (Step 1) will see an undefined value.

The logic error here is that in the presence of pattern matching that
looks through pure ops, we can end up with multiple uses of a value that
originally had a single use (because we allow lookthrough of pure ops in
all cases). In other words, the multiple-use-ness of `v1` "passes
through" in some sense to `v0`. However, the load sinking logic is not
aware of this.

The fix, I think, is pretty simple: we disallow an effectful instruction
from sinking/merging if it already has some other use when we look back
at it.

If we disallowed lookthrough of *any* op that had multiple uses, even
pure ones, then we would avoid this scenario; but earlier experiments
showed that to have a non-negligible performance impact, so (given that
we've worked out the logic above) I think this complexity is worth it.

2020-12-03 14:59:12 -08:00

Chris Fallin

b97f07b405

x64 backend: merge loads into ALU ops when appropriate.

This PR makes use of the support in #2366 for sinking effectful
instructions and merging them with consumers. In particular, on x86, we
want to make use of the ability of many instructions to load one operand
directly from memory. That is, instead of this:

```
    movq 0(%rdi), %rax
    addq %rax, %rbx
```

we want to generate this:

```
    addq 0(%rdi), %rax
```

As described in more detail in #2366, sinking and merging the load is
only possible under certain conditions. In particular, we need to ensure
that the use is the *only* use (otherwise the load happens more than
once), and we need to ensure that it does not move across other
effectful ops (see #2366 for how we ensure this).

This change is actually fairly simple, given that all the framework is
in place: we simply pattern-match a load on one operand of an ALU
instruction that takes an RMI (reg, mem, or immediate) operand, and
generate the mem form when we match.

Also makes a drive-by improvement in the x64 backend to use
statically-monomorphized `LowerCtx` types rather than a `&mut dyn
LowerCtx`.

On `bz2.wasm`, this results in ~1% instruction-count reduction. More is
likely possible by following up with other instructions that can merge
memory loads as well.

2020-11-17 11:06:46 -08:00

3 Commits