From 3e516e784b6e953fbb426eb38af57d5d72fdbcd0 Mon Sep 17 00:00:00 2001
From: Chris Fallin <chris@cfallin.org>
Date: Thu, 3 Dec 2020 14:33:43 -0800
Subject: [PATCH] Fix lowering instruction-sinking (load-merging) bug.

This fixes a subtle corner case exposed during fuzzing. If we have a bit
of CLIF like:

```
    v0 = load.i64 ...
    v1 = iadd.i64 v0, ...
    v2 = do_other_thing v1
    v3 = load.i64 v1
```

and if this is lowered using a machine backend that can merge loads into
ALU ops, *and* that has an addressing mode that can look through add
ops, then the following can happen:

1. We lower the load at `v3`. This looks backward at the address
   operand tree and finds that `v1` is `v0` plus other things; it has an
   addressing mode that can add `v0`'s register and the other things
   directly; so it calls `put_value_in_reg(v0)` and uses its register in
   the amode. At this point, the add producing `v1` has no references,
   so it will not (yet) be codegen'd.
2. We lower `do_other_thing`, which puts `v1` in a register and uses it.
   the `iadd` now has a reference.
3. We reach the `iadd` and, because it has a reference, lower it. Our
   machine has the ability to merge a load into an ALU operation.
   Crucially, *we think the load at `v0` is mergeable* because it has
   only one user, the add at `v1` (!). So we merge it.
4. We reach the `load` at `v0` and because it has been merged into the
   `iadd`, we do not separately codegen it. The register that holds `v0`
   is thus never written, and the use of this register by the final load
   (Step 1) will see an undefined value.

The logic error here is that in the presence of pattern matching that
looks through pure ops, we can end up with multiple uses of a value that
originally had a single use (because we allow lookthrough of pure ops in
all cases). In other words, the multiple-use-ness of `v1` "passes
through" in some sense to `v0`. However, the load sinking logic is not
aware of this.

The fix, I think, is pretty simple: we disallow an effectful instruction
from sinking/merging if it already has some other use when we look back
at it.

If we disallowed lookthrough of *any* op that had multiple uses, even
pure ones, then we would avoid this scenario; but earlier experiments
showed that to have a non-negligible performance impact, so (given that
we've worked out the logic above) I think this complexity is worth it.
---
 cranelift/codegen/src/machinst/lower.rs           |  1 +
 .../filetests/filetests/isa/x64/load-op.clif      | 15 +++++++++++++++
 2 files changed, 16 insertions(+)

diff --git a/cranelift/codegen/src/machinst/lower.rs b/cranelift/codegen/src/machinst/lower.rs
index 32dfaa336c..6a42f3a2b6 100644
--- a/cranelift/codegen/src/machinst/lower.rs
+++ b/cranelift/codegen/src/machinst/lower.rs
@@ -913,6 +913,7 @@ impl<'func, I: VCodeInst> Lower<'func, I> {
                     // the code-motion.
                     if self.cur_scan_entry_color.is_some()
                         && self.value_uses[val] == 1
+                        && self.value_lowered_uses[val] == 0
                         && self.num_outputs(src_inst) == 1
                         && self
                             .side_effect_inst_entry_colors
diff --git a/cranelift/filetests/filetests/isa/x64/load-op.clif b/cranelift/filetests/filetests/isa/x64/load-op.clif
index 87069262ed..77e4b05420 100644
--- a/cranelift/filetests/filetests/isa/x64/load-op.clif
+++ b/cranelift/filetests/filetests/isa/x64/load-op.clif
@@ -44,3 +44,18 @@ block0(v0: i64, v1: i8):
   ; nextln: addl    %esi, %r12d
   return v3
 }
+
+function %no_merge_if_lookback_use(i64, i64) -> i64 {
+block0(v0: i64, v1: i64):
+  v2 = load.i64 v0
+  v3 = iadd.i64 v2, v0
+  store.i64 v3, v1
+  v4 = load.i64 v3
+  return v4
+  ; check:  movq    0(%rdi), %r12
+  ; nextln: movq    %r12, %r13
+  ; nextln: addq    %rdi, %r13
+  ; nextln: movq    %r13, 0(%rsi)
+  ; nextln: movq    0(%r12,%rdi,1), %r12
+  ; nextln: movq    %r12, %rax
+}