x64: Add lea-based lowering for iadd (#5986)

* x64: Refactor `Amode` computation in ISLE This commit replaces the previous computation of `Amode` with a different set of rules that are intended to achieve the same purpose but are structured differently. The motivation for this commit is going to become more relevant in the next commit where `lea` will be used for the `iadd` instruction, possibly, on x64. When doing so it caused a stack overflow in the test suite during the compilation phase of a wasm module, namely as part of the `amode_add` function. This function is recursively defined in terms of itself and recurses as deep as the deepest `iadd`-chain in a program. A particular test in our test suite has a 10k-long chain of `iadd` which ended up causing a stack overflow in debug mode. This stack overflow is caused because the `amode_add` helper in ISLE unconditionally peels all the `iadd` nodes away and looks at all of them, even if most end up in intermediate registers along the way. Given that structure I couldn't find a way to easily abort the recursion. The new `to_amode` helper is structured in a similar fashion but attempts to instead only recurse far enough to fold items into the final `Amode` instead of recursing through items which themselves don't end up in the `Amode`. Put another way previously the `amode_add` helper might emit `x64_add` instructions, but it no longer does that. This goal of this commit is to preserve all the original `Amode` optimizations, however. For some parts, though, it relies more on egraph optimizations to run since if an `iadd` is 10k deep it doesn't try to find a constant buried 9k levels inside there to fold into the `Amode`. The hope, though, is that with egraphs having run already it's shuffled constants to the right most of the time and already folded any possible together. * x64: Add `lea`-based lowering for `iadd` This commit adds a rule for the lowering of `iadd` to use `lea` for 32 and 64-bit addition. The theoretical benefit of `lea` over the `add` instruction is that the `lea` variant can emulate a 3-operand instruction which doesn't destructively modify on of its operands. Additionally the `lea` operation can fold in other components such as constant additions and shifts. In practice, however, if `lea` is unconditionally used instead of `iadd` it ends up losing 10% performance on a local `meshoptimizer` benchmark. My best guess as to what's going on here is that my CPU's dedicated units for address computation are all overloaded while the ALUs are basically idle in a memory-intensive loop. Previously when the ALU was used for `add` and the address units for stores/loads it in theory pipelined things better (most of this is me shooting in the dark). To prevent the performance loss here I've updated the lowering of `iadd` to conditionally sometimes use `lea` and sometimes use `add` depending on how "complicated" the `Amode` is. Simple ones like `a + b` or `a + $imm` continue to use `add` (and its subsequent hypothetical extra `mov` necessary into the result). More complicated ones like `a + b + $imm` or `a + b << c + $imm` use `lea` as it can remove the need for extra instructions. Locally at least this fixes the performance loss relative to unconditionally using `lea`. One note is that this adds an `OperandSize` argument to the `MInst::LoadEffectiveAddress` variant to add an encoding for 32-bit `lea` in addition to the preexisting 64-bit encoding. * Conditionally use `lea` based on regalloc
2023-03-15 12:14:25 -05:00
parent 2e6c7bf994
commit fcddb9ca81
50 changed files with 928 additions and 741 deletions
--- a/cranelift/codegen/src/isa/x64/lower.isle
+++ b/cranelift/codegen/src/isa/x64/lower.isle
@@ -41,17 +41,27 @@

 ;; `i64` and smaller.

-;; Add two registers.
-(rule -5 (lower (has_type (fits_in_64 ty)
+;; Base case for 8 and 16-bit types
+(rule -6 (lower (has_type (fits_in_16 ty)
                       (iadd x y)))
      (x64_add ty x y))

-;; The above case handles when the rhs is an immediate or a sinkable load, but
-;; additionally add lhs meets these criteria.
+;; Base case for 32 and 64-bit types which might end up using the `lea`
+;; instruction to fold multiple operations into one.
+;;
+;; Note that at this time this always generates a `lea` pseudo-instruction,
+;; but the actual instruction emitted might be an `add` if it's equivalent.
+;; For more details on this see the `emit.rs` logic to emit
+;; `LoadEffectiveAddress`.
+(rule -5 (lower (has_type (ty_32_or_64 ty) (iadd x y)))
+      (x64_lea ty (to_amode_add (mem_flags_trusted) x y (zero_offset))))

+;; Higher-priority cases than the previous two where a load can be sunk into
+;; the add instruction itself. Note that both operands are tested for
+;; sink-ability since addition is commutative
 (rule -4 (lower (has_type (fits_in_64 ty)
-                       (iadd (simm32_from_value x) y)))
-      (x64_add ty y x))
+                       (iadd x (sinkable_load y))))
+      (x64_add ty x y))
 (rule -3 (lower (has_type (fits_in_64 ty)
                       (iadd (sinkable_load x) y)))
      (x64_add ty y x))
@@ -442,13 +452,14 @@
 (extern constructor ishl_i8x16_mask_table ishl_i8x16_mask_table)
 (rule (ishl_i8x16_mask (RegMemImm.Reg amt))
      (let ((mask_table SyntheticAmode (ishl_i8x16_mask_table))
-            (base_mask_addr Gpr (x64_lea mask_table))
+            (base_mask_addr Gpr (x64_lea $I64 mask_table))
            (mask_offset Gpr (x64_shl $I64 amt
                                  (imm8_to_imm8_gpr 4))))
-        (amode_imm_reg_reg_shift 0
-                                 base_mask_addr
-                                 mask_offset
-                                 0)))
+        (Amode.ImmRegRegShift 0
+                              base_mask_addr
+                              mask_offset
+                              0
+                              (mem_flags_trusted))))

 (rule (ishl_i8x16_mask (RegMemImm.Mem amt))
      (ishl_i8x16_mask (RegMemImm.Reg (x64_load $I64 amt (ExtKind.None)))))
@@ -546,14 +557,15 @@
 (extern constructor ushr_i8x16_mask_table ushr_i8x16_mask_table)
 (rule (ushr_i8x16_mask (RegMemImm.Reg amt))
      (let ((mask_table SyntheticAmode (ushr_i8x16_mask_table))
-            (base_mask_addr Gpr (x64_lea mask_table))
+            (base_mask_addr Gpr (x64_lea $I64 mask_table))
            (mask_offset Gpr (x64_shl $I64
                                  amt
                                  (imm8_to_imm8_gpr 4))))
-        (amode_imm_reg_reg_shift 0
-                                 base_mask_addr
-                                 mask_offset
-                                 0)))
+        (Amode.ImmRegRegShift 0
+                              base_mask_addr
+                              mask_offset
+                              0
+                              (mem_flags_trusted))))

 (rule (ushr_i8x16_mask (RegMemImm.Mem amt))
      (ushr_i8x16_mask (RegMemImm.Reg (x64_load $I64 amt (ExtKind.None)))))