Aarch64: handle csel with icmp/fcmp source without materializing the bool.
Previously, we simply compared the input bool to 0, which forced the
value into a register (usually via a cmp and cset), zero-extended it,
etc. This patch performs the same pattern-matching that branches do to
directly perform the cmp and use its flag results with the csel.
On the `bz2` benchmark, the runtime is affected as follows (measuring
with `perf stat`, using wasmtime with its cache enabled, and taking the
second run after the first compiles and populates the cache):
pre:
1117.232000 task-clock (msec) # 1.000 CPUs utilized
133 context-switches # 0.119 K/sec
1 cpu-migrations # 0.001 K/sec
5,041 page-faults # 0.005 M/sec
3,511,615,100 cycles # 3.143 GHz
4,272,427,772 instructions # 1.22 insn per cycle
<not supported> branches
27,980,906 branch-misses
1.117299838 seconds time elapsed
post:
1003.738075 task-clock (msec) # 1.000 CPUs utilized
121 context-switches # 0.121 K/sec
0 cpu-migrations # 0.000 K/sec
5,052 page-faults # 0.005 M/sec
3,224,875,393 cycles # 3.213 GHz
4,000,838,686 instructions # 1.24 insn per cycle
<not supported> branches
27,928,232 branch-misses
1.003440004 seconds time elapsed
In other words, with this change, on `bz2`, we see a 6.3% reduction in
executed instructions.
This commit is contained in:
@@ -41,3 +41,14 @@ block0(v0: b1, v1: i8, v2: i8):
|
||||
|
||||
; check: subs wzr
|
||||
; nextln: csel
|
||||
|
||||
function %i(i32, i8, i8) -> i8 {
|
||||
block0(v0: i32, v1: i8, v2: i8):
|
||||
v3 = iconst.i32 42
|
||||
v4 = icmp.i32 eq v0, v3
|
||||
v5 = select.i8 v4, v1, v2
|
||||
return v5
|
||||
}
|
||||
|
||||
; check: subs wzr, w0, #42
|
||||
; nextln: csel x0, x1, x2, eq
|
||||
|
||||
Reference in New Issue
Block a user